| 1 |
Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning |
提出基于VQVAE和模糊聚类的离线强化学习反探索方法,提升效率和性能。 |
reinforcement learning policy learning offline RL |
|
|
| 2 |
Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning |
EpiFlow:基于Epigraph引导的流匹配,实现安全且高效的离线强化学习 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 3 |
MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation |
提出MARTI-MARS$^2$,通过强化学习扩展多智能体自搜索,提升代码生成能力 |
reinforcement learning policy learning large language model |
|
|
| 4 |
Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models |
提出Horizon Imagination,加速扩散世界模型在强化学习中的在线Rollout。 |
reinforcement learning world model |
✅ |
|
| 5 |
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection |
提出OGPSA,通过正交梯度投影缓解安全对齐中的能力遗忘问题 |
DPO direct preference optimization large language model |
✅ |
|
| 6 |
Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization |
D³PO:一种分解式、多样性驱动的偏好条件多目标强化学习方法 |
reinforcement learning PPO |
|
|
| 7 |
When Is Compositional Reasoning Learnable from Verifiable Rewards? |
通过可验证奖励学习组合推理:任务优势比是关键 |
reinforcement learning large language model |
|
|
| 8 |
A Kinetic-Energy Perspective of Flow Matching |
提出基于动能视角的Flow Matching方法,提升生成模型质量并减少记忆化 |
flow matching |
|
|
| 9 |
Interpretable Analytic Calabi-Yau Metrics via Symbolic Distillation |
通过符号提炼获得可解释的解析Calabi-Yau度量 |
distillation |
|
|
| 10 |
rePIRL: Learn PRM with Inverse RL for LLM Reasoning |
rePIRL:通过逆强化学习为LLM推理学习过程奖励模型 |
reinforcement learning deep reinforcement learning |
|
|