| 1 |
Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies |
提出反向流匹配(RFM)框架,统一扩散和流策略的在线强化学习训练。 |
reinforcement learning diffusion policy flow matching |
|
|
| 2 |
Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts |
提出时间依赖的深度强化学习方法,解决非遍历环境中策略次优问题 |
reinforcement learning deep reinforcement learning |
|
|
| 3 |
Coverage Improvement and Fast Convergence of On-policy Preference Learning |
提出覆盖改进原则,加速在线偏好学习语言模型对齐的收敛 |
preference learning DPO distillation |
|
|
| 4 |
Structure Detection for Contextual Reinforcement Learning |
提出SD-MBTL框架,通过在线结构检测提升上下文强化学习的泛化性能。 |
reinforcement learning zero-shot transfer |
|
|
| 5 |
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning |
ORBIT:面向可控多预算推理的On-policy探索-利用框架 |
reinforcement learning distillation chain-of-thought |
|
|
| 6 |
Scalable Multiagent Reinforcement Learning with Collective Influence Estimation |
提出基于集体影响估计网络(CIEN)的可扩展多智能体强化学习框架 |
reinforcement learning SAC |
|
|
| 7 |
Provably Safe Reinforcement Learning using Entropy Regularizer |
提出基于熵正则化的安全强化学习算法,提升学习过程中的安全性和稳定性 |
reinforcement learning |
|
|
| 8 |
Your Group-Relative Advantage Is Biased |
揭示群体相对优势估计偏差,提出HA-DW提升RLVR推理性能 |
reinforcement learning large language model |
|
|