| 1 |
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models |
提出On-Policy自蒸馏框架,提升大语言模型在数学推理任务上的token效率。 |
reinforcement learning distillation privileged information |
|
|
| 2 |
Learned harmonic mean estimation of the marginal likelihood for multimodal posteriors with flow matching |
提出基于Flow Matching的调和平均估计器,提升多峰后验分布下边缘似然的计算精度。 |
flow matching multimodal |
|
|
| 3 |
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration |
POPE:通过特权On-Policy探索学习解决复杂推理问题 |
reinforcement learning privileged information large language model |
|
|
| 4 |
Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning |
提出基于逆Fisher信息矩阵秩-1近似的自然策略梯度方法,加速深度强化学习。 |
reinforcement learning deep reinforcement learning |
|
|
| 5 |
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates |
提出JitRL,无需梯度更新实现LLM Agent的即时强化学习,提升持续学习能力。 |
reinforcement learning large language model |
✅ |
|
| 6 |
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment |
提出TriPlay-RL,通过三方自博弈强化学习提升LLM安全性对齐。 |
reinforcement learning large language model |
|
|
| 7 |
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning |
提出FP8-RL,通过低精度推理加速LLM强化学习并保持训练稳定性。 |
reinforcement learning large language model |
|
|
| 8 |
Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions |
MoReBRAC:通过可信合成数据提升离线强化学习在机器人领域的鲁棒性 |
reinforcement learning offline reinforcement learning world model |
|
|
| 9 |
Multi-Objective Reinforcement Learning for Efficient Tactical Decision Making for Trucks in Highway Traffic |
提出基于多目标强化学习的卡车高速公路行驶策略优化方法 |
reinforcement learning |
|
|
| 10 |
ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule |
提出ART-RL,通过强化学习优化扩散模型采样的时间步长调度,提升生成质量。 |
reinforcement learning |
|
|
| 11 |
CASSANDRA: Programmatic and Probabilistic Learning and Inference for Stochastic World Modeling |
CASSANDRA:利用LLM进行程序化和概率学习,构建随机世界模型 |
world model |
|
|
| 12 |
Learning long term climate-resilient transport adaptation pathways under direct and indirect flood impacts using reinforcement learning |
提出基于强化学习的气候适应性交通长期规划方法,应对洪水直接和间接影响。 |
reinforcement learning |
|
|
| 13 |
K-Myriad: Jump-starting reinforcement learning with unsupervised parallel agents |
K-Myriad:利用无监督并行智能体启动强化学习,提升探索效率。 |
reinforcement learning |
|
|
| 14 |
Enhance the Safety in Reinforcement Learning by ADRC Lagrangian Methods |
提出ADRC-Lagrangian方法,提升强化学习安全性并减少振荡 |
reinforcement learning |
|
|
| 15 |
Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States |
提出ASAP方法,通过对齐动作与前序状态预测,提升强化学习控制策略的平滑性 |
reinforcement learning deep reinforcement learning |
|
|