| 1 |
Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning |
提出乐观世界模型(OWMs),通过奖励偏置最大似然估计实现高效探索 |
reinforcement learning deep reinforcement learning world model |
|
|
| 2 |
Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization |
提出FGO算法以解决长链推理压缩问题 |
reinforcement learning large language model chain-of-thought |
|
|
| 3 |
Towards Uniformity and Alignment for Multimodal Representation Learning |
提出解耦对齐与均匀性的多模态表征学习方法,缓解模态间分布差异。 |
representation learning multimodal |
|
|
| 4 |
Diffusion-Guided Pretraining for Brain Graph Foundation Models |
提出扩散引导的脑图预训练框架,提升脑连接组表征学习的鲁棒性。 |
masked autoencoder foundation model |
|
|
| 5 |
Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning |
提出AFRL范式和Mode-Balanced RL,解决搜索排序中低延迟与高性能的平衡问题。 |
reinforcement learning distillation large language model |
|
|
| 6 |
ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm |
ExO-PPO:一种扩展的Off-policy近端策略优化算法,提升样本效率和稳定性。 |
reinforcement learning deep reinforcement learning PPO |
|
|
| 7 |
ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning |
ADORA:通过动态优势估计训练强化学习推理模型,提升几何和数学任务性能。 |
reinforcement learning |
|
|
| 8 |
Flexible Entropy Control in RLVR with Gradient-Preserving Perspective |
提出基于梯度保持视角的可变熵控制方法,解决RLVR中策略熵坍塌问题 |
reinforcement learning large language model |
|
|
| 9 |
Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning |
FlexMARL:面向大规模LLM多智能体强化学习的高效Rollout-Training协同设计框架 |
reinforcement learning |
|
|
| 10 |
Beyond Student: An Asymmetric Network for Neural Network Inheritance |
提出InherNet,通过非对称低秩分解实现神经网络的结构与知识继承,超越知识蒸馏。 |
distillation multimodal |
|
|
| 11 |
Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning |
针对流式强化学习,提出在线自预测表征学习方法,提升样本效率。 |
reinforcement learning |
|
|
| 12 |
Latent Poincaré Shaping for Agentic Reinforcement Learning |
LaPha:在庞加莱潜在空间中训练类AlphaZero的LLM智能体,提升数学问题求解能力。 |
reinforcement learning |
|
|
| 13 |
Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability |
提出RLFR框架,利用特征作为奖励,提升开放任务中语言模型的真实性。 |
reinforcement learning affordance |
|
|
| 14 |
A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer |
对比DDQN与Dueling DQN在跨环境迁移中的表现差异 |
reinforcement learning deep reinforcement learning |
|
|