| 1 |
PriorZero: Bridging Language Priors and World Models for Decision Making |
提出PriorZero以解决LLM与RL之间的动态不匹配问题 |
reinforcement learning world model world models |
✅ |
|
| 2 |
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models |
针对扩散大语言模型多领域强化学习,提出Block-R1以解决领域块大小冲突问题。 |
reinforcement learning large language model |
✅ |
|
| 3 |
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models |
ORCE:提出一种顺序感知的大语言模型置信度校准框架,提升可靠性。 |
reinforcement learning large language model |
|
|
| 4 |
Discrete Flow Matching for Offline-to-Online Reinforcement Learning |
DRIFT:用于离线到在线强化学习的离散流匹配方法 |
reinforcement learning flow matching |
|
|
| 5 |
Intrinsic Vicarious Conditioning for Deep Reinforcement Learning |
提出基于内在替代性条件反射的深度强化学习方法,解决单生命周期和持续学习问题 |
reinforcement learning deep reinforcement learning |
|
|
| 6 |
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification |
MaskTab:面向工业分类的可扩展掩码表格预训练,结合缩放法则与知识蒸馏 |
distillation foundation model |
|
|
| 7 |
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning |
提出时间horizon泛化理论框架,揭示多稳态对强化学习长期记忆的重要性 |
reinforcement learning state space model |
|
|
| 8 |
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction |
针对异步Agent强化学习中缺失旧Logits问题,提出语义解耦的修正方法。 |
reinforcement learning PPO large language model |
✅ |
|
| 9 |
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation |
提出GEAR框架,通过自蒸馏实现LLM Agent的细粒度自适应优势重加权,提升长程任务性能。 |
reinforcement learning distillation |
|
|
| 10 |
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training |
提出稀疏到稠密奖励原则,提升语言模型在可验证数学问题上的后训练效果 |
distillation |
|
|
| 11 |
Model-based Bootstrap of Controlled Markov Chains |
提出基于模型的Bootstrap方法,用于控制马尔可夫链的离线策略评估与优化。 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 12 |
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning |
提出OGLS-SD,通过结果引导的Logit调整实现LLM推理的On-Policy自蒸馏。 |
distillation |
|
|
| 13 |
Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning |
提出事件驱动框架以解决多智能体强化学习中的行为多样性问题 |
reinforcement learning |
|
|
| 14 |
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling |
提出基于隐式因果图建模的可迁移延迟感知强化学习方法 |
reinforcement learning |
|
|
| 15 |
Delay-Empowered Causal Hierarchical Reinforcement Learning |
提出延迟增强因果分层强化学习(DECHRL),解决时延不确定性下的决策问题 |
reinforcement learning |
|
|
| 16 |
Optimal Policy Learning under Budget and Coverage Constraints |
提出预算与覆盖约束下的最优策略学习方法 |
policy learning |
|
|
| 17 |
Multi-Task Representation Learning for Conservative Linear Bandits |
提出CMTRL框架,解决保守线性Bandit中的多任务表示学习问题 |
representation learning |
|
|
| 18 |
Expected Batch Optimal Transport Plans and Consequences for Flow Matching |
提出期望批量最优传输计划以解决流匹配问题 |
flow matching |
|
|
| 19 |
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning |
提出基于RAPC的强化学习方法,解决随机环境下概率可达-避障约束下的成本优化问题 |
reinforcement learning |
|
|
| 20 |
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization |
提出DGAO,通过双重群组优势优化缓解大语言模型的顺序敏感性问题。 |
reinforcement learning large language model |
✅ |
|
| 21 |
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning |
提出自适应TD($λ$)算法ATD($λ$),解决MARL中策略分布难以计算的问题 |
reinforcement learning |
|
|
| 22 |
Information theoretic underpinning of self-supervised learning by clustering |
通过聚类进行自监督学习的信息理论基础研究 |
distillation foundation model |
|
|
| 23 |
GRAFT: Graph-Tokenized LLMs for Tool Planning |
GRAFT:图结构Token化LLM用于工具规划,解决依赖关系建模难题 |
distillation large language model |
|
|
| 24 |
Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling |
EvoTD:通过技能组合与复杂度缩放,提升大语言模型的推理能力 |
reinforcement learning large language model |
✅ |
|
| 25 |
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation |
提出CREDIT,通过对比学习提升On-Policy自蒸馏的输入特异性奖励。 |
distillation |
|
|
| 26 |
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information |
提出AntiSD,通过反向自蒸馏提升语言模型在数学推理中的能力。 |
distillation |
|
|
| 27 |
Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching |
提出SharpEuler:一种Flow Matching的自适应采样方法,提升生成质量。 |
flow matching |
|
|
| 28 |
BSO: Safety Alignment Is Density Ratio Matching |
提出BSO以简化安全对齐问题的解决方案 |
reinforcement learning direct preference optimization |
|
|
| 29 |
Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds |
提出联合KL的自回归学习以解决长序列建模问题 |
policy learning imitation learning |
|
|
| 30 |
Variance-aware Reward Modeling with Anchor Guidance |
提出Anchor引导的方差感知奖励建模,解决人类偏好多样性下的奖励模型非唯一性问题。 |
PPO RLHF |
|
|
| 31 |
OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training |
提出OUI作为神经网络训练结构可观测指标,揭示激活函数中心视角下的训练动态 |
reinforcement learning PPO |
|
|