| 1 |
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents |
提出HyperEyes:一种双粒度效率感知强化学习框架,实现并行多模态搜索代理 |
reinforcement learning distillation multimodal |
|
|
| 2 |
ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression |
提出ExpThink框架:通过经验引导的强化学习实现自适应思维链压缩 |
reinforcement learning reward shaping chain-of-thought |
|
|
| 3 |
Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought |
揭示思维链(CoT)在上下文强化学习(ICRL)中的收敛机制与涌现原理 |
reinforcement learning chain-of-thought |
|
|
| 4 |
Interpreting Reinforcement Learning Agents with Susceptibilities |
提出基于敏感度(Susceptibilities)的深度强化学习可解释性框架,揭示模型参数空间的演化机制。 |
reinforcement learning deep reinforcement learning RLHF |
|
|
| 5 |
Prototype Guided Post-pretraining for Single-Cell Representation Learning |
提出CellRefine后预训练框架,利用标记基因先验优化单细胞表征学习 |
representation learning large language model foundation model |
|
|
| 6 |
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning |
提出PIQL框架:利用特权信息(PI)加速表格基础模型(TFMs)的训练并提升泛化能力 |
privileged information foundation model |
|
|
| 7 |
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation |
提出轨迹塑形离散流匹配(TS-DFM)方法,通过能量导航蒸馏实现高效文本生成 |
flow matching distillation |
|
|
| 8 |
Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning |
揭示Softmax Transformer的ICRL机制:证明其等价于加权Softmax时序差分学习 |
reinforcement learning linear attention |
|
|
| 9 |
Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach |
提出基于特权信息的神经符号模仿学习框架,以提升复杂环境下的数据效率与泛化能力。 |
imitation learning privileged information |
|
|
| 10 |
KL for a KL: On-Policy Distillation with Control Variate Baseline |
提出vOPD方法:通过引入控制变量基线,解决在线策略蒸馏中的梯度方差不稳定问题。 |
distillation large language model |
|
|
| 11 |
RelAgent: LLM Agents as Data Scientists for Relational Learning |
提出RelAgent框架,利用大语言模型作为自主数据科学家解决关系型学习任务 |
predictive model large language model foundation model |
|
|
| 12 |
SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion |
SHRED:通过Logit降维的自蒸馏实现免retain-set的大语言模型知识遗忘 |
distillation large language model |
|
|
| 13 |
TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models |
提出TRACE框架:利用扩散与流匹配模型实现基于传输对齐的共形预测 |
flow matching multimodal |
|
|
| 14 |
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning |
提出Prune-OPD框架,通过动态截断与奖励加权优化长程推理任务中的在线策略蒸馏 |
teacher-student distillation |
|
|
| 15 |
Structured Coupling for Flow Matching |
提出结构化耦合流匹配(SCFM),通过联合学习结构化潜变量与连续传输映射,实现生成质量与表征可解释性的平衡。 |
flow matching representation learning |
|
|
| 16 |
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States |
提出POISE框架:利用策略模型内部状态进行价值估计,实现高效的大语言模型强化学习 |
reinforcement learning PPO |
|
|
| 17 |
Rubric-based On-policy Distillation |
提出基于准则的在线策略蒸馏框架ROPD,实现黑盒模型的高效对齐 |
teacher-student distillation |
✅ |
|
| 18 |
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control |
提出Star Elastic训练框架,通过单次后训练实现嵌套子模型并支持推理阶段的动态预算控制。 |
SSM distillation large language model |
|
|
| 19 |
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR |
提出自适应负强化学习(A-NSR)框架,通过动态惩罚策略提升LLM推理能力 |
reinforcement learning PPO large language model |
|
|
| 20 |
Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs |
提出基于价值的强化学习算法以解决指数效用优化问题 |
reinforcement learning |
|
|
| 21 |
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph |
提出GraphDPO算法,通过偏好图建模优化语言模型对齐,解决成对偏好学习的局限性。 |
reinforcement learning DPO direct preference optimization |
|
|
| 22 |
Debiased Counterfactual Generation via Flow Matching from Observations |
提出基于流匹配的去偏反事实生成框架,通过利用观测数据分布提升反事实推断的准确性。 |
flow matching |
|
|
| 23 |
A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning |
提出针对极端多分类监督对比学习的精细化泛化分析框架,实现与类别分布无关的样本复杂度界限。 |
representation learning |
|
|
| 24 |
StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models |
提出StreamPhy框架,利用状态空间模型实现高维物理场动态的实时流式推断 |
state space model |
|
|
| 25 |
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models |
提出互惠强化学习(MRL)框架,实现异构大语言模型间的经验共享与协同训练 |
reinforcement learning |
|
|
| 26 |
Improved Model-based Reinforcement Learning with Smooth Kernels |
提出基于平滑核的在线强化学习方法,通过Bernstein风格探索奖励优化遗憾界 |
reinforcement learning |
|
|
| 27 |
Coupling Models for One-Step Discrete Generation |
提出耦合模型(Coupling Models)以实现离散数据的高效一步生成 |
distillation large language model |
✅ |
|
| 28 |
Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning |
提出稳定化的神经Hamilton-Jacobi-Bellman求解器,用于模型强化学习。 |
reinforcement learning |
|
|
| 29 |
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR |
提出HORA算法:通过命中效用最优分配策略提升基于群组的RLVR推理效率 |
reinforcement learning large language model |
|
|
| 30 |
Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift |
通过Poisson-Moreau漂移,提升随机逼近和强化学习的几乎必然收敛速度 |
reinforcement learning |
|
|
| 31 |
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective |
提出累积令牌策略优化(CTPO),通过累积重要性采样比解决LLM强化学习中的偏差-方差困境。 |
reinforcement learning PPO |
✅ |
|
| 32 |
Theoretical Limits of Language Model Alignment |
提出KL正则化的语言模型对齐理论极限以优化对齐效果 |
reinforcement learning PPO |
|
|
| 33 |
Actor-Critic with Active Importance Sampling |
提出主动重要性采样Actor-Critic(AISAC)算法,通过优化行为策略显著降低梯度估计方差。 |
reinforcement learning TD3 |
|
|