| 1 |
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping |
BAPO:通过自适应裁剪平衡策略优化,稳定LLM的离线强化学习。 |
reinforcement learning PPO large language model |
|
|
| 2 |
Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options |
提出M-AUPO算法以提升偏好强化学习的样本效率 |
reinforcement learning large language model |
|
|
| 3 |
From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation |
提出定制化GRPO,解决主体驱动图像生成中保真度和可编辑性的trade-off问题 |
reinforcement learning reward shaping |
|
|
| 4 |
Towards Universal Solvers: Using PGD Attack in Active Learning to Increase Generalizability of Neural Operators as Knowledge Distillation from Numerical PDE Solvers |
提出基于PGD攻击的主动学习框架,提升神经算子在偏微分方程求解中的泛化性。 |
teacher-student distillation |
|
|
| 5 |
Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach |
提出贝叶斯价值函数解决不完美转移预测的强化学习问题 |
reinforcement learning model-based RL |
|
|
| 6 |
Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning |
提出COPPOL,结合Conformal Prediction与强化学习,实现安全导航。 |
reinforcement learning policy learning |
|
|
| 7 |
ADPO: Anchored Direct Preference Optimization |
ADPO:锚定直接偏好优化,通过解耦响应质量与先验流行度提升策略对齐效果 |
reinforcement learning direct preference optimization |
|
|
| 8 |
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task |
研究表明,更高维度嵌入能为排序任务Transformer构建更强的世界模型 |
reinforcement learning world model |
|
|
| 9 |
Towards Identifiability of Hierarchical Temporal Causal Representation Learning |
提出CHiLD框架,解决时间序列数据中分层潜在因果表示学习的唯一性问题。 |
latent dynamics representation learning |
|
|
| 10 |
What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning |
解耦数据排序对LLM数学推理的影响,探究有效课程学习策略 |
curriculum learning large language model |
|
|
| 11 |
POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning |
POLAR:提出基于策略梯度强化学习的联邦学习隐蔽后门攻击方法 |
reinforcement learning |
|
|
| 12 |
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting |
利用On-Policy数据缓解语言模型微调中的灾难性遗忘 |
reinforcement learning instruction following |
|
|
| 13 |
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation |
提出T-MTB:一种可迁移的LLM后门攻击方法,提升蒸馏场景下的安全性风险 |
distillation |
|
|
| 14 |
Simple and Efficient Heterogeneous Temporal Graph Neural Network |
提出SE-HTGNN,通过动态注意力机制和LLM提示,高效学习异构时序图表示。 |
representation learning large language model |
|
|
| 15 |
Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model |
提出基于深度状态空间模型的fMRI语音可懂度跨条件解码方法 |
state space model |
|
|
| 16 |
Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs |
针对无折扣总回报MDP,提出策略梯度算法的收敛性分析方法 |
reinforcement learning large language model |
|
|
| 17 |
Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching |
提出时间条件收缩匹配(TCCM),用于表格数据中可扩展、可解释且鲁棒的异常检测。 |
flow matching |
|
|
| 18 |
RESCUE: Retrieval Augmented Secure Code Generation |
RESCUE:提出检索增强的安全代码生成框架,提升LLM代码安全性。 |
distillation large language model |
|
|
| 19 |
Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients |
提出噪声校正GRPO框架,解决RLHF中噪声奖励导致的策略优化偏差问题 |
reinforcement learning RLHF |
|
|