| 1 |
Forward versus Backward: Comparing Reasoning Objectives in Direct Preference Optimization |
对比正向与反向推理目标,提升直接偏好优化在数学问题上的可靠性 |
DPO direct preference optimization large language model |
|
|
| 2 |
Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training |
提出分段优势估计(SAE),提升PPO在长文本LLM稀疏奖励训练中的性能。 |
reinforcement learning PPO large language model |
|
|
| 3 |
d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation |
提出d3LLM,通过伪轨迹蒸馏加速扩散语言模型,实现精度与并行性的平衡。 |
distillation large language model |
✅ |
|
| 4 |
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training |
证明了后训练中监督微调与强化学习无法解耦,避免性能损失 |
reinforcement learning large language model |
|
|
| 5 |
Stable On-Policy Distillation through Adaptive Target Reformulation |
提出Veto:通过自适应目标重构实现稳定的On-Policy蒸馏 |
distillation large language model |
|
|
| 6 |
Reinforcement Learning for Micro-Level Claims Reserving |
提出基于强化学习的微观索赔准备金方法,提升未决赔案负债预测精度与稳定性 |
reinforcement learning reward design |
|
|
| 7 |
Stagewise Reinforcement Learning and the Geometry of the Regret Landscape |
基于后悔函数几何的阶段性强化学习理论,揭示策略演化中的贝叶斯相变 |
reinforcement learning deep reinforcement learning |
|
|
| 8 |
Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization |
提出FLORA,通过流模型任务推断和自适应特征校正解决离线元强化学习中的泛化问题 |
reinforcement learning offline RL |
|
|
| 9 |
Improving Domain Generalization in Contrastive Learning using Adaptive Temperature Control |
提出自适应温度控制对比学习,提升域泛化能力 |
contrastive learning |
|
|
| 10 |
TFEC: Multivariate Time-Series Clustering via Temporal-Frequency Enhanced Contrastive Learning |
提出TFEC框架,通过时频增强对比学习提升多元时间序列聚类性能 |
contrastive learning |
✅ |
|
| 11 |
Land-then-transport: A Flow Matching-Based Generative Decoder for Wireless Image Transmission |
提出基于流匹配的生成解码器以解决无线图像传输问题 |
flow matching |
|
|
| 12 |
Explaining Machine Learning Predictive Models through Conditional Expectation Methods |
提出MUCE方法,通过条件期望解释机器学习模型预测,提升模型透明度和可信度。 |
predictive model |
|
|
| 13 |
Pseudodata-guided Invariant Representation Learning Boosts the Out-of-Distribution Generalization in Enzymatic Kinetic Parameter Prediction |
O$^2$DENet通过伪数据引导的不变表示学习提升酶促动力学参数预测的OOD泛化能力 |
representation learning |
|
|
| 14 |
Reward-Preserving Attacks For Robust Reinforcement Learning |
提出α-奖励保持攻击,提升强化学习在对抗环境下的鲁棒性 |
reinforcement learning |
|
|