| 22 |
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment |
提出RLHF-COV和DPO-COV算法,同时缓解离线和在线RLHF/DPO对齐中的数据污染、过度优化和冗余问题。 |
reinforcement learning RLHF DPO |
|
|
| 23 |
EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models |
EARL:用于大型语言模型的高效Agentic强化学习系统 |
reinforcement learning large language model |
|
|
| 24 |
Multimodal Trajectory Representation Learning for Travel Time Estimation |
提出MDTI框架,融合多模态轨迹数据,提升出行时间预测精度。 |
representation learning multimodal |
✅ |
|
| 25 |
Primal-Dual Direct Preference Optimization for Constrained LLM Alignment |
提出Primal-Dual DPO方法,用于约束大型语言模型对齐,提升安全性和效率。 |
DPO direct preference optimization large language model |
|
|
| 26 |
Semantic-Cohesive Knowledge Distillation for Deep Cross-modal Hashing |
提出SODA:一种语义一致的知识蒸馏方法,用于深度跨模态哈希 |
distillation multimodal |
|
|
| 27 |
Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents |
提出Stratified GRPO,解决LLM搜索Agent强化学习中结构异质性问题 |
reinforcement learning large language model |
|
|
| 28 |
Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks |
提出Lexical Policy Networks (LEXPOL),利用语言编码门控策略网络解决多任务强化学习问题。 |
reinforcement learning language conditioned |
|
|
| 29 |
The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives |
提出贝叶斯框架以验证和优化大语言模型目标 |
reinforcement learning inverse reinforcement learning RLHF |
|
|
| 30 |
Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL |
提出Failure-aware IRL,通过关注失败案例提升LLM对齐效果 |
reinforcement learning inverse reinforcement learning RLHF |
|
|
| 31 |
GUIDE: Guided Initialization and Distillation of Embeddings |
提出GUIDE:引导初始化和嵌入蒸馏,提升学生模型质量且无额外开销 |
teacher-student distillation |
|
|
| 32 |
From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning |
提出H-DSAC以解决现实世界自动驾驶的安全与效率问题 |
reinforcement learning policy learning |
|
|
| 33 |
Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy |
提出基于强化学习的专家策略编排方法,解决在线匹配问题。 |
reinforcement learning |
|
|
| 34 |
Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization |
通过Hellinger局部化,实现多轨迹下近乎实例最优的参数恢复 |
linear attention foundation model |
|
|
| 35 |
Edit-Based Flow Matching for Temporal Point Processes |
提出基于编辑操作的流匹配模型,用于提升时间点过程的生成效率与灵活性。 |
flow matching |
|
|
| 36 |
Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods |
揭示混合线性注意力转换方法中的组件失衡问题并提出解决方案 |
linear attention |
|
|
| 37 |
Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising |
提出TimePD,通过代理去噪解决无源时间序列预测中的不变特征解耦问题 |
distillation large language model |
|
|
| 38 |
Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection |
提出一种联邦学习框架FedCAPS,用于稳健且保护隐私的特征选择。 |
representation learning |
|
|
| 39 |
Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation |
提出Traj-Transformer,利用Transformer和扩散模型生成高质量GPS轨迹 |
trajectory transformer spatiotemporal |
|
|
| 40 |
Monte Carlo Permutation Search |
提出蒙特卡洛置换搜索(MCPS)算法,提升通用游戏AI在算力有限场景下的性能。 |
reinforcement learning deep reinforcement learning |
|
|
| 41 |
Implicit Updates for Average-Reward Temporal Difference Learning |
提出平均奖励隐式TD(λ)算法,提升时序差分学习的数值稳定性和效率 |
reinforcement learning policy learning |
|
|