| 1 |
Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning |
提出基于偏好动作优化的扩散策略,提升离线强化学习性能 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 2 |
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF |
提出价值激励偏好优化(VPO),统一在线与离线RLHF,提升LLM对齐效果。 |
reinforcement learning offline RL RLHF |
|
|
| 3 |
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning |
提出s-CLIPLoss和NormSim,提升多模态对比学习中数据选择的性能。 |
contrastive learning multimodal |
|
|
| 4 |
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment |
提出自探索语言模型(SELM),通过主动偏好诱导实现LLM的在线对齐。 |
reinforcement learning RLHF DPO |
✅ |
|
| 5 |
Robust Preference Optimization through Reward Model Distillation |
提出基于奖励模型蒸馏的鲁棒偏好优化方法,提升语言模型对偏好数据分布偏移的适应性。 |
reinforcement learning DPO direct preference optimization |
|
|
| 6 |
Preference Learning Algorithms Do Not Learn Preference Rankings |
揭示偏好学习算法的局限性:模型排序能力与人类偏好存在显著差距 |
preference learning RLHF DPO |
|
|
| 7 |
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation |
提出SEER以提高偏好强化学习的反馈效率 |
reinforcement learning policy learning |
|
|
| 8 |
Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees |
提出谱风险约束策略优化算法(SRCPO),解决风险约束强化学习中的收敛性难题。 |
reinforcement learning |
|
|
| 9 |
Forward-Backward Knowledge Distillation for Continual Clustering |
提出面向无监督持续聚类的正向-反向知识蒸馏方法FBCC,解决灾难性遗忘问题。 |
distillation |
|
|
| 10 |
Learning Human-Aligned Representations with Contrastive Learning and Generative Similarity |
提出基于生成相似度的对比学习方法,学习与人类认知对齐的表征 |
contrastive learning |
|
|
| 11 |
Stress-Testing Capability Elicitation With Password-Locked Models |
提出密码锁模型,评估微调在大型语言模型能力诱导中的有效性 |
reinforcement learning large language model |
|
|
| 12 |
Deep Bayesian Filter for Bayes-faithful Data Assimilation |
提出深度贝叶斯滤波,用于解决非线性状态空间模型中的非高斯后验数据同化问题 |
SSM state space model |
|
|