| 1 |
Evolutionary Contrastive Distillation for Language Model Alignment |
提出进化对比蒸馏(ECD)方法,提升LLM在复杂指令跟随任务上的性能 |
DPO contrastive learning distillation |
|
|
| 2 |
Large Vision Model-Enhanced Digital Twin with Deep Reinforcement Learning for User Association and Load Balancing in Dynamic Wireless Networks |
提出基于大视觉模型增强数字孪生的深度强化学习方法,解决动态无线网络中的用户关联和负载均衡问题。 |
reinforcement learning deep reinforcement learning DRL |
|
|
| 3 |
VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers |
VerifierQ:利用Q学习增强LLM测试时计算的验证器模型 |
reinforcement learning CQL IQL |
|
|
| 4 |
COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework |
提出COS-DPO,一种条件式单次多目标微调框架,用于解决多目标优化问题。 |
DPO direct preference optimization |
|
|
| 5 |
Offline Hierarchical Reinforcement Learning via Inverse Optimization |
提出OHIO框架,通过逆优化解决离线分层强化学习中的高层动作推断难题。 |
reinforcement learning offline reinforcement learning |
|
|
| 6 |
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning |
提出过程优势验证器(PAV),通过奖励进步来提升LLM推理能力。 |
reinforcement learning large language model |
|
|