| 1 |
"So, Tell Me About Your Policy...": Distillation of interpretable policies from Deep Reinforcement Learning agents |
提出基于优势函数蒸馏的可解释强化学习策略,提升金融交易等领域应用。 |
reinforcement learning deep reinforcement learning DRL |
|
|
| 2 |
CTRLS: Chain-of-Thought Reasoning via Latent State-Transition |
提出CTRLS框架,通过潜在状态转移实现链式思考推理,提升LLM的推理能力 |
reinforcement learning large language model chain-of-thought |
|
|
| 3 |
Latent Space Data Fusion Outperforms Early Fusion in Multimodal Mental Health Digital Phenotyping Data |
提出基于潜在空间融合的抑郁症预测模型,优于传统早期融合方法 |
predictive model multimodal |
|
|
| 4 |
EXPO: Stable Reinforcement Learning with Expressive Policies |
EXPO:通过可表达策略实现稳定的强化学习 |
reinforcement learning imitation learning flow matching |
|
|
| 5 |
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions |
提出分位数奖励策略优化(QRPO),实现绝对奖励下的离线策略对齐。 |
PPO DPO large language model |
|
|
| 6 |
Bradley-Terry and Multi-Objective Reward Modeling Are Complementary |
提出联合训练框架,结合Bradley-Terry和多目标奖励建模,提升奖励模型泛化性和打分能力。 |
reinforcement learning RLHF large language model |
|
|
| 7 |
BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning |
提出BEAVER框架以解决建筑能效管理中的多目标强化学习问题 |
reinforcement learning policy learning |
|
|
| 8 |
Space-Filling Regularization for Robust and Interpretable Nonlinear State Space Models |
提出空间填充正则化方法,提升非线性状态空间模型的鲁棒性和可解释性 |
state space model |
|
|
| 9 |
Principled Foundations for Preference Optimization |
为偏好优化提供理论基础,揭示DPO与损失函数及随机选择理论的联系 |
DPO direct preference optimization |
|
|