| 1 |
A Comparative Study of Deep Reinforcement Learning Models: DQN vs PPO vs A2C |
对比DQN、PPO和A2C在BreakOut游戏中性能,为游戏AI提供参考 |
reinforcement learning deep reinforcement learning PPO |
|
|
| 2 |
BOND: Aligning LLMs with Best-of-N Distillation |
提出BOND算法,通过模仿Best-of-N采样提升大语言模型性能,降低推理计算开销。 |
reinforcement learning RLHF distillation |
|
|
| 3 |
Longhorn: State Space Models are Amortized Online Learners |
Longhorn:将状态空间模型视为在线学习器的摊销版本,提升序列建模性能。 |
Mamba SSM state space model |
|
|
| 4 |
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification |
揭示RLHF中KL散度正则化在重尾奖励函数下的失效问题:灾难性Goodhart现象 |
reinforcement learning RLHF |
|
|
| 5 |
On Policy Evaluation Algorithms in Distributional Reinforcement Learning |
提出一种新的分布强化学习策略评估算法,适用于具有任意概率奖励机制的MDP |
reinforcement learning DRL |
|
|
| 6 |
Investigating the Indirect Object Identification circuit in Mamba |
研究Mamba模型中的间接对象识别电路,揭示其内部机制。 |
Mamba SSM |
|
|
| 7 |
Decomposed Direct Preference Optimization for Structure-Based Drug Design |
提出DecompDPO,利用多粒度偏好优化结构药物设计扩散模型。 |
DPO direct preference optimization |
|
|
| 8 |
A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning |
利用Reticulate包,实现R与Python在数据科学、机器学习和强化学习中的高效协同 |
reinforcement learning |
|
|
| 9 |
OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning |
OASIS:面向离线安全强化学习的条件分布塑造方法 |
reinforcement learning |
|
|
| 10 |
Data-Centric Human Preference with Rationales for Direct Preference Alignment |
提出基于理由的数据中心人类偏好对齐方法,提升直接偏好优化效率。 |
reinforcement learning preference learning direct preference optimization |
|
|
| 11 |
L^2CL: Embarrassingly Simple Layer-to-Layer Contrastive Learning for Graph Collaborative Filtering |
提出L2CL:一种简易的层间对比学习图协同过滤方法,提升推荐性能。 |
contrastive learning |
✅ |
|
| 12 |
Towards the Causal Complete Cause of Multi-Modal Representation Learning |
提出C³正则化方法,通过因果完备性提升多模态表征学习效果 |
representation learning |
|
|