| 1 |
RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs |
RLAX:用于大规模语言模型在TPU上的大规模分布式强化学习框架 |
reinforcement learning large language model |
|
|
| 2 |
Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning |
提出基于强化学习的解码回归方法,解决token级别监督与数值预测目标不一致问题 |
reinforcement learning large language model |
|
|
| 3 |
A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation |
A-3PO:通过近似近端策略加速异步LLM训练,提升训练效率。 |
reinforcement learning PPO large language model |
✅ |
|
| 4 |
LLM-Upgraded Graph Reinforcement Learning for Carbon-Aware Job Scheduling in Smart Manufacturing |
提出Luca框架,利用LLM增强图强化学习,解决智能制造中碳感知作业调度问题。 |
reinforcement learning deep reinforcement learning |
|
|
| 5 |
Predictive Modeling of Flood-Prone Areas Using SAR and Environmental Variables |
结合SAR与环境数据,提出基于随机森林的洪水易发区预测模型 |
predictive model |
|
|
| 6 |
Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control |
基于最优控制理论分析目标条件强化学习的有效性 |
reinforcement learning |
|
|
| 7 |
When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models |
针对奖励模型BT损失中表征距离偏差问题,提出NormBT自适应归一化方案。 |
RLHF large language model |
|
|
| 8 |
Networked Restless Multi-Arm Bandits with Reinforcement Learning |
提出网络化的RMAB框架以解决决策中的交互问题 |
reinforcement learning |
|
|
| 9 |
Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning |
提出基于强化学习的自适应策略选择方法,解决复杂导航任务中的策略切换问题。 |
reinforcement learning |
|
|
| 10 |
Auto-exploration for online reinforcement learning |
提出自探索在线强化学习算法,解决探索-利用困境,实现参数无关的最优策略。 |
reinforcement learning |
|
|
| 11 |
DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction |
提出DDFI,通过双步重构实现多样性和分布感知的缺失特征填充,提升图神经网络性能。 |
masked autoencoder MAE |
|
|
| 12 |
Learning Without Time-Based Embodiment Resets in Soft-Actor Critic |
提出持续性SAC算法,解决强化学习中依赖重置和终止的问题 |
reinforcement learning SAC |
|
|