| 1 |
Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training |
提出Actor-Curator,通过策略提升Bandit算法实现LLM后训练的协同自适应课程学习。 |
reinforcement learning curriculum learning large language model |
|
|
| 2 |
Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning |
提出LoDADA,通过局部动态感知领域自适应解决离线强化学习中的动态差异问题 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 3 |
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards |
SELAUR:基于不确定性感知奖励的自进化LLM Agent |
reinforcement learning reward design reward shaping |
|
|
| 4 |
Scaling State-Space Models on Multiple GPUs with Tensor Parallelism |
提出一种通信高效的张量并行方案,加速选择性状态空间模型在多GPU上的推理。 |
Mamba SSM state space model |
|
|
| 5 |
TrajGPT-R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer |
TrajGPT-R:提出基于强化学习增强的生成式预训练Transformer,用于生成城市出行轨迹 |
reinforcement learning offline reinforcement learning inverse reinforcement learning |
✅ |
|
| 6 |
Test-Time Training with KV Binding Is Secretly Linear Attention |
揭示KV绑定测试时训练本质:实为线性注意力机制 |
linear attention |
|
|
| 7 |
Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning |
提出多智能体模仿学习的新方法以解决纳什均衡问题 |
imitation learning |
|
|
| 8 |
Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm |
提出ECO:一种高效的神经组合优化离线自博弈学习范式 |
DPO direct preference optimization Mamba |
|
|
| 9 |
Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty |
提出Fuz-RL,一种模糊逻辑引导的鲁棒强化学习框架,提升不确定性下的安全性。 |
reinforcement learning |
|
|
| 10 |
A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies |
提出THEMES框架,利用广义学徒学习捕获学生动态演化的教学策略 |
reinforcement learning deep reinforcement learning DRL |
|
|