| 1 |
RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation, Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays |
RL-Exec:基于强化学习的冲击感知型最优清算策略,优于TWAP和Book-Liquidity VWAP |
reinforcement learning PPO TAMP |
|
|
| 2 |
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems |
ReSpec:优化强化学习系统中推测解码的框架 |
reinforcement learning distillation large language model |
|
|
| 3 |
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning |
探讨RLVR在数学推理中的局限性与改进方法 |
reinforcement learning reward design large language model |
✅ |
|
| 4 |
Offline Clustering of Preference Learning with Active-data Augmentation |
提出Off-C$^2$PL和A$^2$-Off-C$^2$PL算法,解决离线偏好学习中的用户聚类和数据不平衡问题。 |
reinforcement learning preference learning |
|
|
| 5 |
Jasmine: A Simple, Performant and Scalable JAX-based World Modeling Codebase |
Jasmine:一个简单、高性能且可扩展的基于JAX的世界模型代码库 |
world model |
|
|
| 6 |
Defeating the Training-Inference Mismatch via FP16 |
使用FP16精度解决LLM强化学习微调中训练-推理不一致问题 |
reinforcement learning large language model |
|
|
| 7 |
Bridging the Gap between Empirical Welfare Maximization and Conditional Average Treatment Effect Estimation in Policy Learning |
揭示策略学习中经验福利最大化与条件平均处理效应估计的等价性 |
policy learning |
|
|
| 8 |
Data-Efficient RLVR via Off-Policy Influence Guidance |
提出CROPI,利用离线影响函数指导RLVR数据选择,提升LLM推理能力。 |
reinforcement learning large language model |
|
|
| 9 |
Co-Evolving Latent Action World Models |
提出CoLA-World,通过协同进化学习潜在动作世界模型,提升视频生成质量和视觉规划能力。 |
world model |
|
|
| 10 |
Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning |
提出基于低频截断的自适应上下文长度优化MARL框架,解决长期依赖问题。 |
reinforcement learning |
|
|
| 11 |
A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation |
提出基于博弈论时空强化学习的公共资源协同分配框架 |
reinforcement learning |
|
|
| 12 |
Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings |
提出基于Flow Matching的生成式AI模型FM-Cast,高效预测平流层突发性增温 |
flow matching spatiotemporal |
|
|
| 13 |
Clone Deterministic 3D Worlds |
提出几何正则化世界模型(GRWM),用于高保真克隆确定性3D世界。 |
world model contrastive learning |
|
|
| 14 |
Think Outside the Policy: In-Context Steered Policy Optimization |
提出ICPO,利用上下文学习引导策略优化,提升大型推理模型在可验证奖励强化学习中的推理能力。 |
reinforcement learning reward shaping |
|
|