| 1 |
When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic |
提出基于OUI的PPO早期结构信号分析方法,加速超参数寻优。 |
reinforcement learning deep reinforcement learning PPO |
|
|
| 2 |
Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning |
提出In-Context RLVR,通过上下文强化学习提升大语言模型推理质量。 |
reinforcement learning large language model |
|
|
| 3 |
Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning |
提出Reward-Zero,利用语言嵌入驱动强化学习中的隐式奖励机制 |
reinforcement learning PPO reward shaping |
|
|
| 4 |
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards |
提出DCPO框架,解耦推理与置信度,提升可验证奖励强化学习的校准性能 |
reinforcement learning large language model |
|
|
| 5 |
A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System |
提出一种多原型引导的联邦知识蒸馏方法,用于AI-RAN赋能的多接入边缘计算系统 |
MAE distillation |
|
|
| 6 |
ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning |
提出ActiveUltraFeedback,利用主动学习高效生成偏好数据,提升LLM对齐效率。 |
reinforcement learning RLHF large language model |
✅ |
|
| 7 |
From Representation to Clusters: A Contrastive Learning Approach for Attributed Hypergraph Clustering |
提出CAHC:一种基于对比学习的属性超图聚类端到端方法 |
representation learning contrastive learning |
|
|
| 8 |
SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space |
SPAARS:通过抽象探索和精细动作空间利用实现更安全的强化学习策略对齐 |
reinforcement learning IQL curriculum learning |
|
|
| 9 |
Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes |
提出TAM-RL框架,利用表征学习提升陆地碳通量估算的准确性和泛化性 |
representation learning |
|
|
| 10 |
Towards a Neural Debugger for Python |
提出神经调试器,通过模拟调试操作实现对Python代码执行过程的交互式控制。 |
world model large language model |
|
|
| 11 |
Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation |
提出RQRE-OVI算法,通过风险敏感的量化响应均衡提升多智能体强化学习的策略鲁棒性。 |
reinforcement learning |
|
|
| 12 |
Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL |
利用不完美的LLM生成RTL学习网表表示,突破电路表示学习的数据瓶颈。 |
representation learning large language model |
|
|
| 13 |
PPO-Based Hybrid Optimization for RIS-Assisted Semantic Vehicular Edge Computing |
提出基于PPO的混合优化算法,解决RIS辅助的语义车载边缘计算中的低延迟问题 |
PPO |
|
|
| 14 |
Learning Adaptive LLM Decoding |
提出自适应LLM解码方法,通过强化学习动态调整采样策略以提升性能。 |
reinforcement learning large language model |
|
|
| 15 |
Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference |
利用XLA编译器优化,实现Mamba-2在多平台上的高效可移植推理。 |
Mamba SSM |
✅ |
|