| 43 |
Entropy-Regularized Adjoint Matching for Offline RL |
提出最大熵伴随匹配(ME-AM)方法,解决离线强化学习中的流行度偏差和支持绑定问题。 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 44 |
Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark |
提出MTG-Causal-RL基准,用于评估复杂卡牌游戏中因果强化学习算法 |
reinforcement learning PPO world model |
|
|
| 45 |
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning |
提出自适应Q-分块(AQC)方法,解决离线到在线强化学习中动作分块尺寸固定的问题。 |
reinforcement learning VLA |
|
|
| 46 |
On the Safety of Graph Representation Learning |
提出GRL-Safety图表示学习安全评估基准,揭示现有方法在部署压力下的可靠性问题。 |
representation learning foundation model |
✅ |
|
| 47 |
SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation |
SNAPO:通过可微仿真实现最优控制的平滑神经伴随策略优化 |
reinforcement learning differentiable simulation |
|
|
| 48 |
A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment |
提出Pair-GRPO家族,通过显隐偏好约束提升RLHF对齐的稳定性和泛化性 |
reinforcement learning preference learning RLHF |
|
|
| 49 |
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions |
提出FP-FM算法,通过函数投影实现生成模型对未知分布的少样本快速适应 |
flow matching language conditioned |
|
|
| 50 |
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer |
SlimDT:通过序列建模外部注入条件信息,提升Decision Transformer效率与性能 |
reinforcement learning offline reinforcement learning decision transformer |
|
|
| 51 |
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level |
提出非对称On-Policy蒸馏(AOPD),提升数学推理任务中token级别模仿学习效果。 |
reinforcement learning distillation |
|
|
| 52 |
Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems |
提出一致性蒸馏流动匹配方法,加速高精度动力系统物理场重建。 |
flow matching distillation |
|
|
| 53 |
Dynamic Treatment on Networks |
提出Q-Ising框架,解决网络中动态干预策略的优化问题 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 54 |
Operator-Guided Invariance Learning for Continuous Reinforcement Learning |
提出VPSD-RL,通过算子引导的不变性学习提升连续强化学习的数据效率和鲁棒性。 |
reinforcement learning |
|
|
| 55 |
Flow Matching with Arbitrary Auxiliary Paths |
提出AuxPath-FM,通过任意辅助路径扩展Flow Matching生成模型 |
flow matching |
|
|
| 56 |
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex |
提出列表式策略优化LPO,提升LLM推理能力并保证优化稳定性和响应多样性 |
reinforcement learning large language model |
|
|
| 57 |
PRISM: Iterative Cross-Modal Posterior Refinement for Dynamic Text-Attributed Graphs |
提出PRISM框架,通过迭代跨模态后验精炼提升动态文本属性图表示学习。 |
representation learning multimodal |
|
|
| 58 |
Normalized Architectures are Natively 4-Bit |
提出nGPT架构,原生支持4比特量化训练,提升大模型效率。 |
Mamba large language model |
✅ |
|
| 59 |
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference |
Feather:通过强化学习优化LLM推理中批大小与前缀同质性的调度器 |
reinforcement learning large language model |
|
|
| 60 |
Soft Deterministic Policy Gradient with Gaussian Smoothing |
提出基于高斯平滑的软确定性策略梯度(Soft-DPG),解决稀疏奖励下的策略梯度不稳定问题 |
reinforcement learning deep reinforcement learning |
|
|
| 61 |
Optimal Transport for LLM Reward Modeling from Noisy Preference |
提出SelectiveRM框架,利用最优传输解决LLM奖励建模中噪声偏好问题 |
reinforcement learning RLHF |
|
|
| 62 |
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment |
提出影子掩码蒸馏(Shadow Mask Distillation)方法,解决强化学习后训练中KV缓存压缩导致的策略偏差问题。 |
reinforcement learning PPO RLHF |
|
|
| 63 |
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks |
提出基于离线强化学习的托卡马克等离子体旋转剖面控制方法 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 64 |
Causal-Aware Foundation-Model for Bilevel Optimization in Discrete Choice Settings |
提出C3PO因果感知基础模型,解决离散选择环境下的双层价格优化问题 |
imitation learning foundation model |
|
|
| 65 |
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning |
提出最大熵伴随匹配(ME-AM)框架,解决离线强化学习中的流行度偏差与支持集限制问题。 |
reinforcement learning offline reinforcement learning flow matching |
|
|
| 66 |
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses |
提出基于通用f-散度正则化的在线RLHF理论框架,实现最优遗憾界与收敛性分析 |
reinforcement learning RLHF large language model |
|
|
| 67 |
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention |
提出Momentum DeltaNet (MDN),通过分块并行动量机制优化线性注意力模型 |
Mamba linear attention large language model |
✅ |
|
| 68 |
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs |
提出掩码奖励行为树(MRBT)框架,结合LLM与神经符号强化学习实现组合任务的高效求解 |
reinforcement learning reward shaping large language model |
|
|
| 69 |
Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback |
提出基于LLM判别器与闭环强化学习的智能体股票预测行为评估框架 |
reinforcement learning SAC large language model |
|
|
| 70 |
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators |
提出Echo架构:利用谱Koopman算子实现无KV缓存的关联记忆检索 |
Mamba SSM chain-of-thought |
|
|
| 71 |
Revisiting Adam for Streaming Reinforcement Learning |
重审流式强化学习中的Adam优化器:提出Adaptive Q(λ)以实现高效在线学习 |
reinforcement learning deep reinforcement learning |
|
|
| 72 |
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level |
提出非对称在线策略蒸馏(AOPD)方法,通过令牌级反馈优化解决强化学习中的训练瓶颈 |
reinforcement learning distillation |
|
|
| 73 |
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing |
提出Near-Policy Distillation,加速自回归模型知识蒸馏,缓解分布不匹配问题。 |
reinforcement learning distillation |
|
|
| 74 |
RepFlow: Representation Enhanced Flow Matching for Causal Effect Estimation |
提出RepFlow框架,通过表征增强与条件流匹配实现因果效应估计 |
flow matching representation learning |
|
|
| 75 |
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling |
提出AeroJEPA架构,通过联合嵌入预测实现可扩展的3D空气动力学场建模与语义表征学习。 |
Joint-Embedding Predictive Architecture joint-embedding predictive architecture latent optimization |
|
|
| 76 |
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models |
提出统一的生成模型框架以解析扩散与流匹配问题 |
flow matching |
|
|
| 77 |
Towards Differentially Private Reinforcement Learning with General Function Approximation |
提出首个基于通用函数逼近的差分隐私在线强化学习理论框架 |
reinforcement learning |
|
|
| 78 |
Adaptive Memory Decay for Log-Linear Attention |
提出自适应记忆衰减机制,优化对数线性注意力模型的长程上下文建模能力 |
linear attention |
|
|
| 79 |
Physics-Based Flow Matching for Full-Field Prediction of Silicon Photonic Devices |
提出PIC-Flow生成式神经代理模型,通过物理约束流匹配实现硅光子器件全场电磁场预测。 |
flow matching |
|
|
| 80 |
Gradient Extrapolation-Based Policy Optimization |
提出梯度外推策略优化(GXPO),通过高效梯度预测提升大模型推理强化学习性能 |
reinforcement learning large language model |
|
|
| 81 |
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR |
提出选择性资格迹(S-trace)方法,通过细粒度信用分配优化RLVR中的推理能力 |
reinforcement learning large language model |
|
|
| 82 |
FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings |
提出FedeKD框架,利用基于能量的门控机制解决异构联邦知识蒸馏中的负迁移问题。 |
distillation |
|
|
| 83 |
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics |
提出语义状态抽象接口(SSAI)框架,通过多轴新闻分解实现LLM增强型投资组合决策的可解释性诊断。 |
PPO SAC |
|
|
| 84 |
Measuring Learning Progress via Gradient-Momentum Coupling |
提出梯度-动量耦合(GMC)方法,通过优化动力学量化学习进度以提升强化学习的探索效率。 |
reinforcement learning curriculum learning |
|
|