| 1 |
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning |
提出JEDI:一种用于在线模型强化学习的联合嵌入扩散世界模型 |
reinforcement learning world model world models |
|
|
| 2 |
Learning POMDP World Models from Observations with Language-Model Priors |
Pinductor:利用语言模型先验知识,高效学习部分可观测马尔可夫决策过程世界模型 |
world model world models generalist agent |
✅ |
|
| 3 |
Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning |
提出TCE框架,通过目标对齐生成弥合离线强化学习跨域差距 |
reinforcement learning offline RL offline reinforcement learning |
|
|
| 4 |
Trajectory-Level Data Augmentation for Offline Reinforcement Learning |
提出轨迹级数据增强方法,提升离线强化学习在主动定位问题中的性能 |
reinforcement learning offline reinforcement learning |
|
|
| 5 |
Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model |
提出基于心电图(ECG)训练的AI模型,用于心肌梗死后心血管疾病的动态预测。 |
predictive model contrastive learning foundation model |
|
|
| 6 |
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization |
提出RDPO,通过解耦奖励优化多目标混合奖励强化学习,提升指令遵循和写作质量。 |
reinforcement learning instruction following |
|
|
| 7 |
MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters |
提出MARLIN,利用多智能体博弈强化学习优化云数据中心LLM推理能耗与延迟。 |
reinforcement learning large language model |
|
|
| 8 |
Teacher-Guided Policy Optimization for LLM Distillation |
提出TGPO算法,通过教师引导策略优化解决LLM蒸馏中负反馈问题。 |
reinforcement learning imitation learning distillation |
|
|
| 9 |
Coreset-Induced Conditional Velocity Flow Matching |
提出Coreset诱导的条件速度流匹配(CCVFM),提升生成模型性能。 |
flow matching multimodal |
|
|
| 10 |
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation |
提出对比近端策略优化(CPPO),实现免奖励函数的On-Policy自监督强化学习 |
reinforcement learning PPO |
|
|
| 11 |
ERPPO: Entropy Regularization-based Proximal Policy Optimization |
提出基于熵正则化的近端策略优化算法ERPPO,解决多维环境下MAPPO策略优化问题 |
reinforcement learning PPO spatiotemporal |
|
|
| 12 |
CO-MAP: A Reinforcement Learning Approach to the Qubit Allocation Problem |
提出CO-MAP以解决量子比特分配问题 |
reinforcement learning |
|
|
| 13 |
HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning |
HLS-Seek:基于代理比较奖励强化学习的高层次综合QoR感知代码生成 |
reinforcement learning |
|
|
| 14 |
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation |
提出奖励加权On-Policy蒸馏方法,提升NL到SVA生成的属性等价性 |
distillation |
|
|
| 15 |
Path-independent Flow Matching for Multi-parameter Generative Dynamics |
提出路径无关流匹配(PiFM),用于学习多参数生成动态中的路径无关变换。 |
flow matching |
|
|
| 16 |
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention |
OSDN:通过可证明的在线预处理改进线性注意力中的Delta规则 |
linear attention |
|
|
| 17 |
Twincher: Bijective Representation Learning for Robust Inversion of Continuous Systems |
提出Twincher以解决连续系统的鲁棒逆问题 |
representation learning |
|
|
| 18 |
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy |
提出Q-Flow,利用Flow模型进行稳定且具有表达性的强化学习策略优化。 |
reinforcement learning |
|
|
| 19 |
Support-Conditioned Flow Matching Is Kernel Smoothing |
揭示条件化Flow Matching是核平滑,并用高斯核注意力实现高效条件生成 |
flow matching |
|
|
| 20 |
Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning |
提出基于切换后继测度的分层零样本强化学习方法,无需额外监督。 |
reinforcement learning |
✅ |
|
| 21 |
Stable Attention Response for Reliable Precipitation Nowcasting |
HARECast:通过稳定注意力响应提升可靠的降水临近预报 |
representation learning multimodal |
|
|
| 22 |
On the Generalization of Knowledge Distillation: An Information-Theoretic View |
从信息论视角分析知识蒸馏的泛化能力,并提出相应的泛化界限。 |
distillation |
|
|
| 23 |
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy |
揭示多智能体系统谄媚现象并非仅由RLHF引起,提出激活空间干预缓解该问题 |
RLHF |
|
|
| 24 |
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective |
提出ConSPO框架,通过对比学习优化LLM在RLVR中的推理能力,显著提升数学推理性能。 |
reinforcement learning |
|
|
| 25 |
Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions |
单循环Actor-Critic算法在最小假设下实现ε⁻²样本复杂度 |
reinforcement learning policy learning |
|
|
| 26 |
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting |
SpikeProphecy:用于自回归神经群体预测的大规模基准测试 |
SSM distillation |
|
|