| 12 |
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields |
提出EA-WM,利用事件感知生成世界模型,解决机器人操作中精确控制与视觉感知对齐问题。 |
policy learning world model world models |
|
|
| 13 |
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models |
针对机器人世界模型的潜在空间选择,提出语义对齐的表征优于重建。 |
world model world models JEPA |
|
|
| 14 |
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement |
提出NOVA:一种基于权重空间的、具有潜在结构解耦的世界模型,用于可控视频预测。 |
world model world models latent dynamics |
|
|
| 15 |
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling |
提出基于LLM-RL耦合的统一框架,实现3D场景生成与沉浸式交互的闭环。 |
reinforcement learning large language model |
✅ |
|
| 16 |
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency |
DINORANKCLIP:通过DINOv3蒸馏和高阶排序一致性进行视觉-语言预训练 |
distillation |
|
|
| 17 |
HumanNet: Scaling Human-centric Video Learning to One Million Hours |
提出HumanNet大规模以人为中心视频语料库,通过海量交互数据赋能具身智能模型训练 |
representation learning motion generation human-object interaction |
|
|
| 18 |
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement |
提出NOVA世界模型框架:通过权重空间隐式神经表示实现结构解耦与高效视频预测 |
world model world models latent dynamics |
|
|
| 19 |
VISD: Enhancing Video Reasoning via Structured Self-Distillation |
提出VISD结构化自蒸馏框架,通过多维度诊断反馈提升视频大模型推理能力与训练效率 |
reinforcement learning distillation privileged information |
|
|
| 20 |
Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation |
提出异构步长分配(HSA)算法,通过动态调整去噪步长实现高效视频生成 |
flow matching spatiotemporal |
✅ |
|