| 1 |
HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning |
提出HQ-JEPA,用于跨模态遥感表征学习的混合量子联合嵌入预测架构 |
JEPA Joint-Embedding Predictive Architecture joint-embedding predictive architecture |
|
|
| 2 |
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning |
提出iVGR,通过强化学习将视觉定位能力内化于多模态大语言模型的文本推理中 |
reinforcement learning large language model multimodal |
|
|
| 3 |
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching |
VolFill:利用体素流匹配的单视角非完整3D场景重建 |
flow matching scene reconstruction foundation model |
|
|
| 4 |
Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization |
提出IC-VCO,通过上下文视觉对比优化缓解视觉语言模型中的多模态幻觉问题 |
DPO direct preference optimization distillation |
✅ |
|
| 5 |
DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions |
DriveMA:通过可验证的元动作驱动自动驾驶视觉-语言-动作模型 |
reinforcement learning vision-language-action VLA |
|
|
| 6 |
Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models |
Light Interaction:无需训练加速交互式视频世界模型的推理 |
world model world models latent dynamics |
|
|
| 7 |
Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR |
提出EASE:通过证据锚定的空间注意力监督提升多模态RLVR性能 |
reinforcement learning multimodal visual grounding |
|
|
| 8 |
Task-Focused Memorization for Multimodal Agents |
提出TaskMem:基于强化学习的多模态Agent任务聚焦记忆策略学习框架 |
reinforcement learning policy learning multimodal |
|
|
| 9 |
Astra: a generalizable report generation foundation model for 3D computed tomography |
Astra:一种通用的3D CT报告生成基础模型,提升诊断效率和准确性 |
reinforcement learning foundation model |
|
|
| 10 |
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation |
Robust Dreamer:提出偏差感知潜在高斯记忆,用于动作控制的AR视频生成 |
dreamer gaussian splatting splatting |
|
|
| 11 |
HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding |
HiERO-StepG:利用层级活动理解实现Ego4D零样本步骤定位 |
representation learning Ego4D |
✅ |
|
| 12 |
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning |
提出DetAS:一个基于Agent的、具有经验感知推理的目标检测框架,提升复杂场景下的检测性能。 |
representation learning large language model multimodal |
|
|
| 13 |
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving |
提出神经令牌重构(NTR)方法,增强端到端自动驾驶场景令牌的视觉表征能力。 |
representation learning distillation foundation model |
|
|
| 14 |
CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference |
提出CoFiDA-M,利用概念感知特征调制实现图像跨域自适应,解决皮肤癌筛查部署难题。 |
distillation privileged information foundation model |
✅ |
|
| 15 |
Equivariant Latent Alignment via Flow Matching under Group Symmetries |
提出Residual Latent Flow,解决群对称性下等变隐空间对齐问题,提升新视角合成质量。 |
flow matching representation learning |
|
|
| 16 |
PEEK: Picking Essential frames via Efficient Knowledge distillation |
PEEK:通过高效知识蒸馏选取视频关键帧,提升视频描述效率。 |
distillation |
✅ |
|
| 17 |
GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning |
提出GUI-C$^2$,通过难度感知强化学习实现GUI元素精准定位 |
reinforcement learning |
|
|
| 18 |
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory |
提出DecMem,通过解耦记忆实现分钟级一致性世界生成。 |
world model world models |
|
|
| 19 |
Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams |
提出基于测试时训练的域增量学习方法,解决视频流中的灾难性遗忘问题。 |
masked autoencoder MAE |
|
|