| 23 |
Mobile-GS: Real-time Gaussian Splatting for Mobile Devices |
Mobile-GS:面向移动设备的高质量实时高斯溅射渲染 |
distillation 3D gaussian splatting 3DGS |
|
|
| 24 |
LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning |
LatentGeo:通过隐空间可学习辅助构造提升多模态几何推理能力 |
reinforcement learning spatial relationship large language model |
|
|
| 25 |
Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling |
提出基于双向跨注意力与时序建模的多模态情感识别框架,提升野外视频情感识别性能。 |
representation learning motion prediction multimodal |
|
|
| 26 |
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction |
O3N:面向全景开放词汇的三维 occupancy 预测框架 |
world model Mamba open-vocabulary |
✅ |
|
| 27 |
Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D |
Hoi3DGen:生成高质量3D人-物交互模型,显著提升文本一致性和模型质量。 |
distillation human-object interaction large language model |
|
|
| 28 |
EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next |
EgoIntent:用于理解自我中心视频中意图的步骤级基准测试 |
imitation learning egocentric large language model |
|
|
| 29 |
IDRL: An Individual-Aware Multimodal Depression-Related Representation Learning Framework for Depression Diagnosis |
提出IDRL框架,解决多模态抑郁症诊断中个体差异和模态不一致问题。 |
representation learning multimodal |
|
|
| 30 |
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation |
FIRM:通过鲁棒奖励建模和强化学习实现忠实图像编辑和生成 |
reinforcement learning instruction following |
|
|
| 31 |
Risk-Controllable Multi-View Diffusion for Driving Scenario Generation |
提出RiskMV-DPO,实现风险可控的多视角驾驶场景生成 |
DPO direct preference optimization world model |
|
|
| 32 |
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding |
提出R-MSD框架,通过多样本蒸馏提升视频理解中LVLM的可靠性。 |
distillation multimodal |
|
|
| 33 |
SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning |
提出SVLL框架,解决具身任务规划中视觉语言模型的时间绑定和物理约束违反问题。 |
reinforcement learning DPO direct preference optimization |
|
|
| 34 |
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing |
AutoGaze:通过自回归注视实现高效可扩展的视频理解 |
reinforcement learning spatiotemporal large language model |
✅ |
|
| 35 |
Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection |
DART:一种实时的、无需训练的通用物体检测框架,显著加速SAM3推理。 |
distillation open-vocabulary open vocabulary |
✅ |
|
| 36 |
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning |
DreamVideo-Omni:基于潜在身份强化学习的通用运动控制多主体视频定制 |
reinforcement learning |
|
|
| 37 |
Linking Perception, Confidence and Accuracy in MLLMs |
提出置信度驱动的强化学习与测试时缩放,解决多模态大语言模型中的置信度校准问题 |
reinforcement learning large language model |
|
|
| 38 |
Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding |
FutureCAD:提出基于LLM驱动程序生成和文本B-Rep基元接地的CAD高保真生成框架 |
reinforcement learning large language model |
|
|
| 39 |
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model |
InSpatio-WorldFM:开源实时生成式帧模型,实现低延迟空间智能 |
world model distillation |
|
|
| 40 |
Unleashing Video Language Models for Fine-grained HRCT Report Generation |
提出AbSteering框架,利用视频语言模型进行精细化HRCT报告生成。 |
direct preference optimization foundation model chain-of-thought |
|
|
| 41 |
CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning |
CalliMaster:通过布局引导的空间规划掌握页面级中文书法生成 |
flow matching multimodal |
|
|