| 1 |
Toward Cognitive Supersensing in Multimodal Large Language Model |
提出认知超感知训练范式,提升多模态大语言模型在复杂认知任务中的表现。 |
reinforcement learning open-vocabulary open vocabulary |
|
|
| 2 |
UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving |
UniDriveDreamer:用于自动驾驶的单阶段多模态世界模型 |
world model dreamer multimodal |
|
|
| 3 |
ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning |
提出ClueTracer,无需训练即可抑制多模态推理中的幻觉问题 |
Eureka multimodal visual grounding |
|
|
| 4 |
DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation |
提出DenVisCoM Mamba模块和混合架构,用于高效实时的光流和立体匹配估计 |
Mamba optical flow |
✅ |
|
| 5 |
VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations |
提出基于残差量化表示的VQ-Style框架,用于人体运动数据中风格与内容解耦 |
contrastive learning VQ-VAE human motion |
|
|
| 6 |
Unified Personalized Reward Model for Vision Generation |
提出UnifiedReward-Flex,用于提升视觉生成中个性化奖励模型的性能。 |
reinforcement learning DPO direct preference optimization |
|
|
| 7 |
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation |
Causal Forcing:通过自回归扩散蒸馏实现高质量实时交互视频生成 |
distillation instruction following |
✅ |
|
| 8 |
One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation |
提出OSMF框架,对齐大规模广告图像生成中不同用户群体的点击偏好。 |
DPO large language model multimodal |
✅ |
|
| 9 |
Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning |
提出CaCoVID,通过强化学习进行贡献感知的Token压缩,提升视频理解效率。 |
reinforcement learning large language model |
|
|
| 10 |
Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation |
提出DiScene以解决室内占用预测的效率与准确性问题 |
distillation feature matching |
✅ |
|
| 11 |
Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model |
提出基于扩散模型的教师引导学生自知识蒸馏方法DSKD,解决教师-学生特征分布差异问题。 |
teacher-student distillation |
|
|
| 12 |
SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking |
提出SMTrack:利用状态感知Mamba模型高效进行视觉跟踪中的时序建模 |
Mamba state space model |
|
|
| 13 |
Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages |
提出TAFS GRPO框架,加速Flow Matching模型对齐人类偏好,提升少步文图生成质量。 |
reinforcement learning flow matching |
|
|
| 14 |
HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation |
提出HandMCM,利用多模态点云和Correspondence Mamba解决3D手部姿态估计中的遮挡问题 |
Mamba state space model |
|
|
| 15 |
Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory |
Infinite-World:通过无姿态分层记忆将交互式世界模型扩展到1000帧 |
world model |
|
|
| 16 |
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization |
LongVPO:通过自推理优化长视频偏好,无需长视频标注。 |
direct preference optimization large language model |
|
|
| 17 |
GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation |
提出引导式渐进蒸馏(GPD)框架,加速高质量视频生成扩散模型。 |
distillation |
|
|
| 18 |
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention |
提出TempCache、AnnCA和AnnSA,加速自回归视频扩散模型推理并降低显存占用。 |
world model |
|
|
| 19 |
Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks |
提出世界模型的统一设计规范,克服现有方法在任务上的碎片化。 |
world model |
|
|
| 20 |
Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework |
提出Samba+,一个基于Mamba的通用显著性目标检测框架,适用于多种SOD任务。 |
Mamba |
|
|
| 21 |
Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units |
提出基于SW-PS和LRU的无旋转在线手写字符识别框架,提升旋转鲁棒性 |
SSM state space model |
|
|