cs.CV(2025-12-14)

📊 共 26 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱一:机器人控制 (Robot Control) (1) 支柱三:空间感知与语义 (Perception & Semantics) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners 研究对比式字幕模型CoCa在少样本学习中的适应性,并提出优化策略。 foundation model multimodal zero-shot transfer
2 Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation 提出视觉增强LLM框架,用于高分辨率图像合成和多模态数据理解 large language model multimodal
3 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space 提出DMLR框架,通过动态多模态潜在空间推理提升MLLM的推理和感知能力 large language model multimodal chain-of-thought
4 DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models DL$^3$M:结合深度学习与大语言模型,实现专家级医学推理的视觉-语言框架 large language model
5 Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding Lemon:用于通用空间理解的统一可扩展3D多模态模型 multimodal
6 DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning DrivePI:面向统一自动驾驶理解、感知、预测和规划的空间感知4D MLLM vision-language-action VLA large language model
7 CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence 提出CoRe3D以解决3D智能推理不足问题 multimodal chain-of-thought
8 FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning FysicsWorld:首个统一全模态基准,支持任意模态间的理解、生成与推理。 large language model multimodal
9 Efficient Vision-Language Reasoning via Adaptive Token Pruning 提出自适应Token剪枝(ATP),高效实现视觉-语言模型的推理加速。 multimodal visual grounding
10 Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline 提出CMER-Bench、大规模数据集和CMERNet,提升复杂数学表达式识别性能 large language model multimodal
11 StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding StreamingAssistant:高效视觉Token剪枝加速在线视频理解 large language model multimodal
12 SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition SignRAG:一种可扩展的零样本道路标志识别检索增强系统 large language model
13 JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation 提出 JointAVBench 基准,用于评估 Omni-LLM 在联合音视频推理方面的能力。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
14 DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model DiG:差分Grounding增强多模态大语言模型中的细粒度感知 curriculum learning large language model multimodal
15 Supervised Contrastive Frame Aggregation for Video Representation Learning 提出监督对比帧聚合方法,用于高效视频表征学习。 representation learning contrastive learning
16 GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation GenieDrive:提出基于4D Occupancy引导的物理感知驾驶世界模型,用于高质量驾驶视频生成。 world model
17 $β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment 提出β-CLIP,通过文本条件对比学习实现多粒度视觉-语言对齐。 contrastive learning
18 CogDoc: Towards Unified thinking in Documents CogDoc:提出统一的文档理解框架,解决长文档处理中的可扩展性和细节保真度问题 reinforcement learning multimodal
19 Animus3D: Text-driven 3D Animation via Motion Score Distillation Animus3D:提出基于运动分数蒸馏的文本驱动3D动画生成框架 distillation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
20 Robust Motion Generation using Part-level Reliable Data from Videos 提出一种基于视频中可靠部件级数据的鲁棒运动生成方法,解决数据缺失问题。 motion generation human motion character animation
21 InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation InteracTalker:提出基于提示的人-物交互与协同语音手势生成框架 text-to-motion human-object interaction motion adaptation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
22 Generative Spatiotemporal Data Augmentation 提出基于视频扩散模型的时空数据增强方法,提升低数据量场景下的模型性能 spatiotemporal foundation model
23 StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis StegaVAR:提出一种基于隐写域分析的隐私保护视频行为识别框架 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
24 Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models 提出基于视觉语言模型和双阶段思维链的内容感知广告横幅布局生成方法 spatial relationship chain-of-thought

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
25 D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation 提出D3D-VLP模型,用于具身环境下的3D视觉-语言-规划任务,实现更强的推理和导航能力。 manipulation mobile manipulation chain-of-thought

🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)

#题目一句话要点标签🔗
26 Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior Fast-2DGS:利用深度高斯先验实现高效图像表示 gaussian splatting splatting

⬅️ 返回 cs.CV 首页 · 🏠 返回主页