cs.CV（2026-01-15）

📊 共 23 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4) 支柱一：机器人控制 (Robot Control) (3) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱八：物理动画 (Physics-based Animation) (1) 其他 (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models	提出DR$^2$Seg框架，提升多模态大语言模型在推理分割任务中的效率与精度。	large language model multimodal
2	Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer	提出一种缺失感知的多模态生存预测框架，用于解决非小细胞肺癌中数据缺失问题。	foundation model multimodal
3	ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding	ROMA：用于交互式流式理解的实时全模态助手	large language model multimodal
4	See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection	提出基于随机patch选择的通用端到端自动驾驶方法，提升泛化性和效率。	foundation model
5	Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models	提出层级细化的通用多模态攻击框架HRA，提升视觉-语言模型的鲁棒性	multimodal
6	Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method	提出视频异常推理任务与数据集，并设计自适应多阶段推理模型Vad-R1-Plus	large language model multimodal chain-of-thought
7	V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation	V-Zero：一种基于无标注数据的多模态自提升推理框架	multimodal	✅
8	VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models	提出VERHallu基准评测并设计KFP策略，缓解视频大语言模型中的事件关系幻觉问题	large language model
9	Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs	提出基于层选择多模态大语言模型的细粒度人体姿态编辑评估方法	large language model multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
10	Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting	提出基于流引导3D高斯溅射的结构感知风格迁移方法，实现梵高式艺术风格的几何抽象。	3D gaussian splatting 3DGS gaussian splatting
11	RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation	提出RSATalker，用于支持多轮对话的逼真社交感知说话头生成	3D gaussian splatting 3DGS gaussian splatting
12	UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow	提出UEOF水下事件相机光流基准数据集，促进水下事件视觉研究	optical flow	✅
13	Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure	提出领域自适应框架，利用大视觉语言模型实现道路基础设施的智能感知。	open-vocabulary open vocabulary

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
14	LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning	LaViT：对齐潜在视觉思维以实现多模态推理	distillation multimodal visual grounding
15	Action100M: A Large-scale Video Action Dataset	提出Action100M大规模视频动作数据集，促进视频理解和世界建模研究。	world model open-vocabulary open vocabulary
16	Inference-time Physics Alignment of Video Generative Models with Latent World Models	提出WMReward，通过推理时物理对齐提升视频生成模型的物理合理性	world model
17	Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks	提出难度引导采样(DGS)以弥合数据集蒸馏与下游任务之间的目标差距。	distillation

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
18	FlowAct-R1: Towards Interactive Humanoid Video Generation	FlowAct-R1：面向实时交互的人形视频生成框架，实现高保真和低延迟的平衡。	humanoid distillation
19	RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation	提出RAG-3DSG，通过重拍引导检索增强生成提升3D场景图质量。	manipulation open-vocabulary open vocabulary
20	EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression Editing	EditEmoTalk：提出可控的语音驱动3D面部动画框架，支持连续表情编辑	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge	优化多模态LLM用于以自我为中心的视频理解，解决HD-EPIC VQA挑战	egocentric large language model multimodal	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion	提出动态跨层注入CLI，解决视觉-语言模型中视觉特征瓶颈问题。	AMP large language model multimodal

📄 其他

#	题目	一句话要点	标签	🔗	⭐
23	CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos	CoMoVi：提出协同生成框架，同步生成3D人体动作和逼真视频

⬅️ 返回 cs.CV 首页 · 🏠 返回主页