cs.CV（2026-01-13）

📊 共 27 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗2) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?	KidVis：评估多模态大语言模型是否具备6岁儿童的视觉感知能力	large language model multimodal
2	GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards	GI-Bench：揭示多模态大语言模型在胃肠内窥镜临床应用中知识与经验脱节的基准	large language model multimodal	✅
3	M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding	提出M3CoTBench，用于评估多模态大语言模型在医学图像理解中的思维链推理能力。	large language model multimodal chain-of-thought	✅
4	Reasoning Matters for 3D Visual Grounding	提出Reason3DVG-8B，通过合成数据和LLM微调提升3D视觉定位的推理能力。	large language model visual grounding
5	Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2	提出基于BLIP-2的边缘优化多模态学习框架，用于提升无人机视频理解能力。	multimodal
6	UM-Text: A Unified Multimodal Model for Image Understanding	UM-Text：提出统一多模态模型，解决图像理解中的视觉文本编辑与风格一致性问题。	multimodal
7	HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding	HIPPO：通过整体感知并行推测解码加速视频大语言模型推理	large language model
8	Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence	提出基于事件上下文和置信度的大语言模型零样本ADL识别方法	large language model
9	Semantic Misalignment in Vision-Language Models under Perceptual Degradation	研究视觉语言模型在感知退化下的语义失调问题	embodied AI multimodal
10	Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention	通过对比注意力机制理解和优化MLLM中的视觉融合	large language model multimodal
11	Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models	提出基于闭环LLM的通道先验发现方法，提升视觉模型性能。	large language model
12	Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation	提出IQARAG，通过检索增强生成提升大模型在图像质量评估任务中的能力。	multimodal
13	Instruction-Driven 3D Facial Expression Generation and Transition	提出指令驱动的3D面部表情生成与过渡框架，实现逼真表情模拟。	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis	提出CardiacMind，通过强化学习激励MLLM进行类心脏科医生的可解释超声心动图诊断推理。	reinforcement learning large language model foundation model
15	MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP	MMLGNet：利用CLIP进行遥感数据跨模态对齐，实现语义理解	contrastive learning HSI multimodal	✅
16	ReCo-KD: Region- and Context-Aware Knowledge Distillation for Efficient 3D Medical Image Segmentation	提出ReCo-KD，通过区域和上下文感知知识蒸馏提升3D医学图像分割效率。	teacher-student distillation
17	Representation Learning with Semantic-aware Instance and Sparse Token Alignments	提出SISTA框架，通过语义感知的实例和稀疏token对齐提升医学VLP表征学习	representation learning contrastive learning
18	SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling	SfMamba：通过选择性扫描建模实现高效的无源域自适应	Mamba	✅
19	CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning	提出CD^2框架，通过约束数据集蒸馏解决少样本类增量学习中的灾难性遗忘问题	distillation

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
20	3AM: Segment Anything with Geometric Consistency in Videos	3AM：通过几何一致性增强SAM，实现视频中的分割	geometric consistency	✅
21	SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration	SPARK：一种可扩展的实时多视角自校准点云聚合方法	geometric consistency

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching	利用零样本深度局部特征匹配实现近乎完美的Hula彩绘蛙个体识别	feature matching
23	Instance-Aligned Captions for Explainable Video Anomaly Detection	提出实例对齐的视频异常检测字幕，增强可解释性和空间定位能力	egocentric

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations	VideoHEDGE：基于熵的视频VLM幻觉检测，利用语义聚类和时空扰动	spatiotemporal	✅
25	AIMC-Spec: A Benchmark Dataset for Automatic Intrapulse Modulation Classification under Variable Noise Conditions	提出AIMC-Spec数据集，用于噪声环境下雷达信号内脉冲调制自动分类	PULSE

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?	提出SSP框架，融合光流与文本提示，提升音视频语义分割精度	optical flow
27	CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval	CogniMap3D：提出一种受生物启发的认知3D地图构建与快速检索框架	depth estimation scene understanding

⬅️ 返回 cs.CV 首页 · 🏠 返回主页