cs.CV(2026-01-13)

📊 共 27 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? KidVis:评估多模态大语言模型是否具备6岁儿童的视觉感知能力 large language model multimodal
2 GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards GI-Bench:揭示多模态大语言模型在胃肠内窥镜临床应用中知识与经验脱节的基准 large language model multimodal
3 M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding 提出M3CoTBench,用于评估多模态大语言模型在医学图像理解中的思维链推理能力。 large language model multimodal chain-of-thought
4 Reasoning Matters for 3D Visual Grounding 提出Reason3DVG-8B,通过合成数据和LLM微调提升3D视觉定位的推理能力。 large language model visual grounding
5 Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2 提出基于BLIP-2的边缘优化多模态学习框架,用于提升无人机视频理解能力。 multimodal
6 UM-Text: A Unified Multimodal Model for Image Understanding UM-Text:提出统一多模态模型,解决图像理解中的视觉文本编辑与风格一致性问题。 multimodal
7 HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding HIPPO:通过整体感知并行推测解码加速视频大语言模型推理 large language model
8 Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence 提出基于事件上下文和置信度的大语言模型零样本ADL识别方法 large language model
9 Semantic Misalignment in Vision-Language Models under Perceptual Degradation 研究视觉语言模型在感知退化下的语义失调问题 embodied AI multimodal
10 Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention 通过对比注意力机制理解和优化MLLM中的视觉融合 large language model multimodal
11 Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models 提出基于闭环LLM的通道先验发现方法,提升视觉模型性能。 large language model
12 Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation 提出IQARAG,通过检索增强生成提升大模型在图像质量评估任务中的能力。 multimodal
13 Instruction-Driven 3D Facial Expression Generation and Transition 提出指令驱动的3D面部表情生成与过渡框架,实现逼真表情模拟。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
14 Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis 提出CardiacMind,通过强化学习激励MLLM进行类心脏科医生的可解释超声心动图诊断推理。 reinforcement learning large language model foundation model
15 MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP MMLGNet:利用CLIP进行遥感数据跨模态对齐,实现语义理解 contrastive learning HSI multimodal
16 ReCo-KD: Region- and Context-Aware Knowledge Distillation for Efficient 3D Medical Image Segmentation 提出ReCo-KD,通过区域和上下文感知知识蒸馏提升3D医学图像分割效率。 teacher-student distillation
17 Representation Learning with Semantic-aware Instance and Sparse Token Alignments 提出SISTA框架,通过语义感知的实例和稀疏token对齐提升医学VLP表征学习 representation learning contrastive learning
18 SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling SfMamba:通过选择性扫描建模实现高效的无源域自适应 Mamba
19 CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning 提出CD^2框架,通过约束数据集蒸馏解决少样本类增量学习中的灾难性遗忘问题 distillation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
20 3AM: Segment Anything with Geometric Consistency in Videos 3AM:通过几何一致性增强SAM,实现视频中的分割 geometric consistency
21 SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration SPARK:一种可扩展的实时多视角自校准点云聚合方法 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
22 Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching 利用零样本深度局部特征匹配实现近乎完美的Hula彩绘蛙个体识别 feature matching
23 Instance-Aligned Captions for Explainable Video Anomaly Detection 提出实例对齐的视频异常检测字幕,增强可解释性和空间定位能力 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
24 VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations VideoHEDGE:基于熵的视频VLM幻觉检测,利用语义聚类和时空扰动 spatiotemporal
25 AIMC-Spec: A Benchmark Dataset for Automatic Intrapulse Modulation Classification under Variable Noise Conditions 提出AIMC-Spec数据集,用于噪声环境下雷达信号内脉冲调制自动分类 PULSE

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
26 How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation? 提出SSP框架,融合光流与文本提示,提升音视频语义分割精度 optical flow
27 CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval CogniMap3D:提出一种受生物启发的认知3D地图构建与快速检索框架 depth estimation scene understanding

⬅️ 返回 cs.CV 首页 · 🏠 返回主页