cs.CV(2025-07-17)

📊 共 33 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models 提出基于多模态大语言模型的自进化视觉-语言导航框架SE-VLN VLN large language model multimodal
2 Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications 提出多模态不确定性传播模型,分析MLLM中图像-文本不确定性,应用于心脏MR分析 large language model multimodal
3 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval 提出MCoT-RE框架,通过多方面CoT与重排序解决免训练零样本组合图像检索问题。 large language model multimodal chain-of-thought
4 VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding 提出VideoITG,通过指令式时序定位提升多模态视频理解能力 large language model multimodal
5 Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion 提出并行ViT-CNN编码和变分融合的冠状动脉分割框架,提升CAD辅助诊断精度。 foundation model
6 AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning AnyCap项目:提出统一框架、数据集和基准,用于可控全模态图像/视频描述生成。 foundation model multimodal instruction following
7 Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition 提出语义引导的基础模型微调方法以解决长尾视觉识别问题 foundation model
8 Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation 提出Think-Before-Draw框架,实现基于文本驱动的细粒度可控情感表达的 talking head 生成。 large language model multimodal chain-of-thought
9 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images Pixel Perfect MegaMed:用于生成高分辨率医学图像的百万像素级视觉-语言基础模型 foundation model
10 Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark 修正RPE基准的可靠性问题,提升基于推理的姿态估计评估质量 large language model multimodal
11 Leveraging Language Prior for Infrared Small Target Detection 提出一种利用语言先验的红外小目标检测框架,显著提升检测精度。 multimodal
12 DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment 提出DeQA-Doc,利用多模态大语言模型进行文档图像质量评估,显著提升准确性和泛化性。 large language model
13 Transformer-based Spatial Grounding: A Comprehensive Survey Transformer空间定位综述:系统性回顾方法、数据集与评估指标 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
14 Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning 提出差分信息引导的样本选择方法DISSect,加速多模态对比学习。 contrastive learning multimodal
15 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains 提出一种工业领域实时手-物交互检测系统,提升人机交互效率。 Mamba egocentric egocentric vision
16 VITA: Vision-to-Action Flow Matching Policy VITA:一种无噪声、无条件反射的视觉到动作流匹配策略,加速机器人控制。 policy learning flow matching Aloha
17 Unified Medical Image Segmentation with State Space Modeling Snake 提出基于状态空间建模的Mamba Snake,用于统一医学图像分割 Mamba state space model spatiotemporal
18 Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models Orbis:提出一种长时域预测的驾驶世界模型,在复杂场景下表现出色。 flow matching world model
19 Hierarchical Rectified Flow Matching with Mini-Batch Couplings 提出基于Mini-Batch耦合的分层修正流匹配方法,提升生成模型对复杂分布的建模能力。 flow matching
20 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning 提出VisionThink,通过强化学习动态调整视觉token数量,提升视觉语言模型效率。 reinforcement learning
21 Label-Consistent Dataset Distillation with Detector-Guided Refinement 提出检测器引导的标签一致性数据集蒸馏框架,提升合成数据质量。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
22 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models Argus:利用多视角图像增强大型语言模型的三维场景理解能力 scene understanding large language model foundation model
23 DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model DINO-VO:利用视觉基础模型DINOv2的特征点视觉里程计,提升鲁棒性和泛化性。 visual odometry visual SLAM feature matching
24 SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation 提出SCORE框架,利用场景上下文增强遥感图像开放词汇实例分割性能。 open-vocabulary open vocabulary
25 {S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation 提出S²M²:一种可扩展的立体匹配模型,用于可靠的深度估计 depth estimation
26 Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection 提出层级核心集选择机制,提升VLM在复杂广域场景理解中的适应性 scene understanding
27 FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers 提出FantasyPortrait,利用表情增强扩散Transformer提升多角色人像动画效果 implicit representation character animation character control
28 $π^3$: Permutation-Equivariant Visual Geometry Learning 提出$π^3$置换等变网络,用于无参考视角的视觉几何重建。 depth estimation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation AnyPos:面向双臂操作的自动化、任务无关动作学习框架 manipulation bi-manual bimanual manipulation
30 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning City-VLM:通过多模态不完全学习实现多领域感知场景理解 humanoid scene understanding multimodal
31 Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization 提出基于涂鸦注释的弱监督框架以解决图像操控定位问题 manipulation
32 IConMark: Robust Interpretable Concept-Based Watermark For AI Images 提出IConMark:一种鲁棒且可解释的基于概念的AI图像水印方法 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
33 cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration 提出cIDIR,一种基于条件隐式神经表示的正则化可变形图像配准框架 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页