cs.CV（2025-07-17）

📊 共 33 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models	提出基于多模态大语言模型的自进化视觉-语言导航框架SE-VLN	VLN large language model multimodal
2	Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications	提出多模态不确定性传播模型，分析MLLM中图像-文本不确定性，应用于心脏MR分析	large language model multimodal	✅
3	MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval	提出MCoT-RE框架，通过多方面CoT与重排序解决免训练零样本组合图像检索问题。	large language model multimodal chain-of-thought
4	VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	提出VideoITG，通过指令式时序定位提升多模态视频理解能力	large language model multimodal
5	Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion	提出并行ViT-CNN编码和变分融合的冠状动脉分割框架，提升CAD辅助诊断精度。	foundation model	✅
6	AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning	AnyCap项目：提出统一框架、数据集和基准，用于可控全模态图像/视频描述生成。	foundation model multimodal instruction following
7	Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition	提出语义引导的基础模型微调方法以解决长尾视觉识别问题	foundation model
8	Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation	提出Think-Before-Draw框架，实现基于文本驱动的细粒度可控情感表达的 talking head 生成。	large language model multimodal chain-of-thought
9	Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images	Pixel Perfect MegaMed：用于生成高分辨率医学图像的百万像素级视觉-语言基础模型	foundation model	✅
10	Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark	修正RPE基准的可靠性问题，提升基于推理的姿态估计评估质量	large language model multimodal
11	Leveraging Language Prior for Infrared Small Target Detection	提出一种利用语言先验的红外小目标检测框架，显著提升检测精度。	multimodal
12	DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment	提出DeQA-Doc，利用多模态大语言模型进行文档图像质量评估，显著提升准确性和泛化性。	large language model	✅
13	Transformer-based Spatial Grounding: A Comprehensive Survey	Transformer空间定位综述：系统性回顾方法、数据集与评估指标	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning	提出差分信息引导的样本选择方法DISSect，加速多模态对比学习。	contrastive learning multimodal	✅
15	A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains	提出一种工业领域实时手-物交互检测系统，提升人机交互效率。	Mamba egocentric egocentric vision
16	VITA: Vision-to-Action Flow Matching Policy	VITA：一种无噪声、无条件反射的视觉到动作流匹配策略，加速机器人控制。	policy learning flow matching Aloha	✅
17	Unified Medical Image Segmentation with State Space Modeling Snake	提出基于状态空间建模的Mamba Snake，用于统一医学图像分割	Mamba state space model spatiotemporal
18	Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models	Orbis：提出一种长时域预测的驾驶世界模型，在复杂场景下表现出色。	flow matching world model	✅
19	Hierarchical Rectified Flow Matching with Mini-Batch Couplings	提出基于Mini-Batch耦合的分层修正流匹配方法，提升生成模型对复杂分布的建模能力。	flow matching	✅
20	VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning	提出VisionThink，通过强化学习动态调整视觉token数量，提升视觉语言模型效率。	reinforcement learning	✅
21	Label-Consistent Dataset Distillation with Detector-Guided Refinement	提出检测器引导的标签一致性数据集蒸馏框架，提升合成数据质量。	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models	Argus：利用多视角图像增强大型语言模型的三维场景理解能力	scene understanding large language model foundation model
23	DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model	DINO-VO：利用视觉基础模型DINOv2的特征点视觉里程计，提升鲁棒性和泛化性。	visual odometry visual SLAM feature matching
24	SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation	提出SCORE框架，利用场景上下文增强遥感图像开放词汇实例分割性能。	open-vocabulary open vocabulary	✅
25	{S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation	提出S²M²：一种可扩展的立体匹配模型，用于可靠的深度估计	depth estimation
26	Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection	提出层级核心集选择机制，提升VLM在复杂广域场景理解中的适应性	scene understanding
27	FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers	提出FantasyPortrait，利用表情增强扩散Transformer提升多角色人像动画效果	implicit representation character animation character control	✅
28	$π^3$: Permutation-Equivariant Visual Geometry Learning	提出$π^3$置换等变网络，用于无参考视角的视觉几何重建。	depth estimation

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation	AnyPos：面向双臂操作的自动化、任务无关动作学习框架	manipulation bi-manual bimanual manipulation	✅
30	City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning	City-VLM：通过多模态不完全学习实现多领域感知场景理解	humanoid scene understanding multimodal
31	Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization	提出基于涂鸦注释的弱监督框架以解决图像操控定位问题	manipulation
32	IConMark: Robust Interpretable Concept-Based Watermark For AI Images	提出IConMark：一种鲁棒且可解释的基于概念的AI图像水印方法	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration	提出cIDIR，一种基于条件隐式神经表示的正则化可变形图像配准框架	physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页