cs.CV(2025-09-30)

📊 共 62 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (15 🔗5) 支柱一:机器人控制 (Robot Control) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (6) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning 提出AIMCoT,通过主动信息驱动的多模态CoT提升视觉-语言推理能力 multimodal chain-of-thought
2 LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology 提出LMOD+以解决眼科多模态大语言模型评估问题 large language model multimodal
3 GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data GeoLink:利用OpenStreetMap数据增强遥感基础模型,提升地理空间智能 foundation model multimodal
4 MuSLR: Multimodal Symbolic Logical Reasoning 提出MuSLR基准测试,并设计LogiCAM框架提升VLM在多模态符号逻辑推理能力 multimodal chain-of-thought
5 Query-Kontext: An Unified Multimodal Model for Image Generation and Editing 提出Query-Kontext,通过多模态上下文连接VLM和扩散模型,实现图像生成与编辑。 multimodal
6 MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval 提出MR$^2$-Bench,一个面向多模态检索推理能力评估的基准 multimodal
7 LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models 提出LLaVAShield,用于保障视觉-语言模型中多模态多轮对话的安全性。 multimodal
8 A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI 提出mpLLM,用于多参数3D脑部MRI的视觉问答任务,并构建了临床验证数据集。 multimodal
9 SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval 提出SQUARE框架,通过语义增强和高效重排序实现免训练零样本组合图像检索 large language model multimodal
10 VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions VELA:一种用于评估长图像描述的LLM混合判别器方法 large language model multimodal
11 Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition 提出Logo-VGR,通过视觉常识推理实现开放世界Logo识别,提升产品审核智能化。 large language model multimodal
12 FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos FinCap:针对金融短视频,提出主题对齐的字幕生成方法 large language model multimodal
13 Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi:基于孪生骨干跨模态融合的音视频生成方法 multimodal
14 TTT3R: 3D Reconstruction as Test-Time Training 提出TTT3R以解决3D重建中的长度泛化问题 foundation model
15 Video Object Segmentation-Aware Audio Generation 提出SAGANet,通过视频对象分割实现可控音频生成,提升Foley工作流效率。 multimodal
16 TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos 提出TimeScope,解决长视频中面向任务的时序定位难题 chain-of-thought
17 An Experimental Study on Generating Plausible Textual Explanations for Video Summarization 提出一种基于大模型和语义重叠的视频摘要可信解释生成与评估方法 multimodal
18 Towards Reliable and Holistic Visual In-Context Learning Prompt Selection 提出RH-Partial2Global,提升视觉上下文学习中prompt选择的可靠性和全面性 foundation model
19 PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection PatchEAD:统一的工业视觉提示框架,用于补丁互斥异常检测 foundation model
20 Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions 提出基于动态相似图的SAM自适应方法,用于少样本密集小目标检测,以田间鹰嘴豆荚为例。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
21 Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA 提出Farsighted-LAM和SSM-VLA,增强VLA系统中潜在动作模型的空间和动态感知能力 SSM vision-language-action VLA
22 Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization MISP-DPO:通过重要性采样和多负例提升多模态直接偏好优化 DPO direct preference optimization multimodal
23 IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance 提出IMG,通过隐式多模态引导校准扩散模型,提升图文对齐精度。 DPO large language model multimodal
24 Generalized Contrastive Learning for Universal Multimodal Retrieval 提出广义对比学习GCL,解决通用多模态检索中组合模态泛化性问题。 contrastive learning multimodal
25 ProbMed: A Probabilistic Framework for Medical Multimodal Binding ProbMED:提出概率多模态医学绑定框架,提升医学决策支持。 contrastive learning multimodal
26 Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation 提出基于知识蒸馏的后训练流程,提升小型语言模型在边缘设备上的性能。 distillation large language model
27 PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection 提出PRPO算法,通过段落级策略优化提升视觉-语言大模型在Deepfake检测中的性能。 reinforcement learning large language model multimodal
28 Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation 提出自监督解剖一致性学习框架,用于视觉引导的医学报告生成。 contrastive learning foundation model visual grounding
29 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models 揭示视觉语言模型推理的二元性,提出VAPO以提升视觉感知能力 reinforcement learning large language model multimodal
30 Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs 提出文本偏好优化(TPO),实现文本到图像扩散模型的“免费午餐”对齐。 reinforcement learning RLHF DPO
31 Dolphin v1.0 Technical Report Dolphin v1.0:首个大规模多模态超声影像基础模型,统一解决多种临床任务。 reinforcement learning foundation model multimodal
32 Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation 提出基于稀疏高斯表示的数据集蒸馏方法GSDD,提升效率和性能。 distillation splatting
33 Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents 提出Ferret-UI Lite,一个紧凑型端到端GUI智能体,用于跨平台交互。 reinforcement learning chain-of-thought
34 FLOWER: A Flow-Matching Solver for Inverse Problems 提出FLOWER,利用Flow-Matching模型解决逆问题,实现高质量重建。 flow matching
35 Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts 提出多粒度概念专家网络MGCE,解决广义细粒度类别发现问题 representation learning contrastive learning

🔬 支柱一:机器人控制 (Robot Control) (10 篇)

#题目一句话要点标签🔗
36 DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning DeepSketcher:通过内部视觉操作实现多模态推理 manipulation multimodal chain-of-thought
37 Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding 提出Point-It-Out基准,评估视觉语言模型在多阶段视觉定位中的具身推理能力 manipulation visual grounding
38 LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing LaTo:用于精细人脸编辑的地标Token化扩散Transformer manipulation classifier-free guidance multimodal
39 Behavioural Classification in C. elegans: a Spatio-Temporal Analysis of Locomotion 提出一种基于时空分析的线虫行为自动分类方法,无需完整虫体视图。 locomotion
40 Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation 提出可编辑噪声图反演(ENM Inversion),提升扩散模型图像编辑的保真度和可编辑性。 manipulation
41 DGM4+: Dataset Extension for Global Scene Inconsistency DGM4+:扩展数据集以应对全局场景不一致性,提升多模态伪造检测能力。 manipulation multimodal
42 SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies 提出SGS以解决全球场景不一致性问题 manipulation multimodal
43 PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks PinPoint3D:提出一种基于少量点击的精细3D部件分割交互式框架 manipulation embodied AI
44 DiffCamera: Arbitrary Refocusing on Images DiffCamera:提出一种基于扩散Transformer的图像任意重聚焦方法 manipulation
45 Dragging with Geometry: From Pixels to Geometry-Guided Image Editing 提出GeoDrag,利用几何信息引导图像编辑,提升操控精度与一致性 manipulation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
46 Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models 提出Human-MME基准,用于全面评估以人为中心的多模态大语言模型 scene understanding large language model multimodal
47 Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations 提出基于人类叙述弱监督的单目手持物体分割方法NS-iHOS open-vocabulary open vocabulary human-object interaction
48 Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting Stylos:基于单次前向高斯溅射的多视角3D风格迁移 gaussian splatting splatting
49 DA$^{2}$: Depth Anything in Any Direction 提出DA²,实现任意方向全景深度估计的零样本泛化 depth estimation Depth Anything geometric consistency
50 EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models EasyOcc:利用3D伪标签监督实现全自监督语义占据预测模型,显著提升性能。 depth estimation Metric3D scene understanding
51 PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion PFDepth:提出一种畸变感知的pinhole-fisheye异构多视角联合深度估计框架。 depth estimation
52 Image-Plane Geometric Decoding for View-Invariant Indoor Scene Reconstruction 提出图像平面几何解码框架,解决室内场景重建对视角依赖问题 scene reconstruction
53 VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs VLM-FO1:通过特征检索弥合VLM高层推理与细粒度感知之间的鸿沟 scene understanding visual grounding
54 DEPTHOR++: Robust Depth Enhancement from a Real-World Lightweight dToF and RGB Guidance DEPTHOR++:提出一种鲁棒的深度增强框架,利用RGB引导增强真实世界轻量级dToF传感器的深度信息。 depth estimation monocular depth

🔬 支柱六:视频提取与匹配 (Video Extraction) (6 篇)

#题目一句话要点标签🔗
55 V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs 提出V-HUB:一个以视觉为中心的视频幽默理解基准,用于评估视频大语言模型 HuMoR large language model multimodal
56 MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation MotionRAG:通过检索增强运动先验实现逼真的图像到视频生成 motion retrieval motion adaptation
57 LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion 提出基于$SO(3)$扩散的自回归人体网格恢复方法,解决单目图像三维人体姿态估计的歧义性问题。 human mesh recovery HMR
58 Benchmarking Egocentric Visual-Inertial SLAM at City Scale 提出城市级第一人称视觉惯性SLAM基准,挑战现有算法在复杂环境下的鲁棒性。 egocentric
59 SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval 提出SETR:一种语义增强的两阶段框架,用于零样本组合图像检索 feature matching multimodal
60 ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation ProfVLM:一种轻量级视频语言模型,用于多视角熟练度评估 egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
61 Stitch: Training-Free Position Control in Multimodal Diffusion Transformers Stitch:一种免训练的多模态扩散Transformer位置控制方法 spatial relationship multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
62 HART: Human Aligned Reconstruction Transformer HART:提出一种对齐人体的重建Transformer,用于稀疏视角人体重建。 human-object interaction SMPL SMPL-X

⬅️ 返回 cs.CV 首页 · 🏠 返回主页