cs.CV(2025-08-07)

📊 共 41 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model 提出LLaVA-RE,利用多模态大语言模型进行二元图像-文本相关性评估。 large language model multimodal
2 PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems PhysPatch:面向多模态大语言模型自动驾驶系统的物理可实现且可迁移的对抗补丁攻击 large language model multimodal
3 Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision 提出Uni-CoT,用于统一文本和视觉的链式思考推理,实现多模态任务的SOTA性能。 large language model multimodal chain-of-thought
4 AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety 评估多模态LLM在品牌安全内容审核中的表现,对比AI与人工审核员 large language model multimodal
5 mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering 提出mKG-RAG,利用多模态知识图谱增强RAG,提升视觉问答性能 large language model multimodal
6 Finding Needles in Images: Can Multimodal LLMs Locate Fine Details? 提出NiM基准和Spot-IT方法,提升多模态大语言模型在复杂文档中定位细粒度细节的能力 large language model multimodal
7 MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data MedPatch:一种置信度引导的多阶段融合方法,用于多模态临床数据分析 multimodal
8 AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models AdaFusion:一种基于提示引导的病理学Foundation Model自适应融合方法 foundation model
9 A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection 提出基于上下文感知注意力与图神经网络的多模态框架,用于检测仇恨女性言论。 multimodal
10 Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis 提出Follow-Your-Instruction,一个基于MLLM的综合性Agent,用于世界数据自动合成。 large language model multimodal
11 MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs MELLA:为低资源语言MLLM弥合语言能力与文化基础的差距 large language model multimodal
12 B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding 提出B4DL基准,用于4D激光雷达LLM的时空理解 large language model multimodal
13 Symmetry Understanding of 3D Shapes via Chirality Disentanglement 提出基于Diff3F框架的无监督 chirality 特征提取方法,用于3D形状的左右对称性解耦。 foundation model
14 Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions 提出FIxLIP,利用加权Banzhaf交互解释视觉-语言编码器中的相似性,优于一阶方法。 multimodal
15 Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2 利用微调SAM2分割复杂气液两相流中的不规则气泡,解决传统方法局限性 foundation model
16 VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization VFlowOpt:视觉信息流引导的大模型Token剪枝框架,提升推理效率。 multimodal
17 IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection 提出IAD-R1框架,增强视觉-语言模型在工业异常检测中的推理一致性。 chain-of-thought
18 Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features 提出Surformer v1,利用Transformer融合触觉与视觉特征进行表面分类。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
19 RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding 提出RegionMed-CLIP,通过区域感知多模态对比学习提升医学图像理解能力。 representation learning contrastive learning multimodal
20 Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation 提出MCDRL框架,利用因果推断和VLM提升医学图像分割的泛化性 representation learning multimodal
21 ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking 提出 ReasoningTrack,利用思维链推理解决长时视觉语言跟踪问题。 reinforcement learning chain-of-thought
22 SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images SPEX:用于光谱遥感影像地物提取的视觉-语言模型 visual pre-training large language model multimodal
23 ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos 提出ImpliHateVid数据集和双阶段对比学习框架,用于视频中隐式仇恨言论检测。 contrastive learning multimodal
24 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency 提出基于区域一致性的测试时强化学习方法,用于提升GUI元素定位精度。 reinforcement learning consistency policy
25 Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation 提出基于遗传算法优化条件DCGAN和知识蒸馏的情感深度人脸合成方法。 distillation
26 How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization 提出基于Flow Matching的无监督异常检测与定位方法,克服模型表达力限制。 flow matching
27 Latent Expression Generation for Referring Image Segmentation and Grounding 提出基于隐式表达生成的视觉定位框架,提升指代图像分割和定位性能。 contrastive learning visual grounding
28 Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events 提出物理启发的自监督预训练框架,解决事件相机数据稀疏和噪声问题。 contrastive learning optical flow

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
29 DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition DART:双重自适应精炼迁移框架,用于开放词汇多标签识别 open-vocabulary open vocabulary large language model
30 Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion 提出基于深度基础模型的稀疏深度传播方法,提升域外深度补全的鲁棒性 depth estimation monocular depth foundation model
31 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering 提出3DGabSplat,利用3D Gabor基元实现频率自适应的辐射场渲染,提升细节表现和效率。 3D gaussian splatting 3DGS gaussian splatting
32 Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting 提出Textual Inversion方法,高效适应开放词汇目标检测器,避免灾难性遗忘。 open-vocabulary open vocabulary
33 UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS UGOD:不确定性引导的可微透明度和软Dropout,增强稀疏视角3DGS 3D gaussian splatting 3DGS gaussian splatting
34 CF3: Compact and Fast 3D Feature Fields CF3:提出一种紧凑快速的3D高斯特征场构建方法,提升效率并保持几何细节。 3D gaussian splatting 3DGS gaussian splatting
35 MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses MZEN:多尺度增强NeRF,解决未知相机姿态下三维重建的工业检测难题 NeRF neural radiance field
36 GAP: Gaussianize Any Point Clouds with Text Guidance GAP:利用文本引导高斯化任意点云,实现高质量3D高斯模型生成 3D gaussian splatting 3DGS gaussian splatting

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
37 X-MoGen: Unified Motion Generation across Humans and Animals X-MoGen:首个跨人类与动物的统一运动生成框架,提升运动真实性与泛化性 text-driven motion motion generation
38 HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing 提出HOLODECK 2.0以解决3D场景生成与编辑的挑战 physically plausible

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
39 A Survey on Video Temporal Grounding with Multimodal Large Language Model 综述:基于多模态大语言模型的视频时序定位研究进展 spatiotemporal large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
40 MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips MagicHOI:利用3D先验从单目短视频中精确重建手-物交互 hand-object reconstruction

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
41 A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality 提出CADAR神经符号框架,用于增强现实中可解释的认知攻击检测。 manipulation multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页