cs.CV(2025-05-21)

📊 共 50 篇论文 | 🔗 18 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗5) 支柱一:机器人控制 (Robot Control) (5) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models 利用多模态大语言模型实现零样本差分人脸合成攻击检测 large language model multimodal chain-of-thought
2 LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models LENS:多层次评估大型语言模型多模态推理能力 large language model multimodal
3 Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought 揭示多模态思维链中视觉思想的作用机制,提升LVLMs的推理能力。 multimodal chain-of-thought
4 CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment CP-LLM:上下文与像素感知的大语言模型用于视频质量评估 large language model multimodal
5 Exploring The Visual Feature Space for Multimodal Neural Decoding 提出基于多模态大语言模型的零样本神经解码方法,提升视觉特征空间利用率。 large language model multimodal
6 The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization 提出P³数据集,用于多模态建筑物矢量化,融合像素、点云和多边形信息 multimodal
7 Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection 提出多模态条件信息瓶颈网络InfoFD,提升AI生成图像检测的泛化能力 multimodal
8 Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding 提出疾病感知提示(DAP)方法,提升弱监督医学图像视觉定位精度。 visual grounding
9 Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning 提出CAMA:通过上下文感知注意力调制增强LVLMs的多模态上下文学习能力 multimodal
10 Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts 提出Pixels Versus Priors方法,通过视觉反事实控制视觉-语言模型中的知识先验。 large language model multimodal
11 Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders 利用稀疏自编码器分析视觉模型中的层级结构,揭示ImageNet层级信息的编码方式。 large language model foundation model
12 Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval 提出可Prompt的图像嵌入方法,用于属性聚焦的图像检索。 large language model multimodal
13 Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts 提出PromptMargin,通过多模态边际正则化提升视觉语言模型在分布偏移下的少样本学习能力 foundation model multimodal
14 Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs 提出基于语义演化的盲点导航方法,发现LVLMs对特定语义概念的敏感性 large language model multimodal
15 From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation 综述深度学习在遥感图像语义分割中的应用与进展 foundation model multimodal
16 CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation CineTechBench:用于电影摄影技术理解与生成的新基准 large language model multimodal
17 Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM ProxyV:通过代理视觉Token减少LMM计算冗余,提升效率 multimodal
18 Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation 提出伪 gloss 生成框架,无需人工标注即可实现手语翻译。 large language model
19 SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval SCENIR:提出基于无监督场景图检索的图像语义清晰化方法 multimodal
20 How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads 揭示LVLM中OCR Head的作用:分析其如何识别图像中的文本 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
21 MMaDA: Multimodal Large Diffusion Language Models MMaDA:多模态大型扩散语言模型,统一架构实现跨领域卓越性能 reinforcement learning foundation model multimodal
22 PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting PlantDreamer:扩散模型引导的高斯溅射实现逼真3D植物建模 dreamer gaussian splatting splatting
23 STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs STAR-R1:通过强化多模态LLM进行空间变换推理 reinforcement learning large language model multimodal
24 ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning 提出ViaRL,通过视觉迭代增强强化学习自适应地进行时序定位,提升意图驱动的视频理解。 reinforcement learning large language model multimodal
25 CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation 提出CAD框架以解决视频深度伪造检测中的多模态融合问题 distillation multimodal
26 HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning 提出HAMF,通过混合注意力-Mamba框架联合理解场景上下文并学习未来运动表征,提升自动驾驶运动预测性能。 Mamba representation learning scene understanding
27 Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning 提出Pixel Reasoner,通过好奇心驱动的强化学习,激励视觉语言模型进行像素空间推理。 reinforcement learning large language model chain-of-thought
28 Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning 提出双边强化学习框架,提升病理多模态推理精度与效率。 reinforcement learning multimodal
29 Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation 提出对比学习增强的轨迹匹配方法,解决小规模数据集蒸馏问题 contrastive learning distillation
30 Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models 提出自适应硬负样本扰动学习,提升视觉-语言模型在组合推理任务上的性能 contrastive learning multimodal
31 DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer DeepKD:通过深度解耦和去噪知识蒸馏提升模型性能 curriculum learning distillation
32 OViP: Online Vision-Language Preference Learning for VLM Hallucination 提出OViP在线视觉-语言偏好学习框架,解决VLM幻觉问题。 preference learning
33 Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs 提出自适应聚焦链推理(CoF),通过动态视觉搜索和缩放提升视觉语言模型效率。 reinforcement learning multimodal
34 AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection AuxDet:利用辅助元数据实现全域红外小目标检测 representation learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
35 MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models MonoSplat:利用单目深度基础模型实现可泛化的3D高斯溅射 monocular depth 3D gaussian splatting gaussian splatting
36 RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction RUSplatting:用于稀疏视角水下场景重建的鲁棒3D高斯溅射方法 3D gaussian splatting gaussian splatting splatting
37 RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation RAZER:基于时空聚合的鲁棒加速零样本3D开放词汇全景重建 scene understanding semantic mapping semantic map
38 GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation GS2E:利用高斯溅射生成高质量事件流数据集,提升事件视觉任务性能。 3D gaussian splatting gaussian splatting splatting
39 ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation ViQAgent:基于开放词汇 grounding 验证的零样本视频问答Agent open-vocabulary open vocabulary chain-of-thought
40 R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections R3GS:针对非约束图像集,实现鲁棒的重建与重定位的高斯溅射方法 3DGS gaussian splatting splatting
41 GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting GT2-GS:提出几何感知纹理迁移框架,提升高斯溅射的纹理迁移质量与可控性 gaussian splatting splatting
42 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition 提出InstructSAM,一个免训练的遥感图像指令驱动目标识别框架。 open-vocabulary open vocabulary visual grounding
43 DC-Scene: Data-Centric Learning for 3D Scene Understanding 提出DC-Scene数据中心学习框架,提升3D场景理解效率与性能。 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
44 Leveraging Foundation Models for Multimodal Graph-Based Action Recognition 提出基于动态多模态图的动作识别框架,融合预训练模型提升细粒度操作识别能力 manipulation bi-manual bimanual manipulation
45 Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models 提出DeceptionDecoded基准,揭示视觉-语言模型在多模态新闻中理解创作者欺骗意图的局限性 manipulation multimodal
46 AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection AvatarShield:提出基于视觉强化学习的人体合成视频检测框架 manipulation reinforcement learning multimodal
47 Can VLMs Detect and Localize Fine-Grained AI-Edited Images? 提出FragFake基准,研究视觉语言模型在AI编辑图像检测与定位中的能力。 manipulation multimodal
48 Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes 发布小鼠解谜行为数据集,助力计算神经科学中行为识别研究 manipulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
49 Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification 提出KronA+方法,高效微调多光谱预训练模型SpectralGPT用于高光谱图像分类。 HSI foundation model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
50 Intentional Gesture: Deliver Your Intentions with Gestures for Speech 提出Intentional-Gesture框架,通过意图推理提升共语手势生成质量。 motion tokenizer embodied AI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页