cs.CV（2025-08-07）

📊 共 41 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (18 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱四：生成式动作 (Generative Motion) (2) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model	提出LLaVA-RE，利用多模态大语言模型进行二元图像-文本相关性评估。	large language model multimodal
2	PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems	PhysPatch：面向多模态大语言模型自动驾驶系统的物理可实现且可迁移的对抗补丁攻击	large language model multimodal
3	Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision	提出Uni-CoT，用于统一文本和视觉的链式思考推理，实现多模态任务的SOTA性能。	large language model multimodal chain-of-thought	✅
4	AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety	评估多模态LLM在品牌安全内容审核中的表现，对比AI与人工审核员	large language model multimodal
5	mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering	提出mKG-RAG，利用多模态知识图谱增强RAG，提升视觉问答性能	large language model multimodal
6	Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?	提出NiM基准和Spot-IT方法，提升多模态大语言模型在复杂文档中定位细粒度细节的能力	large language model multimodal
7	MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data	MedPatch：一种置信度引导的多阶段融合方法，用于多模态临床数据分析	multimodal
8	AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models	AdaFusion：一种基于提示引导的病理学Foundation Model自适应融合方法	foundation model
9	A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection	提出基于上下文感知注意力与图神经网络的多模态框架，用于检测仇恨女性言论。	multimodal
10	Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis	提出Follow-Your-Instruction，一个基于MLLM的综合性Agent，用于世界数据自动合成。	large language model multimodal
11	MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs	MELLA：为低资源语言MLLM弥合语言能力与文化基础的差距	large language model multimodal
12	B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding	提出B4DL基准，用于4D激光雷达LLM的时空理解	large language model multimodal	✅
13	Symmetry Understanding of 3D Shapes via Chirality Disentanglement	提出基于Diff3F框架的无监督 chirality 特征提取方法，用于3D形状的左右对称性解耦。	foundation model	✅
14	Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions	提出FIxLIP，利用加权Banzhaf交互解释视觉-语言编码器中的相似性，优于一阶方法。	multimodal
15	Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2	利用微调SAM2分割复杂气液两相流中的不规则气泡，解决传统方法局限性	foundation model
16	VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization	VFlowOpt：视觉信息流引导的大模型Token剪枝框架，提升推理效率。	multimodal
17	IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection	提出IAD-R1框架，增强视觉-语言模型在工业异常检测中的推理一致性。	chain-of-thought	✅
18	Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features	提出Surformer v1，利用Transformer融合触觉与视觉特征进行表面分类。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
19	RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding	提出RegionMed-CLIP，通过区域感知多模态对比学习提升医学图像理解能力。	representation learning contrastive learning multimodal
20	Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation	提出MCDRL框架，利用因果推断和VLM提升医学图像分割的泛化性	representation learning multimodal
21	ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	提出 ReasoningTrack，利用思维链推理解决长时视觉语言跟踪问题。	reinforcement learning chain-of-thought	✅
22	SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images	SPEX：用于光谱遥感影像地物提取的视觉-语言模型	visual pre-training large language model multimodal	✅
23	ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos	提出ImpliHateVid数据集和双阶段对比学习框架，用于视频中隐式仇恨言论检测。	contrastive learning multimodal
24	Test-Time Reinforcement Learning for GUI Grounding via Region Consistency	提出基于区域一致性的测试时强化学习方法，用于提升GUI元素定位精度。	reinforcement learning consistency policy
25	Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation	提出基于遗传算法优化条件DCGAN和知识蒸馏的情感深度人脸合成方法。	distillation
26	How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization	提出基于Flow Matching的无监督异常检测与定位方法，克服模型表达力限制。	flow matching
27	Latent Expression Generation for Referring Image Segmentation and Grounding	提出基于隐式表达生成的视觉定位框架，提升指代图像分割和定位性能。	contrastive learning visual grounding
28	Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events	提出物理启发的自监督预训练框架，解决事件相机数据稀疏和噪声问题。	contrastive learning optical flow	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
29	DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition	DART：双重自适应精炼迁移框架，用于开放词汇多标签识别	open-vocabulary open vocabulary large language model
30	Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion	提出基于深度基础模型的稀疏深度传播方法，提升域外深度补全的鲁棒性	depth estimation monocular depth foundation model	✅
31	3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering	提出3DGabSplat，利用3D Gabor基元实现频率自适应的辐射场渲染，提升细节表现和效率。	3D gaussian splatting 3DGS gaussian splatting
32	Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting	提出Textual Inversion方法，高效适应开放词汇目标检测器，避免灾难性遗忘。	open-vocabulary open vocabulary
33	UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS	UGOD：不确定性引导的可微透明度和软Dropout，增强稀疏视角3DGS	3D gaussian splatting 3DGS gaussian splatting
34	CF3: Compact and Fast 3D Feature Fields	CF3：提出一种紧凑快速的3D高斯特征场构建方法，提升效率并保持几何细节。	3D gaussian splatting 3DGS gaussian splatting
35	MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses	MZEN：多尺度增强NeRF，解决未知相机姿态下三维重建的工业检测难题	NeRF neural radiance field
36	GAP: Gaussianize Any Point Clouds with Text Guidance	GAP：利用文本引导高斯化任意点云，实现高质量3D高斯模型生成	3D gaussian splatting 3DGS gaussian splatting	✅

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
37	X-MoGen: Unified Motion Generation across Humans and Animals	X-MoGen：首个跨人类与动物的统一运动生成框架，提升运动真实性与泛化性	text-driven motion motion generation
38	HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing	提出HOLODECK 2.0以解决3D场景生成与编辑的挑战	physically plausible

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	A Survey on Video Temporal Grounding with Multimodal Large Language Model	综述：基于多模态大语言模型的视频时序定位研究进展	spatiotemporal large language model multimodal	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips	MagicHOI：利用3D先验从单目短视频中精确重建手-物交互	hand-object reconstruction

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality	提出CADAR神经符号框架，用于增强现实中可解释的认知攻击检测。	manipulation multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页