cs.CV（2025-07-03）

📊 共 30 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation	提出GCoT，通过注入定位信息提升MLLM在专业视觉任务上的数据效率	distillation large language model multimodal
2	AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models	AIGI-Holmes：通过多模态大语言模型实现可解释和泛化的AI生成图像检测	direct preference optimization large language model multimodal
3	Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach	提出基于置信度驱动梯度调制的动态对比双路学习网络，用于多模态人体活动识别	contrastive learning multimodal
4	FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model	FMOcc：基于TPV和流匹配的3D Occupancy预测，提升少帧场景下的预测精度	flow matching SSM state space model
5	Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics	提出基于多极子展开的线性注意力机制MANO，用于视觉和物理模拟任务。	linear attention MANO	✅
6	Learning few-step posterior samplers by unfolding and distillation of diffusion models	通过扩散模型展开与蒸馏学习少量步骤的后验采样器	distillation
7	Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy	提出时序感知监督对比学习以解决结肠镜下息肉计数问题	contrastive learning	✅
8	Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection	提出基于数量提示的弱监督对比学习方法，用于移动红外小目标检测。	contrastive learning
9	Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation	提出自蒸馏方法，解决视频生成音频任务中部分可见电影语言的难题	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
10	HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars	提出HyperGaussians，用于高保真可动画人脸头像的3D高斯溅射扩展。	3D gaussian splatting 3DGS gaussian splatting
11	LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling	LocalDyGS：通过自适应局部隐式特征解耦实现多视角全局动态场景建模	3D gaussian splatting gaussian splatting splatting	✅
12	LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans	LiteReality：从RGB-D扫描重建可用于图形渲染的交互式3D场景	scene reconstruction scene understanding
13	LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion	提出LangScene-X，通过TriMap视频扩散重建可泛化的3D语言嵌入场景	scene understanding open-vocabulary open vocabulary	✅
14	SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment	提出SIU3R，一种无需特征对齐的同步场景理解与3D重建框架	scene understanding	✅
15	MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details	MoGe-2：提出一种精确的单目几何估计模型，可恢复具有度量尺度和清晰细节的场景3D点云。	MoGe
16	Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory	Point3R：利用显式空间指针记忆实现流式3D重建	scene reconstruction	✅
17	From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images	提出基于SegFormer的语义分割方法，用于社交媒体图像地震灾害程度评估。	depth estimation
18	Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images	Flow-CDNet：一种用于检测双时相图像中慢速和快速变化的新型网络	optical flow

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
19	LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models	提出LaCo，实现多模态大语言模型视觉Token的层间高效压缩。	large language model multimodal
20	SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement	SurgVisAgent：用于多功能手术视觉增强的多模态Agent模型	large language model multimodal chain-of-thought
21	From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding	提出HIVE框架，利用多模态叙事理解实现长视频到精彩短视频的自动剪辑	large language model multimodal
22	Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection	提出VisCo攻击，通过图像驱动的上下文注入破解多模态大语言模型	large language model multimodal	✅
23	Prompt learning with bounding box constraints for medical image segmentation	提出基于边界框约束的Prompt Learning方法，用于医学图像分割。	foundation model multimodal	✅
24	Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization	提出基于语言引导和表征对齐的Prompt解耦方法，提升领域泛化能力	large language model foundation model
25	Intelligent Histology for Tumor Neurosurgery	智能组织学：结合人工智能与受激拉曼组织学，革新肿瘤神经外科术中实时分析	foundation model multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	No time to train! Training-Free Reference-Based Instance Segmentation	提出一种免训练的参考图像实例分割方法，利用语义先验实现高效分割。	feature matching foundation model
27	CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios	提出CrowdTrack数据集以解决复杂场景下行人多目标跟踪问题	first-person view foundation model	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	DexVLG: Dexterous Vision-Language-Grasp Model at Scale	DexVLG：大规模灵巧手视觉-语言-抓取模型，实现指令驱动的部件级抓取	dexterous hand flow matching vision-language-action

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	USAD: End-to-End Human Activity Recognition via Diffusion Model with Spatiotemporal Attention	提出USAD，利用扩散模型与时空注意力进行端到端的人体活动识别。	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning	提出基于外观和社交距离推理的交互动作重建方法，解决复杂场景下人体交互姿态估计难题。	penetration foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页