cs.CV（2025-09-19）

📊 共 41 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models	针对多模态大语言模型的视觉谄媚问题，提出Sycophantic Reflective Tuning方法。	large language model multimodal
2	Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model	提出LIR-GAD，利用多模态大语言模型进行语言指导的群体活动检测。	large language model multimodal
3	TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?	TennisTV：首个网球视频理解基准，评估多模态大模型在快速运动场景下的性能	large language model multimodal
4	MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer	Manzano：一种基于混合视觉Token的简单可扩展统一多模态模型	large language model multimodal
5	ENSAM: an efficient foundation model for interactive segmentation of 3D medical images	ENSAM：一种高效的三维医学图像交互分割基础模型	foundation model multimodal
6	Improving Autism Detection with Multimodal Behavioral Analysis	提出基于多模态行为分析的自闭症检测方法，提升了诊断准确率。	multimodal
7	Qianfan-VL: Domain-Enhanced Universal Vision-Language Models	提出Qianfan-VL，通过领域增强技术实现领先的多模态大语言模型	large language model multimodal chain-of-thought
8	Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion	提出异构融合网络HFN，用于短视频假新闻检测，提升多模态信息利用率。	multimodal
9	EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery	EyePCR：眼科手术中细粒度感知、知识理解和临床推理的综合基准	large language model multimodal
10	AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks	AutoArabic：提出三阶段框架，用于视频-文本检索基准的阿拉伯语本地化	large language model	✅
11	Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks	提出一种基于张量分解的轻量级防御方法，提升视觉-语言模型对抗攻击的鲁棒性	multimodal
12	Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance	提出金字塔Token剪枝（PTP）策略，解决高分辨率大视觉语言模型中计算开销过大的问题。	multimodal
13	Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track	提出Video-Language Checker与Key-Frame Sampler，显著提升Sa2VA在指代表体视频分割任务上的性能	large language model
14	Lynx: Towards High-Fidelity Personalized Video Generation	Lynx：基于单张图像的高保真个性化视频生成模型	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
15	MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild	提出MS-GS，利用多外观3D高斯溅射解决野外稀疏视图场景重建问题	depth estimation monocular depth 3D gaussian splatting
16	Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval	提出GVR，通过视图检索实现3D高斯场景的零样本视觉定位	3D gaussian splatting 3DGS gaussian splatting	✅
17	FingerSplat: Contactless Fingerprint 3D Reconstruction and Generation based on 3D Gaussian Splatting	提出基于3D高斯溅射的非接触式指纹三维重建与生成方法	3D gaussian splatting gaussian splatting splatting
18	GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading	GS-Scale：通过主机卸载解锁大规模3D高斯溅射训练	3D gaussian splatting gaussian splatting splatting
19	Sparse Multiview Open-Vocabulary 3D Detection	提出一种稀疏多视角开放词汇3D检测方法，无需3D训练，性能优异。	open-vocabulary open vocabulary foundation model
20	StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes	StereoAdapter：一种用于水下场景立体深度估计的自适应框架	depth estimation stereo depth metric depth	✅
21	RangeSAM: On the Potential of Visual Foundation Models for Range-View represented LiDAR segmentation	RangeSAM：探索视觉基础模型在激光雷达Range-View分割中的潜力	scene understanding foundation model multimodal
22	Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation	单目深度估计可解释性研究：通过扰动分析与保真度评估提升模型透明度	depth estimation monocular depth
23	Towards Sharper Object Boundaries in Self-Supervised Depth Estimation	提出基于混合分布的自监督深度估计，显著提升物体边界清晰度	depth estimation monocular depth scene understanding
24	Camera Splatting for Continuous View Optimization	提出Camera Splatting，通过连续视角优化实现高质量新视角合成	3D gaussian splatting gaussian splatting splatting
25	3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction	提出混合2D/3D高斯平面表示，提升纹理缺失场景的三维重建质量。	depth estimation scene reconstruction
26	RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars	RadarGaussianDet3D：一种高效的基于高斯分布的4D毫米波雷达3D目标检测器	3D gaussian splatting 3DGS gaussian splatting
27	Global Regulation and Excitation via Attention Tuning for Stereo Matching	提出GREAT框架，通过注意力机制增强立体匹配全局上下文信息，提升在病态区域的匹配精度。	scene flow	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
28	DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching	提出DistillMatch，利用视觉基础模型的知识蒸馏进行多模态图像匹配。	distillation foundation model multimodal
29	BaseReward: A Strong Baseline for Multimodal Reward Model	BaseReward：多模态奖励模型新基准，为MLLM对齐提供有效方案。	reinforcement learning RLHF large language model
30	UNIV: Unified Foundation Model for Infrared and Visible Modalities	提出UNIV，通过跨模态对比学习解决红外-可见光融合中的模式偏见问题	contrastive learning foundation model
31	DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection	DC-Mamba：面向遥感变化检测，提出双时态可变形对齐与尺度稀疏增强方法	Mamba SSM state space model
32	Random Direct Preference Optimization for Radiography Report Generation	提出基于随机直接偏好优化的胸片报告生成方法，提升临床指标。	DPO direct preference optimization large language model
33	Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization	提出课程引导的群相对策略优化算法，提升自动驾驶目标检测的鲁棒性。	reinforcement learning reward design large language model
34	SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models	SAMPO：基于运动提示的分尺度自回归生成世界模型，提升视频预测质量与推理效率。	world model scene understanding spatiotemporal
35	ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding	ChronoForge-RL：通过强化学习的时序锻造增强视频理解	reinforcement learning contrastive learning distillation
36	Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation	提出Rasa框架，利用报告辅助自蒸馏增强WSI生存分析，提升癌症预后预测。	distillation large language model	✅
37	BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent	提出BTL-UI模型，模拟人脑认知过程，提升GUI智能体的交互能力。	reinforcement learning large language model multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
38	See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model	提出SEE&TREK，增强多模态大语言模型在纯视觉下的空间理解能力	motion reconstruction large language model multimodal
39	Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution	融合SAM2和Cutie优势，提出SCOPE模型，提升视频目标分割的鲁棒性	motion prediction	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark	SGMAGNet：用于三维云相结构重建的被动主动卫星基准模型	spatiotemporal multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Simulated Cortical Magnification Supports Self-Supervised Object Learning	模拟皮层放大提升自监督物体学习性能	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页