cs.CV(2025-09-19)

📊 共 40 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models 针对多模态大语言模型的视觉谄媚问题,提出自反思微调方法SRT large language model multimodal
2 See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model 提出SEE&TREK,一种免训练的空间提示框架,提升MLLM的视觉空间理解能力 large language model multimodal
3 Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model 提出LIR-GAD,利用多模态大语言模型进行语言指导的群体活动检测。 large language model multimodal
4 TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies? 提出TennisTV基准以评估多模态大语言模型在网球视频理解中的表现 large language model multimodal
5 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Manzano:一种基于混合视觉Token的简单可扩展统一多模态模型 large language model multimodal
6 ENSAM: an efficient foundation model for interactive segmentation of 3D medical images ENSAM:一种高效的三维医学图像交互分割基础模型 foundation model multimodal
7 Improving Autism Detection with Multimodal Behavioral Analysis 提出基于多模态行为分析的自闭症检测方法,提升诊断准确率。 multimodal
8 Qianfan-VL: Domain-Enhanced Universal Vision-Language Models 提出Qianfan-VL,通过领域增强技术实现领先的多模态大语言模型 large language model multimodal chain-of-thought
9 Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion 提出异构融合网络HFN,用于短视频假新闻检测,提升多模态信息利用率。 multimodal
10 EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery EyePCR:眼科手术中细粒度感知、知识理解和临床推理的综合基准 large language model multimodal
11 AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks AutoArabic:提出三阶段框架,用于视频-文本检索基准的阿拉伯语本地化 large language model
12 Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks 提出轻量级张量分解方法以增强视觉语言模型的鲁棒性 multimodal
13 Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance 提出金字塔Token剪枝(PTP)策略,解决高分辨率大视觉语言模型中的计算开销问题。 multimodal
14 Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track 针对指代表达视频目标分割,提出视频语言检查器与关键帧采样方法,显著提升Sa2VA性能 large language model
15 Lynx: Towards High-Fidelity Personalized Video Generation Lynx:面向高保真个性化视频生成的扩散Transformer模型 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
16 MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild 提出MS-GS,利用多外观稀疏视图3D高斯溅射重建野外场景。 depth estimation monocular depth 3D gaussian splatting
17 Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval 提出GVR,通过视图检索实现3D高斯场景的零样本视觉定位 3D gaussian splatting 3DGS gaussian splatting
18 FingerSplat: Contactless Fingerprint 3D Reconstruction and Generation based on 3D Gaussian Splatting FingerSplat:基于3D高斯溅射的非接触式指纹3D重建与生成 3D gaussian splatting gaussian splatting splatting
19 GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading GS-Scale:通过主机卸载解锁大规模3D高斯溅射训练 3D gaussian splatting gaussian splatting splatting
20 Sparse Multiview Open-Vocabulary 3D Detection 提出一种稀疏多视角开放词汇3D检测方法,无需训练且性能优异 open-vocabulary open vocabulary foundation model
21 StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes 提出StereoAdapter以解决水下场景深度估计问题 depth estimation stereo depth metric depth
22 RangeSAM: On the Potential of Visual Foundation Models for Range-View represented LiDAR segmentation RangeSAM:探索视觉基础模型在激光雷达Range-View分割中的潜力 scene understanding foundation model multimodal
23 Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation 单目深度估计可解释性研究:提出Attribution Fidelity评估解释可靠性 depth estimation monocular depth
24 Towards Sharper Object Boundaries in Self-Supervised Depth Estimation 提出基于混合分布的自监督深度估计,显著提升物体边界清晰度 depth estimation monocular depth scene understanding
25 Camera Splatting for Continuous View Optimization 提出Camera Splatting,通过连续视角优化实现高质量新视角合成 3D gaussian splatting gaussian splatting splatting
26 3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction 提出混合2D/3D高斯平面表示,提升纹理缺失场景的三维重建质量。 depth estimation scene reconstruction
27 RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars RadarGaussianDet3D:基于高斯分布的4D毫米波雷达高效3D目标检测器 3D gaussian splatting 3DGS gaussian splatting
28 Global Regulation and Excitation via Attention Tuning for Stereo Matching 提出GREAT框架,通过注意力机制增强立体匹配全局上下文和几何信息,提升病态区域匹配精度。 scene flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
29 DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching 提出DistillMatch,利用视觉基础模型的知识蒸馏进行多模态图像匹配 distillation foundation model multimodal
30 BaseReward: A Strong Baseline for Multimodal Reward Model BaseReward:多模态奖励模型新基准,为MLLM对齐提供实用指南 reinforcement learning RLHF large language model
31 UNIV: Unified Foundation Model for Infrared and Visible Modalities 提出UNIV以解决红外与可见光模态的跨模态对齐问题 contrastive learning foundation model
32 DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection DC-Mamba:遥感影像变化检测中,通过可变形对齐与尺度稀疏增强提升性能 Mamba SSM state space model
33 Random Direct Preference Optimization for Radiography Report Generation 提出基于随机直接偏好优化的胸片报告生成框架,提升临床性能。 DPO direct preference optimization large language model
34 Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization 提出课程引导的群组相对策略优化算法,提升自动驾驶目标检测的鲁棒性。 reinforcement learning reward design large language model
35 SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models SAMPO:基于运动提示的分尺度自回归生成世界模型,提升视频预测质量与效率。 world model scene understanding spatiotemporal
36 ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding ChronoForge-RL:通过强化学习的时序锻造,增强视频理解能力 reinforcement learning contrastive learning distillation
37 Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation 提出Rasa框架,利用报告辅助自蒸馏增强WSI的生存分析 distillation large language model
38 BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent 提出BTL-UI模型,模拟人脑认知过程,提升GUI智能体的交互能力。 reinforcement learning large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
39 SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark SGMAGNet:用于三维云相结构重建的被动主动卫星基准模型 spatiotemporal multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
40 Simulated Cortical Magnification Supports Self-Supervised Object Learning 模拟皮层放大提升自监督物体学习性能 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页