cs.CV(2025-09-19)

📊 共 41 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models 针对多模态大语言模型的视觉谄媚问题,提出Sycophantic Reflective Tuning方法。 large language model multimodal
2 Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model 提出LIR-GAD,利用多模态大语言模型进行语言指导的群体活动检测。 large language model multimodal
3 TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies? TennisTV:首个网球视频理解基准,评估多模态大模型在快速运动场景下的性能 large language model multimodal
4 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Manzano:一种基于混合视觉Token的简单可扩展统一多模态模型 large language model multimodal
5 ENSAM: an efficient foundation model for interactive segmentation of 3D medical images ENSAM:一种高效的三维医学图像交互分割基础模型 foundation model multimodal
6 Improving Autism Detection with Multimodal Behavioral Analysis 提出基于多模态行为分析的自闭症检测方法,提升了诊断准确率。 multimodal
7 Qianfan-VL: Domain-Enhanced Universal Vision-Language Models 提出Qianfan-VL,通过领域增强技术实现领先的多模态大语言模型 large language model multimodal chain-of-thought
8 Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion 提出异构融合网络HFN,用于短视频假新闻检测,提升多模态信息利用率。 multimodal
9 EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery EyePCR:眼科手术中细粒度感知、知识理解和临床推理的综合基准 large language model multimodal
10 AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks AutoArabic:提出三阶段框架,用于视频-文本检索基准的阿拉伯语本地化 large language model
11 Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks 提出一种基于张量分解的轻量级防御方法,提升视觉-语言模型对抗攻击的鲁棒性 multimodal
12 Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance 提出金字塔Token剪枝(PTP)策略,解决高分辨率大视觉语言模型中计算开销过大的问题。 multimodal
13 Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track 提出Video-Language Checker与Key-Frame Sampler,显著提升Sa2VA在指代表体视频分割任务上的性能 large language model
14 Lynx: Towards High-Fidelity Personalized Video Generation Lynx:基于单张图像的高保真个性化视频生成模型 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
15 MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild 提出MS-GS,利用多外观3D高斯溅射解决野外稀疏视图场景重建问题 depth estimation monocular depth 3D gaussian splatting
16 Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval 提出GVR,通过视图检索实现3D高斯场景的零样本视觉定位 3D gaussian splatting 3DGS gaussian splatting
17 FingerSplat: Contactless Fingerprint 3D Reconstruction and Generation based on 3D Gaussian Splatting 提出基于3D高斯溅射的非接触式指纹三维重建与生成方法 3D gaussian splatting gaussian splatting splatting
18 GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading GS-Scale:通过主机卸载解锁大规模3D高斯溅射训练 3D gaussian splatting gaussian splatting splatting
19 Sparse Multiview Open-Vocabulary 3D Detection 提出一种稀疏多视角开放词汇3D检测方法,无需3D训练,性能优异。 open-vocabulary open vocabulary foundation model
20 StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes StereoAdapter:一种用于水下场景立体深度估计的自适应框架 depth estimation stereo depth metric depth
21 RangeSAM: On the Potential of Visual Foundation Models for Range-View represented LiDAR segmentation RangeSAM:探索视觉基础模型在激光雷达Range-View分割中的潜力 scene understanding foundation model multimodal
22 Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation 单目深度估计可解释性研究:通过扰动分析与保真度评估提升模型透明度 depth estimation monocular depth
23 Towards Sharper Object Boundaries in Self-Supervised Depth Estimation 提出基于混合分布的自监督深度估计,显著提升物体边界清晰度 depth estimation monocular depth scene understanding
24 Camera Splatting for Continuous View Optimization 提出Camera Splatting,通过连续视角优化实现高质量新视角合成 3D gaussian splatting gaussian splatting splatting
25 3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction 提出混合2D/3D高斯平面表示,提升纹理缺失场景的三维重建质量。 depth estimation scene reconstruction
26 RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars RadarGaussianDet3D:一种高效的基于高斯分布的4D毫米波雷达3D目标检测器 3D gaussian splatting 3DGS gaussian splatting
27 Global Regulation and Excitation via Attention Tuning for Stereo Matching 提出GREAT框架,通过注意力机制增强立体匹配全局上下文信息,提升在病态区域的匹配精度。 scene flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
28 DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching 提出DistillMatch,利用视觉基础模型的知识蒸馏进行多模态图像匹配。 distillation foundation model multimodal
29 BaseReward: A Strong Baseline for Multimodal Reward Model BaseReward:多模态奖励模型新基准,为MLLM对齐提供有效方案。 reinforcement learning RLHF large language model
30 UNIV: Unified Foundation Model for Infrared and Visible Modalities 提出UNIV,通过跨模态对比学习解决红外-可见光融合中的模式偏见问题 contrastive learning foundation model
31 DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection DC-Mamba:面向遥感变化检测,提出双时态可变形对齐与尺度稀疏增强方法 Mamba SSM state space model
32 Random Direct Preference Optimization for Radiography Report Generation 提出基于随机直接偏好优化的胸片报告生成方法,提升临床指标。 DPO direct preference optimization large language model
33 Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization 提出课程引导的群相对策略优化算法,提升自动驾驶目标检测的鲁棒性。 reinforcement learning reward design large language model
34 SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models SAMPO:基于运动提示的分尺度自回归生成世界模型,提升视频预测质量与推理效率。 world model scene understanding spatiotemporal
35 ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding ChronoForge-RL:通过强化学习的时序锻造增强视频理解 reinforcement learning contrastive learning distillation
36 Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation 提出Rasa框架,利用报告辅助自蒸馏增强WSI生存分析,提升癌症预后预测。 distillation large language model
37 BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent 提出BTL-UI模型,模拟人脑认知过程,提升GUI智能体的交互能力。 reinforcement learning large language model multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
38 See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model 提出SEE&TREK,增强多模态大语言模型在纯视觉下的空间理解能力 motion reconstruction large language model multimodal
39 Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution 融合SAM2和Cutie优势,提出SCOPE模型,提升视频目标分割的鲁棒性 motion prediction

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
40 SGMAGNet: A Baseline Model for 3D Cloud Phase Structure Reconstruction on a New Passive Active Satellite Benchmark SGMAGNet:用于三维云相结构重建的被动主动卫星基准模型 spatiotemporal multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
41 Simulated Cortical Magnification Supports Self-Supervised Object Learning 模拟皮层放大提升自监督物体学习性能 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页