cs.CV(2025-05-20)

📊 共 41 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (16 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (12 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱一:机器人控制 (Robot Control) (3) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
1 UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning 提出UniVG-R1,通过强化学习增强推理能力,解决通用视觉定位任务。 reinforcement learning large language model multimodal
2 UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation UniGen:通过增强训练和测试策略实现统一多模态理解与生成 direct preference optimization large language model multimodal
3 Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning Visionary-R1:通过强化学习缓解视觉推理中的捷径学习问题 reinforcement learning large language model multimodal
4 Programmatic Video Prediction Using Large Language Models ProgGen:利用大语言模型进行可解释的程序化视频预测 world model large language model
5 Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency 提出TemRobBench基准与PanoDPO优化方法,提升大模型在时序一致性扰动下的鲁棒性。 direct preference optimization multimodal
6 VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank 提出VisualQuality-R1,通过强化学习排序实现推理驱动的图像质量评估。 reinforcement learning large language model
7 DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning DeepEyes:通过强化学习激励视觉语言模型进行“图像思考” reinforcement learning multimodal
8 Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method 提出OmniVQA数据集与360-R1方法,提升全景视觉问答能力 reinforcement learning embodied AI large language model
9 StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning 提出StPR框架,通过时空信息解耦与保持,解决免样本视频类增量学习问题。 distillation spatiotemporal
10 Intra-class Patch Swap for Self-Distillation 提出一种基于类内块交换的自蒸馏方法,无需教师网络即可提升模型性能。 teacher-student distillation
11 MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks 提出MultiMAE地球观测预训练方法,提升多模态遥感数据下游任务性能。 masked autoencoder
12 RETRO: REthinking Tactile Representation Learning with Material PriOrs 提出材料感知先验以提升触觉表示学习的准确性 representation learning
13 Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search 提出符号图排序器SGR,利用LLM统一图学习与文本信息,提升会话搜索性能。 contrastive learning large language model
14 Scaling Vision Mamba Across Resolutions via Fractal Traversal FractalMamba++:提出基于分形遍历的视觉Mamba,提升跨分辨率适应性 Mamba
15 Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning 提出物理驱动的局部-整体弹性变形建模以提升点云表示学习 representation learning
16 Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels Ground-V:通过像素级指令微调,提升VLM在复杂场景下的定位能力 distillation instruction following

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
17 Speculative Decoding Reimagined for Multimodal Large Language Models 针对多模态大语言模型,提出多模态推测解码(MSD)加速推理。 large language model multimodal
18 ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations ViC-Bench:通过自由形式中间视觉状态评估多模态大语言模型的视觉交错思维能力 large language model chain-of-thought
19 EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language EmoSign:构建美国手语情感理解多模态数据集,填补情感手语研究空白。 multimodal
20 RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding 提出RAVENEA基准,通过检索增强提升视觉文化理解能力,解决多模态场景下的文化理解不足问题。 multimodal
21 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models VidCom2:即插即用视频大语言模型推理加速框架,提升效率并保持性能 large language model
22 LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts LoVR:一个用于多模态上下文中长视频检索的基准数据集。 multimodal
23 Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach 提出Llama-SMoP:一种基于稀疏混合投影器的可扩展LLM语音识别方法 large language model multimodal
24 RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection RADAR:通过补充知识注入增强放射学报告生成 large language model multimodal
25 VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation 提出VideoEval-Pro,用于更鲁棒和真实的长期视频理解评估 multimodal
26 Unlocking the Power of SAM 2 for Few-Shot Segmentation 利用SAM 2的Few-Shot分割方法,解决不同身份前景对象匹配问题 foundation model
27 Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting Dolphin:通过异构锚点提示实现文档图像解析 multimodal
28 AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards AppleGrowthVision:用于苹果园物候分析、果实检测和3D重建的大规模立体数据集 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
29 MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction MGStream:利用运动感知3D高斯实现可流式动态场景重建,解决闪烁伪影和存储低效问题。 3D gaussian splatting 3DGS gaussian splatting
30 M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data M3Depth:利用双模态数据互助增强的火星表面深度估计 depth estimation stereo depth
31 Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image 提出CP-GS框架,解决单图个性化3D场景生成中的视角偏差问题 3D gaussian splatting 3DGS gaussian splatting
32 Multi-Label Stereo Matching for Transparent Scene Depth Estimation 提出多标签立体匹配方法,用于透明场景深度估计,解决传统方法的单峰分布假设。 depth estimation scene reconstruction
33 Diving into the Fusion of Monocular Priors for Generalized Stereo Matching 提出基于局部排序和自适应对齐的单目先验融合方法,提升立体匹配泛化性 monocular depth scene flow foundation model
34 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision 提出4D-ROLLS,利用激光雷达监督学习4D雷达的Occupancy预测,提升恶劣环境感知能力。 height map

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
35 Emerging Properties in Unified Multimodal Pretraining BAGEL:一个支持多模态理解与生成的开源统一预训练模型 manipulation multimodal
36 Vid2World: Crafting Video Diffusion Models to Interactive World Models Vid2World:利用视频扩散模型构建交互式世界模型 manipulation world model
37 Visual Agentic Reinforcement Fine-Tuning 提出Visual-ARFT,提升LVLM在多模态Agent任务中的推理和泛化能力 manipulation multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
38 Beyond Words: Multimodal LLM Knows When to Speak 提出MM-When2Speak模型,利用多模态信息提升对话中响应时机预测的准确性。 dyadic interaction large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
39 EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation EGFormer:面向高效且泛化的多模态语义分割框架 MDM multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
40 Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance 提出EAIL框架,利用视觉-语言引导,实现点云中基于头戴IMU的自中心动作感知定位 egocentric multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
41 Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI Dynadiff:单阶段解码动态fMRI生成图像,提升时间分辨率和语义重建效果 diff-sim

⬅️ 返回 cs.CV 首页 · 🏠 返回主页