cs.CV(2025-05-29)

📊 共 46 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (25 🔗10) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱四:生成式动作 (Generative Motion) (1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (25 篇)

#题目一句话要点标签🔗
1 Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models Impromptu VLA:开放数据与权重,赋能自动驾驶视觉-语言-动作模型 vision-language-action VLA
2 Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought Argus:提出基于视觉注意 grounding 的链式思考方法,提升多模态推理能力 large language model multimodal chain-of-thought
3 Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model 提出一种基于输入预处理的多模态语言模型幻觉抑制方法 large language model multimodal
4 OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation OpenUni:一个用于统一多模态理解与生成任务的简单基线模型 large language model multimodal
5 MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking MaskAdapt:利用多模态上下文学习和RGB-D掩码实现无监督几何感知领域自适应 multimodal
6 Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Spatial-MLLM:通过视觉几何先验增强MLLM的视觉空间智能 large language model foundation model multimodal
7 FMG-Det: Foundation Model Guided Robust Object Detection FMG-Det:基于Foundation Model引导的鲁棒目标检测方法,解决噪声标注下的模型训练问题。 foundation model
8 VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos 提出VF-Eval以评估多模态LLM在AIGC视频反馈生成中的表现 multimodal
9 EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis EndoBench:构建内窥镜分析多模态大语言模型综合评估基准 large language model
10 OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data 提出OmniEarth-Bench,用于全面评估地球六大圈层及跨圈层交互的多模态观测数据学习。 multimodal
11 VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning VAU-R1:通过强化微调提升视频异常理解能力 large language model multimodal chain-of-thought
12 MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification 提出MCFNet,用于解决细粒度语义分类中跨模态信息融合难题。 multimodal
13 VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? VideoReasonBench:提出面向视觉复杂推理的多模态大模型评测基准 large language model multimodal chain-of-thought
14 ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks ThinkGeo:评估工具增强型Agent在遥感任务中的性能 large language model multimodal
15 Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications 提出Metadata Enrichment Model,融合神经网络与知识图谱,提升文化遗产数字化元数据质量。 large language model TAMP
16 CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection 提出CMIE框架,结合MLLM洞察与外部证据,解决语境外信息检测难题。 large language model multimodal
17 Vid-SME: Membership Inference Attacks against Large Video Understanding Models 提出Vid-SME,针对视频理解大模型进行高效的成员推理攻击。 large language model multimodal
18 DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment DGIQA:提出深度引导的特征注意力和精炼机制,提升图像质量评估的泛化性 multimodal
19 VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL VisualSphinx:用于强化学习的大规模合成视觉逻辑谜题数据集 multimodal
20 ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding 提出ScaleLong:一个用于长视频理解的多时间尺度基准测试,促进模型在不同时间尺度上性能的直接比较。 multimodal
21 D-AR: Diffusion via Autoregressive Models D-AR:将图像扩散过程重构为自回归模型,实现图像生成。 large language model
22 ZeroSep: Separate Anything in Audio with Zero Training ZeroSep:无需训练,利用预训练文本引导音频扩散模型实现音频分离 foundation model
23 Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition Uni-MuMER:通过统一多任务微调视觉-语言模型,实现手写数学表达式识别 chain-of-thought
24 TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models TerraIncognita:一个用于物种发现的动态基准,利用前沿模型识别未知昆虫物种。 multimodal
25 VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation 提出VCapsBench,一个大规模细粒度视频描述质量评估基准,提升文本生成视频的质量。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
26 DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models DINO-R1:利用强化学习提升视觉基础模型的推理能力 reinforcement learning open-vocabulary open vocabulary
27 UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning 提出UniRL,通过自生成数据和强化学习提升统一多模态模型的性能。 reinforcement learning large language model multimodal
28 VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models VideoREPA:通过关系对齐,将视频理解模型的物理知识迁移到文本生成视频模型 distillation physically plausible foundation model
29 Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles Jigsaw-R1:通过拼图游戏研究基于规则的视觉强化学习 reinforcement learning large language model multimodal
30 UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors UrbanCraft:利用分层语义几何先验实现城市视角外推 distillation scene reconstruction occupancy grid
31 PixelThink: Towards Efficient Chain-of-Pixel Reasoning PixelThink:通过像素链式推理提升分割效率与可解释性 reinforcement learning large language model multimodal
32 BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning BioCLIP 2:通过分层对比学习扩展生物视觉模型,涌现新能力。 contrastive learning foundation model
33 Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization Hallo4:通过直接偏好优化实现高保真动态人像动画 direct preference optimization spatiotemporal
34 Grounded Reinforcement Learning for Visual Reasoning 提出ViGoRL:一种视觉强化学习模型,通过空间定位提升视觉推理能力。 reinforcement learning
35 Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching 提出模型对齐耦合(MAC)方法,提升Flow Matching生成质量与效率 flow matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
36 Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation BriGeS:融合几何与语义基础模型,提升通用单目深度估计性能 depth estimation monocular depth foundation model
37 AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views AnySplat:从无约束视角实现前馈3D高斯溅射,无需相机位姿。 3D gaussian splatting gaussian splatting splatting
38 ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS ZPressor:面向可扩展前馈3DGS的瓶颈感知压缩方法 3D gaussian splatting 3DGS gaussian splatting
39 MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence MMSI-Bench:多图空间智能基准,挑战多模态大语言模型的空间推理能力 scene reconstruction large language model multimodal
40 TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models TextRegion:利用冻结图像-文本模型生成文本对齐的区域令牌,用于细粒度视觉理解。 open-vocabulary open vocabulary
41 CLDTracker: A Comprehensive Language Description for Visual Tracking CLDTracker:提出一种综合语言描述框架,用于提升视觉跟踪的鲁棒性。 open-vocabulary open vocabulary
42 PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views PhysicsNeRF:物理约束引导的稀疏视角三维重建 NeRF neural radiance field

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
43 Semantics-Aware Human Motion Generation from Audio Instructions 提出基于音频指令的语义感知人体动作生成框架,提升交互自然性 motion generation human motion

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
44 To Trust Or Not To Trust Your Vision-Language Model's Prediction 提出TrustVLM,无需训练即可提升视觉-语言模型预测的可信度 IMoS multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
45 Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features 提出一种基于多分辨率学习特征的弱监督图像篡改区域定位方法 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration VITON-DRR:通过非刚性配准实现细节保留的虚拟试穿 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页