cs.CV(2025-03-26)

📊 共 49 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (23 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗3) 支柱四:生成式动作 (Generative Motion) (5 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (23 篇)

#题目一句话要点标签🔗
1 Dynamic Pyramid Network for Efficient Multimodal Large Language Model 提出动态金字塔网络DPN,用于高效多模态大语言模型,提升性能并降低计算成本。 large language model multimodal
2 Unified Multimodal Discrete Diffusion 提出UniDisc:统一多模态离散扩散模型,实现文本图像联合生成与理解 multimodal
3 Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs 揭示MLLM在图表理解中的“数学盲”现象,并提出基于图结构的改进方案 large language model multimodal chain-of-thought
4 MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion 提出MMMORRF,通过模态感知的加权倒数排序融合,提升多模态视频检索效果。 multimodal
5 TerraTorch: The Geospatial Foundation Models Toolkit TerraTorch:用于地球空间基础模型的微调与基准测试工具包 foundation model
6 CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets CryoSAMU:利用结构感知多模态U-Net增强中间分辨率冷冻电镜蛋白结构密度图 multimodal
7 ViLBench: A Suite for Vision-Language Process Reward Modeling 提出ViLBench,用于评估视觉-语言过程奖励模型的细粒度反馈能力 large language model multimodal chain-of-thought
8 Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models 提出多模态自回归模型以解决长文本图像生成问题 multimodal
9 Multimodal Image Matching based on Frequency-domain Information of Local Energy Response 提出基于局部能量响应频域信息的多模态图像匹配方法FILER,解决非线性差异等难题。 multimodal
10 Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy 提出SAFEQA模型和ESA-PO框架,缓解多模态大模型在底层视觉任务中的幻觉问题 large language model multimodal
11 Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering 提出视觉增强语义熵(VASE)用于医疗VQA中幻觉检测 large language model multimodal
12 Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs 提出指令导向的偏好对齐以提升多模态理解能力 large language model multimodal
13 Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping Skip-Vision:通过自适应Token跳过加速视觉-语言模型,提升效率与可扩展性 large language model multimodal
14 Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency 提出Free4D框架以解决单图像生成4D场景问题 foundation model
15 Dynamic Motion Blending for Versatile Motion Editing 提出MotionReFit,通过动态运动混合实现通用文本引导的运动编辑 large language model
16 Shape Generation via Weight Space Learning 通过权重空间学习实现形状生成,探索3D生成模型的下游任务新范式。 foundation model
17 MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning 提出MLLM-Selector,通过必要性和多样性驱动的高价值数据选择增强视觉指令微调。 large language model
18 From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment 提出基于视觉上下文采样和自奖励对齐的长视频理解方法 large language model
19 VideoGEM: Training-free Action Grounding in Videos 提出VideoGEM,一种无需训练的视频空间动作定位方法,优于现有训练方法。 foundation model
20 Wan: Open and Advanced Large-Scale Video Generative Models Wan:开放先进的大规模视频生成模型,显著提升生成能力和效率 foundation model
21 Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations 提出Ramblings和Mutes视频水印,对抗基于视频的LLM的自动标注。 large language model
22 Faster Parameter-Efficient Tuning with Token Redundancy Reduction 提出FPET,通过token冗余缩减加速参数高效微调并降低计算开销。 foundation model
23 Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation 提出ExCEL,通过patch-text对齐探索CLIP的密集知识,用于弱监督语义分割 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
24 Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields Feature4X:利用高斯特征场,桥接单目视频到4D Agentic AI distillation gaussian splatting splatting
25 DINeMo: Learning Neural Mesh Models with no 3D Annotations DINeMo:无需3D标注学习神经网格模型,提升类别级姿态估计。 contrastive learning scene understanding 6D pose estimation
26 Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology 提出ProAlign,通过跨模态原型分配实现无监督WSI切片表征学习 representation learning large language model foundation model
27 Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos 提出E-ViM³,一种数据高效的Mamba网络,用于精确分析医学超声视频。 Mamba masked autoencoder
28 Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models 提出Reason-RFT,通过强化微调提升视觉语言模型在视觉推理任务上的泛化能力。 reinforcement learning multimodal chain-of-thought
29 GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving GAIA-2:用于自动驾驶的可控多视角生成世界模型 world model spatiotemporal
30 Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound 提出基于翻转学习的弱监督乳腺超声结节分割方法,仅需2D/3D包围盒标注。 reinforcement learning curriculum learning foundation model
31 VPO: Aligning Text-to-Video Generation Models with Prompt Optimization 提出VPO框架,通过提示优化对齐文本到视频生成模型,提升安全性与质量。 preference learning RLHF large language model
32 EditCLIP: Representation Learning for Image Editing EditCLIP:用于图像编辑的表征学习方法,通过联合编码输入和编辑后图像学习编辑表示。 representation learning

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
33 Guiding Human-Object Interactions with Rich Geometry and Relations 提出ROG框架,通过几何关系引导逼真的人-物交互合成 motion generation physically plausible human-object interaction
34 InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction InsViE-1M:通过精细数据集构建实现有效的基于指令的视频编辑 classifier-free guidance instruction following
35 Video Motion Graphs 提出视频运动图,通过条件控制和插帧生成逼真人体运动视频 motion diffusion model motion diffusion
36 PhysGen3D: Crafting a Miniature Interactive World from a Single Image PhysGen3D:从单张图像构建可交互的微型3D物理世界 physically plausible
37 Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models 改进微调扩散模型的条件生成:利用高质量的无条件先验 classifier-free guidance

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
38 TC-GS: Tri-plane based compression for 3D Gaussian Splatting TC-GS:基于三平面编码的3D高斯 Splatting 压缩方法 3D gaussian splatting 3DGS gaussian splatting
39 EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis EVolSplat:面向城市场景高效体素化高斯溅射新视角合成 3D gaussian splatting 3DGS gaussian splatting
40 GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection GLRD:基于全局-局部协同推理与辩论的PSL框架,用于3D开放词汇检测 open-vocabulary open vocabulary
41 Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors 提出基于运动和结构先验的自监督深度估计框架,提升恶劣天气下的鲁棒性。 depth estimation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
42 Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins 提出ORDiRS框架,通过数字孪生和推理分割提升手术室工作流分析精度。 spatial relationship large language model foundation model
43 HierRelTriple: Guiding Indoor Layout Generation with Hierarchical Relationship Triplet Losses HierRelTriple:利用层级关系三元组损失引导室内布局生成 spatial relationship

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
44 UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants 提出UFM:统一多模态图像辅助的特征匹配预训练模型,提升跨模态图像匹配性能。 feature matching multimodal
45 Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection 提出一种利用3D几何先验的2D旋转对称性检测方法,提升鲁棒性和准确性。 feature matching geometric consistency

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
46 Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement 提出CABL与CGSR模块,用于上下文感知弱监督图像篡改定位,提升定位精度。 manipulation
47 ARMO: Autoregressive Rigging for Multi-Category Objects ARMO:用于多类别物体的自回归骨骼绑定框架,提升骨骼预测精度。 humanoid

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
48 UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines UniSTD:提出一种统一时空学习框架,解决跨领域任务泛化性问题。 spatiotemporal foundation model
49 SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams 提出SpikeDerain以解决雨天视频清晰度恢复问题 spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页