cs.CV(2025-09-29)

📊 共 60 篇论文 | 🔗 23 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (24 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (13 🔗6) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)

#题目一句话要点标签🔗
1 FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology FishNet++:评估多模态大语言模型在海洋生物学中的能力,并构建大规模多模态基准 large language model multimodal
2 MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment 提出MMRQA框架,融合信号处理与多模态大语言模型,用于MRI质量评估。 large language model multimodal
3 Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation 提出ViPET-ReportGen数据集与基准,用于提升越南语PET/CT报告生成的视觉-语言基础模型性能 foundation model multimodal
4 LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models LLM-RG:利用大语言模型实现户外场景下的指称对象定位 large language model chain-of-thought
5 GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs GHOST:通过诱导幻觉的图像生成方法,用于压力测试多模态LLM large language model multimodal
6 Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding 提出层对比解码(LayerCD)方法,缓解多模态大语言模型中的幻觉问题。 large language model multimodal
7 OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding 提出OIG-Bench基准,用于评估多模态大语言模型对单图引导的理解能力 large language model multimodal
8 Vision Function Layer in Multimodal LLMs 揭示多模态LLM视觉功能层,实现高效可定制的视觉能力 large language model multimodal
9 Multimodal Arabic Captioning with Interpretable Visual Concept Integration VLCAP:一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架 multimodal
10 VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning VideoAnchor:通过强化子空间结构视觉线索实现连贯的视觉-空间推理 large language model multimodal visual grounding
11 A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration 提出FFDP框架,实现前所未有的十亿体素多模态图像配准 multimodal
12 Robust Multimodal Semantic Segmentation with Balanced Modality Contributions 提出EQUISeg,通过平衡模态贡献提升多模态语义分割的鲁棒性 multimodal
13 Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models 提出Uni-X模型,通过两端分离架构缓解多模态统一模型中的模态冲突问题 multimodal
14 Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection 提出Forensic-Chat框架,提升多模态大语言模型在伪造图像检测中的泛化性和可解释性。 large language model multimodal
15 PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images PixelCraft:用于结构化图像高保真视觉推理的多智能体系统 large language model multimodal
16 VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning 提出VT-FSL框架,利用LLM桥接视觉与文本,提升小样本学习性能 large language model multimodal
17 Environment-Aware Satellite Image Generation with Diffusion Models 提出环境感知扩散模型,用于生成高质量、环境相关的卫星图像。 foundation model multimodal
18 FreeRet: MLLMs as Training-Free Retrievers FreeRet:无需训练,利用MLLM实现强大的多模态检索 large language model multimodal
19 Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks 提出Euclid30K数据集并微调视觉语言模型,显著提升其空间感知与推理能力 large language model multimodal
20 UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark UI2V-Bench:提出一个基于理解的图生视频生成评测基准,关注语义理解与推理能力。 large language model multimodal
21 VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models VISOR++:基于通用视觉输入的视觉语言模型行为引导方法 multimodal
22 Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents CogniGPT:交互式多粒度线索探索,提升长视频理解的效率与可靠性 large language model
23 Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models 提出训练无关的令牌修剪方法以降低视觉语言模型的推理成本 multimodal
24 Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency 提出QL-Adapter,解决多对象图像编辑中数量和布局一致性问题 instruction following

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
25 LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model LUMA:基于双路锚定的低维统一运动对齐文本到动作扩散模型 contrastive learning motion diffusion model motion diffusion
26 Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy Vid-LLM:提出一种基于视频的紧凑型3D多模态LLM,实现重建-推理协同 distillation metric depth scene understanding
27 DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation 提出DAM,利用多模态基础模型进行无源域自适应双重主动学习。 distillation foundation model multimodal
28 BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation 提出基于强化学习的深度图到图像生成引擎BRIDGE,用于单目深度估计。 reinforcement learning depth estimation monocular depth
29 VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding VTPerception-R1:通过显式视觉和文本感知增强多模态推理 reinforcement learning large language model multimodal
30 Score Distillation of Flow Matching Models 将Score Distillation成功应用于Flow Matching模型,实现快速高质量图像生成。 flow matching distillation
31 Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs 提出SCPO框架,通过语义课程偏好优化缓解多模态大语言模型中的视觉幻觉问题 DPO direct preference optimization large language model
32 UI-UG: A Unified MLLM for UI Understanding and Generation UI-UG:统一的多模态大语言模型,用于用户界面理解与生成 DPO direct preference optimization large language model
33 Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning Geo-R1:通过跨视角强化学习解锁视觉语言模型中的地理空间推理能力 reinforcement learning chain-of-thought
34 Visual Jigsaw Post-Training Improves MLLMs Visual Jigsaw:通过视觉拼图后训练提升多模态大语言模型 reinforcement learning large language model multimodal
35 REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport REALIGN:基于正则化融合偏Gromov-Wasserstein最优传输的程序视频对齐方法 representation learning contrastive learning egocentric
36 Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning 提出基于跨模态融合注意力和自监督多事件表征学习的事件相机人脸关键点对齐方法 representation learning
37 Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks 提出基于知识蒸馏的双异构学生网络,用于通用多类异常检测。 distillation
38 Rolling Forcing: Autoregressive Long Video Diffusion in Real Time 提出Rolling Forcing,实现实时自回归长视频扩散生成,显著降低误差累积。 world model distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
39 GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction GEM:基于3D高斯溅射的冷冻电镜高效精确重建框架 3D gaussian splatting 3DGS gaussian splatting
40 CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D CORE-3D:通过3D嵌入和上下文感知,实现开放词汇的3D场景检索 scene understanding semantic mapping semantic map
41 Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh Proxy-GS:利用代理网格实现高效的3D高斯溅射,提升渲染速度与质量 3D gaussian splatting 3DGS gaussian splatting
42 Triangle Splatting+: Differentiable Rendering with Opaque Triangles Triangle Splatting+:提出基于不透明三角形的可微渲染方法,实现高效网格重建与新视角合成。 3D gaussian splatting 3DGS gaussian splatting
43 VGGT-X: When VGGT Meets Dense Novel View Synthesis VGGT-X:针对密集场景的新视角合成,提升3D基础模型性能。 3DGS NeRF VGGT
44 Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation 提出分类器为中心的自适应框架,提升开放词汇伪装目标分割性能 open-vocabulary open vocabulary
45 GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification GaussianLens:基于按需高斯致密化的局部高分辨率重建 3D gaussian splatting 3DGS gaussian splatting
46 HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping HBSplat:基于混合损失引导深度和双向扭曲的鲁棒稀疏视角高斯重建 depth estimation 3D gaussian splatting 3DGS
47 DepthLM: Metric Depth From Vision Language Models DepthLM:利用视觉语言模型实现度量深度估计,无需修改架构或损失函数。 depth estimation metric depth
48 ExGS: Extreme 3D Gaussian Compression with Diffusion Priors ExGS:利用扩散先验实现极端3D高斯压缩,兼顾高质量渲染 3D gaussian splatting 3DGS gaussian splatting
49 LVT: Large-Scale Scene Reconstruction via Local View Transformers 提出局部视图Transformer(LVT),用于大规模场景重建和新视角合成。 scene reconstruction
50 Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots 提出Social 3D Scene Graphs,用于交互式服务机器人理解人类行为与关系 scene understanding open-vocabulary open vocabulary
51 PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos PAD3R:从单目视频中进行姿态感知的动态3D重建 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
52 Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events 提出快速特征场(F³),用于事件相机数据的预测性表征学习。 quadruped depth estimation metric depth
53 SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics SCOPE:基于语义条件扩散模型的机器人Sim2Real类别级物体姿态估计 manipulation sim2real
54 NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding NeoWorld:通过渐进式3D展开实现可探索虚拟世界的神经模拟 manipulation world model representation learning

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
55 StreamForest: Efficient Online Video Understanding with Persistent Event Memory 提出StreamForest,利用持久事件记忆实现高效的在线视频理解 spatiotemporal large language model multimodal
56 PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion PanoWorld-X:基于球面感知视频扩散生成可探索全景世界 spatiotemporal
57 PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement 提出PHASE-Net,通过物理驱动的谐波注意力机制实现高效的远程光电容积脉搏波测量。 PULSE

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
58 LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation 提出LaMoGen以解决文本到运动生成中的表达控制问题 text-to-motion motion synthesis motion generation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
59 DINOReg: Strong Point Cloud Registration with Vision Foundation Model DINOReg:利用视觉基础模型实现强大的点云配准 spatial relationship foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
60 SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs 提出SpinBench以评估视觉语言模型的空间推理能力 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页