cs.CV(2025-09-29)

📊 共 66 篇论文 | 🔗 24 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (25 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (16 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (15 🔗7) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (25 篇)

#题目一句话要点标签🔗
1 FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology FishNet++:评估多模态大语言模型在海洋生物学中的能力,并构建大规模基准数据集。 large language model multimodal
2 MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment 提出MMRQA框架,融合信号处理与多模态大语言模型,提升MRI质量评估 large language model multimodal
3 Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation 提出ViPET-ReportGen数据集与基准,促进越南语PET/CT报告生成医学视觉-语言基础模型研究。 foundation model multimodal
4 EVLF-FM: Explainable Vision Language Foundation Model for Medicine 提出EVLF-FM,一种具备可解释性的医学视觉语言基础模型,用于多疾病诊断和视觉问答。 foundation model multimodal visual grounding
5 LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models LLM-RG:利用大语言模型实现户外场景下的指代表达式定位 large language model chain-of-thought
6 GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs GHOST:通过诱导幻觉的图像生成方法,用于压力测试多模态LLM large language model multimodal
7 Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding 提出LayerCD,通过层对比解码缓解多模态LLM中的幻觉问题 large language model multimodal
8 OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding OIG-Bench:提出多智能体标注的多模态单图指南理解评测基准 large language model multimodal
9 Vision Function Layer in Multimodal LLMs 揭示多模态LLM视觉功能层,实现高效可定制模型 large language model multimodal
10 Multimodal Arabic Captioning with Interpretable Visual Concept Integration VLCAP:一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架 multimodal
11 VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning VideoAnchor:通过强化子空间结构视觉线索实现连贯的视觉-空间推理 large language model multimodal visual grounding
12 A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration 提出FFDP框架,实现前所未有的十亿体素多模态图像配准 multimodal
13 Robust Multimodal Semantic Segmentation with Balanced Modality Contributions 提出EQUISeg,通过平衡模态贡献提升多模态语义分割的鲁棒性 multimodal
14 Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models 提出Uni-X架构,通过两端分离结构缓解多模态统一模型中的模态冲突问题 multimodal
15 Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection 提出Forensic-Chat框架,解决多模态大语言模型在伪造图像检测中泛化性和可解释性不足的问题。 large language model multimodal
16 PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images PixelCraft:用于结构化图像高保真视觉推理的多智能体系统 large language model multimodal
17 VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning 提出VT-FSL框架,利用LLM桥接视觉与文本,提升小样本学习性能。 large language model multimodal
18 Environment-Aware Satellite Image Generation with Diffusion Models 提出环境感知扩散模型,用于生成高质量、环境相关的卫星图像。 foundation model multimodal
19 FreeRet: MLLMs as Training-Free Retrievers 提出FreeRet框架以实现无训练的多模态检索 large language model multimodal
20 Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks 提出Euclid30K数据集并微调视觉语言模型,显著提升其空间感知与推理能力 large language model multimodal
21 UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark 提出UI2V-Bench以解决图像到视频生成的语义理解问题 large language model multimodal
22 VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models VISOR++:基于通用视觉输入的视觉语言模型行为引导方法 multimodal
23 Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents CogniGPT:交互式多粒度线索探索框架,用于高效长视频理解 large language model
24 Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models 提出无训练的令牌剪枝方法以降低视觉语言模型的推理成本 multimodal
25 Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency 提出QL-Adapter,解决多对象图像编辑中数量和布局一致性问题 instruction following

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
26 LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model LUMA:基于双路锚定的低维统一运动对齐文本到动作扩散模型 contrastive learning motion diffusion model motion diffusion
27 Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy Vid-LLM:提出一种基于视频的紧凑型3D多模态LLM,实现重建-推理协同 distillation metric depth scene understanding
28 DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation 提出DAM,利用多模态基础模型进行无源域自适应的双重主动学习。 distillation foundation model multimodal
29 BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation 提出基于强化学习的深度图到图像生成引擎BRIDGE,用于单目深度估计。 reinforcement learning depth estimation monocular depth
30 VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding VTPerception-R1:通过显式视觉和文本感知增强多模态推理 reinforcement learning large language model multimodal
31 Latent Visual Reasoning 提出潜在视觉推理(LVR),实现视觉嵌入空间内的自回归推理,提升视觉问答性能。 reinforcement learning large language model multimodal
32 Score Distillation of Flow Matching Models 将Score Distillation成功应用于Flow Matching模型,实现快速高质量图像生成。 flow matching distillation
33 Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs 提出SCPO框架,通过语义课程偏好优化缓解多模态大语言模型中的视觉幻觉问题 DPO direct preference optimization large language model
34 UI-UG: A Unified MLLM for UI Understanding and Generation UI-UG:统一的多模态大语言模型,用于用户界面理解与生成 DPO direct preference optimization large language model
35 Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning Geo-R1:通过跨视角强化学习解锁视觉语言模型中的地理空间推理能力 reinforcement learning chain-of-thought
36 Visual Jigsaw Post-Training Improves MLLMs Visual Jigsaw:通过视觉拼图后训练提升多模态大语言模型 reinforcement learning large language model multimodal
37 Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis 提出AVF-MAE++,通过可扩展的音视频掩码自编码器高效分析情感视频面部,并在多个基准测试中达到SOTA。 masked autoencoder MAE
38 REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport REALIGN:基于正则化融合部分Gromov-Wasserstein最优传输的程序视频对齐方法 representation learning contrastive learning egocentric
39 Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning 提出基于跨模态融合注意力和自监督多事件表征学习的事件相机人脸关键点对齐方法 representation learning
40 Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks 提出基于知识蒸馏的双异构学生网络,用于通用多类别异常检测。 distillation
41 Rolling Forcing: Autoregressive Long Video Diffusion in Real Time 提出 Rolling Forcing,实现实时自回归长视频扩散生成,显著降低误差累积。 world model distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (15 篇)

#题目一句话要点标签🔗
42 GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction GEM:基于3D高斯溅射的冷冻电镜高效精确重建框架 3D gaussian splatting 3DGS gaussian splatting
43 CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D CORE-3D:通过3D嵌入和上下文感知实现开放词汇检索 scene understanding semantic mapping semantic map
44 Proxy-GS: Efficient 3D Gaussian Splatting via Proxy Mesh Proxy-GS:利用代理网格实现高效的3D高斯溅射,提升渲染速度与质量 3D gaussian splatting 3DGS gaussian splatting
45 Triangle Splatting+: Differentiable Rendering with Opaque Triangles Triangle Splatting+:提出基于不透明三角形的可微渲染方法,实现高效网格重建与新视角合成。 3D gaussian splatting 3DGS gaussian splatting
46 VGGT-X: When VGGT Meets Dense Novel View Synthesis VGGT-X:针对密集场景的新视角合成,提升3D基础模型性能。 3DGS NeRF VGGT
47 Classifier-Centric Adaptive Framework for Open-Vocabulary Camouflaged Object Segmentation 提出分类器为中心的自适应框架,提升开放词汇伪装目标分割性能 open-vocabulary open vocabulary
48 Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos Forge4D:提出一种前馈4D人体重建与插值方法,解决稀疏视角视频的快速重建和新视角合成问题。 optical flow motion prediction TAMP
49 GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification GaussianLens:基于按需高斯致密化的局部高分辨率重建 3D gaussian splatting 3DGS gaussian splatting
50 HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping HBSplat:基于混合损失引导深度和双向扭曲的鲁棒稀疏视图高斯重建 depth estimation 3D gaussian splatting 3DGS
51 DepthLM: Metric Depth From Vision Language Models DepthLM:利用视觉语言模型实现度量深度估计,无需修改架构或损失函数。 depth estimation metric depth
52 ExGS: Extreme 3D Gaussian Compression with Diffusion Priors ExGS:利用扩散先验实现极端3D高斯压缩,兼顾高压缩率与高质量渲染 3D gaussian splatting 3DGS gaussian splatting
53 Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection 提出TaSe框架,通过解耦和分层聚合语言表示,提升语言引导的目标检测性能。 open-vocabulary open vocabulary multimodal
54 LVT: Large-Scale Scene Reconstruction via Local View Transformers 提出局部视图Transformer(LVT)用于大规模场景重建和新视角合成。 scene reconstruction
55 Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots 提出Social 3D Scene Graphs,用于交互式服务机器人理解人类行为与关系 scene understanding open-vocabulary open vocabulary
56 PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos PAD3R:从单目视频中进行姿态感知的动态3D重建 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
57 Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events 提出快速特征场(F³),用于事件相机数据的预测性表征,实现高效的场景理解与运动估计。 quadruped depth estimation metric depth
58 FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation 提出FreeAction,通过无训练方法提升轨迹到视频生成中机器人视频的真实度 manipulation world model classifier-free guidance
59 SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics SCOPE:利用语义条件进行Sim2Real机器人类别级物体姿态估计 manipulation sim2real
60 NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding NeoWorld:通过渐进式3D展开实现可探索虚拟世界的神经模拟 manipulation world model representation learning

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
61 StreamForest: Efficient Online Video Understanding with Persistent Event Memory 提出StreamForest,利用持久事件记忆实现高效的在线视频理解。 spatiotemporal large language model multimodal
62 PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion PanoWorld-X:基于球形感知视频扩散生成可探索全景世界 spatiotemporal
63 PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement 提出基于物理信息的PHASE-Net,高效准确地进行远程光电容积脉搏波测量。 PULSE

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
64 LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation LaMoGen:基于拉班动作分析的扩散模型文本到动作生成方法 text-to-motion motion synthesis motion generation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
65 DINOReg: Strong Point Cloud Registration with Vision Foundation Model DINOReg:利用视觉基础模型实现强大的点云配准 spatial relationship foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
66 SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs 提出SpinBench以评估视觉语言模型中的空间推理能力 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页