cs.CV(2025-10-09)

📊 共 50 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗5) 支柱八:物理动画 (Physics-based Animation) (5 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints NaViL:数据约束下原生多模态大语言模型缩放特性的再思考 large language model multimodal
2 CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning 提出CIR-CoT,通过端到端思维链推理实现可解释的组合图像检索 large language model multimodal chain-of-thought
3 BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities BEAR:原子具身能力的多模态语言模型基准测试与增强 large language model multimodal
4 Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Puffin:提出统一的多模态模型,实现相机视角的理解与生成 multimodal
5 PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment 提出PIT-QMM,一种用于无参考点云质量评估的大型多模态模型 multimodal
6 MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration 提出MoA-VR,一个混合Agent的通用视频修复系统,有效处理复杂退化。 large language model multimodal
7 InstructX: Towards Unified Visual Editing with MLLM Guidance InstructX:基于MLLM指导的统一视觉编辑框架,实现图像和视频编辑 large language model multimodal
8 The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping 提出视觉标志性挑战,评估视觉-语言模型在手语形式-意义映射上的能力。 multimodal visual grounding
9 UniVideo: Unified Understanding, Generation, and Editing for Videos UniVideo:统一视频理解、生成与编辑的多模态框架 large language model multimodal
10 D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition D-CoDe:通过动态压缩和问题分解,将图像预训练的VLM扩展到视频领域 large language model
11 ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation ARTDECO:基于结构化场景表示的高效高保真即时3D重建 foundation model
12 To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models 针对大型视觉语言模型,论文提出利用ViT注意力汇聚增强视觉推理能力。 large language model
13 Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement 提出时序条件注意力锐化(TCAS)方法,提升视频语言模型时序理解逻辑一致性 large language model
14 AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views AlignGS:对齐几何与语义,实现稀疏视角下鲁棒的室内重建 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
15 Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools Video-STAR:利用工具增强的强化学习进行开放词汇动作识别 reinforcement learning open-vocabulary open vocabulary
16 FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation 提出FOLK,通过标签引导的知识蒸馏实现快速开放词汇3D实例分割 distillation open-vocabulary open vocabulary
17 MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization MM-HELIX:通过整体平台和自适应混合策略优化提升多模态长链反思推理能力 reinforcement learning large language model multimodal
18 MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning 提出MATRIX框架,通过多模态Agent调优实现稳健的工具使用推理 preference learning multimodal
19 CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving 提出CVD-STORM,利用时空重建扩散模型生成自动驾驶多视角长视频,并具备4D重建能力。 world model depth estimation gaussian splatting
20 MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding 提出MARC:一种基于记忆增强强化学习的视频token压缩方法,用于高效视频理解。 reinforcement learning distillation large language model
21 Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation Memoir:提出基于想象引导的经验检索方法,提升记忆持久性视觉语言导航性能。 world model VLN language conditioned
22 Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning 提出基于回报引导对比学习的视觉注意力机制,提升强化学习样本效率 reinforcement learning contrastive learning
23 SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models SpatialLadder:通过渐进式训练提升视觉语言模型中的空间推理能力 reinforcement learning multimodal
24 VideoVerse: How Far is Your T2V Generator from a World Model? VideoVerse:构建更全面的文本到视频生成模型评估基准,衡量模型与世界模型的差距 world model
25 SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation SimCast:利用短时到长时知识蒸馏增强降水临近预报 distillation
26 FlowLensing: Simulating Gravitational Lensing with Flow Matching FlowLensing:利用Flow Matching加速引力透镜模拟,助力暗物质研究 flow matching
27 LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation LinVideo:一种后训练框架,实现高效视频生成中O(n)复杂度Attention linear attention spatiotemporal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
28 Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting 利用3D高斯溅射进行人脸解析标签优化,提升极端姿态下的解析精度 3D gaussian splatting 3DGS gaussian splatting
29 PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting PrismGS:面向大规模高保真3D高斯溅射的物理约束抗锯齿方法 3D gaussian splatting 3DGS gaussian splatting
30 DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream 提出DEGS,结合RGB和事件流实现可变形的动态3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
31 XYZCylinder: Towards Compatible Feed-Forward 3D Gaussian Splatting for Driving Scenes via Unified Cylinder Lifting Method XYZCylinder:通过统一柱面提升方法实现兼容的驾驶场景3D高斯溅射 3D gaussian splatting gaussian splatting splatting
32 D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction D$^2$GS:深度与密度引导的高斯溅射,用于稳定且精确的稀疏视角重建 3D gaussian splatting 3DGS gaussian splatting
33 A Multimodal Depth-Aware Method For Embodied Reference Understanding 提出一种多模态深度感知方法,用于具身引用理解任务。 open-vocabulary open vocabulary multimodal
34 An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images 提出一种室内全景图像的端到端、基于房间几何约束的深度估计框架 depth estimation
35 FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition 提出FMANet,利用双阶段光流和融合运动注意力网络提升微表情识别鲁棒性 optical flow
36 ReSplat: Learning Recurrent Gaussian Splats 提出ReSplat,一种迭代优化高斯splatting的循环模型,提升渲染质量和效率。 gaussian splatting splatting
37 ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes ComGS:通过表面八面体探针实现高效的3D物体-场景合成 gaussian splatting splatting
38 The impact of abstract and object tags on image privacy classification 研究抽象和对象标签对图像隐私分类的影响,揭示标签类型与数量的关键作用。 scene understanding

🔬 支柱八:物理动画 (Physics-based Animation) (5 篇)

#题目一句话要点标签🔗
39 SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models SciVideoBench:提出科学视频推理基准,评估大型多模态模型在科学领域的认知能力。 spatiotemporal multimodal visual grounding
40 VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning VideoCanvas:通过上下文条件反射实现任意时空补丁的统一视频补全 spatiotemporal
41 Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection 提出基于物理驱动的时空建模方法,用于检测AI生成视频 spatiotemporal
42 TTOM: Test-Time Optimization and Memorization for Compositional Video Generation 提出TTOM:一种测试时优化与记忆框架,用于组合视频生成。 spatiotemporal foundation model
43 Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization Q-Router:基于专家模型路由和伪影定位的Agentic视频质量评估 spatiotemporal

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
44 Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge 提出CMAT和CAST,利用2D语义知识提升3D可供性分割性能 manipulation affordance embodied AI
45 Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction 提出一种用于高分辨率图像量化和特征提取的分层空间算法框架 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
46 VideoNorms: Benchmarking Cultural Awareness of Video Language Models VideoNorms:构建视频语言模型文化意识基准,揭示模型在跨文化理解上的不足。 HuMoR large language model
47 SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction SyncHuman:同步2D和3D生成模型,实现单视角人体重建 SMPL

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
48 Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing 提出MURE框架,利用交错文本-图像链和深度置信推理进行图像编辑 spatial relationship large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
49 Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction 提出FineDual,通过动态分层交互生成细粒度文本驱动的双人运动 motion generation large language model

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
50 MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions 提出MMHOI数据集和MMHOI-Net,用于建模复杂3D多人多物交互 human-object interaction HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页