cs.CV(2025-10-27)

📊 共 39 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges 综述多模态地理空间基础模型,应对遥感图像分析的挑战。 foundation model multimodal
2 A Survey on Efficient Vision-Language-Action Models 对高效视觉-语言-动作模型(Efficient VLA)的综述,旨在降低计算和数据需求。 vision-language-action foundation model
3 Towards Generalisable Foundation Models for 3D Brain MRI BrainFound:面向3D脑部MRI的通用Foundation模型,提升疾病检测与分割性能。 foundation model multimodal
4 PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models PISA-Bench:一个多语言多模态基准,用于评估视觉-语言模型 large language model multimodal
5 PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection PRISM-Bench:一个基于谜题的可解释多模态推理评测基准 large language model multimodal chain-of-thought
6 Multitask Multimodal Self-Supervised Learning for Medical Images 提出Medformer,用于医学图像多任务多模态自监督学习,减少对标注数据的依赖。 multimodal
7 MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection 提出MMSD3.0多图讽刺检测基准和CIRM模型,解决真实场景多图线索讽刺识别问题 multimodal
8 AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes 提出自适应门控融合方法以解决复杂场景中的3D物体检测问题 multimodal
9 Implicit Modeling for Transferability Estimation of Vision Foundation Models 提出隐式迁移建模(ITM),高效评估视觉基础模型在下游任务的迁移能力。 foundation model
10 Revisiting Multimodal Positional Encoding in Vision-Language Models 提出多头旋转位置编码MHRoPE及其变体MRoPE-I,提升视觉-语言模型的多模态位置编码能力。 multimodal
11 LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation LightFusion:轻量级双重融合框架,用于统一多模态理解与生成 multimodal
12 DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning DynaStride:结合MMCoT的动态步长窗口化方法,用于生成教学视频的多场景字幕。 multimodal chain-of-thought
13 PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity 提出PixelRefer,一个统一的区域级MLLM框架,用于任意粒度的时空对象指代理解。 large language model multimodal
14 On the Faithfulness of Visual Thinking: Measurement and Enhancement 提出SCCM学习策略,提升视觉语言模型多模态推理中视觉信息的可靠性和充分性。 multimodal chain-of-thought
15 CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting CountFormer:Transformer框架学习视觉重复与结构,实现类别无关的目标计数 foundation model
16 A Video Is Not Worth a Thousand Words 提出基于Shapley值的特征归因和模态评分方法,评估VLM在VQA任务中的文本依赖性。 large language model
17 The Underappreciated Power of Vision Models for Graph Structural Understanding 利用视觉模型进行图结构理解,性能媲美图神经网络,并揭示其全局感知优势 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
18 Finding 3D Scene Analogies with Multimodal Foundation Models 利用多模态基础模型实现零样本三维场景类比,用于机器人轨迹和路径点迁移。 imitation learning open-vocabulary open vocabulary
19 VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting VR-Drive:利用前馈3D高斯溅射实现视角鲁棒的端到端自动驾驶 distillation 3D gaussian splatting gaussian splatting
20 Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning 提出Video-Thinker,通过强化学习赋能MLLM进行视频推理 reinforcement learning large language model multimodal
21 VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations VideoTG-R1:通过反射边界标注的课程强化学习提升视频时序定位性能 reinforcement learning large language model multimodal
22 Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment PathSearch:基于注意力视觉-语言对齐的精准可扩展多模态病理图像检索框架 contrastive learning multimodal
23 HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling HieraMamba:通过分层Anchor-Mamba池化实现视频时序定位 Mamba Ego4D AMP
24 FARMER: Flow AutoRegressive Transformer over Pixels FARMER:提出一种基于流自回归Transformer的像素生成模型,实现精确似然估计和高质量图像合成。 distillation classifier-free guidance large language model
25 VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation 提出VOLD框架,通过策略蒸馏将LLM的推理能力迁移到视觉-语言模型 reinforcement learning distillation
26 MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding 提出MergeMix,统一视觉和多模态理解的增强范式,提升效率和对齐质量。 reinforcement learning large language model
27 Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics 提出基于Bi-Encoder对比学习的指纹和虹膜跨模态生物特征识别方法 contrastive learning
28 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations Concerto:融合2D-3D自监督学习,涌现空间表征 distillation scene understanding

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
29 PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors PlanarGS:利用视觉-语言平面先验实现高保真室内3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
30 Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression Gen-LangSplat:利用预训练特征压缩实现通用语言高斯溅射,提升效率。 3D gaussian splatting gaussian splatting splatting
31 EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction EndoWave:用于内窥镜重建的Rational-Wavelet 4D高斯溅射 3DGS gaussian splatting splatting
32 Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation 提出LHT-CLIP,无需训练即可提升CLIP在开放词汇语义分割中的视觉区分性 open-vocabulary open vocabulary
33 More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models 提出MERGE,通过文本到图像扩散模型统一图像生成与深度估计 depth estimation
34 Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap 提出混合框架Yesnt,提升扩散模型在动态体积视频光照重构中的时序稳定性 optical flow
35 UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception UrbanIng-V2X:用于协同感知的多路口大规模多车辆多基础设施数据集 scene understanding

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
36 Positional Preservation Embedding for Multimodal Large Language Models 提出位置保持嵌入(PPE)以提升多模态大语言模型在视觉-语言任务中的效率和性能。 spatiotemporal large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
37 EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT EgoThinker:利用时空CoT揭示以自我为中心的推理能力 egocentric large language model multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
38 DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification 提出DecoDINO以解决人类与场景接触预测问题 human-object interaction

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
39 VoMP: Predicting Volumetric Mechanical Property Fields VoMP:预测三维物体体积机械属性场,加速物理仿真。 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页