cs.CV(2025-10-27)

📊 共 38 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges 综述多模态地理空间基础模型,应对遥感图像分析中的异构性与分布偏移。 foundation model multimodal
2 A Survey on Efficient Vision-Language-Action Models 对高效视觉-语言-动作模型进行综述,旨在弥合数字知识与物理世界交互的鸿沟。 vision-language-action VLA
3 Towards Generalisable Foundation Models for Brain MRI BrainFound:面向脑部MRI的通用可泛化基础模型 foundation model multimodal
4 PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models PISA-Bench:一个多语言多模态基准,用于评估视觉-语言模型 large language model multimodal
5 PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection PRISM-Bench:一个基于谜题的可视化任务基准,具备CoT错误检测能力 large language model multimodal chain-of-thought
6 Multitask Multimodal Self-Supervised Learning for Medical Images 提出Medformer,用于医学图像多任务多模态自监督学习,减少对标注数据的依赖。 multimodal
7 MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection 提出MMSD3.0多图讽刺检测基准和CIRM模型,解决真实场景多图线索讽刺识别问题 multimodal
8 AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes 提出自适应门控多模态融合AG-Fusion,解决复杂场景下3D目标检测的鲁棒性问题 multimodal
9 Implicit Modeling for Transferability Estimation of Vision Foundation Models 提出隐式迁移建模(ITM)框架,提升视觉基础模型的可迁移性评估准确率和效率。 foundation model
10 Revisiting Multimodal Positional Encoding in Vision-Language Models 提出多头旋转位置编码MHRoPE及其变体,提升视觉-语言模型的多模态理解能力 multimodal
11 LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation LightFusion:轻量级双重融合框架,用于统一多模态理解与生成 multimodal
12 DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning DynaStride:利用MMCoT和动态步长窗口解决教学视频多场景字幕生成问题 multimodal chain-of-thought
13 PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity 提出PixelRefer,一个统一的区域级多模态大语言模型框架,用于任意粒度的时空对象指代。 large language model multimodal
14 On the Faithfulness of Visual Thinking: Measurement and Enhancement 提出SCCM学习策略,提升视觉语言模型多模态推理中视觉信息的可靠性和充分性。 multimodal chain-of-thought
15 CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting CountFormer:基于Transformer的无类别物体计数,学习视觉重复与结构 foundation model
16 The Underappreciated Power of Vision Models for Graph Structural Understanding 探索视觉模型在图结构理解中的潜力,并提出GraphAbstract基准测试。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
17 Finding 3D Scene Analogies with Multimodal Foundation Models 提出基于多模态基础模型的零样本3D场景类比方法,用于机器人轨迹和路径点迁移。 imitation learning open-vocabulary open vocabulary
18 VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting VR-Drive:利用前馈3D高斯溅射实现视角鲁棒的端到端自动驾驶 distillation 3D gaussian splatting gaussian splatting
19 Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning 提出Video-Thinker,通过强化学习赋能MLLM进行视频推理 reinforcement learning large language model multimodal
20 VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations VideoTG-R1:通过反射边界标注上的课程强化学习提升视频时序定位 reinforcement learning large language model multimodal
21 Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment PathSearch:一种基于注意力视觉-语言对齐的精准可扩展多模态病理检索框架 contrastive learning multimodal
22 HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling 提出HieraMamba,通过分层Anchor-Mamba池化实现视频时序定位 Mamba Ego4D AMP
23 FARMER: Flow AutoRegressive Transformer over Pixels FARMER:提出一种基于流自回归Transformer的像素生成模型,实现精确似然估计和高质量图像合成。 distillation classifier-free guidance large language model
24 VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation 提出VOLD,通过策略蒸馏将LLM推理能力迁移至视觉-语言模型 reinforcement learning distillation
25 MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding 提出MergeMix统一增强范式,提升视觉和多模态理解能力 reinforcement learning large language model
26 Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics 利用Bi-Encoder对比学习探索指纹和虹膜生物特征的相关性 contrastive learning
27 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations Concerto:融合2D-3D自监督学习,涌现空间表征 distillation scene understanding

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
28 PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors 提出PlanarGS以解决室内场景高保真重建问题 3D gaussian splatting 3DGS gaussian splatting
29 Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression Gen-LangSplat:利用预训练特征压缩实现通用语言高斯溅射,提升效率。 3D gaussian splatting gaussian splatting splatting
30 EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction EndoWave:用于内窥镜重建的Rational-Wavelet 4D高斯溅射 3DGS gaussian splatting splatting
31 Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation 提出LHT-CLIP,无需训练即可提升CLIP在开放词汇语义分割中的视觉区分性 open-vocabulary open vocabulary
32 More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models 提出MERGE,通过文本到图像扩散模型统一图像生成与深度估计 depth estimation
33 Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap 提出混合重光照框架以解决体积视频重光照不稳定问题 optical flow
34 UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception UrbanIng-V2X:用于协同感知的多路口大规模多车辆多基础设施数据集 scene understanding

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
35 Positional Preservation Embedding for Multimodal Large Language Models 提出位置保持嵌入PPE,提升多模态大语言模型视觉token压缩效率与性能。 spatiotemporal large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
36 EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT EgoThinker:利用时空CoT增强MLLM的自我中心视角推理能力 egocentric large language model multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
37 DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification 提出DecoDINO以解决人类与场景接触预测问题 human-object interaction

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
38 VoMP: Predicting Volumetric Mechanical Property Fields 提出VoMP以解决3D物体体积机械属性预测问题 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页