cs.CV(2025-05-18)

📊 共 12 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? 提出LogicOCR以解决多模态模型在文本丰富图像上的逻辑推理问题 multimodal chain-of-thought
2 KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection 提出KGAlign以解决多模态假新闻检测中的知识编码问题 multimodal
3 MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark 提出MMS-VPR数据集以解决街景视觉位置识别的多模态不足问题 multimodal
4 SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis 提出SMFusion以解决多模态医学图像融合中的语义信息缺失问题 multimodal
5 Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts 提出ViCA2以解决视觉空间认知问题 large language model multimodal
6 Visuospatial Cognitive Assistant 提出ViCA以解决视频基础空间认知挑战 embodied AI
7 From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations 提出L-Storyboard以解决视频编辑中的语言与视觉信息融合问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
8 LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding 提出LLaVA-4D以解决动态场景理解中的时空表示问题 scene understanding spatiotemporal multimodal
9 Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind 提出AgroMind以解决农业遥感基准不足问题 scene understanding multimodal
10 VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold 提出VGGT-SLAM以解决无标定单目相机的稠密RGB SLAM问题 scene reconstruction VGGT

🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)

#题目一句话要点标签🔗
11 PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning 提出PRETI以解决视网膜图像分析中的数据依赖问题 representation learning foundation model
12 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning 提出VideoRFT以解决视频推理能力不足的问题 reinforcement learning large language model chain-of-thought

⬅️ 返回 cs.CV 首页 · 🏠 返回主页