cs.CV(2025-05-18)
📊 共 12 篇论文 | 🔗 5 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (2 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | 提出LogicOCR基准测试,评估大型多模态模型在文本图像上的逻辑推理能力 | multimodal chain-of-thought | ✅ | |
| 2 | KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection | KGAlign:融合语义-结构知识的多模态假新闻检测方法 | multimodal | ✅ | |
| 3 | MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark | MMS-VPR:多模态街景视觉定位数据集与基准,填补非西方城市场景空白。 | multimodal | ✅ | |
| 4 | SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis | 提出SMFusion,利用语义信息融合多模态医学图像以提升临床诊断。 | multimodal | ||
| 5 | Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts | ViCA2:通过视觉专家分层融合增强多模态大语言模型中的视觉空间认知 | large language model multimodal | ||
| 6 | Visuospatial Cognitive Assistant | 提出ViCA-322K数据集和ViCA-7B模型,提升具身AI在视频空间认知任务上的性能。 | embodied AI | ||
| 7 | From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations | 提出L-Storyboard,利用LLM进行视频编辑,解决视觉信息与语言推理的鸿沟 | large language model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding | LLaVA-4D:将时空提示嵌入LMM中用于4D场景理解 | scene understanding spatiotemporal multimodal | ||
| 9 | Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | 提出AgroMind农业遥感基准,评估并揭示大型多模态模型在农业场景理解中的局限性。 | scene understanding multimodal | ✅ | |
| 10 | VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold | VGGT-SLAM:基于SL(4)流形优化的稠密RGB SLAM系统 | scene reconstruction VGGT |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning | PRETI:通过元数据引导的表征学习,构建患者感知的视网膜基础模型 | representation learning foundation model | ✅ | |
| 12 | VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning | 提出VideoRFT,通过强化微调提升MLLM在视频推理方面的能力 | reinforcement learning large language model chain-of-thought |