cs.CV(2025-03-17)
📊 共 6 篇论文 | 🔗 3 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (3 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | 提出MicroVQA:一个用于评估显微镜图像多模态推理能力的基准数据集。 | large language model multimodal chain-of-thought | ✅ | |
| 2 | Exploring 3D Reasoning-Driven Planning: From Implicit Human Intentions to Route-Aware Activity Planning | 提出3D推理驱动规划,解决具身智能中隐式意图理解和路径规划问题 | embodied AI multimodal | ||
| 3 | Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning | 提出mmSSR方法,用于高效、可扩展地筛选高质量、多样化的多模态指令微调数据。 | large language model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 4 | MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs | MM-Spatial:探索多模态LLM中的3D空间理解能力 | depth estimation monocular depth metric depth | ||
| 5 | NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models | 提出NuPlanQA数据集与BEV-LLM模型,提升多模态大语言模型在自动驾驶场景理解能力。 | scene understanding large language model | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding | DeepPerception:提升MLLM在知识密集型视觉定位中的认知视觉感知能力 | reinforcement learning large language model multimodal | ✅ |