cs.CV（2025-05-18）

📊 共 12 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (2 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?	提出LogicOCR基准测试，评估大型多模态模型在文本图像上的逻辑推理能力	multimodal chain-of-thought	✅
2	KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection	KGAlign：融合语义-结构知识的多模态假新闻检测方法	multimodal	✅
3	MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark	MMS-VPR：多模态街景视觉定位数据集与基准，填补非西方城市场景空白。	multimodal	✅
4	SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis	提出SMFusion，利用语义信息融合多模态医学图像以提升临床诊断。	multimodal
5	Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts	ViCA2：通过视觉专家分层融合增强多模态大语言模型中的视觉空间认知	large language model multimodal
6	Visuospatial Cognitive Assistant	提出ViCA-322K数据集和ViCA-7B模型，提升具身AI在视频空间认知任务上的性能。	embodied AI
7	From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations	提出L-Storyboard，利用LLM进行视频编辑，解决视觉信息与语言推理的鸿沟	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
8	LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding	LLaVA-4D：将时空提示嵌入LMM中用于4D场景理解	scene understanding spatiotemporal multimodal
9	Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind	提出AgroMind农业遥感基准，评估并揭示大型多模态模型在农业场景理解中的局限性。	scene understanding multimodal	✅
10	VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold	VGGT-SLAM：基于SL(4)流形优化的稠密RGB SLAM系统	scene reconstruction VGGT

🔬 支柱二：RL算法与架构 (RL & Architecture) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
11	PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning	PRETI：通过元数据引导的表征学习，构建患者感知的视网膜基础模型	representation learning foundation model	✅
12	VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning	提出VideoRFT，通过强化微调提升MLLM在视频推理方面的能力	reinforcement learning large language model chain-of-thought

⬅️ 返回 cs.CV 首页 · 🏠 返回主页