cs.CV(2025-07-09)

📊 共 18 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor 利用Stable Diffusion作为任务感知特征提取器,提升多模态理解能力 large language model multimodal
2 Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs 提出D2I框架,提升多模态LLM在测试时推理的灵活性与泛化性 large language model multimodal
3 MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning 提出MK-Pose以解决类别级物体姿态估计问题 multimodal
4 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning MagiC:提出一个综合基准测试,用于评估具身视觉推理中的多模态认知能力。 multimodal
5 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation LinguaMark:提出多语言多模态偏见评估基准,揭示LMMs在公平性上的不足。 multimodal
6 Integrating Pathology Foundation Models and Spatial Transcriptomics for Cellular Decomposition from Histology Images 利用病理学基础模型和空间转录组学,从组织学图像中进行细胞分解 foundation model
7 FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation 提出FIFA框架,统一评估文本-视频和视频-文本生成任务中的事实一致性。 large language model multimodal
8 DisenQ: Disentangling Q-Former for Activity-Biometrics 提出DisenQ,通过解耦Q-Former实现活动生物特征识别 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
9 Robust Multimodal Large Language Models Against Modality Conflict 针对多模态大语言模型中的模态冲突问题,提出鲁棒性提升方法 reinforcement learning large language model multimodal
10 Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata 提出ACETADA基准,评估上下文元数据增强的大型多模态模型在营养分析中的性能。 MAE multimodal chain-of-thought
11 Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning Video-RTS:结合数据高效强化学习与自适应测试时缩放,提升视频推理能力。 reinforcement learning large language model chain-of-thought
12 Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning 提出对比强化学习方法,提升视觉故事叙述中实体指代的连贯性 reinforcement learning direct preference optimization chain-of-thought
13 Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models 提出VLV自动编码器,利用扩散模型蒸馏知识,低成本构建高质量视觉-语言模型。 distillation large language model
14 MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation 提出MST-Distill,利用混合专家教师模型进行跨模态知识蒸馏 distillation multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
15 A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding 提出SpatialReasoner,利用LLM驱动的空间推理增强开放词汇3D视觉定位 open-vocabulary open vocabulary embodied AI
16 LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS LangSplatV2:实现450+ FPS高维3D语言高斯溅射,加速开放词汇文本查询。 gaussian splatting splatting open-vocabulary
17 mmFlux: Crowd Flow Analytics with Commodity mmWave MIMO Radar 提出mmFlux,利用毫米波雷达进行人群流量分析与语义推断 optical flow

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
18 Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data 提出MotionMillion数据集与评估基准,实现文本到动作生成零样本泛化 text-to-motion motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页