cs.CV(2025-06-21)

📊 共 14 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (6) 支柱二:RL算法与架构 (RL & Architecture) (2) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
1 Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? 研究利用文本生成图像增强文本分类任务,探索合成感知在多模态学习中的可行性。 large language model multimodal
2 HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs HIRE:轻量级高分辨率图像特征增强,提升多模态LLM性能 large language model multimodal
3 Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning 提出PathGenIC框架,利用多模态上下文学习生成组织病理学图像报告,显著提升报告质量。 multimodal
4 JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent 提出JarvisArt以解决传统照片修饰工具的使用门槛问题 large language model instruction following chain-of-thought
5 A Multimodal In Vitro Diagnostic Method for Parkinson's Disease Combining Facial Expressions and Behavioral Gait Data 提出一种融合面部表情和步态数据的多模态体外帕金森病诊断方法 multimodal
6 DreamJourney: Perpetual View Generation with Video Diffusion Models DreamJourney:利用视频扩散模型实现具有动态物体的无限视角生成 large language model multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)

#题目一句话要点标签🔗
7 Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations Scene-R1:无需3D标注,基于视频的大语言模型实现3D场景推理 reinforcement learning scene understanding open-vocabulary
8 Pixel-Optimization-Free Patch Attack on Stereo Depth Estimation 提出PatchHunter:一种无需像素优化的立体深度估计对抗攻击方法 reinforcement learning depth estimation stereo depth

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
9 Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models Programmable-Room:基于大语言模型的交互式纹理3D房间网格生成框架 semantic map large language model
10 DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving DRAMA-X:提出用于驾驶场景的细粒度意图预测与风险推理基准 open-vocabulary open vocabulary large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
11 Domain Generalization using Action Sequences for Egocentric Action Recognition 提出SeqDG,利用动作序列提升第一视角动作识别的域泛化能力 egocentric egocentric vision first-person view
12 CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning 提出CLiViS,通过语言-视觉协同认知地图解决具身视觉推理中的长时依赖问题 egocentric spatiotemporal large language model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
13 VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models VLA-OS:统一视觉-语言-动作模型架构,系统分析规划范式与表示的影响 manipulation dexterous hand vision-language-action

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
14 PhysID: Physics-based Interactive Dynamics from a Single-view Image PhysID:提出一种基于单视图图像的物理交互动态生成方法 physically plausible large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页