cs.CV(2025-06-21)
📊 共 14 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (2 🔗1)
支柱一:机器人控制 (Robot Control) (1)
支柱四:生成式动作 (Generative Motion) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? | 研究利用文本生成图像增强文本分类任务,探索合成感知在多模态学习中的可行性。 | large language model multimodal | ||
| 2 | HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs | HIRE:轻量级高分辨率图像特征增强,提升多模态LLM性能 | large language model multimodal | ||
| 3 | Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning | 提出PathGenIC框架,利用多模态上下文学习生成组织病理学图像报告,显著提升报告质量。 | multimodal | ||
| 4 | JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent | 提出JarvisArt以解决传统照片修饰工具的使用门槛问题 | large language model instruction following chain-of-thought | ||
| 5 | A Multimodal In Vitro Diagnostic Method for Parkinson's Disease Combining Facial Expressions and Behavioral Gait Data | 提出一种融合面部表情和步态数据的多模态体外帕金森病诊断方法 | multimodal | ||
| 6 | DreamJourney: Perpetual View Generation with Video Diffusion Models | DreamJourney:利用视频扩散模型实现具有动态物体的无限视角生成 | large language model multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations | Scene-R1:无需3D标注,基于视频的大语言模型实现3D场景推理 | reinforcement learning scene understanding open-vocabulary | ||
| 8 | Pixel-Optimization-Free Patch Attack on Stereo Depth Estimation | 提出PatchHunter:一种无需像素优化的立体深度估计对抗攻击方法 | reinforcement learning depth estimation stereo depth |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models | Programmable-Room:基于大语言模型的交互式纹理3D房间网格生成框架 | semantic map large language model | ✅ | |
| 10 | DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving | DRAMA-X:提出用于驾驶场景的细粒度意图预测与风险推理基准 | open-vocabulary open vocabulary large language model |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Domain Generalization using Action Sequences for Egocentric Action Recognition | 提出SeqDG,利用动作序列提升第一视角动作识别的域泛化能力 | egocentric egocentric vision first-person view | ||
| 12 | CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning | 提出CLiViS,通过语言-视觉协同认知地图解决具身视觉推理中的长时依赖问题 | egocentric spatiotemporal large language model | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models | VLA-OS:统一视觉-语言-动作模型架构,系统分析规划范式与表示的影响 | manipulation dexterous hand vision-language-action |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | PhysID: Physics-based Interactive Dynamics from a Single-view Image | PhysID:提出一种基于单视图图像的物理交互动态生成方法 | physically plausible large language model multimodal |