cs.CV（2025-06-21）

📊 共 14 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6) 支柱二：RL算法与架构 (RL & Architecture) (2) 支柱三：空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?	研究利用文本生成图像增强文本分类任务，探索合成感知在多模态学习中的可行性。	large language model multimodal
2	HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs	HIRE：轻量级高分辨率图像特征增强，提升多模态LLM性能	large language model multimodal
3	Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning	提出PathGenIC框架，利用多模态上下文学习生成组织病理学图像报告，显著提升报告质量。	multimodal
4	JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent	提出JarvisArt以解决传统照片修饰工具的使用门槛问题	large language model instruction following chain-of-thought
5	A Multimodal In Vitro Diagnostic Method for Parkinson's Disease Combining Facial Expressions and Behavioral Gait Data	提出一种融合面部表情和步态数据的多模态体外帕金森病诊断方法	multimodal
6	DreamJourney: Perpetual View Generation with Video Diffusion Models	DreamJourney：利用视频扩散模型实现具有动态物体的无限视角生成	large language model multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations	Scene-R1：无需3D标注，基于视频的大语言模型实现3D场景推理	reinforcement learning scene understanding open-vocabulary
8	Pixel-Optimization-Free Patch Attack on Stereo Depth Estimation	提出PatchHunter：一种无需像素优化的立体深度估计对抗攻击方法	reinforcement learning depth estimation stereo depth

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
9	Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models	Programmable-Room：基于大语言模型的交互式纹理3D房间网格生成框架	semantic map large language model	✅
10	DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving	DRAMA-X：提出用于驾驶场景的细粒度意图预测与风险推理基准	open-vocabulary open vocabulary large language model

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Domain Generalization using Action Sequences for Egocentric Action Recognition	提出SeqDG，利用动作序列提升第一视角动作识别的域泛化能力	egocentric egocentric vision first-person view
12	CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning	提出CLiViS，通过语言-视觉协同认知地图解决具身视觉推理中的长时依赖问题	egocentric spatiotemporal large language model	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
13	VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models	VLA-OS：统一视觉-语言-动作模型架构，系统分析规划范式与表示的影响	manipulation dexterous hand vision-language-action

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
14	PhysID: Physics-based Interactive Dynamics from a Single-view Image	PhysID：提出一种基于单视图图像的物理交互动态生成方法	physically plausible large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页