cs.CV（2024-12-11）

📊 共 21 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (2) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions	Euclid：利用高质量合成视觉描述增强多模态LLM的几何感知能力	large language model multimodal
2	CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs	CogNav：利用LLM进行认知过程建模，显著提升ObjectNav任务性能	embodied AI large language model foundation model
3	Multimodal Approaches to Fair Image Classification: An Ethical Perspective	提出多模态融合方法，提升图像分类公平性，缓解人口统计学偏见。	multimodal
4	Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions	提出Illusory VQA，用于评估和提升多模态模型在视觉错觉上的表现。	multimodal
5	LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information	LLaVA-Zip：利用内在图像信息的自适应视觉Token压缩，提升多图/视频处理能力	large language model instruction following
6	Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel	提出自精炼数据飞轮(SRDF)，用于引导式导航学习，性能超越人类水平。	embodied AI VLN
7	FILA: Fine-Grained Vision Language Models	FILA提出HyViLM，通过混合编码器和特征融合提升高分辨率图像的视觉语言模型性能	large language model multimodal
8	Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation	提出双重通用对抗扰动，欺骗跨图像和文本的视觉-语言模型	large language model multimodal
9	RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation	RoomTour3D：用于具身导航的几何感知视频指令调优	VLN
10	StreamChat: Chatting with Streaming Video	StreamChat：通过在解码时更新视觉上下文，增强LMMs与流视频的交互能力	multimodal
11	Position-aware Guided Point Cloud Completion with CLIP Model	提出位置感知引导的点云补全方法，利用CLIP模型提升补全质量	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
12	EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation	提出EOV-Seg，一种高效的开放词汇全景分割框架，显著提升推理速度。	open-vocabulary open vocabulary	✅
13	BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation	BLADE：通过精确深度估计实现单视图人体网格学习，提升近距离图像的3D姿态和2D对齐精度。	depth estimation human mesh recovery
14	Dense Depth from Event Focal Stack	提出基于事件相机焦栈的深度估计方法，解决传统方法在动态场景下的深度感知问题。	depth estimation
15	Utilizing Multi-step Loss for Single Image Reflection Removal	提出多步损失训练方法，结合RefGAN合成数据，有效提升单图像反射去除效果	depth estimation
16	Physics Based Differentiable Rendering for Inverse Problems and Beyond	综述基于物理的可微渲染技术，解决逆向问题并拓展应用场景	scene reconstruction

🔬 支柱二：RL算法与架构 (RL & Architecture) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Multi-level Matching Network for Multimodal Entity Linking	提出多层匹配网络M3EL，解决多模态实体链接中跨模态交互不足的问题。	representation learning contrastive learning multimodal
18	Visual Program Distillation with Template-Based Augmentation	提出基于模板增强的视觉程序蒸馏方法，降低视觉任务代码生成成本。	distillation large language model

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
19	GMem: A Modular Approach for Ultra-Efficient Generative Models	GMem：一种用于超高效生成模型的模块化方法，显著提升训练和采样效率。	classifier-free guidance
20	ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes	ChatDyn：提出基于语言指令的多智能体街景动态生成系统	physically plausible	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models	提出物理上下文构建器（PCBs），提升视觉-语言模型在物理推理任务上的性能。	sim2real

⬅️ 返回 cs.CV 首页 · 🏠 返回主页