cs.CV（2024-06-20）

📊 共 16 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (8 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱三：空间感知与语义 (Perception & Semantics) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss	提出基于多模态基础模型和聚类的风格歧义损失，提升文本到图像生成模型的创造性。	foundation model multimodal
2	The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation Applications	利用多模态大语言模型从热成像中检测物体，应用于智能交通系统	large language model multimodal
3	A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models	综述：基于文本到图像扩散模型的多模态引导图像编辑技术	multimodal	✅
4	HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models	HeartBeat：多模态条件引导的扩散模型，实现可控超声心动图视频合成	multimodal
5	Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs	Prism：解耦并评估视觉语言模型能力的框架，提升性能并降低成本	large language model multimodal	✅
6	E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion	提出E-ANT大规模中文GUI导航数据集，促进多模态大模型在移动设备上的应用	large language model multimodal
7	Towards Event-oriented Long Video Understanding	提出Event-Bench基准测试和VIM方法，提升MLLM在事件导向长视频理解能力	large language model multimodal	✅
8	From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment	揭示生成式图像描述增强的负面影响：偏见与幻觉问题	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
9	Deblurring Neural Radiance Fields with Event-driven Bundle Adjustment	提出EBAD-NeRF，利用事件相机数据解决NeRF在运动模糊场景下的重建问题	representation learning NeRF neural radiance field
10	Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition	提出基于知识蒸馏的协同学习框架，提升SNN单眼情感识别效率	distillation multimodal
11	Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation	提出正则化分布匹配蒸馏以解决无配对图像翻译问题	distillation
12	Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images	Seg-LSTM：评估xLSTM在遥感图像语义分割中的性能，并分析其局限性	Mamba large language model	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation	提出人-机对比对齐方法，缓解机器人视觉预训练中的领域差异	manipulation visual pre-training language conditioned
14	Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps	提出可逆一致性蒸馏(iCD)，实现仅需约7步的文本引导图像编辑。	manipulation distillation classifier-free guidance

🔬 支柱三：空间感知与语义 (Perception & Semantics) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	CityNav: A Large-Scale Dataset for Real-World Aerial Navigation	CityNav：用于真实世界空中导航的大规模数据集	semantic map spatial relationship VLN

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought	ICAL：VLM智能体通过自反思生成高质量经验，提升具身智能任务性能。	Ego4D instruction following

⬅️ 返回 cs.CV 首页 · 🏠 返回主页