cs.CV（2024-07-05）

📊 共 23 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge	提出基于外部知识的视觉提示方法，提升多模态大语言模型对细粒度视觉信息的理解能力。	large language model multimodal
2	MobileFlow: A Multimodal LLM For Mobile GUI Agent	MobileFlow：面向移动GUI代理的多模态大语言模型，提升中文GUI理解与交互能力	large language model multimodal
3	Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning	提出SpLIP，通过多模态Prompt学习提升零样本草图图像检索性能	foundation model multimodal	✅
4	MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?	MJ-Bench：评估多模态奖励模型在文本生成图像任务中的判断能力	multimodal	✅
5	VCoME: Verbal Video Composition with Multimodal Editing Effects	VCoME：提出一种基于多模态编辑效果的口语视频自动合成框架，提升视频的清晰度和视觉吸引力。	multimodal
6	Robust Multimodal Learning via Representation Decoupling	提出DMRNet，通过解耦多模态表征实现鲁棒的多模态学习	multimodal
7	Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge	提出基于OFA的三阶段视觉问答方案，在WSDM2023 Toloka VQA挑战赛中获得第二名	multimodal visual grounding
8	AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation	提出AWT框架，通过增强、加权和运输提升视觉-语言模型的迁移能力	multimodal
9	Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model	提出基于双重分布感知的上下文提示学习框架Dude，提升大视觉语言模型在细粒度分类任务上的性能。	large language model
10	Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR	提出结合LLM和AR的上下文感知色觉障碍辅助系统	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Unsupervised 4D Cardiac Motion Tracking with Spatiotemporal Optical Flow Networks	提出基于时空光流网络的无监督4D心脏运动追踪方法，提升超声心动图分析精度。	optical flow spatiotemporal motion tracking
12	ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA	提出STAformer，融合环境认知与注意力机制，提升Ego4D短时物体交互预测性能。	affordance egocentric Ego4D
13	GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction	GSD：基于高斯溅射扩散模型的单视角3D重建	gaussian splatting splatting
14	CountGD: Multi-Modal Open-World Counting	提出CountGD，一种多模态开放世界计数模型，提升了通用性和准确性。	open-vocabulary open vocabulary foundation model
15	Hybrid Primal Sketch: Combining Analogy, Qualitative Representations, and Computer Vision for Scene Understanding	提出混合原始草图框架，结合计算机视觉与认知模型实现场景理解	scene understanding
16	Segment Any 4D Gaussians	提出SA4D框架，实现对4D高斯模型的任意物体分割	3D gaussian splatting gaussian splatting splatting	✅
17	A Physical Model-Guided Framework for Underwater Image Enhancement and Depth Estimation	提出物理模型引导的框架，用于水下图像增强和深度估计	depth estimation
18	Gaussian Eigen Models for Human Heads	提出高斯特征模型(GEM)，用于创建轻量级、高质量且易于控制的人头化身。	gaussian splatting splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Self-Supervised Representation Learning for Adversarial Attack Detection	提出自监督表征学习框架，用于提升对抗攻击检测的泛化能力	representation learning
20	Fine-grained Context and Multi-modal Alignment for Freehand 3D Ultrasound Reconstruction	提出ReMamba，结合多模态对齐，实现自由手持3D超声重建	Mamba SSM state space model
21	AMD: Automatic Multi-step Distillation of Large-scale Vision Models	提出AMD：自动多步蒸馏方法，用于大规模视觉模型压缩	distillation
22	MARS: Paying more attention to visual attributes for text-based person search	MARS：通过更关注视觉属性来改进基于文本的行人检索	masked autoencoder MAE

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Neural varifolds: an aggregate representation for quantifying the geometry of point clouds	提出神经Varifold表示，用于量化点云几何形状，提升形状匹配和少样本分类性能。	geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页