cs.CV（2024-08-05）

📊 共 19 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (8) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models	提出MMIU基准，用于评估大型视觉语言模型在多图理解方面的能力	multimodal
2	Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection	提出一种基于图像描述增强的多层次跨模态语义不一致性表示方法，用于多模态讽刺检测。	multimodal
3	Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs	提出VECTN模型，通过视觉到情感字幕翻译增强目标依赖的多模态情感分析。	multimodal
4	Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes	提出Shenlong，结合LLM与CGA实现交互式3D场景中精确可控的物体重定位。	large language model
5	Fairness and Bias Mitigation in Computer Vision: A Survey	计算机视觉公平性与偏见缓解综述：总结现有方法并展望未来趋势	multimodal
6	Infusing Environmental Captions for Long-Form Video Language Grounding	提出EI-VLG，利用环境字幕增强长视频语言定位，有效排除无关帧。	large language model
7	Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets	利用OWLv2零样本检测摩托车、乘客及头盔佩戴情况，助力交通安全	foundation model
8	ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning	ExoViP：利用外骨骼模块进行逐步验证与探索，提升组合式视觉推理能力	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
9	Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts	提出OpenCBM，通过开放词汇概念增强概念瓶颈模型的可解释性	open-vocabulary open vocabulary
10	Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining	Lumina-mGPT：基于多模态预训练的灵活逼真文本到图像生成模型	depth estimation multimodal	✅
11	Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics	Latent-INR：一种具有判别语义的视频隐式表示灵活框架	implicit representation
12	Gaussian Mixture based Evidential Learning for Stereo Matching	提出基于高斯混合模型的证据学习立体匹配方法，提升深度估计精度和跨域泛化能力。	depth estimation scene flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs	提出基于对比学习的多模态架构，用于图像-文本对表情符号预测。	contrastive learning multimodal
14	A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders	提出一种多模态对比掩码自编码器的两阶段渐进式预训练方法，用于RGB-D图像理解。	masked autoencoder contrastive learning distillation
15	LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba	提出LaMamba-Diff，结合局部注意力与Mamba，实现线性复杂度的高保真扩散模型。	Mamba state space model	✅
16	CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration	提出CMR-Agent，用于迭代图像到点云的跨模态配准，提升精度和效率。	reinforcement learning imitation learning

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection	提出条件多模态提示CMMP，用于零样本人-物交互检测，提升泛化性。	human-object interaction HOI foundation model	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
18	REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models	REVISION框架通过渲染工具提升视觉-语言模型中的空间保真度	spatial relationship large language model multimodal

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization	提出MoNFAP框架，增强多人脸伪造图像的检测与定位能力	manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页