cs.CV(2024-08-05)

📊 共 19 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models 提出MMIU基准,用于评估大型视觉语言模型在多图理解方面的能力 multimodal
2 Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection 提出一种基于图像描述增强的多层次跨模态语义不一致性表示方法,用于多模态讽刺检测。 multimodal
3 Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs 提出VECTN模型,通过视觉到情感字幕翻译增强目标依赖的多模态情感分析。 multimodal
4 Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes 提出Shenlong,结合LLM与CGA实现交互式3D场景中精确可控的物体重定位。 large language model
5 Fairness and Bias Mitigation in Computer Vision: A Survey 计算机视觉公平性与偏见缓解综述:总结现有方法并展望未来趋势 multimodal
6 Infusing Environmental Captions for Long-Form Video Language Grounding 提出EI-VLG,利用环境字幕增强长视频语言定位,有效排除无关帧。 large language model
7 Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets 利用OWLv2零样本检测摩托车、乘客及头盔佩戴情况,助力交通安全 foundation model
8 ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning ExoViP:利用外骨骼模块进行逐步验证与探索,提升组合式视觉推理能力 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
9 Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts 提出OpenCBM,通过开放词汇概念增强概念瓶颈模型的可解释性 open-vocabulary open vocabulary
10 Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining Lumina-mGPT:基于多模态预训练的灵活逼真文本到图像生成模型 depth estimation multimodal
11 Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics Latent-INR:一种具有判别语义的视频隐式表示灵活框架 implicit representation
12 Gaussian Mixture based Evidential Learning for Stereo Matching 提出基于高斯混合模型的证据学习立体匹配方法,提升深度估计精度和跨域泛化能力。 depth estimation scene flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
13 Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs 提出基于对比学习的多模态架构,用于图像-文本对表情符号预测。 contrastive learning multimodal
14 A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders 提出一种多模态对比掩码自编码器的两阶段渐进式预训练方法,用于RGB-D图像理解。 masked autoencoder contrastive learning distillation
15 LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba 提出LaMamba-Diff,结合局部注意力与Mamba,实现线性复杂度的高保真扩散模型。 Mamba state space model
16 CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration 提出CMR-Agent,用于迭代图像到点云的跨模态配准,提升精度和效率。 reinforcement learning imitation learning

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
17 Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection 提出条件多模态提示CMMP,用于零样本人-物交互检测,提升泛化性。 human-object interaction HOI foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
18 REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models REVISION框架通过渲染工具提升视觉-语言模型中的空间保真度 spatial relationship large language model multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
19 Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization 提出MoNFAP框架,增强多人脸伪造图像的检测与定位能力 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页