cs.CV(2024-10-03)

📊 共 26 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models 提出可控的图像描述生成流程,优化多模态预训练模型对不同描述格式的偏好。 foundation model multimodal
2 Contrastive Localized Language-Image Pre-Training 提出对比局部语言-图像预训练以提升视觉表示能力 large language model foundation model multimodal
3 IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers 提出IC3M,用于车载多模态多对象监控驾驶员和乘客的异常状态 multimodal
4 A Foundation Model for the Solar Dynamics Observatory SDO-FM:用于太阳动力学观测台的多模态太阳物理基础模型 foundation model
5 LLaVA-Video: Video Instruction Tuning With Synthetic Data LLaVA-Video:通过合成数据进行视频指令调优,提升视频多模态大模型性能。 multimodal instruction following
6 Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment 提出Dog-IQA,一种标准引导的零样本混合粒度图像质量评估方法,利用MLLM先验知识。 large language model multimodal
7 SCA: Improve Semantic Consistent in Unrestricted Adversarial Attacks via DDPM Inversion 提出SCA框架,通过DDPM反演和MLLM引导,提升非限制对抗攻击的语义一致性与效率。 large language model multimodal
8 Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Vinoground:针对短视频时序推理,评估LMM的基准数据集 multimodal
9 Loong: Generating Minute-level Long Videos with Autoregressive Language Models Loong:提出一种基于自回归语言模型的分钟级长视频生成方法 large language model
10 Learning from Offline Foundation Features with Tensor Augmentations LOFF-TA:利用离线基础模型特征和张量增强,实现高效的资源受限场景学习 foundation model
11 DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM DTVLT:基于LLM的多样化文本视觉语言跟踪基准 large language model
12 Parameter Competition Balancing for Model Merging 提出PCB-Merging,通过参数竞争平衡实现高效的模型融合,提升多任务性能。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
13 SuperGS: Super-Resolution 3D Gaussian Splatting Enhanced by Variational Residual Features and Uncertainty-Augmented Learning 提出SuperGS,通过变分残差特征和不确定性学习增强3D高斯溅射超分辨率 3D gaussian splatting 3DGS gaussian splatting
14 GI-GS: Global Illumination Decomposition on Gaussian Splatting for Inverse Rendering GI-GS:基于高斯溅射的全局光照分解逆渲染框架,实现逼真的新视角合成与重光照。 3D gaussian splatting 3DGS gaussian splatting
15 DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes DivScene:利用大规模视觉语言模型在多样化场景中实现开放词汇目标导航 open-vocabulary open vocabulary
16 RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions RSA:利用语言描述解决单目深度估计中的尺度模糊问题 depth estimation monocular depth metric depth
17 Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation 提出DiffewS,利用扩散模型解决少样本语义分割问题 open-vocabulary open vocabulary large language model
18 Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats Flash-Splat:利用闪光线索和高斯溅射进行3D反射去除 3D gaussian splatting gaussian splatting splatting
19 RESSCAL3D++: Joint Acquisition and Semantic Segmentation of 3D Point Clouds 提出RESSCAL3D++,用于联合获取和语义分割可扩展分辨率的3D点云,显著提升效率。 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
20 LLaVA-Critic: Learning to Evaluate Multimodal Models 提出LLaVA-Critic,一个用于评估多模态模型性能的通用评估器。 preference learning multimodal instruction following
21 A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond 综述Mamba架构在医学图像分析中的应用:分类、分割、重建及其他 Mamba SSM state space model
22 SynCo: Synthetic Hard Negatives for Contrastive Visual Representation Learning SynCo:通过合成难负样本提升对比视觉表征学习 representation learning contrastive learning

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
23 Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model Articulate-Anything:利用视觉-语言模型自动建模可动对象 manipulation affordance foundation model
24 FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models 提出FakeShield,利用多模态大语言模型实现可解释的图像伪造检测与定位。 manipulation large language model
25 Capturing complex hand movements and object interactions using machine learning-powered stretchable smart textile gloves 提出基于机器学习的可伸缩智能纺织手套以解决手部动作捕捉问题 dexterous hand

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
26 Uncovering Regional Defaults from Photorealistic Forests in Text-to-Image Generation with DALL-E 2 揭示DALL-E 2在文图生成中光照森林的区域默认偏见 spatial relationship foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页