cs.CV(2024-08-06)

📊 共 19 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (4) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine MedTrinity-25M:大规模多模态医学数据集,支持多粒度标注与医学AI模型预训练。 large language model foundation model multimodal
2 GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI GMAI-MMBench:构建综合性多模态医学评估基准,推动通用医学AI发展 multimodal
3 Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline 提出野外多模态植物病害识别数据集与多原型融合基线模型,解决类间差异小、类内差异大的难题。 multimodal
4 One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning 提出基于LLM神经元调优的统一多模态框架,解决多任务通用性问题。 multimodal
5 WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection 提出FakeMix基准与新指标,提升多模态Deepfake检测在动态场景下的可解释性。 multimodal
6 Targeted Visual Prompting for Medical Visual Question Answering 提出靶向视觉提示方法,提升医疗视觉问答中多模态大语言模型的区域理解能力 large language model multimodal
7 Set2Seq Transformer: Temporal and Positional-Aware Set Representations for Sequential Multiple-Instance Learning 提出Set2Seq Transformer,用于序列多示例学习中的时序和位置感知集合表示。 multimodal
8 LLaVA-OneVision: Easy Visual Task Transfer LLaVA-OneVision:实现单模型在图像、多图和视频场景下的视觉任务迁移 multimodal
9 Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models 提出一种样本无关的对抗扰动方法,提升视觉-语言预训练模型的安全性。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
10 Vision Foundation Models in Remote Sensing: A Survey 遥感领域视觉基础模型综述:分析架构、数据集、方法并展望未来方向 masked autoencoder contrastive learning foundation model
11 Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization 提出EGMS模型,利用实体信息增强跨模态相关性学习,提升多模态摘要生成质量。 distillation multimodal
12 Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network 提出Hybrid Mamba-GCN网络以解决3D人体姿态估计中的效率与准确性问题 Mamba state space model spatiotemporal
13 Contrastive Learning for Image Complexity Representation 提出基于对比学习的图像复杂度表示方法CLIC,无需人工标注即可有效评估图像复杂度。 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
14 LumiGauss: Relightable Gaussian Splatting in the Wild 提出LumiGauss以解决复杂3D重建与环境光照分离问题 gaussian splatting splatting NeRF
15 Efficient NeRF Optimization -- Not All Samples Remain Equally Hard 提出在线难例挖掘优化NeRF,显著提升训练效率和渲染质量 NeRF neural radiance field
16 RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis RayGauss:基于体素高斯的射线投射,实现逼真的新视角合成 splatting NeRF neural radiance field

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
17 Body of Her: A Preliminary Study on End-to-End Humanoid Agent 提出端到端交互式人形代理模型,实现实时双向沟通和通用物体操作 humanoid manipulation large language model
18 BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications BodySLAM:一种用于手术应用的通用单目视觉SLAM框架 manipulation visual SLAM depth estimation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
19 TextIM: Part-aware Interactive Motion Synthesis from Text TextIM:提出一种基于文本驱动的、关注部件交互的动作合成框架 motion synthesis large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页