cs.CV(2024-12-18)

📊 共 30 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗1) 支柱一:机器人控制 (Robot Control) (4) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future 综述多模态可解释人工智能(MXAI)方法,应对AI黑盒难题,提升透明度和信任度。 large language model foundation model multimodal
2 MetaMorph: Multimodal Understanding and Generation via Instruction Tuning 提出Visual-Predictive Instruction Tuning以提升多模态理解与生成能力 multimodal instruction following
3 InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models InstructSeg:统一多模态大语言模型的指令式视觉分割框架 large language model
4 Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models 利用大语言模型,探索零样本提示与少样本微调在文档图像分类中的应用 large language model
5 MedCoT: Medical Chain of Thought via Hierarchical Expert 提出MedCoT:一种基于层级专家验证推理链的医学视觉问答方法 chain-of-thought
6 G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o 提出基于GPT-4o的G-VEval,用于评估图像和视频字幕质量,并构建MSVD-Eval数据集。 large language model multimodal chain-of-thought
7 CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers 提出CAD-Assistant,一种工具增强的VLLM,作为通用CAD任务求解器 large language model multimodal
8 LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer LLaVA-UHD v2:通过分层窗口Transformer集成高分辨率语义金字塔的多模态大语言模型 large language model multimodal
9 AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities AnySat:提出一种地球观测统一模型,处理多分辨率、多尺度和多模态数据。 multimodal
10 Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition 提出基于描述的真实分类任务,扩展CLIP在部件属性识别上的能力 large language model
11 Prompt Categories Cluster for Weakly Supervised Semantic Segmentation 提出Prompt类别聚类(PCC)框架,利用LLM进行弱监督语义分割,提升类别间关系学习。 large language model
12 Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection Nullu:通过HalluSpace投影缓解大型视觉语言模型中的对象幻觉问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
13 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation 提出Prompt Depth Anything,利用低成本LiDAR提示实现4K高精度度量深度估计。 depth estimation metric depth Depth Anything
14 A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation 提出一种简单有效的测试时适应方法以解决零-shot单目度量深度估计问题 depth estimation monocular depth metric depth
15 Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation 提出特征金字塔Tokenization,用于提升开放词汇语义分割性能 open-vocabulary open vocabulary
16 GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians GraphAvatar:利用GNN生成3D高斯模型的紧凑型头部Avatar 3D gaussian splatting 3DGS gaussian splatting
17 MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data MegaSynth:通过合成数据扩展3D场景重建规模 scene reconstruction affordance
18 Dynamic semantic VSLAM with known and unknown objects 提出一种动态语义VSLAM,可处理已知和未知对象,提升动态环境下的定位精度。 optical flow
19 MobiFuse: A High-Precision On-device Depth Perception System with Multi-Data Fusion MobiFuse:一种基于多数据融合的高精度移动端深度感知系统 stereo depth

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
20 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces 提出VSI-Bench基准测试MLLM在视频中进行视觉空间推理的能力,并探索认知地图生成方法。 world model large language model multimodal
21 AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models 提出AdvIRL框架以解决3D NeRF模型的对抗攻击问题 reinforcement learning NeRF neural radiance field
22 When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning? 对比State-to-Visual DAgger与Visual RL,揭示其在不同视觉策略学习任务中的适用性。 reinforcement learning policy learning
23 On Explaining Knowledge Distillation: Measuring and Visualising the Knowledge Transfer Process 提出UniCAM,解释知识蒸馏过程,提升学生模型特征学习效率。 distillation
24 Multi-Exposure Image Fusion via Distilled 3D LUT Grid with Editable Mode 提出基于蒸馏3D LUT网格的多曝光图像融合方法,支持编辑模式 teacher-student implicit representation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
25 An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training 提出DFIT-OccWorld,通过解耦动态流和图像辅助训练,高效预测4D occupancy世界模型。 motion planning world model
26 PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation PixelMan:通过像素操作与生成实现扩散模型下的一致性物体编辑 manipulation
27 Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization 提出Mesorch架构,通过多尺度混合建模提升图像篡改定位性能 manipulation
28 DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions DragScene:基于单视角拖拽指令的交互式3D场景编辑 manipulation latent optimization

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
29 Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception 提出DCE方法以增强多模态感知中的描述性图像字幕 human-object interaction HOI multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
30 Do Language Models Understand Time? 分析大型语言模型在视频理解中时间推理能力的局限性 spatiotemporal large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页