cs.CV（2024-12-18）

📊 共 30 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗1) 支柱一：机器人控制 (Robot Control) (4) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future	综述多模态可解释人工智能（MXAI）方法，应对AI黑盒难题，提升透明度和信任度。	large language model foundation model multimodal	✅
2	MetaMorph: Multimodal Understanding and Generation via Instruction Tuning	提出Visual-Predictive Instruction Tuning以提升多模态理解与生成能力	multimodal instruction following
3	InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models	InstructSeg：统一多模态大语言模型的指令式视觉分割框架	large language model	✅
4	Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models	利用大语言模型，探索零样本提示与少样本微调在文档图像分类中的应用	large language model
5	MedCoT: Medical Chain of Thought via Hierarchical Expert	提出MedCoT：一种基于层级专家验证推理链的医学视觉问答方法	chain-of-thought
6	G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o	提出基于GPT-4o的G-VEval，用于评估图像和视频字幕质量，并构建MSVD-Eval数据集。	large language model multimodal chain-of-thought	✅
7	CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers	提出CAD-Assistant，一种工具增强的VLLM，作为通用CAD任务求解器	large language model multimodal
8	LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer	LLaVA-UHD v2：通过分层窗口Transformer集成高分辨率语义金字塔的多模态大语言模型	large language model multimodal
9	AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities	AnySat：提出一种地球观测统一模型，处理多分辨率、多尺度和多模态数据。	multimodal	✅
10	Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition	提出基于描述的真实分类任务，扩展CLIP在部件属性识别上的能力	large language model	✅
11	Prompt Categories Cluster for Weakly Supervised Semantic Segmentation	提出Prompt类别聚类(PCC)框架，利用LLM进行弱监督语义分割，提升类别间关系学习。	large language model
12	Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection	Nullu：通过HalluSpace投影缓解大型视觉语言模型中的对象幻觉问题	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation	提出Prompt Depth Anything，利用低成本LiDAR提示实现4K高精度度量深度估计。	depth estimation metric depth Depth Anything
14	A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation	提出一种简单有效的测试时适应方法以解决零-shot单目度量深度估计问题	depth estimation monocular depth metric depth
15	Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation	提出特征金字塔Tokenization，用于提升开放词汇语义分割性能	open-vocabulary open vocabulary
16	GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians	GraphAvatar：利用GNN生成3D高斯模型的紧凑型头部Avatar	3D gaussian splatting 3DGS gaussian splatting	✅
17	MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data	MegaSynth：通过合成数据扩展3D场景重建规模	scene reconstruction affordance
18	Dynamic semantic VSLAM with known and unknown objects	提出一种动态语义VSLAM，可处理已知和未知对象，提升动态环境下的定位精度。	optical flow
19	MobiFuse: A High-Precision On-device Depth Perception System with Multi-Data Fusion	MobiFuse：一种基于多数据融合的高精度移动端深度感知系统	stereo depth

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	提出VSI-Bench基准测试MLLM在视频中进行视觉空间推理的能力，并探索认知地图生成方法。	world model large language model multimodal
21	AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models	提出AdvIRL框架以解决3D NeRF模型的对抗攻击问题	reinforcement learning NeRF neural radiance field	✅
22	When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?	对比State-to-Visual DAgger与Visual RL，揭示其在不同视觉策略学习任务中的适用性。	reinforcement learning policy learning
23	On Explaining Knowledge Distillation: Measuring and Visualising the Knowledge Transfer Process	提出UniCAM，解释知识蒸馏过程，提升学生模型特征学习效率。	distillation
24	Multi-Exposure Image Fusion via Distilled 3D LUT Grid with Editable Mode	提出基于蒸馏3D LUT网格的多曝光图像融合方法，支持编辑模式	teacher-student implicit representation

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
25	An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training	提出DFIT-OccWorld，通过解耦动态流和图像辅助训练，高效预测4D occupancy世界模型。	motion planning world model
26	PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation	PixelMan：通过像素操作与生成实现扩散模型下的一致性物体编辑	manipulation
27	Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization	提出Mesorch架构，通过多尺度混合建模提升图像篡改定位性能	manipulation
28	DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions	DragScene：基于单视角拖拽指令的交互式3D场景编辑	manipulation latent optimization

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception	提出DCE方法以增强多模态感知中的描述性图像字幕	human-object interaction HOI multimodal	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Do Language Models Understand Time?	分析大型语言模型在视频理解中时间推理能力的局限性	spatiotemporal large language model multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页