cs.CV(2024-10-04)

📊 共 29 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (4 🔗2) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models 提出MemVR,通过视觉重溯缓解多模态大语言模型中的幻觉问题 large language model multimodal
2 Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Grounded-VideoLLM:提升视频大语言模型中细粒度时序定位能力 large language model TAMP
3 A Multimodal Framework for Deepfake Detection 提出一种多模态深度伪造检测框架,融合视觉和听觉信息以提高检测准确率。 multimodal
4 Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning 提出Visual-O1框架,通过多模态多轮CoT推理解决视觉任务中歧义指令理解问题 chain-of-thought
5 Frame-Voyager: Learning to Query Frames for Video Large Language Models 提出Frame-Voyager,学习查询视频帧组合,提升Video-LLM在视频理解任务中的性能。 large language model
6 ARB-LLM: Alternating Refined Binarizations for Large Language Models 提出ARB-LLM,通过交替优化二值化参数实现大语言模型的高效1比特量化 large language model
7 Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition Audio-Agent:利用LLM实现高质量音频生成、编辑与合成 large language model multimodal
8 Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models 提出跨模态参数知识冲突检测与缓解方法,提升大视觉语言模型性能 multimodal
9 An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation 提出SAE-Rad,利用稀疏自编码器提升放射报告生成的可解释性与效率。 multimodal
10 Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation 提出BGTAI模型,利用Gloss标注弥合文本、音频、图像等多模态理解的鸿沟。 multimodal
11 AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark 提出AuroraCap,一种高效视频详细描述模型,并构建新的VDC评测基准。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
12 CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control CLoSD:结合模拟与扩散模型,实现多任务角色控制的闭环方法 reinforcement learning motion diffusion model motion diffusion
13 Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation 提出HRVMamba,利用动态视觉状态空间模型高效学习高分辨率人体姿态表示 Mamba SSM state space model
14 VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning VEDIT:基于潜在空间预测的程序性视频表征学习框架 representation learning Ego4D
15 Mamba in Vision: A Comprehensive Survey of Techniques and Applications 提出Mamba以解决CNN和ViT在视觉任务中的局限性 Mamba state space model
16 Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation 提出Distill-DKP,利用跨模态蒸馏提升自监督人体关键点检测精度。 distillation
17 DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models DocKD:利用大型语言模型进行知识蒸馏,提升开放世界文档理解模型的泛化能力。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
18 Variational Bayes Gaussian Splatting 提出变分贝叶斯高斯溅射(VBGS),解决3D高斯溅射在连续数据流中的灾难性遗忘问题。 3D gaussian splatting gaussian splatting splatting
19 SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models SPARTUN3D:面向大语言模型的3D世界情境空间理解数据集与模型 scene understanding large language model
20 Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering 提出多视角可微渲染方法以提升单目深度图精度 depth estimation monocular depth
21 EvenNICER-SLAM: Event-based Neural Implicit Encoding SLAM EvenNICER-SLAM:基于事件相机的神经隐式SLAM,提升快速运动下的鲁棒性 implicit representation

🔬 支柱四:生成式动作 (Generative Motion) (4 篇)

#题目一句话要点标签🔗
22 ECHOPulse: ECG controlled echocardio-grams video generation ECHOPulse:提出一种基于心电图控制的心脏超声视频生成模型,提升合成数据质量和自动化监测能力。 VQ-VAE PULSE
23 Scaling Large Motion Models with Million-Level Human Motions 提出MotionLib数据集以解决人类动作生成模型数据不足问题 motion generation motion tokenizer
24 AutoLoRA: AutoGuidance Meets Low-Rank Adaptation for Diffusion Models AutoLoRA:结合AutoGuidance与LoRA微调扩散模型,提升生成质量与多样性 classifier-free guidance
25 MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty 提出MDMP:一种多模态扩散模型,用于带不确定性的监督运动预测 motion generation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
26 Autonomous Character-Scene Interaction Synthesis from Text Instruction 提出基于文本指令的自主角色-场景交互动作合成框架 locomotion human-object interaction
27 CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing CLIPDrag:结合文本和拖拽指令的图像编辑方法 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
28 Does SpatioTemporal information benefit Two video summarization benchmarks? 质疑时空信息在视频摘要中的作用:基准数据集可能存在偏差 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Estimating Body and Hand Motion in an Ego-sensed World EgoAllo:提出一种从头戴设备估计人体和手部运动的系统。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页