cs.CV（2024-10-04）

📊 共 29 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (4 🔗2) 支柱一：机器人控制 (Robot Control) (2) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models	提出MemVR，通过视觉重溯缓解多模态大语言模型中的幻觉问题	large language model multimodal	✅
2	Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models	Grounded-VideoLLM：提升视频大语言模型中细粒度时序定位能力	large language model TAMP
3	A Multimodal Framework for Deepfake Detection	提出一种多模态深度伪造检测框架，融合视觉和听觉信息以提高检测准确率。	multimodal
4	Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning	提出Visual-O1框架，通过多模态多轮CoT推理解决视觉任务中歧义指令理解问题	chain-of-thought
5	Frame-Voyager: Learning to Query Frames for Video Large Language Models	提出Frame-Voyager，学习查询视频帧组合，提升Video-LLM在视频理解任务中的性能。	large language model
6	ARB-LLM: Alternating Refined Binarizations for Large Language Models	提出ARB-LLM，通过交替优化二值化参数实现大语言模型的高效1比特量化	large language model	✅
7	Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition	Audio-Agent：利用LLM实现高质量音频生成、编辑与合成	large language model multimodal
8	Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models	提出跨模态参数知识冲突检测与缓解方法，提升大视觉语言模型性能	multimodal
9	An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation	提出SAE-Rad，利用稀疏自编码器提升放射报告生成的可解释性与效率。	multimodal
10	Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation	提出BGTAI模型，利用Gloss标注弥合文本、音频、图像等多模态理解的鸿沟。	multimodal
11	AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	提出AuroraCap，一种高效视频详细描述模型，并构建新的VDC评测基准。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
12	CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control	CLoSD：结合模拟与扩散模型，实现多任务角色控制的闭环方法	reinforcement learning motion diffusion model motion diffusion	✅
13	Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation	提出HRVMamba，利用动态视觉状态空间模型高效学习高分辨率人体姿态表示	Mamba SSM state space model	✅
14	VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning	VEDIT：基于潜在空间预测的程序性视频表征学习框架	representation learning Ego4D
15	Mamba in Vision: A Comprehensive Survey of Techniques and Applications	提出Mamba以解决CNN和ViT在视觉任务中的局限性	Mamba state space model	✅
16	Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation	提出Distill-DKP，利用跨模态蒸馏提升自监督人体关键点检测精度。	distillation	✅
17	DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models	DocKD：利用大型语言模型进行知识蒸馏，提升开放世界文档理解模型的泛化能力。	distillation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Variational Bayes Gaussian Splatting	提出变分贝叶斯高斯溅射(VBGS)，解决3D高斯溅射在连续数据流中的灾难性遗忘问题。	3D gaussian splatting gaussian splatting splatting
19	SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models	SPARTUN3D：面向大语言模型的3D世界情境空间理解数据集与模型	scene understanding large language model
20	Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering	提出多视角可微渲染方法以提升单目深度图精度	depth estimation monocular depth	✅
21	EvenNICER-SLAM: Event-based Neural Implicit Encoding SLAM	EvenNICER-SLAM：基于事件相机的神经隐式SLAM，提升快速运动下的鲁棒性	implicit representation

🔬 支柱四：生成式动作 (Generative Motion) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
22	ECHOPulse: ECG controlled echocardio-grams video generation	ECHOPulse：提出一种基于心电图控制的心脏超声视频生成模型，提升合成数据质量和自动化监测能力。	VQ-VAE PULSE	✅
23	Scaling Large Motion Models with Million-Level Human Motions	提出MotionLib数据集以解决人类动作生成模型数据不足问题	motion generation motion tokenizer	✅
24	AutoLoRA: AutoGuidance Meets Low-Rank Adaptation for Diffusion Models	AutoLoRA：结合AutoGuidance与LoRA微调扩散模型，提升生成质量与多样性	classifier-free guidance
25	MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty	提出MDMP：一种多模态扩散模型，用于带不确定性的监督运动预测	motion generation

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Autonomous Character-Scene Interaction Synthesis from Text Instruction	提出基于文本指令的自主角色-场景交互动作合成框架	locomotion human-object interaction
27	CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing	CLIPDrag：结合文本和拖拽指令的图像编辑方法	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Does SpatioTemporal information benefit Two video summarization benchmarks?	质疑时空信息在视频摘要中的作用：基准数据集可能存在偏差	spatiotemporal	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Estimating Body and Hand Motion in an Ego-sensed World	EgoAllo：提出一种从头戴设备估计人体和手部运动的系统。	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页