cs.CV(2024-10-14)

📊 共 36 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (6) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱七:动作重定向 (Motion Retargeting) (3 🔗2) 支柱一:机器人控制 (Robot Control) (2 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization 提出ForgeryGPT,利用多模态大语言模型实现可解释的图像伪造检测与定位。 large language model multimodal instruction following
2 X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing 提出X-Fi:一种模态不变的基础模型,用于多模态人体感知。 foundation model multimodal
3 TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning 提出TWIST & SCOUT框架,通过无遗忘调优提升MLLM的视觉定位能力 large language model multimodal visual grounding
4 EchoApex: A General-Purpose Vision Foundation Model for Echocardiography EchoApex:用于超声心动图的通用视觉基础模型 foundation model
5 TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models TemporalBench:用于多模态视频模型细粒度时序理解的基准测试 multimodal
6 Towards Foundation Models for 3D Vision: How Close Are We? 提出UniQA-3D基准测试,评估并提升3D视觉基础模型能力 foundation model
7 CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes 提出CAFuser,一种条件感知多模态融合方法,提升驾驶场景语义感知鲁棒性。 multimodal
8 MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks MEGA-Bench:构建包含500+真实世界任务的多模态评估基准,覆盖广泛应用场景。 multimodal
9 Class Balancing Diversity Multimodal Ensemble for Alzheimer's Disease Diagnosis and Early Detection 提出IMBALMED,通过类平衡多样性多模态集成方法,用于阿尔茨海默病早期诊断。 multimodal
10 Performance Evaluation of Deep Learning and Transformer Models Using Multimodal Data for Breast Cancer Classification 提出基于多模态数据融合的深度学习模型,用于提升乳腺癌分类性能 multimodal
11 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models 提出MMIE大规模多模态交错理解基准,用于评估大型视觉语言模型 multimodal
12 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content 提出LiveXiv:一个基于ArXiv论文内容的多模态实时评测基准,用于评估大型多模态模型。 foundation model
13 Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation 提出SpatialSonic模型,实现语言驱动的沉浸式空间音频生成。 multimodal
14 Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework 提出生成式迁移学习框架GTL,解决跨模态少样本学习问题 multimodal
15 MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer 提出MoTE框架,平衡视频识别中的泛化能力与特定任务性能。 foundation model
16 Hybrid Transformer for Early Alzheimer's Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features 提出一种混合Transformer模型,融合手写体图像与信号特征,用于阿尔茨海默病早期检测。 multimodal
17 Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation 提出空间感知高效投影器SAEP,通过多层特征聚合提升MLLM效率与空间理解能力。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
18 4DStyleGaussian: Zero-shot 4D Style Transfer with Gaussian Splatting 提出4DStyleGaussian,利用高斯溅射实现零样本4D风格迁移 distillation gaussian splatting splatting
19 V2M: Visual 2-Dimensional Mamba for Image Representation Learning 提出V2M:一种用于图像表示学习的视觉二维Mamba模型 Mamba SSM state space model
20 Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution Hi-Mamba:用于高效图像超分辨率的分层Mamba网络 Mamba SSM state space model
21 DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model DrivingDojo:提出交互式和知识增强的驾驶世界模型数据集,促进复杂驾驶场景建模。 world model instruction following
22 GlobalMamba: Global Image Serialization for Vision Mamba GlobalMamba:通过全局图像序列化增强Vision Mamba的性能 Mamba
23 Depth Any Video with Scalable Synthetic Data 提出Depth Any Video模型,利用可扩展合成数据解决视频深度估计问题 flow matching depth estimation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
24 Few-shot Novel View Synthesis using Depth Aware 3D Gaussian Splatting 提出深度感知3D高斯溅射,解决少样本新视角合成中性能下降问题。 monocular depth 3D gaussian splatting 3DGS
25 4-LEGS: 4D Language Embedded Gaussian Splatting 提出4-LEGS:一种语言嵌入的4D高斯溅射方法,用于时空事件定位。 3D gaussian splatting gaussian splatting splatting
26 Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world 提出自评估生成(SAG)框架,提升光流和立体匹配在真实场景的泛化性 optical flow geometric consistency
27 3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications 提出3DArticCyclists框架,生成可控3D自行车骑行者合成数据,解决自动驾驶中骑行者数据稀缺问题。 3D gaussian splatting 3DGS gaussian splatting
28 StegaINR4MIH: steganography by implicit neural representation for multi-image hiding StegaINR4MIH:利用隐式神经表示实现多图像隐藏的隐写术 implicit representation

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
29 Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention Cavia:提出基于视角集成注意力机制的可控相机多视角视频扩散模型 geometric consistency spatiotemporal
30 DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships DragEntity:利用实体和位置关系进行轨迹引导的视频生成 spatial relationship
31 FlexGen: Flexible Multi-View Generation from Text and Image Inputs FlexGen:提出一种灵活的多视角生成框架,支持文本和图像输入。 spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
32 Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes Sitcom-Crafter:一个情节驱动的3D场景中人物动作生成系统 locomotion motion synthesis motion generation
33 Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors 提出一种隐蔽的越界触发器攻击,提升目标检测器的对抗鲁棒性 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
34 MaskControl: Spatio-Temporal Control for Masked Motion Synthesis MaskControl:为生成式掩码运动模型引入时空控制,提升控制精度和运动质量。 motion diffusion model motion diffusion text-to-motion
35 Boosting Camera Motion Control for Video Diffusion Transformers 提出相机运动引导(CMG),显著提升视频扩散Transformer的相机运动控制精度 classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
36 A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration 提出一致性感知的点引导Transformer,用于通用且分层的点云配准 feature matching geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页