cs.CV(2024-05-31)

📊 共 20 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗2) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models DeCo:解耦多模态大语言模型中的Token压缩与语义抽象,提升视觉语言对齐效果。 large language model multimodal
2 Ovis: Structural Embedding Alignment for Multimodal Large Language Model Ovis:结构化嵌入对齐的多模态大语言模型 large language model multimodal
3 MeshXL: Neural Coordinate Field for Generative 3D Foundation Models MeshXL:提出基于神经坐标场的生成式3D基础模型,用于高质量网格生成。 large language model foundation model
4 Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning 提出RMR框架,利用检索增强提升多模态视觉语言模型的推理能力 large language model multimodal
5 Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations 提出Creativity-VLA,解决图像编辑推荐任务中用户意图模糊的问题。 VLA multimodal
6 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Video-MME:首个面向视频分析的多模态大语言模型综合评估基准 large language model
7 Enhancing Vision Models for Text-Heavy Content Understanding and Interaction 增强视觉模型以理解和交互文本密集型视觉内容 multimodal
8 InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding 提出InsightSee多智能体框架,提升视觉语言模型在复杂场景下的视觉理解能力 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
9 Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning 提出C²VL框架,利用视觉-语言知识蒸馏提升骨骼动作表征学习 representation learning contrastive learning distillation
10 Hybrid Fourier Score Distillation for Efficient One Image to 3D Object Generation 提出混合傅里叶分数蒸馏(hy-FSD)方法,高效实现单图到3D物体生成 distillation geometric consistency
11 StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond StrucTexTv3:高效的图文模型,用于文本丰富图像的感知、理解及应用 representation learning multimodal
12 S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion 提出S4Fusion,利用显著性感知选择性状态空间模型实现红外与可见光图像融合 state space model
13 Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling 提出Adv-KD对抗知识蒸馏方法,加速扩散模型采样过程。 distillation
14 MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition 提出MASA:融合运动感知与语义对齐的掩码自编码器用于手语识别 masked autoencoder
15 4Diffusion: Multi-view Video Diffusion Model for 4D Generation 4Diffusion:提出多视角视频扩散模型,用于生成时空一致的4D内容 distillation NeRF

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
16 ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model ContextGS:基于Anchor级上下文模型的紧凑型3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
17 R$^2$-Gaussian: Rectifying Radiative Gaussian Splatting for Tomographic Reconstruction 提出R$^2$-Gaussian,用于加速和优化基于3DGS的稀疏视角断层重建。 3D gaussian splatting 3DGS gaussian splatting
18 MetaGS: A Meta-Learned Gaussian-Phong Model for Out-of-Distribution 3D Scene Relighting MetaGS:用于域外3D场景重光照的元学习高斯-Phong模型 3D gaussian splatting gaussian splatting splatting
19 Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding 提出区域引导Transformer(RGTR)以解决时序语句定位中提议重叠冗余问题。 open-vocabulary open vocabulary

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
20 Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models 利用视觉-语言基础模型检测运动预测中的困难场景,提升自动驾驶安全性。 motion prediction foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页