cs.CV(2024-10-17)

📊 共 28 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 $γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models 提出$γ$-MoD,通过深度混合自适应提升多模态大语言模型的效率。 large language model multimodal
2 RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models RAP:检索增强的个性化多模态大语言模型框架 large language model multimodal
3 Improving Multi-modal Large Language Model through Boosting Vision Capabilities Arcana:通过增强视觉能力提升多模态大语言模型性能 large language model multimodal
4 Fundus to Fluorescein Angiography Video Generation as a Retinal Generative Foundation Model 提出Fundus2Video,用于从眼底彩照生成动态FFA视频,并作为视网膜生成式基础模型。 foundation model multimodal
5 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Janus:解耦视觉编码,实现统一的多模态理解与生成 multimodal
6 Movie Gen: A Cast of Media Foundation Models Movie Gen:一套高质量媒体基础模型,实现1080p高清视频生成与编辑 foundation model
7 Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation 提出TenRMOT,利用时序增强Transformer解决Referring多目标跟踪与分割问题 multimodal
8 Harnessing Webpage UIs for Text-Rich Visual Understanding 利用网页UI提升富文本视觉理解能力,解决多模态大模型在结构化环境中的交互问题。 large language model multimodal
9 Exploring the Design Space of Visual Context Representation in Video MLLMs 提出视觉上下文表示设计方案以提升视频多模态大语言模型性能 large language model multimodal
10 Performance of Gaussian Mixture Model Classifiers on Embedded Feature Spaces 提出一种参数量更少的GMM分类器,并评估其在CLIP和ImageBind嵌入特征空间上的性能。 multimodal
11 Trust but Verify: Programmatic VLM Evaluation in the Wild 提出程序化VLM评估方法以解决视觉语言模型的响应验证问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
12 GlossyGS: Inverse Rendering of Glossy Objects with 3D Gaussian Splatting GlossyGS:利用3D高斯溅射和材质先验进行光泽物体逆渲染 3D gaussian splatting gaussian splatting splatting
13 DepthSplat: Connecting Gaussian Splatting and Depth DepthSplat:连接高斯溅射与深度估计,实现高质量三维重建与深度预测。 depth estimation monocular depth 3D gaussian splatting
14 MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes MEGA:面向动态场景的内存高效4D高斯溅射方法 gaussian splatting splatting
15 VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding VLM-Grounder:一种基于视觉语言模型的零样本3D视觉定位方法 scene understanding visual grounding
16 DN-4DGS: Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene Rendering 提出DN-4DGS,通过去噪和时空聚合实现动态场景实时高质量渲染 3D gaussian splatting 3DGS gaussian splatting
17 ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding ARKit LabelMaker:构建大规模室内3D场景理解数据集,提升语义分割性能 scene understanding
18 Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation 提出点-体素融合与表面表示的自监督场景流估计方法,提升三维运动场预测精度。 scene flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
19 DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation DriveDreamer4D:利用世界模型作为数据机器,提升4D驾驶场景表示 world model dreamer 3DGS
20 Stochastic Flow Matching for Resolving Small-Scale Physics 提出随机流匹配(SFM)框架,用于解决物理科学中小尺度细节的超分辨率重建问题。 flow matching
21 Enhancing Dataset Distillation via Label Inconsistency Elimination and Learning Pattern Refinement M-DATM:通过消除标签不一致性和优化学习模式提升数据集蒸馏效果 distillation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 Object Pose Estimation Using Implicit Representation For Transparent Objects 提出基于NeRF隐式表达的透明物体位姿估计方法,超越现有技术水平。 manipulation NeRF neural radiance field
23 PUMA: Empowering Unified MLLM with Multi-granular Visual Generation PUMA:提出一种统一的多模态大语言模型,赋能多粒度视觉生成任务。 manipulation large language model foundation model

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
24 Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring 提出时空车辆检测模型,提升无人机交通监控中车辆检测精度 spatiotemporal
25 Training Compute-Optimal Vision Transformers for Brain Encoding 针对大脑编码,研究计算量最优的视觉Transformer训练策略 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
26 MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations MotionBank:构建大规模视频动作基准,用于解耦的规则驱动型动作描述生成。 motion generation foundation model
27 L3DG: Latent 3D Gaussian Diffusion L3DG:提出基于潜空间3D高斯扩散的生成式3D建模方法,可扩展到房间级场景。 VQ-VAE

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
28 GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction GraspDiffusion:合成逼真全身人-物交互场景 human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页