cs.CV(2024-12-19)

📊 共 44 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (13 🔗6) 支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗5) 支柱一:机器人控制 (Robot Control) (6 🔗3) 支柱四:生成式动作 (Generative Motion) (3) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
1 Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering 提出多模态假设摘要(MHyS)方法,提升检索式多图问答性能。 contrastive learning large language model multimodal
2 MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis MMAudio:提出多模态联合训练框架,用于高质量视频到音频合成 flow matching multimodal
3 DiffSim: Taming Diffusion Models for Evaluating Visual Similarity DiffSim:利用扩散模型评估视觉相似性,提升生成模型质量 contrastive learning diff-sim
4 SqueezeMe: Mobile-Ready Distillation of Gaussian Full-Body Avatars SqueezeMe:高斯全身体化身移动端实时蒸馏框架 distillation splatting
5 Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution 提出CrossFlow,一种无需噪声的跨模态Flow Matching框架,实现模态间的直接映射。 flow matching depth estimation classifier-free guidance
6 {S$^3$-Mamba}: Small-Size-Sensitive Mamba for Lesion Segmentation 提出S$^3$-Mamba,提升Mamba模型对小病灶分割的敏感性,助力早期疾病诊断。 Mamba curriculum learning
7 OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization OnlineVPO:通过在线视频偏好优化对齐视频扩散模型,提升生成质量。 policy learning preference learning DPO
8 GURecon: Learning Detailed 3D Geometric Uncertainties for Neural Surface Reconstruction GURecon:学习神经表面重建的细粒度3D几何不确定性 distillation geometric consistency
9 SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection 提出SCKD半监督跨模态知识蒸馏方法,提升4D雷达目标检测性能。 distillation
10 Scaling 4D Representations 通过扩展4D表征,显著提升视频自监督学习在时空任务上的性能 MAE depth estimation
11 Learning Visual Composition through Improved Semantic Guidance 通过改进语义指导提升视觉组合学习能力,显著增强CLIP模型性能。 representation learning contrastive learning
12 Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM Prompt-A-Video:提出基于偏好对齐LLM的视频扩散模型Prompt优化框架 DPO direct preference optimization
13 Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation 提出自校准视觉锚定奖励的Token偏好优化以缓解幻觉问题 DPO direct preference optimization

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
14 Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models 研究多模态大语言模型校准问题,提出IDK数据集并优化提示以提升不确定性评估。 large language model multimodal
15 OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving OpenEMMA:开源多模态大模型,用于端到端自动驾驶 large language model multimodal chain-of-thought
16 WikiStyle+: A Multimodal Approach to Content-Style Representation Disentanglement for Artistic Image Stylization 提出WikiStyle+数据集和解耦扩散模型,实现多模态艺术图像风格迁移 multimodal
17 Multi-QuAD: Multi-Level Quality-Adaptive Dynamic Network for Reliable Multimodal Classification 提出Multi-QuAD,解决多模态分类中样本质量差异导致的可靠性问题 multimodal
18 MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval MegaPairs:大规模数据合成,用于通用多模态检索 multimodal
19 Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation 提出 SewingLDM,用于生成受文本、体型和草图控制的复杂缝纫图案 multimodal
20 FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning FedPIA:利用Wasserstein重心置换和集成适配器,用于多模态联邦学习中微调基础模型 foundation model
21 FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability FiVL:通过训练、评估和可解释性增强视觉语言模型中的视觉对齐 multimodal visual grounding
22 Movie2Story: A framework for understanding videos and telling stories in the form of novel text 提出MSBench基准测试,用于评估多模态大语言模型在长视频故事生成中的能力 large language model
23 TextSleuth: Towards Explainable Tampered Text Detection 提出TextSleuth以解决可解释的篡改文本检测问题 multimodal
24 HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning 提出HiCM²以解决密集视频字幕生成问题 large language model
25 Llama Learns to Direct: DirectorLLM for Human-Centric Video Generation 提出DirectorLLM,利用LLM编排人体姿态,提升人本视频生成质量 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
26 LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction 提出LiHi-GS,利用激光雷达监督的高斯溅射重建高速公路驾驶场景 gaussian splatting splatting NeRF
27 GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting GSRender:基于弱监督3D高斯溅射的去重占用预测,提升自动驾驶感知性能 3D gaussian splatting gaussian splatting splatting
28 Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion 提出Mask-Aware Dual Diffusion模型,用于可控的、符合常识的物体插入 affordance affordance-aware
29 Improving Geometry in Sparse-View 3DGS via Reprojection-based DoF Separation 提出基于重投影的自由度分离方法,提升稀疏视角3DGS的几何重建质量 3D gaussian splatting 3DGS gaussian splatting
30 Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw Images Bright-NeRF:提出色彩恢复的神经辐射场,解决低光照RAW图像的新视角合成问题 NeRF neural radiance field
31 SolidGS: Consolidating Gaussian Surfel Splatting for Sparse-View Surface Reconstruction SolidGS:通过巩固高斯Surfel Splatting实现稀疏视角下的表面重建 gaussian splatting splatting
32 ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects ObjVariantEnsemble:针对细微差异物体的点云LLM评测基准 scene understanding spatial relationship embodied AI
33 LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation 提出LiDAR-RT,利用高斯基元光线追踪实现动态LiDAR实时重仿真 neural radiance field
34 PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation 提出PC-BEV,高效融合极坐标与笛卡尔坐标BEV特征,用于LiDAR语义分割 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
35 Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations 提出基于视频预测策略(VPP)的通用机器人策略,利用预测视觉表征提升机器人操作能力。 manipulation dexterous manipulation contrastive learning
36 Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization through Spare-Coding Transformer SparseViT:通过稀疏编码Transformer自适应提取非语义特征,实现高效图像篡改定位 manipulation
37 Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations Arti-PG:用于程序化合成大规模、多样化、带丰富标注的铰接物体工具箱 manipulation affordance
38 A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space 提出轻量级解耦开放集目标检测框架DOSOD,提升机器人应用中的实时性。 manipulation
39 Efficient Neural Network Encoding for 3D Color Lookup Tables 提出高效神经网络编码以压缩3D颜色查找表 manipulation
40 TDCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network 提出TDCNet,利用CNN-Transformer双分支并行网络完成透明物体深度补全 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
41 EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space EnergyMoGen:基于能量的扩散模型在潜在空间中进行组合式人体运动生成 text-to-motion motion generation motion latent
42 ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model ScaMo:探索自回归运动生成模型中的缩放定律 motion generation motion tokenizer
43 Jet: A Modern Transformer-Based Normalizing Flow Jet:一种基于Transformer的现代化归一化流模型,提升图像生成质量。 VQ-VAE

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
44 GenHMR: Generative Human Mesh Recovery GenHMR:提出一种生成式人体网格恢复框架,有效应对单目图像三维重建的不确定性。 human mesh recovery HMR

⬅️ 返回 cs.CV 首页 · 🏠 返回主页