cs.CV(2024-12-12)

📊 共 56 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (22 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗8) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗4) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (22 篇)

#题目一句话要点标签🔗
1 Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine MedPLIB:面向生物医学,具备像素级理解的多模态大语言模型 large language model multimodal
2 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM EasyRef:利用多模态LLM实现扩散模型的多参考图像泛化生成 large language model multimodal instruction following
3 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions 提出InternLM-XComposer2.5-OmniLive,用于长期流式视频和音频交互的多模态系统 large language model foundation model multimodal
4 Exemplar Masking for Multimodal Incremental Learning 提出Exemplar Masking框架,解决多模态增量学习中的存储和计算瓶颈。 large language model multimodal
5 Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models 提出基于视觉基础模型的蔓越莓成熟度分析框架,用于精准农业 foundation model
6 V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding 提出V2PE:通过可变视觉位置编码提升视觉-语言模型的多模态长上下文能力 multimodal
7 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation 提出VMB框架,利用显式桥接和检索增强实现高质量多模态音乐生成 multimodal
8 MaskTerial: A Foundation Model for Automated 2D Material Flake Detection MaskTerial:用于自动二维材料薄片检测的基础模型 foundation model
9 Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering 提出LGQAVE模型,通过自适应特征选择和基础模型增强视频问答性能 foundation model
10 Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content 提出DEVA框架,通过文本情感描述增强视听内容的多模态情感分析。 multimodal
11 Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation 提出局部化对抗样本生成方法,降低基础模型中的社会偏见 foundation model
12 ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation 提出ViCaS数据集以解决视频理解中的高层次与像素级分割问题 large language model multimodal
13 Olympus: A Universal Task Router for Computer Vision Tasks Olympus:一种用于计算机视觉任务的通用任务路由框架 large language model multimodal
14 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding SynerGen-VL:利用视觉专家和Token Folding实现协同图像理解与生成 large language model multimodal
15 Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior 提出HVSBench基准测试MLLM是否具备类人感知行为,揭示显著差距。 large language model multimodal
16 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Lyra:一种高效且以语音为中心的全知认知框架 large language model multimodal
17 GenEx: Generating an Explorable World GenEx:通过生成式想象构建可探索的3D世界,提升具身智能体能力 embodied AI
18 TimeRefine: Temporal Grounding with Time Refining Video LLM TimeRefine:利用时间细化的视频LLM进行时序定位 TAMP
19 Vision-Language Models Generate More Homogeneous Stories for Phenotypically Black Individuals 视觉-语言模型对表型黑人生成更趋同的故事,揭示群体内部的同质性偏见 large language model
20 FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection 提出FD2-Net,通过频率驱动的特征分解实现红外-可见光图像目标检测性能提升。 multimodal
21 Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method 提出长程视觉-语言导航任务与基准,并设计多粒度动态记忆模型以提升导航性能。 VLN
22 Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning 提出Geo-LLaVA,结合元上下文学习解决几何数学难题 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
23 Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation VisPer-LM:通过视觉嵌入蒸馏提升多模态LLM的视觉感知能力 distillation embodied AI multimodal
24 VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation 提出UDA-FROVSS框架,结合VLM与UDA提升开放词汇语义分割的跨域迁移能力 distillation open-vocabulary open vocabulary
25 YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls YingSound:提出基于多模态CoT控制的视频引导音效生成方法,解决少样本场景下的高质量音效生成问题。 flow matching foundation model chain-of-thought
26 Tuned Reverse Distillation: Enhancing Multimodal Industrial Anomaly Detection with Crossmodal Tuners 提出Tuned Reverse Distillation,通过跨模态调谐器增强多模态工业异常检测。 distillation multimodal
27 Towards Robust and Fair Vision Learning in Open-World Environments 针对开放世界环境,论文提出提升视觉学习公平性和鲁棒性的方法 world model foundation model multimodal
28 Physics-Driven Autoregressive State Space Models for Medical Image Reconstruction 提出MambaRoll,一种物理驱动的自回归状态空间模型,用于高质量医学图像重建。 Mamba SSM state space model
29 USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation 提出USDRL框架,通过多粒度特征解耦学习骨骼动作的稠密表征,提升动作识别、检索和检测性能。 representation learning contrastive learning
30 Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? 提出CMCR框架,通过综合学习模态共享与特定特征,提升3D表征能力 representation learning distillation
31 Selective Visual Prompting in Vision Mamba 针对Vision Mamba,提出选择性视觉提示(SVP)方法,提升下游任务微调性能。 Mamba state space model
32 Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration 提出动态对比知识蒸馏框架,提升图像复原任务中小型网络的性能。 contrastive learning distillation
33 Owl-1: Omni World Model for Consistent Long Video Generation 提出Owl-1:用于生成一致长视频的Omni World Model world model
34 All You Need in Knowledge Distillation Is a Tailored Coordinate System 提出定制坐标系蒸馏方法,解决知识蒸馏对特定任务教师模型的依赖问题。 distillation
35 DomCLP: Domain-wise Contrastive Learning with Prototype Mixup for Unsupervised Domain Generalization 提出DomCLP,通过领域对比学习与原型混合解决无监督域泛化问题 contrastive learning
36 Inference-Time Diffusion Model Distillation 提出Distillation++,通过推理时教师引导优化,提升扩散模型蒸馏性能。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
37 Feat2GS: Probing Visual Foundation Models with Gaussian Splatting Feat2GS:利用高斯溅射探究视觉基础模型的3D感知能力 3DGS gaussian splatting splatting
38 PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields PBR-NeRF:利用物理渲染的神经场进行逆渲染,实现材质和光照的联合估计。 3D gaussian splatting gaussian splatting splatting
39 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction FreeSplatter:无需相机位姿的稀疏视图高斯溅射三维重建 gaussian splatting splatting
40 GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency GEAL:利用跨模态一致性提升3D可供性学习的泛化能力 gaussian splatting splatting affordance
41 eCARLA-scenes: A synthetically generated dataset for event-based optical flow prediction 提出eCARLA-scenes:用于事件相机光流预测的合成数据集,服务自动驾驶场景。 visual odometry optical flow
42 Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Stereo4D:从互联网立体视频中学习3D动态场景重建 depth estimation stereo depth scene reconstruction
43 LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors LiftImage3D:利用视频生成先验将单张图像提升为3D高斯模型,解决单图3D重建难题。 3D gaussian splatting gaussian splatting splatting
44 SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos SLAM3R:一种基于单目RGB视频的实时稠密场景重建系统 scene reconstruction
45 ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation ResFlow:微调残差光流,实现基于事件相机的高时间分辨率运动估计 optical flow
46 Labits: Layered Bidirectional Time Surfaces Representation for Event Camera-based Continuous Dense Trajectory Estimation 提出Labits:一种用于事件相机连续稠密轨迹估计的分层双向时间表面表示 optical flow TAMP
47 Cross-View Completion Models are Zero-shot Correspondence Estimators 提出基于跨视角补全模型的零样本对应关系估计方法 depth estimation
48 Mojito: Motion Trajectory and Intensity Control for Video Generation Mojito:提出运动轨迹和强度可控的视频生成扩散模型 optical flow

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
49 Motion Generation Review: Exploring Deep Learning for Lifelike Animation with Manifold 综述性论文:探索流形学习在人体动作生成中的深度学习应用 motion generation
50 Video Creation by Demonstration 提出δ-Diffusion,通过演示视频和上下文图像生成逼真且连贯的新视频 physically plausible foundation model

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
51 Doe-1: Closed-Loop Autonomous Driving with Large World Model 提出Doe-1:基于大世界模型的闭环自动驾驶框架,实现统一感知、预测与规划 motion planning world model
52 MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments MS2Mesh-XR:在XR环境中基于手绘草图和语音输入的多模态网格生成 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
53 Identity-Preserving Pose-Guided Character Animation via Facial Landmarks Transformation 提出FLT方法,解决姿态引导的人物动画中面部一致性问题 character animation
54 Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression 提出DiQP,一种基于Transformer-Diffusion的QP感知模型,用于恢复编解码压缩造成的8K视频质量损失。 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
55 ContextHOI: Spatial Context Learning for Human-Object Interaction Detection ContextHOI:提出空间上下文学习框架,提升遮挡场景下人-物交互检测性能 human-object interaction HOI

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
56 Weighted Poisson-disk Resampling on Large-Scale Point Clouds 提出加权泊松盘重采样方法,提升大规模点云处理的效率和几何一致性 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页