cs.CV(2025-01-23)

📊 共 31 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (10 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
1 Multi-aspect Knowledge Distillation with Large Language Model 提出基于多模态大语言模型的多方面知识蒸馏方法,提升图像分类性能。 distillation large language model multimodal
2 QMamba: Post-Training Quantization for Vision State Space Models QMamba:面向视觉状态空间模型的后训练量化框架 Mamba SSM state space model
3 MultiDreamer3D: Multi-concept 3D Customization with Concept-Aware Diffusion Guidance MultiDreamer3D:提出概念感知扩散引导的多概念3D定制方法。 dreamer 3D gaussian splatting gaussian splatting
4 Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step 提出基于CoT的图像生成方法,通过验证和强化步骤显著提升自回归图像生成质量。 DPO direct preference optimization chain-of-thought
5 MV-GMN: State Space Model for Multi-View Action Recognition 提出MV-GMN模型,高效处理多视角动作识别中的多模态、多视角和多时序数据。 Mamba state space model
6 Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision 提出Contrast混合架构,融合Transformer与状态空间模型,提升图像超分辨率性能。 Mamba state space model
7 Temporal Preference Optimization for Long-Form Video Understanding 提出时间偏好优化(TPO)框架,提升视频大模型在长视频中的时间定位能力 preference learning multimodal
8 Improving Video Generation with Human Feedback 提出基于人类反馈的视频生成优化流程,解决运动不平滑和对齐问题。 reinforcement learning DPO direct preference optimization
9 Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models 提出BadRDM,一种针对检索增强扩散模型的对比后门攻击方法,揭示RAG引入的安全隐患。 contrastive learning multimodal
10 A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs 提出认知范式评估框架,解剖视觉语言模型中感知-推理的接口 DRL HOI

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
11 Revisiting CLIP: Efficient Alignment of 3D MRI and Tabular Data using Domain-Specific Foundation Models 提出一种基于领域特定3D基础模型的MRI与表格数据高效对齐方法 foundation model
12 GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing GeoPixel:首个遥感像素级Grounding的大型多模态模型,支持交互式掩码生成。 multimodal
13 EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion EchoVideo:通过多模态特征融合实现身份保持的人类视频生成 multimodal
14 MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata MetaWild:提出包含环境元数据的多模态动物重识别数据集与元特征适配器。 multimodal
15 ReasVQA: Advancing VideoQA with Imperfect Reasoning Process ReasVQA:利用不完善推理过程提升视频问答性能 large language model multimodal
16 Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge 提出StreamChat框架,通过增强记忆的知识实现流视频理解和多轮交互。 large language model multimodal
17 Eye Gaze as a Signal for Conveying User Attention in Contextual AI Systems 利用眼动追踪作为上下文AI系统中用户注意力的信号 multimodal
18 Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos 提出Video-MMMU以评估多模态模型从专业视频中获取知识的能力 multimodal
19 MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation MPG-SAM 2:利用掩码先验和全局上下文改进SAM 2,用于指代视频对象分割 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
20 PromptMono: Cross Prompting Attention for Self-Supervised Monocular Depth Estimation in Challenging Environments PromptMono:利用跨Prompting注意力提升复杂环境下单目深度估计 depth estimation monocular depth
21 GoDe: Gaussians on Demand for Progressive Level of Detail and Scalable Compression 提出GoDe:基于按需高斯的渐进式细节层次和可扩展压缩方法 3D gaussian splatting 3DGS gaussian splatting
22 GC-ConsFlow: Leveraging Optical Flow Residuals and Global Context for Robust Deepfake Detection GC-ConsFlow:利用光流残差和全局上下文增强Deepfake检测鲁棒性 optical flow spatiotemporal
23 Deblur-Avatar: Animatable Avatars from Motion-Blurred Monocular Videos Deblur-Avatar:从运动模糊单目视频重建可动画高保真3D人像 3D gaussian splatting 3DGS gaussian splatting
24 Symmetrization Weighted Binary Cross-Entropy: Modeling Perceptual Asymmetry for Human-Consistent Neural Edge Detection 提出SWBCE损失函数以解决边缘检测中的感知不对称问题 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
25 LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps 提出LLM引导的实例级图像操控方法,利用扩散U-Net交叉注意力图实现精准编辑。 manipulation open-vocabulary open vocabulary
26 Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction 将波斯语唇语识别集成到Surena-V机器人,提升人机交互能力 humanoid humanoid robot
27 mmEgoHand: Egocentric Hand Pose Estimation and Gesture Recognition with Head-mounted Millimeter-wave Radar and IMU 提出mmEgoHand,利用头戴毫米波雷达和IMU进行手部姿态估计和手势识别。 teleoperation egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
28 EventVL: Understand Event Streams via Multimodal Large Language Model 提出EventVL,首个生成式事件相机多模态大语言模型,用于显式语义理解。 spatiotemporal large language model multimodal
29 Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization 提出AEO框架,解决多模态开放集测试时自适应问题,提升未知类别样本区分能力。 AMP multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection 提出ME-CPT,用于城市三维变化检测,提升多时相点云语义变化特征提取能力。 spatial relationship spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
31 Implicit Neural Surface Deformation with Explicit Velocity Fields 提出一种基于显式速度场的无监督神经隐式表面形变方法 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页