cs.CV(2025-03-14)

📊 共 53 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗4) 支柱八:物理动画 (Physics-based Animation) (6 🔗2) 支柱一:机器人控制 (Robot Control) (5) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving 提出一种能力驱动的评估框架,用于评估自动驾驶中多模态大语言模型对场景的理解能力 large language model multimodal
2 Towards a Unified Copernicus Foundation Model for Earth Vision 提出Copernicus-FM:统一的地球视觉基础模型,支持多模态遥感数据处理。 foundation model multimodal
3 VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity 提出VERIFY基准以评估多模态推理的视觉解释能力 large language model multimodal
4 MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens 提出MMS-LLaMA以解决多模态语音识别中的计算效率问题 large language model multimodal
5 BannerAgency: Advertising Banner Design with Multimodal LLM Agents 提出BannerAgency,一个基于多模态LLM Agent的广告横幅全自动设计框架。 large language model multimodal
6 Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation 提出多模态感知融合网络MAFN,用于解决遥感图像的指代分割任务。 multimodal
7 Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models 提出PURE模型,利用自回归多模态生成模型实现鲁棒的真实世界图像超分辨率 multimodal
8 Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report) Falcon:遥感领域的视觉-语言基础模型,实现多任务统一处理 foundation model
9 Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning 提出FedSense框架,通过联邦互指导学习实现遥感基础模型的隐私保护预训练。 foundation model
10 OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning OmniDiff:用于细粒度图像差异描述的综合基准,并提出M$^3$Diff模型。 large language model multimodal
11 PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models PARIC:提出概率注意力正则化方法,提升预训练视觉语言模型在语言引导图像分类中的性能。 foundation model
12 SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets SpaceSeg:针对在轨多航天器目标的高精度智能感知分割方法 foundation model
13 Solution for 8th Competition on Affective & Behavior Analysis in-the-wild 提出一种基于音频-视觉多模态融合的AU检测方法,提升野外环境下的面部动作单元识别精度。 multimodal
14 Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias 提出概念一致性分数(CCS),揭示CLIP模型性能与社会偏见之间的内在联系。 foundation model
15 Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation 提出人-LMM协作框架,提升图像标注效率,减轻标注疲劳 multimodal
16 Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation 提出OIKG框架,通过细粒度指令引导的图推理提升视觉语言导航性能 VLN

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
17 EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting EgoSplat:基于语言嵌入3D高斯溅射的开放词汇第一人称视角场景理解 3D gaussian splatting gaussian splatting splatting
18 Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation 提出基于高斯溅射的工业级传感器仿真框架,用于自动驾驶系统全栈验证。 gaussian splatting splatting NeRF
19 Uncertainty-Aware Normal-Guided Gaussian Splatting for Surface Reconstruction from Sparse Image Sequences 提出UNG-GS,通过不确定性感知的法向量引导高斯溅射重建稀疏图像序列。 3D gaussian splatting 3DGS gaussian splatting
20 Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection 提出循环对比知识迁移(CCKT-Det)用于开放词汇目标检测,无需额外监督。 open-vocabulary open vocabulary
21 VGGT: Visual Geometry Grounded Transformer VGGT:视觉几何驱动的Transformer,一步到位地从多视角图像中推断场景的3D属性。 depth estimation VGGT
22 Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation 提出LayeredDepth数据集,用于解决透明物体多层深度估计难题 depth estimation
23 Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation 提出基于光线追踪的双像素图像模拟方法Sdirt,提升深度估计模型在真实数据上的泛化性 depth estimation
24 NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications 提出NF-SLAM以解决汽车应用中的对象级视觉SLAM问题 visual SLAM
25 Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset 提出基于视觉-语言模型的道路怒火推理任务与数据集,用于主动预防驾驶风险。 scene understanding spatial relationship
26 EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation EMoTive:提出事件引导的轨迹建模方法,用于提升3D运动估计精度 optical flow
27 FG-DFPN: Flow Guided Deformable Frame Prediction Network 提出FG-DFPN,利用光流引导可变形卷积进行视频帧预测,显著提升预测精度。 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
28 DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning DecAlign:通过分层跨模态对齐解耦多模态表征学习,提升模态融合效果。 representation learning multimodal
29 A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis 综述:自监督对比学习在多模态文本-图像分析中的应用与进展 contrastive learning multimodal
30 Towards General Multimodal Visual Tracking 提出QuadFusion,利用多尺度Mamba融合RGB、热红外、事件和语言四模态信息,实现通用视觉追踪。 Mamba multimodal
31 FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection 提出FMNet,一种频率辅助的Mamba线性注意力网络,用于伪装目标检测。 Mamba linear attention
32 Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers Vamba:混合Mamba-Transformer模型,用于理解小时级长视频 Mamba multimodal
33 MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation 提出MAVFlow,利用条件流匹配实现零样本AV2AV多语种翻译,保持说话人一致性。 flow matching multimodal
34 Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking 提出TPF以解决RGBT跟踪中的模态间隙问题 Mamba representation learning distillation
35 GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior GaussianIP:通过以人为中心的扩散先验实现保持身份的逼真3D人体生成 distillation mutual attention

🔬 支柱八:物理动画 (Physics-based Animation) (6 篇)

#题目一句话要点标签🔗
36 FastVID: Dynamic Density Pruning for Fast Video Large Language Models FastVID:面向快速视频大语言模型的动态密度剪枝 spatiotemporal large language model
37 Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control Cafe-Talk:提出一种多模态粗细粒度控制的3D说话人脸动画生成方法 spatiotemporal multimodal
38 LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs LLaVA-MLB:缓解并利用注意力偏差,实现免训练视频LLM spatiotemporal large language model
39 Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios 提出端到端视频变换器模型以解决极端光照下的远程光电容积描记问题 spatiotemporal
40 GMG: A Video Prediction Method Based on Global Focus and Motion Guided 提出GMG模型,通过全局关注和运动引导提升视频预测精度,尤其针对气象数据 spatiotemporal
41 Non Line-of-Sight Optical Wireless Communication using Neuromorphic Cameras 利用神经形态相机实现非视距光无线通信 PULSE

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
42 Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information 提出基于互补与共识信息的3D高斯溅射编辑框架,提升视觉保真度和一致性。 manipulation 3D gaussian splatting 3DGS
43 TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation TASTE-Rob:面向通用机器人操作的任务导向手-物交互视频生成 manipulation imitation learning Ego4D
44 Disentangled Object-Centric Image Representation for Robotic Manipulation 提出DOCIR,用于机器人操作的解耦式目标中心图像表征,提升多目标环境下的操作技能学习。 manipulation zero-shot transfer
45 Safe Vision-Language Models via Unsafe Weights Manipulation 提出UWM:通过操纵不安全权重提升视觉-语言模型安全性,同时保持知识 manipulation
46 EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation EmoAgent:用于生成多样化情感图像编辑的多智能体框架 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
47 Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space 提出Open3D-VQA:用于评估多模态大语言模型在开放空间中空间推理能力的基准 spatial relationship large language model multimodal
48 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning V-STaR:用于评估视频大语言模型时空推理能力的基准测试 spatial relationship large language model chain-of-thought

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
49 HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models HiTVideo:用于增强自回归大语言模型文本生成视频的分层 Tokenizer VQ-VAE spatiotemporal large language model
50 ACMo: Attribute Controllable Motion Generation 提出ACMo,实现属性可控的运动生成,解决现有方法控制精度不足和泛化性差的问题。 text-to-motion motion generation multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
51 Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation 提出基于后置摄像头的Transformer方法,提升自中心3D人体姿态估计精度。 egocentric
52 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling 提出人机协同的3D场景布局局部修正方法,通过填补式编辑提升精度。 egocentric

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
53 Provenance Detection for AI-Generated Images: Combining Perceptual Hashing, Homomorphic Encryption, and AI Detection Models 提出三部分框架以解决AI生成图像来源检测问题 OMOMO

⬅️ 返回 cs.CV 首页 · 🏠 返回主页