cs.CV(2025-03-31)

📊 共 48 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (25 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (25 篇)

#题目一句话要点标签🔗
1 DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance 提出DeepDubber-V1,通过多模态CoT推理指导电影配音,提升质量并适应不同风格。 large language model multimodal chain-of-thought
2 FlexiMo: A Flexible Remote Sensing Foundation Model FlexiMo:提出一种灵活的遥感基础模型,适应任意空间分辨率。 foundation model multimodal
3 Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity 研究多模态LLM在上下文无关OCR中的图像分辨率和视觉复杂度影响 large language model multimodal
4 Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography 利用扩散模型和图像基础模型提升冠状动脉造影中的对应点匹配 foundation model
5 Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation 提出一种自适应视觉基础模型,用于实时超声图像分割。 foundation model
6 PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks PathOrchestra:一个用于计算病理学的综合基础模型,支持超过100项临床任务 foundation model
7 Can Test-Time Scaling Improve World Foundation Model? 提出SWIFT框架,通过测试时计算扩展提升世界基础模型性能 foundation model
8 FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics 提出FakeScope:用于透明AI生成图像取证的大型多模态专家模型 multimodal
9 MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing 提出MB-ORES,用于遥感图像中基于多分支对象推理的视觉定位 visual grounding
10 Foundation Models For Seismic Data Processing: An Extensive Review 评估自然图像预训练模型在地震数据处理中的应用潜力 foundation model
11 AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models 利用Foundation Model进行AI辅助结肠镜息肉检测与分割 foundation model
12 IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration IMPACT:一种通用的多模态医学图像配准语义损失函数 multimodal
13 PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis PolypSegTrack:用于结肠镜视频分析的统一基础模型,实现息肉的检测、分割、分类和跟踪。 foundation model
14 HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment 提出HumanAesExpert以解决人像美学评估问题 foundation model
15 STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? STI-Bench:评估多模态大语言模型在时空理解方面的能力 embodied AI large language model multimodal
16 Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Any2Caption:提出一种条件可控的视频生成框架,通过多模态大语言模型将任意条件转化为详细描述。 large language model multimodal
17 Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs Chapter-Llama利用LLM高效处理长视频章节划分与标题生成任务 large language model TAMP
18 H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding 提出H2VU基准,用于全面评估分层整体视频理解能力,尤其针对长视频和在线流媒体。 large language model multimodal
19 COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation 提出COSMO模型,通过选择性记忆降低视觉-语言导航的计算成本并提升性能。 VLN
20 Towards Understanding How Knowledge Evolves in Large Vision-Language Models 揭示大规模视觉语言模型中知识演化轨迹,为理解其内在机制提供新视角。 multimodal
21 Style Quantization for Data-Efficient GAN Training 提出SQ-GAN,通过风格量化提升数据稀缺场景下的GAN训练效果 foundation model
22 It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data 提出无平行数据的视觉-语言对应方法,探索模型表征的无监督匹配 foundation model
23 Boosting MLLM Reasoning with Text-Debiased Hint-GRPO 提出Hint-GRPO,通过文本去偏Hint机制提升MLLM在复杂多模态推理任务中的性能。 multimodal
24 MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation MGD-SAM2:多视角引导的细节增强SAM2模型,用于高分辨率无类别分割 foundation model
25 Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model 提出XS-Video数据集与NetGPT模型,用于短视频跨平台传播影响力评估 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
26 ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image ExScene:基于单张图像和高斯溅射的自由视角3D场景重建 depth estimation 3D gaussian splatting 3DGS
27 LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors 提出LITA-GS以解决不良光照条件下的视图合成问题 3D gaussian splatting 3DGS gaussian splatting
28 StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting 提出基于随机光栅化的StochasticSplats,实现无排序的3D高斯溅射加速渲染。 3D gaussian splatting 3DGS gaussian splatting
29 DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting DiET-GS:扩散先验与事件流辅助的运动去模糊3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
30 SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting SonarSplat:基于高斯溅射的水下成像声呐新视角合成方法,有效建模声学条纹现象。 gaussian splatting splatting
31 Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views Free360:提出分层高斯溅射,解决极稀疏无位姿视角下无限360度场景的新视角合成问题。 gaussian splatting splatting
32 NeRF-Based defect detection 提出基于NeRF的自动化缺陷检测框架,用于大型机械的精确、安全检测。 NeRF neural radiance field
33 ERUPT: Efficient Rendering with Unposed Patch Transformer ERUPT:一种高效的、基于无位姿图像块Transformer的新视角合成方法 gaussian splatting splatting NeRF
34 Visual Acoustic Fields 提出Visual Acoustic Fields,利用3DGS在三维空间中桥接敲击声音和视觉信号。 3D gaussian splatting 3DGS gaussian splatting
35 Detail-aware multi-view stereo network for depth estimation 提出DA-MVSNet,解决多视点立体视觉中物体边界和细节区域深度估计不准问题 depth estimation metric depth
36 SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM SuperEvent:提出一种基于跨模态学习的事件相机关键点检测方法,用于SLAM。 visual SLAM feature matching
37 Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Easi3R:无需训练,从DUSt3R中解耦运动信息以实现动态场景重建 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
38 Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 SEED-Bench-R1:探索强化学习在视频理解多模态大模型后训练中的有效性 reinforcement learning large language model multimodal
39 Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification 提出基于WordNet松弛文本嵌入的跨模态知识蒸馏框架,提升图像分类鲁棒性。 distillation multimodal
40 SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency 提出SALT:一种灵活的半自动LiDAR点云标注工具,具备跨场景适应性和4D一致性。 distillation foundation model
41 AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition 提出AMMSM框架,通过自适应运动放大和稀疏Mamba模型提升微表情识别精度。 Mamba
42 HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation HumanDreamer:提出解耦生成框架,通过文本驱动生成可控人体运动视频。 dreamer
43 CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization 提出CIBR:通过跨模态信息瓶颈正则化增强CLIP的泛化能力 representation learning contrastive learning

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
44 HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation 提出HOIGen-1M大规模数据集,提升文本到视频生成中人与物交互的精确性 human-object interaction HOI large language model
45 AMB-FHE: Adaptive Multi-biometric Fusion with Fully Homomorphic Encryption 提出AMB-FHE,一种基于全同态加密的自适应多生物特征融合方法,提升隐私保护和系统灵活性。 OMOMO

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
46 XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? XLRS-Bench:评估多模态LLM在超高分辨率遥感图像理解上的能力 spatiotemporal large language model multimodal
47 An Explainable Neural Radiomic Sequence Model with Spatiotemporal Continuity for Quantifying 4DCT-based Pulmonary Ventilation 提出时序显著性增强的可解释神经放射组学模型,用于量化4DCT肺部通气 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
48 LiM-Loc: Visual Localization with Dense and Accurate 3D Reference Maps Directly Corresponding 2D Keypoints to 3D LiDAR Point Clouds LiM-Loc:提出一种直接将2D关键点与3D激光雷达点云对应,构建稠密精确3D参考地图的视觉定位方法 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页