cs.CV(2025-03-27)

📊 共 51 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (22 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (3 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (22 篇)

#题目一句话要点标签🔗
1 Harmonizing Visual Representations for Unified Multimodal Understanding and Generation 提出Harmon,一个统一的自回归框架,用于多模态理解和生成任务。 multimodal
2 On Large Multimodal Models as Open-World Image Classifiers 评估大型多模态模型在开放世界图像分类中的性能与挑战 multimodal
3 PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval PS-ReID:结合图像文本多模态检索,实现更精准的行人重识别与分割 multimodal
4 Multimodal surface defect detection from wooden logs for sawing optimization 提出一种基于多模态融合的木材表面节疤检测方法,用于优化木材锯切。 multimodal
5 HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery 提出HyperFree:一种通道自适应、免调参的高光谱遥感图像基础模型 foundation model
6 AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction 提出AdaMHF,自适应多模态分层融合用于提升生存预测精度,尤其在数据缺失场景下。 multimodal
7 iMedImage Technical Report iMedImage:用于通用医学图像识别的端到端多模态基础模型,提升染色体异常检测精度。 foundation model multimodal chain-of-thought
8 Online Reasoning Video Segmentation with Just-in-Time Digital Twins 提出基于即时数字孪生的在线推理视频分割框架,解决现有方法推理能力不足、依赖微调等问题。 embodied AI large language model multimodal
9 FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs FaceBench:用于评估人脸感知多模态大语言模型的多视角多层次人脸属性VQA数据集 large language model multimodal
10 VALLR: Visual ASR Language Model for Lip Reading VALLR:提出视觉ASR语言模型,用于唇语识别,显著降低词错误率。 large language model multimodal
11 InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression InternVL-X:通过高效视觉Token压缩提升InternVL系列模型的性能与效率 large language model multimodal
12 Differential Evolution for Grassmann Manifold Optimization: A Projection Approach 提出一种基于投影的差分进化算法,用于格拉斯曼流形上的优化问题。 multimodal
13 StarFlow: Generating Structured Workflow Outputs From Sketch Images StarFlow:利用视觉-语言模型从草图生成结构化工作流 foundation model
14 Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model 提出Mobile-VideoGPT,一种参数小于10亿的高效视频理解语言模型,实现实时吞吐。 multimodal
15 Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence 提出Stable-SCore框架,通过稳定配准实现更鲁棒的3D形状对应 foundation model
16 Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation 提出基于深度学习的图像、视频和音频分类器,用于自动化新闻视频分割。 multimodal
17 FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval 提出FineCIR框架,通过显式解析细粒度语义提升组合图像检索精度。 multimodal
18 Vision-to-Music Generation: A Survey 综述视觉到音乐生成:系统回顾视频、图像到音乐生成的技术进展与未来方向。 multimodal
19 M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? 提出M-DocSum-Bench,评估LVLM在多模态文档摘要中的理解能力 multimodal
20 Towards Generalizable Forgery Detection and Reasoning 提出FakeReasoning框架,利用多模态大语言模型实现AI生成图像的通用伪造检测与推理。 large language model
21 DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement DSU-Net:融合DINOv2和SAM2的多尺度跨模型特征增强U-Net,提升图像分割性能 foundation model
22 A Multi-Modal Knowledge-Enhanced Framework for Vessel Trajectory Prediction 提出多模态知识增强框架MAKER,提升船舶轨迹预测精度。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
23 X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction 提出X²-Gaussian,通过动态辐射高斯溅射实现连续时间断层扫描重建 gaussian splatting splatting spatiotemporal
24 LandMarkSystem Technical Report LandMarkSystem:用于大规模高质量3D重建与渲染的计算框架 3D gaussian splatting 3DGS gaussian splatting
25 Frequency-Aware Gaussian Splatting Decomposition 提出频率感知高斯溅射分解,实现高效可控的新视角合成 3D gaussian splatting gaussian splatting splatting
26 Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation 提出语义库适应框架解决开放词汇语义分割问题 open-vocabulary open vocabulary
27 UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation UGNA-VPR:基于不确定性引导NeRF增强的视觉定位新训练范式 NeRF
28 SC-NeRF: NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications 提出基于静止相机的SC-NeRF,用于农业高通量植物表型分析的点云重建 NeRF
29 StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency StyledStreets:提出时空一致的多风格街景模拟器,用于城市环境重建。 gaussian splatting splatting scene reconstruction
30 Can Video Diffusion Model Reconstruct 4D Geometry? Sora3R:利用视频扩散模型从单目视频重建动态4D几何 optical flow spatiotemporal
31 HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM HS-SLAM:结合结构化监督的混合表示,提升稠密SLAM性能 NeRF
32 ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo ICG-MVSNet:学习视图内和跨视图关系以指导多视图立体匹配 depth estimation
33 GenFusion: Closing the Loop between Reconstruction and Generation via Videos GenFusion:通过视频闭环重建与生成,弥合3D重建与生成之间的差距 scene reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
34 Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model 利用多模态数据融合与时间序列模型Anyplant,实现可持续室内园艺的植物健康监测。 MAE foundation model multimodal
35 VADMamba: Exploring State Space Models for Fast Video Anomaly Detection 提出VADMamba,利用状态空间模型加速视频异常检测,提升推理速度。 Mamba state space model optical flow
36 Video-R1: Reinforcing Video Reasoning in MLLMs Video-R1:通过规则强化学习提升多模态大语言模型中的视频推理能力 reinforcement learning large language model multimodal
37 What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning 利用状态变化描述与反事实推理,提升程序性视频表征学习 representation learning large language model
38 AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis 提出AssistPDA以解决实时视频异常检测问题 distillation spatiotemporal large language model
39 Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration 提出Q-MambaIR,用于高效图像恢复的精确量化Mamba模型 Mamba SSM
40 Delving Deep into Semantic Relation Distillation 提出基于语义关系知识蒸馏(SeRKD)方法,提升模型压缩和泛化能力。 distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
41 Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks 提出一种基于注视引导的3D手部动作预测方法,用于辅助抓取任务中的意图检测。 egocentric
42 AgRowStitch: A High-fidelity Image Stitching Pipeline for Ground-based Agricultural Images AgRowStitch:针对地面农业图像的高保真图像拼接流程,无需额外数据。 feature matching
43 Reconstructing Humans with a Biomechanically Accurate Skeleton 提出基于生物力学骨骼模型的单图人体三维重建方法,提升极端姿态下的重建效果。 human mesh recovery
44 ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate 提出ClimbingCap以解决攀岩动作捕捉的挑战 HMR

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
45 Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video Uni4D:统一视觉基础模型,从单视频实现4D建模 motion tracking foundation model
46 CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition 提出CMD-HAR模型,通过跨模态解耦解决可穿戴设备人体活动识别中的数据混合与异构问题 spatiotemporal multimodal
47 DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation 提出DynamiCtrl框架,提升扩散Transformer在高质量人体图像动画中的控制性和语义一致性。 character control

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
48 Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying 提出语义一致语言高斯溅射,实现点级开放词汇查询 manipulation 3D gaussian splatting gaussian splatting
49 CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models 提出CoT-VLA,通过视觉思维链推理提升视觉-语言-动作模型的操作能力 manipulation vision-language-action VLA

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
50 ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model ChatAnyone:基于分层运动扩散模型的风格化实时人像视频生成 motion diffusion model motion diffusion
51 StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion StyleMotif:提出一种多模态风格化运动潜在扩散模型,用于生成具有风格的运动。 motion synthesis motion generation motion latent

⬅️ 返回 cs.CV 首页 · 🏠 返回主页