cs.CV(2025-10-31)

📊 共 36 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 RzenEmbed: Towards Comprehensive Multimodal Retrieval RzenEmbed:提出统一多模态嵌入框架,显著提升视频和文档检索性能 large language model multimodal instruction following
2 Image Hashing via Cross-View Code Alignment in the Age of Foundation Models 提出CroVCA,通过跨视图编码对齐实现高效图像哈希检索 foundation model multimodal
3 Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing 提出CIELR,通过LLM推理将复杂图像编辑指令分解为简单动作,无需联合微调。 large language model foundation model
4 Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions 提出MIVA基准,评估多模态大语言模型在多人社交互动中识别谎言的能力 large language model multimodal
5 Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation 提出Sketch-to-Layout框架,利用草图引导多模态布局生成,提升设计体验。 multimodal
6 Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum 提出通用视频检索框架,通过合成多模态金字塔课程泛化视频嵌入 multimodal
7 E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources 提出E-MMDiT,一种轻量级多模态扩散Transformer,用于资源受限下的快速图像合成。 multimodal
8 CompAgent: An Agentic Framework for Visual Compliance Verification 提出CompAgent,用于视觉合规性验证的Agent框架,提升细粒度推理能力。 large language model multimodal
9 FOCUS: Efficient Keyframe Selection for Long Video Understanding 提出FOCUS,一种高效的关键帧选择方法,用于提升长视频理解中多模态大语言模型的性能。 large language model multimodal
10 Generating Accurate and Detailed Captions for High-Resolution Images 提出一种多阶段流程,融合视觉-语言模型、大型语言模型和目标检测,为高分辨率图像生成更准确、详细的描述。 large language model multimodal
11 FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding FLoC:基于设施选址的长视频高效视觉Token压缩方法 multimodal
12 NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception NegoCollab:一种面向异构协作感知的协商式通用表征方法 multimodal
13 MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series MapSAM2:通过自适应SAM2实现历史地图图像和时间序列的自动分割 foundation model
14 Mitigating Semantic Collapse in Partially Relevant Video Retrieval 提出文本相关性保持学习与跨分支视频对齐,缓解部分相关视频检索中的语义坍塌问题。 foundation model
15 MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts 提出MoRE:基于混合专家模型的3D视觉几何重建框架,提升可扩展性和适应性。 foundation model
16 Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks 提出基于分层图神经网络的多模态特征融合方法,用于传统村落空间形态分析。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
17 Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model 提出双流扩散模型DUST,增强世界模型在视觉-语言-动作模型中的性能 policy learning flow matching world model
18 Object-Aware 4D Human Motion Generation 提出MSDI框架,利用运动扩散先验生成逼真且符合物理规律的4D人体运动 distillation motion diffusion model motion diffusion
19 Fusion of Multi-scale Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis FuseCPath:融合多尺度异构病理学基础模型用于全切片图像分析 distillation foundation model
20 End-to-End Framework Integrating Generative AI and Deep Reinforcement Learning for Autonomous Ultrasound Scanning 提出集成生成对抗网络与深度强化学习的端到端框架,实现自主超声扫描。 reinforcement learning deep reinforcement learning DRL
21 ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning 提出ANCHOR框架,结合对抗训练与难例监督对比学习,提升表征学习的鲁棒性。 representation learning contrastive learning
22 MambaNetLK: Enhancing Colonoscopy Point Cloud Registration with Mamba MambaNetLK:利用Mamba SSM增强结肠镜点云配准精度与鲁棒性 Mamba SSM state space model
23 Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation 提出vMambaX,利用上下文门控跨模态感知和视觉Mamba进行PET-CT肺肿瘤分割 Mamba multimodal
24 Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals 提出Phased DMD,通过子区间内的分数匹配蒸馏提升多步生成模型的性能和多样性 distillation
25 C-LEAD: Contrastive Learning for Enhanced Adversarial Defense C-LEAD:利用对比学习增强对抗防御能力 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
26 SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction 提出SAGS,解决动态手术内窥镜重建中的伪影和混叠问题。 3D gaussian splatting 3DGS gaussian splatting
27 NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding NAUTILUS:用于水下场景理解的大型多模态模型,提升水下任务鲁棒性 scene understanding multimodal
28 WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond WildfireX-SLAM:用于野火SLAM及其他应用的大规模低空RGB-D数据集 3D gaussian splatting 3DGS gaussian splatting
29 BeetleFlow: An Integrative Deep Learning Pipeline for Beetle Image Processing BeetleFlow:用于甲虫图像处理的集成深度学习流水线 open-vocabulary open vocabulary
30 A Retrospect to Multi-prompt Learning across Vision and Language 提出能量驱动的多提示学习方法,提升视觉-语言预训练模型在下游任务的泛化能力。 open-vocabulary open vocabulary

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
31 Towards 1000-fold Electron Microscopy Image Compression for Connectomics via VQ-VAE with Transformer Prior 提出基于VQ-VAE与Transformer先验的电镜图像压缩方法,实现高达1000倍的压缩比。 VQ-VAE
32 Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V 提出LoRA微调的视频生成管线,用于电影场景合成,解决小数据集难题。 motion generation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
33 SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles SilhouetteTell:利用模糊视频字幕记录实现视频识别攻击 spatiotemporal
34 M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar M^3Detection:多帧多层特征融合的相机-4D雷达多模态3D目标检测 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
35 SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping SpecAware:一种光谱内容感知的基础模型,用于统一高光谱遥感多传感器学习。 HSI foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition HiGS:用于多步关联语义空间组合的分层生成场景框架 spatial relationship geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页