cs.CV(2026-04-13)

📊 共 53 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (21 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗4) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (21 篇)

#题目一句话要点标签🔗
1 Empowering Video Translation using Multimodal Large Language Models 利用多模态大语言模型赋能视频翻译,克服传统流水线的局限性。 large language model multimodal
2 BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning BoxTuning:通过直接注入目标框信息微调多模态模型,提升视频问答性能 large language model multimodal
3 Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models 提出基于熵探测的伪统一性诊断框架,揭示统一多模态模型的信息流不一致问题 large language model multimodal
4 LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment LARY:用于可泛化视觉-动作对齐的潜在动作表征基准 vision-language-action VLA foundation model
5 HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation HuiYanEarth-SAR:首个基于地理坐标生成高保真全球SAR影像的基础模型 foundation model
6 MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration MedP-CLIP:融合区域感知Prompt的医学CLIP模型,提升医学图像细粒度理解 large language model multimodal zero-shot transfer
7 Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions 探索深度学习在视频中识别矛盾/犹豫情绪,用于个性化数字健康干预 large language model multimodal
8 POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs POINTS-Long:提出双模态视觉推理MLLM,解决长视频和流媒体场景下的视觉token扩展性问题。 large language model multimodal
9 MLLM-as-a-Judge Exhibits Model Preference Bias 提出Philautia-Eval评估MLLM偏好偏差,并用Pomms集成模型缓解该偏差。 large language model multimodal
10 Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging 提出MERIT,通过层选择模型融合恢复视频语言模型中的时间推理能力。 large language model multimodal
11 Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding 提出多流场景脚本MTSS,解耦视频信息以提升多模态大语言模型在视频理解和生成任务上的性能。 large language model multimodal
12 rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training 提出rPPG-VQA框架,用于评估视频质量并提升无监督rPPG训练效果 large language model multimodal
13 Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding 提出DualComp,针对超高分辨率遥感影像,实现任务自适应的视觉令牌高效压缩。 large language model multimodal
14 Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images 提出TTSP框架,通过测试时感知缩放解决多模态大模型中的Grounding Paradox问题 large language model multimodal
15 Panoptic Pairwise Distortion Graph 提出基于区域结构化表示的Distortion Graph,用于图像对的细粒度质量评估。 large language model multimodal
16 TraversalBench: Challenging Paths to Follow for Vision Language Models TraversalBench:用于评估视觉语言模型在复杂视觉路径上推理能力的新基准 multimodal visual grounding
17 SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models 提出SVD-Prune,一种免训练的视觉-语言模型token剪枝方法,提升效率。 multimodal
18 Sign Language Recognition in the Age of LLMs 探索LLM在零样本手语识别中的能力,揭示模型规模与数据多样性的重要性 multimodal
19 Hierarchical Textual Knowledge for Enhanced Image Clustering 提出KEC方法,利用层级文本知识增强图像聚类效果 large language model
20 ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation ReSpinQuant:通过子空间残差旋转逼近实现高效的逐层LLM量化 large language model
21 ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation ArtiCAD:基于多智能体代码生成的装配式CAD设计 embodied AI

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
22 FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling FlowCoMotion:通过Token-Latent流建模实现文本到动作生成 distillation text-to-motion motion generation
23 MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models 提出MMR-AD:一个大规模多模态数据集,用于评估多模态大语言模型在通用异常检测中的性能。 reinforcement learning large language model multimodal
24 Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge FOMO25挑战赛:探索面向临床脑部MRI的Foundation Model MAE foundation model
25 Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language 提出统一形式语言和GDP-29K数据集,提升MLLM在平面和立体几何推理中的性能。 reinforcement learning geometric consistency large language model
26 OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video 提出OmniScript,用于生成长篇电影视频的音视频脚本,提升时序定位和语义准确性。 reinforcement learning large language model multimodal
27 UNIGEOCLIP: Unified Geospatial Contrastive Learning UNIGEOCLIP:统一地理空间对比学习框架,实现多模态地理数据对齐。 representation learning contrastive learning multimodal
28 Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation 提出DiTTA,通过蒸馏辅助的测试时自适应实现无标注视频语义分割 distillation foundation model
29 TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition 提出TAG-Head,一个即插即用的时序对齐图头部,用于提升细粒度动作识别性能。 privileged information optical flow multimodal
30 You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass 提出单次前向多响应奖励模型,加速多模态偏好学习并提升开放生成质量。 reinforcement learning preference learning multimodal
31 Scene Change Detection with Vision-Language Representation Learning 提出LangSCD,利用视觉-语言表示学习进行场景变更检测,提升城市监控与导航能力。 representation learning
32 Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization 提出Agentic Harmonization方法,解决跨数据集标注不一致的版面分析问题。 representation learning
33 Learning Long-term Motion Embeddings for Efficient Kinematics Generation 提出基于长时运动嵌入的高效运动学生成方法,显著提升生成效率。 flow matching motion latent
34 STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding 提出STS-Mixer,通过时空谱混合增强4D点云视频理解能力。 representation learning spatiotemporal
35 MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance MorphoFlow:基于稀疏监督和自适应隐变量相关性的生成式形状建模 SSM implicit representation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
36 GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors GS4City:利用城市模型先验的分层语义高斯溅射,用于城市场景理解。 3D gaussian splatting 3DGS gaussian splatting
37 Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis 提出Iterative Gaussian Synopsis,通过迭代高斯概要实现3D高斯溅射的紧凑和渐进式渲染。 3D gaussian splatting 3DGS gaussian splatting
38 Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction 提出基于3DGS的端到端通信方案,用于提升无人机图像传输效率和大规模场景重建质量。 3D gaussian splatting 3DGS gaussian splatting
39 Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection 提出Seg2Change,将开放词汇语义分割模型适配于遥感影像变化检测任务 open-vocabulary open vocabulary
40 Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS 提出Naka-GS,通过双分支Naka校正和渐进式点云剪枝,解决低光照3D高斯溅射问题。 3D gaussian splatting 3DGS gaussian splatting
41 CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation 提出CDPR:一种偏振跨模态扩散方法,用于提升单目深度估计的可靠性 depth estimation monocular depth
42 Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling 提出PLOVIS,利用开放词汇图像分割进行3D点云语义分割,解决数据稀缺问题 open-vocabulary open vocabulary
43 GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth GeomPrompt:针对RGB-D语义分割,在深度信息缺失或退化情况下,学习几何提示以提升性能。 monocular depth embodied AI multimodal
44 Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding 评估Gemini视觉语言模型中的思维流对视频场景理解的影响 scene understanding
45 LumiMotion: Improving Gaussian Relighting with Scene Dynamics LumiMotion:利用场景动态信息改进高斯光照重构 gaussian splatting splatting
46 STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation 提出STGV,通过时空哈希编码提升高斯视频表示质量 gaussian splatting splatting

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
47 EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates EgoFun3D:提出一种基于功能模板的自中心视频交互3D对象建模方法 egocentric embodied AI
48 Who Handles Orientation? Investigating Invariance in Feature Matching 研究特征匹配中旋转不变性的融入位置,提升多模态和卫星图像匹配性能 feature matching

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
49 LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation 综述LMMs与Object-Centric Vision融合,实现理解、分割、编辑和生成 manipulation scene understanding multimodal
50 ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding 提出ReXSonoVQA:一个面向超声流程理解的视频问答基准 manipulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
51 OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation OmniShow:统一多模态条件的人-物交互视频生成框架 human-object interaction multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
52 LiveGesture Streamable Co-Speech Gesture Generation Model 提出LiveGesture,首个零延迟、任意长度的流式语音驱动全身手势生成框架 motion generation motion tokenizer

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
53 LottieGPT: Tokenizing Vector Animation for Autoregressive Generation LottieGPT:提出一种基于Lottie动画的Token化与自回归生成框架 motion representation multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页