cs.CV(2026-04-14)

📊 共 43 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (23 🔗8) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱一:机器人控制 (Robot Control) (5) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (23 篇)

#题目一句话要点标签🔗
1 CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models CLASP:面向多模态大语言模型的类自适应层融合与双阶段剪枝 large language model multimodal
2 All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding 提出统一合成数据流水线,解决多模态视频理解中数据匮乏问题 large language model multimodal visual grounding
3 Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models 提出模型链预训练(CoM-PT),加速视觉基础模型训练且无性能损失。 large language model foundation model
4 Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation 提出P-FIN,解决多模态联邦学习中特征缺失和不确定性问题,提升医疗诊断安全性。 multimodal
5 Towards Long-horizon Agentic Multimodal Search 提出LMM-Searcher以解决长时段多模态搜索问题 multimodal
6 Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks 提出多模态语义光照攻击MSLA,挑战视觉-语言模型在物理世界的安全性。 multimodal
7 AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition 提出AffectAgent以解决多模态情感识别中的模态歧义问题 multimodal
8 Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining Brain-DiT:基于元数据条件扩散预训练的通用多状态fMRI基础模型 foundation model
9 GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning GeoAlign通过几何特征重对齐提升MLLM的空间推理能力 large language model foundation model multimodal
10 Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models 提出Decoder-side Temporal Rebalancing (DTR)以缓解视频大语言模型中的幻觉问题 large language model
11 MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models MODIX:一种免训练的多模态信息驱动的位置索引缩放方法,提升视觉-语言模型性能 multimodal
12 Boosting Visual Instruction Tuning with Self-Supervised Guidance 提出V-GIFT,通过自监督指导提升视觉指令微调,增强MLLM的视觉推理能力 large language model multimodal
13 Distorted or Fabricated? A Survey on Hallucination in Video LLMs 对视频大语言模型幻觉现象的全面综述,提出系统分类与缓解策略。 large language model visual grounding
14 DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment 提出DPC-VQA,解耦质量感知与残差校准,高效评估视频质量 large language model multimodal
15 NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1) NTIRE 2026 RAIM挑战赛:探索MLLM在专业图像质量评估中的应用 large language model multimodal
16 Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding 提出DSTP框架,解决MLLM解码过程中视觉token剪枝在复杂推理任务中性能下降问题 large language model multimodal
17 Agentic Discovery with Active Hypothesis Exploration for Visual Recognition HypoExplore:基于主动假设探索的Agentic视觉识别架构发现框架 large language model
18 Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs 提出感知程序P²,通过语言原生线索提升多模态大语言模型视觉工具推理能力 multimodal
19 OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion 提出OmniFood8K数据集和单图营养估计框架,解决中餐营养估计难题。 multimodal
20 Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection 提出多维对抗特征学习框架,提升AI生成图像检测的泛化能力。 multimodal
21 Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors 利用3D基础先验,实现逼真且一致的物体轨道视频生成 foundation model
22 Boosting Robust AIGI Detection with LoRA-based Pairwise Training 提出基于LoRA的Pairwise训练方法LPT,提升AIGI图像在复杂失真下的鲁棒检测性能。 foundation model
23 Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment 提出DS-IEQA框架,解决图像编辑质量评估中度量标准僵化和距离无关评分建模问题。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
24 PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting 提出PDF-GS,通过渐进式干扰物过滤实现鲁棒的3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
25 ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models ArtifactWorld:通过视频生成模型扩展3D高斯溅射伪影修复 3D gaussian splatting 3DGS gaussian splatting
26 ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction ELoG-GS:基于亮度引导增强的双分支高斯溅射,用于极端弱光3D重建 gaussian splatting splatting geometric consistency
27 Pi-HOC: Pairwise 3D Human-Object Contact Estimation 提出Pi-HOC,用于解决多人-多物体交互场景下的3D人体-物体接触估计问题 sam 3D SAM 3D human-object interaction
28 Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions GraG:基于高斯模型的快速单目动态手-物交互三维重建 sam 3D SAM 3D
29 Cross-Attentive Multiview Fusion of Vision-Language Embeddings 提出CAMFusion,通过交叉注意力多视角融合提升3D场景语义分割性能。 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
30 PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination 提出PianoFlow以解决双手协调钢琴动作生成问题 bi-manual flow matching motion synthesis
31 From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception 提出VIF框架,解决MLLM在细粒度视觉感知中的信息衰减问题 manipulation large language model multimodal
32 Detecting Precise Hand Touch Moments in Egocentric Video 提出HiCE模块,用于精准检测第一视角视频中手与物体接触的时刻 manipulation egocentric spatiotemporal
33 Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization 提出FASA框架以解决图像操控定位中的微观与宏观差距问题 manipulation
34 Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection 提出直接差异重放方法,解决持续人脸伪造检测中的灾难性遗忘问题 manipulation

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
35 RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation 提出RSGMamba,解决多模态语义分割中模态可靠性差异导致的特征退化问题。 Mamba state space model scene understanding
36 SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker SEATrack:一种简单、高效、自适应的多模态目标跟踪器,提升性能效率。 representation learning multimodal
37 Hypergraph-State Collaborative Reasoning for Multi-Object Tracking 提出HyperSSM框架,通过超图状态协同推理解决多目标跟踪中的运动估计难题。 SSM state space model motion estimation
38 Visual Preference Optimization with Rubric Rewards 提出rDPO框架以优化视觉偏好评估 DPO direct preference optimization multimodal
39 Cross-Modal Knowledge Distillation for PET-Free Amyloid-Beta Detection from MRI 提出PET引导的跨模态知识蒸馏,实现仅MRI的阿尔茨海默病Aβ检测 contrastive learning distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
40 A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture 提出复杂场景4D无标记人体运动捕捉数据集与评估基准,解决真实交互场景下的性能瓶颈。 SMPL SMPL-X human motion
41 EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports EgoEsportsQA:提出电子竞技第一视角视频问答基准,用于评估感知与推理能力。 egocentric large language model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
42 Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns 提出Conflated逆建模框架,生成多样且可控的城市植被模式以调控温度 physically plausible

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
43 VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization 提出VideoFlexTok,一种灵活长度的由粗到精视频Token化方法,提升视频生成效率。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页