cs.CV(2026-04-17)

📊 共 37 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs 发现思维链推理降低多模态LLM的视觉空间推理能力 multimodal chain-of-thought
2 neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing 提出neuralCAD-Edit:一个面向多模态指令的3D CAD模型编辑专家基准。 foundation model multimodal
3 PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation 提出PixDLM,用于无人机遥感图像推理分割,并构建DRSeg基准数据集。 multimodal chain-of-thought
4 GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos GAViD:大规模多模态数据集,用于视频中上下文感知群体情感识别 multimodal
5 Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI 提出SGMRI-VQA基准,用于评估医学VLM在体积MRI上的多帧空间推理能力。 visual grounding chain-of-thought
6 RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees 提出RefereeBench,评估视频多模态大模型在多体育项目裁判任务中的能力。 large language model multimodal
7 SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding SIMMER:利用MLLM嵌入实现跨模态食物图像-食谱检索 large language model multimodal
8 VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects 提出VEFX-Bench,用于评估通用视频编辑和视觉效果的综合基准。 instruction following
9 Information Router for Mitigating Modality Dominance in Vision-Language Models 提出多模态信息路由(MoIR)以缓解视觉-语言模型中的模态主导问题。 large language model
10 Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap 提出CrossMath基准,揭示视觉语言模型在视觉推理上的局限性 multimodal
11 Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization 评估视觉-语言模型在图像地理定位中的表现与局限性 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
12 Splats in Splats++: Robust and Generalizable 3D Gaussian Splatting Steganography 提出Splats in Splats++以解决3D高斯点云隐写问题 3D gaussian splatting 3DGS 3D reconstruction
13 AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution AdaVFM:通过LLM引导的自适应视觉基础模型实现边缘智能 open-vocabulary open vocabulary large language model
14 Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction 提出神经Gabor Splatting,增强高频表面重建的3D高斯溅射 3D gaussian splatting 3DGS 3D reconstruction
15 SENSE: Stereo OpEN Vocabulary SEmantic Segmentation SENSE:利用立体视觉增强开放词汇语义分割,提升空间精度 scene understanding open-vocabulary open vocabulary
16 CLOTH-HUGS: Cloth Aware Human Gaussian Splatting Cloth-HUGS:基于高斯溅射的服装感知人体重建,解耦身体与服装 gaussian splatting splatting SMPL
17 Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline 提出OVRSISBenchV2以解决开放词汇遥感图像分割问题 open-vocabulary open vocabulary
18 PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding 提出PLAF,实现像素级语言对齐特征提取,提升高效3D场景理解能力 scene understanding open-vocabulary open vocabulary

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
19 SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification SSMamba:一种用于病理图像分类的自监督混合状态空间模型 Mamba state space model foundation model
20 Find, Fix, Reason: Context Repair for Video Reasoning 提出Find, Fix, Reason框架,通过上下文修复提升视频推理性能 reinforcement learning spatiotemporal instruction following
21 CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting CollideNet:用于碰撞时间预测的分层多尺度解耦视频表示学习 representation learning spatiotemporal
22 MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis MambaBack:结合局部特征与全局上下文的病理切片图像分析方法 Mamba SSM
23 CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder CLIMB:基于Mamba和高斯对齐自编码器的可控纵向脑部图像生成 Mamba state space model
24 The Amazing Stability of Flow Matching Flow Matching模型展现惊人的稳定性,对数据和架构扰动不敏感 flow matching
25 Repurposing 3D Generative Model for Autoregressive Layout Generation LaviGen:利用3D生成模型进行自回归布局生成,提升场景物理合理性与生成效率 distillation physically plausible

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
26 Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs 提出Mind's Eye基准测试,评估多模态LLM的视觉抽象、转换和组合能力 manipulation large language model multimodal
27 DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics 提出基于DINOv3的图像取证基线模型,性能超越专用检测器 manipulation foundation model
28 Continual Hand-Eye Calibration for Open-world Robotic Manipulation 提出一种持续手眼标定框架,解决开放世界机器人操作中的灾难性遗忘问题。 manipulation distillation
29 From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance 提出CoEdit,通过竞争合作训练实现文本引导的免训练图像编辑 manipulation
30 AHS: Adaptive Head Synthesis via Synthetic Data Augmentations 提出AHS,通过合成数据增强实现自适应头部合成,解决现有头部替换方法在真实场景中的局限性。 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
31 FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation 提出FineCog-Nav以解决无人机视觉语言导航中的零-shot挑战 egocentric VLN foundation model
32 Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions 提出EgoScreen-Emotion数据集,用于具身智能体在主视角屏幕观看电影时的情感理解。 egocentric multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
33 NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition NeuroLip:一种事件驱动的时空学习框架,用于跨场景的唇动视觉说话人识别 spatiotemporal
34 LP$^{2}$DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition 提出LP²DH框架以解决动态纹理识别中的高维特征问题 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
35 AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection 提出AIFIND,通过伪造痕迹对齐实现增量人脸伪造检测 geometric consistency
36 APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition 提出对抗点云反击(APC),提升3D点云识别模型对抗攻击的鲁棒性和迁移性。 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
37 Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions 提出Motion-Adapter,解决文本到复合动作生成中的动作覆盖和注意力崩溃问题 motion diffusion model motion diffusion text-to-motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页