cs.CV(2026-04-29)

📊 共 35 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (2) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation 提出CheXthought以提升胸部X光解读的多模态推理能力 multimodal chain-of-thought
2 Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation 提出Three-Step Nav,解决零样本视觉语言导航中的漂移和早停问题 VLN large language model multimodal
3 AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation AnimateAnyMesh++:用于高保真文本驱动网格动画的灵活4D基础模型 foundation model
4 TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection 利用视觉基础模型特征,提出TAP以提升AI生成图像检测性能 foundation model
5 Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection 利用视觉基础模型和解耦原型匹配,解决小样本工业物体检测问题。 foundation model
6 FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing FASH-iCNN:通过多模态CNN探究可解释的时尚编辑风格 multimodal
7 State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading 提出TriSCA框架,提升MLLM在表盘读数任务中的状态一致性,解决视角和光照变化下的性能下降问题。 large language model multimodal
8 Adaptive Transform Coding for Semantic Compression 提出自适应变换编码方法,用于语义压缩,提升机器视觉任务性能。 foundation model
9 Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners 提出LILA,利用线性上下文学习从动态3D场景中学习像素级特征 foundation model
10 Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection 提出基于稀疏自编码器的ViT异常检测方法,提升模型安全性。 large language model
11 Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning 提出SQI框架,通过定性推理增强冻结VLM在视觉错觉场景下的感知鲁棒性 visual grounding

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
12 MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching MesonGS++:通过超参数搜索实现3D高斯溅射的后训练压缩,显著降低存储成本。 3D gaussian splatting 3DGS gaussian splatting
13 EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors EnerGS:基于能量的3D高斯溅射,利用部分几何先验提升重建质量 3D gaussian splatting 3DGS gaussian splatting
14 MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification 提出MemOVCD,通过跨时序记忆推理和自适应校正实现免训练开放词汇变化检测 open-vocabulary open vocabulary foundation model
15 Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation 提出Last-Layer-Centric Feature Recombination模块,提升DINOv3在单目深度估计中的几何信息利用率。 depth estimation monocular depth foundation model
16 Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation 提出SeeCo框架,通过几何-语义共识校准提升遥感开放词汇语义分割性能 open-vocabulary open vocabulary
17 Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction 提出基于颜色编码照明的高速体三维重建方法,无需改造相机硬件。 gaussian splatting splatting scene reconstruction
18 AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision AirZoo:用于空中几何3D视觉的大规模统一数据集与基准 metric depth Depth Anything 3D reconstruction
19 Semantic Foam: Unifying Spatial and Semantic Scene Decomposition Semantic Foam:统一空间和语义场景分解,提升交互式图形应用能力 3D gaussian splatting gaussian splatting splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
20 GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents GLM-5V-Turbo:面向多模态Agent的原生基础模型 reinforcement learning foundation model multimodal
21 Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding MCM-VG:通过多重一致性2D-3D映射实现鲁棒的Zero-Shot 3D视觉定位 distillation open-vocabulary open vocabulary
22 World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning World2VLM:将世界模型的想象能力蒸馏到VLM中,用于动态空间推理 world model world models egocentric
23 A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection 提出EEGVFusion,用于整合脑电和视频信息以提升小鼠癫痫检测的可靠性。 representation learning multimodal
24 $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding 提出PKS$^4$,通过并行运动学选择性状态空间扫描器实现高效视频理解 SSM state space model spatial relationship
25 GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition GaitKD:一种通用的解耦蒸馏框架,用于高效步态识别 teacher-student distillation
26 Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation 提出基于知识蒸馏的边缘AI方案,提升自动驾驶弱势道路使用者检测的INT8量化精度。 distillation

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
27 Cross-Domain Transfer of Hyperspectral Foundation Models 提出跨域迁移高光谱基础模型,提升近端遥感语义分割性能 HSI foundation model
28 HOI-aware Adaptive Network for Weakly-supervised Action Segmentation 提出HOI感知的自适应网络AdaAct,用于弱监督动作分割 human-object interaction HOI

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
29 Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints 提出基于归因引导的多模态Deepfake检测框架,通过跨模态指纹提升检测精度。 manipulation multimodal
30 GIFGuard: Proactive Forensics against Deepfakes in Facial GIFs via Spatiotemporal Watermarking 提出GIFGuard,通过时空水印技术实现对GIF图像中深度伪造的主动取证。 manipulation spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
31 DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation 提出DenseStep2M:一个可扩展、免训练的密集教学视频标注流程。 egocentric large language model multimodal
32 ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection ViBE:通过时空VAE和分布对齐投影实现视觉到M/EEG脑编码 feature matching

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
33 GateMOT: Q-Gated Attention for Dense Object Tracking 提出Q-Gated Attention的GateMOT,解决密集物体跟踪中高分辨率特征的计算瓶颈。 motion estimation
34 Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments ART-Track:针对空间科学实验中模型生物的运动驱动多目标跟踪 motion estimation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
35 Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models 提出空间自适应多重引导(SAMG),解决扩散模型中细节缺失与伪影问题。 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页