cs.CV(2026-04-27)

📊 共 31 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies 提出CF-VLA,通过粗到精的两阶段动作生成方法提升视觉-语言-动作策略的效率。 embodied AI vision-language-action VLA
2 QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering 提出QEVA:一种基于多模态问答的叙事视频摘要无参考评价指标 large language model multimodal
3 Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis 提出HPDP框架,利用层级原型和领域先验提升多模态病理图像MIL分析性能。 large language model multimodal
4 Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation Tuna-2:像素嵌入超越视觉编码器,实现多模态理解与生成 multimodal
5 Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction 大规模基准测试病理学预训练模型在乳腺癌生存预测中的性能 foundation model
6 Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation 提出Positive-and-Negative Decoding框架,缓解视觉语言模型中的对象幻觉问题 visual grounding
7 EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT EXACT:用于3D胸部CT分析的可解释异常感知视觉基础模型 foundation model
8 Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues 提出语言引导语义线索,提升MLLM在遮挡和小物体场景下的鲁棒性 large language model multimodal
9 NeuroClaw Technical Report NeuroClaw:用于可执行和可复现神经影像研究的领域专用多智能体研究助手 multimodal
10 Meta-CoT: Enhancing Granularity and Generalization in Image Editing Meta-CoT:通过细粒度和泛化能力增强图像编辑 chain-of-thought
11 Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift 提出MG-MTTA,解决视觉-语言模型在模态特定偏移下的测试时自适应问题 multimodal
12 Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data Zero-to-CAD:无需真实数据,百万规模合成可解释的CAD程序 large language model
13 Don't Pause! Every prediction matters in a streaming video 提出SPOT-Bench评估流视频理解模型的实时性,并提出AsynKV提升性能。 TAMP
14 SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs 提出SMoES,通过模态引导专家特化提升MoE-VLM的性能与效率 multimodal
15 LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models LearnPruner:重新思考视觉语言模型中基于注意力的Token剪枝 large language model
16 GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction 提出GoClick轻量级GUI元素定位模型,用于资源受限设备上的自主GUI交互。 visual grounding

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
17 DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery DeepTaxon:一种可解释的检索增强多模态框架,用于统一物种识别与发现 reinforcement learning multimodal chain-of-thought
18 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation World-R1:通过强化学习增强3D约束的文本到视频生成框架 reinforcement learning geometric consistency foundation model
19 Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification 提出多视角协同学习框架MVSL,解决低资源生物医学图像分类难题。 representation learning contrastive learning large language model
20 Self-Supervised Representation Learning via Hyperspherical Density Shaping 提出HyDeS:一种基于超球面密度塑造的自监督表征学习方法 representation learning
21 See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection ForeSight:利用低级视觉线索和反馈提升VLM的推理能力 reinforcement learning multimodal
22 POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation 提出POCA框架,通过帕累托最优课程对齐解决视觉文本生成中准确率与一致性的权衡问题。 reinforcement learning instruction following

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
23 Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images 提出TSMNet,融合文本监督与视觉表征,解决多模态遥感图像的开放词汇语义分割问题。 open-vocabulary open vocabulary multimodal
24 Light 'em Up: Enabling Few-Shot Low-Light 3D Gaussian Splatting with Multi-Scale Explicit Retinex Illumination Decoupling 提出MERID-GS,通过多尺度Retinex解耦实现弱光环境下的Few-Shot 3D高斯溅射 3D gaussian splatting gaussian splatting splatting
25 Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures LAGRNet:通过可学习代数群和环结构的神经网络进行单目深度估计 depth estimation monocular depth OMOMO
26 Multivariate Gaussian NeRF for Wide Field-of-View Ultrasound Reconstruction 提出Ultra-Wide-NeRF,用于宽视野超声重建,解决伪影和混叠问题。 NeRF
27 Computer Vision-Based Early Detection of Container Loss at Sea 提出基于计算机视觉的集装箱船早期失稳检测系统,降低海上集装箱损失 optical flow IMoS

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
28 DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation DYMAPIA:融合多域信息的深度伪造视频检测框架,提升检测精度与效率。 manipulation optical flow
29 Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics 提出新方法评估CLIP对360度文本与视觉语义的理解 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
30 Unconstrained Multi-view Human Pose Estimation with Algebraic Priors 提出基于代数先验的无约束多视角人体姿态估计框架,无需相机标定。 human motion
31 PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms PointTransformerX:无需稀疏算法的高效便携式3D点云处理 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页