Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search

作者: Haoyu Zhang, Zhihao Yu, Rui Wang, Yaochu Jin, Qiqi Liu, Ran Cheng

分类: cs.CV, cs.AI

发布日期: 2026-03-20

🔗 代码/项目: GITHUB

💡 一句话要点

提出EvoNAS以解决大规模视觉模型在边缘设备上的部署问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 进化神经架构搜索 视觉状态空间 视觉变换器 知识蒸馏 边缘计算 多目标优化 计算机视觉

📋 核心要点

现有方法在多目标优化中面临候选评估成本高和子网络排名不一致的问题，限制了其实际应用。
本文提出EvoNAS，通过构建混合超网络和跨架构双域知识蒸馏策略，提升了模型的表示能力和评估效率。
实验结果显示，EvoNets在COCO、ADE20K等数据集上实现了更低的推理延迟和更高的吞吐量，且保持了强大的泛化能力。

📝 摘要（中文）

现代计算机视觉需要在预测准确性与实时效率之间取得平衡，但大型视觉模型的高推理成本限制了其在资源受限的边缘设备上的部署。尽管进化神经架构搜索（ENAS）适合多目标优化，但其实际应用受到候选评估成本高和子网络排名不一致的困扰。为此，本文提出了EvoNAS，一个高效的分布式多目标进化架构搜索框架。通过构建集成视觉状态空间和视觉变换器模块的混合超网络，并采用跨架构双域知识蒸馏策略，EvoNAS提升了超网络的表示能力和排名一致性，减少了大规模验证的成本。实验结果表明，EvoNets在准确性和效率之间实现了帕累托最优的权衡。

🔬 方法详解

问题定义：本文旨在解决大型视觉模型在边缘设备上部署时的高推理成本问题。现有的进化神经架构搜索方法在候选评估和子网络排名一致性方面存在显著不足，导致效率低下。

核心思路：提出EvoNAS框架，通过构建一个集成视觉状态空间（VSS）和视觉变换器（ViT）模块的混合超网络，结合跨架构双域知识蒸馏（CA-DDKD）策略，提升模型的表示能力和评估效率。

技术框架：EvoNAS的整体架构包括混合超网络的构建、CA-DDKD策略的应用以及分布式多模型并行评估（DMMPE）框架。该框架通过GPU资源池化和异步调度来提高评估效率。

关键创新：EvoNAS的主要创新在于结合了VSS模块的计算效率与ViT模块的语义表达能力，显著提升了超网络的表示能力和排名一致性，避免了额外的微调过程。

关键设计：在设计中，采用了高效的损失函数和网络结构，确保了在多GPU环境下的并行执行，DMMPE框架使得评估效率提高超过70%。

🖼️ 关键图片

📊 实验亮点

实验结果表明，EvoNets在COCO、ADE20K、KITTI和NYU-Depth v2数据集上实现了准确性与效率的帕累托最优权衡。与传统的CNN、ViT和Mamba模型相比，EvoNets在严格的计算预算下表现出更低的推理延迟和更高的吞吐量，且在下游任务中保持了强大的泛化能力。

🎯 应用场景

该研究的潜在应用领域包括边缘计算、智能监控、自动驾驶等场景，能够在资源受限的环境中实现高效的视觉推理。EvoNAS的设计理念和技术框架为未来的视觉模型优化提供了新的思路，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI-Group/evonas

Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理