AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

作者: Ma zirui, Fan Zhihua, Li Wenxing, Wu Haibin, Zhang Fulin, Ye Xiaochun, Li Wenming

分类: cs.AR, cs.AI

发布日期: 2026-04-28

备注: 7 pages, 9 figures, accepted by DAC 2026, repo: https://github.com/MAdrid1011/AHASD

DOI: 10.1145/3770743.3803965

💡 一句话要点

提出AHASD以解决移动设备上LLM自适应草拟的效率问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 自适应草拟 异步架构 移动设备 能效优化 任务调度 PIM技术

📋 核心要点

现有方法在移动单NPU-PIM系统上进行自适应草拟推理时，面临传统同步执行的空闲开销和异步执行的计算浪费问题。
AHASD通过任务级DLM-TLM解耦实现草拟与验证的并行处理，并引入动态管理机制以优化草拟过程。
实验结果表明，AHASD在吞吐量和能效方面分别比GPU基线提升了4.2倍和5.6倍，相较于GPU+PIM基线提升了1.5倍和1.24倍。

📝 摘要（中文）

本论文提出了一种名为AHASD的异步异构架构，旨在提高大语言模型（LLM）在移动设备上的推理效率。通过使用小型草拟语言模型（DLM）生成草拟内容，并利用大型目标语言模型（TLM）进行批量验证，AHASD克服了传统同步执行中的空闲开销和异步执行中的计算浪费。该架构通过任务级DLM-TLM解耦，实现了在PIM上并行草拟和在单个NPU上验证。此外，AHASD引入了基于熵历史的草拟控制和时间感知的预验证控制，以动态管理草拟算法的执行和预验证时机，抑制低置信度草拟的无效生成。实验结果显示，AHASD在吞吐量和能效方面相较于基线有显著提升。

🔬 方法详解

问题定义：本论文旨在解决在移动设备上进行大语言模型（LLM）自适应草拟时的效率问题，尤其是在传统同步执行导致的空闲开销和异步执行造成的计算浪费。

核心思路：AHASD通过任务级解耦DLM与TLM，实现草拟与验证的并行处理，利用动态控制机制来优化草拟过程，特别是针对低置信度草拟的抑制。

技术框架：AHASD架构包括两个主要模块：草拟模块（DLM）和验证模块（TLM），并结合了熵历史感知草拟控制和时间感知预验证控制，以实现高效的任务调度和执行。

关键创新：AHASD的主要创新在于其任务级异步架构设计，能够在PIM上实现草拟并在NPU上进行验证，显著提高了推理效率，并有效管理了草拟的有效性。

关键设计：AHASD集成了注意力算法单元和门控任务调度单元，支持注意力链接定位和亚微秒级任务切换，同时确保硬件开销低于DRAM区域的3%。

🖼️ 关键图片

📊 实验亮点

实验结果显示，AHASD在不同大语言模型和自适应草拟算法下，吞吐量提升高达4.2倍，能效提升5.6倍，相较于最先进的GPU+PIM基线，吞吐量和能效分别提升1.5倍和1.24倍，展示了其显著的性能优势。

🎯 应用场景

该研究的潜在应用领域包括移动设备上的自然语言处理、智能助手和实时翻译等场景。通过提高大语言模型的推理效率，AHASD能够在资源受限的环境中实现更快速的响应和更高的能效，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2$\times$ in throughput and 5.6$\times$ in energy efficiency improvements over a GPU-only baseline, and 1.5$\times$ in throughput and 1.24$\times$ in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3\% of the DRAM area.

AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理