Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

作者: Munawar Hasan

分类: cs.LG, cs.CL

发布日期: 2026-05-28

💡 一句话要点

提出有界行为不可区分性以改进黑箱LLM蒸馏评估

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 黑箱LLM蒸馏 行为不可区分性 对抗性评估 LoRA技术 自然语言处理

📋 核心要点

现有的黑箱LLM蒸馏方法仅关注输出相似性，未能充分评估学生与教师模型的行为一致性。
本文提出有界行为不可区分性，定义了在特定提示分布下的行为评估框架，以更全面地评估蒸馏效果。
实验表明，LoRA蒸馏在提高语义相似性的同时，仍存在行为差异，强调了对抗性评估的重要性。

📝 摘要（中文）

黑箱LLM蒸馏通常被视为输出匹配问题：学生模型的成功与否取决于其响应与教师模型的语义相似性。然而，输出相似性并不意味着学生在行为上与其模仿的模型不可区分。本文引入了有界行为不可区分性，形式化为$(ε,q,t, extbf{A})$-行为不可区分性，针对明确的提示分布进行评估。通过在Qwen和Llama教师-学生对上进行实验，使用5000个提示的行为探测套件，比较了教师模型与基础学生和LoRA蒸馏学生的行为差异。实验结果表明，尽管LoRA提高了语义相似性，但仍存在行为差异，表明黑箱LLM蒸馏需要有界、对抗性和类别感知的评估。

🔬 方法详解

问题定义：本文旨在解决现有黑箱LLM蒸馏方法在评估学生模型与教师模型行为一致性方面的不足，现有方法仅关注输出的语义相似性，未能考虑行为上的不可区分性。

核心思路：提出有界行为不可区分性，形式化为$(ε,q,t, extbf{A})$-行为不可区分性，旨在通过明确的提示分布评估学生模型的行为与教师模型的相似度，从而提供更全面的评估标准。

技术框架：整体架构包括定义行为不可区分性的数学框架、设计5000个提示的行为探测套件，以及通过对抗性评估来比较教师模型与学生模型的行为差异。主要模块包括提示生成、行为评估和对抗性测试。

关键创新：最重要的创新在于引入了有界行为不可区分性的概念，强调了在蒸馏过程中不仅要关注输出的相似性，还要考虑模型的行为一致性，这与传统的输出匹配方法有本质区别。

关键设计：在实验中，设置了多个关键参数，包括区分优势的界限$ε$、查询次数$ q $、计算时间$t$等，并使用了LoRA蒸馏技术来提升模型的语义相似性，同时进行对抗性评估以检测行为差异。实验还探讨了不同的查询预算和采样策略。

🖼️ 关键图片

📊 实验亮点

实验结果显示，LoRA蒸馏在Qwen模型中将语义相似性从0.788提升至0.862，在Llama模型中从0.814提升至0.874。然而，对抗性评估揭示了仍然存在的行为差异，Qwen的区分优势在LoRA蒸馏后从0.158降至0.081，表明蒸馏过程中的行为一致性仍需关注。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、对话系统和智能助手等。通过改进黑箱LLM蒸馏的评估方法，可以提升模型的可靠性和一致性，进而增强用户体验和系统的实际应用价值。未来，该方法可能推动更高效的模型训练和评估标准的建立。

📄 摘要（原文）

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(ε,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $ε$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理