Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

作者: Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Qika Lin, Kai He, Ting Liu, Bing Qin, Mengling Feng

分类: cs.CL

发布日期: 2025-09-08

💡 一句话要点

提出ProCon方法，通过投影约束缓解指令微调中大语言模型的安全性风险。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 指令微调 安全性 拒绝方向 投影约束 对抗训练 模型安全

📋 核心要点

指令微调虽提升大语言模型能力，但会降低其安全性，尤其是在拒绝恶意指令方面。
ProCon方法通过约束隐藏状态在拒绝方向上的投影，来减轻训练过程中的拒绝方向漂移。
实验表明，ProCon能在保持任务性能的同时，显著降低安全风险，并稳定拒绝方向。

📝 摘要（中文）

指令微调(IFT)已被广泛采用作为一种有效的后训练策略，以增强大型语言模型(LLMs)的各种能力。然而，先前的研究表明，IFT会显著损害LLMs的安全性，特别是它们拒绝恶意指令的能力，从而引起重大关注。最近对LLMs内部机制的研究已经确定了隐藏状态中的拒绝方向(r-direction)，它在控制拒绝行为中起着关键作用。基于这一洞察，我们的研究表明，r-direction在训练过程中容易漂移，我们将其识别为相关安全风险的原因之一。为了减轻这种漂移，我们提出的ProCon方法引入了一个投影约束损失项，该损失项正则化每个训练样本的隐藏状态在r-direction上的投影幅度。我们的初步分析表明，应用适当的约束可以有效地减轻拒绝方向漂移和相关的安全风险，但仍然受到整体性能障碍的限制。为了克服这一障碍，根据我们对早期急剧漂移的观察和数据驱动的视角，我们引入了一种warm-up策略，该策略强调早期强约束并扩大数据分布以加强约束信号，从而产生增强的ProCon方法。在各种数据集、场景和LLMs下的实验结果表明，我们的方法可以显著减轻IFT带来的安全风险，同时保持任务性能的提升。即使与强大的基线相比，我们的方法也能始终如一地提供卓越的整体性能。至关重要的是，我们的分析表明ProCon有助于在训练期间稳定r-direction，同时这种对LLMs内部机制的解释性驱动探索为未来的安全研究奠定了坚实的基础。

🔬 方法详解

问题定义：指令微调(IFT)虽然能够提升大语言模型(LLM)的性能，但会显著降低其安全性，具体表现为LLM更难拒绝恶意指令。现有方法缺乏对LLM内部机制的深入理解，无法有效解决IFT带来的安全性下降问题。

核心思路：论文的核心思路是稳定LLM隐藏状态中的“拒绝方向”（r-direction）。研究发现，r-direction的漂移是导致LLM安全性下降的原因之一。通过约束训练过程中隐藏状态在r-direction上的投影，可以减轻这种漂移，从而提高LLM的安全性。

技术框架：ProCon方法主要包含以下几个关键部分：1）拒绝方向识别：利用现有研究成果，确定LLM隐藏状态中的r-direction。2）投影约束损失：引入一个投影约束损失项，该损失项正则化每个训练样本的隐藏状态在r-direction上的投影幅度。3）Warm-up策略：设计一个warm-up策略，在训练初期施加更强的约束，并扩大数据分布，以增强约束信号。

关键创新：ProCon方法的关键创新在于：1）基于内部机制的安全性提升：不同于以往依赖外部干预或数据过滤的方法，ProCon直接作用于LLM的内部表示，通过稳定r-direction来提升安全性。2）投影约束损失：通过约束隐藏状态在r-direction上的投影，有效地减轻了拒绝方向的漂移。3）Warm-up策略：针对早期急剧漂移现象，设计了warm-up策略，进一步提升了ProCon的性能。

关键设计：1）投影约束损失函数：损失函数包含一个标准IFT损失项和一个投影约束项。投影约束项的目标是最小化隐藏状态在r-direction上的投影幅度。2）Warm-up策略：在训练初期，使用较大的约束系数，并随着训练的进行逐渐减小。同时，扩大训练数据分布，增加包含恶意指令的数据，以增强约束信号。3）拒绝方向的确定：采用已有的方法来确定LLM隐藏状态中的r-direction，具体方法未在论文中详细说明。

🖼️ 关键图片

📊 实验亮点

实验结果表明，ProCon方法在多个数据集、场景和LLM上均能显著降低安全风险，同时保持任务性能。与现有方法相比，ProCon在整体性能上表现更优。分析表明，ProCon能够有效稳定训练过程中的拒绝方向，验证了其有效性。具体性能数据和提升幅度在论文中详细展示。

🎯 应用场景

该研究成果可应用于各种需要安全保障的大语言模型应用场景，例如智能客服、内容生成、代码生成等。通过ProCon方法，可以有效降低LLM生成有害或不当内容的风险，提升用户体验，并减少潜在的法律和伦理问题。未来，该方法可以进一步扩展到其他类型的安全风险缓解，例如防止模型泄露敏感信息。

📄 摘要（原文）

Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs' safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs' internal mechanisms lays a solid foundation for future safety research.

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理