Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

作者: ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng

分类: cs.AI

发布日期: 2026-05-12

💡 一句话要点

提出MORA，通过扩展奖励维度打破大语言模型安全性-有用性瓶颈

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型对齐 多目标优化 奖励维度扩展 安全性 有用性 prompt工程 MORA

📋 核心要点

现有大语言模型对齐方法难以突破安全性与有用性的固有矛盾，在两者之间进行权衡。
MORA通过扩展奖励维度，重塑prompt，使模型能够更好地理解和满足多方面的偏好。
实验结果表明，MORA在序列和同步对齐中均能有效提升模型性能，尤其是在安全性方面。

📝 摘要（中文）

大型语言模型的多目标对齐通常面临不同人类偏好之间的零和冲突，即优化一个指标（如有用性）往往会牺牲另一个指标（如安全性）。现有工作主要集中于数据选择、参数合并或训练期间的算法平衡，但这些方法仅在固定的帕累托前沿上进行折衷，无法从根本上解决内在的权衡。本文从多维奖励的角度出发，通过扩展模型的rollout并分析不同奖励维度上的输出，得出结论：多目标之间的冲突源于prompt本身限制了可实现的多维奖励。基于此，提出了多目标奖励同化（MORA），通过预采样隔离单奖励prompt，并通过重写原始问题以包含多维意图来扩展奖励多样性。实验表明，在序列对齐中，MORA在helpful、harmless和truthful维度上进行多偏好对齐后，实现了5%到12.4%的单偏好改进，尤其是在harmlessness方面。在同步对齐中，MORA实现了平均4.6%的总体奖励提升。

🔬 方法详解

问题定义：现有的大语言模型对齐方法，例如数据选择、参数合并等，在优化多个目标（如安全性、有用性和真实性）时，往往需要在这些目标之间进行折衷。这种折衷源于模型在固定的帕累托前沿上进行优化，无法同时提升所有目标的性能。核心问题在于prompt本身可能限制了模型能够实现的多维奖励。

核心思路：MORA的核心思路是通过扩展奖励维度来打破这种限制。具体来说，MORA首先识别出能够有效激发模型在特定奖励维度上表现的prompt，然后通过重写原始问题，将多个奖励维度的意图融入到prompt中，从而引导模型在多个维度上进行优化。这样，模型不再需要在不同目标之间进行权衡，而是能够同时提升多个目标的性能。

技术框架：MORA主要包含两个阶段：预采样和奖励同化。在预采样阶段，MORA通过采样大量的prompt，并评估模型在不同奖励维度上的表现，从而识别出能够有效激发模型在特定维度上表现的prompt。在奖励同化阶段，MORA使用这些prompt来重写原始问题，将多个奖励维度的意图融入到prompt中。然后，使用这些重写后的prompt来训练模型，使其能够更好地理解和满足多方面的偏好。

关键创新：MORA的关键创新在于它从prompt的角度出发，通过扩展奖励维度来解决多目标对齐问题。与现有方法不同，MORA不是在固定的帕累托前沿上进行折衷，而是通过改变prompt来扩展帕累托前沿，从而实现多个目标的共同提升。这种方法能够更有效地利用模型的潜力，从而获得更好的性能。

关键设计：MORA的关键设计包括：1) 使用预采样来识别单奖励prompt；2) 使用重写技术将多个奖励维度的意图融入到prompt中；3) 使用重写后的prompt来训练模型。具体的重写方法未知，论文中可能没有详细描述。

📊 实验亮点

MORA在序列对齐中实现了5%到12.4%的单偏好改进，尤其是在安全性方面取得了显著提升。在同步对齐中，MORA实现了平均4.6%的总体奖励提升。这些结果表明，MORA能够有效地打破安全性-有用性瓶颈，并提升大语言模型的多目标对齐性能。

🎯 应用场景

MORA可应用于各种需要平衡多个目标的大语言模型对齐场景，例如对话系统、内容生成和智能助手。通过提升模型的安全性、有用性和真实性，MORA可以提高用户满意度，减少潜在的风险，并促进大语言模型在更广泛领域的应用。

📄 摘要（原文）

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理