When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models
作者: Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent
分类: cs.CL, cs.AI
发布日期: 2026-05-07
备注: Currently under review. Dataset can be found: https://huggingface.co/datasets/duke-trust-lab/When2Speak
💡 一句话要点
提出When2Speak数据集与四阶段生成流水线,解决大语言模型在多方对话中的介入时机决策问题。
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大语言模型 多方对话 对话介入 强化学习 合成数据 社交智能 时序决策
📋 核心要点
- 现有LLM在多方对话中缺乏时序感知能力,盲目响应导致对话频繁中断,严重破坏了交互的连贯性与自然度。
- 论文构建了When2Speak数据集,通过四阶段流水线整合真实世界基础、结构化增强与受控合成,明确建模对话介入决策。
- 实验证明,通过监督微调结合非对称奖励强化学习,模型在介入时机判断上的召回率大幅提升,有效解决了模型过度保守的难题。
📝 摘要(中文)
大语言模型(LLM)在生成上下文相关的回复方面表现出色,但在多方对话中往往难以把握介入时机。在这些场景下,盲目响应会导致频繁打断和对话连贯性下降。本文提出了When2Speak,这是一个包含21.5万个示例的合成数据集,通过四阶段生成流水线构建,涵盖了2-6名参与者的多样化对话风格。该研究明确建模了每一轮对话中的“发言(SPEAK)”与“沉默(SILENT)”决策。实验表明,在When2Speak上进行监督微调(SFT)显著优于零样本基线,平均Macro F1提升达60%。针对SFT模型存在的过度保守问题,研究进一步引入了非对称奖励塑造的强化学习,成功将漏报率(MIR)从0.50降低至0.186-0.218,显著提升了模型的介入召回率。
🔬 方法详解
问题定义:论文旨在解决多方对话中LLM的“介入时机(Turn-Taking)”问题。现有模型往往无法判断何时应当发言,导致在多人交互中出现不必要的打断或错失发言机会,缺乏对话参与的社交智能。
核心思路:将介入决策建模为二分类任务(SPEAK vs. SILENT)。通过构建大规模、多样化的合成数据集,使模型学习对话动态,并利用强化学习解决监督微调中模型倾向于“过度保守”的偏差。
技术框架:采用四阶段生成流水线:1. 真实世界基础构建;2. 结构化数据增强;3. 受控对话合成;4. 监督微调与强化学习优化。该流程确保了数据在对话风格、语气和参与者动态上的多样性。
关键创新:引入了非对称奖励塑造(Asymmetric Reward Shaping)机制。针对模型在SFT阶段表现出的高漏报率(MIR),通过强化学习调整奖励函数,强制模型在必要时更积极地介入,从而平衡了准确性与召回率。
关键设计:数据集包含21.5万个样本,覆盖2-6人对话。在强化学习阶段,通过非对称奖励惩罚漏报行为,将MIR从0.50降低至0.186-0.218,实现了从被动响应到主动参与的跨越。
📊 实验亮点
实验显示,在4B参数以上模型中,SFT训练使Macro F1平均提升60%,最高提升达120%。针对SFT模型漏报率(MIR)高达0.50的局限,引入强化学习后,MIR降至0.186-0.218,召回率从0.479提升至0.81,证明了该方法在优化对话介入决策方面的显著有效性。
🎯 应用场景
该研究可广泛应用于智能语音助手、多人协作机器人、虚拟社交代理及在线教育辅导系统。通过提升模型在复杂多方对话中的时序感知能力,能够显著改善人机交互的自然度,使其在会议记录、多人协作任务及沉浸式游戏NPC交互中表现出更强的社交智能。
📄 摘要(原文)
Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.