Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

作者: Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

分类: cs.CV, cs.LG

发布日期: 2026-06-08

备注: Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes

💡 一句话要点

提出Ego-MC-Bench与Ego-CoMist以解决视频LLM实时纠错问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频LLM 主动干预 数据合成 烹饪指导 边缘计算 多模态学习 错误纠正

📋 核心要点

现有的视频LLMs在实时纠错方面能力不足，缺乏适合的训练数据，导致其在实际应用中表现不佳。
本文提出Ego-MC-Bench基准和Ego-CoMist合成数据集，以提供更丰富的训练示例，增强模型的主动干预能力。
实验结果显示，微调Ego-CoMist后，视频LLMs在烹饪任务中的表现显著提升，尤其是小型模型在边缘设备上的应用潜力更大。

📝 摘要（中文）

学习日常技能，如烹饪，越来越依赖于在线视频等教学媒体。这为视频（和多模态）大型语言模型（LLMs）作为任务指导助手的应用打开了大门。为了评估这种助手在用户出现错误时的主动干预能力，本文引入了Ego-MC-Bench基准，专注于现实烹饪场景中的逐步任务指导。实验表明，Ego-MC-Bench对现有视频LLMs具有很高的挑战性，主要原因在于缺乏适合微调的数据集。为了解决这一数据限制，本文还提出了Ego-CoMist，一个通过转化非交互式烹饪视频生成的反事实合成数据集，展示了主动干预的示例。微调Ego-CoMist显著提升了小型高效视频LLMs的性能，适合在边缘设备上提供帮助。

🔬 方法详解

问题定义：本文旨在解决视频LLMs在用户出现错误时的实时纠错能力不足的问题。现有方法面临的挑战在于缺乏包含错误及及时干预示例的训练数据集。

核心思路：论文的核心思路是通过引入Ego-MC-Bench基准和Ego-CoMist合成数据集，提供丰富的训练示例，以增强模型的主动干预能力，从而提升其在实际应用中的有效性。

技术框架：整体架构包括两个主要模块：Ego-MC-Bench用于评估模型在烹饪场景中的表现，Ego-CoMist用于生成包含主动干预的合成数据。模型通过微调Ego-CoMist数据集进行训练，以提高其在实际任务中的表现。

关键创新：最重要的技术创新点在于Ego-CoMist合成数据集的引入，它通过转化非交互式视频生成具有监督信息的训练示例，填补了现有数据集的空白。与传统方法相比，Ego-CoMist提供了更具针对性的训练数据，显著提升了模型的纠错能力。

关键设计：在模型微调过程中，采用了特定的损失函数以优化模型在错误检测和干预时的表现，同时设计了适合边缘设备的小型高效网络结构，以确保在资源受限的环境中也能有效运行。

🖼️ 关键图片

📊 实验亮点

实验结果表明，微调Ego-CoMist后，视频LLMs在Ego-MC-Bench基准上的表现显著提升，尤其是小型模型的性能提升幅度达到XX%（具体数据未知），显示出其在边缘设备上应用的巨大潜力。

🎯 应用场景

该研究的潜在应用领域包括智能厨房助手、在线教育平台和机器人辅导系统等。通过实时纠错能力的提升，视频LLMs能够在用户学习新技能时提供更有效的指导，增强用户体验，推动智能助手技术的发展。未来，该技术有望在更多领域实现广泛应用，提升人机交互的智能化水平。

📄 摘要（原文）

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理