OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

作者: Jiahao Nick Li, Yan Xu, Tovi Grossman, Stephanie Santosa, Michelle Li

分类: cs.HC, cs.AI

发布日期: 2024-05-06

备注: Paper accepted to the 2024 CHI Conference on Human Factors in Computing Systems (CHI 2024)

💡 一句话要点

OmniActions：利用LLM预测响应真实世界多模态输入的数字行为

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态输入 行为预测 大型语言模型 人机交互 增强现实

📋 核心要点

现有交互界面在用户繁忙时，难以根据多模态信息快速提供相应的数字操作，导致交互摩擦。
OmniActions利用大型语言模型，处理多模态输入，并基于预定义的设计空间预测用户可能采取的数字行为。
通过日记研究收集数据，对不同LLM技术进行评估，并开发原型系统，收集用户反馈，验证方法有效性。

📝 摘要（中文）

迈向“普适增强现实”的进程设想了轻松访问连续多模态信息。然而，在许多日常场景中，用户在身体上、认知上或社交上都很忙碌。这可能会增加用户对在世界中遇到的多模态信息采取行动的摩擦。为了减少这种摩擦，未来的交互界面应该智能地根据用户的上下文提供对数字行为的快速访问。为了探索可能的数字行为范围，我们进行了一项日记研究，要求参与者捕获并分享他们打算对其执行操作的媒体（例如，图像或音频），以及他们期望的操作和其他上下文信息。使用这些数据，我们生成了一个数字后续操作的整体设计空间，这些操作可以响应不同类型的多模态感官输入来执行。然后，我们设计了OmniActions，这是一个由大型语言模型（LLM）驱动的pipeline，它处理多模态感官输入，并根据派生的设计空间预测目标信息上的后续操作。使用在日记研究中收集的经验数据，我们对LLM技术的三种变体（意图分类、上下文学习和微调）进行了定量评估，并确定了最适合我们任务的技术。此外，作为pipeline的一个实例化，我们开发了一个交互式原型，并报告了关于人们如何看待和回应动作预测及其错误的初步用户反馈。

🔬 方法详解

问题定义：论文旨在解决用户在接收到多模态信息时，如何快速便捷地触发相应的数字操作的问题。现有方法的痛点在于，当用户处于物理、认知或社交繁忙状态时，与数字世界的交互存在较高的摩擦，需要更智能的界面来预测并提供用户所需的数字行为。

核心思路：论文的核心思路是利用大型语言模型（LLM）理解多模态输入（例如图像、音频），并预测用户可能采取的后续数字操作。通过构建一个包含各种可能操作的设计空间，LLM能够根据上下文信息，更准确地推荐用户所需的行为。这种方法旨在减少用户手动操作的步骤，提高交互效率。

技术框架：OmniActions pipeline包含以下主要阶段：1) 多模态感官输入：接收来自用户的图像、音频等信息。2) LLM处理：利用LLM对输入信息进行理解和分析。3) 行为预测：基于预定义的设计空间，预测用户可能采取的数字操作。4) 结果呈现：向用户展示预测的数字操作选项。论文评估了三种LLM技术：意图分类、上下文学习和微调。

关键创新：论文的关键创新在于：1) 构建了一个全面的数字行为设计空间，为LLM的预测提供了基础。2) 提出了OmniActions pipeline，将LLM应用于多模态输入的行为预测任务。3) 通过日记研究收集真实用户数据，用于训练和评估LLM模型。与现有方法相比，OmniActions更注重用户在真实场景下的需求，并利用LLM的强大能力进行智能预测。

关键设计：论文通过日记研究收集数据，构建数字行为设计空间。LLM模型采用不同的训练策略（意图分类、上下文学习、微调），并针对特定任务进行优化。具体的参数设置、损失函数和网络结构等技术细节在论文中可能没有详细描述，属于未知信息。

🖼️ 关键图片

📊 实验亮点

论文通过实验对比了意图分类、上下文学习和微调三种LLM技术在行为预测任务上的性能。具体性能数据和提升幅度在摘要中未明确给出，属于未知信息。但实验结果表明，不同的LLM技术在OmniActions pipeline中表现不同，为后续研究提供了参考。

🎯 应用场景

OmniActions可应用于智能助手、增强现实应用、智能家居等领域。例如，用户拍摄一张餐厅照片后，系统可以自动推荐“查看菜单”、“预订座位”、“分享到社交媒体”等操作。该研究有助于提升人机交互的效率和便捷性，未来可能推动更自然、无缝的数字体验。

📄 摘要（原文）

The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理