Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

作者: Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander J. Fiannaca, Alex Olwal, Michael Terry, Carrie J. Cai

分类: cs.HC, cs.AI

发布日期: 2025-01-27

期刊: 30th International Conference on Intelligent User Interfaces (IUI'25), March 24-27, 2025, Cagliari, Italy. ACM, New York, NY, USA, 16 pages

DOI: 10.1145/3708359.3712085

💡 一句话要点

Gensors：利用多模态基础模型和推理能力，构建个性化视觉传感器

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 个性化传感器 视觉推理 人机交互 需求获取

📋 核心要点

现有方法难以让用户有效定义和调试个性化AI传感器，尤其是在表达个人需求和发现潜在问题方面。
Gensors通过自动生成和手动创建标准、并行测试、图像引导建议和压力测试用例，辅助用户定义和调试传感器。
用户研究表明，Gensors显著提升了用户在定义传感器时的控制感、理解度和沟通便利性，并能暴露用户盲点。

📝 摘要（中文）

多模态大型语言模型（MLLMs）凭借其广泛的世界知识和推理能力，为终端用户创建能够推理复杂情况的个性化AI传感器提供了独特的机会。用户可以用自然语言描述所需的感知任务（例如，“如果我的孩子淘气就发出警报”），MLLM分析摄像头画面并在几秒钟内做出响应。在一项初步研究中，我们发现用户在定义自己的传感器方面看到了巨大的价值，但难以表达他们独特的个人需求，也难以仅通过提示来调试传感器。为了应对这些挑战，我们开发了Gensors，一个支持用户利用MLLM的推理能力来定义定制传感器的系统。Gensors 1) 通过自动生成和手动创建的传感器标准来帮助用户获取需求，2) 通过允许用户并行隔离和测试单个标准来促进调试，3) 根据用户提供的图像建议额外的标准，以及 4) 提出测试用例来帮助用户在潜在的不可预见的情况下“压力测试”传感器。在一项用户研究中，参与者表示在使用Gensors定义传感器时，控制感、理解度和沟通的便利性都显著提高。除了解决模型限制之外，Gensors还通过基于标准的推理来支持用户调试、获取需求以及向传感器表达独特的个人需求；它还有助于揭示用户“盲点”，暴露被忽视的标准并揭示意想不到的失败模式。最后，我们讨论了MLLM的独特特征（例如幻觉和不一致的响应）如何影响传感器创建过程。这些发现有助于未来智能传感系统的设计，使其对日常用户来说直观且可定制。

🔬 方法详解

问题定义：论文旨在解决用户难以利用多模态大语言模型（MLLM）创建和调试个性化视觉传感器的问题。现有方法主要依赖于prompt工程，用户难以充分表达自身需求，并且调试过程困难，容易出现未预料到的错误和盲点。

核心思路：论文的核心思路是将复杂的传感器定义任务分解为一系列可独立测试和调试的标准（criteria）。通过提供工具辅助用户显式地定义、测试和完善这些标准，从而提高用户对传感器的控制感和理解度，并减少潜在的错误。

技术框架：Gensors系统包含以下主要模块：1) 需求获取模块：自动生成和手动创建传感器标准，辅助用户表达需求。2) 调试模块：允许用户隔离和并行测试单个标准，快速定位问题。3) 标准建议模块：根据用户提供的图像，利用MLLM推理并推荐额外的标准。4) 压力测试模块：生成测试用例，帮助用户发现潜在的未预见场景和错误。整体流程是用户首先定义初始标准，然后通过调试、建议和压力测试等模块迭代完善标准，最终得到满足需求的个性化传感器。

关键创新：Gensors的关键创新在于其基于标准的推理方法，将复杂的传感器定义任务分解为可管理和可调试的单元。与传统的prompt工程相比，这种方法更具结构化和可解释性，能够更好地支持用户表达个性化需求和发现潜在问题。此外，系统还集成了多种辅助工具，例如自动标准生成、图像引导建议和压力测试，进一步提升了用户体验。

关键设计：Gensors的关键设计包括：1) 标准的表示形式：标准采用自然语言描述，易于用户理解和编辑。2) 并行测试机制：允许用户同时测试多个标准，提高调试效率。3) 图像引导建议：利用MLLM分析用户提供的图像，提取关键特征并生成相关标准。4) 压力测试用例生成：基于用户定义的标准，自动生成具有挑战性的测试用例，帮助用户发现潜在的错误。

🖼️ 关键图片

📊 实验亮点

用户研究表明，使用Gensors定义传感器时，用户对传感器的控制感、理解度和沟通便利性都显著提高。Gensors还能够帮助用户发现被忽视的标准和意想不到的失败模式，从而提升传感器的可靠性和鲁棒性。这些结果表明Gensors是一种有效的个性化视觉传感器构建工具。

🎯 应用场景

Gensors具有广泛的应用前景，例如智能家居、安全监控、辅助驾驶等领域。用户可以根据自身需求定制传感器，例如监控儿童安全、检测异常行为、识别特定物体等。该研究有助于推动AI技术的普及，使普通用户也能轻松创建和使用智能传感器，从而提升生活质量和工作效率。

📄 摘要（原文）

Multimodal large language models (MLLMs), with their expansive world knowledge and reasoning capabilities, present a unique opportunity for end-users to create personalized AI sensors capable of reasoning about complex situations. A user could describe a desired sensing task in natural language (e.g., "alert if my toddler is getting into mischief"), with the MLLM analyzing the camera feed and responding within seconds. In a formative study, we found that users saw substantial value in defining their own sensors, yet struggled to articulate their unique personal requirements and debug the sensors through prompting alone. To address these challenges, we developed Gensors, a system that empowers users to define customized sensors supported by the reasoning capabilities of MLLMs. Gensors 1) assists users in eliciting requirements through both automatically-generated and manually created sensor criteria, 2) facilitates debugging by allowing users to isolate and test individual criteria in parallel, 3) suggests additional criteria based on user-provided images, and 4) proposes test cases to help users "stress test" sensors on potentially unforeseen scenarios. In a user study, participants reported significantly greater sense of control, understanding, and ease of communication when defining sensors using Gensors. Beyond addressing model limitations, Gensors supported users in debugging, eliciting requirements, and expressing unique personal requirements to the sensor through criteria-based reasoning; it also helped uncover users' "blind spots" by exposing overlooked criteria and revealing unanticipated failure modes. Finally, we discuss how unique characteristics of MLLMs--such as hallucinations and inconsistent responses--can impact the sensor-creation process. These findings contribute to the design of future intelligent sensing systems that are intuitive and customizable by everyday users.

Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理