Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

作者: Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar

分类: cs.LG

发布日期: 2025-02-13

备注: Accepted to ICLR 2025

🔗 代码/项目: GITHUB

💡 一句话要点

Digi-Q：学习Q值函数训练设备控制Agent，提升离线策略学习效果

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 设备控制 强化学习 离线学习 Q函数 视觉语言模型 策略学习 Android-in-the-Wild

📋 核心要点

现有基于提示或微调的Agent方法在移动设备控制等动态环境中表现不足，需要更有效的策略学习方法。
Digi-Q通过离线学习VLM特征上的Q函数，避免了与环境的直接交互，降低了交互成本并提升了可扩展性。
Digi-Q在Android设备控制任务中超越现有方法21.2%，在某些情况下达到需要在线交互的SOTA强化学习方法水平。

📝 摘要（中文）

现有的构建基础模型Agent的方法，如提示或使用人类演示进行微调，在动态环境（如移动设备控制）中并不充分。在线强化学习（RL）可以解决这些限制，但在开放式的Agent问题中，收集实际的rollout通常是不可取的，因为每次交互都伴随着成本。本文提出Digi-Q，一种通过学习训练好的动作价值Q函数来利用离线经验进行策略学习的方法。Digi-Q使用离线时序差分(TD)学习，在冻结的VLM中间层特征之上训练Q函数。与微调整个VLM相比，这种方法节省了计算资源并增强了可扩展性。为了使VLM特征适用于表示Q函数，需要进行初始的微调阶段，以扩大对价值函数所需的可操作信息的覆盖范围。训练完成后，通过Best-of-N策略提取算子使用Q函数，该算子模仿当前策略中多个候选动作中价值函数排序的最佳动作，从而在无需环境交互的情况下改进策略。在Android-in-the-Wild中的用户级设备控制任务中，Digi-Q优于几种先前的方法，比先前最佳方法提高了21.2%。在某些情况下，Digi-Q已经可以与需要交互的state-of-the-art RL方法相媲美。该项目已开源。

🔬 方法详解

问题定义：论文旨在解决移动设备控制等动态环境中，Agent训练与环境交互成本高昂的问题。现有方法如提示学习和模仿学习难以适应复杂环境，而在线强化学习的交互成本限制了其应用。因此，如何在离线数据上高效学习策略成为关键挑战。

核心思路：论文的核心思路是利用离线数据学习一个动作价值函数（Q函数），然后通过该Q函数来指导策略的改进，从而避免与环境的直接交互。通过在预训练的视觉语言模型（VLM）的中间层特征上训练Q函数，可以有效利用VLM的知识，并降低计算成本。

技术框架：Digi-Q的整体框架包含以下几个主要阶段：1) 使用离线数据集，对VLM进行初步微调，以增强其对可操作信息的覆盖；2) 冻结VLM的参数，提取中间层特征；3) 在VLM特征上训练Q函数，使用离线时序差分学习；4) 使用训练好的Q函数，通过Best-of-N策略提取算子，选择最优动作，提升策略。

关键创新：Digi-Q的关键创新在于：1) 利用VLM的中间层特征作为Q函数的输入，避免了从头开始学习特征表示；2) 采用离线时序差分学习训练Q函数，无需与环境交互；3) 提出Best-of-N策略提取算子，利用Q函数从多个候选动作中选择最优动作，实现策略改进。与现有方法相比，Digi-Q能够在离线数据上高效学习策略，降低了交互成本，并提升了可扩展性。

关键设计：论文的关键设计包括：1) VLM的初步微调，旨在增强VLM对可操作信息的感知能力；2) Q函数的网络结构，需要能够有效利用VLM特征，并准确估计动作价值；3) Best-of-N策略提取算子的设计，需要平衡探索和利用，选择最优动作的同时，避免陷入局部最优。

🖼️ 关键图片

📊 实验亮点

Digi-Q在Android-in-the-Wild数据集上进行了评估，结果表明，Digi-Q优于现有的离线策略学习方法，取得了21.2%的性能提升。在某些任务中，Digi-Q甚至可以与需要在线交互的state-of-the-art强化学习方法相媲美。这些结果表明，Digi-Q是一种有效的离线策略学习方法，具有很强的实际应用价值。

🎯 应用场景

Digi-Q具有广泛的应用前景，可应用于移动设备控制、机器人控制、人机交互等领域。它能够利用离线数据训练Agent，降低了与环境交互的成本，加速了Agent的部署。此外，该方法还可以应用于教育、医疗等领域，例如，训练智能助手来辅助用户完成特定任务。

📄 摘要（原文）

While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq

Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理