Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
作者: Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang
分类: cs.CL
发布日期: 2025-07-23
💡 一句话要点
提出Shop-R1框架以增强LLM在在线购物中的人类行为模拟能力
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 强化学习 人类行为模拟 在线购物 推理能力提升
📋 核心要点
- 现有方法在增强LLM推理能力方面存在局限,生成的推理质量直接影响后续动作预测的准确性。
- Shop-R1框架通过将人类行为模拟任务分为推理生成和动作预测两个阶段,采用不同的奖励信号来引导学习过程。
- 实验结果显示,Shop-R1在性能上相较于基线有超过65%的提升,证明了其有效性和优越性。
📝 摘要(中文)
大型语言模型(LLMs)在生成“可信的人类行为”方面展现出强大潜力。以往研究通过增强训练数据和应用监督微调(SFT)来提升推理能力,但这些方法的性能受限于生成推理的模型能力。本文提出了Shop-R1,一个新颖的强化学习框架,旨在提升LLM在在线购物环境中模拟真实人类行为的推理能力。Shop-R1将人类行为模拟任务分为两个阶段:推理生成和动作预测,分别由不同的奖励信号引导。实验结果表明,该方法相较于基线实现了超过65%的相对提升。
🔬 方法详解
问题定义:本文旨在解决现有方法在模拟人类在线购物行为时推理能力不足的问题。以往方法依赖于生成推理的模型能力,导致性能受限。
核心思路:Shop-R1框架通过强化学习增强LLM的推理能力,分阶段处理推理生成和动作预测,分别使用不同的奖励信号以提高学习效果。
技术框架:整体架构包括两个主要模块:推理生成模块和动作预测模块。推理生成模块利用内部模型信号进行自监督学习,而动作预测模块则采用分层奖励结构进行细粒度奖励分配。
关键创新:最重要的创新在于将人类行为模拟任务分为两个阶段,并为每个阶段设计了特定的奖励机制,这与现有方法的单一奖励信号设计形成了鲜明对比。
关键设计:在推理生成阶段,利用logit分布指导推理过程;在动作预测阶段,采用难度感知的奖励缩放机制,确保奖励分配的细致性和有效性。
🖼️ 关键图片
📊 实验亮点
实验结果表明,Shop-R1框架在性能上相较于基线实现了超过65%的相对提升,验证了其在在线购物行为模拟中的有效性和优越性。
🎯 应用场景
该研究的潜在应用领域包括在线购物平台、智能推荐系统和虚拟助手等。通过提升LLM在模拟人类行为方面的能力,可以显著改善用户体验和交互质量,推动智能购物和个性化服务的发展。
📄 摘要(原文)
Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.