FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning

📄 arXiv: 2603.11901v1 📥 PDF

作者: Yijun Pan, Weikang Qiu, Qiyao Ma, Mingxuan Ju, Tong Zhao, Neil Shah, Rex Ying

分类: cs.LG

发布日期: 2026-03-12


💡 一句话要点

提出FlexRec以解决动态推荐系统的灵活需求问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 推荐系统 强化学习 动态需求 个性化推荐 深度学习 不确定性感知 项目级奖励 反事实交换

📋 核心要点

  1. 现有推荐系统主要针对单一静态目标,难以适应动态需求,导致推荐效果不佳。
  2. FlexRec通过引入基于反事实交换的项目级奖励和不确定性感知的批评指导,解决了奖励稀疏和反馈噪声的问题。
  3. 在多种推荐场景中,FlexRec显著提升了推荐效果,NDCG@5和Recall@5分别提升了59%和109.4%。

📝 摘要(中文)

现代推荐系统必须适应动态的、特定需求的目标,以应对多样化的推荐场景。然而,大多数传统推荐系统仅针对单一静态目标进行优化,难以根据需求重新配置行为。本文提出FlexRec,一个基于强化学习的后训练框架,解决了序列级奖励导致的粗糙信用分配和稀疏噪声反馈的问题。通过引入基于反事实交换的项目级奖励和不确定性感知的批评指导,FlexRec在多种推荐场景中显著提升了推荐效果,NDCG@5提升高达59%,Recall@5提升高达109.4%。

🔬 方法详解

问题定义:本文旨在解决传统推荐系统在动态需求下的适应性不足问题。现有方法在面对复杂推荐目标时,往往依赖于序列级奖励,导致信用分配不精细,且反馈稀疏和噪声严重,影响学习效率和稳定性。

核心思路:FlexRec的核心思路是通过强化学习后训练,利用项目级奖励和不确定性感知机制来优化推荐系统。具体而言,项目级奖励基于反事实交换,能够提供更精细的训练信号,而不确定性感知则帮助稳定学习过程。

技术框架:FlexRec的整体架构包括两个主要模块:项目级奖励模块和不确定性感知模块。项目级奖励模块通过分析候选池中的项目交换情况来生成奖励,而不确定性感知模块则通过建模奖励的不确定性来调整学习过程。

关键创新:FlexRec的关键创新在于引入了基于反事实交换的项目级奖励和批评指导的不确定性感知机制。这与传统方法的序列级奖励形成鲜明对比,使得学习过程更加稳定和高效。

关键设计:在参数设置上,FlexRec对奖励的计算进行了精细化设计,采用了特定的损失函数来优化项目级奖励的反馈。此外,网络结构上,FlexRec结合了强化学习和深度学习的优势,确保了模型的高效性和准确性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

FlexRec在多种推荐场景中表现出色,NDCG@5提升高达59%,Recall@5提升高达109.4%。与传统推荐系统和基于LLM的基线相比,FlexRec展现了显著的性能优势,证明了其在动态需求下的有效性。

🎯 应用场景

FlexRec的研究成果在个性化推荐、电子商务、社交媒体等领域具有广泛的应用潜力。通过更好地适应用户的动态需求,FlexRec能够提升用户体验,增加用户粘性,并为企业带来更高的转化率。未来,FlexRec的框架也可扩展到其他需要动态适应的智能系统中。

📄 摘要(原文)

Modern recommender systems must adapt to dynamic, need-specific objectives for diverse recommendation scenarios, yet most traditional recommenders are optimized for a single static target and struggle to reconfigure behavior on demand. Recent advances in reinforcement-learning-based post-training have unlocked strong instruction-following and reasoning capabilities in LLMs, suggesting a principled route for aligning them to complex recommendation goals. Motivated by this, we study closed-set autoregressive ranking, where an LLM generates a permutation over a fixed candidate set conditioned on user context and an explicit need instruction. However, applying RL to this setting faces two key obstacles: (i) sequence-level rewards yield coarse credit assignment that fails to provide fine-grained training signals, and (ii) interaction feedback is sparse and noisy, which together lead to inefficient and unstable updates. We propose FlexRec, a post-training RL framework that addresses both issues with (1) a causally grounded item-level reward based on counterfactual swaps within the remaining candidate pool, and (2) critic-guided, uncertainty-aware scaling that explicitly models reward uncertainty and down-weights low-confidence rewards to stabilize learning under sparse supervision. Across diverse recommendation scenarios and objectives, FlexRec achieves substantial gains: it improves NDCG@5 by up to \textbf{59\%} and Recall@5 by up to \textbf{109.4\%} in need-specific ranking, and further achieves up to \textbf{24.1\%} Recall@5 improvement under generalization settings, outperforming strong traditional recommenders and LLM-based baselines.