Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback
作者: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman
分类: cs.LG, cs.AI, cs.CL, q-fin.CP
发布日期: 2026-05-07
备注: 9 pages, 2 figures, 8 tables. Short Communication submitted to Knowledge-Based Systems (Elsevier)
💡 一句话要点
提出基于LLM判别器与闭环强化学习的智能体股票预测行为评估框架
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 智能体系统 强化学习 大语言模型评估 量化金融 行为审计 决策优化 软演员-评论家算法
📋 核心要点
- 现有股票预测系统仅依赖MAPE等聚合指标,无法揭示智能体在复杂决策链条中各环节的性能瓶颈。
- 提出基于多LLM集成判别器的行为评估框架,将决策轨迹量化为六个维度的行为得分,并反馈至SAC强化学习循环。
- 实验表明该方法显著提升了模型在波动市场中的预测精度与夏普比率,实现了从行为评估到策略优化的闭环改进。
📝 摘要(中文)
智能体股票预测系统涉及 regime 检测、路径路由及强化学习控制等一系列相互依赖的决策,而均方百分比误差(MAPE)或方向准确率等聚合指标往往掩盖了单项决策的质量。本文提出了一种行为评估框架以弥补这一缺陷。该框架将每个自主决策点的行为轨迹分组为五天片段,并由三个大语言模型(LLM)组成的集成判别器在六个领域特定维度(regime检测、路由、适应性、风险校准、策略连贯性、错误恢复)进行评分。基于扰动的验证显示,目标维度的评分下降显著(-1.6至-2.4),而其余维度影响较小,且模型间一致性达到Krippendorff's α=0.85。综合行为得分与20天夏普比率的相关性为ρ=0.72。通过闭环反馈,该框架将维度评分转化为信用分配惩罚项,并整合进软演员-评论家(SAC)算法的奖励函数中。在2017-2025年测试集上,该方法使MAPE降低11.5%,方向准确率提升至74%,夏普比率提升18%,尤其在高波动场景下表现显著。
🔬 方法详解
问题定义:智能体股票预测系统在执行序列决策时,传统的宏观性能指标(如MAPE)无法捕捉系统在 regime 检测、路径路由等微观决策环节的逻辑缺陷,导致难以针对性地优化智能体行为。
核心思路:引入“行为评估”范式,利用LLM作为专家判别器对智能体的决策轨迹进行多维度定性与定量分析,并将评估结果转化为强化学习的奖励惩罚项,实现模型行为的定向优化。
技术框架:框架分为评估与优化两个阶段。评估阶段通过LLM集成(GPT-4o, Claude 3.5, Gemini 1.5等)对五天决策片段进行六维度评分;优化阶段将低分维度转化为SAC算法的惩罚项,通过微调循环更新智能体策略。
关键创新:首次将LLM作为行为审计工具嵌入强化学习闭环,通过“行为轨迹-维度评分-奖励修正”的链路,解决了复杂智能体系统在黑盒优化中的可解释性与针对性改进难题。
关键设计:采用基于扰动的验证机制确保评分的维度敏感性;将维度得分映射为SAC奖励函数中的信用分配惩罚项,重点针对高波动率下的行为缺陷进行策略微调,确保模型在极端市场条件下的鲁棒性。
🖼️ 关键图片
📊 实验亮点
在2017-2025年测试集上,该方法使MAPE从0.61%降至0.54%(相对提升11.5%),方向准确率提升3个百分点。最显著的成果是夏普比率提升18%(95%置信区间[8.2%, 27.4%]),且性能增益主要集中在模型原先表现最差的高波动市场片段。
🎯 应用场景
该研究适用于高频交易、量化投资及复杂决策系统的自动化优化。其核心价值在于通过LLM对智能体决策过程进行“审计”,不仅能提升金融预测的准确性,还可推广至机器人路径规划、自动驾驶等需要高可解释性与行为约束的复杂决策领域。
📄 摘要(原文)
Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff's $α= 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $ρ= 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen's $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.