TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
作者: Xinyu Guo, Yazhou Zhang, Jing Qin
分类: cs.CL
发布日期: 2026-03-20
备注: 20 pages
💡 一句话要点
提出TextReasoningBench以评估推理策略在文本分类中的有效性
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 文本分类 推理策略 大型语言模型 性能评估 自然语言处理 效率分析 基准测试
📋 核心要点
- 现有推理策略在文本分类任务中的有效性尚未得到充分验证,尤其是考虑到其高昂的token和时间成本。
- 本文提出TextReasoningBench基准,通过系统评估七种推理策略在文本分类中的表现,填补了这一研究空白。
- 实验结果表明,推理策略并不总是提高分类性能,且许多策略在效率上表现不佳,token消耗显著增加。
📝 摘要(中文)
从大型语言模型(LLMs)中引出明确的逐步推理过程已成为增强模型能力的主要方法。尽管这些推理策略最初是为需要明确多步推理的问题设计的,但它们越来越多地应用于广泛的自然语言处理任务。然而,这种推理机制是否真正有利于分类任务仍然未得到充分探讨,尤其是考虑到其显著的token和时间成本。为填补这一空白,本文提出了TextReasoningBench,这是一个系统的基准,旨在评估推理策略在文本分类中的有效性和效率。我们比较了七种推理策略在十种LLMs和五个文本分类数据集上的表现。实验结果揭示了三个显著发现:推理并不普遍提高分类性能,推理通常效率低下,许多推理策略的token消耗增加了10倍到100倍。
🔬 方法详解
问题定义:本文旨在解决推理策略在文本分类任务中是否真正有效的问题。现有方法未能充分探讨推理对分类性能的影响,尤其是在token和时间成本方面的挑战。
核心思路:论文提出了TextReasoningBench基准,系统评估不同推理策略在文本分类中的有效性与效率,比较七种推理策略在多种LLMs上的表现。
技术框架:整体架构包括数据集选择、推理策略应用和性能评估三个主要模块。通过对比传统指标和成本感知评估指标,全面分析推理策略的效果。
关键创新:最重要的创新点在于引入了成本感知评估指标,量化推理策略的性能提升与token成本之间的关系,提供了更全面的评估视角。
关键设计:在实验中,采用了七种推理策略(如CoT、ToT等),并在十种LLMs上进行比较,设置了传统准确率和宏F1等指标,同时引入了新的成本感知指标。
🖼️ 关键图片
📊 实验亮点
实验结果显示,推理策略并不总是提高分类性能,简单策略如CoT和SC-CoT在大模型上仅提升1%到3%。复杂方法如ToT和GoT在小模型上甚至可能导致性能下降。此外,许多推理策略的token消耗增加了10倍到100倍,效率低下。
🎯 应用场景
该研究的潜在应用领域包括文本分类、情感分析和信息检索等自然语言处理任务。通过评估推理策略的有效性,研究为优化LLMs在实际应用中的性能提供了重要参考,未来可能推动更高效的推理方法的开发与应用。
📄 摘要(原文)
Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.