Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

📄 arXiv: 2606.06835v1 📥 PDF

作者: Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj, Meet Vadera, Priyanka Nigam, Bing Yin

分类: cs.CL

发布日期: 2026-06-05

备注: 14 pages main text plus appendix, 7 figures, 11 tables


💡 一句话要点

提出Translate-R1以优化翻译工具使用的成本效益

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 翻译工具 强化学习 成本敏感 多语言处理 自适应策略 大型语言模型 智能决策

📋 核心要点

  1. 现有方法依赖于手动工程的语言特定规则,难以适应多样化的输入和语言环境。
  2. 本文提出了一种基于强化学习的策略,通过自我评估理解能力,智能决定是否进行翻译。
  3. 实验结果显示,提出的策略在高资源语言上奖励提升4.6,在低资源语言上提升23.5,且在未见语言上也有显著改进。

📝 摘要(中文)

在大型语言模型(LLMs)中,不同语言之间的性能差距已被广泛记录。为了弥合这一差距,传统方法依赖于特定语言的规则或外部路由器,存在手动工程的不足。本文提出了一种基于强化学习的单一策略,能够根据奖励信号自适应决定何时进行翻译,从而在不理解输入时才调用翻译工具。通过在22种语言和5个领域上进行实验,提出的策略在成本敏感的工具使用上显著提高了奖励,且在未见语言上也表现出色。

🔬 方法详解

问题定义:本文解决的是在多语言环境中,如何有效利用翻译工具以提升大型语言模型的性能。现有方法往往依赖于手动规则,导致灵活性不足和效率低下。

核心思路:提出了一种基于强化学习的单一策略,利用奖励信号来判断何时进行翻译,从而在模型无法理解输入时才调用翻译工具,避免不必要的资源浪费。

技术框架:整体架构包括一个自适应的翻译决策模块,该模块通过强化学习算法进行训练,评估输入的可理解性,并在必要时调用翻译工具。实验中使用了22种语言和5个领域的数据集。

关键创新:最重要的创新在于通过强化学习实现了语言和领域自适应的内省能力,能够根据自身的理解能力动态调整翻译策略,而非依赖于固定的规则或外部判断。

关键设计:在训练过程中,采用了信心门控的GSPO策略,设置了不同的成本敏感参数,确保在不同资源层次下都能优化工具使用的成本效益。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,提出的策略在高资源语言上相较于基线提升了4.6,在低资源语言上提升了23.5,在未见语言上也有18.7的提升。与几乎总是翻译的无约束策略相比,该策略在63%的成本下保持了完整的奖励,展现出优越的成本效益。

🎯 应用场景

该研究的潜在应用领域包括多语言翻译系统、跨语言信息检索和智能客服等。通过优化翻译工具的使用,能够显著提升用户体验和系统效率,未来可能对全球化的语言服务产生深远影响。

📄 摘要(原文)

The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.