Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation

作者: Zekun Yuan, Yangfan Ye, Xiaocheng Feng, Baohang Li, Qichen Hong, Yunfei Lu, Dandan Tu, Bing Qin

分类: cs.CL

发布日期: 2026-04-27

备注: 26pages,25 figures ACL2026 main conference, long paper

💡 一句话要点

提出CanMT以解决大语言模型在文化翻译中的不足

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 文化意识翻译 大型语言模型 机器翻译 数据集构建 评估框架 翻译策略 文化特定知识

📋 核心要点

现有的大语言模型在文化意识翻译方面的能力尚未得到充分理解，存在性能差异和翻译策略影响的问题。
本文提出CanMT数据集和多维评估框架，系统评估不同LLMs在文化翻译中的表现，填补了这一研究空白。
研究结果表明，翻译策略对模型行为有显著影响，且引入参考翻译能有效提高评估的可靠性。

📝 摘要（中文）

大型语言模型（LLMs）在一般机器翻译中表现出色，但在文化意识场景下的能力仍然不够清晰。为此，本文引入了CanMT，一个文化意识的新驱动平行数据集，并提出了一个理论基础的多维评估框架来评估文化翻译质量。通过CanMT，我们系统地评估了多种LLMs和翻译系统在不同翻译策略约束下的表现。研究发现模型间存在显著的性能差异，翻译策略对模型行为有系统性影响。此外，文化特定项目的翻译难度各异，模型在识别文化特定知识与正确运用于翻译输出之间仍存在差距。引入参考翻译显著提高了LLM作为评判者的评估可靠性，强调了其在文化意识翻译质量评估中的重要性。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在文化意识翻译中的不足，现有方法在处理文化特定知识时表现不佳，导致翻译质量不一致。

核心思路：通过引入CanMT数据集和多维评估框架，系统性地评估和分析不同翻译策略对模型行为的影响，从而提高文化翻译的质量。

技术框架：整体架构包括数据集构建、模型评估和结果分析三个主要模块。数据集提供文化特定的翻译实例，评估模块则使用多维标准来衡量翻译质量。

关键创新：最重要的创新在于提出了CanMT数据集和理论基础的评估框架，使得文化翻译的评估更加系统化和可靠，填补了现有研究的空白。

关键设计：在模型评估中，采用了多种翻译策略，并引入参考翻译以提高评估的可靠性，确保评估结果的准确性和有效性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，不同模型在文化翻译任务中的表现存在显著差异，翻译策略对模型行为的影响是系统性的。引入参考翻译后，评估的可靠性显著提高，表明在文化意识翻译中，参考翻译的作用不可或缺。

🎯 应用场景

该研究的潜在应用领域包括跨文化交流、国际商务、教育和旅游等。通过提高机器翻译的文化意识，能够更好地满足不同文化背景用户的需求，提升翻译质量和用户体验。未来，该研究可能推动文化敏感型翻译系统的开发，促进全球化交流。

📄 摘要（原文）

Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models' recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.

Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理