Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
作者: Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarinayum Meerajita Sharma, Aditi Gupta, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli, Álvaro Arroyo, Amin Bajand, Amol Khanna, Ana Chkhaidze, Ana Condez, Andiswa Mkhonto, Andrew Hoblitzell, Andrew Tran, Angelos Poulis, Anirban Majumder, Anna Vacalopoulou, Annette Kuuipolani Kanahele Wong, Annika Simonsen, Anton Kovalev, Ashvanth. S, Ayodeji Joseph Lana, Barkin Kinay, Bashar Alhafni, Benedict Cibalinda Busole, Bernard Ghanem, Bharti Nathani, Biljana Stojanovska Đurić, Bola Agbonile, Bragi Bergsson, Bruce Torres Fischer, Burak Tutar, Burcu Alakuş Çınar, Cade J. Kanoniakapueo Kane, Can Udomcharoenchaikit, Catherine Arnett, Chadi Helwe, Chaithra Reddy Nerella, Chen Cecilia Liu, Chiamaka Glory Nwokolo, Cristina España-Bonet, Cynthia Amol, DaeYeop Lee, Dana Arad, Daniil Dzenhaliou, Daria Pugacheva, Dasol Choi, Daud Abolade, David Liu, David Semedo, Deborah Popoola, Deividas Mataciunas, Delphine Nyaboke, Dhyuthy Krishna Kumar, Diogo Glória-Silva, Diogo Tavares, Divyanshu Goyal, DongGeon Lee, Ebele Nwamaka Anajemba, Egonu Ngozi Grace, Elena Mickel, Elena Tutubalina, Elias Herranen, Emile Anand, Emmanuel Habumuremyi, Emuobonuvie Maria Ajiboye, Eryawan Presma Yulianrifat, Esther Adenuga, Ewa Rudnicka, Faith Olabisi Itiola, Faran Taimoor Butt, Fathima Thekkekara, Fatima Haouari, Filbert Aurelian Tjiaranata, Firas Laakom, Francesca Grasso, Francesco Orabona, Francesco Periti, Gbenga Kayode Solomon, Gia Nghia Ngo, Gloria Udhehdhe-oze, Gonçalo Martins, Gopi Naga Sai Ram Challagolla, Guijin Son, Gulnaz Abdykadyrova, Hafsteinn Einarsson, Hai Hu, Hamidreza Saffari, Hamza Zaidi, Haopeng Zhang, Harethah Abu Shairah, Harry Vuong, Hele-Andra Kuulmets, Houda Bouamor, Hwanjo Yu, Iben Nyholm Debess, İbrahim Ethem Deveci, Ikhlasul Akmal Hanif, Ikhyun Cho, Inês Calvo, Inês Vieira, Isaac Manzi, Ismail Daud, Itay Itzhak, Iuliia, Alekseenko, Ivan Belashkin, Ivan Spada, Ivan Zhelyazkov, Jacob Brinton, Jafar Isbarov, Jaka Čibej, Jan Čuhel, Jan Kocoń, Jauza Akbar Krito, Jebish Purbey, Jennifer Mickel, Jennifer Za, Jenny Kunz, Jihae Jeong, Jimena Tena Dávalos, Jinu Lee, João Magalhães, John Yi, Jongin Kim, Joseph Chataignon, Joseph Marvin Imperial, Jubeerathan Thevakumar, Judith Land, Junchen Jiang, Jungwhan Kim, Kairit Sirts, Kamesh R, Kamesh V, Kanda Patrick Tshinu, Kätriin Kukk, Kaustubh Ponkshe, Kavsar Huseynova, Ke He, Kelly Buchanan, Kengatharaiyer Sarveswaran, Kerem Zaman, Khalil Mrini, Kian Kyars, Krister Kruusmaa, Kusum Chouhan, Lainitha Krishnakumar, Laura Castro Sánchez, Laura Porrino Moscoso, Leshem Choshen, Levent Sencan, Lilja Øvrelid, Lisa Alazraki, Lovina Ehimen-Ugbede, Luheerathan Thevakumar, Luxshan Thavarasa, Mahnoor Malik, Mamadou K. Keita, Mansi Jangid, Marco De Santis, Marcos García, Marek Suppa, Mariam D'Ciofalo, Marii Ojastu, Maryam Sikander, Mausami Narayan, Maximos Skandalis, Mehak Mehak, Mehmet İlteriş Bozkurt, Melaku Bayu Workie, Menan Velayuthan, Michael Leventhal, Michał Marcińczuk, Mirna Potočnjak, Mohammadamin Shafiei, Mridul Sharma, Mrityunjaya Indoria, Muhammad Ravi Shulthan Habibi, Murat Kolić, Nada Galant, Naphat Permpredanun, Narada Maugin, Nicholas Kluge Corrêa, Nikola Ljubešić, Nirmal Thomas, Nisansa de Silva, Nisheeth Joshi, Nitish Ponkshe, Nizar Habash, Nneoma C. Udeze, Noel Thomas, Noémi Ligeti-Nagy, Nouhoum Coulibaly, Nsengiyumva Faustin, Odunayo Kareemat Buliaminu, Odunayo Ogundepo, Oghojafor Godswill Fejiro, Ogundipe Blessing Funmilola, Okechukwu God'spraise, Olanrewaju Samuel, Olaoye Deborah Oluwaseun, Olasoji Akindejoye, Olga Popova, Olga Snissarenko, Onyinye Anulika Chiemezie, Orkun Kinay, Osman Tursun, Owoeye Tobiloba Moses, Oyelade Oluwafemi Joshua, Oyesanmi Fiyinfoluwa, Pablo Gamallo, Pablo Rodríguez Fernández, Palak Arora, Pedro Valente, Peter Rupnik, Philip Oghenesuowho Ekiugbo, Pramit Sahoo, Prokopis Prokopidis, Pua Niau-Puhipau, Quadri Yahya, Rachele Mignone, Raghav Singhal, Ram Mohan Rao Kadiyala, Raphael Merx, Rapheal Afolayan, Ratnavel Rajalakshmi, Rishav Ghosh, Romina Oji, Ron Kekeha Solis, Rui Guerra, Rushikesh Zawar, Sa'ad Nasir Bashir, Saeed Alzaabi, Sahil Sandeep, Sai Pavan Batchu, SaiSandeep Kantareddy, Salsabila Zahirah Pranida, Sam Buchanan, Samuel Rutunda, Sander Land, Sarah Sulollari, Sardar Ali, Saroj Sapkota, Saulius Tautvaisas, Sayambhu Sen, Sayantani Banerjee, Sebastien Diarra, SenthilNathan. M, Sewoong Lee, Shaan Shah, Shankar Venkitachalam, Sharifa Djurabaeva, Sharon Ibejih, Shivanya Shomir Dutta, Siddhant Gupta, Silvia Paniagua Suárez, Sina Ahmadi, Sivasuthan Sukumar, Siyuan Song, Snegha A., Sokratis Sofianopoulos, Sona Elza Simon, Sonja Benčina, Sophie Gvasalia, Sphurti Kirit More, Spyros Dragazis, Stephan P. Kaufhold, Suba. S, Sultan AlRashed, Surangika Ranathunga, Taiga Someya, Taja Kuzman Pungeršek, Tal Haklay, Tasi'u Jibril, Tatsuya Aoyama, Tea Abashidze, Terenz Jomar Dela Cruz, Terra Blevins, Themistoklis Nikas, Theresa Dora Idoko, Thu Mai Do, Tilek Chubakov, Tommaso Gargiani, Uma Rathore, Uni Johannesen, Uwuma Doris Ugwu, Vallerie Alexandra Putra, Vanya Bannihatti Kumar, Varsha Jeyarajalingam, Varvara Arzt, Vasudevan Nedumpozhimana, Viktoria Ondrejova, Viktoryia Horbik, Vishnu Vardhan Reddy Kummitha, Vuk Dinić, Walelign Tewabe Sewunetie, Winston Wu, Xiaojing Zhao, Yacouba Diarra, Yaniv Nikankin, Yash Mathur, Yixi Chen, Yiyuan Li, Yolanda Xavier, Yonatan Belinkov, Yusuf Ismail Abayomi, Zaid Alyafeai, Zhengyang Shan, Zhi Rui Tam, Zilu Tang, Zuzana Nadova, Baber Abbasi, Stella Biderman, David Stap, Duygu Ataman, Fabian Schmidt, Hila Gonen, Jiayi Wang, David Ifeoluwa Adelani
分类: cs.CL
发布日期: 2025-10-28
备注: Preprint
💡 一句话要点
提出Global PIQA,用于评估大型语言模型在100+种语言和文化中的物理常识推理能力
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 常识推理 多语言 文化敏感性 大型语言模型 评估基准
📋 核心要点
- 现有大型语言模型缺乏针对多种语言和文化的文化特定评估基准。
- Global PIQA通过全球研究者参与构建,旨在提供一个多语言、文化敏感的常识推理评估平台。
- 实验表明,现有模型在低资源语言上表现不佳,突显了模型在文化背景知识方面的不足。
📝 摘要(中文)
本文提出了Global PIQA,一个由来自65个国家的335名研究人员手工构建的、覆盖超过100种语言的参与式常识推理基准。Global PIQA包含116种语言变体,覆盖五大洲、14个语系和23种书写系统。在Global PIQA的非平行分割中,超过50%的例子引用了当地的食物、习俗、传统或其他文化特定元素。研究发现,当前最先进的大型语言模型在Global PIQA上的总体表现良好,但在低资源语言中的表现较差(准确率差距高达37%,而随机猜测的准确率为50%)。开源模型的性能通常不如专有模型。Global PIQA强调,在许多语言和文化中,日常知识仍然是一个需要改进的领域,与更广泛讨论的复杂推理和专业知识能力一样。除了用于大型语言模型评估外,我们希望Global PIQA能够展示人类语言所嵌入的广泛文化多样性。
🔬 方法详解
问题定义:现有的大型语言模型(LLMs)在常识推理方面取得了显著进展,但缺乏针对不同语言和文化背景的评估基准。这使得我们难以准确评估LLMs在处理文化特定知识和常识方面的能力,尤其是在低资源语言中。现有方法无法有效衡量LLMs在不同文化环境下的泛化能力。
核心思路:Global PIQA的核心思路是构建一个多语言、文化敏感的常识推理数据集,通过全球研究者的参与,确保数据集能够覆盖广泛的语言和文化背景。该数据集的设计侧重于物理常识,并包含大量与当地文化相关的例子,从而能够更准确地评估LLMs在不同文化环境下的推理能力。
技术框架:Global PIQA的构建过程主要包括以下几个阶段: 1. 数据收集:邀请来自全球各地的研究者参与,贡献他们各自语言和文化背景下的常识推理问题。 2. 数据翻译和本地化:将收集到的问题翻译成多种语言,并进行本地化处理,确保问题在不同文化背景下都具有实际意义。 3. 数据验证:对翻译和本地化后的数据进行验证,确保数据的质量和准确性。 4. 数据集划分:将数据集划分为训练集、验证集和测试集,用于模型的训练和评估。 5. 模型评估:使用Global PIQA评估现有LLMs的性能,并分析其在不同语言和文化背景下的表现。
关键创新:Global PIQA最重要的技术创新点在于其参与式构建方法和文化敏感性设计。传统的常识推理数据集往往由少数研究者构建,难以覆盖广泛的语言和文化背景。Global PIQA通过邀请全球研究者参与,确保数据集能够反映不同文化背景下的常识知识。此外,Global PIQA的数据集包含大量与当地文化相关的例子,从而能够更准确地评估LLMs在不同文化环境下的推理能力。
关键设计:Global PIQA的关键设计包括: 1. 问题类型:数据集中的问题主要侧重于物理常识,例如物体之间的关系、因果关系等。 2. 语言覆盖:数据集覆盖超过100种语言,包括高资源语言和低资源语言。 3. 文化敏感性:数据集包含大量与当地文化相关的例子,例如当地的食物、习俗、传统等。 4. 数据集划分:数据集划分为训练集、验证集和测试集,用于模型的训练和评估。 5. 评估指标:使用准确率作为评估指标,衡量LLMs在Global PIQA上的性能。
🖼️ 关键图片
📊 实验亮点
实验结果表明,现有最先进的LLMs在Global PIQA上的总体表现良好,但在低资源语言中的表现显著下降,准确率差距高达37%。开源模型的性能普遍低于专有模型。这些结果突显了现有模型在处理文化特定知识和低资源语言方面的不足,表明未来研究需要更加关注模型的文化敏感性和多语言能力。
🎯 应用场景
Global PIQA可应用于评估和提升大型语言模型在多语言和文化环境下的常识推理能力。该基准有助于开发更具文化敏感性和泛化能力的AI系统,从而在教育、医疗、客户服务等领域实现更有效的跨文化交流和应用。此外,Global PIQA还能促进对不同文化背景下常识知识的理解和研究。
📄 摘要(原文)
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.