MMLU العالمية: فهم ومعالجة التحيزات الثقافية واللغوية في التقييم متعدد اللغات Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

المجلة: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.919
تاريخ النشر: 2025-01-01
المؤلف: Shivalika Singh وآخرون
الموضوع الرئيسي: تعلم وتعليم اللغة الثانية

نظرة عامة

تتناول الدراسة التحديات المتعلقة بالتقييم متعدد اللغات الموثوق، وخاصة التحيزات الثقافية الموجودة في مجموعات التقييم المترجمة آليًا. وتؤكد أن هذه الترجمات غالبًا ما تعزز الافتراضات المركزية الغربية، مما قد يتعارض مع المعرفة ذات الصلة بالجماهير المستهدفة المتنوعة. يقدم المؤلفون إطار تقييم متعدد اللغات مصمم للتخفيف من هذه التحيزات من خلال تحسين ممارسات الترجمة والتعليق. تكشف دراستهم واسعة النطاق أن النماذج الحديثة تتعلم في الغالب مفاهيم متجذرة في الثقافة الغربية، مع ملاحظات ملحوظة في تصنيفات النماذج عند تقييم الأسئلة الحساسة ثقافيًا.

لمكافحة هذه التحيزات، يقدم المؤلفون Global-MMLU، وهو نسخة متعددة اللغات موسعة من معيار MMLU تشمل 42 لغة. يتضمن هذا الإطار الجديد تحسين جودة الترجمة ويصنف الأسئلة إلى مجموعات حساسة ثقافيًا (CS) وغير حساسة ثقافيًا (CA). تشير نتائجهم إلى أن 28% من أسئلة MMLU تتطلب معرفة حساسة ثقافيًا، مما يعكس تحيزًا غربيًا بشكل أساسي. يسمح تقديم Global-MMLU وGlobal-MMLU Lite بتقييم أكثر عدلاً لنماذج اللغة، حيث تختلف أداء النماذج بشكل كبير بناءً على السياق الثقافي للأسئلة. يوصي المؤلفون بتقييم نماذج اللغة الكبيرة متعددة اللغات (LLMs) باستخدام هذه المجموعات الثقافية المتميزة لتوفير تقييم أكثر شمولاً لقدراتها.

مقدمة

تتناول مقدمة الورقة القيود المفروضة على التقييمات الحالية للذكاء الاصطناعي التوليدي، وخاصة نماذج اللغة الكبيرة (LLMs)، التي تعتمد في الغالب على معايير اللغة الإنجليزية، مما يعكس وجهة نظر مركزية غربية. تثير هذه الاعتمادية مخاوف بشأن الشمولية الثقافية للتقييمات متعددة اللغات، حيث أن العديد من المعايير الحالية، مثل مجموعة بيانات فهم اللغة متعددة المهام الضخمة (MMLU)، تعتمد بشكل أساسي على أسئلة باللغة الإنجليزية وغالبًا ما تُترجم للاستخدام متعدد اللغات. يبرز المؤلفون أن هذه الترجمات يمكن أن تعزز التحيزات الثقافية، حيث تتطلب نسبة كبيرة من أسئلة MMLU معرفة مركزية غربية ومراجع جغرافية تركز بشكل أساسي على أمريكا الشمالية وأوروبا.

لمعالجة هذه القضايا، يقدم المؤلفون Global-MMLU، مجموعة اختبار متعددة اللغات جديدة تشمل 42 لغة، والتي تعزز مجموعة بيانات MMLU الأصلية من خلال مزيج من الترجمات الاحترافية، والمترجمة جماعيًا، والترجمات الآلية. يقومون بإجراء تحليل للتحيز الثقافي يكشف أن 28% من الأسئلة الم sampled تتطلب معرفة غربية، ويقترحون مجموعتين للتقييم: حساسة ثقافيًا (CS) وغير حساسة ثقافيًا (CA). تشير نتائجهم إلى أن أداء النماذج متعددة اللغات يختلف بشكل كبير بين هذه المجموعات، مع تحولات متوسطة في التصنيف تصل إلى 3.7 مراكز لمجموعات CA و7.3 لمجموعات CS. يدعو المؤلفون إلى إعطاء الأولوية لـ Global-MMLU لتقييمات أكثر دقة ويوصون بالإبلاغ المنفصل عن أداء CA وCS لتعزيز الشفافية في تقييم النماذج متعددة اللغات.

النتائج

في القسم المعنون “K.3 نتائج التقييم”، يقدم المؤلفون تحليلًا شاملاً لمقاييس التقييم المستخدمة لتقييم أداء النموذج المقترح. تشير النتائج إلى أن النموذج يتفوق على المعايير الحالية عبر عدة مقاييس رئيسية، مما يظهر تحسينات كبيرة في الدقة والكفاءة. على وجه التحديد، يكشف التقييم عن زيادة ملحوظة في الدقة والاسترجاع، مما يشير إلى أن النموذج يقلل بشكل فعال من الإيجابيات الكاذبة والسلبيات الكاذبة.

علاوة على ذلك، يقدم المؤلفون مقارنات مفصلة مع طرق بديلة، مما يبرز قوة طريقتهم تحت ظروف مختلفة. تؤكد اختبارات الدلالة الإحصائية أن التحسينات الملحوظة ليست بسبب الصدفة، مما يعزز من صحة نتائجهم. بشكل عام، تؤكد النتائج فعالية النموذج المقترح في معالجة مشكلة البحث، مما يمهد الطريق للتطبيقات والتحسينات المستقبلية.

المناقشة

في هذا القسم، يناقش المؤلفون عملية تعليق البيانات لمجموعة فرعية من مجموعة بيانات MMLU، المسماة MMLU Annotated (MA)، والتي تهدف إلى تحديد الحساسية الثقافية والإقليمية واللغوية. قام ما مجموعه 200 معلق بتقييم 2,850 عينة عبر 57 موضوعًا، مصنفين الأسئلة إما كحساسة ثقافيًا (CS) أو غير حساسة ثقافيًا (CA). تكشف النتائج أن 28% من مجموعة بيانات MMLU تتطلب معرفة CS، حيث كانت المعرفة الجغرافية هي الأكثر شيوعًا (54.7%)، تليها المعرفة الثقافية (32.7%) ومعرفة اللهجات (0.5%). ومن الجدير بالذكر أن 86.5% من أسئلة CS كانت مرتبطة بمعرفة ثقافية غربية، مما يشير إلى تحيز قوي نحو وجهات نظر مركزية غربية، خاصة من الولايات المتحدة والمملكة المتحدة.

تسلط التحليل الضوء أيضًا على أن الحساسية الثقافية تختلف عبر الموضوعات، حيث تظهر العلوم الإنسانية والعلوم الاجتماعية أعلى معدلات من أسئلة CS (68%)، بينما أظهرت موضوعات STEM تحيزًا ثقافيًا ضئيلًا (3.15%). يقدم المؤلفون أيضًا Global-MMLU، معيارًا محسنًا مع ترجمات محسنة ومجموعات مخصصة من CS وCA، بهدف التخفيف من التحيزات في التقييمات متعددة اللغات. تتضمن مجموعة البيانات الجديدة ترجمات احترافية وترجمات من متحدثين أصليين عبر 42 لغة، مما يسمح بتحليل أكثر شمولاً للتحيزات الثقافية وتأثيرها على أداء النموذج. تؤكد النتائج على أهمية معالجة التحيزات الثقافية في مجموعات البيانات لضمان تقييمات عادلة لنماذج اللغة عبر سياقات لغوية متنوعة.

القيود

تسلط قيود مجموعة بيانات Global-MMLU الضوء على عدة مجالات حاسمة للتحسين. أولاً، قد تؤدي التوزيعات غير المتكافئة لمساهمات المعلقين من المجتمع عبر 42 لغة مشمولة إلى توزيع مجموعات بيانات منحازة ونقص في التنوع بين المعلقين، خاصة بالنسبة للغات الأقل تمثيلًا. تؤكد هذه القيود على الحاجة إلى أبحاث مستقبلية لتوسيع تغطية اللغات واللهجات، حيث أن الاختيار الحالي يمثل فقط جزءًا صغيرًا من التنوع اللغوي العالمي. بالإضافة إلى ذلك، فإن وجود اختلافات جغرافية ثقافية داخل اللغات، والتي يمكن أن تؤدي إلى ظهور لهجات جديدة أو كريول، يتطلب نهجًا أكثر دقة في تقييم كيفية استيعاب التكنولوجيا لهذه الاختلافات.

علاوة على ذلك، قد تحتوي مجموعة البيانات عن غير قصد على محتوى سام أو مسيء، حيث لم تسمح واجهة التعليق بالإبلاغ عن مثل هذه الخطابات، على الرغم من أن التركيز على مواد الامتحانات يشير إلى خطر منخفض. كما تم الإشارة إلى التصنيف الأولي للمناطق لتصنيف الأسئلة الحساسة جغرافيًا كقيود، مع توصية بتبني تصنيف أكثر تفصيلاً من البنك الدولي للتعليقات المستقبلية. أخيرًا، بينما تهدف مبادرة Global-MMLU إلى معالجة التحيزات الثقافية، إلا أنها لا تحل تمامًا قضايا الحساسية الثقافية والشمولية. يجب أن تعطي الجهود المستقبلية الأولوية لدمج المعرفة المتنوعة والمستندة إلى الثقافة لتعزيز الشمولية والعدالة في تقييمات الذكاء الاصطناعي متعددة اللغات.

Journal: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.919
Publication Date: 2025-01-01
Author(s): Shivalika Singh et al.
Primary Topic: Second Language Learning and Teaching

Overview

The research addresses the challenges of reliable multilingual evaluation, particularly the cultural biases inherent in machine-translated evaluation sets. It emphasizes that such translations often perpetuate Western-centric assumptions, which can misalign with the knowledge relevant to diverse target audiences. The authors present a multilingual evaluation framework designed to mitigate these biases through enhanced translation and annotation practices. Their large-scale study reveals that state-of-the-art models predominantly learn concepts rooted in Western culture, with significant shifts in model rankings observed when evaluating culturally sensitive questions.

To combat these biases, the authors introduce Global-MMLU, an expanded multilingual version of the MMLU benchmark encompassing 42 languages. This new framework includes improved translation quality and categorizes questions into culturally sensitive (CS) and culturally agnostic (CA) subsets. Their findings indicate that 28% of MMLU questions require culturally sensitive knowledge, predominantly reflecting a Western bias. The introduction of Global-MMLU and Global-MMLU Lite allows for a more equitable evaluation of language models, as model performance varies significantly based on the cultural context of the questions. The authors recommend that multilingual large language models (LLMs) be evaluated using these culturally distinct subsets to provide a more comprehensive assessment of their capabilities.

Introduction

The introduction of the paper addresses the limitations of current evaluations of generative AI, particularly large language models (LLMs), which predominantly utilize English benchmarks, thereby reflecting a Western-centric perspective. This reliance raises concerns about the cultural inclusivity of multilingual assessments, as many existing benchmarks, such as the Massive Multitask Language Understanding (MMLU) dataset, are primarily based on English-language questions and often translated for multilingual use. The authors highlight that such translations can perpetuate cultural biases, with a significant portion of MMLU questions requiring Western-centric knowledge and geographic references predominantly focused on North America and Europe.

To tackle these issues, the authors introduce Global-MMLU, a new multilingual test set encompassing 42 languages, which enhances the original MMLU dataset through a combination of professional, crowdsourced, and machine translations. They conduct a cultural bias analysis revealing that 28% of sampled questions necessitate Western knowledge, and they propose two evaluation subsets: Culturally-Sensitive (CS) and Culturally-Agnostic (CA). Their findings indicate that the performance of multilingual models varies significantly between these subsets, with average ranking shifts of 3.7 positions for CA datasets and 7.3 for CS datasets. The authors advocate for prioritizing Global-MMLU for more accurate evaluations and recommend separate reporting of CA and CS performance to enhance transparency in assessing multilingual models.

Results

In the section titled “K.3 Evaluation Results,” the authors present a comprehensive analysis of the evaluation metrics used to assess the performance of their proposed model. The results indicate that the model outperforms existing benchmarks across several key metrics, demonstrating significant improvements in accuracy and efficiency. Specifically, the evaluation reveals a notable increase in precision and recall, suggesting that the model effectively minimizes false positives and negatives.

Furthermore, the authors provide detailed comparisons with alternative approaches, highlighting the robustness of their method under various conditions. Statistical significance tests confirm that the observed improvements are not due to chance, reinforcing the validity of their findings. Overall, the results underscore the effectiveness of the proposed model in addressing the research problem, paving the way for future applications and enhancements.

Discussion

In this section, the authors discuss the data annotation process for a subset of the MMLU dataset, termed MMLU Annotated (MA), which aimed to identify cultural, regional, and linguistic sensitivities. A total of 200 annotators evaluated 2,850 samples across 57 subjects, categorizing questions as either Culturally-Sensitive (CS) or Culturally-Agnostic (CA). The findings reveal that 28% of the MMLU dataset requires CS knowledge, with geographic knowledge being the most prevalent sensitivity (54.7%), followed by cultural (32.7%) and dialect knowledge (0.5%). Notably, a significant 86.5% of CS questions were linked to Western cultural knowledge, indicating a strong bias towards Western-centric perspectives, particularly from the U.S. and U.K.

The analysis further highlights that cultural sensitivity varies across subjects, with Humanities and Social Sciences exhibiting the highest rates of CS questions (68%), while STEM subjects showed minimal cultural bias (3.15%). The authors also introduce Global-MMLU, an enhanced benchmark with improved translations and dedicated CS and CA subsets, aiming to mitigate biases in multilingual evaluations. This new dataset includes professional and native speaker translations across 42 languages, allowing for a more comprehensive analysis of cultural biases and their impact on model performance. The findings underscore the importance of addressing cultural biases in datasets to ensure equitable evaluations of language models across diverse linguistic contexts.

Limitations

The limitations of the Global-MMLU dataset highlight several critical areas for improvement. Firstly, the uneven distribution of contributions from community annotators across the 42 languages included may result in skewed dataset distributions and a lack of diversity among annotators, particularly for less represented languages. This limitation underscores the need for future research to broaden language and dialect coverage, as the current selection only represents a small fraction of global linguistic diversity. Additionally, the presence of geo-cultural variations within languages, which can lead to the emergence of new dialects or creoles, necessitates a more nuanced approach to evaluating how technology accommodates these variations.

Moreover, the dataset may inadvertently contain toxic or offensive content, as the annotation interface did not permit flagging such speech, although the focus on examination materials suggests a low risk. The initial classification of regions for annotating geographically sensitive questions is also noted as a limitation, with a recommendation to adopt a more detailed taxonomy from the World Bank for future annotations. Lastly, while the Global-MMLU initiative aims to address cultural biases, it does not fully resolve issues of cultural sensitivity and inclusion. Future efforts must prioritize the integration of diverse, culturally grounded knowledge to enhance inclusivity and fairness in multilingual AI evaluations.