كمية صادمة من الويب مترجمة آليًا: رؤى من التوازي متعدد الاتجاهات A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

المجلة: Findings of the Association for Computational Linguistics ACL 2024
DOI: https://doi.org/10.18653/v1/2024.findings-acl.103
تاريخ النشر: 2024-01-01
المؤلف: Brian J. Thompson وآخرون
الموضوع الرئيسي: الإنسانيات الرقمية والدراسات الأكاديمية

نظرة عامة

تسلط الأبحاث الضوء على أن جزءًا كبيرًا من محتوى الويب يتم ترجمته إلى عدة لغات، بشكل أساسي من خلال الترجمة الآلية (MT)، مما يؤدي إلى ترجمات منخفضة الجودة. هذه الظاهرة شائعة بشكل خاص في اللغات ذات الموارد المنخفضة، حيث تشكل المحتويات المتعددة الطرق المولدة آليًا حصة كبيرة من الوجود الكلي على الويب. كما تحدد الدراسة تحيزًا في الاختيار، مما يشير إلى أن المحتوى الإنجليزي المنخفض الجودة يتم ترجمته بكثافة إلى هذه اللغات عبر MT.

تثير هذه النتائج مخاوف حاسمة بشأن تدريب نماذج اللغة الكبيرة متعددة اللغات على البيانات المأخوذة من الويب، سواء كانت أحادية اللغة أو ثنائية اللغة. تشير الآثار إلى أن الاعتماد على مثل هذه البيانات قد يؤدي إلى استمرار نشر محتوى منخفض الجودة وقد يؤثر على أداء وموثوقية نماذج اللغة.

مقدمة

تناقش مقدمة هذه الورقة البحثية الدور المهم لبيانات التدريب على نطاق واسع في الذكاء الاصطناعي الحديث، خاصة في سياق الترجمة الآلية (MT). يقوم المؤلفون بالتحقيق في كيفية تأثير توفر MT منخفض التكلفة على المدى الطويل على محتوى الويب، كاشفين عن أن كمية كبيرة من المواد المترجمة، خاصة في اللغات ذات الموارد المنخفضة، يتم إنتاجها من خلال MT. يقدمون أدلة تشير إلى أن جودة الترجمات تنخفض مع زيادة التوازي المتعدد الطرق، مما يشير إلى أن المحتوى المترجم إلى عدة لغات غالبًا ما يكون أقل جودة وأكثر توقعًا. يُعزى هذا التحيز في الاختيار إلى انتشار المحتوى الإنجليزي منخفض الجودة، الذي يتم ترجمته بكثافة لأغراض عائدات الإعلانات.

تسلط النتائج الضوء على آثار حاسمة لتطوير النماذج متعددة اللغات، حيث قد تؤدي الطلاقة والدقة للبيانات المولدة بواسطة MT إلى نماذج أقل فعالية تتميز بزيادة الهلاوس. يؤكد المؤلفون على أهمية جودة البيانات في تدريب نماذج اللغة الكبيرة (LLMs) ويلاحظون التحديات المرتبطة بـ MT منخفض الموارد. لتسهيل المزيد من التحليل، يقدمون مجموعة بيانات شاملة متعددة الطرق تتكون من 6.4 مليار جملة فريدة عبر 90 لغة، إلى جانب كود لإعادة إنتاج مجموعتهم وتحليلهم. يبني هذا العمل على دراسات سابقة استكشفت مجموعات بيانات كبيرة الحجم وكشف الترجمة الآلية، مستفيدًا من إطار عمل ccMatrix المشتق من مشروع Common Crawl لإنشاء مجموعة بيانات قوية لتحقيقهم.

نقاش

في هذا القسم، يناقش المؤلفون إنشاء وتحليل مجموعة بيانات متعددة الطرق، تُسمى Multi-Way ccMatrix (MWccMatrix)، والتي تتكون من أزواج ترجمة عبر عدة لغات. تم بناء المجموعة من خلال التكرار عبر النصوص الثنائية الموجودة، مع ضمان تضمين جمل فريدة مع إزالة النسخ القريبة. قلل هذا العملية العدد الإجمالي للجمل من 21.7 مليار إلى 6.4 مليار جملة فريدة عبر 2.2 مليار زوج. تكشف التحليلات أن جزءًا كبيرًا من محتوى الويب يتم ترجمته، مع معدلات عالية من التوازي المتعدد الطرق، خاصة في اللغات ذات الموارد المنخفضة، حيث يكون متوسط التوازي أعلى بشكل ملحوظ مقارنة باللغات ذات الموارد العالية.

تشير النتائج إلى أن الترجمات متعددة الطرق تميل إلى أن تكون أقصر وأبسط، لكنها تظهر جودة أقل كما تم تقييمها بواسطة طرق تقدير الجودة (QE). على وجه التحديد، حصلت الترجمات ذات التوازي المتعدد الطرق الأعلى على درجات أقل بكثير في مقياس CometQE مقارنة بالترجمات ثنائية الاتجاه. وهذا يشير إلى أن الكثير من المحتوى المترجم، خاصة في اللغات ذات الموارد المنخفضة، من المحتمل أن يتم إنتاجه بواسطة الترجمة الآلية (MT)، مما يعكس تحيزًا نحو محتوى أقصر وأقل تعقيدًا غالبًا ما يتم تصنيفه تحت “المحادثة والرأي”. يقترح المؤلفون أن هذه الاتجاهات قد تنبع من انتشار المحتوى منخفض الجودة الذي يتم ترجمته بكثافة لأغراض عائدات الإعلانات. يدعون إلى مزيد من التحقيق في آثار هذه النتائج ويقترحون أن التوازي المتعدد الطرق يمكن أن يكون مؤشرًا مفيدًا لتحديد البيانات المترجمة آليًا منخفضة الجودة في كل من السياقات الثنائية اللغة والأحادية اللغة.

القيود

تسلط قيود هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على صلاحية وعمومية النتائج. أولاً، تركز الأبحاث على 90 لغة مستخدمة على نطاق واسع، مما قد لا يعكس بدقة الاتجاهات في اللغات ذات الموارد المنخفضة التي لا تدعمها الترجمة الآلية (MT). بالإضافة إلى ذلك، يتم إجراء التحليل على مستوى الجملة بسبب طبيعة مجموعة بيانات ccMatrix، مما يعقد بعض التقييمات، مثل تحليل الموضوع، الذي سيستفيد من التوافق على مستوى الوثيقة. يحد غياب مجموعة بيانات على مستوى الوثيقة متاحة للجمهور من التحليل.

علاوة على ذلك، تستخدم الدراسة CometQE لتقييم جودة الترجمة، والتي تم تدريبها على بيانات من تقييمات WMT. ومع ذلك، قد تختلف بيانات الويب المستخدمة في هذا البحث بشكل كبير في توزيع المجال عن بيانات WMT، مما قد يؤدي إلى اختلافات في الأداء في مقاييس التقييم. يؤدي الاعتماد على النصوص الثنائية المجمعة من الويب إلى إدخال تحيزات متأصلة في عملية التجميع، كما أظهرت الدراسات السابقة. قد تفضل مجموعة بيانات Common Crawl، على الرغم من شمولها، الترجمات التي تم تحديدها مسبقًا، والطريقة المستخدمة لإنشاء المجموعة، بما في ذلك درجات هامش LASER، معروفة بوجود قيود في اللغات ذات الموارد المنخفضة. أخيرًا، قد تظهر طرق التعرف التلقائي على اللغة (LID) المستخدمة في التحليل أيضًا عيوبًا، خاصة بالنسبة للغات ذات الموارد المنخفضة، مما قد يؤثر بشكل أكبر على موثوقية النتائج.

Journal: Findings of the Association for Computational Linguistics ACL 2024
DOI: https://doi.org/10.18653/v1/2024.findings-acl.103
Publication Date: 2024-01-01
Author(s): Brian J. Thompson et al.
Primary Topic: Digital Humanities and Scholarship

Overview

The research highlights that a significant portion of web content is translated into multiple languages, predominantly through Machine Translation (MT), resulting in low-quality translations. This phenomenon is particularly prevalent in lower-resource languages, where machine-generated, multi-way parallel content constitutes a substantial share of the total web presence. The study also identifies a selection bias, indicating that primarily low-quality English content is translated en masse into these languages via MT.

These findings raise critical concerns regarding the training of multilingual large language models on data sourced from the web, both monolingual and bilingual. The implications suggest that reliance on such data may perpetuate the dissemination of low-quality content and potentially impact the performance and reliability of language models.

Introduction

The introduction of this research paper discusses the significant role of large-scale training data in modern AI, particularly in the context of Machine Translation (MT). The authors investigate how the long-term availability of low-cost MT has influenced web content, revealing that a substantial amount of translated material, especially in lower-resource languages, is generated through MT. They present evidence indicating that the quality of translations decreases with increased multi-way parallelism, suggesting that content translated into multiple languages is often of lower quality and more predictable. This selection bias is attributed to the prevalence of low-quality English content, which is translated en masse for ad revenue purposes.

The findings highlight critical implications for multilingual model development, as the fluency and accuracy of MT-generated data may lead to less effective models characterized by increased hallucinations. The authors emphasize the importance of data quality in training Large Language Models (LLMs) and note the challenges associated with low-resource MT. To facilitate further analysis, they introduce a comprehensive multi-way corpus comprising 6.4 billion unique sentences across 90 languages, alongside code for reproducing their corpus and analysis. This work builds on previous studies that have explored large-scale corpora and machine translation detection, leveraging the ccMatrix framework derived from the Common Crawl project to curate a robust dataset for their investigation.

Discussion

In this section, the authors discuss the creation and analysis of a multi-way parallel corpus, termed Multi-Way ccMatrix (MWccMatrix), which consists of translation tuples across multiple languages. The corpus was constructed by iterating through existing bilingual text, ensuring unique sentences were included while removing near duplicates. This process reduced the total sentence count from 21.7 billion to 6.4 billion unique sentences across 2.2 billion tuples. The analysis reveals that a significant portion of web content is translated, with high rates of multi-way parallelism, particularly in lower-resource languages, where the average parallelism is notably higher than in high-resource languages.

The findings indicate that multi-way parallel translations tend to be shorter and simpler, yet exhibit lower quality as assessed by Quality Estimation (QE) methods. Specifically, translations with higher multi-way parallelism scored significantly lower on the CometQE metric compared to 2-way translations. This suggests that much of the translated content, particularly in lower-resource languages, is likely generated by machine translation (MT), reflecting a bias towards shorter, less complex content often categorized under “CONVERSATION & OPINION.” The authors propose that this trend may stem from the prevalence of low-quality content being translated en masse for ad revenue purposes. They advocate for further investigation into the implications of these findings and suggest that multi-way parallelism could serve as a useful indicator for identifying low-quality, machine-translated data in both bilingual and monolingual contexts.

Limitations

The limitations of this study highlight several critical factors that may affect the validity and generalizability of the findings. Firstly, the research focuses on 90 widely used languages, which may not accurately reflect trends in low-resource languages that are not supported by machine translation (MT). Additionally, the analysis is conducted at the sentence level due to the nature of the ccMatrix corpus, which complicates certain evaluations, such as topic analysis, that would benefit from document-level alignment. The absence of a publicly available document-level corpus further constrains the analysis.

Moreover, the study employs CometQE for translation quality evaluation, which is trained on data from the WMT evaluations. However, the web data utilized in this research may differ significantly in domain distribution from WMT data, potentially leading to performance discrepancies in the evaluation metrics. The reliance on web-scraped bitext introduces biases inherent in the scraping process, as demonstrated by previous studies. The Common Crawl dataset, while comprehensive, may favor previously identified translations, and the methodology used for corpus creation, including LASER margin scores, is known to have limitations in lower-resource languages. Lastly, automatic language identification (LID) methods used in the analysis may also exhibit shortcomings, particularly for low-resource languages, which could further impact the reliability of the results.