تدقيق واسع النطاق لترخيص البيانات ونسبها في الذكاء الاصطناعي A large-scale audit of dataset licensing and attribution in AI

المجلة: Nature Machine Intelligence، المجلد: 6، العدد: 8
DOI: https://doi.org/10.1038/s42256-024-00878-8
تاريخ النشر: 2024-08-30
المؤلف: Shayne Longpre وآخرون
الموضوع الرئيسي: الأخلاقيات والآثار الاجتماعية للذكاء الاصطناعي

نظرة عامة

تناقش ورقة البحث القضايا القانونية والأخلاقية الحرجة الناشئة عن تدريب نماذج اللغة على مجموعات بيانات واسعة وغير موثقة بشكل متسق. قام فريق متعدد التخصصات من الخبراء القانونيين وخبراء تعلم الآلة بإجراء تدقيق منهجي لأكثر من 1,800 مجموعة نصية لتعزيز شفافية البيانات. قاموا بتطوير أدوات ومعايير لتتبع نسب هذه المجموعات، بما في ذلك مصادرها ومبدعيها وترخيصها واستخداماتها. كشفت التحليلات عن تفاوتات كبيرة في تكوين المجموعات المرخصة للاستخدام التجاري، حيث كانت اللغات ذات الموارد المنخفضة والبيانات الاصطناعية غالبًا ما تواجه تراخيص مقيدة. ومن المقلق، أن الدراسة وجدت أن مواقع استضافة المجموعات الشهيرة كانت لديها معدلات إغفال الترخيص تتجاوز 70% ومعدلات خطأ تزيد عن 50%، مما يشير إلى أزمة كبيرة في نسب البيانات والاستخدام المستنير.

علاوة على ذلك، تسلط الورقة الضوء على أن معالجة مجموعات البيانات ككيانات موحدة تُخفي نسبها المعقدة، والتي تشمل جولات متعددة من التنسيق والترخيص. إن نقص الشفافية هذا يتفاقم بسبب التحديات المتعلقة بتوثيق مجموعات البيانات الضخمة وزيادة التدقيق في حقوق الطبع والنشر، مما يؤدي إلى تراجع في توثيق وفهم بيانات التدريب. لمعالجة هذه القضايا، أطلق المؤلفون مستكشف نسب البيانات، وهو أداة تفاعلية مصممة لمساعدة الممارسين في تتبع وتصنيف نسب البيانات لمجموعات البيانات الشائعة، وبالتالي تعزيز الاستخدام المسؤول والشفافية في إدارة المجموعات.

الطرق

في هذا القسم، يصف المؤلفون منهجيتهم للتعليق التلقائي على مجموعات البيانات المتعلقة بالمنشورات الأكاديمية. يقومون باستخراج التراخيص من تكوينات Hugging Face وصفحات GitHub، مستفيدين من واجهة برمجة التطبيقات العامة لـ Semantic Scholar لجمع المعلومات حول تواريخ النشر وعدد الاقتباسات. بالإضافة إلى ذلك، يقومون بحساب مقاييس نصية متنوعة، بما في ذلك الحد الأدنى والمتوسط والحد الأقصى لطول المدخلات والأهداف، فضلاً عن دورات الحوار، مع اختيار قياس طول التسلسل بالرموز لضمان معاملة عادلة عبر لغات ونصوص مختلفة بسبب التباينات في أداء أدوات تقسيم النصوص.

لتحسين تقييم تنوع البيانات بما يتجاوز فئات المهام المعروفة، يستخدم المؤلفون واجهة برمجة التطبيقات GPT-4 من OpenAI للتعليق على الموضوعات. قاموا بأخذ عينة عشوائية من 100 مثال من كل مجموعة بيانات وطلبوا من GPT-4 تحديد ما يصل إلى عشرة مواضيع لكل نص. علاوة على ذلك، قام خبراء الذكاء الاصطناعي، بما في ذلك طلاب الدكتوراه وما بعد الدكتوراه، بمراجعة الأوراق الأصلية لتوثيق مصادر البيانات وطرق التوليد، سواء كانت اصطناعية أو موضوعة بواسطة البشر. كما عملت واجهة برمجة التطبيقات GPT-4 كأداة استرجاع في السياق، حيث قامت بمعالجة أوراق ArXiv في مقاطع من 4,000 رمز لاستخراج مقتطفات ذات صلة قد تكون قد تم تجاهلها من قبل الخبراء.

المناقشة

ت outlines قسم المناقشة في ورقة البحث مبادرة نسب البيانات، التي تهدف إلى تدقيق مجموعات البيانات المستخدمة على نطاق واسع، وخاصة تلك الخاصة بالتدريب أو التوافق، لتعزيز فهم نسب المجموعات والمخاطر المرتبطة بها. لقد قامت المبادرة بتدقيق 44 مجموعة تضم 1,858 مجموعة بيانات، والتي يتم تنزيلها بشكل متكرر من منصات مثل Hugging Face. تكشف النتائج أن العديد من مجموعات البيانات تعاني من تراخيص غامضة أو غير محددة، حيث يتم تصنيف أكثر من 70% من التراخيص على المنصات الشهيرة على أنها “غير محددة”. يشكل هذا النقص في الوضوح مخاطر كبيرة للممارسين الذين يسعون للامتثال للمعايير القانونية والأخلاقية في استخدام البيانات.

يقدم المؤلفون مستكشف نسب البيانات (DPExplorer)، وهو أداة مصممة لتسهيل استكشاف وتصنيف نسب وخصائص مجموعات البيانات. تهدف هذه الأداة إلى مساعدة الباحثين في إنشاء ملخصات قابلة للقراءة البشرية لمجموعات البيانات، مما يسهل عبء التوثيق. تسلط الورقة الضوء على انقسام حرج بين مجموعات البيانات المفتوحة تجاريًا والمغلقة، مشيرة إلى أن مجموعات البيانات غير التجارية تميل إلى إظهار تنوع أكبر في المهام والمواضيع والمصادر. ومع ذلك، تقدم هذه المجموعات أيضًا تحديات تتعلق بالامتثال لمتطلبات النسب وبنود المشاركة، مما يعقد إعادة تجميع مجموعات البيانات. يؤكد المؤلفون على الحاجة إلى تحسين الشفافية والوضوح القانوني في ترخيص مجموعات البيانات لدعم تطوير الذكاء الاصطناعي المسؤول والتخفيف من المخاطر المرتبطة باستخدام المجموعات.

Journal: Nature Machine Intelligence, Volume: 6, Issue: 8
DOI: https://doi.org/10.1038/s42256-024-00878-8
Publication Date: 2024-08-30
Author(s): Shayne Longpre et al.
Primary Topic: Ethics and Social Impacts of AI

Overview

The research paper addresses critical legal and ethical issues arising from the training of language models on extensive and inconsistently documented datasets. A multi-disciplinary team of legal and machine learning experts conducted a systematic audit of over 1,800 text datasets to enhance data transparency. They developed tools and standards to trace the lineage of these datasets, including their sources, creators, licenses, and uses. The analysis revealed significant disparities in the composition of datasets licensed for commercial use, with low-resource languages and synthetic data often facing restrictive licensing. Alarmingly, the study found that popular dataset hosting sites had license omission rates exceeding 70% and error rates above 50%, indicating a substantial crisis in data attribution and informed usage.

Furthermore, the paper highlights that the treatment of dataset collections as monolithic entities obscures their complex lineage, which includes multiple rounds of curation and licensing. This lack of transparency is exacerbated by the challenges of properly attributing vast data collections and increasing copyright scrutiny, leading to a decline in the documentation and understanding of training data. To address these issues, the authors released the Data Provenance Explorer, an interactive tool designed to help practitioners trace and filter data provenance for popular finetuning datasets, thereby promoting responsible use and transparency in dataset management.

Methods

In this section, the authors describe their methodology for automated annotation of datasets related to academic publications. They extract licenses from Hugging Face configurations and GitHub pages, utilizing the Semantic Scholar public API to gather information on publication dates and citation counts. Additionally, they compute various text metrics, including minimum, mean, and maximum input and target lengths, as well as dialogue turns, opting to measure sequence length in characters to ensure equitable treatment across different languages and scripts due to variations in tokenizer performance.

To enhance the assessment of data diversity beyond established task categories, the authors employ OpenAI’s GPT-4 API for topic annotation. They randomly sampled 100 examples from each dataset and prompted GPT-4 to identify up to ten topics per text. Furthermore, AI experts, including PhD students and postdocs, reviewed the original papers to document data sources and the methods of generation, whether synthetic or human-annotated. The GPT-4 API also served as an in-context retriever, processing the ArXiv papers in 4,000-character segments to extract relevant snippets that may have been overlooked by the experts.

Discussion

The discussion section of the research paper outlines the Data Provenance Initiative, which aims to audit widely used datasets, particularly those for instruction or alignment finetuning, to enhance understanding of dataset lineage and associated risks. The initiative has audited 44 collections comprising 1,858 datasets, which are frequently downloaded from platforms like Hugging Face. The findings reveal that many datasets suffer from ambiguous or unspecified licensing, with over 70% of licenses on popular platforms being classified as ‘unspecified.’ This lack of clarity poses significant risks for practitioners who seek to comply with legal and ethical standards in data usage.

The authors introduce the Data Provenance Explorer (DPExplorer), a tool designed to facilitate the exploration and filtering of dataset provenance and characteristics. This tool aims to assist researchers in generating human-readable summaries of datasets, thereby easing the burden of documentation. The paper highlights a critical divide between commercially open and closed datasets, noting that non-commercial datasets tend to exhibit greater diversity in tasks, topics, and sources. However, these datasets also present challenges regarding compliance with attribution requirements and share-alike clauses, which complicate the repackaging of data collections. The authors emphasize the need for improved transparency and legal clarity in dataset licensing to support responsible AI development and mitigate the risks associated with dataset usage.