نماذج اللغة الأكبر والأكثر قابلية للتعليم تصبح أقل موثوقية Larger and more instructable language models become less reliable

المجلة: Nature، المجلد: 634، العدد: 8032
DOI: https://doi.org/10.1038/s41586-024-07930-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39322679
تاريخ النشر: 2024-09-25
المؤلف: Lexin Zhou وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

تقدم هذه القسم نظرة عامة على تطور الموثوقية في نماذج اللغة الكبيرة (LLMs) من خلال دراسة العائلات الرئيسية، وبشكل خاص سلسلة GPT من OpenAI، وسلسلة LLaMA من Meta، ومجموعة BLOOM من BigScience. كانت نماذج GPT محورية في تقدم الحالة الفنية في LLMs، مما أثر بشكل كبير على هياكل المحولات، ومجموعات بيانات التدريب، وطرق التقييم، واستراتيجيات المحاذاة، كما تم تسليط الضوء عليه من خلال مسوحات مختلفة.

في المقابل، تمثل سلسلة LLaMA عائلة نماذج تم إصدار أوزانها علنًا، مما يعزز الوصول والتعاون في هذا المجال. في الوقت نفسه، تمثل مجموعة BLOOM مبادرة أكثر انفتاحًا يقودها المجتمع العلمي، تهدف إلى تعزيز قدرات ومحاذاة LLMs. تعكس كل من هذه العائلات الجهود المستمرة لتحسين موثوقية وأداء LLMs في تطبيقات مختلفة.

طرق

في هذا القسم، يوضح المؤلفون طرقهم التجريبية، بما في ذلك اختيار المعايير، وقوالب التحفيز، ودوال الصعوبة، وتقييم الاستجابات، ومقاييس التقييم لنماذج اللغة الكبيرة (LLMs). تضمنت الإعدادات التجريبية استعلامات لموديلات مختلفة مع تعيين معلمة درجة الحرارة على الصفر وعدم وجود تحفيز نظامي، باستخدام مجموعة مشتركة من ستة عقد مزودة بـ 8× NVIDIA A40 48 GB GPUs. كان إجمالي وقت الحوسبة للتجارب حوالي 100 يوم حوسبة على عقدة واحدة.

فحصت الدراسة نماذج متعددة من عائلة GPT، بما في ذلك النسخ الأصلية من GPT-3 (ada، babbage، curie، davinci) وخلفائها المعدلة (إصدارات InstructGPT وGPT-3.5-turbo). بالإضافة إلى ذلك، تم اختبار أربعة مقاييس من نموذج LLaMA، جنبًا إلى جنب مع مقاييس مختلفة من نماذج BLOOM وBLOOMz، حيث تتضمن الأخيرة تحسينًا متعدد اللغات ومتعدد المهام. تم إجراء الاستدلالات للنماذج الأصغر محليًا، بينما تم الوصول إلى النماذج الأكبر عبر واجهات برمجة التطبيقات. تم ضبط عدد الرموز لمعايير مختلفة بعناية لضمان أطوال استجابة كافية مع تحسين التكاليف، مع تكوينات محددة لكل نوع من أنواع النماذج.

نتائج

“تقدم الشكل 2 أداء نماذج مختلفة من عائلات GPT وLLaMA وBLOOM عبر خمسة مجالات: ‘الإضافة’، ‘الأنجرام’، ‘المحلية’، ‘العلم’، و’transforms’. تشير النتائج إلى أنه مع زيادة حجم النماذج وتشكيلها، تزداد نسبة الاستجابات الصحيحة، خاصة بالنسبة للنماذج الأكثر تقدمًا. ومع ذلك، بينما تظهر النماذج المشكّلة نسب صحة أعلى واستقرارًا تجاه تغييرات التحفيز، فإنها تظهر حذرًا أقل ولا تتماشى جيدًا مع مقاييس صعوبة البشر. ومن الجدير بالذكر أن النماذج تواجه صعوبة في المهام الأبسط، مما يكشف عن ظاهرة عدم توافق الصعوبة حيث تفشل في الحالات الأساسية على الرغم من قدرتها على حل مشاكل أكثر تعقيدًا.

تظهر تحليل الدقة بالنسبة للصعوبة انخفاضًا مستمرًا في الدقة مع زيادة صعوبة المهمة، مع وجود ارتباطات عالية بين الدقة ومؤشرات صعوبة البشر، باستثناء نموذج BLOOM في مجال ‘الإضافة’. على الرغم من التحسينات في المهام ذات الصعوبة المتوسطة إلى العالية، لا يحقق أي نموذج أكثر من 60% دقة في أبسط المستويات، حيث يظهر GPT-4 فقط مكاسب هامشية. علاوة على ذلك، تؤدي الانتقال من النماذج الخام إلى المشكّلة إلى زيادة كبيرة في الاستجابات غير الصحيحة، حيث تتناقص استراتيجيات التجنب، مما يؤدي إلى مخرجات معقولة ولكن غير صحيحة.”

نقاش

في هذا القسم، يناقش المؤلفون تداعيات دراستين بشريتين تركزان على موثوقية نماذج اللغة الكبيرة (LLMs) فيما يتعلق بتوقعات المستخدمين المدفوعة بالصعوبة ودقة مخرجات النموذج. يبرزون أهمية فهم تصورات البشر للصعوبة، والتي تعتبر حاسمة لتحديد المهام التي يرغب المستخدمون في تفويضها إلى LLMs. تكشف الدراسات أنه بينما يمكن لمؤشرات صعوبة البشر التنبؤ بدقة النموذج، فإن النماذج الحالية تظهر اتجاهًا مقلقًا: مع زيادة الصعوبة، لا تنخفض الأخطاء، ولا ترتفع سلوكيات التجنب بشكل متناسب، مما يؤدي إلى نقص في مجالات التشغيل الموثوقة للمستخدمين.

يحدد المؤلفون ثلاثة عناصر متداخلة تؤثر على موثوقية LLM: توافق الصعوبة، وتجنب المهام، واستقرار التحفيز. يلاحظون أن زيادة حجم النماذج وتشكيلها غالبًا ما يتبادل التجنب مقابل زيادة الأخطاء، مما يمكن أن يضلل المستخدمين في تقدير موثوقية مخرجات النموذج بشكل مبالغ فيه. علاوة على ذلك، تشير النتائج إلى أن حساسية التحفيز تختلف عبر النماذج ومستويات الصعوبة، مما يعقد تجربة المستخدم. يدعو المؤلفون إلى تحسين منهجيات التدريب التي تتضمن توقعات صعوبة البشر ويؤكدون على ضرورة تطوير نماذج متخصصة للتطبيقات الحرجة، والتي قد تشمل آليات لرفض الاستجابات غير المناسبة. ويختتمون بالاعتراف بالقيود في دراستهم والحاجة إلى مزيد من البحث لتعزيز موثوقية LLMs.

Journal: Nature, Volume: 634, Issue: 8032
DOI: https://doi.org/10.1038/s41586-024-07930-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39322679
Publication Date: 2024-09-25
Author(s): Lexin Zhou et al.
Primary Topic: Topic Modeling

Overview

The section provides an overview of the evolution of reliability in large language models (LLMs) by examining key families, specifically the GPT series from OpenAI, the LLaMA series from Meta, and the BLOOM suite from BigScience. The GPT models have been pivotal in advancing the state of the art in LLMs, significantly influencing transformer architectures, training datasets, evaluation methods, and alignment strategies, as highlighted by various surveys.

In contrast, the LLaMA series exemplifies a model family with publicly released weights, promoting accessibility and collaboration in the field. Meanwhile, the BLOOM suite represents a more open initiative driven by the scientific community, aiming to enhance the capabilities and alignment of LLMs. Each of these families reflects ongoing efforts to improve the reliability and performance of LLMs in various applications.

Methods

In this section, the authors detail their experimental methods, including the selection of benchmarks, prompt templates, difficulty functions, response scoring, and evaluation metrics for large language models (LLMs). The experimental setup involved querying various models with a temperature parameter set to zero and no system prompt, utilizing a shared cluster of six nodes equipped with 8× NVIDIA A40 48 GB GPUs. The total compute time for the experiments was approximately 100 compute days on a single node.

The study examined multiple models from the GPT family, including the original GPT-3 variants (ada, babbage, curie, davinci) and their fine-tuned successors (InstructGPT versions and GPT-3.5-turbo). Additionally, four scales of the LLaMA model were tested, along with different scales of the BLOOM and BLOOMz models, the latter incorporating multilingual multitask fine-tuning. Inferences for smaller models were conducted locally, while larger models were accessed via APIs. Token counts for various benchmarks were carefully adjusted to ensure adequate response lengths while optimizing costs, with specific configurations for each model type.

Results

“Figure 2 presents the performance of various models from the GPT, LLaMA, and BLOOM families across five domains: ‘addition’, ‘anagram’, ‘locality’, ‘science’, and ‘transforms’. The results indicate that as models are scaled and shaped, the percentage of correct responses increases, particularly for the most advanced models. However, while shaped-up models demonstrate higher correctness proportions and stability to prompt variations, they exhibit lower prudence and do not align well with human difficulty metrics. Notably, the models struggle with simpler tasks, revealing a difficulty discordance phenomenon where they fail at basic instances despite being capable of solving more complex problems.

The analysis of correctness relative to difficulty shows a consistent decline in accuracy as task difficulty increases, with high correlations between correctness and human difficulty proxies, except for the BLOOM model in the ‘addition’ domain. Despite improvements in medium to high difficulty tasks, no model achieves over 60% correctness at the simplest levels, with GPT-4 showing only marginal gains. Furthermore, the transition from raw to shaped-up models results in a significant increase in incorrect responses, as avoidance strategies diminish, leading to plausible but incorrect outputs.”

Discussion

In this section, the authors discuss the implications of two human studies focused on the reliability of large language models (LLMs) in relation to user-driven expectations of difficulty and the accuracy of model outputs. They highlight the importance of understanding human perceptions of difficulty, which are crucial for determining the tasks users are willing to delegate to LLMs. The studies reveal that while human difficulty proxies can predict model correctness, the current models exhibit a concerning trend: as difficulty increases, errors do not decrease, and avoidance behaviors do not correspondingly rise, leading to a lack of reliable operating areas for users.

The authors identify three intertwined elements affecting LLM reliability: difficulty concordance, task avoidance, and prompting stability. They note that scaling and shaping models often trade avoidance for increased incorrectness, which can mislead users into overestimating the reliability of model outputs. Furthermore, the findings suggest that prompt sensitivity varies across models and difficulty levels, complicating the user experience. The authors advocate for improved training methodologies that incorporate human difficulty expectations and emphasize the necessity of developing specialized models for critical applications, which may include mechanisms for rejecting inappropriate responses. They conclude by acknowledging limitations in their study and the need for further research to enhance the reliability of LLMs.