HalluLens: معيار هلوسة LLM HalluLens: LLM Hallucination Benchmark

المجلة: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.1176
تاريخ النشر: 2025-01-01
المؤلف: Yejin Bang وآخرون
الموضوع الرئيسي: المخدرات والدراسات المتعلقة بها

نظرة عامة

تقدم ورقة البحث HalluLens، وهو معيار جديد مصمم لتقييم الهلوسات في نماذج اللغة الكبيرة (LLMs). الهلوسات، التي تُعرف على أنها انحرافات عن مدخلات المستخدم أو بيانات التدريب، تشكل تحديات كبيرة لثقة المستخدم واعتماد أنظمة الذكاء الاصطناعي التوليدية بشكل أوسع. يتناول المؤلفون نقص إطار موحد لتقييم الهلوسات من خلال اقتراح تصنيف واضح يميز بين الهلوسات الداخلية والخارجية، مع تركيز خاص على الهلوسات الخارجية – حيث يتباين المحتوى المولد عن بيانات التدريب. هذا التركيز مهم حيث تستمر LLMs في التطور، ومع ذلك لا تعالج المعايير الحالية الهلوسات الخارجية بشكل كافٍ.

يقدم HalluLens ثلاث مهام تقييم خارجية جديدة: PreciseWikiQA وLongWiki وNonExistentRefusal، والتي تستخدم توليد مجموعة اختبار ديناميكية لمنع تسرب البيانات وتعزيز القوة. تؤكد الورقة على أهمية هذه المهام في توفير نهج منظم لتقييم مخرجات LLM. من خلال إصدار قاعدة بيانات HalluLens، يهدف المؤلفون إلى تزويد الباحثين والممارسين بالأدوات الأساسية لتحسين موثوقية وثقة تطورات LLM المستقبلية، مما يعزز مجال الذكاء الاصطناعي التوليدي.

مقدمة

تتناول مقدمة هذه الورقة البحثية القضية الحرجة لـ “الهلاوس” في نماذج اللغة الكبيرة (LLMs)، حيث قد تتباين الاستجابات المولدة عن مدخلات المستخدم أو المعرفة المتاحة، مما يقوض ثقة المستخدم في أنظمة الذكاء الاصطناعي التوليدية. يؤكد المؤلفون على ضرورة وجود إطار تقييم شامل وموثوق لتحديد وتخفيف هذه الهلوسات، التي غالبًا ما يتم الخلط بينها وبين مفهوم “الحقائق”. يجادلون بأنه بينما تشترك الظاهرتان في أوجه التشابه، إلا أنهما تتطلبان معايير وحلول متميزة بسبب اختلاف طبيعتهما – يمكن تعريف الهلوسات داخليًا بناءً على سلوك النموذج، بينما تعتمد الحقائق على مصدر خارجي للحقيقة الأساسية.

لتعزيز فهم هلاوس LLM، يقترح المؤلفون تصنيفًا يصنف الهلوسات إلى نوعين رئيسيين: “داخلية” و”خارجية”. تحدث الهلوسات الداخلية عندما يتعارض المحتوى المولد مع استعلام المصدر، مما يجعلها قابلة للتحقق ضد المدخلات. في المقابل، تنشأ الهلوسات الخارجية عندما تولد LLMs محتوى بناءً فقط على المعرفة الداخلية دون سياق مدخلات مباشر، مما يعقد التحقق. تقدم الورقة مهام تقييم جديدة تستهدف تحديدًا الهلوسات الخارجية وت outlines نهجًا ديناميكيًا لتصميم المعايير يقلل من مخاطر تسرب البيانات، مما يضمن قوة التقييمات على مر الزمن. تشمل الأهداف العامة للعمل إنشاء تصنيف واضح للهلاوس، وتطوير مهام تقييم جديدة، وتحليل المعايير الحالية للتمييز بين تقييمات الهلوسة والحقائق.

طرق

في قسم الطرق، يقدم المؤلفون نتائج تجريبية تتعلق بمعدلات القبول الخاطئ لمختلف النماذج، كما هو موضح في الجدول 2. يظهر نموذج Llama-3.1-405B-Instruct أدنى معدلات قبول خاطئ، مما يشير إلى ميل أقل لتوليد الهلوسات عند مواجهة معرفة غير مألوفة، حيث يختار الامتناع عن الرد. تشير النتائج إلى علاقة عكسية بين معدلات القبول الخاطئ ومعدلات الرفض الخاطئ؛ على وجه التحديد، تظهر نماذج Llama معدلات قبول خاطئ منخفضة بسبب معدلات الرفض الأعلى، بينما يظهر نموذج Mistral، الذي نادرًا ما يرفض، معدل قبول خاطئ مرتفع بشكل متناسب.

تظهر التحليلات مزيدًا من التباين في أداء النموذج عبر مهمتين فرعيتين، كما هو موضح في الجدول 3. حسب المؤلفون ارتباطات Kendall’s τ بين أزواج من المهام الفرعية وضد المتوسط، مما أسفر عن ارتباطات كبيرة قدرها 0.5897 بين المهمتين الفرعيتين و0.7436 و0.8462 لكل مهمة مقارنة بالمتوسط. هذه الارتباطات ذات دلالة إحصائية، مما يشير إلى اتجاه أداء متسق عبر المهام. يتم توضيح تحليلات إضافية، بما في ذلك تباين أداء النموذج عبر مجالات مختلفة، ودراسة إلغاء على نهج الدوران باستخدام نموذج واحد، وتباين الأداء مع بذور مختلفة في توليد المطالبات العشوائية، في الملحق C.3.2.

نتائج

تشير نتائج التقييم إلى تباين كبير في معدلات الرفض الخاطئ بين النماذج المختلفة، حيث تظهر نماذج Llama وClaude معدلات رفض أعلى بشكل ملحوظ مقارنة بالآخرين. على وجه التحديد، يتمتع نموذج Llama-3.1-8B-Instruct بأعلى معدل رفض عند 83.09%، بينما يظهر GPT-4o معدلًا أقل بكثير قدره 4.13%. يتماشى هذا الاتجاه مع النتائج السابقة لنماذج GPT-4o وClaude-3 (Wang et al., 2024). على الرغم من معدل الرفض العالي، يظهر Llama-3.1-8B معدل هلوسة أقل (48.37%) من نماذج مماثلة الحجم مثل Qwen2.5 7B (85.22%) وMistral 7B (81.19%). عمومًا، ترفض النماذج الأكبر أقل من نظيراتها الأصغر، على الرغم من أن نموذج Llama-3.1-405B-Instruct، بينما لديه أدنى معدل هلوسة عند 26.84%، لا يزال يرفض الإجابة 56.77% من الوقت.

تكشف التحليلات الإضافية أن النماذج تميل إلى الرفض بشكل متكرر في الأسئلة الصعبة، مما يبرز مشكلة الذيل الطويل. يتميز GPT-4o بمعدل رفض خاطئ منخفض قدره 0.13 ودرجات دقة وF1@32 عالية، مما يشير إلى أنه يقدم محتوى أقل هلوسة ويجيب بشكل أكثر تكرارًا. كما أن نموذج Llama-3.1-405B-Instruct-FP8 يؤدي بشكل جيد مع درجة F1@32 تبلغ 61.98، بينما يحقق نموذج Llama-3.3-70B معدل رفض خاطئ أقل قدره 0.67 ودرجة recall@32 تبلغ 75.46. في المقابل، تظهر نماذج Mistral نتائج مختلطة، حيث لا تظهر Mistral-Nemo-Instruct-2407 أي رفض خاطئ ولكن بدقة أقل (38.06). تظهر نماذج Claude أداءً متنوعًا، حيث تحقق نموذج Claude-3-haiku دقة عالية (65.24) ولكن recall أقل (58.95)، بينما يتمتع نموذج Claude-3-sonnet بدرجات F1@32 مماثلة ولكنه يقدم المزيد من المحتوى على حساب الهلوسة. بشكل عام، تؤكد النتائج على التوازن بين الدقة وrecall ومعدلات الرفض عبر نماذج مختلفة.

مناقشة

في قسم المناقشة من الورقة، يميز المؤلفون بين هلوسة LLM (نموذج اللغة الكبيرة) والحقائق، مؤكدين على آثارها المتميزة على أداء النموذج. تتعلق حقائق LLM بدقة المحتوى المولد مقابل المصادر الموثوقة، بينما تشير الهلوسة إلى اتساق المخرجات مع بيانات تدريب النموذج أو سياق المدخلات. يجادل المؤلفون بأن الخلط بين هذه المفاهيم يعقد تطوير النموذج واستراتيجيات التخفيف. يقترحون تمييزًا واضحًا بين الحقائق، التي تعتمد على التحقق الخارجي، والهلاوس، التي يتم تقييمها بناءً على المعرفة والسياق الداخلي للنموذج. تدعو الورقة إلى معايير مخصصة لتقييم الهلوسة، مشددة على الحاجة إلى معايير تقييم قوية تأخذ في الاعتبار تسرب البيانات، وقابلية التطبيق في العالم الحقيقي، وإمكانية التكرار.

يصنف المؤلفون الهلوسات إلى أنواع خارجية وداخلية. تحدث الهلوسة الخارجية عندما لا يتماشى المحتوى المولد مع بيانات التدريب، بينما تنشأ الهلوسة الداخلية من عدم الاتساق مع سياق المدخلات. يحددون مصادر محتملة للهلاوس، بما في ذلك المعرفة المحدودة، وبيانات التدريب المتناقضة، وأخطاء النمذجة. تقدم الورقة إطارًا جديدًا للمعايير، HalluLens، الذي يتضمن مهام مصممة لتقييم الهلوسة الخارجية من خلال مطالبات تم إنشاؤها ديناميكيًا من ويكيبيديا. يهدف هذا الإطار إلى توفير تقييم شامل لأداء النموذج، مما يضمن بقاء التقييمات ذات صلة ومقاومة لتسرب البيانات. يختتم المؤلفون بأن معالجة الهلوسة ستعزز من الحقائق العامة وموثوقية النموذج.

القيود

تسلط قيود هذا البحث الضوء على عدة جوانب حاسمة تتعلق بتقييم نماذج اللغة الكبيرة (LLMs) وميولها نحو الهلوسات. بينما تؤسس الدراسة معيارًا شاملاً يركز على الهلوسات الخارجية، فإنها تعترف بأهمية الهلوسات الداخلية لتقييم كامل لصدق النموذج. تم تصميم المعيار بشكل أساسي للغة الإنجليزية، مما قد يحد من قابليته للتطبيق في السياقات متعددة اللغات، على الرغم من إمكانية تعديل المنهجية للغات أخرى. بالإضافة إلى ذلك، تشير الورقة إلى أعمال ذات صلة قد طورت أطرًا لتقييم الهلوسة في لغات مثل الصينية، مما يبرز الحاجة إلى معايير متعددة اللغات أوسع.

تحدد الورقة مصادر محتملة للهلاوس، مصنفة إلى عوامل تتعلق بالبيانات وعوامل تتعلق بالنموذج. تشمل القضايا المتعلقة بالبيانات وجود معرفة غير مرئية أو محدودة في مجموعات بيانات التدريب، مما يمكن أن يؤدي إلى الهلوسات الخارجية عندما تصنع النماذج استجابات لاستفسارات تفتقر إلى المعلومات ذات الصلة. علاوة على ذلك، يمكن أن تؤدي بيانات التدريب المتناقضة أو المزعجة إلى إرباك النماذج، مما يؤدي إلى كل من الهلوسات الخارجية والداخلية. من جانب النموذج، يمكن أن تؤدي بنية واستراتيجيات تدريب LLMs، بما في ذلك التعلم المعزز من ردود الفعل البشرية (RLHF)، إلى إدخال تحيزات تسهم في الهلوسات. من الجدير بالذكر أن عملية المحاذاة في RLHF قد تؤدي إلى فقدان القدرات المكتسبة سابقًا، مما يعقد قدرة النموذج على توليد محتوى دقيق.

Journal: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.1176
Publication Date: 2025-01-01
Author(s): Yejin Bang et al.
Primary Topic: Psychedelics and Drug Studies

Overview

The research paper presents HalluLens, a novel benchmark designed to evaluate hallucinations in large language models (LLMs). Hallucinations, defined as deviations from user input or training data, pose significant challenges to user trust and the broader adoption of generative AI systems. The authors address the lack of a unified framework for assessing hallucinations by proposing a clear taxonomy that distinguishes between intrinsic and extrinsic hallucinations, with a particular focus on extrinsic hallucinations—where generated content diverges from training data. This focus is crucial as LLMs continue to evolve, yet existing benchmarks do not adequately address extrinsic hallucinations.

HalluLens introduces three new extrinsic evaluation tasks: PreciseWikiQA, LongWiki, and NonExistentRefusal, which utilize dynamic test set generation to prevent data leakage and enhance robustness. The paper emphasizes the importance of these tasks in providing a structured approach for evaluating LLM outputs. By releasing the HalluLens codebase, the authors aim to equip researchers and practitioners with essential tools to improve the reliability and trustworthiness of future LLM developments, thereby advancing the field of generative AI.

Introduction

The introduction of this research paper addresses the critical issue of “hallucination” in large language models (LLMs), where generated responses may diverge from user input or established knowledge, undermining user trust in generative AI systems. The authors emphasize the necessity of a comprehensive and reliable evaluation framework to identify and mitigate these hallucinations, which are often conflated with the concept of “factuality.” They argue that while both phenomena share similarities, they require distinct benchmarks and solutions due to their differing natures—hallucinations can be internally defined based on model behavior, whereas factuality relies on an external oracle for ground truth.

To advance the understanding of LLM hallucinations, the authors propose a taxonomy that categorizes hallucinations into two main types: “intrinsic” and “extrinsic.” Intrinsic hallucinations occur when generated content contradicts the source query, making them verifiable against the input. In contrast, extrinsic hallucinations arise when LLMs generate content based solely on internal knowledge without direct input context, complicating verification. The paper introduces new evaluation tasks specifically targeting extrinsic hallucination and outlines a dynamic approach to benchmark design that mitigates data leakage risks, ensuring the robustness of evaluations over time. The overarching goals of the work include establishing a clear taxonomy of hallucinations, developing new evaluation tasks, and analyzing existing benchmarks to differentiate between hallucination and factuality assessments.

Methods

In the Methods section, the authors present experimental results regarding the false acceptance rates of various models, as detailed in Table 2. The Llama-3.1-405B-Instruct model demonstrates the lowest false acceptance rates, suggesting a reduced tendency to generate hallucinations when faced with unfamiliar knowledge, as it opts to abstain from responding. The findings indicate an inverse relationship between false acceptance and false refusal rates; specifically, Llama models exhibit low false acceptance rates due to their higher refusal rates, while the Mistral model, which rarely refuses, shows a correspondingly high false acceptance rate.

The analysis further reveals variability in model performance across two sub-tasks, as shown in Table 3. The authors calculated Kendall’s τ correlations among pairs of sub-tasks and against the average, yielding significant correlations of 0.5897 between the two sub-tasks and 0.7436 and 0.8462 for each task compared to the average. These correlations are statistically significant, indicating a consistent performance trend across tasks. Additional analyses, including model performance variation across different domains, an ablation study on the round-robin approach using a single model, and performance variance with different seeds in random prompt generation, are elaborated in Appendix C.3.2.

Results

The evaluation results indicate significant variability in false refusal rates among different models, with the Llama and Claude models exhibiting notably higher refusal rates compared to others. Specifically, the Llama-3.1-8B-Instruct model has the highest refusal rate at 83.09%, while GPT-4o demonstrates a much lower rate of 4.13%. This trend is consistent with previous findings for GPT-4o and Claude-3 models (Wang et al., 2024). Despite its high refusal rate, Llama-3.1-8B shows a lower hallucination rate (48.37%) than similar-sized models like Qwen2.5 7B (85.22%) and Mistral 7B (81.19%). Larger models generally refuse less than their smaller counterparts, although the Llama-3.1-405B-Instruct model, while having the lowest hallucination rate at 26.84%, still refuses to answer 56.77% of the time.

Further analysis reveals that models tend to refuse more frequently on difficult questions, highlighting a long-tail problem. GPT-4o stands out with a low false refusal rate of 0.13 and high precision and F1@32 scores, indicating it provides less hallucinated content and answers more frequently. The Llama-3.1-405B-Instruct-FP8 model also performs well with an F1@32 score of 61.98, while the Llama-3.3-70B-Instruct model achieves a lower false refusal rate of 0.67 and a recall@32 of 75.46. In contrast, Mistral models show mixed results, with Mistral-Nemo-Instruct-2407 exhibiting no false refusals but lower precision (38.06). Claude models display varied performance, with the Claude-3-haiku model achieving high precision (65.24) but lower recall (58.95), while the Claude-3-sonnet model has similar F1@32 scores but offers more content at the risk of hallucination. Overall, the findings underscore the trade-offs between precision, recall, and refusal rates across different models.

Discussion

In the discussion section of the paper, the authors differentiate between LLM (Large Language Model) hallucination and factuality, emphasizing their distinct implications for model performance. LLM factuality pertains to the accuracy of generated content against verified sources, while hallucination refers to the consistency of outputs with the model’s training data or input context. The authors argue that the conflation of these concepts complicates model development and mitigation strategies. They propose a clear distinction between factuality, which relies on external verification, and hallucination, which is evaluated based on the model’s internal knowledge and context. The paper advocates for dedicated benchmarks to assess hallucination, highlighting the need for robust evaluation criteria that account for data leakage, real-world applicability, and reproducibility.

The authors categorize hallucinations into extrinsic and intrinsic types. Extrinsic hallucination occurs when generated content does not align with the training data, while intrinsic hallucination arises from inconsistencies with the input context. They identify potential sources of hallucination, including limited knowledge, contradictory training data, and modeling errors. The paper introduces a new benchmarking framework, HalluLens, which includes tasks designed to evaluate extrinsic hallucination through dynamically generated prompts from Wikipedia. This framework aims to provide a comprehensive assessment of model performance, ensuring that evaluations remain relevant and resistant to data leakage. The authors conclude that addressing hallucination will enhance overall model factuality and reliability.

Limitations

The limitations of this research highlight several critical aspects regarding the evaluation of large language models (LLMs) and their propensity for hallucinations. While the study establishes a comprehensive benchmark focusing on extrinsic hallucinations, it acknowledges the importance of intrinsic hallucinations for a complete assessment of model truthfulness. The benchmark is primarily designed for English, which may limit its applicability to multilingual contexts, although the methodology can be adapted for other languages. Additionally, the paper references related works that have developed frameworks for hallucination evaluation in languages such as Chinese, underscoring the need for broader multilingual benchmarks.

The paper identifies potential sources of hallucinations, categorized into data-related and model-related factors. Data-related issues include the presence of unseen or limited knowledge in training datasets, which can lead to extrinsic hallucinations when models fabricate responses to queries lacking relevant information. Furthermore, contradictory or noisy training data can confuse models, resulting in both extrinsic and intrinsic hallucinations. On the model side, the architecture and training strategies of LLMs, including reinforcement learning from human feedback (RLHF), can introduce biases that contribute to hallucinations. Notably, the alignment process in RLHF may lead to a loss of previously acquired capabilities, further complicating the model’s ability to generate accurate content.