المبادئ والإرشادات لاستخدام قضاة LLM Principles and Guidelines for the Use of LLM Judges

المجلة: Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)
DOI: https://doi.org/10.1145/3731120.3744588
تاريخ النشر: 2025-07-18
المؤلف: Laura Dietz وآخرون
الموضوع الرئيسي: الابتكارات في التعليم القانوني والممارسة

نظرة عامة

تناقش الورقة الاعتماد المتزايد على نماذج اللغة الكبيرة (LLMs) لتقييم أنظمة استرجاع المعلومات (IR)، التي كانت تقيم تقليديًا بواسطة قضاة بشريين. بينما تشير الدراسات التجريبية إلى أن تقييمات LLM غالبًا ما تتماشى مع الأحكام البشرية، تثار مخاوف بشأن موثوقية وصلاحية هذه التقييمات. يحذر المؤلفون من أنه مع دمج تقييمات LLM التي تم إنشاؤها في تطوير نظام IR، هناك خطر من التحيزات المعززة ذاتيًا والاستنتاجات المضللة. تشمل المخاطر الرئيسية المحددة تعزيز التحيز، تحديات القابلية للتكرار، وعدم الاتساق في منهجيات التقييم. للتخفيف من هذه القضايا، تقترح الورقة اختبارات لقياس الآثار السلبية، وتأسيس حواجز، وإنشاء إطار تعاوني لتطوير مجموعات اختبارات قابلة لإعادة الاستخدام تتضمن أحكام LLM بشكل مسؤول.

في الاستنتاجات، يؤكد المؤلفون على الحاجة إلى أفضل الممارسات لضمان أن تظل تقييمات LLM قائمة على الصرامة العلمية. يدعون إلى التحقق المتبادل من تقييمات LLM مع المعايير التي تم التحقق منها من قبل البشر ويوصون بمعاملة منهجيات التقييم التلقائي كأدوات تحقق بدلاً من مقاييس نهائية للأداء. تقدم الورقة أسلوب Coopetition على نمط TREC لتحديد سنويًا أساليب تقييم LLM الفعالة لأنظمة IR الحديثة، مما يضمن استخدام أفضل التقنيات مع التحقق المستمر من تقييمات LLM مقابل الأحكام اليدوية. يحدد المؤلفون الشروط التي يمكن بموجبها استخدام الأحكام المستندة إلى LLM في البحث، مؤكدين على أهمية التحقق الأخير ضد أحكام المستخدمين الحقيقيين، ومنع التأثير على تطوير النظام، والنقاش الشامل حول التحيزات المحتملة. يهدف هذا الإطار إلى ضمان أن تظل تقييمات LLM موثوقة، وقابلة للتكرار، ومبنية على أسس علمية، مما يعزز في النهاية فائدة أنظمة IR للمستخدمين البشر.

مقدمة

تناقش مقدمة الورقة الدور المتزايد لنماذج اللغة الكبيرة (LLMs) في تقييم أنظمة استرجاع المعلومات (IR)، لا سيما في توليد أحكام الصلة التي كانت تُجرى تقليديًا بواسطة مقيمين بشريين. يبرز المؤلفون التحديات التي تطرحها الأعداد الهائلة من أزواج الموضوع-المستند، مما يستلزم طرق التجميع لاختيار مجموعة قابلة للإدارة للتقييم. تقدم LLMs بديلاً واعدًا، حيث تقدم إمكانية توسيع تقييمات الصلة بشكل كبير يتجاوز القدرات البشرية. ومع ذلك، يحذر المؤلفون من الاعتماد على LLMs، مشيرين إلى نقد سوبوروف بأن استخدام البيانات التي تم إنشاؤها بواسطة LLM قد يحد من تقييم الأداء من خلال وضع سقف اصطناعي.

تهدف الورقة إلى تحقيق توازن من خلال تحديد 13 طريقة يمكن أن تؤثر بها LLMs سلبًا على عمليات التقييم واقتراح استراتيجيات للتخفيف من هذه التأثيرات. ميزة رئيسية لـ LLMs هي سرعتها في توليد تسميات الصلة، مما يقلل من الوقت والتكلفة المرتبطة بالتعليق البشري. تمكن هذه الكفاءة من تقييم مجموعات بيانات أكبر ونطاق أوسع من مهام الاسترجاع، مما يؤدي إلى تكاملها السريع في خطوط تقييم، كما يتضح من استخدام مايكروسوفت لنماذج GPT من OpenAI لتقييم الصلة في Bing. بالإضافة إلى ذلك، يبرز إدخال أدوات مثل Umbrela، التي تستخدم LLMs الملكية لتسمية المستندات غير المحكوم عليها، الإمكانية المتزايدة لـ LLMs لاستبدال المقيمين البشريين في سياقات معينة.

نقاش

يتناول قسم النقاش في الورقة التحديات والآثار المترتبة على استخدام نماذج اللغة الكبيرة (LLMs) لتقييم استرجاع المعلومات (IR). بينما تدعم الأدلة التجريبية فعالية الأحكام المستندة إلى LLM، تستمر المخاوف بشأن موثوقيتها وصلاحيتها والتحيزات المحتملة. يبرز المؤلفون عدم كفاية منهجيات الحكم التقليدية في سياق الأنظمة التوليدية ومتعددة الوسائط، مؤكدين على الحاجة إلى إطار مبدئي لتوجيه الاعتماد المسؤول على LLMs في دراسات التقييم. يقدمون مفهوم “أساليب تقييم LLM”، وهو تصنيف للمشكلات المتكررة التي يمكن أن تقوض صلاحية التقييمات المستندة إلى LLM، مثل الدائرية، واعتلال LLM، وفقدان تنوع الرأي.

تدعو الورقة إلى إنشاء أفضل الممارسات لضمان أن تظل التقييمات المستندة إلى LLM صالحة وموثوقة. تؤكد على أهمية دمج الأحكام البشرية لمواجهة التحيزات والحفاظ على الصرامة العلمية. يقترح المؤلفون حواجز للتخفيف من الأساليب المحددة، مؤكدين على ضرورة التحقق المستمر من مقيمي LLM مقابل التقييمات البشرية. يدعون إلى تطوير نماذج تقييم جديدة تحافظ على صلاحية النتائج التجريبية، إلى جانب إنشاء مجموعات اختبارات قابلة لإعادة الاستخدام تقلل من الحاجة إلى تقييمات بشرية إضافية. بشكل عام، تهدف الورقة إلى تعزيز إطار أكثر قوة لتقييم أنظمة IR في المشهد المتطور لـ LLMs.

Journal: Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)
DOI: https://doi.org/10.1145/3731120.3744588
Publication Date: 2025-07-18
Author(s): Laura Dietz et al.
Primary Topic: Legal Education and Practice Innovations

Overview

The paper discusses the increasing reliance on Large Language Models (LLMs) for evaluating information retrieval (IR) systems, traditionally assessed by human judges. While empirical studies indicate that LLM evaluations often align with human judgments, concerns arise regarding the reliability and validity of these assessments. The authors caution that as LLM-generated evaluations become integrated into IR system development, there is a risk of self-reinforcing biases and misleading conclusions. Key risks identified include bias reinforcement, reproducibility challenges, and inconsistencies in assessment methodologies. To mitigate these issues, the paper proposes tests to quantify adverse effects, establish guardrails, and create a collaborative framework for developing reusable test collections that responsibly incorporate LLM judgments.

In the conclusions, the authors emphasize the need for best practices to ensure that LLM-based evaluations maintain scientific rigor. They advocate for cross-validation of LLM evaluations with human-verified benchmarks and recommend treating automatic evaluation methodologies as validation tools rather than definitive measures of performance. The paper introduces a TREC-style Coopetition to annually identify effective LLM-evaluation approaches for state-of-the-art IR systems, ensuring the use of the best technology while continuously validating LLM evaluations against manual judgments. The authors outline conditions under which LLM-based judgments can be utilized in research, emphasizing the importance of recent validation against real user judgments, prevention of influence on system development, and thorough discussion of potential biases. This framework aims to ensure that LLM evaluations remain trustworthy, reproducible, and scientifically grounded, ultimately enhancing the utility of IR systems for human users.

Introduction

The introduction of the paper discusses the growing role of Large Language Models (LLMs) in the evaluation of information retrieval (IR) systems, particularly in generating relevance judgments that were traditionally performed by human assessors. The authors highlight the challenges posed by the vast number of topic-document pairs, which necessitate pooling methods to select a manageable subset for evaluation. LLMs present a promising alternative, offering the potential to scale relevance assessments significantly beyond human capabilities. However, the authors caution against the reliance on LLMs, referencing Soboroff’s critique that using LLM-generated data may limit performance evaluation by setting an artificial ceiling.

The paper aims to strike a balance by identifying 13 ways in which LLMs can adversely affect evaluation processes and proposing strategies to mitigate these impacts. A key advantage of LLMs is their speed in generating relevance labels, which reduces the time and cost associated with human annotation. This efficiency enables the assessment of larger datasets and a broader range of retrieval tasks, leading to their rapid integration into evaluation pipelines, as exemplified by Microsoft’s use of OpenAI’s GPT models for relevance assessment in Bing. Additionally, the introduction of tools like Umbrela, which utilizes proprietary LLMs for labeling unjudged documents, further underscores the potential for LLMs to replace human assessors in certain contexts.

Discussion

The discussion section of the paper addresses the challenges and implications of using large language models (LLMs) for information retrieval (IR) evaluation. While empirical evidence supports the effectiveness of LLM-based judgments, concerns about their reliability, validity, and potential biases persist. The authors highlight the inadequacy of traditional pooled judgment methodologies in the context of generative and multi-modal systems, emphasizing the need for a principled framework to guide the responsible adoption of LLMs in evaluation studies. They introduce the concept of “LLM Evaluation Tropes,” a taxonomy of recurring issues that can undermine the validity of LLM-based evaluations, such as circularity, LLM narcissism, and loss of variety of opinion.

The paper advocates for the establishment of best practices to ensure that LLM-based evaluations remain valid and reliable. It underscores the importance of integrating human judgments to counteract biases and maintain scientific rigor. The authors propose guardrails to mitigate the identified tropes, emphasizing the necessity of continuous validation of LLM evaluators against human assessments. They call for the development of novel evaluation paradigms that uphold the validity of experimental results, alongside the creation of reusable test collections that minimize the need for additional human assessments. Overall, the paper aims to foster a more robust framework for evaluating IR systems in the evolving landscape of LLMs.