الحالة الحالية والمستقبلية لتقييم نماذج اللغة الكبيرة لمهام التلخيص الطبي Current and future state of evaluation of large language models for medical summarization tasks

المجلة: npj Health Systems، المجلد: 2، العدد: 1
DOI: https://doi.org/10.1038/s44401-024-00011-2
PMID: https://pubmed.ncbi.nlm.nih.gov/40124388
تاريخ النشر: 2025-02-03
المؤلف: Emma Croxford وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

مقدمة

تسلط المقدمة الضوء على الحاجة الملحة لاستراتيجيات تقييم موثوقة في المجال السريري، خاصة مع التقدم السريع لتقنيات الذكاء الاصطناعي التوليدي (GenAI) الذي يتجاوز عملية التحقق منها. بينما توفر معايير التقييم البشرية موثوقية ودقة عالية، فإنها تفرض أعباء زمنية كبيرة على المهنيين في الرعاية الصحية، الذين هم بالفعل تحت ضغط. تنشأ هذه المفارقة حيث تهدف هذه التقنيات إلى تخفيف العبء المعرفي ولكنها تتطلب وقتًا إضافيًا لتقييم الأداء. تقدم التقييمات الآلية حلاً محتملاً، ومع ذلك لم تتطابق الطرق التقليدية باستمرار مع صرامة التقييمات البشرية، وغالبًا ما تتجاهل جوانب حاسمة مثل الهلوسة، وجودة التفكير، وملاءمة المحتوى المولد.

مع ظهور نماذج اللغة الكبيرة (LLMs) كبدائل للمقيمين البشريين، من الضروري معالجة المتطلبات الفريدة للإعداد السريري. تشير المراجعات المنهجية إلى أن أبعاد التقييم مثل السلامة، والتحيز، وجودة المعلومات هي الأهم، خاصة بالنظر إلى العواقب المحتملة للمعلومات غير الصحيحة على نتائج المرضى. يجب فحص نماذج اللغة الكبيرة بحثًا عن التحيزات الناجمة عن بيانات التدريب غير الموضوعية، والتي يمكن أن تؤدي إلى صور نمطية ضارة. علاوة على ذلك، يتطلب تقييم جودة المعلومات التي تنتجها نماذج اللغة الكبيرة في السياقات السريرية فهمًا دقيقًا لعوامل مثل الواقعية، والملاءمة، والكمال. لذلك، يجب تصميم أطر التقييم لتكون مخصصة لإعطاء الأولوية لهذه الأبعاد ذات الصلة السريرية على المقاييس التقليدية التي تركز على مطابقة السلاسل أو التشابهات الهيكلية. يمكن أن يدمج مصمم نموذج تقييم LLM بشكل مثالي موثوقية التقييمات البشرية مع كفاءة الطرق الآلية، مما يعزز السلامة السريرية مع الحفاظ على جودة التقييم.

نقاش

ت outlines قسم النقاش في ورقة البحث المنهجيات والأطر المستخدمة لتقييم الوثائق السريرية، خاصة في سياق مهام التلخيص والإجابة على الأسئلة. تم إجراء بحث شامل في الأدبيات عبر عدة قواعد بيانات، مما أسفر عن 262 ملخصًا يتعلق بأطر التقييم البشرية و95 ملخصًا يتعلق بنماذج اللغة الكبيرة (LLMs). ركزت معايير الإدراج على أطر التقييم الجديدة، والمهام ذات الصلة السريرية، والتحسينات على المقاييس التقليدية مثل ROUGE. شملت المراجعة في النهاية 130 ورقة، مما يبرز أهمية التقييمات البشرية كمعيار ذهبي، على الرغم من طبيعتها التي تتطلب موارد كبيرة.

توسع القسم في الحديث عن معايير التقييم البشرية الحالية، مثل SaferDx وPDQI-9 وRevised-IDEA، التي تقيم جودة الوثائق السريرية بناءً على معايير مختلفة، بما في ذلك دقة التشخيص والكمال. بينما لم يتم تصميم هذه المعايير في الأصل لمحتوى تم إنشاؤه بواسطة LLM، فإنها توفر رؤى قيمة حول معايير التقييم الأساسية. كما يحدد النقاش سبعة معايير واسعة لتقييم مخرجات LLM، بما في ذلك الهلوسة، والإغفال، والطلاقة، ويصف طرق التحليل المختلفة، مثل التصنيفات الثنائية ومقاييس مسافة التحرير. على الرغم من مزايا التقييمات الآلية من حيث الكفاءة، لا تزال هناك تحديات في التقاط المتطلبات الدقيقة للنصوص السريرية، مما يبرز الحاجة إلى نهج متوازن يجمع بين التقييمات البشرية والآلية لضمان وثائق سريرية عالية الجودة.

Journal: npj Health Systems, Volume: 2, Issue: 1
DOI: https://doi.org/10.1038/s44401-024-00011-2
PMID: https://pubmed.ncbi.nlm.nih.gov/40124388
Publication Date: 2025-02-03
Author(s): Emma Croxford et al.
Primary Topic: Topic Modeling

Introduction

The introduction highlights the urgent need for reliable evaluation strategies in the clinical domain, particularly as the rapid advancement of Generative AI (GenAI) technologies outpaces their validation. While human evaluation rubrics provide high reliability and accuracy, they impose significant time burdens on healthcare professionals, who are already under pressure. This paradox arises as these technologies aim to alleviate cognitive load but require additional time for performance assessment. Automated evaluations present a potential solution, yet traditional methods have not consistently matched the rigor of human evaluations, often neglecting critical aspects such as hallucinations, reasoning quality, and relevance of generated content.

As large language models (LLMs) emerge as alternatives for human evaluators, it is essential to address the unique requirements of the clinical setting. Systematic reviews indicate that evaluation dimensions such as safety, bias, and information quality are paramount, especially given the potential consequences of incorrect information on patient outcomes. LLMs must be scrutinized for biases stemming from non-objective training data, which can lead to harmful stereotypes. Furthermore, evaluating the quality of information generated by LLMs in clinical contexts necessitates a nuanced understanding of factors like factuality, relevance, and completeness. Therefore, evaluation frameworks must be tailored to prioritize these clinically relevant dimensions over traditional metrics focused on string matching or structural similarities. An optimally designed LLM evaluator could merge the reliability of human assessments with the efficiency of automated methods, thereby enhancing clinical safety while maintaining assessment quality.

Discussion

The discussion section of the research paper outlines the methodologies and frameworks used for evaluating clinical documentation, particularly in the context of summarization and question-answering tasks. A comprehensive literature search was conducted across multiple databases, yielding 262 abstracts related to human evaluation frameworks and 95 abstracts concerning large language models (LLMs). The inclusion criteria focused on novel evaluation frameworks, clinically relevant tasks, and improvements over traditional metrics like ROUGE. The review ultimately encompassed 130 papers, highlighting the importance of human evaluations as the gold standard, despite their resource-intensive nature.

The section further elaborates on existing human evaluation rubrics, such as SaferDx, PDQI-9, and Revised-IDEA, which assess clinical documentation quality based on various criteria, including diagnostic accuracy and completeness. While these rubrics were not originally designed for LLM-generated content, they provide valuable insights into essential evaluation criteria. The discussion also identifies seven broad criteria for evaluating LLM outputs, including hallucination, omission, and fluency, and describes various analysis methods, such as binary categorizations and edit distance metrics. Despite the advantages of automated evaluations in terms of efficiency, challenges remain in capturing the nuanced requirements of clinical text, underscoring the need for a balanced approach that combines both human and automated evaluations to ensure high-quality clinical documentation.

كلمات مفتاحية: التلخيص التلقائي، الذكاء الاصطناعي، اللغة الطبيعية، المورد (توضيح)، الهندسة، توليد اللغة الطبيعية، حالة (علوم الحاسوب)، سير العمل، علوم الإدارة، علوم البيانات، علوم الحاسوب، لغة البرمجة، نظام اللغة الطبية الموحد