تقييم الأداء التشخيصي لنماذج اللغة المفتوحة المصدر في تقارير حالات يوروراد لعام 1933 Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01488-3
PMID: https://pubmed.ncbi.nlm.nih.gov/39934372
تاريخ النشر: 2025-02-12
المؤلف: Su Hwan Kim وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

لقد عززت التطورات الأخيرة في نماذج اللغة الكبيرة (LLMs) الدعم للتشخيصات الإشعاعية بشكل كبير. قامت هذه الدراسة بتقييم الأداء التشخيصي لخمس عشرة نموذجاً مفتوح المصدر جنباً إلى جنب مع نموذج مغلق المصدر، GPT-4o، باستخدام مجموعة بيانات تتكون من 1,933 حالة من مكتبة يوروراد. قامت النماذج بتوليد تشخيصات تفريقية بناءً على التاريخ السريري ونتائج التصوير، مع تحديد الدقة بناءً على ما إذا كان التشخيص الحقيقي قد ظهر بين أفضل ثلاثة اقتراحات.

بالإضافة إلى ذلك، تم تقييم النماذج على 60 حالة تصوير بالرنين المغناطيسي للدماغ غير العامة من مستشفى ثالثي لاختبار قابليتها للتعميم. أظهرت النتائج أن GPT-4o تفوق على النماذج الأخرى، مع متابعة وثيقة من Llama-3-70B. تشير هذه النتائج إلى أن نماذج LLMs مفتوحة المصدر تقترب بسرعة من مستويات أداء النماذج المملوكة، مما يبرز إمكاناتها كأدوات دعم قرار فعالة للتشخيص التفريقي الإشعاعي في سيناريوهات معقدة وعالمية.

الطرق

في هذه الدراسة، تنازلت لجنة الأخلاقيات في الجامعة التقنية في ميونيخ عن متطلبات الموافقة المستنيرة بسبب الطبيعة الاسترجاعية للبحث، الذي استخدم بيانات متاحة للجمهور وبيانات محلية مجهولة الهوية. وقد اعتُبر أن هذه الطريقة تشكل مخاطر ضئيلة على المشاركين، مما سمح بتحليل البيانات دون الحاجة إلى موافقة فردية. تبرز المنهجية الاعتبارات الأخلاقية التي تم أخذها في الاعتبار عند التعامل مع البيانات الحساسة في سياقات البحث.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. عادةً ما يتضمن بيانات كمية، وتحليلات إحصائية، وتمثيلات بصرية مثل الرسوم البيانية أو الجداول التي توضح نتائج الدراسة. غالباً ما تتم مقارنة النتائج مع الفرضيات الأولية أو الأبحاث السابقة لتسليط الضوء على الفروقات المهمة أو التأكيدات.

في هذا القسم، قد يقوم المؤلفون بالإبلاغ عن مقاييس محددة، مثل المتوسطات، والانحرافات المعيارية، وقيم p، لدعم ادعاءاتهم. بالإضافة إلى ذلك، يتم مناقشة أي اتجاهات أو أنماط ملحوظة في البيانات، مما يوفر رؤى حول تداعيات النتائج. بشكل عام، تعتبر النتائج أساساً للنقاشات اللاحقة والاستنتاجات المستخلصة في الورقة.

المناقشة

في هذه الدراسة، تم تقييم الأداء التشخيصي لخمس عشرة نموذجاً رائداً مفتوح المصدر من نماذج اللغة الكبيرة (LLMs) باستخدام مجموعة بيانات تتكون من 1933 تقرير حالة صعبة في الأشعة من مكتبة يوروراد. وقد ضمنت عملية التصفية، التي استبعدت الحالات ذات التشخيصات المحددة بوضوح، أن الحالات المتبقية تتطلب استنتاجاً، مما يوفر تقييماً أكثر صرامة لقدرات النماذج. كانت مجموعة البيانات تتكون في الغالب من حالات تصوير الأعصاب، وتصوير البطن، وتصوير الهيكل العظمي، مما يعكس انتشار هذه التخصصات الفرعية في الممارسة السريرية. تم استخدام نموذج Llama-3-70B كحكم آلي لتقييم دقة استجابات النماذج، محققاً دقة مثيرة للإعجاب بلغت 87.8% في مجموعة فرعية من الحالات مقارنةً بتقييمات الخبراء البشريين.

أشارت النتائج إلى أن GPT-4o تفوق على جميع النماذج المفتوحة المصدر بدقة تشخيصية بلغت 79.6%، بينما تبعته Llama-3-70B من ميتا بدقة 73.2%. ومن الجدير بالذكر أن أداء هذه النماذج كان قابلاً للمقارنة مع أداء أطباء الأشعة ذوي الخبرة، خاصةً في مجموعة بيانات تصوير الرنين المغناطيسي للدماغ المحلية. كما كشفت الدراسة عن تباينات في الدقة عبر تخصصات الأشعة الفرعية المختلفة، مع أداء أعلى في التصوير التدخلي والقلب والأوعية الدموية مقارنةً بتصوير الثدي والهيكل العظمي. علاوة على ذلك، لوحظت علاقة إيجابية معتدلة (معامل بيرسون 0.54) بين حجم النموذج والدقة التشخيصية، على الرغم من أن بعض النماذج الأصغر تفوقت على نظرائها الأكبر، مما يتحدى الافتراض بأن النماذج الأكبر تعطي نتائج أفضل بشكل تلقائي. تؤكد النتائج على إمكانات نماذج LLMs مفتوحة المصدر كأدوات دعم قرار في مجال الأشعة، بينما تبرز أيضاً الحاجة إلى مزيد من الاستكشاف لدمجها في سير العمل السريري والتحديات المرتبطة بذلك، بما في ذلك الاعتبارات التقنية والاقتصادية والتنظيمية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01488-3
PMID: https://pubmed.ncbi.nlm.nih.gov/39934372
Publication Date: 2025-02-12
Author(s): Su Hwan Kim et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

Recent advancements in large language models (LLMs) have significantly enhanced support for radiological diagnostics. This study assessed the diagnostic performance of fifteen open-source LLMs alongside one closed-source model, GPT-4o, using a dataset of 1,933 cases from the Eurorad library. The models generated differential diagnoses based on clinical histories and imaging findings, with accuracy determined by whether the true diagnosis appeared among the top three suggestions.

Additionally, the models were evaluated on 60 non-public brain MRI cases from a tertiary hospital to test their generalizability. Results indicated that GPT-4o outperformed the other models, with Llama-3-70B closely following. These findings suggest that open-source LLMs are rapidly approaching the performance levels of proprietary models, highlighting their potential as effective decision support tools for radiological differential diagnosis in complex, real-world scenarios.

Methods

In this study, the Ethics Committee of the Technical University of Munich waived the requirement for informed consent due to the retrospective nature of the research, which utilized publicly available data and de-identified local data. This approach was deemed to pose minimal risk to participants, thereby allowing for the analysis to proceed without the need for individual consent. The methodology underscores the ethical considerations taken into account when handling sensitive data in research contexts.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It typically includes quantitative data, statistical analyses, and visual representations such as graphs or tables that illustrate the outcomes of the study. The results are often compared against the initial hypotheses or previous research to highlight significant differences or confirmations.

In this section, the authors may report specific metrics, such as means, standard deviations, and p-values, to substantiate their claims. Additionally, any observed trends or patterns in the data are discussed, providing insights into the implications of the findings. Overall, the results serve as a foundation for the subsequent discussion and conclusions drawn in the paper.

Discussion

In this study, the diagnostic performance of fifteen leading open-source large language models (LLMs) was evaluated using a dataset of 1933 challenging radiology case reports from the Eurorad library. The filtering process, which excluded cases with explicitly stated diagnoses, ensured that the remaining cases required inference, thus providing a more rigorous assessment of the models’ capabilities. The dataset was predominantly composed of neuroradiology, abdominal imaging, and musculoskeletal imaging cases, reflecting the prevalence of these subspecialties in clinical practice. The Llama-3-70B model was utilized as an automated judge to assess the accuracy of model responses, achieving an impressive 87.8% accuracy in a subset of cases compared to human expert evaluations.

The results indicated that GPT-4o outperformed all open-source models with a diagnostic accuracy of 79.6%, while Meta’s Llama-3-70B followed closely at 73.2%. Notably, the performance of these models was comparable to that of experienced radiologists, particularly in the local brain MRI dataset. The study also revealed variations in accuracy across different radiological subspecialties, with higher performance in interventional and cardiovascular imaging compared to breast and musculoskeletal imaging. Furthermore, a moderate positive correlation (Pearson coefficient of 0.54) was observed between model size and diagnostic accuracy, although some smaller models outperformed larger counterparts, challenging the assumption that larger models inherently yield better results. The findings underscore the potential of open-source LLMs as decision-support tools in radiology, while also highlighting the need for further exploration of their integration into clinical workflows and the associated challenges, including technical, economic, and regulatory considerations.