MEDAI-LLM-SUMM: قائمة مراجعة للتقارير لدراسات تلخيص النصوص الطبية باستخدام نماذج اللغة الكبيرة MEDAI-LLM-SUMM: a reporting checklist for medical text summarization studies using large language models

المجلة: Frontiers in Digital Health، المجلد: 8
DOI: https://doi.org/10.3389/fdgth.2026.1761601
PMID: https://pubmed.ncbi.nlm.nih.gov/41847286
تاريخ النشر: 2026-03-02
المؤلف: Anna N. Khoruzhaya وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

قسم “نظرة عامة” في ورقة البحث يوفر سياقًا أساسيًا للدراسة المقدمة في “الحدود في الصحة الرقمية”. يوضح أهمية تقنيات الصحة الرقمية في تحسين تقديم الرعاية الصحية ونتائج المرضى. يركز القسم على التطور السريع لأدوات الصحة الرقمية، بما في ذلك الطب عن بُعد، وتطبيقات الصحة المحمولة، والأجهزة القابلة للارتداء، التي اكتسبت شهرة، خاصة استجابةً للتحديات الصحية العالمية الأخيرة.

علاوة على ذلك، تسلط النظرة العامة الضوء على النتائج الرئيسية من دراسات مختلفة تُظهر فعالية هذه التقنيات في تحسين الوصول إلى الرعاية، ومشاركة المرضى، ومراقبة الصحة. كما يتناول التحديات المرتبطة بدمج حلول الصحة الرقمية في أنظمة الرعاية الصحية التقليدية، مثل مخاوف خصوصية البيانات والحاجة إلى أطر تنظيمية. بشكل عام، يضع هذا القسم الأساس لاستكشاف أعمق لتداعيات واتجاهات المستقبل للابتكارات في الصحة الرقمية.

مقدمة

تسلط المقدمة الضوء على تقدم كبير في استخدام نماذج اللغة الكبيرة (LLMs) لتلخيص النصوص الطبية، مع الإشارة بشكل خاص إلى تحسين ملحوظ في أدائها بين عامي 2024 و2025. تشير الدراسات الحديثة، بما في ذلك عمل فان فين وآخرون، إلى أن الملخصات التي تم إنشاؤها بواسطة LLMs لتقارير الأشعة والسجلات السريرية يمكن أن تكون قابلة للمقارنة أو حتى تتجاوز تلك التي ينتجها الخبراء الطبيون. ومع ذلك، على الرغم من هذه التقدمات، لا تزال هناك قضايا حاسمة تتعلق بالاختبار والتقييم والسلامة وجاهزية LLMs للتطبيق السريري. تؤكد مراجعة نطاقية أجراها بدنارشيك وآخرون أن هذا المجال لا يزال استكشافيًا، حيث أن 7% فقط من الدراسات قامت بإجراء تحقق خارجي و3% فقط قامت بتحليل مخاطر سلامة المرضى. تؤكد التباينات في معدلات الهلوسة، التي يمكن أن تؤثر بشكل كبير على النتائج السريرية، الفجوة بين الأداء في المختبر وقابلية التطبيق في العالم الحقيقي.

لمعالجة هذه النواقص، يقترح المؤلفون قائمة فحص MEDAI-LLM-SUMM، التي تهدف إلى تحسين صرامة البحث حول تلخيص النصوص الطبية المعتمد على LLM. تم تصميم هذه القائمة لتعزيز القابلية للتكرار، وتقييم السلامة، والأهمية السريرية في دراسات التلخيص. تخدم القائمة ثلاث مجموعات مستخدمين رئيسية: الباحثين، محرري المجلات والمراجعين الأقران، والهيئات التنظيمية التي تقيم أدوات LLM التجارية. تهدف القائمة إلى سد الفجوات التي تركتها الإرشادات الحالية، التي تعالج بشكل غير كافٍ قضايا مثل الهلوسات، ومنهجيات التقييم، ومتطلبات الاختبار التجريبي.

طرق

في هذا القسم، يصف المؤلفون نهجهم المنهجي لتطوير قائمة فحص تهدف إلى تعزيز هندسة المطالبات، وهياكل النماذج، ومقاييس الهلوسة. اختاروا طريقة بناء توافق معدلة من خلال مناقشات مفتوحة بدلاً من الطريقة التقليدية المجهولة الهوية (ديليفي)، مشيرين إلى تعقيد المهمة كسبب رئيسي. شمل عملية اتخاذ القرار ثلاث جولات من المناقشات الموثقة، مما سهل الشفافية وسمح بالتفاعلات وجهًا لوجه، مما يعزز التعاون بين التخصصات.

تم تطوير قائمة الفحص وفقًا للمبادئ العامة التي وضعتها شبكة EQUATOR لإرشادات الإبلاغ، مما يضمن أن الأداة الناتجة دقيقة وذات صلة بالمجال. يبرز هذا النهج أهمية التواصل المباشر والمساهمة التعاونية في معالجة المواضيع المعقدة التي قد لا يتم التقاطها بشكل فعال من خلال طرق الاستبيانات القياسية.

نتائج

استعرضت المراجعة المنهجية التي أجراها فاسيليف وآخرون (2025) 216 منشورًا وكشفت عن نقص كبير في معايير الإبلاغ داخل هذا المجال. أشارت التقييمات، التي استرشدت بمعايير PROBAST، إلى أن 98% (211 من 216) من الدراسات فشلت في الإشارة إلى أي معايير إبلاغ معتمدة. علاوة على ذلك، أظهرت 89% (192 من 216) خطرًا عاليًا من التحيز، يُعزى بشكل كبير إلى نقص الوثائق المتعلقة بهندسة المطالبات، وإصدار النماذج، ومنهجيات التقييم.

تؤكد هذه النتائج على نقص منتشر في التفاصيل الفنية المتعلقة بتكوينات النماذج، واستراتيجيات المطالبات، وبروتوكولات التقييم، مما يعيق بشكل كبير القدرة على التحقق المستقل من النتائج. إن غياب معايير الإبلاغ الدقيقة والشفافية المنهجية الشاملة يمثل تحديًا حاسمًا لإمكانية تكرار وموثوقية نتائج البحث في هذا المجال.

مناقشة

تم تطوير قائمة فحص MEDAI-LLM-SUMM لمعالجة الفجوات الكبيرة في معايير الإبلاغ لأبحاث الذكاء الاصطناعي الطبي، خاصة في سياق تلخيص النصوص الطبية. باستخدام نهج توافق تكراري معدّل، تطورت القائمة من خلال مراجعة منهجية للأدبيات ومدخلات من لجنة خبراء متعددة التخصصات، مما أدى في النهاية إلى تقليصها من 44 عنصرًا إلى 24 عنصرًا أساسيًا عبر ستة أقسام. يبرز هذا الإطار أهمية الصلاحية السريرية، واختيار النموذج، وجودة البيانات، وتقييم الجودة، والسلامة، وتوافر البيانات، مما يضمن أن الدراسات شفافة وقابلة للتكرار.

تسلط النتائج الرئيسية الضوء على أن 33% من عناصر القائمة جديدة، مصممة خصيصًا لتناسب تفاصيل تلخيص النصوص الطبية. تفرض القائمة متطلبات صارمة لتقييم الهلوسة وإنشاء ملخصات مرجعية، مما يعالج القضايا الحرجة التي تم تحديدها في الدراسات السابقة، مثل انتشار الأخطاء في الملخصات التي تم إنشاؤها بواسطة الذكاء الاصطناعي. يعترف المؤلفون بالقيود المنهجية، بما في ذلك التحيزات المحتملة بسبب تجانس لجنة الخبراء، ويقترحون أن تخضع القائمة للتحقق الدولي لتعزيز قابليتها للتطبيق وقوتها في سياقات تنظيمية متنوعة. تهدف هذه المبادرة إلى توحيد أبحاث الذكاء الاصطناعي الطبي وتحسين موثوقية تطبيقات الذكاء الاصطناعي في البيئات السريرية.

Journal: Frontiers in Digital Health, Volume: 8
DOI: https://doi.org/10.3389/fdgth.2026.1761601
PMID: https://pubmed.ncbi.nlm.nih.gov/41847286
Publication Date: 2026-03-02
Author(s): Anna N. Khoruzhaya et al.
Primary Topic: Topic Modeling

Overview

The “Overview” section of the research paper provides a foundational context for the study presented in “Frontiers in Digital Health.” It outlines the significance of digital health technologies in enhancing healthcare delivery and patient outcomes. The section emphasizes the rapid evolution of digital health tools, including telemedicine, mobile health applications, and wearable devices, which have gained prominence, especially in response to recent global health challenges.

Furthermore, the overview highlights key findings from various studies that demonstrate the effectiveness of these technologies in improving access to care, patient engagement, and health monitoring. It also addresses the challenges associated with the integration of digital health solutions into traditional healthcare systems, such as data privacy concerns and the need for regulatory frameworks. Overall, this section sets the stage for a deeper exploration of the implications and future directions of digital health innovations.

Introduction

The introduction highlights a significant advancement in the use of large language models (LLMs) for medical text summarization, particularly noting a marked improvement in their performance between 2024 and 2025. Recent studies, including work by Van Veen et al., indicate that LLM-generated summaries of radiology reports and clinical records can be comparable to or even exceed those produced by medical experts. However, despite these advancements, critical issues remain regarding the testing, evaluation, safety, and readiness of LLMs for clinical application. A scoping review by Bednarczyk et al. underscores that the field is still exploratory, with a mere 7% of studies performing external validation and only 3% analyzing patient safety risks. The variability in hallucination rates, which can significantly affect clinical outcomes, further emphasizes the gap between laboratory performance and real-world applicability.

To address these shortcomings, the authors propose the MEDAI-LLM-SUMM checklist, aimed at improving the rigor of research on LLM-based medical text summarization. This checklist is designed to enhance reproducibility, safety assessment, and clinical relevance in summarization studies. It serves three primary user groups: researchers, journal editors and peer reviewers, and regulatory bodies evaluating commercial LLM tools. The checklist aims to fill the gaps left by existing guidelines, which inadequately address issues such as hallucinations, evaluation methodologies, and pilot testing requirements.

Methods

In this section, the authors describe their methodological approach for developing a checklist aimed at enhancing prompt engineering, model architectures, and hallucination metrics. They opted for a modified consensus-building method through open discussions rather than the traditional anonymized Delphi method, citing the complexity of the task as a key reason. The decision-making process involved three rounds of documented discussions, which facilitated transparency and allowed for face-to-face interactions, thereby fostering interdisciplinary collaboration.

The development of the checklist adhered to the general principles established by the EQUATOR Network for reporting guidelines, ensuring that the resulting tool is both rigorous and relevant to the field. This approach highlights the importance of direct communication and collaborative input in addressing intricate topics that may not be effectively captured through standard questionnaire methods.

Results

The systematic review conducted by Vasilev et al. (2025) analyzed 216 publications and revealed significant deficiencies in reporting standards within the field. The assessment, guided by PROBAST criteria, indicated that a staggering 98% (211 out of 216) of the studies failed to reference any established reporting standards. Furthermore, 89% (192 out of 216) exhibited a high risk of bias, largely attributed to insufficient documentation concerning prompt engineering, model versioning, and evaluation methodologies.

These findings underscore a pervasive lack of technical detail related to model configurations, prompting strategies, and evaluation protocols, which severely hampers the ability for independent verification of results. The absence of rigorous reporting standards and comprehensive methodological transparency poses a critical challenge to the reproducibility and credibility of research outcomes in this domain.

Discussion

The MEDAI-LLM-SUMM checklist was developed to address significant gaps in the reporting standards for medical AI research, particularly in the context of summarizing medical texts. Utilizing a modified iterative consensus approach, the checklist evolved through a systematic literature review and input from a multidisciplinary expert panel, ultimately condensing from 44 items to 24 essential items across six sections. This framework emphasizes the importance of clinical validity, model selection, data quality, quality assessment, safety, and data availability, ensuring that studies are transparent and reproducible.

Key findings highlight that 33% of the checklist items are novel, specifically tailored to the nuances of medical text summarization. The checklist mandates rigorous requirements for hallucination assessment and reference summary creation, addressing critical issues identified in previous studies, such as the prevalence of inaccuracies in AI-generated summaries. The authors acknowledge methodological limitations, including potential biases due to the homogeneity of the expert panel, and propose that the checklist undergoes international validation to enhance its applicability and robustness in diverse regulatory contexts. This initiative aims to standardize medical AI research and improve the reliability of AI applications in clinical settings.