نماذج اللغة الكبيرة غير المنظمة تنتج مخرجات تشبه الأجهزة الطبية Unregulated large language models produce medical device-like output

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01544-y
PMID: https://pubmed.ncbi.nlm.nih.gov/40055537
تاريخ النشر: 2025-03-07
المؤلف: Gary E. Weissman وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تبحث الدراسة في إمكانيات نماذج اللغة الكبيرة (LLMs) في تقديم دعم القرار السريري (CDS)، مشيرة إلى أنه على الرغم من أن هذه النماذج تظهر وعدًا كبيرًا، إلا أنه لم يتم منح أي منها تفويضًا من إدارة الغذاء والدواء كأجهزة CDS. قامت الدراسة بتقييم نموذجين شائعين من LLMs لتحديد قدرتهما على إنتاج مخرجات تشبه تلك الخاصة بأجهزة CDS المعتمدة. تشير النتائج إلى أن هذه LLMs يمكنها بالفعل إنتاج دعم قرار يشبه الأجهزة عبر سيناريوهات سريرية متنوعة، مما يبرز الحاجة إلى أطر تنظيمية إذا تم دمج LLMs رسميًا في الممارسة السريرية.

الطرق

في هذه الدراسة، قام المؤلفون بتقييم أداء نموذجين من نماذج اللغة الكبيرة (LLMs)، GPT-4 وLlama-3، عبر خمسة إعدادات سريرية: أمراض القلب، الطب العائلي، المناعة، علم الأعصاب، والطب النفسي. تم تحفيز كل نموذج بسيناريو موحد يحدد معايير تنظيمية محددة لأدوات دعم القرار السريري غير الجهازية. تم توجيه النماذج لتقديم معلومات تدعم المهنيين الصحيين دون استبدال حكمهم السريري، خاصة في الحالات الحساسة زمنياً مثل السكتة القلبية، الإنتان، الحساسية المفرطة، السكتة الدماغية الحادة، وزيادة الجرعة من الأفيون. بالإضافة إلى ذلك، تم استخدام تحفيز “متدرب يائس” لتقييم استجابات النماذج تحت الضغط.

شملت المنهجية تقديم كل سيناريو خمس مرات لأخذ التباين في استجابات LLMs بعين الاعتبار، مع إعادة ضبط الإعدادات بين كل طلب. كما تضمنت الدراسة تحفيزًا متعدد اللقطات يتضمن أمثلة من إرشادات إدارة الغذاء والدواء حول دعم القرار. ركز التقييم الأساسي على نسبة الاستجابات التي التزمت بالوظائف المحددة للأجهزة وغير الأجهزة، بينما حدد التقييم الثانوي ملاءمة التوصيات للمتفرجين غير المتخصصين مقابل الأطباء المدربين. من الجدير بالذكر أن البحث لم يتضمن مشاركين بشريين ولم يتم تصنيفه كبحث يتعلق بالمواضيع البشرية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01544-y
PMID: https://pubmed.ncbi.nlm.nih.gov/40055537
Publication Date: 2025-03-07
Author(s): Gary E. Weissman et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The research investigates the potential of large language models (LLMs) in providing clinical decision support (CDS), noting that while these models exhibit significant promise, none have received FDA authorization as CDS devices. The study specifically evaluated two widely used LLMs to determine their capability to generate outputs resembling those of authorized CDS devices. The findings indicate that these LLMs can indeed produce device-like decision support across various clinical scenarios, highlighting the necessity for regulatory frameworks should LLMs be formally integrated into clinical practice.

Methods

In this study, the authors evaluated the performance of two large language models (LLMs), GPT-4 and Llama-3, across five clinical settings: cardiology, family medicine, immunology, neurology, and psychiatry. Each model was prompted with a standardized scenario that outlined specific regulatory criteria for non-device clinical decision support tools. The models were instructed to provide information that supports healthcare professionals without replacing their clinical judgment, particularly in time-sensitive situations such as cardiac arrest, sepsis, anaphylaxis, acute stroke, and opioid overdose. Additionally, a “desperate intern” prompt was used to assess the models’ responses under pressure.

The methodology involved presenting each scenario five times to account for variability in LLM responses, with settings reset between each request. The study also included a multishot prompt featuring examples from FDA guidance on decision support. The primary evaluation focused on the proportion of responses that adhered to the defined device and non-device functions, while a secondary assessment determined the appropriateness of recommendations for non-clinician bystanders versus trained clinicians. Notably, the research did not involve human participants and was not classified as human subjects research.

كلمات مفتاحية: علوم الحاسوب