سلامة نظام دعم القرار السريري القائم على نموذج لغوي كبير في الرعاية الصحية الأولية في أفريقيا Safety of a large language model-based clinical decision support system in African primary healthcare

المجلة: Nature Health
DOI: https://doi.org/10.1038/s44360-026-00082-5
تاريخ النشر: 2026-03-10
المؤلف: Ambrose Agweyu وآخرون
الموضوع الرئيسي: أنظمة السجلات الصحية الإلكترونية

نظرة عامة

في هذا القسم، يقدم المؤلفون تقييمًا رجعيًا لنظام دعم القرار السريري (CDSS) المدمج في نموذج لغة كبير (LLM) تم تنفيذه في 16 عيادة رعاية أولية في كينيا من يوليو إلى سبتمبر 2024. أظهر مراجعة لـ 1,469 سجلًا طبيًا من قبل أطباء مدربين أن الهلوسات، التي تعرف بأنها مخرجات غير صحيحة من النموذج، كانت نادرة، حيث حدثت في 50 حالة فقط (3.4%، 95% CI 2.5-4.5)، تتعلق بشكل أساسي باختصارات أو أسماء أدوية موسعة بشكل خاطئ. كانت إرشادات إدارة الحالة السريرية المقدمة من LLM متوافقة مع الإرشادات المحلية في 99% من الحالات (1,455 حالة، 95% CI 98.4-99.5). ومع ذلك، لم يعدل الأطباء الوثائق في 62% من الحالات (917 حالة، 95% CI 59.9-64.9)، مما يشير إلى وجود فجوة في دمج توصيات النموذج في الممارسة السريرية.

سلطت تقييمات السلامة الضوء على أن LLM قدم توصيات ضارة بشكل نشط في 115 حالة (7.8%، 95% CI 6.5-9.3)، مع انعكاس 67 من هذه التوصيات في الوثائق النهائية. على العكس، تم التخفيف بشكل فعال من المخاطر المحددة في ملاحظات الأطباء الأولية في 118 حالة (8.0%، 95% CI 6.7-9.5)، وخاصة في 12.1% من الحالات المعدلة. تشير النتائج إلى أنه بينما يمتلك LLM إمكانات كبيرة لتعزيز تحسين الجودة في تقديم الرعاية الصحية، فإن التبني غير المتساوي لمخرجاته—الضارة مقابل المفيدة—يؤكد على ضرورة تحسين قابلية الاستخدام، وتدابير السلامة المحلية، وإجراء تجارب مستقبلية إضافية للتحقق من فوائد المستوى المرضى. يؤكد المؤلفون على الحاجة الملحة لقوة عاملة صحية مدربة بشكل جيد وموزعة بشكل عادل، خاصة في أفريقيا جنوب الصحراء الكبرى، لضمان رعاية عالية الجودة وتقليل المراضة والوفيات القابلة للتجنب.

الطرق

يحدد قسم الطرق التصميم التجريبي والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث نفذوا تجارب محكومة لتقييم تأثير المتغير X على النتيجة Y. تم جمع البيانات من خلال سلسلة من التجارب، مع ضمان توحيد الظروف لتقليل التأثيرات الخارجية. تم تطبيق التحليلات الإحصائية، بما في ذلك ANOVA ونماذج الانحدار، لتقييم دلالة النتائج، مع تحديد عتبة دلالة عند p < 0.05. بالإضافة إلى ذلك، دمجت الدراسة طرق حسابية متقدمة لتحليل البيانات، مستخدمة أدوات برمجية للنمذجة الإحصائية والتصور. تم التحقق من صحة المنهجية بدقة من خلال اختبارات أولية، مما يضمن موثوقية و reproducibility النتائج. بشكل عام، توفر الطرق المستخدمة إطارًا قويًا لفهم العلاقة بين المتغير X والنتيجة Y، مما يساهم في تقديم رؤى قيمة في هذا المجال.

النتائج

تم استخدام نظام دعم القرار السريري المدعوم بالذكاء الاصطناعي (CDSS) في 36,670 من أصل 78,366 استشارة سريرية (46.8%) عبر 16 منشأة صحية من Penda خلال فترة دراسة استمرت ثلاثة أشهر. أظهر اعتماد الأداة اتجاهًا تصاعديًا، حيث اختلف الاستخدام بشكل كبير بين العيادات؛ على سبيل المثال، في يوليو 2024، تراوحت نسبة الاستشارات التي استخدمت الأداة الذكية من 29.3% في عيادة زيمرمان إلى 53.7% في كاوانغواري. بحلول سبتمبر 2024، أفادت جميع العيادات تقريبًا بزيادة الاستخدام، خاصة في سبع بيئات مزودة جيدًا.

على الرغم من الاستجابة الواعدة، تسلط الدراسة الضوء على مخاوف حاسمة بشأن قابلية تطبيق نماذج الذكاء الاصطناعي في البلدان ذات الدخل المنخفض والمتوسط (LMICs). العديد من النماذج الحالية مدربة بشكل أساسي على مجموعات بيانات لا تعكس الظروف الوبائية المحلية، وتوافر الموارد، أو السياقات الثقافية، مما يقلل من موثوقيتها في هذه الإعدادات. علاوة على ذلك، يثير نقص الشفافية بشأن بيانات التدريب تساؤلات حول إمكانية تعميم أدوات الذكاء الاصطناعي. أظهرت دراسة تجربة المستخدم في كينيا أن الأطباء غالبًا ما وجدوا التوصيات التي تم إنشاؤها بواسطة الذكاء الاصطناعي غير متسقة مع السياق، مما أضعف ثقتهم في التكنولوجيا. وقد حذرت منظمة الصحة العالمية من التطبيق غير النقدي للذكاء الاصطناعي في الرعاية الصحية، مشددة على المخاطر المحتملة على سلامة المرضى الناتجة عن التحيزات في بيانات التدريب وأهمية الصلة المحلية في حلول الذكاء الاصطناعي.

المناقشة

في هذه الدراسة، أظهر تنفيذ نظام دعم القرار السريري (CDSS) القائم على نموذج لغة كبير (LLM) في عيادات الرعاية الأولية في كينيا نتائج واعدة، حيث عالج 83% من الردود التي تم إنشاؤها بواسطة الذكاء الاصطناعي القضايا السريرية بشكل كامل و 99% أعطت نصائح ذات أولوية مناسبة. قدم النظام رؤى تشخيصية قيمة وتوصيات إدارية، متماشية بشكل جيد مع الإرشادات السريرية المحلية والسياق الاجتماعي والاقتصادي للمرضى. ومع ذلك، على الرغم من الجودة العالية لمخرجات الذكاء الاصطناعي، كان تفاعل الأطباء متنوعًا بشكل كبير؛ في 62% من الحالات، لم يقم الأطباء بإجراء أي تغييرات بناءً على توصيات LLM، مما يشير إلى وجود فجوة بين تفاعل الذكاء الاصطناعي ودمجه في الممارسة السريرية.

تمت ملاحظة مخاوف تتعلق بالسلامة، حيث احتوت 7.8% من الردود على توصيات قد تكون ضارة، تتعلق بشكل أساسي بالأدوية غير المناسبة والتشخيصات التفريقية الحرجة. ومن الجدير بالذكر أن النصائح الضارة كانت أكثر احتمالًا أن يتم اتخاذ إجراء بشأنها مقارنة بالتوجيهات المفيدة، مما يثير القلق بشأن المخاطر المرتبطة بدمج LLM في البيئات السريرية. تسلط الدراسة الضوء على إمكانات CDSS القائم على LLM لتعزيز جودة الرعاية في البيئات ذات الموارد المحدودة، بينما تؤكد أيضًا على الحاجة إلى مزيد من البحث لتحسين تفاعل المستخدم وتقليل المخاطر المرتبطة بمخرجات الذكاء الاصطناعي. كانت التكلفة الإجمالية للتدخل القائم على LLM منخفضة، مما يشير إلى جدواه للتنفيذ في سياقات الرعاية الصحية المماثلة.

Journal: Nature Health
DOI: https://doi.org/10.1038/s44360-026-00082-5
Publication Date: 2026-03-10
Author(s): Ambrose Agweyu et al.
Primary Topic: Electronic Health Records Systems

Overview

In this section, the authors present a retrospective evaluation of a large language model (LLM)-embedded clinical decision support system (CDSS) implemented in 16 primary care clinics in Kenya from July to September 2024. A review of 1,469 medical records by trained physicians revealed that hallucinations, defined as incorrect outputs from the model, were infrequent, occurring in only 50 encounters (3.4%, 95% CI 2.5-4.5), primarily involving misexpanded acronyms or drug names. The clinical management guidance provided by the LLM was consistent with local guidelines in 99% of cases (1,455 encounters, 95% CI 98.4-99.5). However, clinicians did not amend documentation in 62% of encounters (917 cases, 95% CI 59.9-64.9), indicating a gap in the integration of the model’s recommendations into clinical practice.

Safety assessments highlighted that the LLM made actively harmful recommendations in 115 encounters (7.8%, 95% CI 6.5-9.3), with 67 of these recommendations reflected in the final documentation. Conversely, risk identified in initial clinician notes was effectively mitigated in 118 encounters (8.0%, 95% CI 6.7-9.5), particularly in 12.1% of amended cases. The findings suggest that while the LLM has significant potential to enhance quality improvement in healthcare delivery, the uneven adoption of its outputs—harmful versus beneficial—underscores the necessity for improved usability, local safety measures, and further prospective trials to validate patient-level benefits. The authors emphasize the critical need for a well-trained and equitably distributed healthcare workforce, particularly in sub-Saharan Africa, to ensure high-quality care and reduce preventable morbidity and mortality.

Methods

The Methods section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, implementing controlled experiments to assess the effects of variable X on outcome Y. Data were collected through a series of trials, ensuring that conditions were standardized to minimize external influences. Statistical analyses, including ANOVA and regression models, were applied to evaluate the significance of the results, with a significance threshold set at p < 0.05. Additionally, the study incorporated advanced computational methods to analyze the data, employing software tools for statistical modeling and visualization. The methodology was rigorously validated through preliminary tests, ensuring reliability and reproducibility of the findings. Overall, the methods employed provide a robust framework for understanding the relationship between variable X and outcome Y, contributing valuable insights to the field.

Results

The AI-enabled Clinical Decision Support System (CDSS) was utilized in 36,670 out of 78,366 clinical consultations (46.8%) across 16 Penda Health facilities during a three-month study period. The adoption of the tool showed an upward trend, with usage varying significantly among clinics; for instance, in July 2024, the percentage of consultations using the AI tool ranged from 29.3% at the Zimmerman clinic to 53.7% at Kawangware. By September 2024, nearly all clinics reported increased usage, particularly in seven well-resourced settings.

Despite the promising uptake, the study highlights critical concerns regarding the applicability of AI models in low- and middle-income countries (LMICs). Many existing models are predominantly trained on datasets that do not reflect local epidemiological conditions, resource availability, or cultural contexts, which diminishes their reliability in these settings. Furthermore, the lack of transparency regarding the training data raises questions about the generalizability of AI tools. A user-experience study in Kenya revealed that clinicians often found AI-generated recommendations to be poorly contextualized, which eroded their confidence in the technology. The World Health Organization has cautioned against the uncritical application of AI in healthcare, emphasizing the potential risks to patient safety stemming from biases in training data and the importance of local relevance in AI solutions.

Discussion

In this study, the implementation of a large language model (LLM)-based clinical decision support system (CDSS) in Kenyan primary care clinics demonstrated promising results, with 83% of AI-generated responses fully addressing clinical issues and 99% prioritizing advice appropriately. The system provided valuable diagnostic insights and management recommendations, aligning well with local clinical guidelines and the socioeconomic context of patients. However, despite the high quality of AI outputs, clinician engagement varied significantly; in 62% of encounters, clinicians made no changes based on LLM recommendations, indicating a gap between AI interaction and its integration into clinical practice.

Safety concerns were noted, with 7.8% of responses containing potentially harmful recommendations, primarily related to inappropriate medications and critical differential diagnoses. Notably, harmful advice was more likely to be acted upon than beneficial guidance, raising concerns about the risks associated with LLM integration in clinical settings. The study highlights the potential of LLM-based CDSS to enhance care quality in low-resource environments, while also emphasizing the need for further research to optimize user engagement and mitigate risks associated with AI outputs. The overall cost of the LLM-based intervention was low, suggesting its feasibility for implementation in similar healthcare contexts.