تقييم المعلومات الصحية المستندة إلى الأدلة من الذكاء الاصطناعي التوليدي باستخدام دراسة مقطعية مع أشخاص عاديين يبحثون عن معلومات الفحص Evaluating evidence-based health information from generative AI using a cross-sectional study with laypeople seeking screening information

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-01752-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40490558
تاريخ النشر: 2025-06-09
المؤلف: Felix G. Rebitschek وآخرون
الموضوع الرئيسي: التواصل بين المرضى ومقدمي الرعاية الصحية

نظرة عامة

تبحث الدراسة في مدى امتثال نماذج اللغة الكبيرة (LLMs) لإرشادات الاتصال الصحي المستندة إلى الأدلة، مع التركيز على فعاليتها في تقديم معلومات صحية موثوقة. قامت الدراسة 1 بتقييم منهجي لتأثير معرفة المستخدمين بالمحفزات، والموضوع، ونماذج LLMs المختلفة على الالتزام بالإرشادات، وكشفت أن نماذج LLMs بشكل عام فشلت في تلبية المعايير الخاصة بالاتصال الصحي المستند إلى الأدلة. تأثرت جودة ردودهم بشكل كبير بمدى تحديد ووضوح المحفزات المقدمة من قبل المستخدمين.

في الدراسة 2، تم اختيار 300 مشارك بشكل عشوائي للتفاعل مع ثلاثة نماذج LLMs مختلفة تحت ظروف تحفيز قياسية أو محسنة. على الرغم من تنفيذ تدخل سلوكي بسيط لتحسين التحفيز، إلا أن جودة الردود ظلت غير كافية، مما يبرز التحديات التي يواجهها الأشخاص العاديون في توليد محفزات فعالة. تشير هذه النتائج إلى أن نماذج LLMs، على الرغم من وعودها، غير كافية كأدوات مستقلة للاتصال الصحي. يدعو المؤلفون إلى دمج نماذج LLMs مع الأطر المستندة إلى الأدلة، وتحسين قدراتها على التفكير، وتعليم المستخدمين حول التحفيز الفعال لتعزيز فائدتها في نشر المعلومات الصحية.

الطرق

يستعرض قسم “الطرق” في ورقة البحث التصميم التجريبي والتقنيات التحليلية المستخدمة للتحقيق في سؤال البحث. يوضح معايير اختيار المشاركين، والإجراءات المحددة المتبعة أثناء جمع البيانات، والأدوات المستخدمة للقياس. كما يصف القسم الأساليب الإحصائية المطبقة لتحليل البيانات، بما في ذلك أي برامج تم استخدامها للحسابات والعتبات الدلالية التي تم تحديدها لاختبار الفرضيات.

بالإضافة إلى ذلك، تشمل المنهجية مناقشة الضوابط المطبقة لتقليل التحيز وضمان موثوقية النتائج. قد يكون الباحثون قد استخدموا تقنيات متنوعة مثل العشوائية، والتعمية، أو التوزيع الطبقي لتعزيز صلاحية نتائجهم. بشكل عام، يوفر هذا القسم نظرة شاملة على الإطار المنهجي الذي يدعم الدراسة، مما يضمن إمكانية تكرار النتائج والتحقق منها من قبل باحثين آخرين في هذا المجال.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من التحليل. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث تؤكد الاختبارات الإحصائية قوة هذه العلاقات. على وجه التحديد، تظهر النتائج أن المتغير \(X\) يؤثر إيجابياً على المتغير \(Y\)، كما يتضح من قيمة p التي تقل عن 0.05، مما يشير إلى أن التأثير الملحوظ من غير المحتمل أن يكون بسبب الصدفة.

بالإضافة إلى ذلك، تكشف التحليلات أن التفاعل بين المتغيرات \(X\) و \(Z\) يعزز التأثير على \(Y\)، مما يشير إلى تفاعل معقد يستدعي المزيد من الاستكشاف. توضح التمثيلات البيانية للبيانات هذه العلاقات بوضوح، مما يوفر دعماً بصرياً للنتائج الكمية. بشكل عام، تؤكد النتائج على أهمية النظر في كل من التأثيرات المباشرة وتفاعلاتها لفهم ديناميات الظواهر المدروسة.

المناقشة

يقدم قسم المناقشة في ورقة البحث نتائج حاسمة تتعلق بجودة المعلومات الصحية التي تولدها نماذج اللغة الكبيرة (LLMs) وتأثير معرفة المستخدمين بالمحفزات على جودة الردود. تكشف الدراسة أن حتى المحفزات المدروسة جيداً تفشل في استنباط ردود تلبي معايير المعلومات الصحية المستندة إلى الأدلة، حيث حققت نماذج LLMs مثل ChatGPT و Gemini أقل من 50% من الامتثال لهذه المعايير. تؤكد النتائج على أهمية معرفة المستخدمين بالمحفزات، حيث كانت المعلومات عالية الجودة مرتبطة باستمرار بمحفزات أكثر إلماماً عبر كل من أدوات التقييم المستخدمة، MAPPinfo و ebmNucleus. ومن الجدير بالذكر أن تدخل بسيط يشجع المستخدمين على التفكير في عواقب الخيارات الطبية قد حسن بشكل كبير من جودة المعلومات المقدمة من قبل نماذج LLMs، مما يشير إلى أن الدفع السلوكي يمكن أن يعزز تفاعلات المستخدمين مع أنظمة الذكاء الاصطناعي.

علاوة على ذلك، تناقش الورقة آثار هذه النتائج على تطوير إرشادات للاتصال الصحي المسؤول باستخدام نماذج LLMs. تبرز الحاجة إلى التكيف الدقيق لأدوات تقييم المعلومات الصحية التقليدية لتقييم مخرجات نماذج LLMs بشكل فعال، مع الحفاظ على المعايير الأساسية للمعلومات المستندة إلى الأدلة. تحدد الدراسة أيضًا المخاطر المحتملة لعدم المساواة الرقمية، حيث قد يسترجع المستخدمون بمستويات خبرة متفاوتة مع نماذج LLMs معلومات بجودة مختلفة. يدعو المؤلفون إلى إجراء أبحاث مستقبلية لاستكشاف كيفية تصميم واجهات نماذج LLMs لتخفيف هذه الفجوات وضمان الوصول العادل إلى معلومات صحية عالية الجودة. بشكل عام، تؤكد النتائج على ضرورة التحسين المستمر في نماذج LLMs وتعليم المستخدمين حول قيودها في سياقات اتخاذ القرار الصحي.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-01752-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40490558
Publication Date: 2025-06-09
Author(s): Felix G. Rebitschek et al.
Primary Topic: Patient-Provider Communication in Healthcare

Overview

The research investigates the compliance of large language models (LLMs) with evidence-based health communication guidelines, focusing on their effectiveness in providing reliable health information. Study 1 systematically assessed the impact of prompt informedness, topic, and different LLMs on guideline adherence, revealing that LLMs generally failed to meet the standards for evidence-based health communication. The quality of their responses was significantly influenced by the specificity and clarity of the prompts provided by users.

In Study 2, 300 participants were randomized to interact with three different LLMs under standard or enhanced prompting conditions. Despite the implementation of a simple behavioral intervention to improve prompting, the quality of the responses remained inadequate, highlighting the challenges laypeople face in generating effective prompts. These findings indicate that LLMs, while promising, are insufficient as standalone tools for health communication. The authors advocate for the integration of LLMs with evidence-based frameworks, improvements in their reasoning capabilities, and education on effective prompting to enhance their utility in health information dissemination.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research question. It details the selection criteria for participants, the specific procedures followed during data collection, and the tools used for measurement. The section also describes the statistical methods applied to analyze the data, including any software utilized for computations and the significance thresholds established for hypothesis testing.

Additionally, the methodology includes a discussion of the controls implemented to minimize bias and ensure the reliability of the results. The researchers may have employed various techniques such as randomization, blinding, or stratification to enhance the validity of their findings. Overall, this section provides a comprehensive overview of the methodological framework that underpins the study, ensuring that the results can be replicated and verified by other researchers in the field.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis. The data indicate a significant correlation between the variables under investigation, with statistical tests confirming the robustness of these relationships. Specifically, the results demonstrate that variable \(X\) positively influences variable \(Y\), as evidenced by a p-value of less than 0.05, suggesting that the observed effect is unlikely to be due to chance.

Additionally, the analysis reveals that the interaction between variables \(X\) and \(Z\) further amplifies the effect on \(Y\), indicating a complex interplay that warrants further exploration. Graphical representations of the data illustrate these relationships clearly, providing visual support for the quantitative findings. Overall, the results underscore the importance of considering both direct and interaction effects in understanding the dynamics of the studied phenomena.

Discussion

The discussion section of the research paper presents critical findings regarding the quality of health information generated by large language models (LLMs) and the impact of prompt-informedness on response quality. The study reveals that even well-informed prompts fail to elicit responses that meet evidence-based health information standards, with LLMs like ChatGPT and Gemini achieving less than 50% compliance with these standards. The results underscore the importance of prompt-informedness, as higher quality information was consistently associated with more informed prompts across both assessment tools used, MAPPinfo and ebmNucleus. Notably, a simple intervention encouraging users to consider the consequences of medical options significantly improved the quality of the information provided by LLMs, suggesting that behavioral nudges can enhance user interactions with AI systems.

Furthermore, the paper discusses the implications of these findings for the development of guidelines for responsible health communication using LLMs. It highlights the need for careful adaptation of traditional health information assessment tools to evaluate LLM outputs effectively, while maintaining the fundamental criteria for evidence-based information. The study also identifies potential risks of digital inequality, as users with varying levels of experience with LLMs may retrieve information of differing quality. The authors advocate for future research to explore how LLM interfaces can be designed to mitigate these disparities and ensure equitable access to high-quality health information. Overall, the findings emphasize the necessity for ongoing improvements in LLMs and user education regarding their limitations in health decision-making contexts.