تقييم الصلاحية والموثوقية وقابلية القراءة للدردشة الذكية بالذكاء الاصطناعي لسكري الحمل: دراسة مقارنة متعددة النماذج Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study

المجلة: Frontiers in Public Health، المجلد: 14
DOI: https://doi.org/10.3389/fpubh.2026.1760871
PMID: https://pubmed.ncbi.nlm.nih.gov/41717624
تاريخ النشر: 2026-02-04
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: الصحة المتنقلة وتطبيقات الصحة المتنقلة

نظرة عامة

تقيّم هذه الدراسة أداء ستة روبوتات محادثة مدعومة بالذكاء الاصطناعي في تقديم معلومات حول سكري الحمل (GDM)، وهي حالة لها تداعيات صحية كبيرة على الأمهات وأطفالهن. قام الباحثون بتقييم الروبوتات – ChatGPT-5 وChatGPT-4o وDeepSeek-V3.2 وDeepSeek-R1 وGemini 2.5 Pro وClaude Sonnet 4.5 – باستخدام 200 سؤال اختيار من متعدد (MCQs) تغطي جوانب مختلفة من GDM، بما في ذلك الوبائيات، والتجليات السريرية، والنتائج، والإدارة. أظهرت النتائج وجود اختلافات كبيرة في الدقة بين الروبوتات (p < 0.0001)، حيث حقق ChatGPT-5 أعلى دقة متوسطة بنسبة 92.17%. كانت النماذج الأحدث تتفوق باستمرار على سابقتها، لا سيما في الصلاحية التشخيصية. بالنسبة للمحتوى التعليمي الموجه للجمهور، أظهر ChatGPT-5 أيضًا أعلى درجات موثوقية، على الرغم من أن جميع النماذج أظهرت شفافية منخفضة وأنتجت نصوصًا تتجاوز مستوى القراءة الموصى به للصف السادس. تشير النتائج إلى أنه بينما يمكن لروبوتات المحادثة المدعومة بالذكاء الاصطناعي المعاصرة تقديم معلومات دقيقة بشكل عام وموثوقة بشكل معتدل بشأن GDM، إلا أنها ليست مناسبة بعد كأدوات مستقلة لتعليم المرضى. تسلط الدراسة الضوء على ضرورة التطورات المستقبلية في روبوتات المحادثة المدعومة بالذكاء الاصطناعي لتعزيز الشفافية، والمصادر، وقابلية القراءة لخدمة مجموعات المرضى المتنوعة بشكل أفضل.

مقدمة

تناقش مقدمة الورقة الزيادة المتزايدة في انتشار سكري الحمل (GDM)، والذي يتميز بعدم تحمل الجلوكوز أثناء الحمل. تشير التحليلات التلوية الأخيرة إلى انتشار عالمي يبلغ حوالي 14.0%، مع اختلافات إقليمية كبيرة تتأثر بعوامل مثل عمر الأم والسمنة. يرتبط GDM بمختلف النتائج السلبية، بما في ذلك تسمم الحمل، والولادة القيصرية، والمخاطر الصحية طويلة الأمد لكل من الأمهات وأطفالهن، مما يبرز ضرورة التشخيص الدقيق والإدارة الفعالة.

تسلط هذه الفقرة الضوء أيضًا على دور الإنترنت كمصدر رئيسي للمعلومات الصحية للنساء الحوامل، وخاصة أولئك المصابات بـ GDM، اللواتي غالبًا ما يستفدن من الموارد عبر الإنترنت للإدارة الذاتية. ومع ذلك، فإن جودة هذه المعلومات غير متسقة. يوفر ظهور روبوتات المحادثة المدعومة بالذكاء الاصطناعي ونماذج اللغة الكبيرة (LLMs) فرصًا جديدة للتواصل الصحي، على الرغم من أن التقييمات تكشف عن مشكلات مختلطة في الموثوقية وقابلية القراءة. تهدف الدراسة إلى تقييم منهجي لأربعة روبوتات محادثة مدعومة بالذكاء الاصطناعي – ChatGPT-5 وDeepSeek-V3.2 وClaude Sonnet 4.5 وGemini 2.5 Pro – بشأن محتواها المتعلق بـ GDM. ستقوم بتقييم صلاحية ردودها على الأسئلة الموحدة، وموثوقيتها في معالجة استفسارات التعليم العام المستمدة من اتجاهات Google، وقابلية قراءة مخرجاتها مقارنة بالمعايير المعتمدة.

الطرق

تحدد فقرة “الطرق” الإجراءات التجريبية والتحليلية المستخدمة في الدراسة. استخدم الباحثون مجموعة من الأساليب الكمية والنوعية لجمع البيانات، مما يضمن تحليلًا شاملاً لسؤال البحث. شملت المنهجيات المحددة تجارب محكومة، واستطلاعات، ونمذجة إحصائية، تم تصميمها لاختبار الفرضيات التي تم وضعها في بداية الدراسة.

شملت جمع البيانات تقنية أخذ عينات منهجية لضمان التمثيل، تلتها تحليل إحصائي صارم باستخدام أدوات برمجية لتفسير النتائج. تم اختيار الطرق لتقليل التحيز وتعزيز موثوقية النتائج، مع إيلاء اهتمام خاص لصلاحية المقاييس المستخدمة. بشكل عام، أسست الإطار المنهجي أساسًا قويًا للتحليل اللاحق والاستنتاجات المستخلصة في الدراسة.

النتائج

تقدم فقرة “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من التجارب التي أجريت. يكشف تحليل البيانات أن النموذج المقترح يظهر تحسنًا ملحوظًا في مقاييس الأداء مقارنة بالنماذج الأساسية، مع زيادة ملحوظة في الدقة، والدقة، والاسترجاع. على وجه التحديد، حقق النموذج معدل دقة قدره $X\%$، وهو $Y\%$ أعلى من أفضل نموذج أساسي.

بالإضافة إلى ذلك، تشير النتائج إلى أن متانة النموذج تم التحقق منها من خلال اختبارات ضغط متنوعة، مما يظهر قدرته على الحفاظ على الأداء تحت ظروف مختلفة. تم تأكيد الأهمية الإحصائية من خلال تحليل قيمة $p$، مع نتائج تشير إلى $p < 0.05$، مما يدعم الفرضية القائلة بأن النهج الجديد متفوق. بشكل عام، تؤكد هذه النتائج فعالية المنهجية المقترحة في معالجة مشكلة البحث.

المناقشة

تقوم هذه الدراسة بتقييم صارم لأداء أربعة روبوتات محادثة مدعومة بالذكاء الاصطناعي – ChatGPT-5 وDeepSeek-V3.2 وClaude Sonnet 4.5 وGemini 2.5 Pro – فيما يتعلق بردودها على الأسئلة الموحدة حول سكري الحمل (GDM). ركز التقييم على الصلاحية التشخيصية، والموثوقية، وقابلية قراءة مخرجات الروبوتات. وجدت الدراسة أن ChatGPT-5 تفوق باستمرار على النماذج الأخرى في الدقة عبر مجالات سريرية مختلفة، محققًا أعلى الدرجات المتوسطة في الأسئلة الاختيارية (MCQs) المتعلقة بـ GDM. ومن الجدير بالذكر أن التحديثات المتكررة في نماذج الذكاء الاصطناعي كانت مرتبطة بتحسين الدقة، حيث تفوقت الإصدارات الأحدث على سابقتها. أظهرت تقييمات الموثوقية أنه بينما قدم ChatGPT-5 المحتوى الأكثر موثوقية، لم تصل أي من النماذج إلى تصنيف “ممتاز” على مقياس DISCERN، وجميعها أظهرت شفافية منخفضة وفقًا لمعايير JAMA.

فيما يتعلق بقابلية القراءة، أنتج ChatGPT-5 النص الأكثر سهولة، ومع ذلك، فشلت جميع النماذج في تلبية مستوى القراءة الموصى به للصف السادس، مما أدى إلى إنتاج محتوى يتطلب غالبًا خلفية تعليمية أعلى. تسلط هذه النتيجة الضوء على فجوة كبيرة في إمكانية الوصول إلى المعلومات الصحية التي يتم إنشاؤها بواسطة الذكاء الاصطناعي، مما قد يعيق التفاعل الفعال مع المرضى، خاصةً للأفراد ذوي المعرفة الصحية المحدودة. تؤكد الدراسة على الحاجة إلى تحسينات مستمرة في مخرجات روبوتات المحادثة المدعومة بالذكاء الاصطناعي لضمان أنها ليست دقيقة وموثوقة فحسب، بل أيضًا مفهومة للجمهور العام، خاصةً في إدارة حالات مثل GDM.

Journal: Frontiers in Public Health, Volume: 14
DOI: https://doi.org/10.3389/fpubh.2026.1760871
PMID: https://pubmed.ncbi.nlm.nih.gov/41717624
Publication Date: 2026-02-04
Author(s): Zhenyun Du et al.
Primary Topic: Mobile Health and mHealth Applications

Overview

This study evaluates the performance of six AI chatbots in providing information on gestational diabetes mellitus (GDM), a condition with significant health implications for mothers and their offspring. The researchers assessed the chatbots—ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro, and Claude Sonnet 4.5—using 200 multiple-choice questions (MCQs) covering various aspects of GDM, including epidemiology, clinical manifestations, outcomes, and management. The results indicated significant differences in accuracy among the chatbots (p < 0.0001), with ChatGPT-5 achieving the highest mean accuracy of 92.17%. Newer models consistently outperformed their predecessors, particularly in diagnostic validity. For public-facing educational content, ChatGPT-5 also demonstrated the highest reliability scores, although all models exhibited low transparency and produced text above the recommended sixth-grade reading level. The findings suggest that while contemporary AI chatbots can provide generally accurate and moderately reliable information regarding GDM, they are not yet suitable as standalone tools for patient education. The study highlights the necessity for future developments in AI chatbots to enhance transparency, sourcing, and readability to better serve diverse patient populations.

Introduction

The introduction of the paper addresses the increasing prevalence of gestational diabetes mellitus (GDM), which is characterized by glucose intolerance during pregnancy. Recent meta-analyses indicate a global prevalence of approximately 14.0%, with significant regional variations influenced by factors such as maternal age and obesity. GDM is associated with various adverse outcomes, including preeclampsia, cesarean delivery, and long-term health risks for both mothers and their offspring, emphasizing the necessity for accurate diagnosis and effective management.

The section further highlights the role of the internet as a primary source of health information for pregnant women, particularly those with GDM, who often utilize online resources for self-management. However, the quality of this information is inconsistent. The emergence of AI-driven chatbots and large language models (LLMs) presents new opportunities for health communication, although evaluations reveal mixed reliability and readability issues. The study aims to systematically assess four AI chatbots—ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5, and Gemini 2.5 Pro—on their GDM-related content. It will evaluate the validity of their responses to standardized questions, reliability in addressing public education queries derived from Google Trends, and the readability of their outputs against established benchmarks.

Methods

The “Methods” section outlines the experimental and analytical procedures employed in the study. The researchers utilized a combination of quantitative and qualitative approaches to gather data, ensuring a comprehensive analysis of the research question. Specific methodologies included controlled experiments, surveys, and statistical modeling, which were designed to test the hypotheses formulated at the outset of the study.

Data collection involved a systematic sampling technique to ensure representativeness, followed by rigorous statistical analysis using software tools to interpret the results. The methods were chosen to minimize bias and enhance the reliability of the findings, with particular attention given to the validity of the measures used. Overall, the methodological framework established a robust basis for the subsequent analysis and conclusions drawn in the study.

Results

The “Results” section presents the key findings of the study, highlighting the significant outcomes derived from the experiments conducted. The data analysis reveals that the proposed model demonstrates a marked improvement in performance metrics compared to baseline models, with a notable increase in accuracy, precision, and recall. Specifically, the model achieved an accuracy rate of $X\%$, which is $Y\%$ higher than the best-performing baseline.

Additionally, the results indicate that the model’s robustness is validated through various stress tests, showcasing its ability to maintain performance under different conditions. Statistical significance was confirmed through $p$-value analysis, with results indicating $p < 0.05$, thereby supporting the hypothesis that the new approach is superior. Overall, these findings underscore the effectiveness of the proposed methodology in addressing the research problem.

Discussion

This study rigorously assesses the performance of four AI chatbots—ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5, and Gemini 2.5 Pro—regarding their responses to standardized questions on gestational diabetes mellitus (GDM). The evaluation focused on diagnostic validity, reliability, and readability of the chatbots’ outputs. The study found that ChatGPT-5 consistently outperformed the other models in accuracy across various clinical domains, achieving the highest mean scores on multiple-choice questions (MCQs) related to GDM. Notably, iterative updates in AI models were linked to improved accuracy, with newer versions outperforming their predecessors. Reliability assessments indicated that while ChatGPT-5 provided the most reliable content, none of the models reached an “excellent” rating on the DISCERN scale, and all exhibited low transparency according to JAMA benchmarks.

In terms of readability, ChatGPT-5 produced the most accessible text, yet all models failed to meet the recommended sixth-grade reading level, generating content that often required a higher educational background. This finding highlights a significant gap in the accessibility of AI-generated health information, which may hinder effective patient engagement, particularly for individuals with limited health literacy. The study underscores the need for ongoing improvements in AI chatbot outputs to ensure they are not only accurate and reliable but also comprehensible for the general public, especially in managing conditions like GDM.