العالِم العربي - التقييم المقارن للاستجابات من DeepSeek-R1 وChatGPT-o1 وChatGPT-4 وروبوتات الدردشة الخاصة بطب الأسنان إلى استفسارات المرضى حول الأطراف الصناعية السنية والفكية Comparative evaluation of responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and dental GPT chatbots to patient inquiries about dental and maxillofacial prostheses

المجلة: BMC Oral Health، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12903-025-06267-w
PMID: https://pubmed.ncbi.nlm.nih.gov/40450291
تاريخ النشر: 2025-05-31
المؤلف: Tuğgen Özcivelek وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

هذه الدراسة في السليكو قيمت أداء أربعة روبوتات محادثة بالذكاء الاصطناعي—DeepSeek-R1، ChatGPT-o1، ChatGPT-4، وDental GPT—في تقديم ردود على 35 سؤالاً شائعاً من المرضى حول الأطراف الصناعية السنية والوجهية. ركز التقييم على الدقة، الجودة، القابلية للقراءة، الفهم، وقابلية التنفيذ للردود، التي تم تقييمها من قبل اثنين من أطباء التعويضات السنية باستخدام مقاييس مختلفة، بما في ذلك مقياس ليكرت ذو الخمس نقاط وأداة تقييم مواد التعليم للمرضى. كشفت التحليلات الإحصائية، بما في ذلك اختبار كروسكال-واليس وتحليل التباين الأحادي، عن اختلافات كبيرة في الدقة والقابلية للقراءة بين روبوتات المحادثة (p < 0.05). أشارت النتائج إلى أن Dental GPT حقق أعلى درجة دقة، بينما حصل ChatGPT-4 على أدنى درجة. على العكس، تفوق DeepSeek-R1 في الأداء، رغم أن Dental GPT أظهر أقل قابلية للقراءة. على الرغم من الجودة العالية والتدفق العام للردود، أبرزت الدراسة الحاجة الملحة لتحسين القابلية للقراءة، خاصة بالنسبة لـ Dental GPT. تؤكد النتائج على أهمية مراقبة دقة وقابلية قراءة ردود روبوتات المحادثة في بيئات الرعاية الصحية لتقليل المخاطر المرتبطة بالمعلومات المضللة. أظهرت أدوات الذكاء الاصطناعي المتخصصة مثل ChatGPT-o1 دقة متفوقة مقارنة بروبوتات المحادثة العامة، مما يبرز ضرورة الاختيار الدقيق لروبوتات المحادثة لتوجيه المرضى.

مقدمة

تسلط المقدمة الضوء على النسبة المتزايدة من الأفراد الذين تتراوح أعمارهم بين 60 عامًا وأكثر، الذين غالبًا ما يواجهون تحديات صحية فموية مثل فقدان الأسنان، بمعدل عالمي يبلغ 22.7% في هذه الفئة العمرية. يمكن أن يؤدي غياب الأسنان إلى مشاكل صحية كبيرة، بما في ذلك الألم، التهاب الرئة الناتج عن الشهيق، ونقص التغذية، مما يؤثر في النهاية على جودة الحياة. تؤكد هذه الفقرة على أهمية الحلول التعويضية، مثل الأطراف الصناعية الفموية والوجهية، والحاجة إلى الصيانة المناسبة والوصول إلى الرعاية السنية، خاصة للمرضى المسنين الذين قد يواجهون صعوبة في الحركة والوصول إلى الرعاية الصحية.

تناقش النصوص أيضًا دور الذكاء الاصطناعي (AI) والمحولات المدربة مسبقًا (GPT) في توفير معلومات متاحة حول الأطراف الصناعية السنية والوجهية. توضح تطور روبوتات المحادثة بالذكاء الاصطناعي، بما في ذلك ChatGPT-4 وDental GPT، التي تم تصميمها لمساعدة المرضى ومقدمي الرعاية من خلال تقديم معلومات موثوقة. تهدف الدراسة إلى تقييم ومقارنة ردود مختلف روبوتات المحادثة بالذكاء الاصطناعي بشأن الأطراف الصناعية السنية، مع التركيز على معايير مثل الدقة، الجودة، القابلية للقراءة، الفهم، وقابلية التنفيذ. تقترح الفرضية الصفرية أنه لن تكون هناك اختلافات كبيرة بين روبوتات المحادثة في هذه الجوانب، ولن تتماشى ردودهم بشكل كافٍ مع الأدبيات الموجودة.

الطرق

في هذه الدراسة، تم استخدام نهج في السليكو لاستكشاف الاستفسارات الشائعة حول الأطراف الصناعية السنية والوجهية، مما ألغى الحاجة إلى موافقة مجلس المراجعة المؤسسية بسبب عدم وجود مواضيع بشرية أو حيوانية. طرح الباحثون السؤال، “ما هي الأسئلة التي يطرحها الناس حول الأطراف الصناعية السنية والوجهية؟” على أربعة روبوتات محادثة بالذكاء الاصطناعي: DeepSeek-R1، ChatGPT-o1، ChatGPT-4، وDental GPT. تم استرجاع ما مجموعه 135 سؤالاً في البداية، تم اختيار 35 سؤالاً مصفاة بناءً على معايير مثل تشابه المحتوى وملاءمته لمخاوف المرضى، وتم التحقق منها من قبل اثنين من أطباء التعويضات السنية. تم تصنيف هذه الأسئلة إلى مواضيع تشمل الاستفسارات العامة، إجراءات العلاج، الرعاية، والمخاوف الجمالية.

تم جمع ردود روبوتات المحادثة وتقييمها من قبل اثنين من أطباء التعويضات السنية من حيث الدقة، الجودة، الفهم، وقابلية التنفيذ، باستخدام مقياس ليكرت ذو الخمس نقاط للدقة ومقياس الجودة العالمي (GQS) لتقييم الجودة. تم تقييم القابلية للقراءة باستخدام درجات سهولة القراءة فليش (FRE) ومستوى الصف فليش-كينكايد (FKGL)، المحسوبة من خلال صيغ محددة لتحديد صعوبة النص. تم أيضًا استخدام أداة تقييم مواد التعليم للمرضى (PEMAT-P) لتقييم الفهم وقابلية التنفيذ للردود، لضمان أن المواد تلبي معايير الفهم المقبولة لتعليم المرضى.

النتائج

تشير نتائج الدراسة إلى وجود اختلافات كبيرة في دقة مختلف روبوتات المحادثة، مع قيمة p تبلغ 0.000، وفي معايير القابلية للقراءة، مع قيمة p تبلغ 0.034. على وجه التحديد، تفوقت ChatGPT-o1 وDental GPT على DeepSeek-R1 وChatGPT-4 في درجات الدقة (p = 0.000). حققت جميع روبوتات المحادثة درجة متوسطة لمقياس الجودة العالمي (GQS) تبلغ 5، مما يشير إلى جودة عالية وتدفق، وعُدّت عمومًا مفيدة للمرضى (p = 0.191). تراوحت درجات الفهم وقابلية التنفيذ المتوسطة بين 87.81 إلى 90.83 و40 إلى 60، على التوالي، دون وجود اختلافات كبيرة في درجات PEMAT-P (p = 0.645، p = 0.082).

فيما يتعلق بالقابلية للقراءة، كما هو موضح في الشكل 1، كان لدى DeepSeek-R1 أعلى درجة متوسطة لسهولة القراءة فليش (FRE)، بينما حصلت Dental GPT على أدنى درجة. أكدت التحليلات اللاحقة أن قابلية قراءة Dental GPT كانت أقل بكثير من تلك الخاصة بالنماذج الأخرى (p = 0.034). تراوحت قيم مستوى الصف فليش-كينكايد (FKGL) لجميع روبوتات المحادثة بين 9.42 إلى 10.70، دون وجود اختلافات كبيرة، وكانت المستويات التعليمية المقدرة المطلوبة للفهم على الأقل عند مستوى الصف الثامن إلى التاسع.

المناقشة

في هذه الدراسة، تم تقييم دقة، جودة، قابلية قراءة، فهم، وقابلية تنفيذ الردود المتعلقة بالأطراف الصناعية السنية والوجهية من مختلف روبوتات المحادثة، بما في ذلك Deep-Seek-R1، ChatGPT-o1، ChatGPT-4، وDental GPT، إحصائيًا. كشفت النتائج أنه بينما تم قبول الفرضية الصفرية المتعلقة بالجودة، الفهم، وقابلية التنفيذ، تم رفض الدقة والقابلية للقراءة عند مستوى الدلالة المحدد. من الجدير بالذكر أن Dental GPT وChatGPT-o1 حققتا تقييمات دقة أعلى بكثير، مما يبرز أهمية مجموعات البيانات المتخصصة في تحسين أداء روبوتات المحادثة. على الرغم من أن ChatGPT-4 أظهر دقة محسنة مقارنة بالدراسات السابقة، تم الاعتراف بحدوده في تقديم معلومات قائمة على الأدلة.

كما قامت الدراسة بتقييم قابلية قراءة ردود روبوتات المحادثة، حيث حصلت Dental GPT على أدنى درجة على مقياس سهولة القراءة فليش (FRE)، مما يشير إلى الحاجة للتحسين. أظهرت جميع روبوتات المحادثة مستوى قراءة يعادل الصف الثامن والتاسع، مما قد يعيق فهم المرضى. تراوحت درجات الفهم بين 87.81% إلى 90.8%، بينما كانت قيم القابلية للتنفيذ أقل، مما يشير إلى أنه على الرغم من أن المعلومات كانت واضحة بشكل عام، إلا أن تطبيقها العملي قد يكون محدودًا. تؤكد النتائج على ضرورة تحسين قابلية قراءة ردود روبوتات المحادثة لضمان أن المرضى، وخاصة أولئك الذين يعانون من محدودية الحركة أو تدهور إدراكي، يمكنهم استخدام المعلومات المقدمة بشكل فعال. بشكل عام، بينما أظهرت روبوتات المحادثة إمكانات كأدوات تعليمية للمرضى، فإن الاعتبار الدقيق لحدودها والحاجة إلى التحقق من المعلومات أمر بالغ الأهمية.

Journal: BMC Oral Health, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12903-025-06267-w
PMID: https://pubmed.ncbi.nlm.nih.gov/40450291
Publication Date: 2025-05-31
Author(s): Tuğgen Özcivelek et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

This in-silico study evaluated the performance of four AI chatbots—DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and Dental GPT—in providing responses to 35 frequently asked patient questions regarding dental and maxillofacial prostheses. The assessment focused on accuracy, quality, readability, understandability, and actionability of the responses, evaluated by two prosthodontists using various scales, including the five-point Likert scale and the Patient Education Materials Assessment Tool. Statistical analyses, including the Kruskal-Wallis test and one-way ANOVA, revealed significant differences in accuracy and readability among the chatbots (p < 0.05). The findings indicated that Dental GPT achieved the highest accuracy score, while ChatGPT-4 scored the lowest. Conversely, DeepSeek-R1 excelled in performance, although Dental GPT exhibited the poorest readability. Despite the overall high quality and flow of the responses, the study highlighted the critical need for improved readability, particularly for Dental GPT. The results underscore the importance of monitoring the accuracy and readability of chatbot responses in healthcare settings to mitigate the risks associated with misinformation. Domain-specific AI tools like ChatGPT-o1 demonstrated superior accuracy compared to general-purpose chatbots, emphasizing the necessity for careful selection of chatbots for patient guidance.

Introduction

The introduction highlights the increasing proportion of individuals aged 60 and older, who often face oral health challenges such as edentulism, with a global rate of 22.7% in this demographic. The absence of teeth can lead to significant health issues, including pain, aspiration pneumonia, and nutritional deficiencies, ultimately affecting the quality of life. The section emphasizes the importance of prosthetic solutions, such as intraoral and extraoral maxillofacial prostheses, and the need for proper maintenance and accessibility to dental care, particularly for geriatric patients who may struggle with mobility and healthcare access.

The text further discusses the role of artificial intelligence (AI) and generative pre-trained transformers (GPT) in providing accessible information about dental and maxillofacial prostheses. It outlines the evolution of AI chatbots, including ChatGPT-4 and Dental GPT, which are designed to assist patients and caregivers by offering reliable information. The study aims to evaluate and compare the responses of various AI chatbots regarding dental prostheses, focusing on parameters such as accuracy, quality, readability, understandability, and actionability. The null hypothesis suggests that there will be no significant differences among the chatbots in these aspects, nor will their responses align adequately with existing literature.

Methods

In this study, an in silico approach was employed to explore common queries regarding dental and maxillofacial prostheses, eliminating the need for Institutional Review Board approval due to the absence of human or animal subjects. The researchers posed the question, “Which questions do people ask about dental and maxillofacial prostheses?” to four AI chatbots: DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and Dental GPT. A total of 135 questions were initially retrieved, from which 35 refined questions were selected based on criteria such as content similarity and relevance to patient concerns, validated by two prosthodontists. These questions were categorized into themes including general inquiries, treatment procedures, care, and aesthetic concerns.

Responses from the chatbots were collected and rated by two prosthodontists for accuracy, quality, understandability, and actionability, utilizing a five-point Likert scale for accuracy and the Global Quality Scale (GQS) for quality assessment. Readability was evaluated using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores, calculated through specific formulas to determine text difficulty. The Patient Education Materials Assessment Tool (PEMAT-P) was also employed to assess the understandability and actionability of the responses, ensuring that the materials met acceptable comprehension standards for patient education.

Results

The results of the study indicate significant differences in the accuracy of various chatbots, with a p-value of 0.000, and in their readability parameters, with a p-value of 0.034. Specifically, ChatGPT-o1 and Dental GPT outperformed DeepSeek-R1 and ChatGPT-4 in accuracy scores (p = 0.000). All chatbots achieved a median Global Quality Score (GQS) of 5, indicating high quality and flow, and were generally considered useful for patients (p = 0.191). The median understandability and actionability scores ranged from 87.81 to 90.83 and 40 to 60, respectively, with no significant differences in the PEMAT-P scores (p = 0.645, p = 0.082).

In terms of readability, as illustrated in Figure 1, DeepSeek-R1 had the highest median Flesch Reading Ease (FRE) score, while Dental GPT scored the lowest. Post hoc analysis confirmed that Dental GPT’s readability was significantly inferior to that of the other models (p = 0.034). The Flesch-Kincaid Grade Level (FKGL) values for all chatbots ranged from 9.42 to 10.70, showing no significant differences, and the estimated educational levels required for comprehension were at least at the eighth to ninth-grade level.

Discussion

In this study, the accuracy, quality, readability, understandability, and actionability of responses regarding dental and maxillofacial prostheses from various chatbots, including Deep-Seek-R1, ChatGPT-o1, ChatGPT-4, and Dental GPT, were statistically evaluated. The findings revealed that while the null hypothesis regarding quality, understandability, and actionability was accepted, the accuracy and readability were rejected at the specified significance level. Notably, Dental GPT and ChatGPT-o1 achieved significantly higher accuracy ratings, underscoring the importance of domain-specific datasets in enhancing chatbot performance. Despite ChatGPT-4 demonstrating improved accuracy compared to previous studies, its limitations in providing evidence-based information were acknowledged.

The study also assessed the readability of the chatbot responses, with Dental GPT scoring the lowest on the Flesch Reading Ease (FRE) scale, indicating a need for improvement. All chatbots exhibited a reading grade level equivalent to 8th and 9th grade, which may hinder patient comprehension. The understandability scores ranged from 87.81% to 90.8%, while actionability values were lower, suggesting that while the information was generally clear, its practical application could be limited. The results emphasize the necessity for enhancing the readability of chatbot responses to ensure that patients, particularly those with limited mobility or cognitive decline, can effectively utilize the information provided. Overall, while the chatbots demonstrated potential as patient education tools, careful consideration of their limitations and the need for verification of information is crucial.