أداء الذكاء الاصطناعي في امتحان التخصص في طب الأسنان التركي: هل يمكن لـ ChatGPT-4.0 و Gemini Advanced تحقيق نتائج قابلة للمقارنة مع البشر؟ Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and gemini advanced achieve comparable results to humans?

المجلة: BMC Medical Education، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12909-024-06389-9
PMID: https://pubmed.ncbi.nlm.nih.gov/39930399
تاريخ النشر: 2025-02-10
المؤلف: Soner Şişmanoğlu وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تقيّم هذه الدراسة أداء الدردشة المدعومة بالذكاء الاصطناعي، وتحديداً ChatGPT-4.0 و Gemini Advanced، في امتحان التخصص في طب الأسنان (DUS) الذي أُجري في تركيا في عامي 2020 و 2021. تم إدخال أسئلة DUS إلى الدردشات باللغة التركية، وتم تحليل ردودها إحصائياً باستخدام اختبار كاي تربيع لبييرسون (p < 0.05). حقق ChatGPT-4.0 معدل استجابة صحيحة بنسبة 83.3% في عام 2020 و 80.5% في عام 2021، متفوقاً على Gemini Advanced، الذي سجل 65% و 60.2%، على التوالي. ومع ذلك، سجلت كلتا الدردشات درجات أقل من أفضل الأداء البشري، الذين حققوا درجات 68.5 و 72.3 للسنوات المعنية. على الرغم من اجتياز الامتحان بتجاوز الحد الأدنى من درجة العتبة البالغة 45، أظهرت الدردشات أداءً غير كافٍ في العلوم الأساسية والسريرية، فضلاً عن تخصصات سريرية محددة مثل علاج الجذور وتقويم الأسنان. تشير النتائج إلى أنه بينما تظهر الدردشات المدعومة بالذكاء الاصطناعي وعداً في التعليم في مجال طب الأسنان والمساعدة السريرية، فإن قدراتها الحالية لا ترقى إلى مستوى الخبراء البشريين، مما يشير إلى الحاجة إلى مزيد من التطوير والتكيف في هذه المجالات.

مقدمة

تناقش مقدمة ورقة البحث التأثير التحويلي للذكاء الاصطناعي (AI)، وخاصة من خلال نماذج اللغة الكبيرة (LLMs) مثل ChatGPT و Google Gemini، على مختلف القطاعات، بما في ذلك الرعاية الصحية وتعليم طب الأسنان. لقد أظهرت تقنيات الذكاء الاصطناعي، التي تستخدم التعلم العميق والشبكات العصبية، القدرة على معالجة مجموعات بيانات واسعة وتقديم ردود متماسكة على استفسارات معقدة. تسلط الورقة الضوء على ظهور الدردشات المدعومة بالذكاء الاصطناعي، وخاصة ChatGPT وإصدارها المحدث ChatGPT-4.0، بالإضافة إلى Google Gemini، التي اكتسبت زخمًا في التعليم الطبي وطب الأسنان بسبب قدراتها في توليد إجابات مفصلة عبر لغات ومواضيع متعددة.

تهدف الدراسة إلى تقييم أداء هذه الدردشات الذكية في سياق امتحان التخصص في طب الأسنان (DUS) في تركيا، الذي يقيم معرفة خريجي طب الأسنان من خلال أسئلة متعددة الخيارات (MCQs). على الرغم من الاهتمام المتزايد في تطبيقات الذكاء الاصطناعي في مجال طب الأسنان، هناك نقص ملحوظ في الأبحاث حول فعالية الدردشات الذكية في الإجابة على الأسئلة المتعلقة بطب الأسنان، خاصة في الامتحانات الشاملة التي تغطي تخصصات متنوعة. تسعى الأبحاث إلى مقارنة أداء ChatGPT-4.0 و Gemini Advanced بأداء أطباء الأسنان البشريين الذين شاركوا في DUS، واختبار الفرضيات التي تفيد بأن كلا الدردشات يمكن أن تجتاز الامتحان وأنه لا يوجد فرق أداء كبير بينهما. تمثل هذه الدراسة جهدًا رائدًا لتقييم قدرات هذه النماذج المتقدمة من الذكاء الاصطناعي في سياق امتحان طب الأسنان.

الطرق

في هذه الدراسة، استخدم المؤلفون أحدث إصدارات من اثنين من الدردشات المدعومة بالذكاء الاصطناعي المعروفة على نطاق واسع: ChatGPT-4.0 التي طورتها OpenAI و Gemini Advanced من Google LLC. تضمنت المنهجية توجيه سلسلة من أسئلة الامتحان إلى كلا الدردشات لتقييم أدائها وقدراتها في الرد على الاستفسارات الأكاديمية. تهدف هذه التحليل المقارن إلى تقييم فعالية ودقة هذه الأنظمة الذكية في سياق تعليمي.

النتائج

تشير نتائج الدراسة إلى أن ChatGPT-4.0 تفوق بشكل كبير على Gemini Advanced في كل من أقسام العلوم الأساسية والسريرية، مع قيم p أقل من 0.001، مما يدل على درجات إجمالية متفوقة. بينما كانت هناك اختلافات ملحوظة في قسم العلوم السريرية بشكل عام، لم تكشف تحليلات المجموعات الفرعية للتخصصات الفردية عن اختلافات ذات دلالة إحصائية. برزت أمراض اللثة كتخصص حيث تفوقت كلا النماذج الذكية، بينما أظهرت تخصصات علاج الجذور وتقويم الأسنان أدنى مستويات الأداء.

فيما يتعلق بتعقيد الأسئلة، شكلت الأسئلة ذات المستوى الأعلى 40.3% من إجمالي 238 سؤالًا، حيث حقق ChatGPT-4.0 أداءً إحصائيًا أعلى في كل من الفئات ذات المستوى الأعلى والأدنى (p < 0.05). انخفضت معدلات الاستجابة الصحيحة لكلتا الدردشات مع زيادة مستويات تصنيف بلوم. على وجه التحديد، سجل ChatGPT-4.0 65.5 نقطة (83.3% صحيحة) في امتحان 2020 و 65.6 نقطة (80.5% صحيحة) في امتحان 2021، بينما سجل Gemini Advanced 50.1 نقطة (65%) و 48.6 نقطة (60.2%)، على التوالي، مع تحقيق كلا المقارنتين دلالة إحصائية (p < 0.05). علاوة على ذلك، حقق ChatGPT-4.0 درجة كاملة في قسم العلوم الأساسية من امتحان 2020، في حين كان أداء Gemini Advanced هو الأدنى عبر جميع الأقسام. ومن الجدير بالذكر، في الأسئلة المعتمدة على الصور، أجاب ChatGPT-4.0 بشكل صحيح على 3 من أصل 7، بينما ترك Gemini Advanced 3 بدون إجابة وقدم 4 ردود غير صحيحة.

المناقشة

تقيّم قسم المناقشة في ورقة البحث أداء الدردشات المدعومة بالذكاء الاصطناعي، ChatGPT-4.0 و Gemini Advanced، في الإجابة على أسئلة امتحان اختيار طلاب طب الأسنان (DUS)، الذي يقيم المعرفة في الطب الأساسي وطب الأسنان السريري. وجدت الدراسة أن ChatGPT-4.0 تفوق بشكل كبير على Gemini Advanced عبر جميع أقسام الامتحان، محققًا معدل استجابة صحيحة إجمالية يبلغ حوالي 80%، مقارنةً بـ 60% لـ Gemini. يُعزى هذا التفاوت في الأداء إلى عدم توفر ChatGPT على الوصول إلى الويب في الوقت الحقيقي، حيث يعتمد بدلاً من ذلك على بيانات التدريب قبل عام 2022، بينما يمكن لـ Gemini استخدام المعلومات عبر الإنترنت.

تشير النتائج إلى أن كلا الدردشات يمكن أن تجتاز DUS، ومع ذلك لا تزال أقل من أفضل المرشحين البشريين، خاصة في العلوم الأساسية والسريرية. ومن الجدير بالذكر أن أدائها اختلف عبر التخصصات، حيث تفوق ChatGPT-4.0 في أمراض اللثة لكنه واجه صعوبة في علاج الجذور وتقويم الأسنان. تسلط الدراسة الضوء على إمكانيات الذكاء الاصطناعي في التعليم في مجال طب الأسنان مع التحذير من الاعتماد المفرط على هذه التقنيات بسبب قيودها، مثل “ظاهرة الهلوسة” والحاجة إلى مزيد من البحث لتقييم فعاليتها في البيئات السريرية. بشكل عام، بينما تظهر الدردشات المدعومة بالذكاء الاصطناعي وعدًا، يجب أن تُعتبر أدوات تكميلية بدلاً من بدائل للخبرة البشرية في طب الأسنان.

Journal: BMC Medical Education, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12909-024-06389-9
PMID: https://pubmed.ncbi.nlm.nih.gov/39930399
Publication Date: 2025-02-10
Author(s): Soner Şişmanoğlu et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

This study evaluates the performance of AI-powered chatbots, specifically ChatGPT-4.0 and Gemini Advanced, in the Dental Specialization Exam (DUS) administered in Turkey for the years 2020 and 2021. The DUS questions were inputted into the chatbots in Turkish, and their responses were statistically analyzed using Pearson’s chi-squared test (p < 0.05). ChatGPT-4.0 achieved a correct response rate of 83.3% in 2020 and 80.5% in 2021, outperforming Gemini Advanced, which scored 65% and 60.2%, respectively. However, both chatbots scored lower than the top human performers, who achieved scores of 68.5 and 72.3 for the respective years. Despite passing the exam by exceeding the minimum threshold score of 45, the chatbots demonstrated inadequate performance in basic and clinical sciences, as well as in specific clinical specialties like endodontics and orthodontics. The findings suggest that while AI-powered chatbots show promise in dental education and clinical assistance, their current capabilities fall short compared to human experts, indicating a need for further development and adaptation in these fields.

Introduction

The introduction of the research paper discusses the transformative impact of artificial intelligence (AI), particularly through large language models (LLMs) like ChatGPT and Google Gemini, on various sectors, including healthcare and dental education. AI technologies, which utilize deep learning and neural networks, have demonstrated the ability to process extensive datasets and provide coherent responses to complex queries. The paper highlights the emergence of AI-powered chatbots, particularly ChatGPT and its upgraded version ChatGPT-4.0, as well as Google Gemini, which have gained traction in medical and dental education due to their capabilities in generating detailed answers across multiple languages and subjects.

The study aims to evaluate the performance of these AI chatbots in the context of the Dental Specialization Exam (DUS) in Turkey, which assesses the knowledge of dental graduates through multiple-choice questions (MCQs). Despite the growing interest in AI applications within the dental field, there is a notable lack of research on the effectiveness of AI chatbots in answering questions related to dentistry, especially in comprehensive exams that cover various specialties. The research seeks to compare the performance of ChatGPT-4.0 and Gemini Advanced against that of human dentists who participated in the DUS, testing the hypotheses that both chatbots can pass the exam and that there is no significant performance difference between them. This study represents a pioneering effort to assess the capabilities of these advanced AI models in a dental examination context.

Methods

In this study, the authors employed the latest iterations of two widely recognized AI-powered chatbots: ChatGPT-4.0 developed by OpenAI and Gemini Advanced from Google LLC. The methodology involved directing a series of exam questions to both chatbots to evaluate their performance and capabilities in responding to academic inquiries. This comparative analysis aims to assess the effectiveness and accuracy of these AI systems in an educational context.

Results

The results of the study indicate that ChatGPT-4.0 significantly outperformed Gemini Advanced in both basic and clinical sciences sections, with p-values less than 0.001, demonstrating superior overall scores. While there were notable differences in the clinical sciences section overall, subgroup analyses for individual specialties did not reveal statistically significant differences. Periodontology emerged as the specialty where both AI models excelled, whereas endodontics and orthodontics exhibited the lowest performance levels.

In terms of question complexity, higher-order questions constituted 40.3% of the total 238 questions, with ChatGPT-4.0 achieving statistically higher performance in both higher-order and lower-order categories (p < 0.05). The correct response rates for both chatbots decreased with increasing Bloom's taxonomy levels. Specifically, ChatGPT-4.0 scored 65.5 points (83.3% correct) on the 2020 exam and 65.6 points (80.5% correct) on the 2021 exam, while Gemini Advanced scored 50.1 points (65%) and 48.6 points (60.2%), respectively, with both comparisons yielding statistical significance (p < 0.05). Furthermore, ChatGPT-4.0 achieved a perfect score in the basic sciences section of the 2020 exam, contrasting with Gemini Advanced's lowest performance across all sections. Notably, in image-based questions, ChatGPT-4.0 answered 3 correctly out of 7, while Gemini Advanced left 3 unanswered and provided 4 incorrect responses.

Discussion

The discussion section of the research paper evaluates the performance of two AI-powered chatbots, ChatGPT-4.0 and Gemini Advanced, in answering questions from the Dental Undergraduate Selection (DUS) exam, which assesses knowledge in basic medicine and clinical dentistry. The study found that ChatGPT-4.0 significantly outperformed Gemini Advanced across all sections of the exam, achieving an overall correct response rate of approximately 80%, compared to Gemini’s 60%. This performance disparity is attributed to ChatGPT’s lack of real-time web access, relying instead on pre-2022 training data, while Gemini can utilize online information.

The findings indicate that both chatbots can pass the DUS, yet they still fall short of the top-performing human candidates, particularly in basic and clinical sciences. Notably, their performance varied across specialties, with ChatGPT-4.0 excelling in periodontology but struggling in endodontics and orthodontics. The study highlights the potential of AI in dental education while cautioning against over-reliance on these technologies due to their limitations, such as the “hallucination phenomenon” and the need for further research to assess their efficacy in clinical settings. Overall, while AI-powered chatbots show promise, they should be viewed as supplementary tools rather than replacements for human expertise in dentistry.