أداء الذكاء الاصطناعي في الإجابة على أسئلة علم الأمراض الفموية متعددة الخيارات: تحليل مقارن Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

المجلة: BMC Oral Health، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12903-025-05926-2
PMID: https://pubmed.ncbi.nlm.nih.gov/40234873
تاريخ النشر: 2025-04-15
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تدرس الدراسة أداء ثمانية نماذج لغوية كبيرة (LLMs) في الإجابة على أسئلة علم الأمراض الفموية متعددة الخيارات من امتحان التخصص في طب الأسنان التركي (DUS). تم تحليل ما مجموعه 100 سؤال، مصنفة إما على أنها “مبنية على الحالة” أو “مبنية على المعرفة”، لتقييم دقة النماذج، والتي شملت Gemini 1.5 و Gemini 2 و ChatGPT o1 و ChatGPT 4o و ChatGPT 4 و Copilot و Claude 3.5 و Deepseek. كشفت النتائج عن اختلافات كبيرة في الأداء بين النماذج (p < 0.001)، حيث حقق ChatGPT o1 أعلى دقة (96 إجابة صحيحة)، تليه Claude (84 صحيحة) وكلا من Gemini 2 و Deepseek (82 صحيحة لكل منهما). ومن الجدير بالذكر أن ChatGPT o1 تفوق على معظم النماذج الأخرى في كلا فئتي الأسئلة، لا سيما في الأسئلة المبنية على الحالة (p = 0.034) والأسئلة المبنية على المعرفة (p < 0.001). تشير النتائج إلى أنه بينما تظهر LLMs كفاءة متغيرة في معالجة أسئلة علم الأمراض الفموية، يبرز ChatGPT o1 كنموذج الأكثر دقة. وهذا يدل على إمكانية استخدام LLMs كأدوات تعليمية مساعدة في التعليم الطبي وتهيئة الامتحانات. ومع ذلك، تؤكد الدراسة على الحاجة إلى مزيد من التحقق بشأن موثوقية هذه النماذج وحداثتها وأهميتها السريرية. مع تقدم التكنولوجيا، من المتوقع أن يتزايد الدور الداعم للذكاء الاصطناعي و LLMs في التعليم الطبي، مع توقع تحسينات في أدائها.

مقدمة

تسلط مقدمة ورقة البحث الضوء على التأثير التحويلي للذكاء الاصطناعي (AI) على الرعاية الصحية والتعليم، لا سيما في مجالات الطب وطب الأسنان. على مدار العقدين الماضيين، سهلت التطورات في البيانات الضخمة، وقوة الحوسبة، وخوارزميات الذكاء الاصطناعي تحسينات كبيرة في عمليات التشخيص، وتخطيط العلاج، ومنهجيات التعليم. إن دمج تقنيات الذكاء الاصطناعي في المهام الروتينية يعيد تشكيل التعليم الطبي وطب الأسنان، مما يوفر فوائد كبيرة لكل من الطلاب والمعلمين. ومن الجدير بالذكر أن نماذج الذكاء الاصطناعي يمكنها معالجة مجموعات بيانات واسعة لتمكين التحليلات السريرية والنسجية الموضوعية، مما قد يعزز طرق العلاج ونتائج التنبؤ.

على الرغم من هذه التطورات، لا يزال هناك نقص في الدراسات التي تقيم دقة نماذج الذكاء الاصطناعي في معالجة المعرفة النظرية وأدائها مقارنة بالقدرات البشرية في الامتحانات متعددة الخيارات. تركز هذه الدراسة بشكل خاص على علم الأمراض الفموية والوجه والفكين (OMP)، وهو مجال حاسم في التعليم الطبي يتطلب معرفة نظرية شاملة وخبرة سريرية للتشخيص الدقيق والعلاج. تقيم الدراسة أداء ثمانية نماذج لغوية كبيرة (LLMs)—Gemini 1.5 و Gemini 2 و ChatGPT 4o و ChatGPT 4 و ChatGPT o1 و Copilot و Claude 3.5 و Deepseek—من خلال تحليل دقتها في الإجابة على أسئلة متعددة الخيارات المتعلقة بعلم الأمراض الفموية. الهدف هو توضيح اختلافات الأداء بين هذه النماذج، وبالتالي المساهمة في فهم دور الذكاء الاصطناعي في تحسين النتائج التعليمية في علم الأمراض الفموية.

الطرق

في هذه الدراسة، التي أجريت في 5 فبراير 2025، في تركيا، تم تقييم أداء نماذج لغوية كبيرة (LLMs) مختلفة باستخدام أسئلة من امتحان دخول التخصص في طب الأسنان (DUS). يتم إدارة DUS من قبل مركز اختيار وتوظيف الطلاب (OSYM)، ويقيم المرشحين لتدريب التخصص في طب الأسنان ويتكون من 120 سؤالًا، بما في ذلك 40 من العلوم الأساسية و80 من العلوم السريرية. ركزت الدراسة بشكل خاص على أسئلة علم الأمراض الفموية، التي تغطي مواضيع مثل الكيسات الفموية التنموية والأورام الفموية والأمراض المعدية التي تؤثر على الأنسجة الفموية.

حلل الباحثون جميع أسئلة علم الأمراض الفموية من امتحانات DUS التي أجريت بين عامي 2012 و2021، والتي تم الحصول عليها من موقع OSYM المتاح للجمهور. شملت LLMs التي تم اختبارها Gemini 1.5 و Gemini 2 و Chat-GPT 4o و ChatGPT 4 و ChatGPT o1 و Copilot و Claude 3.5 و Deepseek. نظرًا لأن الدراسة لم تشمل أشخاصًا أو حيوانات، فقد كانت معفاة من الحاجة إلى موافقة لجنة الأخلاقيات. تهدف النتائج إلى تقديم رؤى حول قدرات LLMs المعتمدة على الذكاء الاصطناعي في معالجة محتوى الامتحانات المتخصصة في طب الأسنان.

النتائج

في هذه الدراسة، تم تقييم أداء نماذج لغوية كبيرة (LLMs) المعتمدة على الذكاء الاصطناعي على أسئلة علم الأمراض الفموية من DUS، مما كشف عن اختلافات ذات دلالة إحصائية بين النماذج (p < 0.001). برز ChatGPT o1 كأفضل أداء مع 96 إجابة صحيحة، تليه Claude (84 صحيحة) وكلا من Gemini 2 و Deepseek (82 صحيحة لكل منهما). كان Copilot هو النموذج الأقل أداءً مع 61 إجابة صحيحة فقط. أظهرت المقارنات التفصيلية للأداء اختلافات كبيرة بين Copilot والنماذج الأخرى، لا سيما Gemini 1.5 و Gemini 2 و Deepseek و Claude و ChatGPT 4o (p < 0.0031). أظهرت التحليلات الإضافية تصنيف الأسئلة إلى نوعين: مبنية على الحالة ومبنية على المعرفة، حيث أظهرت كلاهما اختلافات كبيرة بين النماذج (مبنية على الحالة: p = 0.034؛ مبنية على المعرفة: p < 0.001). في الأسئلة المبنية على الحالة، قاد ChatGPT o1 مرة أخرى مع 27 إجابة صحيحة، تليه Claude مع 26. بالنسبة للأسئلة المبنية على المعرفة، حقق ChatGPT o1 69 إجابة صحيحة، بينما قدمت Claude 58 إجابة صحيحة. ومن الجدير بالذكر أنه تم ملاحظة اختلافات كبيرة بين Copilot والنماذج الأخرى عبر كلا النوعين من الأسئلة (p < 0.0031). تم أيضًا تقييم النماذج عبر خمسة مواضيع محددة، مع ملاحظة اختلافات كبيرة فقط في فئة "أمراض الغشاء المخاطي واللسان"، حيث أجاب ChatGPT o1 على جميع الأسئلة بشكل صحيح، مما يتناقض بشكل حاد مع أداء Copilot (19 صحيحة، 11 خاطئة).

المناقشة

في قسم المناقشة من الدراسة، قام المؤلفون بتقييم أداء نماذج لغوية كبيرة (LLMs) مختلفة في الإجابة على أسئلة امتحانات البكالوريوس في طب الأسنان المتعلقة بعلم الأمراض الفموية. حدد تحليل القوة أن الحد الأدنى من 97 سؤالًا كان ضروريًا لتحقيق قوة إحصائية كافية، مما أدى إلى تضمين 100 سؤال متعدد الخيارات من DUS. تم تصنيف الأسئلة إلى نوعين: مبنية على الحالة ومبنية على المعرفة، وتم تقييم النماذج لدقتها في الرد على هذه الأسئلة. أظهرت النتائج أن ChatGPT o1 تفوق على النماذج الأخرى، مما يشير إلى قاعدة معرفية وفهم متفوق، لا سيما في السيناريوهات السريرية. كما أظهرت Claude و Deepseek أداءً قويًا، بينما كان Copilot الأقل نجاحًا، ربما بسبب قدراته المحدودة في السياق في المصطلحات الطبية.

تتوافق النتائج مع الأدبيات الموجودة، التي تظهر معدلات نجاح متباينة لـ LLMs عبر تخصصات طبية مختلفة. ومن الجدير بالذكر أن الدراسة سلطت الضوء على اختلافات كبيرة في الأداء بين الأسئلة المبنية على الحالة والأسئلة المبنية على المعرفة، حيث تفوق ChatGPT o1 و Claude في الأولى. تضمنت القيود استخدام أسئلة تركية، مما قد يكون قد أدخل أخطاء في الترجمة، وعدد محدود من الأسئلة، مما يشير إلى الحاجة إلى مزيد من التحقق مع أنواع أسئلة متنوعة. خلص المؤلفون إلى أنه بينما يمكن أن تكون LLMs مثل ChatGPT o1 أدوات قيمة في التعليم الطبي، يجب أن تركز الأبحاث المستقبلية على أنواع أسئلة أوسع ومجموعات بيانات أكبر لتعزيز الموثوقية والقابلية للتطبيق.

Journal: BMC Oral Health, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12903-025-05926-2
PMID: https://pubmed.ncbi.nlm.nih.gov/40234873
Publication Date: 2025-04-15
Author(s): Zhenyun Du et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The study investigates the performance of eight large language models (LLMs) in answering multiple-choice oral pathology questions from the Turkish Dental Specialization Examination (DUS). A total of 100 questions, categorized as either “case-based” or “knowledge-based,” were analyzed to assess the accuracy of the models, which included Gemini 1.5, Gemini 2, ChatGPT o1, ChatGPT 4o, ChatGPT 4, Copilot, Claude 3.5, and Deepseek. The results revealed significant performance differences among the models (p < 0.001), with ChatGPT o1 achieving the highest accuracy (96 correct answers), followed by Claude (84 correct) and both Gemini 2 and Deepseek (82 correct each). Notably, ChatGPT o1 outperformed most other models in both question categories, particularly in case-based (p = 0.034) and knowledge-based questions (p < 0.001). The findings suggest that while LLMs exhibit variable proficiency in addressing oral pathology questions, ChatGPT o1 stands out as the most accurate model. This indicates the potential of LLMs as supplementary educational tools in dental education and exam preparation. However, the study emphasizes the need for further validation regarding the reliability, currency, and clinical relevance of these models. As technology advances, the supportive role of AI and LLMs in dental education is expected to grow, with improvements in their performance anticipated.

Introduction

The introduction of the research paper highlights the transformative impact of artificial intelligence (AI) on healthcare and education, particularly within the fields of medicine and dentistry. Over the past two decades, advancements in big data, computational power, and AI algorithms have facilitated significant improvements in diagnostic processes, treatment planning, and educational methodologies. The integration of AI technologies into routine tasks is reshaping medical and dental education, providing substantial benefits to both students and educators. Notably, AI models can process extensive datasets to enable objective clinical and histopathological analyses, which may enhance treatment methods and prognostic outcomes.

Despite these advancements, there remains a scarcity of studies assessing the accuracy of AI models in processing theoretical knowledge and their performance relative to human capabilities in multiple-choice examinations. This research specifically focuses on oral and maxillofacial pathology (OMP), a critical area in dental education that requires comprehensive theoretical knowledge and clinical experience for accurate diagnosis and treatment. The study evaluates the performance of eight large language models (LLMs)—Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, and Deepseek—by analyzing their accuracy in answering multiple-choice questions related to oral pathology. The aim is to elucidate performance differences among these models, thereby contributing to the understanding of AI’s role in enhancing educational outcomes in dental pathology.

Methods

In this study, conducted on February 5, 2025, in Türkiye, the performance of various large language models (LLMs) was evaluated using questions from the Dental Specialization Entrance Examination (DUS). The DUS, administered by the Student Selection and Placement Center (OSYM), assesses candidates for dental specialty training and consists of 120 questions, including 40 from basic sciences and 80 from clinical sciences. The study specifically focused on oral pathology questions, which cover topics such as odontogenic and developmental jaw cysts, odontogenic tumors, and infectious diseases affecting oral tissues.

The researchers analyzed all oral pathology questions from DUS exams held between 2012 and 2021, sourced from the publicly accessible OSYM website. The LLMs tested included Gemini 1.5, Gemini 2, Chat-GPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, and Deepseek. As the study did not involve human or animal subjects, it was exempt from requiring ethical committee approval. The findings aim to provide insights into the capabilities of AI-based LLMs in addressing specialized dental examination content.

Results

In this study, the performance of AI-based large language models (LLMs) on DUS oral pathology questions was evaluated, revealing statistically significant differences among the models (p < 0.001). ChatGPT o1 emerged as the top performer with 96 correct answers, followed by Claude (84 correct) and both Gemini 2 and Deepseek (82 correct each). Copilot was the lowest-performing model with only 61 correct answers. Detailed performance comparisons indicated significant differences between Copilot and other models, particularly Gemini 1.5, Gemini 2, Deepseek, Claude, and ChatGPT 4o (p < 0.0031). Further analysis categorized questions into case-based and knowledge-based types, both showing significant differences among models (case-based: p = 0.034; knowledge-based: p < 0.001). In case-based questions, ChatGPT o1 again led with 27 correct answers, closely followed by Claude with 26. For knowledge-based questions, ChatGPT o1 achieved 69 correct answers, while Claude provided 58 correct answers. Notably, significant differences were observed between Copilot and other models across both question types (p < 0.0031). The models were also assessed across five specific topics, with significant differences noted only in the "Mucosal and Tongue Diseases" category, where ChatGPT o1 answered all questions correctly, contrasting sharply with Copilot's performance (19 correct, 11 incorrect).

Discussion

In the discussion section of the study, the authors evaluated the performance of various large language models (LLMs) in answering dental undergraduate exam questions related to oral pathology. A power analysis determined that a minimum of 97 questions was necessary for sufficient statistical power, leading to the inclusion of 100 multiple-choice questions from the DUS. The questions were categorized into case-based and knowledge-based types, and the models were assessed for their accuracy in responding to these questions. The results indicated that ChatGPT o1 outperformed other models, suggesting a superior knowledge base and understanding, particularly in clinical scenarios. Claude and Deepseek also demonstrated strong performance, while Copilot was the least successful, potentially due to its limited contextual capabilities in medical terminology.

The findings align with existing literature, which shows varying success rates of LLMs across different medical specialties. Notably, the study highlighted significant differences in performance between case-based and knowledge-based questions, with ChatGPT o1 and Claude excelling in the former. Limitations included the use of Turkish questions, which may have introduced translation inaccuracies, and the restricted number of questions, suggesting the need for further validation with diverse question types. The authors concluded that while LLMs like ChatGPT o1 can be valuable tools in dental education, future research should focus on broader question types and larger datasets to enhance reliability and applicability.