تفوق ChatGPT-4 Omni في الإجابة على أسئلة الأشعة الفموية متعددة الخيارات ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions

المجلة: BMC Oral Health، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12903-025-05554-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39893407
تاريخ النشر: 2025-02-01
المؤلف: Melek Taşsöker
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تستعرض هذه الفقرة التقدمات الكبيرة في روبوتات الدردشة المدعومة بالذكاء الاصطناعي منذ تقديم ChatGPT من قبل OpenAI في نوفمبر 2022، مع تسليط الضوء على تأثيرها المتزايد في التعليم العالي. تم تطوير روبوتات دردشة مختلفة، بما في ذلك Ernie وBard (الآن Gemini) وGrok، باستخدام مجموعات بيانات واسعة، مما يعزز فعاليتها بناءً على عوامل مثل الخبرة وتكرار التحديث وتعقيد الاستفسارات. من المتوقع أن تقدم هذه الأدوات الذكية حلولاً مخصصة قد تحل في النهاية محل محركات البحث التقليدية.

في الختام، تم تحديد ChatGPT-4o كالأكثر دقة بين روبوتات الدردشة الحالية، مع التحسينات المستمرة في المحتوى التعليمي وهياكل الذكاء الاصطناعي مما يجعل هذه الأدوات أكثر أهمية في الأوساط الأكاديمية. إنها تحمل وعدًا في معالجة السيناريوهات المعقدة في مجالات طب الأسنان والطب، مما قد يحسن النتائج. ومع ذلك، فإن القيود الحالية لأدوات الذكاء الاصطناعي مقارنة بالمتخصصين البشريين تبرز الحاجة إلى مزيد من البحث والتطوير. يبقى مستقبل هذه التقنيات في التعليم والرعاية الصحية موضوعًا ذا اهتمام وتوقع كبيرين.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على تطور معالجة اللغة الطبيعية (NLP) وتطبيقها في تطوير روبوتات الدردشة المدعومة بالذكاء الاصطناعي، متتبعةً أصولها إلى استفسارات آلان تورينغ الأساسية في الخمسينيات. أظهرت دراسات متعددة فعالية روبوتات الدردشة المدعومة بالذكاء الاصطناعي عبر تخصصات طبية متعددة، مع نتائج ملحوظة مثل تحقيق ChatGPT-4 لمعدل إجابات صحيحة بنسبة 87.11% في امتحان الكلية الأمريكية للأشعة، متفوقًا على Bard الذي حقق 70.44%. ومع ذلك، لا يزال هناك نقص في الأبحاث حول أداء روبوتات الدردشة في طب الأسنان، على الرغم من الدراسات الأخيرة التي تشير إلى أن ChatGPT-3.5 وChatGPT-4 حققا معدلات نجاح بلغت 61.3% و76.9%، على التوالي، في امتحانات مجلس طب الأسنان.

تؤكد الورقة على الإمكانات التحويلية للنماذج اللغوية الكبيرة (LLMs) في التعليم، لا سيما لطلاب طب الأسنان الذين يستعدون لامتحانات الاختيار من متعدد. وتبرز أهمية تقييم دقة وموثوقية الردود التي ينتجها الذكاء الاصطناعي، خاصة في مجالات متخصصة مثل الأشعة الفموية، التي تواجه نقصًا في الممارسين والمعلمين. تهدف الدراسة إلى تقييم ومقارنة أداء روبوتات الدردشة المتقدمة—ChatGPT-4o وChatGPT-3.5 وGoogle Bard وMicrosoft Copilot—على أسئلة اختيار من متعدد النصية المتعلقة بالأشعة الفموية في سياق امتحان قبول تخصص طب الأسنان في تركيا (DUS)، والذي يتضمن أسئلة محددة حول هذا التخصص.

الطرق

تستعرض فقرة “الطرق” تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث قاموا بتنفيذ تجارب محكومة لجمع البيانات حول المتغيرات المحددة. تم إجراء تحليلات إحصائية، بما في ذلك نماذج الانحدار واختبار الفرضيات، لتقييم العلاقات بين المتغيرات المستقلة والتابعة.

شملت عملية جمع البيانات عينة من $n$ مشاركًا، مما يضمن قوة كافية للتحليلات. كما تضمنت المنهجية استخدام أدوات قياسية للقياس، مما يعزز موثوقية وصلاحية النتائج. تختتم الفقرة بمناقشة الاعتبارات الأخلاقية التي تم أخذها في الاعتبار خلال عملية البحث، بما في ذلك الموافقة المستنيرة وسرية البيانات.

النتائج

تشير نتائج الدراسة إلى وجود اختلافات كبيرة في دقة الردود بين روبوتات الدردشة الأربعة التي تم تقييمها، حيث حقق ChatGPT-4o أعلى معدل دقة بنسبة 86.1%، يليه Google Bard بنسبة 61.8%. في المقابل، كانت معدلات دقة ChatGPT-3.5 وMicrosoft Copilot أقل، حيث بلغت 43.9% و41.5%، على التوالي (p = 0.000). بالإضافة إلى ذلك، كشفت التحليلات عن اختلافات ملحوظة في عدد الكلمات، حيث أنتج Google Bard الردود الأكثر تفصيلاً بينما أنتج ChatGPT-3.5 أقلها (p = 0.000).

كما أظهرت أوقات الاستجابة اختلافات كبيرة؛ حيث قدم ChatGPT-3.5 أسرع الردود، بينما كان ChatGPT-4o هو الأبطأ (p = 0.000). سلطت المقارنات الزوجية الضوء على اختلافات كبيرة في عدد الكلمات بين ChatGPT-3.5 وGoogle Bard، وكذلك بين Microsoft Copilot وGoogle Bard. علاوة على ذلك، لوحظت اختلافات كبيرة في متوسط أوقات الاستجابة بين ChatGPT-3.5 وChatGPT-4o وGoogle Bard. كما ركزت الدراسة على ردود روبوتات الدردشة حول مواضيع الأشعة الفموية، محددةً أمراض الفك والأمراض الجهازية كأكثر المواضيع استفسارًا، مع تقديم ملخصات إضافية وتمثيلات رسومية في الجداول والأشكال المرفقة.

المناقشة

قيمت الدراسة أداء أربعة روبوتات دردشة مدعومة بالذكاء الاصطناعي—ChatGPT-3.5 وChatGPT-4 Omni (4o) وGoogle Bard وMicrosoft Copilot—في الإجابة على أسئلة اختيار من متعدد تتعلق بالأشعة الفموية. باستخدام مجموعة بيانات تضم 123 سؤالًا من بنك أسئلة DUS (2012-2021)، كانت الدراسة تهدف إلى تقييم الدقة ووقت الاستجابة وعدد الكلمات في إجابات كل روبوت دردشة. تم تصنيف الأسئلة إلى 17 موضوعًا، تم تجميعها في ثلاث مجالات محتوى تعليمي: المعرفة الأساسية، التصوير والمعدات، وتفسير الصور. تم تقييم الموثوقية باستخدام كابا كوهين، مما كشف عن معدلات اتفاق عالية، خاصة بالنسبة لـ ChatGPT-4o (κ = 0.86).

أشارت النتائج إلى أن ChatGPT-4o حقق أعلى معدل دقة بنسبة 86.1%، متفوقًا على روبوتات الدردشة الأخرى، التي أظهرت مستويات متفاوتة من الدقة. بينما كان ChatGPT-3.5 الأسرع في وقت الاستجابة، أظهر دقة أقل، على الأرجح بسبب إجاباتها المختصرة. في المقابل، قدم Google Bard ردودًا مفصلة ولكن لم يحقق دقة ChatGPT-4o. تبرز الدراسة إمكانات النماذج اللغوية الكبيرة (LLMs) في تعزيز التعليم والممارسة الطبية، لا سيما في مجالات متخصصة مثل الأشعة الفموية. ومع ذلك، تؤكد أيضًا على الحاجة إلى مزيد من البحث لمعالجة قيود هذه الأدوات الذكية، حيث إنها لا تستطيع حاليًا تكرار خبرة المتخصصين البشريين بالكامل.

القيود

تتمثل القيود الرئيسية لهذه الدراسة في تركيزها الحصري على الأسئلة النصية باللغة التركية، مما يحد من إمكانية تعميم النتائج. لتعزيز قوة البحث المستقبلي، سيكون من المفيد استكشاف مجموعة متنوعة من تنسيقات الأسئلة، مثل الأسئلة المفتوحة والأسئلة المعتمدة على الترتيب. بالإضافة إلى ذلك، قد يسهل زيادة عدد الأسئلة وإدراج الصور الشعاعية في روبوتات دردشة مختلفة إجراء تحليل مقارن لردودها مقابل ردود الطلاب البشر.

علاوة على ذلك، قد يؤدي إنشاء أسئلة جديدة بناءً على الأسئلة المرفوعة إلى الحصول على رؤى قيمة حول آليات الدعم المتاحة للطلاب أثناء التحضير للامتحانات. يمكن أن تسهم معالجة هذه القيود بشكل كبير في إثراء فهم فعالية روبوتات الدردشة في السياقات التعليمية.

Journal: BMC Oral Health, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12903-025-05554-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39893407
Publication Date: 2025-02-01
Author(s): Melek Taşsöker
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The section outlines the significant advancements in generative AI chatbots since the introduction of ChatGPT by OpenAI in November 2022, highlighting their growing influence in higher education. Various chatbots, including Ernie, Bard (now Gemini), and Grok, have been developed using extensive datasets, which enhance their effectiveness based on factors such as expertise, update frequency, and inquiry complexity. These AI tools are expected to offer personalized solutions that may eventually replace traditional search engines.

In conclusion, ChatGPT-4o is identified as the most accurate among the current chatbots, with ongoing improvements in educational content and AI architectures making these tools increasingly vital in academia. They hold promise for efficiently addressing complex scenarios in dental and medical fields, potentially improving outcomes. However, the current limitations of AI tools in comparison to human specialists highlight the necessity for further research and development. The future of these technologies in education and healthcare remains a topic of significant interest and anticipation.

Introduction

The introduction of this research paper highlights the evolution of Natural Language Processing (NLP) and its application in developing AI chatbots, tracing back to Alan Turing’s foundational inquiries in the 1950s. Various studies have demonstrated the effectiveness of AI chatbots across multiple medical specialties, with notable findings such as ChatGPT-4 achieving an 87.11% correct answer rate on the American College of Radiology’s examination, outperforming Bard at 70.44%. However, there remains a scarcity of research on chatbot performance in dentistry, despite recent studies indicating that ChatGPT-3.5 and ChatGPT-4 attained success rates of 61.3% and 76.9%, respectively, on dental board exams.

The paper emphasizes the transformative potential of large language models (LLMs) in education, particularly for dental students preparing for multiple-choice exams. It underscores the importance of evaluating the accuracy and reliability of AI-generated responses, especially in specialized fields like oral radiology, which faces a shortage of practitioners and educators. The study aims to assess and compare the performance of advanced chatbots—ChatGPT-4o, ChatGPT-3.5, Google Bard, and Microsoft Copilot—on text-based multiple-choice questions related to oral radiology in the context of Türkiye’s Dental Specialty Admission Exam (DUS), which includes specific questions on this specialty.

Methods

The “Methods” section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, implementing controlled experiments to gather data on the specified variables. Statistical analyses, including regression models and hypothesis testing, were performed to evaluate the relationships between the independent and dependent variables.

Data collection involved a sample size of $n$ participants, ensuring adequate power for the analyses. The methodology also included the use of standardized instruments for measurement, enhancing the reliability and validity of the findings. The section concludes with a discussion of the ethical considerations taken into account during the research process, including informed consent and data confidentiality.

Results

The results of the study indicate significant differences in the accuracy of responses among the four evaluated chatbots, with ChatGPT-4o achieving the highest accuracy rate of 86.1%, followed by Google Bard at 61.8%. In contrast, ChatGPT-3.5 and Microsoft Copilot had lower accuracy rates of 43.9% and 41.5%, respectively (p = 0.000). Additionally, the analysis revealed notable variations in word counts, with Google Bard producing the most verbose responses and ChatGPT-3.5 generating the least (p = 0.000).

Response times also exhibited significant differences; ChatGPT-3.5 provided the fastest responses, while ChatGPT-4o was the slowest (p = 0.000). Pairwise comparisons highlighted significant differences in word count between ChatGPT-3.5 and Google Bard, as well as between Microsoft Copilot and Google Bard. Furthermore, significant variations in mean response times were observed among ChatGPT-3.5, ChatGPT-4o, and Google Bard. The study also focused on chatbot responses to oral radiology topics, identifying jaw pathologies and systemic diseases as the most frequently queried subjects, with additional summaries and graphical representations provided in the accompanying tables and figures.

Discussion

The study evaluated the performance of four AI chatbots—ChatGPT-3.5, ChatGPT-4 Omni (4o), Google Bard, and Microsoft Copilot—in answering multiple-choice questions related to oral radiology. Utilizing a dataset of 123 questions from the DUS question bank (2012-2021), the research aimed to assess the accuracy, response time, and word count of each chatbot’s answers. The questions were categorized into 17 topics, grouped into three educational content areas: Fundamental Knowledge, Imaging and Equipment, and Image Interpretation. Reliability was assessed using Cohen’s Kappa, revealing high agreement rates, particularly for ChatGPT-4o (κ = 0.86).

The findings indicated that ChatGPT-4o achieved the highest accuracy rate at 86.1%, outperforming the other chatbots, which exhibited varying levels of accuracy. ChatGPT-3.5, while the fastest in response time, demonstrated lower accuracy, likely due to its concise answers. In contrast, Google Bard provided detailed responses but did not match the accuracy of ChatGPT-4o. The study highlights the potential of large language models (LLMs) in enhancing medical education and practice, particularly in specialized fields like oral radiology. However, it also emphasizes the need for further research to address the limitations of these AI tools, as they currently cannot fully replicate the expertise of human specialists.

Limitations

The primary limitation of this study lies in its exclusive focus on Turkish text-based questions, which restricts the generalizability of the findings. To enhance the robustness of future research, it would be beneficial to explore a variety of question formats, such as open-ended and rank-based questions. Additionally, increasing the number of questions and incorporating radiological images into different chatbots could facilitate a comparative analysis of their responses against those of human students.

Moreover, the generation of new questions based on the uploaded ones could yield valuable insights into the support mechanisms available to students during exam preparation. Addressing these limitations could significantly enrich the understanding of chatbot efficacy in educational contexts.