أداء خمسة نماذج لغوية كبيرة مجانية في إصابات الأسنان: دراسة معيارية طولية لمدة 30 يومًا Performance of five free large language models in dental trauma: a 30-day longitudinal benchmark study

المجلة: Frontiers in Oral Health، المجلد: 6
DOI: https://doi.org/10.3389/froh.2025.1737114
PMID: https://pubmed.ncbi.nlm.nih.gov/41473869
تاريخ النشر: 2025-12-15
المؤلف: Rafaela Mancini Lisboa وآخرون
الموضوع الرئيسي: إصابات الأسنان وعلاجاتها

نظرة عامة

هدفت هذه الدراسة إلى تقييم دقة وثبات خمسة نماذج لغوية كبيرة (LLMs)—ChatGPT وGoogle Gemini وMicrosoft Copilot وDeepSeek وMeta AI—في توليد استجابات تتعلق بإصابات الأسنان. على مدار 30 يومًا، تم جمع ما مجموعه 18,000 استجابة من خلال تقديم 60 سؤالًا ثنائيًا يوميًا تحت حالتين من التحفيز: بدون سياق ومع سياق. تم تقييم الاستجابات وفقًا لإرشادات الجمعية الدولية لإصابات الأسنان (IADT)، باستخدام نموذج مختلط خطي عام (GLMM) للتحليل الإحصائي، بما في ذلك مقاييس مثل الحساسية، والخصوصية، والدقة، والمساحة تحت منحنى ROC (AUC). تم قياس الاستقرار الزمني باستخدام معامل الارتباط داخل الفئة (ICC).

أظهرت النتائج أن جميع LLMs حققت دقة تتجاوز 85%، حيث أظهر Microsoft Copilot (91.1%) وDeepSeek (90%) أعلى أداء، على الرغم من عدم ملاحظة فرق كبير بينهما (p > 0.05). كما أظهرت كلا النموذجين ثباتًا قويًا خلال فترة التقييم (ICC > 0.90). ومن الجدير بالذكر أن تضمين سياق التحفيز لم يعزز بشكل كبير دقة أو استقرار أي نموذج. تستنتج الدراسة أن Microsoft Copilot وDeepSeek فعالان بشكل خاص في تقديم معلومات موثوقة حول إصابات الأسنان، مما يشير إلى أن LLMs المجانية يمكن أن تكون أدوات مكملة قيمة لنشر مثل هذه المعلومات، شريطة استخدامها بالتزامن مع مصادر علمية موثوقة وإشراف مهني.

مقدمة

تناقش مقدمة ورقة البحث ظهور وأهمية النماذج اللغوية الكبيرة (LLMs) في توليد استجابات تشبه البشر من خلال معالجة اللغة الطبيعية والشبكات العصبية العميقة. بعد الإطلاق العام لـ ChatGPT من قبل OpenAI في أواخر 2022، تم الاعتراف بـ LLMs لإمكاناتها كمصادر متاحة للإرشادات الطبية، لا سيما في سياق إصابات الأسنان، التي تشكل تحديًا كبيرًا للصحة العامة بسبب الألم المرتبط بها ومخاطر فقدان الأسنان. على الرغم من وعودها، تواجه LLMs تحديات مثل الهلوسة والتحيزات في بيانات التدريب، مما يمكن أن يعيق فعالية نشر المعلومات.

تسلط الورقة الضوء على الحاجة الملحة لإدارة دقيقة لإصابات الأسنان، مشددة على دور إرشادات العلاج والمنصات المهنية في توحيد الرعاية. قامت دراسات سابقة بتقييم دقة نماذج LLMs المختلفة في تقديم معلومات حول إصابات الأسنان، كاشفة عن مجموعة من نتائج الأداء. على سبيل المثال، أفادت إحدى الدراسات بدقة 80.8% لنموذج Gemini، بينما وجدت دراسة أخرى أن ChatGPT 4.0 Plus حقق أعلى دقة بنسبة 95.6% عبر تفاعلات متعددة. ومع ذلك، لم تفحص هذه الدراسات الثبات الطولي للاستجابات أو تشمل نماذج أحدث. وبالتالي، الهدف الرئيسي من هذه الدراسة هو مقارنة الدقة والثبات على مدى 30 يومًا لخمس نماذج LLMs حديثة—Copilot وDeepSeek وChatGPT-4o وGemini وMetaAI—مقابل إرشادات IADT، مع التحقيق أيضًا في تأثير استراتيجيات التفاعل على أدائها.

الطرق

استخدمت الدراسة تصميمًا طوليًا ملاحظًا، مع التركيز على البيانات التي لم تشمل مواضيع بشرية أو معلومات قابلة للتعريف، مما أعفاها من متطلبات الموافقة الأخلاقية. التزمت المنهجية بمعايير الإبلاغ المعتمدة، وبالتحديد إرشادات TRIPOD-LLM وبيان CHART، لضمان صرامة وشفافية عملية البحث. يوفر الملحق 1 تفاصيل إضافية ذات صلة بالالتزام بهذه الإرشادات.

النتائج

قيمت الدراسة أداء خمسة روبوتات محادثة تعتمد على الذكاء الاصطناعي—Copilot وDeepSeek وMetaAI وChatGPT وGemini—في تقديم معلومات لاتخاذ قرارات حول إصابات الأسنان، حيث تم تحليل ما مجموعه 18,000 استجابة على مدار 30 يومًا. أظهر Copilot أعلى دقة، محققًا دقة بدون سياق تبلغ 0.91 (95% CI: 0.84; 0.98) وحساسية تبلغ 0.93، بينما جاء DeepSeek قريبًا منه بدقة بدون سياق تبلغ 0.90 (95% CI: 0.82; 0.98). أظهرت جميع النماذج مستويات عالية من الدقة تتجاوز 85%، مما يدل على موثوقيتها كمصادر معلومات. ومن الجدير بالذكر أن استخدام سياق التحفيز لم يعزز الدقة بشكل كبير، ووجدت الدراسة تباينًا كبيرًا على مستوى السؤال ولكن تباينًا ضئيلًا من يوم لآخر في الاستجابات.

كشفت التحليلات الإحصائية عن تأثيرات كبيرة لنماذج اللغة (LLMs) على الأداء، بينما لم تسفر التحفيزات وتفاعلاتها عن نتائج كبيرة. أبرزت الدراسة أن Copilot وDeepSeek تفوقا على ChatGPT وGemini، مع أحجام تأثير تشير إلى اختلافات صغيرة بين النماذج. تشير النتائج إلى اتجاه تصاعدي في دقة روبوتات المحادثة مقارنة بالدراسات السابقة، ويعزى ذلك إلى التحديثات المستمرة للنماذج. كما أكدت البحث على أهمية الخصوصية في تحديد البدائل الخاطئة، حيث حققت جميع النماذج خصوصية تتجاوز 80%. تضمنت القيود التركيز على النسخ المجانية المتاحة على الويب من روبوتات المحادثة والاستخدام الحصري لتفاعلات البرتغالية، مما يشير إلى الحاجة إلى تقييمات متعددة اللغات أوسع في الأبحاث المستقبلية. بشكل عام، تؤكد الدراسة على إمكانية استخدام هذه الروبوتات كأدوات مكملة في إدارة إصابات الأسنان، لا سيما في سيناريوهات الرعاية العاجلة.

المناقشة

في هذه الدراسة، تم تطوير معيار من 60 سؤالًا صحيح/خطأ يتعلق بتشخيص وإدارة إصابات الأسنان الناتجة عن الصدمات من قبل اثنين من أطباء الأسنان ذوي الخبرة، بناءً على أحدث إرشادات الجمعية الدولية لإصابات الأسنان (IADT). كان الهدف من المعيار هو تقييم أداء خمسة نماذج لغوية كبيرة (LLMs) في سياق سريري، تم التحقق منه من قبل خبراء خارجيين من حيث الوضوح والموضوعية. تم اختبار النماذج تحت حالتين: بدون سياق ومع سياق، لمحاكاة سيناريوهات مستخدم مختلفة. تم جمع ما مجموعه 18,000 تفاعل، وتم تقييم الاستجابات مقابل حقيقة محددة مسبقًا.

تم إجراء التحليل الإحصائي باستخدام R، مع التركيز على الحساسية والخصوصية والأداء التشخيصي العام من خلال نموذج مختلط خطي عام (GLMM). أشارت النتائج إلى أن Microsoft Copilot وDeepSeek-V3 أظهرا أعلى دقة وثبات خلال فترة الاختبار. ومن الجدير بالذكر أن استخدام تحفيزات المعايرة لم يعزز بشكل كبير أداء النماذج. تشير النتائج إلى أنه بينما يمكن أن تكون LLMs المجانية أدوات قيمة لنشر معلومات إصابات الأسنان، يجب أن يتم توجيه تطبيقها من قبل مصادر علمية موثوقة وإشراف مهني لضمان الدقة والموثوقية في الممارسة السريرية.

Journal: Frontiers in Oral Health, Volume: 6
DOI: https://doi.org/10.3389/froh.2025.1737114
PMID: https://pubmed.ncbi.nlm.nih.gov/41473869
Publication Date: 2025-12-15
Author(s): Rafaela Mancini Lisboa et al.
Primary Topic: Dental Trauma and Treatments

Overview

This study aimed to evaluate the accuracy and consistency of five large language models (LLMs)—ChatGPT, Google Gemini, Microsoft Copilot, DeepSeek, and Meta AI—in generating responses related to dental trauma. Over a 30-day period, a total of 18,000 responses were collected by submitting 60 dichotomous questions daily under two prompting conditions: zero-shot and zero-shot with context. The responses were assessed against the International Association of Dental Traumatology (IADT) guidelines, utilizing a generalized linear mixed model (GLMM) for statistical analysis, including metrics such as sensitivity, specificity, accuracy, and area under the ROC curve (AUC). Temporal stability was measured using the intraclass correlation coefficient (ICC).

The findings revealed that all LLMs achieved an accuracy exceeding 85%, with Microsoft Copilot (91.1%) and DeepSeek (90%) demonstrating the highest performance, although no significant difference was noted between them (p > 0.05). Both models also exhibited strong consistency over the evaluation period (ICC > 0.90). Notably, the inclusion of context prompts did not significantly enhance the accuracy or stability of any model. The study concludes that Microsoft Copilot and DeepSeek are particularly effective in providing reliable information on dental trauma, suggesting that free LLMs can serve as valuable supplementary tools for disseminating such information, provided they are used in conjunction with credible scientific sources and professional oversight.

Introduction

The introduction of the research paper discusses the emergence and significance of large language models (LLMs) in generating human-like responses through natural language processing and deep neural networks. Following the public release of ChatGPT by OpenAI in late 2022, LLMs have been recognized for their potential as accessible sources of medical guidance, particularly in the context of dental trauma, which poses a substantial public health challenge due to its associated pain and risk of tooth loss. Despite their promise, LLMs face challenges such as hallucinations and biases in training data, which can hinder effective information dissemination.

The paper highlights the critical need for accurate management of dental trauma, emphasizing the role of treatment guidelines and professional platforms in standardizing care. Previous studies have evaluated the accuracy of various LLMs in providing information on dental trauma, revealing a range of performance outcomes. For instance, one study reported an accuracy of 80.8% for the Gemini model, while another found that ChatGPT 4.0 Plus achieved the highest accuracy at 95.6% across multiple interactions. However, these studies did not examine the longitudinal consistency of responses or include newer models. Consequently, the primary objective of this study is to compare the accuracy and consistency over 30 days of five modern LLMs—Copilot, DeepSeek, ChatGPT-4o, Gemini, and MetaAI—against the IADT guidelines, while also investigating the impact of interaction strategies on their performance.

Methods

The study employed an observational longitudinal design, focusing on data that did not involve human subjects or identifiable information, thus exempting it from the requirement for ethics approval. The methodology adhered to established reporting standards, specifically the TRIPOD-LLM guidelines and the CHART statement, ensuring the rigor and transparency of the research process. Supplementary Material 1 provides additional details relevant to the adherence to these guidelines.

Results

The study evaluated the performance of five AI-based chatbots—Copilot, DeepSeek, MetaAI, ChatGPT, and Gemini—in providing information for dental trauma decision-making, analyzing a total of 18,000 responses over 30 days. Copilot demonstrated the highest accuracy, achieving a zero-shot accuracy of 0.91 (95% CI: 0.84; 0.98) and a sensitivity of 0.93, while DeepSeek followed closely with a zero-shot accuracy of 0.90 (95% CI: 0.82; 0.98). All models exhibited high accuracy levels above 85%, indicating their reliability as information sources. Notably, the use of context prompts did not significantly enhance accuracy, and the study found substantial question-level variability but minimal day-to-day variability in responses.

Statistical analyses revealed significant effects of the language models (LLMs) on performance, while prompts and their interactions did not yield significant results. The study highlighted that Copilot and DeepSeek outperformed ChatGPT and Gemini, with effect sizes indicating small differences between models. The findings suggest an upward trend in chatbot accuracy compared to previous studies, attributed to ongoing model updates. The research also emphasized the importance of specificity in identifying false alternatives, with all models achieving specificity above 80%. Limitations included the focus on free, web-accessible versions of chatbots and the exclusive use of Portuguese interactions, suggesting a need for broader multilingual evaluations in future research. Overall, the study underscores the potential of these chatbots as supplementary tools in dental trauma management, particularly in urgent care scenarios.

Discussion

In this study, a benchmark of 60 true/false questions regarding the diagnosis and management of traumatic dental injuries was developed by two experienced endodontists, based on the latest International Association of Dental Traumatology (IADT) guidelines. The benchmark aimed to assess the performance of five different large language models (LLMs) in a clinical context, validated by external experts for clarity and objectivity. The models were tested under two conditions: zero-shot and zero-shot with context, simulating different user scenarios. A total of 18,000 interactions were collected, and responses were evaluated against a predefined ground truth.

Statistical analysis was conducted using R, focusing on sensitivity, specificity, and overall diagnostic performance through a Generalized Linear Mixed Model (GLMM). The results indicated that Microsoft Copilot and DeepSeek-V3 exhibited the highest accuracy and consistency over the testing period. Notably, the use of calibration prompts did not significantly enhance the models’ performance. The findings suggest that while free LLMs can serve as valuable tools for disseminating dental trauma information, their application should be guided by reliable scientific sources and professional oversight to ensure accuracy and reliability in clinical practice.