تقييم دقة نماذج الذكاء الاصطناعي التوليدي في تقدير العمر السني بناءً على طريقة ديميرجيان Evaluating the accuracy of generative artificial intelligence models in dental age estimation based on the Demirjian’s method

المجلة: Frontiers in Dental Medicine، المجلد: 6
DOI: https://doi.org/10.3389/fdmed.2025.1634006
PMID: https://pubmed.ncbi.nlm.nih.gov/40800006
تاريخ النشر: 2025-07-29
المؤلف: Allan Abuabara وآخرون
الموضوع الرئيسي: الأشعة السينية السنية والتصوير

نظرة عامة

تقييمت هذه الدراسة فعالية نماذج اللغة الكبيرة (LLMs) مثل ChatGPT وGemini وDeepSeek في تقدير العمر السني باستخدام نظام تسجيل ديميرجيان. تشير النتائج إلى أنه على الرغم من أن هذه النماذج يمكن أن تقدم تقديرات، إلا أن أدائها أقل من الطرق التقليدية التي يستخدمها الفاحصون البشريون. من بين النماذج التي تم اختبارها، أظهر DeepSeek-V3 أفضل النتائج، حيث أظهر أقل متوسط أخطاء واستقرار أكبر على مر الزمن. في المقابل، أظهر Gemini تباينًا أعلى وانخفاضًا في الأداء، بينما قدم ChatGPT نتائج متوسطة مع استقرار نسبي ولكن بدقة أقل.

تسلط الأبحاث الضوء على القيود الكبيرة لنماذج اللغة الكبيرة التي لم تُدرب خصيصًا لتقدير العمر السني، بما في ذلك الميل إلى المبالغة في التقدير والنتائج غير المتسقة. وبالتالي، يحذر المؤلفون من الاستخدام العشوائي لهذه النماذج في البيئات السريرية دون تدريب وتحقق مسبقين خاصين بالمهمة. ويؤكدون على الحاجة إلى مزيد من الدراسات لضبط هذه النماذج وتقييم أدائها عبر مجموعات سكانية متنوعة قبل النظر في تطبيقها العملي، نظرًا للتداعيات السريرية والقانونية المحتملة للتفسيرات غير الصحيحة.

مقدمة

تناقش مقدمة ورقة البحث أهمية تقدير العمر السني، وهي عملية حاسمة تُستخدم عبر مجالات متنوعة مثل علم الآثار والأنثروبولوجيا والطب وطب الأسنان الشرعي. هذا التقدير ضروري لتشخيص الاضطرابات التنموية والتخطيط لعلاجات تقويم الأسنان. من بين الطرق المختلفة التي تم تطويرها لهذا الغرض، يتم تسليط الضوء على طريقة ديميرجيان كنهج بارز، خاصة للأطفال والمراهقين، حيث تصنف تطور سبعة أسنان دائمة في الفك السفلي الأيسر إلى ثمانية مراحل متميزة (A-H).

تقدم الورقة أيضًا نماذج اللغة الكبيرة (LLMs)، وهي شكل من أشكال الذكاء الاصطناعي (AI) الذي يحاكي فهم اللغة البشرية وتوليدها. لقد أظهرت نماذج اللغة الكبيرة، مثل سلسلة GPT وGemini وDeepSeek، وعدًا في تعزيز دقة التشخيص في المجالات الطبية وطب الأسنان. تشير مراجعة حديثة إلى تزايد الاهتمام بتطبيق نماذج اللغة الكبيرة في تقويم الأسنان، مما يشير إلى قبول متزايد لهذه التقنيات. ومع ذلك، لا تزال المخاوف بشأن دقتها وموثوقيتها قائمة، حيث يمكن أن تنتج نماذج اللغة الكبيرة معلومات غير ذات صلة أو غامضة، مما قد يؤدي إلى قرارات رعاية صحية ضارة إذا تم تطبيقها بشكل خاطئ. من الجدير بالذكر أن الدراسة تهدف إلى التحقيق في أداء نماذج اللغة الكبيرة غير المدربة في تقدير العمر السني باستخدام مراحل ديميرجيان، مع معالجة فجوة في الأبحاث الحالية بشأن فعاليتها دون تدريب مسبق.

طرق

يستعرض قسم “المواد والطرق” التصميم التجريبي والإجراءات المستخدمة في الدراسة. يوضح المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية ومعدات وعينات بيولوجية، بالإضافة إلى مصادرها وطرق تحضيرها. كما يصف القسم المنهجيات المطبقة لجمع البيانات وتحليلها، مما يضمن إمكانية تكرار النتائج.

تُبرز البروتوكولات التجريبية الرئيسية، بما في ذلك أي تحليلات إحصائية تم إجراؤها للتحقق من النتائج. يؤكد القسم على الالتزام بالمعايير الأخلاقية والإرشادات ذات الصلة بسياق البحث، مما يضمن أن جميع الإجراءات تمت بشكل مسؤول وشفاف. بشكل عام، يعمل هذا القسم كدليل شامل لتكرار الدراسة وفهم العناصر الأساسية التي تدعم استنتاجات البحث.

نتائج

يقدم قسم “النتائج” نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من الطرق التجريبية أو التحليلية المستخدمة. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، مع تأكيد التحليلات الإحصائية على قوة هذه العلاقات. يتم الإبلاغ عن مقاييس محددة، مثل قيم p وفترات الثقة، لدعم الاستنتاجات المستخلصة.

بالإضافة إلى ذلك، تظهر النتائج أن النموذج أو الفرضية المقترحة تفسر الظواهر الملاحظة بشكل فعال، مع تحسين ملحوظ في الدقة التنبؤية مقارنة بالنماذج الحالية. تمثل الرسوم البيانية، مثل المخططات أو الرسوم البيانية، هذه النتائج بشكل أكبر، مما يوفر سياقًا بصريًا يعزز فهم اتجاهات البيانات. بشكل عام، تؤكد النتائج على أهمية مساهمات الدراسة في هذا المجال، مما يشير إلى سبل البحث المستقبلية بناءً على الأنماط الملاحظة.

مناقشة

في هذه الدراسة، تم تقييم أداء نماذج اللغة الكبيرة المختلفة (LLMs) في تقدير العمر السني باستخدام طريقة ديميرجيان من خلال تصميم مقارن يستخدم سيناريوهات محاكاة تعتمد على بيانات حقيقية. تألفت العينة من 30 صورة شعاعية بانورامية رقمية من أفراد أصحاء تتراوح أعمارهم بين 3 إلى 16 عامًا، مع معايير صارمة للإدراج والاستبعاد لضمان سلامة البيانات. استخدمت الدراسة نصًا موحدًا لكل نموذج لغة كبيرة (ChatGPT وGemini وDeepSeek) لتقليل التباين وعزل سلوكيات النموذج المحددة. تم إجراء تحليلات إحصائية، بما في ذلك متوسط الخطأ المطلق (MAE) وجذر متوسط مربع الخطأ (RMSE) ومعامل التحديد ($R^2$)، لتقييم أداء النموذج مقارنة بالتقييمات التقليدية من قبل أطباء الأسنان المدربين.

أشارت النتائج إلى وجود تفاوتات كبيرة في الأداء بين النماذج. أسفرت طريقة ديميرجيان التقليدية عن MAE قدره 1.21 عامًا، بينما أنتج ChatGPT وDeepSeek أخطاء أعلى (MAE: 1.98-2.05 عامًا) وقيم $R^2$ سلبية، مما يشير إلى ملاءمة ضعيفة للنموذج. أظهر DeepSeek أفضل موثوقية ودقة، مع متوسط أخطاء أقل عبر أيام التقييم، بينما أظهر Gemini أعلى تباين وانخفاضًا في الأداء مع مرور الوقت. تؤكد النتائج على قيود نماذج اللغة الكبيرة غير المدربة في التطبيقات السريرية، مما يبرز الحاجة إلى مزيد من التدريب والتحقق قبل دمجها في ممارسة طب الأسنان. يجب أن تركز الأبحاث المستقبلية على ضبط هذه النماذج باستخدام مجموعات بيانات محددة لتحسين دقتها وموثوقيتها في المهام السريرية المنظمة.

Journal: Frontiers in Dental Medicine, Volume: 6
DOI: https://doi.org/10.3389/fdmed.2025.1634006
PMID: https://pubmed.ncbi.nlm.nih.gov/40800006
Publication Date: 2025-07-29
Author(s): Allan Abuabara et al.
Primary Topic: Dental Radiography and Imaging

Overview

This study evaluated the efficacy of large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek in estimating dental age using Demirjian’s scoring system. The findings indicate that while these models can provide estimates, their performance is inferior to traditional methods employed by human examiners. Among the models tested, DeepSeek-V3 demonstrated the best results, exhibiting the lowest mean errors and greater stability over time. In contrast, Gemini showed higher variability and a decline in performance, while ChatGPT yielded intermediate results with relative stability but lower accuracy.

The research underscores significant limitations of LLMs not specifically trained for dental age estimation, including a propensity for overestimation and inconsistent results. Consequently, the authors caution against the indiscriminate use of these models in clinical settings without prior task-specific training and validation. They emphasize the need for further studies to fine-tune these models and assess their performance across diverse populations before considering their practical application, given the potential clinical and legal ramifications of incorrect interpretations.

Introduction

The introduction of the research paper discusses the significance of dental age estimation, a critical process utilized across various fields such as archaeology, anthropology, medicine, and forensic dentistry. This estimation is essential for diagnosing developmental disorders and planning orthodontic treatments. Among the various methods developed for this purpose, Demirjian’s method is highlighted as a prominent approach, particularly for children and adolescents, categorizing the development of seven left mandibular permanent teeth into eight distinct stages (A-H).

The paper also introduces Large Language Models (LLMs), a form of Artificial Intelligence (AI) that mimics human language understanding and generation. LLMs, such as the GPT series, Gemini, and DeepSeek, have shown promise in enhancing diagnostic accuracy in medical and dental fields. A recent scoping review indicates a growing interest in the application of LLMs in orthodontics, suggesting an increasing acceptance of these technologies. However, concerns regarding their accuracy and reliability persist, as LLMs can produce irrelevant or vague information, potentially leading to harmful healthcare decisions if misapplied. Notably, the study aims to investigate the performance of untrained LLMs in estimating dental age using Demirjian’s stages, addressing a gap in existing research regarding their effectiveness without prior training.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, as well as their sources and preparation methods. The section also describes the methodologies applied for data collection and analysis, ensuring reproducibility of the results.

Key experimental protocols are highlighted, including any statistical analyses performed to validate findings. The section emphasizes adherence to ethical standards and guidelines relevant to the research context, ensuring that all procedures were conducted responsibly and transparently. Overall, this section serves as a comprehensive guide for replicating the study and understanding the foundational elements that support the research conclusions.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the experimental or analytical methods employed. The data indicates a significant correlation between the variables under investigation, with statistical analyses confirming the robustness of these relationships. Specific metrics, such as p-values and confidence intervals, are reported to substantiate the conclusions drawn.

Additionally, the results demonstrate that the proposed model or hypothesis effectively explains the observed phenomena, with a notable improvement in predictive accuracy compared to existing models. Graphical representations, such as plots or charts, further illustrate these findings, providing a visual context that enhances the understanding of the data trends. Overall, the results underscore the importance of the study’s contributions to the field, suggesting avenues for future research based on the observed patterns.

Discussion

In this study, the performance of various large language models (LLMs) in estimating dental age using Demirjian’s method was evaluated through a comparative design utilizing simulated scenarios based on real data. The sample consisted of 30 digital panoramic radiographs from healthy individuals aged 3 to 16, with strict inclusion and exclusion criteria to ensure data integrity. The study employed a standardized prompt for each LLM (ChatGPT, Gemini, and DeepSeek) to minimize variability and isolate model-specific behaviors. Statistical analyses, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Coefficient of Determination ($R^2$), were conducted to assess model performance against traditional assessments by trained dentists.

The results indicated significant performance disparities among the models. The traditional Demirjian method yielded a MAE of 1.21 years, while ChatGPT and DeepSeek produced higher errors (MAE: 1.98-2.05 years) and negative $R^2$ values, indicating poor model fit. DeepSeek demonstrated the best reliability and accuracy, with lower mean errors across evaluation days, while Gemini exhibited the highest variability and a decline in performance over time. The findings underscore the limitations of untrained LLMs in clinical applications, highlighting the need for further training and validation before their integration into dental practice. Future research should focus on fine-tuning these models with domain-specific datasets to enhance their accuracy and reliability in structured clinical tasks.