هل يمكن استخدام deepseek وChatGPT في تشخيص الأمراض الفموية؟ Can deepseek and ChatGPT be used in the diagnosis of oral pathologies?

المجلة: BMC Oral Health، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12903-025-06034-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40281436
تاريخ النشر: 2025-04-25
المؤلف: Ömer Faruk Kaygisiz وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تقيّم ورقة البحث دقة التشخيص لتطبيقين من تطبيقات الذكاء الاصطناعي (AI)، ChatGPT-4o وDeepSeek-v3، في سياق 16 سيناريو سريري يتعلق بالآفات الفموية. تضمنت المنهجية تقديم هذه السيناريوهات لكلا النموذجين من الذكاء الاصطناعي، اللذين تم تكليفه بتوليد ثلاثة تشخيصات أولية واستشهاد بالأدبيات ذات الصلة. تم تقييم المخرجات التشخيصية من قبل 20 متخصصًا باستخدام مقياس ليكرت. أظهرت النتائج أن DeepSeek-v3 حقق متوسط درجة قدره $4.02 \pm 0.36$، متفوقًا على ChatGPT-4o، الذي سجل $3.15 \pm 0.41$. بشكل محدد، كان DeepSeek-v3 متفوقًا إحصائيًا في 9 من أصل 16 سيناريو، بينما تفوق ChatGPT-4o في سيناريو واحد فقط. ومن الجدير بالذكر أن ChatGPT-4o أنتج عددًا أكبر من المراجع غير الدقيقة، حيث تم اعتبار 50 من أصل 62 مزيفة، مقارنة بـ 8 من أصل 48 لـ DeepSeek-v3.

تؤكد الاستنتاجات المستخلصة على إمكانية تطبيقات الذكاء الاصطناعي لمساعدة الأطباء من خلال تحسين كفاءة التشخيص وتقليل عبء العمل. ومع ذلك، تسلط الدراسة الضوء على أن كلا النموذجين يحتاجان إلى مزيد من التحسين قبل أن يمكن دمجهما بشكل موثوق في الممارسة السريرية الروتينية. تشير النتائج إلى أنه بينما يمكن أن يوفر الذكاء الاصطناعي دعمًا قيمًا، إلا أنه لا ينبغي أن يحل محل الأطباء، حيث يفتقر حاليًا إلى القدرة على تفسير البيانات بشكل ذي معنى. يدعو المؤلفون إلى إجراء أبحاث مستقبلية تشمل تاريخ حالات حقيقية وأحجام عينات أكبر لتحسين موثوقية الذكاء الاصطناعي في بيئات الرعاية الصحية.

مقدمة

تناقش مقدمة ورقة البحث أهمية تشخيص الآفات الفموية بدقة، والتي يمكن أن تنشأ من عوامل مختلفة بما في ذلك العدوى، والأمراض المناعية الذاتية، والصدمات. قد تظهر هذه الحالات أعراضًا مثل الاحمرار، والألم، والتورم، ويمكن أن يؤدي التأخير في العلاج إلى مضاعفات صحية خطيرة. تواجه طرق التشخيص التقليدية، على الرغم من استخدامها بشكل شائع، تحديات مثل استهلاك الوقت وإمكانية التشخيص الخاطئ بسبب التقييمات الذاتية من قبل الأطباء. استجابةً لذلك، هناك اهتمام متزايد في استخدام تقنيات الذكاء الاصطناعي (AI)، وخاصة التعلم العميق، لتعزيز دقة وكفاءة التشخيص.

تسلط الورقة الضوء على ظهور نماذج اللغة الكبيرة (LLMs)، مثل ChatGPT وDeepSeek، التي أظهرت وعدًا في توليد نصوص شبيهة بالبشر والمساعدة في التشخيصات الأولية. تستفيد هذه النماذج من مجموعات بيانات واسعة وهياكل الشبكات العصبية المتقدمة لمعالجة وإنتاج اللغة الطبيعية، مما قد يحول المجال الطبي. على الرغم من مزاياها، لا يزال تطبيق الذكاء الاصطناعي في تشخيص الآفات الفموية محدودًا، حيث تشير الدراسات الحالية إلى أن دقة تشخيص الذكاء الاصطناعي لا تزال ليست على قدم المساواة مع الطرق التقليدية. تهدف الدراسة إلى تقييم موثوقية استجابات الذكاء الاصطناعي لسيناريوهات سريرية قائمة على النص، وبالتالي تقييم فائدة نماذج اللغة الكبيرة المختلفة في مساعدة عمليات اتخاذ القرار للأطباء في تشخيص الآفات الفموية.

طرق

تحدد قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يوضح المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، ومعدات، وعينات بيولوجية، مما يضمن إمكانية تكرار التجارب. يتم وصف المنهجية بشكل منهجي، تغطي إعداد العينات، والظروف التي أجريت فيها التجارب، والتقنيات التحليلية المستخدمة لجمع البيانات وتحليلها.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول الطرق الإحصائية المطبقة لتفسير النتائج، مما يضمن أن النتائج قوية وموثوقة. يسمح هذا النهج الشامل بفهم واضح للإطار التجريبي، مما يسهل الأبحاث المستقبلية والتحقق من النتائج التي تم الحصول عليها في الدراسة.

نتائج

شملت الدراسة 20 متخصصًا، يتكونون من 40% أطباء أشعة فموية ووجهية و60% جراحي فم ووجه، بمتوسط خبرة قدره 7.40 ± 3.28 سنوات. أظهر تقييم نموذجين كبيرين من نماذج اللغة (LLMs) لتشخيص الآفات الفموية أن كلا النموذجين كانا مقبولين بشكل عام أو جيدين، حيث حققا تقييمات إيجابية في 13 من أصل 16 سؤالًا. ومن الجدير بالذكر أن Deepseek-V3 تفوق على ChatGPT-4o في 9 أسئلة، بينما سجل ChatGPT-4o درجة أعلى في سؤال واحد، دون وجود اختلافات كبيرة في 5 أسئلة. أشارت متوسط درجات ليكرت إلى تفضيل كبير لـ Deepseek-V3 (4.02 ± 0.36) مقارنة بـ ChatGPT-4o (3.15 ± 0.41)، مع قيمة p تساوي 0.024.

فيما يتعلق بدقة المراجع، قدم Deepseek-V3 48 مرجعًا، كان 40 منها حقيقيًا، مأخوذًا من 11 منشورًا مختلفًا. في المقابل، قدم ChatGPT-4o 62 مرجعًا، كان 12 منها فقط حقيقيًا، مستمدًا من 6 مصادر. كانت نسبة دقة ChatGPT-4o منخفضة بشكل ملحوظ عند 19.35%، مما يشير إلى حدوث أعلى لمرجعيات وهمية. تشير النتائج إلى أنه بينما تضمنت كلا النموذجين عدم دقة في مراجعها، أظهر Deepseek-V3 موثوقية أفضل في الحصول على مراجع حقيقية مقارنة بـ ChatGPT-4o، مما يثير القلق بشأن الموثوقية العامة لكلا النموذجين بسبب انتشار الاقتباسات المتكررة.

مناقشة

في هذه الدراسة، استكشف المؤلفون القدرات التشخيصية لنموذجين كبيرين من نماذج اللغة (LLMs)، ChatGPT-4o وDeepSeek-v3، في تحديد الآفات الفموية من خلال سيناريوهات افتراضية. تم تصميم السيناريوهات لتشمل معلومات سريرية أساسية دون تأكيد هيستوباثولوجي، وقام الأطباء المشاركون بتقييم استجابات LLM باستخدام مقياس ليكرت. أظهرت النتائج أنه بينما قدم كلا النموذجين اقتراحات تشخيصية مقبولة، كان DeepSeek-v3 متفوقًا على ChatGPT-4o من حيث الدقة. ومن الجدير بالذكر أن الدراسة سلطت الضوء على انتشار المراجع “الوهمية” التي تم إنشاؤها بواسطة كلا النموذجين، مما يثير القلق بشأن موثوقية المعلومات التي تم إنشاؤها بواسطة الذكاء الاصطناعي في البيئات السريرية.

أكدت المناقشة على الإمكانات التحويلية للذكاء الاصطناعي في الرعاية الصحية، وخاصة في تعزيز عمليات التشخيص واتخاذ القرار. ومع ذلك، أبرزت القيود المفروضة على نماذج اللغة الحالية، بما في ذلك عدم قدرتها على تقديم معلومات طبية قائمة على الأدلة بشكل متسق والآثار الأخلاقية للاعتماد على الذكاء الاصطناعي في القرارات السريرية. خلص المؤلفون إلى أنه بينما يمكن أن تكون نماذج اللغة الكبيرة أدوات قيمة لمساعدة الأطباء، إلا أنه لا ينبغي أن تحل محل الحكم البشري. هناك حاجة إلى مزيد من الأبحاث لتحسين هذه التقنيات وضمان أن يكون دمجها في الممارسة السريرية فعالًا وآمنًا، داعين إلى نهج تعاوني بين الذكاء الاصطناعي والمهنيين في الرعاية الصحية.

القيود

تقدم الدراسة عدة قيود قد تؤثر على صحة نتائجها. أولاً، كانت حجم العينة صغيرًا نسبيًا، حيث تتكون من 16 سيناريو فقط، مما قد يحد من إمكانية تعميم النتائج. للتخفيف من ذلك، استشار الباحثون 20 خبيرًا مختلفًا بشكل فردي؛ ومع ذلك، فإن غياب أطباء الأمراض الفموية – الذين عادةً ما يشاركون المسؤوليات السريرية مع جراحي الوجه والفكين وأطباء الأشعة – يحد من شمولية الدراسة. بالإضافة إلى ذلك، لم يتم تقييم معايير التقييم لاستجابات الخبراء، مثل طول الإجابات وصرامتها الأكاديمية، بشكل منهجي.

تتمثل قيود أخرى كبيرة في الطبيعة الذاتية لتقييمات المشاركين، التي كانت تعتمد على المعرفة الشخصية والخبرة السريرية بدلاً من المعايير الموحدة. هذا يقدم تباينًا في تقييم الاستجابات التي تولدها نماذج اللغة الكبيرة (LLMs)، التي، على الرغم من قدرتها على إنتاج إجابات تبدو متماسكة، قد تعتمد على المعتقدات الشائعة بدلاً من المعرفة القائمة على الأدلة. على الرغم من أن الأبحاث السابقة أشارت إلى معدل تكرار مرتفع (85.44%) في الاستجابات التي تولدها نماذج اللغة الكبيرة، إلا أن هذا لا يضمن الاتساق المطلق، مما قد يؤدي إلى ارتباك بين المستخدمين الذين يسعون للحصول على معلومات موثوقة. علاوة على ذلك، لم يتم تأكيد التشخيصات التي أجرتها نماذج اللغة الكبيرة في سياق الآفات الفموية من خلال نتائج الخزعة، مما يثير القلق بشأن الدقة. كما أن خطر توليد مراجع وهمية يمثل تحديًا لمصداقية المعلومات المقدمة.

Journal: BMC Oral Health, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12903-025-06034-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40281436
Publication Date: 2025-04-25
Author(s): Ömer Faruk Kaygisiz et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

The research paper evaluates the diagnostic accuracy of two artificial intelligence (AI) applications, ChatGPT-4o and DeepSeek-v3, in the context of 16 clinical case scenarios related to oral pathologies. The methodology involved presenting these scenarios to both AI models, which were tasked with generating three preliminary diagnoses and citing relevant literature. The diagnostic outputs were assessed by 20 specialists using a Likert scale. Results indicated that DeepSeek-v3 achieved a mean score of $4.02 \pm 0.36$, outperforming ChatGPT-4o, which scored $3.15 \pm 0.41$. Specifically, DeepSeek-v3 was statistically superior in 9 out of 16 scenarios, while ChatGPT-4o excelled in only one. Notably, ChatGPT-4o produced a higher number of inaccurate references, with 50 out of 62 being deemed fake, compared to 8 out of 48 for DeepSeek-v3.

The conclusions drawn emphasize the potential of AI applications to assist clinicians by enhancing diagnostic efficiency and reducing workload. However, the study highlights that both models require further refinement before they can be reliably integrated into routine clinical practice. The findings suggest that while AI can provide valuable support, it should not replace clinicians, as it currently lacks the capability to interpret data meaningfully. The authors advocate for future research involving real case histories and larger sample sizes to improve the reliability of AI in healthcare settings.

Introduction

The introduction of the research paper discusses the significance of accurately diagnosing oral pathologies, which can arise from various factors including infections, autoimmune diseases, and trauma. These conditions may present symptoms such as redness, pain, and swelling, and delayed treatment can lead to severe health complications. Traditional diagnostic methods, while commonly employed, often face challenges such as time consumption and the potential for misdiagnosis due to subjective clinician evaluations. As a response, there is a growing interest in utilizing Artificial Intelligence (AI) technologies, particularly deep learning, to enhance diagnostic accuracy and efficiency.

The paper highlights the emergence of Large Language Models (LLMs), such as ChatGPT and DeepSeek, which have shown promise in generating human-like text and assisting in preliminary diagnoses. These models leverage extensive datasets and advanced neural network architectures to process and produce natural language, potentially transforming the medical field. Despite their advantages, the application of AI in diagnosing oral pathologies remains limited, with current studies indicating that AI’s diagnostic accuracy is not yet on par with traditional methods. The study aims to evaluate the reliability of AI responses to text-based clinical scenarios, thereby assessing the utility of different LLMs in aiding clinicians’ decision-making processes in oral pathology diagnosis.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, ensuring reproducibility of the experiments. The methodology is described systematically, covering the preparation of samples, the conditions under which experiments were conducted, and the analytical techniques utilized for data collection and analysis.

Additionally, the section may include information on statistical methods applied to interpret the results, ensuring that the findings are robust and reliable. This comprehensive approach allows for a clear understanding of the experimental framework, facilitating future research and validation of the results obtained in the study.

Results

The study involved 20 specialists, comprising 40% Oral and Maxillofacial Radiologists and 60% Oral and Maxillofacial Surgeons, with an average experience of 7.40 ± 3.28 years. Evaluation of two large language models (LLMs) for oral pathology diagnosis revealed that both models were generally acceptable or good, achieving favorable ratings on 13 out of 16 questions. Notably, Deepseek-V3 outperformed ChatGPT-4o on 9 questions, while ChatGPT-4o scored higher on one question, with no significant differences observed in 5 questions. The mean Likert scores indicated a significant preference for Deepseek-V3 (4.02 ± 0.36) over ChatGPT-4o (3.15 ± 0.41), with a p-value of 0.024.

In terms of reference accuracy, Deepseek-V3 provided 48 references, of which 40 were real, sourced from 11 different publications. In contrast, ChatGPT-4o offered 62 references, with only 12 being real, drawn from 6 sources. The accuracy rate for ChatGPT-4o was notably low at 19.35%, indicating a higher incidence of fictitious references. The findings suggest that while both models included inaccuracies in their references, Deepseek-V3 demonstrated superior reliability in sourcing real references compared to ChatGPT-4o, raising concerns about the overall reliability of both models due to the prevalence of repetitive citations.

Discussion

In this study, the authors explored the diagnostic capabilities of two large language models (LLMs), ChatGPT-4o and DeepSeek-v3, in identifying oral lesions through hypothetical scenarios. The scenarios were designed to include essential clinical information without histopathological confirmation, and participating clinicians evaluated the LLM responses using a Likert scale. The findings indicated that while both models provided acceptable diagnostic suggestions, DeepSeek-v3 outperformed ChatGPT-4o in accuracy. Notably, the study highlighted the prevalence of “fake” references generated by both models, raising concerns about the reliability of AI-generated information in clinical settings.

The discussion emphasized the transformative potential of AI in healthcare, particularly in enhancing diagnostic processes and decision-making. However, it underscored the limitations of current LLMs, including their inability to consistently provide evidence-based medical information and the ethical implications of relying on AI for clinical decisions. The authors concluded that while LLMs can serve as valuable tools to assist clinicians, they should not replace human judgment. Future research is necessary to refine these technologies and ensure their integration into clinical practice is both effective and safe, advocating for a collaborative approach between AI and healthcare professionals.

Limitations

The study presents several limitations that may affect the validity of its findings. Firstly, the sample size was relatively small, comprising only 16 case scenarios, which could limit the generalizability of the results. To mitigate this, the researchers consulted 20 different experts individually; however, the absence of oral pathologists—who typically share clinical responsibilities with maxillofacial surgeons and radiologists—further constrains the study’s comprehensiveness. Additionally, the evaluation criteria for the experts’ responses, such as the length and academic rigor of the answers, were not systematically assessed.

Another significant limitation is the subjective nature of the participants’ evaluations, which were based on personal knowledge and clinical experience rather than standardized criteria. This introduces variability in the assessment of the responses generated by large language models (LLMs), which, while capable of producing seemingly coherent answers, may rely on popular beliefs rather than evidence-based knowledge. Although previous research indicated a high repeatability rate (85.44%) in LLM-generated responses, this does not guarantee absolute consistency, potentially leading to confusion among users seeking reliable information. Furthermore, the diagnoses made by LLMs in the context of oral lesions were not corroborated by biopsy findings, raising concerns about accuracy. The risk of generating fictitious references also poses a challenge to the credibility of the information provided.