نماذج اللغة الكبيرة متعددة الوسائط لتشخيص الآفات الفموية: مراجعة منهجية للأداء التشخيصي والفائدة السريرية Multimodal large language models for oral lesion diagnosis: a systematic review of diagnostic performance and clinical utility

المجلة: Frontiers in Oral Health، المجلد: 7
DOI: https://doi.org/10.3389/froh.2026.1748450
PMID: https://pubmed.ncbi.nlm.nih.gov/41816099
تاريخ النشر: 2026-02-24
المؤلف: Fatma E. A. Hassanein وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

تقيّم الورقة البحثية الأداء التشخيصي والفائدة السريرية لنماذج اللغة الكبيرة (LLMs) في تحديد الآفات الفموية، مع معالجة التحديات التي تطرحها الميزات البصرية المتداخلة والاعتماد على علم الأمراض النسيجي. تم إجراء مراجعة منهجية للدراسات المنشورة حتى يوليو 2025، مع التركيز على نماذج LLM مثل ChatGPT وDeepSeek، التي استخدمت النصوص والصور أو المدخلات متعددة الوسائط للتشخيص. شملت التحليلات 17 دراسة مع أكثر من 1200 حالة، وكشفت أن دقة التشخيص تختلف بشكل كبير (25%-96%) بناءً على إصدار النموذج، ونوع المدخلات، وتعقيد الآفة. ومن الجدير بالذكر أن المدخلات متعددة الوسائط عززت الأداء، حيث حققت قيم كوهين κ تتراوح بين 0.85-0.90، بينما اقتربت النماذج المتقدمة من أداء الخبراء في مهام معينة.

على الرغم من هذه التقدمات، تشير النتائج إلى أن نماذج LLM لا تتطابق باستمرار مع دقة التشخيص لدى الخبراء ولا ينبغي اعتبارها أنظمة تشخيصية مستقلة. إن فائدتها تكمن أساسًا كأدوات مساعدة تعزز التفكير السريري والتواصل، خاصة في هيكلة التشخيصات التفريقية وتحديد الميزات الحرجة. تم تقييم المخاطر العامة للتحيز في الدراسات المشمولة على أنها منخفضة إلى معتدلة. يؤكد المؤلفون على الحاجة إلى منهجيات موحدة، والتحقق المستقبلي، والتقييمات الواقعية لتعريف الدور السريري لنماذج LLM بشكل أفضل في تقييم الآفات الفموية، داعين إلى استخدامها لتكملة الحكم الخبير بدلاً من استبداله.

مقدمة

تسلط المقدمة الضوء على التحديات التي تواجه التشخيص السريري للآفات الفموية، والتي يمكن أن تتراوح من حالات حميدة إلى اضطرابات محتملة الخباثة وسرطان الخلايا الحرشفية الفموية (OSCC). غالبًا ما تؤدي التشابهات البصرية بين هذه الآفات إلى غموض تشخيصي، وهو أمر حاسم حيث يرتبط التشخيص المتأخر بمراحل المرض المتقدمة ومعدلات البقاء المنخفضة. الطرق التشخيصية الحالية، مثل التحليل النسيجي، تتسم بالتدخل وتحتاج إلى موارد كبيرة، مما يبرز الحاجة إلى أدوات دعم القرار السريعة والمتاحة، خاصة في إعدادات الرعاية الأولية.

تناقش الورقة الدور التحويلي للذكاء الاصطناعي (AI)، وبشكل خاص التعلم العميق والشبكات العصبية التلافيفية (CNNs)، في تعزيز دقة التشخيص في طب الأسنان. لقد أظهر الذكاء الاصطناعي وعدًا في اكتشاف حالات الأسنان المختلفة والتمييز بين الآفات الفموية الحميدة والخبيثة. ومع ذلك، فإن ظهور نماذج اللغة الكبيرة (LLMs) مثل GPT-4 وGemini يمثل تحولًا كبيرًا، حيث يمكن لهذه النماذج تجميع التاريخ السريري وتوليد الشروحات، بدلاً من مجرد تصنيف الصور. يمكن لنماذج LLM دمج البيانات متعددة الوسائط، بما في ذلك تاريخ المريض ووصفاته السريرية، لتقديم تشخيصات تفريقية واقتراحات للإدارة. على الرغم من إمكانياتها، فإن تطبيق نماذج LLM في تشخيص الآفات الفموية لا يزال في مراحله الأولى، حيث تفتقر الأدبيات الحالية إلى تقييمات شاملة لدقتها التشخيصية وفائدتها السريرية. تهدف هذه المراجعة المنهجية إلى سد هذه الفجوة من خلال تقييم أداء نماذج LLM في تحديد الآفات الفموية واستكشاف قيودها وانحيازاتها، مما يبرز قدراتها الفريدة في التفكير وإمكاناتها التشخيصية متعددة الوسائط.

الطرق

يحدد قسم “الطرق” تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث تم استخدام التحليلات الإحصائية لتقييم البيانات المجمعة من تجارب مختلفة. شملت المنهجيات المحددة التجارب المضبوطة، حيث تم التلاعب بالمتغيرات بشكل منهجي لملاحظة تأثيراتها على النتائج المعنية.

شمل جمع البيانات استخدام أدوات موحدة لضمان الموثوقية والصلاحية. تم إجراء التحليل باستخدام أدوات برمجية سهلت تطبيق الاختبارات الإحصائية المناسبة، مثل ANOVA وتحليل الانحدار، لتفسير النتائج بدقة. يبرز القسم أهمية القابلية للتكرار والشفافية في الطرق المستخدمة، موضحًا حجم العينة ومعايير الاختيار لدعم قوة النتائج.

النتائج

يقدم قسم “النتائج” في الورقة البحثية النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. عادةً ما يتضمن بيانات كمية، وتحليلات إحصائية، وتمثيلات بصرية مثل الرسوم البيانية أو الجداول التي توضح نتائج الدراسة. غالبًا ما تقارن النتائج مع الفرضيات الأولية أو الدراسات السابقة لتسليط الضوء على الفروق أو التأكيدات المهمة.

في هذا القسم، قد يبلغ المؤلفون عن مقاييس محددة، مثل المتوسطات، والانحرافات المعيارية، أو قيم p، لدعم استنتاجاتهم. بالإضافة إلى ذلك، يتم مناقشة أي اتجاهات أو أنماط ملحوظة في البيانات، مما يوفر رؤى حول تداعيات النتائج. بشكل عام، تعتبر النتائج أساسًا للنقاش والتفسير اللاحق لنتائج البحث.

المناقشة

تقيّم المراجعة المنهجية القدرات التشخيصية لنماذج اللغة الكبيرة (LLMs) للآفات الفموية، مع تسليط الضوء على ثلاثة أنماط رئيسية: تحسينات الأداء التدريجية عبر أجيال النماذج، المكاسب الكبيرة من دمج المدخلات متعددة الوسائط، وقيود الأنظمة التشخيصية المستقلة. التزمت المراجعة ببروتوكول مسجل وشملت دراسات استخدمت نماذج LLM مختلفة، مثل ChatGPT وGemini، عبر أنواع مدخلات مختلفة (نص، صورة، متعددة الوسائط) لتقييم دقة التشخيص واتفاق الخبراء. ركزت النتائج الرئيسية على مقاييس الأداء التشخيصي الموضوعية، بينما تناولت النتائج الثانوية التصورات الذاتية لمخرجات النموذج.

تشير النتائج إلى أن النماذج الأحدث، وخاصة ChatGPT-4 ونسخها، حققت دقة تشخيصية أعلى، خاصة مع المدخلات متعددة الوسائط، مقارنة بالنماذج السابقة مثل ChatGPT-3.5. ومع ذلك، اختلف الأداء بشكل كبير بناءً على بنية النموذج، ونوع المدخلات، وتعقيد الآفة. بينما أظهرت نماذج LLM إمكانيات كأدوات مساعدة في هيكلة التشخيصات التفريقية ودعم التفكير السريري، إلا أنها لم تصل إلى دقة وموثوقية مستوى الخبراء، خاصة في الحالات المعقدة. تؤكد المراجعة على ضرورة إجراء دراسات تحقق مستقبلية لضمان التكامل الآمن والفعال لنماذج LLM في الممارسة السريرية، مع التحذير أيضًا من الاعتماد المفرط على هذه النماذج بسبب قيودها الحالية.

القيود

يسلط قسم القيود الضوء على القيود المنهجية الكبيرة والانحيازات داخل قاعدة الأدلة الحالية. استخدمت العديد من الدراسات مجموعات بيانات معيارية مختارة، أو مشاهد حالات اصطناعية، أو أمثلة سريرية مختارة، مما أدخل انحياز الطيف وتأثيرات إثراء الحالات التي قد تعزز الأداء التشخيصي بشكل مصطنع. كانت معايير المرجعية المستخدمة تتفاوت بشكل كبير، بدءًا من التشخيصات المؤكدة بواسطة علم الأمراض النسيجي إلى توافق الخبراء، مما أدى إلى انحياز التحقق وتباين ذاتي في النتائج.

علاوة على ذلك، فإن عدم التجانس في استراتيجيات التحفيز، وصيغ تقديم الحالات، وأنظمة تصنيف الآفات، وتعريفات النتائج المبلغ عنها قد حد من قابلية مقارنة النتائج عبر الدراسات، مما جعل التحليل الكمي التلوي تحديًا. تؤكد هذه القضايا على الحاجة الملحة إلى أطر تقييم موحدة، كما دعت المعايير الجديدة للتقارير مثل STARD-AI وQUADAS-AI، لتعزيز صلاحية وموثوقية الأبحاث المستقبلية في هذا المجال.

Journal: Frontiers in Oral Health, Volume: 7
DOI: https://doi.org/10.3389/froh.2026.1748450
PMID: https://pubmed.ncbi.nlm.nih.gov/41816099
Publication Date: 2026-02-24
Author(s): Fatma E. A. Hassanein et al.
Primary Topic: AI in cancer detection

Overview

The research paper evaluates the diagnostic performance and clinical utility of large language models (LLMs) in identifying oral lesions, addressing the challenges posed by overlapping visual features and reliance on histopathology. A systematic review of studies published until July 2025 was conducted, focusing on LLMs such as ChatGPT and DeepSeek, which utilized text, images, or multimodal inputs for diagnosis. The analysis included 17 studies with over 1,200 cases, revealing that diagnostic accuracy varied significantly (25%-96%) based on model version, input modality, and lesion complexity. Notably, multimodal inputs enhanced performance, achieving Cohen’s κ values of 0.85-0.90, while advanced models approached expert-level performance in certain tasks.

Despite these advancements, the findings indicate that LLMs do not consistently match expert diagnostic accuracy and should not be viewed as autonomous diagnostic systems. Their utility is primarily as adjunctive tools that enhance clinical reasoning and communication, particularly in structuring differential diagnoses and identifying critical features. The overall risk of bias in the included studies was assessed as low to moderate. The authors emphasize the need for standardized methodologies, prospective validation, and real-world evaluations to better define the clinical role of LLMs in oral lesion assessment, advocating for their use to complement rather than replace expert judgment.

Introduction

The introduction highlights the challenges faced in the clinical diagnosis of oral lesions, which can vary from benign conditions to potentially malignant disorders and oral squamous cell carcinoma (OSCC). The visual similarities among these lesions often lead to diagnostic ambiguity, which is critical as delayed diagnosis correlates with advanced disease stages and poor survival rates. Current diagnostic methods, such as histopathological analysis, are invasive and resource-intensive, underscoring the need for rapid and accessible decision-support tools, particularly in primary care settings.

The paper discusses the transformative role of artificial intelligence (AI), specifically deep learning and convolutional neural networks (CNNs), in enhancing diagnostic accuracy in dentistry. AI has shown promise in detecting various dental conditions and distinguishing between benign and malignant oral lesions. However, the emergence of large language models (LLMs) like GPT-4 and Gemini represents a significant shift, as these models can synthesize clinical histories and generate explanations, rather than merely classifying images. LLMs can integrate multimodal data, including patient history and clinical descriptors, to provide differential diagnoses and management suggestions. Despite their potential, the application of LLMs in diagnosing oral lesions is still in its early stages, with existing literature lacking comprehensive evaluations of their diagnostic accuracy and clinical utility. This systematic review aims to fill this gap by assessing the performance of LLMs in identifying oral lesions and exploring their limitations and biases, thereby emphasizing their unique reasoning capabilities and multimodal diagnostic potential.

Methods

The “Methods” section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, employing statistical analyses to evaluate the data collected from various experiments. Specific methodologies included controlled trials, where variables were systematically manipulated to observe their effects on the outcomes of interest.

Data collection involved the use of standardized instruments to ensure reliability and validity. The analysis was conducted using software tools that facilitated the application of appropriate statistical tests, such as ANOVA and regression analysis, to interpret the results accurately. The section emphasizes the importance of replicability and transparency in the methods used, detailing the sample size and selection criteria to support the robustness of the findings.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments or analyses. It typically includes quantitative data, statistical analyses, and visual representations such as graphs or tables that illustrate the outcomes of the study. The results are often compared against the initial hypotheses or previous studies to highlight significant differences or confirmations.

In this section, the authors may report specific metrics, such as means, standard deviations, or p-values, to substantiate their conclusions. Additionally, any observed trends or patterns in the data are discussed, providing insights into the implications of the findings. Overall, the results serve as a foundation for the subsequent discussion and interpretation of the research outcomes.

Discussion

The systematic review evaluates the diagnostic capabilities of large language models (LLMs) for oral lesions, highlighting three key patterns: progressive performance improvements across model generations, significant gains from multimodal input integration, and the limitations of autonomous diagnostic systems. The review adhered to a registered protocol and included studies that utilized various LLMs, such as ChatGPT and Gemini, across different input modalities (text, image, multimodal) to assess diagnostic accuracy and expert agreement. Primary outcomes focused on objective diagnostic performance metrics, while secondary outcomes addressed subjective perceptions of model outputs.

Findings indicate that newer models, particularly ChatGPT-4 and its variants, achieved higher diagnostic accuracy, especially with multimodal inputs, compared to earlier models like ChatGPT-3.5. However, performance varied significantly based on model architecture, input type, and lesion complexity. While LLMs demonstrated potential as adjunctive tools for structuring differential diagnoses and supporting clinical reasoning, they fell short of expert-level accuracy and reliability, particularly in complex cases. The review emphasizes the necessity for prospective validation studies to ensure the safe and effective integration of LLMs into clinical practice, while also cautioning against overreliance on these models due to their current limitations.

Limitations

The section on limitations highlights significant methodological constraints and biases within the current evidence base. Many studies utilized curated benchmark datasets, synthetic case vignettes, or selectively chosen clinical examples, which introduced spectrum bias and case enrichment effects that may artificially enhance diagnostic performance. The reference standards employed varied considerably, ranging from histopathology-confirmed diagnoses to expert consensus, leading to verification bias and subjective variability in results.

Furthermore, the heterogeneity in prompting strategies, case presentation formats, lesion categorization schemes, and definitions of reported outcomes has limited the comparability of findings across studies, making quantitative meta-analysis challenging. These issues underscore the critical need for standardized evaluation frameworks, as advocated by emerging reporting standards such as STARD-AI and QUADAS-AI, to enhance the validity and reliability of future research in this domain.