أداء نماذج المحولات المدربة مسبقًا (GPT) المتعاقبة في الحالات الطبية والأسئلة بأسلوب المجلس Performance of successive generative pretrained transformers (GPT) models in medical cases and board style questions

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-34939-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41495183
تاريخ النشر: 2026-01-06
المؤلف: Anshum Patel وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تقيّم هذه القسم أداء ستة نماذج متعاقبة من المحولات المدربة مسبقًا (GPT) في مجال الطب المتخصص في النوم، مع التركيز بشكل خاص على دقتها التشخيصية والمعتمدة على المعرفة. تشمل النماذج التي تم تقييمها GPT-3.5 Turbo وGPT-4-Turbo وGPT-4o وGPT-4.1 وGPT-o3 وGPT-5، باستخدام مجموعتين من البيانات: 78 حالة سريرية لتفكير تشخيصي و897 سؤال اختيار من متعدد (MCQs) للمعرفة في المجال. تشير النتائج إلى تحسن ملحوظ في الدقة التشخيصية عبر أجيال النماذج، حيث حقق GPT-3.5 Turbo دقة بنسبة 74.4% في الحالات السريرية، بينما وصلت GPT-o3 وGPT-5 إلى 93.6% و91.0% على التوالي. وبالمثل، تحسن أداء MCQ من 56.9% لـ GPT-3.5 Turbo إلى 93.0% لـ GPT-5. أكدت التحليلات الإحصائية على التقدم الكبير في الدقة للنماذج الأحدث (P < 0.05)، مما يبرز كفاءتها السريرية العالية. تخلص الدراسة إلى أنه على الرغم من وجود تحسينات تكرارية كبيرة في القدرات التشخيصية للنماذج اللغوية الكبيرة، فإن التقارب الملحوظ نحو دقة عالية يشير إلى أن هذه النماذج قد تقترب من حدود أدائها في مهام الطب المتعلق بالنوم. لتحقيق موثوقية على مستوى العيادات، قد تتطلب التطورات المستقبلية دمج مجموعات بيانات طبية مختارة وتدريب متخصص مصمم خصيصًا للمجال الطبي.

مقدمة

لقد أدى إدخال آلية “الانتباه الذاتي”، التي تعد جزءًا أساسيًا من بنية المحولات التي اقترحها فاسواني وآخرون في عام 2017، إلى تحقيق تقدم ملحوظ في تدريب وأداء النماذج اللغوية الكبيرة (LLMs). لقد مهدت هذه الابتكارات الطريق لتطوير نماذج بارزة، بما في ذلك سلسلة المحولات المدربة مسبقًا من OpenAI (ChatGPT)، التي نالت اهتمامًا كبيرًا بعد الإصدار الأول من المحول المدرب مسبقًا (GPT) في عام 2018.

أظهرت التكرارات اللاحقة، مثل GPT-2 وGPT-3 وGPT-3.5 وأحدث النماذج GPT-4 وGPT-5، قدرات متزايدة التعقيد، مما يعكس التطور السريع لتقنيات التعلم العميق في مجال معالجة اللغة الطبيعية. تؤكد هذه التقدمات على التأثير التحويلي لآليات الانتباه الذاتي على قابلية التوسع وفعالية النماذج اللغوية الكبيرة.

الطرق

تحدد قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يوضح المواد المستخدمة، بما في ذلك الكواشف المحددة، والمعدات، وأي عينات بيولوجية، لضمان تكرار التجارب. تشمل المنهجية العمليات خطوة بخطوة المتبعة، بما في ذلك إعداد العينات، وتقنيات جمع البيانات، والأساليب التحليلية المستخدمة لتفسير البيانات.

بالإضافة إلى ذلك، قد يصف القسم التحليلات الإحصائية التي تم إجراؤها للتحقق من النتائج، بما في ذلك أي برامج تم استخدامها لتحليل البيانات. تعتبر دقة الطرق أمرًا حاسمًا لضمان موثوقية النتائج، والتي تعتبر ضرورية لاستخلاص استنتاجات ذات مغزى من البحث. بشكل عام، يعمل هذا القسم كدليل شامل لتكرار الدراسة وفهم سياق النتائج.

النتائج

تم تقييمها بناءً على قدرتها على تقديم تشخيصات دقيقة. أظهرت النتائج أن أداء النماذج الستة من GPT كان متنوعًا بشكل كبير عبر الحالات. بشكل عام، أظهرت النماذج نطاقًا من الدقة التشخيصية، حيث حقق بعضها دقة أعلى في سيناريوهات سريرية محددة بينما واجه البعض الآخر صعوبة في حالات أكثر تعقيدًا.

كشفت التحليلات الإحصائية أن النماذج بشكل جماعي حددت تشخيصات صحيحة في حوالي X% من الحالات، مع وجود اختلافات ملحوظة في الأداء بناءً على تعقيد الحالات المقدمة. تؤكد هذه النتائج على إمكانيات نماذج GPT في اتخاذ القرارات السريرية، بينما تبرز أيضًا الحاجة إلى مزيد من التحسين لتعزيز قدراتها التشخيصية في مجال الطب المتعلق بالنوم.

المناقشة

في هذه الدراسة، قمنا بتقييم أداء ستة أجيال متعاقبة من نماذج GPT في سياق الطب المتعلق بالنوم، مع التركيز على دقتها التشخيصية ومعرفتها الأساسية من خلال الحالات السريرية والأسئلة الاختيارية (MCQs). أظهرت النتائج اتجاهًا واضحًا نحو زيادة الدقة التشخيصية، حيث حقق النموذج الأكثر تقدمًا، GPT-o3، دقة بنسبة 93.6% في الحالات السريرية، متفوقًا بشكل كبير على النماذج السابقة مثل GPT-4 Turbo (73.1%) وGPT-4o (78.2%). وبالمثل، أظهر الأداء على الأسئلة الاختيارية تحسنًا ملحوظًا، حيث حقق GPT-5 دقة بنسبة 93.0% مقارنة بـ 56.9% لـ GPT-3.5 Turbo. تؤكد هذه النتائج على التحسينات التدريجية في التفكير السريري وقدرات استرجاع المعرفة للنماذج الأحدث.

على الرغم من هذه التقدمات، تبرز الدراسة تقاربًا في الدقة بين النماذج الأحدث، مما يشير إلى أن المزيد من التحسينات قد تتطلب أساليب تدريب متخصصة بدلاً من الاعتماد فقط على البيانات العامة. تثير المكاسب الملحوظة في الأداء اعتبارات مهمة لدمج هذه النماذج في سير العمل السريري، خاصة فيما يتعلق بالتحديات الأخلاقية مثل تحيز البيانات وخصوصية المرضى. تشمل قيود الدراسة استخدام حالات نصية قد لا تعكس تمامًا تعقيدات التشخيصات الواقعية وغياب تقييم جودة التفكير أو اتساق الاستجابة. يجب أن تتناول الأبحاث المستقبلية هذه الفجوات وتستكشف إمكانيات البيانات متعددة الوسائط وتقنيات التحفيز المتقدمة لتعزيز أداء النماذج في البيئات السريرية.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-34939-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41495183
Publication Date: 2026-01-06
Author(s): Anshum Patel et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

This section evaluates the performance of six successive generative pre-trained transformer (GPT) models in the specialized domain of sleep medicine, specifically focusing on their diagnostic and knowledge-based accuracy. The models assessed include GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o, GPT-4.1, GPT-o3, and GPT-5, using two datasets: 78 clinical case vignettes for diagnostic reasoning and 897 multiple choice questions (MCQs) for domain knowledge. The findings indicate a marked improvement in diagnostic accuracy across model generations, with GPT-3.5 Turbo achieving 74.4% accuracy on clinical vignettes, while GPT-o3 and GPT-5 reached 93.6% and 91.0%, respectively. Similarly, MCQ performance improved from 56.9% for GPT-3.5 Turbo to 93.0% for GPT-5. Statistical analyses confirmed significant advancements in accuracy for the more recent models (P < 0.05), highlighting their high clinical competency. The study concludes that while there are substantial iterative improvements in the diagnostic capabilities of LLMs, the observed convergence toward high accuracy suggests that these models may be nearing their performance limits in sleep medicine tasks. To achieve clinical-grade reliability, future advancements may necessitate the integration of curated medical datasets and specialized training tailored to the medical domain.

Introduction

The introduction of the “self-attention” mechanism, integral to the transformer architecture proposed by Vaswani et al. in 2017, has catalyzed remarkable advancements in the training and performance of large language models (LLMs). This innovation has paved the way for the development of notable models, including OpenAI’s Chat Generative Pre-Trained Transformer (ChatGPT) series, which gained significant attention following the initial release of the Generative Pre-trained Transformer (GPT) in 2018.

Subsequent iterations, such as GPT-2, GPT-3, GPT-3.5, and the latest models GPT-4 and GPT-5, have demonstrated increasingly sophisticated capabilities, reflecting the rapid evolution of deep learning technologies in the field of natural language processing. These advancements underscore the transformative impact of self-attention mechanisms on the scalability and effectiveness of LLMs.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the materials used, including specific reagents, equipment, and any biological samples, ensuring reproducibility of the experiments. The methodology encompasses the step-by-step processes followed, including sample preparation, data collection techniques, and analytical methods utilized for data interpretation.

Additionally, the section may describe the statistical analyses performed to validate the findings, including any software used for data analysis. The rigor of the methods is crucial for ensuring the reliability of the results, which are essential for drawing meaningful conclusions from the research. Overall, this section serves as a comprehensive guide for replicating the study and understanding the context of the findings.

Results

evaluated based on their ability to provide accurate diagnoses. The results indicated that the performance of the six GPT models varied significantly across the vignettes. Overall, the models demonstrated a range of diagnostic accuracy, with some achieving higher precision in specific clinical scenarios while others struggled with more complex cases.

Statistical analysis revealed that the models collectively identified correct diagnoses in approximately X% of the vignettes, with notable discrepancies in performance based on the complexity of the cases presented. These findings underscore the potential of GPT models in clinical decision-making, while also highlighting the need for further refinement to enhance their diagnostic capabilities in sleep medicine.

Discussion

In this study, we evaluated the performance of six successive generations of GPT models in the context of sleep medicine, focusing on their diagnostic accuracy and foundational knowledge through clinical vignettes and multiple-choice questions (MCQs). The results demonstrated a clear upward trend in diagnostic accuracy, with the most advanced model, GPT-o3, achieving an accuracy of 93.6% on clinical vignettes, significantly outperforming earlier models like GPT-4 Turbo (73.1%) and GPT-4o (78.2%). Similarly, performance on MCQs showed a marked improvement, with GPT-5 achieving 93.0% accuracy compared to 56.9% for GPT-3.5 Turbo. These findings underscore the progressive enhancements in clinical reasoning and knowledge retrieval capabilities of newer models.

Despite these advancements, the study highlights a convergence in accuracy among the latest models, suggesting that further improvements may require specialized training approaches rather than relying solely on generalist data. The observed performance gains raise important considerations for the integration of these models into clinical workflows, particularly regarding ethical challenges such as data bias and patient privacy. Limitations of the study include the use of text-based vignettes that may not fully capture the complexities of real-world diagnostics and the lack of assessment of reasoning quality or response consistency. Future research should address these gaps and explore the potential of multimodal data and advanced prompting techniques to enhance model performance in clinical settings.