نحو نموذج أساسي عام للتصوير الشعاعي من خلال الاستفادة من بيانات طبية ثنائية وثلاثية الأبعاد على نطاق الويب Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

المجلة: Nature Communications، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41467-025-62385-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40849424
تاريخ النشر: 2025-08-23
المؤلف: Chaoyi Wu وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

في هذه الدراسة، نقدم نموذج مؤسسة الأشعة (RadFM) كدليل على المفهوم يهدف إلى تعزيز تطوير نماذج المؤسسة الطبية العامة في الأشعة. تشمل نهجنا ثلاثة جوانب رئيسية: بناء مجموعة البيانات، تصميم النموذج، والتقييم الشامل. نقدم أربع مجموعات بيانات متعددة الوسائط تتكون من 13 مليون صورة ثنائية الأبعاد و615,000 مسح ثلاثي الأبعاد، والتي تُعرف مجتمعة باسم مجموعة البيانات الطبية متعددة الوسائط (MedMD). تعمل هذه المجموعة الواسعة من البيانات، عند دمجها مع المجموعات الموجودة، كأساس لتدريب RadFM.

نقترح بنية جديدة تسهل دمج مدخلات النص مع المسحات الطبية ثنائية وثلاثية الأبعاد، مما يمكّن النموذج من أداء مجموعة متنوعة من المهام الإشعاعية، مثل التشخيص، والإجابة على الأسئلة البصرية، وتوليد التقارير، وتشخيص الأسباب. لتقييم RadFM، نقوم بتقييم أدائه ليس فقط على تسع مجموعات بيانات معروفة ولكن أيضًا نقدم معيارًا جديدًا، RadBench، الذي يتضمن ثلاث مهام مصممة لتقييم نماذج المؤسسة بشكل شامل. تشير نتائجنا إلى أن RadFM يتفوق على نماذج المؤسسة متعددة الوسائط الموجودة، بما في ذلك GPT-4V، ويظهر قدرة تفوق عبر معايير عامة متعددة، متجاوزًا العديد من النماذج الرائدة. تتناول هذه الدراسة التحديات الحرجة في هذا المجال، بما في ذلك ندرة مجموعات البيانات متعددة الوسائط، والحاجة إلى إطار معماري موحد، وغياب المعايير الفعالة لمراقبة التقدم في تطوير نماذج المؤسسة الطبية.

مقدمة

في هذا القسم، يناقش المؤلفون تنفيذ ضبط التعليمات المحددة بالنطاق باستخدام مجموعة بيانات RadMD، التي تتكون من أكثر من 3 ملايين صورة إشعاعية مرتبطة بتعليمات واستجابات لغوية عالية الجودة. يؤكد المؤلفون على أهمية تصفية مجموعات البيانات التي لا تنشأ من سيناريوهات سريرية حقيقية، مستبعدين بشكل خاص PMC-Inline وPMC-OA. يركزون على الاحتفاظ فقط بالمحتوى الإشعاعي ذي الصلة من خلال تصفية مجموعات البيانات غير المتعلقة بالأشعة، بما في ذلك MPx-series وRP3D-series وPMC-CaseReport.

تكون عملية تصفية MPx-series وRP3D-series مباشرة، حيث توفر هذه المجموعات طرق التصوير المرتبطة بكل حالة. بالنسبة لمجموعة بيانات PMC-CaseReport، المشتقة من تقارير الحالات باستخدام ChatGPT، يستخدم المؤلفون تسميات الصور لتحديد الحالات ذات الصلة، محتفظين فقط بتلك التي تذكر بوضوح مصطلحات الأشعة مثل “MRI” و”CT” و”الأشعة السينية” و”الموجات فوق الصوتية” أو “تصوير الثدي”. يعترف المؤلفون بوجود حالات ضوضائية محتملة في مجموعة البيانات ويلاحظون أن مجموعة بيانات التقييم، RadBench، تخضع لفحص يدوي إضافي لضمان جودة حالات الاختبار المختارة.

طرق

في هذا القسم، يحدد المؤلفون المنهجية المستخدمة في دراستهم، والتي تعتمد بشكل أساسي على البيانات المستمدة من مواقع الويب ذات الوصول المفتوح، كما هو موضح في الجدول التكميلي 5. يتم تحديد الاعتبارات الأخلاقية المحيطة باستخدام هذه البيانات من خلال عمليات تحميل البيانات الأصلية المحددة في خط أنابيب جمع كل مجموعة بيانات، مع توفر مزيد من المعلومات عبر مواقع الويب الخاصة بكل مجموعة بيانات مدرجة في المواد التكميلية.

تستمد نسبة كبيرة من مجموعة البيانات من Radiopaedia، وهي منصة مفتوحة التحرير ومراجعة الأقران مخصصة لموارد الأشعة. تهدف Radiopaedia إلى توفير مرجع شامل ومتاح مجانًا في مجال الأشعة. حصل المؤلفون على إذن بالاستخدام غير التجاري من مختلف المساهمين وكذلك من مؤسس Radiopaedia، مما يضمن الامتثال للإرشادات الأخلاقية الموضحة في سياسة الخصوصية الخاصة بـ Radiopaedia.

النتائج

في هذا القسم، يقدم المؤلفون نتائج تقييم نموذجهم المقترح، RadFM، عبر تسع مجموعات بيانات طبية عامة وRadBench التي تم تطويرها حديثًا، والتي تتضمن مهام مثل الإجابة على الأسئلة البصرية الطبية (VQA) وتوليد التقارير وتشخيص الأسباب. يتم مقارنة أداء النموذج مع نماذج المؤسسة متعددة الوسائط الموجودة، بما في ذلك Open-flamingo وMedVInT وLLaVA-Med وMedFlamingo، تحت ظروف التوجيه بدون عينة وقليل من العينات. من الجدير بالذكر أن RadFM يتفوق بشكل كبير على هذه النماذج، محققًا درجات دقة تبلغ 59.96% و68.82% و56.32% و83.62% و72.95% عبر خمس مجموعات بيانات تشخيصية، بينما تكافح النماذج الموجودة مع درجات قريبة من الصدفة العشوائية (حوالي 50%).

تكشف التحليلات الإضافية أن RadFM يتفوق في مهام توليد الجمل الطويلة، مثل VQA الطبية وتوليد التقارير، مما يظهر تحسينات متسقة عبر مهام سريرية مختلفة وطرق تصوير. كما يجري المؤلفون تجارب ضبط دقيق محددة للمهام، مما يظهر أن RadFM يعزز نتائج التشخيص وجودة توليد النص عند تكييفه مع مجموعات بيانات محددة. تشير النتائج النوعية إلى أنه بينما يولد RadFM استجابات متماسكة ويحدد الأمراض الأساسية بفعالية، فإنه أحيانًا يواجه صعوبة في التمييز الدقيق في الحالات المعقدة. بشكل عام، تشير النتائج إلى أن RadFM يوفر أساسًا قويًا لتعزيز تطبيقات الذكاء الاصطناعي الطبية، لا سيما في سياق طرق التصوير المتنوعة والمهام السريرية.

المناقشة

يقدم قسم المناقشة في الورقة تقييمًا شاملاً لنموذج RadFM، مع التركيز على أدائه في مختلف المهام الطبية، بما في ذلك الإجابة على الأسئلة البصرية الطبية (VQA) وتوليد التقارير وتشخيص الأسباب. تكشف دراسات الإزالة أن ضبط التعليمات المحددة بالنطاق ضروري لأداء النموذج الفعال، خاصة في مجموعة بيانات RadMD. يعزز إدخال مجموعات البيانات التي تم جمعها حديثًا، مثل PMC-Inline وRP3D، بشكل كبير من قدرات النموذج عبر مهام سريرية متنوعة. تشير النتائج إلى أن RadFM يتفوق على النماذج الموجودة، بما في ذلك MedVInT، خاصة في التعامل مع المسحات الطبية ثلاثية الأبعاد المعقدة، مما يظهر درجات دقة UMLS_Precision وUMLS_Recall متفوقة.

تسمح بنية RadFM المبتكرة له بمعالجة كل من الصور الإشعاعية ثنائية وثلاثية الأبعاد، مما يعالج فجوة حرجة في النماذج الموجودة التي تركز عادةً على بعد واحد فقط من الصور. تعزز قدرة النموذج على دعم مدخلات متعددة الصور وصيغ النص المتداخلة من قابليته للتطبيق في الإعدادات السريرية، مما يسهل اتخاذ قرارات أكثر دقة. على الرغم من تقدمه، تعترف الورقة بالقيود، مثل الحاجة إلى تحسين توليد النص الطويل والتحديات التي تطرحها الندرة المحدودة للصور ثلاثية الأبعاد في مجموعة بيانات التدريب. يُقترح العمل المستقبلي لتحسين قدرات النموذج ومعالجة هذه القيود، بهدف إقامة RadFM كأداة قوية في تطبيقات الذكاء الاصطناعي الطبية.

Journal: Nature Communications, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41467-025-62385-7
PMID: https://pubmed.ncbi.nlm.nih.gov/40849424
Publication Date: 2025-08-23
Author(s): Chaoyi Wu et al.
Primary Topic: AI in cancer detection

Overview

In this study, we introduce the Radiology Foundation Model (RadFM) as a proof-of-concept aimed at advancing the development of generalist medical foundation models in radiology. Our approach encompasses three key aspects: dataset construction, model design, and comprehensive evaluation. We present four multimodal datasets comprising 13 million 2D images and 615,000 3D scans, collectively termed the Medical Multi-modal Dataset (MedMD). This extensive dataset, when integrated with existing collections, serves as the foundation for training RadFM.

We propose a novel architecture that facilitates the integration of text inputs with 2D and 3D medical scans, enabling the model to perform a variety of radiologic tasks, such as diagnosis, visual question answering, report generation, and rationale diagnosis. To evaluate RadFM, we not only assess its performance on nine established datasets but also introduce a new benchmark, RadBench, which includes three tasks designed to evaluate foundation models comprehensively. Our findings indicate that RadFM outperforms existing multimodal foundation models, including GPT-4V, and demonstrates superior adaptability across various public benchmarks, surpassing several state-of-the-art models. This work addresses critical challenges in the field, including the scarcity of multimodal datasets, the need for a unified architectural framework, and the absence of effective benchmarks to monitor progress in medical foundation model development.

Introduction

In this section, the authors discuss the implementation of domain-specific instruction tuning using the RadMD dataset, which comprises over 3 million radiologic images paired with high-quality language instructions and responses. The authors emphasize the importance of filtering out datasets that do not originate from real clinical scenarios, specifically excluding PMC-Inline and PMC-OA. They focus on retaining only relevant radiology content by filtering out non-radiology-related datasets, including the MPx-series, RP3D-series, and PMC-CaseReport datasets.

The filtering process for the MPx-series and RP3D-series is straightforward, as these datasets provide imaging modalities linked to each case. For the PMC-CaseReport dataset, which is derived from case reports using ChatGPT, the authors utilize image captions to identify relevant cases, retaining only those that explicitly mention radiology terms such as “MRI,” “CT,” “X-ray,” “ultrasound,” or “mammography.” The authors acknowledge the potential presence of noisy cases in the dataset and note that the evaluation dataset, RadBench, undergoes additional manual inspection to ensure the quality of the selected test cases.

Methods

In this section, the authors outline the methodology employed in their study, which is primarily based on data sourced from open-access websites, as detailed in Supplementary Table 5. The ethical considerations surrounding the use of this data are dictated by the original data-uploading processes specified in each dataset’s collection pipeline, with further information accessible via the respective dataset websites listed in the supplementary material.

A significant portion of the dataset is derived from Radiopaedia, a peer-reviewed, open-edit platform dedicated to radiology resources. Radiopaedia aims to provide a comprehensive and freely accessible radiology reference. The authors have secured non-commercial use permissions from various contributors as well as from the founder of Radiopaedia, ensuring compliance with the ethical guidelines outlined in Radiopaedia’s privacy policy.

Results

In this section, the authors present the evaluation results of their proposed model, RadFM, across nine public medical datasets and their newly developed RadBench, which includes tasks such as medical Visual Question Answering (VQA), report generation, and rationale diagnosis. The model’s performance is compared to existing multi-modal foundation models, including Open-flamingo, MedVInT, LLaVA-Med, and MedFlamingo, under both zero-shot and few-shot prompting conditions. Notably, RadFM outperforms these models significantly, achieving accuracy scores of 59.96%, 68.82%, 56.32%, 83.62%, and 72.95% across five diagnosis datasets, while existing models struggle with scores near random chance (approximately 50%).

Further analysis reveals that RadFM excels in long sentence generation tasks, such as medical VQA and report generation, demonstrating consistent improvements across various clinical tasks and imaging modalities. The authors also conduct task-specific fine-tuning experiments, showing that RadFM enhances diagnosis results and text generation quality when adapted to specific datasets. Qualitative results indicate that while RadFM effectively generates coherent responses and identifies underlying diseases, it occasionally struggles with nuanced distinctions in complex cases. Overall, the findings suggest that RadFM provides a robust foundation for advancing medical AI applications, particularly in the context of diverse imaging modalities and clinical tasks.

Discussion

The discussion section of the paper presents a comprehensive evaluation of the RadFM model, focusing on its performance in various medical tasks, including medical visual question answering (VQA), report generation, and rationale diagnosis. Ablation studies reveal that domain-specific instruction tuning is essential for effective model performance, particularly in the RadMD dataset. The introduction of newly collected datasets, such as PMC-Inline and RP3D, significantly enhances the model’s capabilities across diverse clinical tasks. The results indicate that RadFM outperforms existing models, including MedVInT, particularly in handling complex 3D medical scans, demonstrating superior UMLS_Precision and UMLS_Recall scores.

RadFM’s innovative architecture allows it to process both 2D and 3D radiologic images, addressing a critical gap in existing models that typically focus on only one image dimension. The model’s ability to support multi-image inputs and interleaved text formats enhances its applicability in clinical settings, facilitating more nuanced decision-making. Despite its advancements, the paper acknowledges limitations, such as the need for improved long-form text generation and the challenges posed by the limited availability of 3D images in the training dataset. Future work is suggested to refine the model’s capabilities and address these limitations, ultimately aiming to establish RadFM as a robust tool in medical AI applications.