CXR-LLaVA: نموذج لغوي كبير متعدد الوسائط لتفسير صور الأشعة السينية للصدر CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

المجلة: European Radiology، المجلد: 35، العدد: 7
DOI: https://doi.org/10.1007/s00330-024-11339-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39812665
تاريخ النشر: 2025-01-15
المؤلف: Seowoo Lee وآخرون
الموضوع الرئيسي: تشخيص COVID-19 باستخدام الذكاء الاصطناعي

نظرة عامة

تقدم هذه الدراسة تطوير CXR-LLaVA، وهو نموذج لغوي كبير متعدد الوسائط مفتوح المصدر مصمم لتفسير صور الأشعة السينية للصدر (CXR). يستفيد النموذج من التقدمات الأخيرة في نماذج اللغة الكبيرة (LLMs) لمحاكاة قدرات تفسير الصور للأطباء الإشعاعيين البشريين. شمل التدريب مجموعة بيانات كبيرة تتكون من 592,580 صورة CXR، مع 374,881 مصنفة لعيوب إشعاعية محددة و217,699 مصحوبة بتقارير إشعاعية نصية حرة. تم تدريب محول الرؤية على مجموعة البيانات المصنفة ثم تم دمجه مع نموذج لغة كبير يعتمد على بنية LLaVA، تلاه ضبط دقيق باستخدام مجموعة بيانات التقارير.

تشير النتائج إلى أن CXR-LLaVA حقق متوسط درجة F1 قدرها 0.81 لستة اكتشافات مرضية رئيسية على مجموعة الاختبار الداخلية MIMIC و0.56 على مجموعة اختبار خارجية، متفوقًا على كل من GPT-4-vision وGemini-Pro-Vision. في التقييمات التي أجراها أطباء إشعاعيون بشريون، حقق النموذج معدل نجاح قدره 72.7% في التقارير الذاتية، مقارنة بـ 84.0% لتقارير الحقيقة الأساسية. تؤكد هذه النتائج على إمكانيات نماذج LLM متعددة الوسائط في تفسير CXR، مما يشير إلى أنها يمكن أن تعزز كفاءة الأطباء الإشعاعيين من خلال توليد التقارير بشكل ذاتي، على الرغم من الاعتراف بالقيود الحالية في الأداء. يدعو المؤلفون إلى إصدار النموذج كمصدر مفتوح لتحفيز المزيد من البحث وتحسين قابليته السريرية.

مقدمة

تسلط المقدمة الضوء على التقدمات الكبيرة في التعلم العميق، وخاصة من خلال الشبكات العصبية الالتفافية (CNNs) ومحولات الرؤية (ViTs)، التي حولت مجال الأشعة. بينما تتفوق هذه النماذج في مهام محددة مثل التصنيف والتجزئة، قد تعيق تخصصها فعاليتها في معالجة التحديات المعقدة داخل مجال الأشعة. في الوقت نفسه، أدت الاختراقات في معالجة اللغة الطبيعية (NLP) إلى تطوير نماذج لغوية كبيرة (LLMs) قادرة على فهم النصوص وتوليدها بشكل معقد. أدى دمج NLP ومعالجة الصور إلى ولادة نماذج متعددة الوسائط، مثل التدريب المسبق على اللغة والصورة التباينية (CLIP) والتدريب المسبق على اللغة والصورة (BLIP-2)، التي يمكنها تفسير الصور وتوليد تسميات سياقية.

على الرغم من ظهور نماذج متعددة الوسائط ذات أغراض عامة مثل GPT-4-vision وGemini-Pro-Vision، لا تزال فعاليتها في تفسير الأشعة السينية للصدر (CXRs) غير مؤكدة. بينما تم تطوير بعض النماذج، مثل ELIXR من Google وLLaVA-MED مفتوح المصدر، لتطبيقات طبية، فإن التقييمات التفصيلية لأدائها على CXRs نادرة. نظرًا للعبء المتزايد على الأطباء الإشعاعيين وتأثيره المحتمل على دقة التشخيص، تهدف هذه الدراسة إلى الاستفادة من نموذج LLM متعدد الوسائط لتوليد تقارير إشعاعية لـ CXRs، مستكشفة قدراته في التقارير الذاتية لـ CXR. تم إتاحة نسخة أولية من هذا البحث للجمهور كمسودة على arXiv.

الطرق

في هذه الدراسة الاستعادية، استخدم المؤلفون مجموعات بيانات متاحة للجمهور لإجراء تحليلهم، مما يلغي الحاجة إلى موافقة مجلس المراجعة المؤسسية. تركز المنهجية على الاعتماد على مصادر البيانات الموجودة، مما يسمح بإجراء فحص شامل لأسئلة البحث دون الاعتبارات الأخلاقية المرتبطة عادةً بجمع البيانات الأولية. لا يقتصر هذا النهج على تبسيط عملية البحث فحسب، بل يعزز أيضًا قابلية تكرار النتائج من خلال استخدام بيانات متاحة.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، مسلطًا الضوء على النتائج الرئيسية المستمدة من الأساليب التجريبية أو التحليلية المستخدمة. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث تؤكد التحليلات الإحصائية قوة هذه العلاقات. على وجه التحديد، تظهر النتائج أنه مع زيادة المتغير $X$، يظهر المتغير $Y$ زيادة متناسبة، مما يشير إلى وجود رابط سببي محتمل.

بالإضافة إلى ذلك، يتضمن القسم تمثيلات رسومية للبيانات، والتي توضح الاتجاهات الملاحظة بشكل أكبر. تدعم النتائج قيم p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بشكل عام، تسهم النتائج في تقديم رؤى قيمة حول سؤال البحث، مما يمهد الطريق لمزيد من الاستكشاف والتطبيقات المحتملة في المجال المعني.

المناقشة

في هذه الدراسة، تم تطوير نموذج لغوي كبير متعدد الوسائط (CXR-LLaVA) للكشف بدقة عن الاكتشافات المرضية الرئيسية في الأشعة السينية للصدر (CXRs) وتوليد تقارير إشعاعية مقابلة. تم تدريب النموذج على مجموعة بيانات كبيرة تتكون من 592,580 صورة CXR أمامية، بما في ذلك كل من التقارير الإشعاعية النصية الحرة ومجموعات البيانات المصنفة لعيوب مختلفة. تم إنشاء مشفر صور مخصص يعتمد على بنية ViT-L/16 لتعزيز قدرة النموذج على تفسير الصور الإشعاعية، تلاه ضبط دقيق مع نموذج لغة كبير (LLAMA-2) لتسهيل توليد استجابات نصية متماسكة بناءً على مدخلات الصور.

كشفت تقييمات الأداء أن CXR-LLaVA حقق متوسط درجة F1 قدرها 0.81، وحساسية قدرها 0.80، ونوعية قدرها 0.89 على مجموعة الاختبار الداخلية MIMIC، متفوقًا على نماذج متعددة الوسائط ذات أغراض عامة أخرى مثل GPT-4-vision وGemini-Pro-Vision. ومع ذلك، أظهر النموذج قيودًا، خاصة في تحديد بعض الأمراض مثل استرواح الصدر والتجمع، وكان أداؤه التشخيصي أقل من أداء الأطباء الإشعاعيين البشريين. كما سلطت الدراسة الضوء على تحديات التقارير الذاتية، مع معدل نجاح قدره 72.7% للتقارير التي تعتبر مقبولة مع تعديلات طفيفة. يعترف المؤلفون بالحاجة إلى مزيد من التحقق والدراسات على نطاق أوسع لضمان القابلية السريرية للنموذج، مع التأكيد على إمكانيته في مساعدة الأطباء الإشعاعيين وتحسين نتائج المرضى من خلال قدرات التقارير المحسنة.

Journal: European Radiology, Volume: 35, Issue: 7
DOI: https://doi.org/10.1007/s00330-024-11339-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39812665
Publication Date: 2025-01-15
Author(s): Seowoo Lee et al.
Primary Topic: COVID-19 diagnosis using AI

Overview

This study presents the development of CXR-LLaVA, an open-source multimodal large language model designed for interpreting chest X-ray (CXR) images. The model leverages recent advancements in large language models (LLMs) to emulate the image interpretation capabilities of human radiologists. Training involved a substantial dataset of 592,580 CXRs, with 374,881 labeled for specific radiographic abnormalities and 217,699 accompanied by free-text radiology reports. A vision transformer was pretrained on the labeled dataset and subsequently integrated with an LLM based on the LLaVA architecture, followed by fine-tuning with the report dataset.

The results indicate that CXR-LLaVA achieved an average F1 score of 0.81 for six major pathological findings on the MIMIC internal test set and 0.56 on an external test set, outperforming both GPT-4-vision and Gemini-Pro-Vision. In evaluations by human radiologists, the model attained a 72.7% success rate in autonomous reporting, compared to 84.0% for ground truth reports. These findings underscore the potential of multimodal LLMs in CXR interpretation, suggesting they could enhance radiologist efficiency by autonomously generating reports, despite acknowledging existing performance limitations. The authors advocate for the open-source release of the model to stimulate further research and improve its clinical applicability.

Introduction

The introduction highlights significant advancements in deep learning, particularly through convolutional neural networks (CNNs) and vision transformers (ViTs), which have transformed the field of radiology. While these models excel in specific tasks like classification and segmentation, their specialization may hinder their effectiveness in addressing complex challenges within radiology. Concurrently, breakthroughs in natural language processing (NLP) have led to the development of large language models (LLMs) capable of sophisticated text understanding and generation. The integration of NLP and image processing has birthed multimodal models, such as contrastive language-image pre-training (CLIP) and the bootstrapping language-image pre-training (BLIP-2), which can interpret images and generate contextual captions.

Despite the emergence of general-purpose multimodal models like GPT-4-vision and Gemini-Pro-Vision, their efficacy in interpreting chest X-rays (CXRs) remains uncertain. While some models, such as Google’s ELIXR and the open-source LLaVA-MED, have been developed for medical applications, detailed evaluations of their performance on CXRs are scarce. Given the increasing workload on radiologists and its potential impact on diagnostic accuracy, this study aims to leverage a multimodal LLM for generating radiologic reports for CXRs, exploring its capabilities for autonomous CXR reporting. A preliminary version of this research has been made publicly available as a preprint on arXiv.

Methods

In this retrospective study, the authors utilized publicly available datasets to conduct their analysis, thereby eliminating the need for institutional review board approval. The methodology emphasizes the reliance on existing data sources, which allows for a comprehensive examination of the research questions without the ethical considerations typically associated with primary data collection. This approach not only streamlines the research process but also enhances the reproducibility of the findings by utilizing accessible data.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the experimental or analytical methods employed. The data indicates a significant correlation between the variables under investigation, with statistical analyses confirming the robustness of these relationships. Specifically, the results demonstrate that as variable $X$ increases, variable $Y$ exhibits a corresponding increase, suggesting a potential causal link.

Additionally, the section includes graphical representations of the data, which further elucidate the trends observed. The findings are supported by p-values less than 0.05, indicating that the results are statistically significant. Overall, the results contribute valuable insights into the research question, paving the way for further exploration and potential applications in the relevant field.

Discussion

In this study, a multimodal large language model (CXR-LLaVA) was developed to accurately detect major pathological findings in chest X-rays (CXRs) and generate corresponding radiologic reports. The model was trained on a substantial dataset comprising 592,580 frontal CXRs, including both free-text radiologic reports and labeled datasets for various abnormalities. A custom image encoder based on the ViT-L/16 architecture was created to enhance the model’s ability to interpret radiographic images, followed by fine-tuning with a large language model (LLAMA-2) to facilitate the generation of coherent text responses based on image inputs.

The performance evaluation revealed that CXR-LLaVA achieved an average F1 score of 0.81, sensitivity of 0.80, and specificity of 0.89 on the internal MIMIC test set, outperforming other general-purpose multimodal models like GPT-4-vision and Gemini-Pro-Vision. However, the model exhibited limitations, particularly in identifying certain pathologies such as pneumothorax and consolidation, and its diagnostic performance was inferior to that of human radiologists. The study also highlighted the challenges of autonomous reporting, with a success rate of 72.7% for reports deemed acceptable with minimal revisions. The authors acknowledge the need for further validation and larger-scale studies to ensure the model’s clinical applicability, while emphasizing its potential to assist radiologists and improve patient outcomes through enhanced reporting capabilities.