حدود الذكاء الاصطناعي في التصوير الطبي العادل في التعميم في العالم الحقيقي The limits of fair medical imaging AI in real-world generalization

المجلة: Nature Medicine، المجلد: 30، العدد: 10
DOI: https://doi.org/10.1038/s41591-024-03113-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38942996
تاريخ النشر: 2024-06-28
المؤلف: Yuzhe Yang وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

في هذه الدراسة، نحقق في آثار الذكاء الاصطناعي (AI) في التصوير الطبي، وخاصة إمكانيته في تفاقم الفجوات في الرعاية الصحية من خلال استخدام اختصارات ديموغرافية في تصنيف الأمراض. أظهرت الأبحاث السابقة أن نماذج الذكاء الاصطناعي يمكن أن تستنتج معلومات ديموغرافية من الصور الطبية، مما يثير القلق بشأن العدالة في التنبؤات عبر مجموعات فرعية مختلفة. تمتد تحليلاتنا عبر ثلاثة مجالات في التصوير الطبي – الأشعة، والأمراض الجلدية، وطب العيون – باستخدام بيانات من ستة مجموعات بيانات عالمية لأشعة الصدر. نؤكد أن نماذج الذكاء الاصطناعي تستفيد بالفعل من الترميزات الديموغرافية، مما يمكن أن يؤدي إلى توقعات متحيزة، تؤثر بشكل خاص على المجموعات المهمشة.

وجدنا أنه بينما يمكن أن تؤدي التصحيحات الخوارزمية للاختصارات الديموغرافية إلى نماذج “محلية مثالية” ضمن توزيع بيانات التدريب، قد لا تؤدي هذه النماذج بشكل عادل في إعدادات الاختبار الخارجية. من المثير للاهتمام أن النماذج التي تستخدم عددًا أقل من الخصائص الديموغرافية تميل إلى أن تكون أكثر “عالمية مثالية”، مما يظهر تحسين العدالة في سياقات تقييم متنوعة. تؤكد هذه الأبحاث على ضرورة اتباع أفضل الممارسات في تطوير الذكاء الاصطناعي في التصوير الطبي الذي لا يحافظ فقط على الأداء ولكن يضمن أيضًا العدالة عبر مختلف السكان والبيئات السريرية، مما يبرز الحاجة الملحة لمعالجة التحيزات في نشرات الذكاء الاصطناعي في الرعاية الصحية.

الطرق

في هذا القسم، يحدد المؤلفون طرق التقييم المستخدمة لتقييم أداء تصنيف الأمراض في التصوير الطبي. تشمل المقاييس المستخدمة منطقة تحت منحنى خصائص التشغيل المستقبلية (AUROC)، ومعدل الإيجابيات الحقيقية (TPR)، ومعدل السلبيات الحقيقية (TNR)، وخطأ المعايرة المتوقع (ECE). يتم حساب TPR و TNR باستخدام الصيغ القياسية:

\[
\text{TPR} = \frac{TP}{TP + FN}, \quad \text{TNR} = \frac{TN}{TN + FP}
\]

قام المؤلفون بتحسين العتبة للحساسية والنوعية بناءً على درجة F1 لكل مجموعة بيانات، ومهمة، وخوارزمية، وتركيبة خصائص، بينما قاموا أيضًا بحساب فترات الثقة 95% لهذه المقاييس. تم حساب ECE باستخدام مكتبة netcal، مما يوفر رؤى حول معايرة النماذج.

لتقييم العدالة في نماذج التعلم الآلي، قام المؤلفون بتقييم المقاييس المذكورة أعلاه عبر مجموعات ديموغرافية مختلفة، مع التركيز على مساواة TPR و TNR، المعروفة باسم فرص متساوية. وأبرزوا التكاليف المختلفة المرتبطة بالإيجابيات الكاذبة (FP) والسلبيات الكاذبة (FN) في سياقات محددة، مثل توقعات “عدم وجود نتيجة”، حيث تعتبر FPs أكثر تكلفة. شمل التحليل أيضًا فحص ECE لكل مجموعة والفجوة في ECE بين المجموعات، مما يبرز أن الفجوات في المعايرة يمكن أن تؤدي إلى عدم المساواة الكبيرة في العلاج، مما قد يؤدي إلى نقص العلاج أو زيادة العلاج بناءً على تقييمات المخاطر غير المعايرة.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. عادةً ما يتضمن بيانات كمية، وتحليلات إحصائية، وتمثيلات بصرية مثل الرسوم البيانية أو الجداول لتوضيح النتائج. غالبًا ما تتم مقارنة النتائج مع الفرضيات أو الدراسات السابقة لتسليط الضوء على الفروق أو التأكيدات المهمة.

في هذا القسم، قد يبلغ المؤلفون عن مقاييس محددة، مثل المتوسطات، والانحرافات المعيارية، أو قيم p، التي تشير إلى الأهمية الإحصائية للنتائج. بالإضافة إلى ذلك، يتم مناقشة أي اتجاهات أو أنماط ملحوظة في البيانات، مما يوفر رؤى حول آثار النتائج على السياق البحثي الأوسع. بشكل عام، يخدم هذا القسم في التحقق من صحة أسئلة البحث المطروحة سابقًا في الدراسة ويضع الأساس للمناقشات والاستنتاجات اللاحقة.

المناقشة

تسلط الأبحاث الضوء على الفجوات الكبيرة في العدالة في نماذج التعلم العميق المستخدمة في مهام التصوير الطبي، وخاصة في توقعات أشعة الصدر (CXR) وطرق أخرى. من خلال استخدام نهج التعلم الانتقالي، تكشف الدراسة أن النماذج المدربة لتصنيف الأمراض تقوم بشكل غير مقصود بتشفير الخصائص الديموغرافية مثل العمر، والعرق، والجنس، مما يؤدي إلى أداء متحيز عبر هذه المجموعات. يظهر التحليل أن الفجوات في معدلات السلبيات الكاذبة (FNR) ومعدلات الإيجابيات الكاذبة (FPR) يمكن أن تصل إلى 30% لبعض المجموعات الفرعية الديموغرافية، مما يشير إلى الحاجة الملحة لمعالجة التحيز الخوارزمي في تطبيقات الذكاء الاصطناعي السريرية.

تؤكد النتائج أيضًا أنه بينما يمكن أن يؤدي تقليل الاختصارات الديموغرافية إلى تحسين العدالة في إعدادات التوزيع (ID)، فإن هذه التحسينات لا تترجم بالضرورة إلى سياقات خارج التوزيع (OOD). تحدد الدراسة تفاعلًا معقدًا بين العدالة وتحولات التوزيع، حيث قد تظهر النماذج التي تؤدي بشكل جيد في إعدادات ID أداءً ضعيفًا في سيناريوهات OOD. بالإضافة إلى ذلك، تقترح الأبحاث استراتيجيات اختيار النماذج التي تعطي الأولوية لتقليل تشفير الخصائص الديموغرافية لتحقيق نماذج عالمية مثالية تحافظ على العدالة عبر مجموعات بيانات متنوعة. بشكل عام، تؤكد هذه الأعمال على أهمية التقييمات الشاملة في الذكاء الاصطناعي الطبي لضمان نتائج عادلة لجميع المجموعات الديموغرافية، وخاصة في البيئات السريرية الواقعية.

Journal: Nature Medicine, Volume: 30, Issue: 10
DOI: https://doi.org/10.1038/s41591-024-03113-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38942996
Publication Date: 2024-06-28
Author(s): Yuzhe Yang et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

In this study, we investigate the implications of artificial intelligence (AI) in medical imaging, particularly its potential to exacerbate healthcare disparities through the use of demographic shortcuts in disease classification. Previous research has shown that AI models can infer demographic information from medical images, raising concerns about fairness in predictions across different subpopulations. Our analysis spans three medical imaging fields—radiology, dermatology, and ophthalmology—utilizing data from six global chest X-ray datasets. We confirm that AI models indeed leverage demographic encodings, which can lead to biased predictions, particularly affecting marginalized groups.

We found that while algorithmic corrections to demographic shortcuts can yield ‘locally optimal’ models within the training data distribution, these models may not perform equitably in external test settings. Interestingly, models that utilize fewer demographic attributes tend to be more ‘globally optimal’, demonstrating improved fairness in diverse evaluation contexts. This research underscores the necessity for best practices in developing medical imaging AI that not only maintains performance but also ensures fairness across various populations and clinical environments, highlighting the critical need for addressing biases in AI deployments in healthcare.

Methods

In this section, the authors outline the evaluation methods employed to assess the performance of disease classification in medical imaging. The metrics utilized include the Area Under the Receiver Operating Characteristic curve (AUROC), True Positive Rate (TPR), True Negative Rate (TNR), and Expected Calibration Error (ECE). TPR and TNR are calculated using standard formulas:

\[
\text{TPR} = \frac{TP}{TP + FN}, \quad \text{TNR} = \frac{TN}{TN + FP}
\]

The authors optimized the threshold for sensitivity and specificity based on the F1 score for each dataset, task, algorithm, and attribute combination, while also calculating 95% confidence intervals for these metrics. The ECE was computed using the netcal library, providing insights into the calibration of the models.

To evaluate fairness in machine learning models, the authors assessed the aforementioned metrics across different demographic groups, focusing on the equality of TPR and TNR, known as equal odds. They highlighted the differing costs associated with false positives (FP) and false negatives (FN) in specific contexts, such as ‘No Finding’ predictions, where FPs are deemed more costly. The analysis also included the examination of per-group ECE and the ECE gap between groups, emphasizing that disparities in calibration can lead to significant treatment inequities, potentially resulting in under-treatment or over-treatment based on miscalibrated risk assessments.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It typically includes quantitative data, statistical analyses, and visual representations such as graphs or tables to illustrate the outcomes. The results are often compared against the hypotheses or previous studies to highlight significant differences or confirmations.

In this section, the authors may report specific metrics, such as means, standard deviations, or p-values, which indicate the statistical significance of the findings. Additionally, any observed trends or patterns in the data are discussed, providing insights into the implications of the results for the broader research context. Overall, this section serves to validate the research questions posed earlier in the study and lays the groundwork for subsequent discussions and conclusions.

Discussion

The research highlights significant fairness gaps in deep learning models used for medical imaging tasks, specifically in chest X-ray (CXR) predictions and other modalities. By employing a transfer learning approach, the study reveals that models trained for disease classification inadvertently encode demographic attributes such as age, race, and sex, leading to biased performance across these groups. The analysis demonstrates that discrepancies in false negative rates (FNR) and false positive rates (FPR) can be as high as 30% for certain demographic subgroups, indicating a critical need for addressing algorithmic bias in clinical AI applications.

The findings further emphasize that while mitigating demographic shortcuts can enhance fairness in in-distribution (ID) settings, such improvements do not necessarily translate to out-of-distribution (OOD) contexts. The study identifies a complex interplay between fairness and distribution shifts, where models that perform well in ID settings may exhibit poor fairness in OOD scenarios. Additionally, the research proposes model selection strategies that prioritize minimizing demographic attribute encoding to achieve globally optimal models that maintain fairness across diverse datasets. Overall, the work underscores the importance of comprehensive evaluations in medical AI to ensure equitable outcomes for all demographic groups, particularly in real-world clinical environments.