التنفيذ الواقعي على مستوى البلاد للذكاء الاصطناعي في الكشف عن السرطان في فحص الماموجرافي القائم على السكان Nationwide real-world implementation of AI for cancer detection in population-based mammography screening

المجلة: Nature Medicine، المجلد: 31، العدد: 3
DOI: https://doi.org/10.1038/s41591-024-03408-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39775040
تاريخ النشر: 2025-01-07
المؤلف: Nora Eisemann وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

تعد دراسة PRAIM تجربة متعددة المراكز، رصدية، غير تفاضلية تقيم فعالية القراءة المزدوجة المدعومة بالذكاء الاصطناعي في فحص الماموجرام مقارنة بالقراءة المزدوجة القياسية. أجريت الدراسة في 12 موقعًا في ألمانيا من يوليو 2021 إلى فبراير 2023، وشملت 463,094 امرأة تتراوح أعمارهن بين 50-69 عامًا، حيث تلقت 260,739 منهن مساعدة من 119 طبيب أشعة مشارك. أظهرت النتائج أن المجموعة المدعومة بالذكاء الاصطناعي حققت معدل اكتشاف سرطان الثدي بلغ 6.7 لكل 1,000، وهو أعلى بنسبة 17.6% من معدل المجموعة الضابطة البالغ 5.7 لكل 1,000 (فترة الثقة 95%: +5.7%، +30.8%). بالإضافة إلى ذلك، كان معدل الاستدعاء في مجموعة الذكاء الاصطناعي 37.4 لكل 1,000، أقل قليلاً من معدل المجموعة الضابطة البالغ 38.3 لكل 1,000، مما يدل على عدم التفوق (فرق النسبة: -2.5% (-6.5%، +1.7%)).

كما تحسن القيمة التنبؤية الإيجابية (PPV) للاستدعاء في مجموعة الذكاء الاصطناعي بنسبة 17.9%، مقارنة بـ 14.9% في المجموعة الضابطة، بينما كانت القيمة التنبؤية الإيجابية للخزعة 64.5% مقابل 59.2%. تشير هذه النتائج إلى أن القراءة المزدوجة المدعومة بالذكاء الاصطناعي تعزز معدلات اكتشاف سرطان الثدي دون التأثير سلبًا على معدلات الاستدعاء، مما يدل على إمكانية تحسين الذكاء الاصطناعي لنتائج فحص الماموجرام. تسلط الدراسة الضوء على الحاجة المستمرة للتقدم في فحص سرطان الثدي لتعزيز الحساسية والنوعية، مما يقلل في النهاية من الإيجابيات الكاذبة والأعباء النفسية والمالية المرتبطة بها على المرضى وأنظمة الرعاية الصحية.

الطرق

في هذه الدراسة، تم استهداف حجم عينة يبلغ 200,000 امرأة لتقييم عدم التفوق للذكاء الاصطناعي (AI) في معدلات اكتشاف سرطان الثدي (BCDR). نظرًا للتحيزات التي أدخلتها سلوكيات القراءة في العالم الحقيقي، تم إجراء دراسة محاكاة لتحديد طريقة إحصائية مناسبة لتصحيح هذه التحيزات وتقدير القوة الإحصائية المتوقعة. من بين الطرق التي تم تقييمها، ظهرت نموذج الانحدار البسيط الذي يستخدم توزيع خطأ شبه ثنائي الوزن ووزن التداخل بناءً على درجات الميل كنهج وحيد يقدم نتائج غير متحيزة بقوة كافية. تم حساب درجات الميل باستخدام الانحدار اللوجستي، مع الأخذ في الاعتبار مجموعة القارئ وتوقعات الذكاء الاصطناعي، وتم تطبيق وزن التداخل لتخفيف تأثير درجات الميل المتطرفة.

لتقييم عدم التفوق، اعتُبرت القراءة المدعومة بالذكاء الاصطناعي مقبولة إذا كان معدل الاكتشاف في مجموعة الذكاء الاصطناعي لا يقل عن 10% عن ذلك في المجموعة الضابطة، مع أن الحد الأدنى لفترة الثقة الثنائية الجانبين 95% (CI) يتجاوز -10%. وبالمثل، بالنسبة لمعدلات الاستدعاء، تم تأسيس عدم التفوق إذا كان معدل استدعاء مجموعة الذكاء الاصطناعي أعلى بنسبة 10% على الأكثر من المجموعة الضابطة. تم اشتقاق التنبؤات القائمة على النموذج لمعدلات BCDR ومعدلات الاستدعاء من تحليلات الانحدار، مع إجراء تحليلات فرعية بناءً على جولة الفحص، وكثافة الثدي، والعمر. تم إجراء تحليلات حساسية لمعالجة العوامل المربكة المحتملة، وتم تنفيذ جميع التحليلات الإحصائية باستخدام R وPython، مع توفر الشيفرة للمراجعة الإضافية.

النتائج

قيمت الدراسة فعالية نظام القراءة المزدوجة المدعوم بالذكاء الاصطناعي ضمن برنامج فحص سرطان الثدي المنظم في ألمانيا، حيث شملت 461,818 امرأة تتراوح أعمارهن بين 50-69 عامًا تم فحصهن بين 1 يوليو 2021 و23 فبراير 2023. تم تقسيم المشاركات إلى مجموعتين: 260,739 امرأة في مجموعة الذكاء الاصطناعي و201,079 في المجموعة الضابطة. أظهرت النتائج أن 41.9 لكل 1,000 امرأة كانت لديها نتائج مشبوهة أدت إلى استدعاءات، حيث خضعت 10.4 لكل 1,000 لخزعات و6.2 لكل 1,000 تم تشخيصهن في النهاية بسرطان الثدي. ومن الملاحظ أن 79.4% من السرطانات المكتشفة كانت غازية، بينما تم تصنيف 18.9% على أنها سرطان قنوي في الموقع (DCIS).

وسمّت أنظمة الذكاء الاصطناعي 56.7% من الفحوصات على أنها طبيعية، مع معدل أعلى في مجموعة الذكاء الاصطناعي (59.4%) مقارنة بالمجموعة الضابطة (53.3%). تم تفعيل آلية شبكة الأمان لـ 1.5% من الفحوصات المسمّاة بالذكاء الاصطناعي، مما أسفر عن 541 استدعاء و204 تشخيصات بسرطان الثدي. على العكس، تم تقييم 3.1% من الفحوصات الطبيعية المسمّاة بالذكاء الاصطناعي من قبل مجموعة الإجماع، مما أدى إلى استدعاءات إضافية وعدد قليل من تشخيصات سرطان الثدي. تشير هذه النتائج إلى أن الذكاء الاصطناعي يمكن أن يؤثر على نتائج الفحص وقد يحسن كفاءة التصنيف في اكتشاف سرطان الثدي.

المناقشة

تعد دراسة PRAIM، التي شملت 461,818 امرأة تم فحصهن من أجل سرطان الثدي عبر 12 موقعًا، تقييمًا لدمج الذكاء الاصطناعي (AI) في سير العمل الخاص بالماموجرام. وجدت الدراسة أن معدل اكتشاف سرطان الثدي (BCDR) في مجموعة الذكاء الاصطناعي كان 6.70 لكل 1,000 امرأة، وهو أعلى بكثير من معدل المجموعة الضابطة البالغ 5.70 لكل 1,000، مما يمثل زيادة نسبتها 17.6%. ومن الملاحظ أن مجموعة الذكاء الاصطناعي أظهرت أيضًا معدل استدعاء أقل (37.4 لكل 1,000) مقارنة بالمجموعة الضابطة (38.3 لكل 1,000)، مما يدل على أن الذكاء الاصطناعي يمكن أن يعزز الاكتشاف بينما يقلل من المتابعات غير الضرورية. كانت القيمة التنبؤية الإيجابية (PPV) للخزعة أعلى في مجموعة الذكاء الاصطناعي، مما يشير إلى تحسين الدقة في تحديد الإيجابيات الحقيقية.

كشفت التحليلات الفرعية عن تحسينات متسقة في BCDR عبر مختلف الفئات الديموغرافية، بما في ذلك العمر وكثافة الثدي، بينما أكدت تحليلات الحساسية على قوة هذه النتائج. سمح تصميم الدراسة بتطبيقها في العالم الحقيقي، حيث شملت مواقع فحص متنوعة وأطباء أشعة، واستخدمت نهج إحالة القرار حيث كانت توقعات الذكاء الاصطناعي تُعلم ولكن لا تحل محل تقييمات أطباء الأشعة. بشكل عام، تدعم النتائج جدوى وأمان الذكاء الاصطناعي في فحص الماموجرام، مما يبرز إمكانيته في تخفيف عبء العمل على أطباء الأشعة وتحسين معدلات اكتشاف السرطان، بينما تثير أيضًا تساؤلات حول تداعيات زيادة اكتشاف سرطان القناة في الموقع (DCIS) والحاجة إلى مزيد من التحقيق في النتائج طويلة الأجل.

Journal: Nature Medicine, Volume: 31, Issue: 3
DOI: https://doi.org/10.1038/s41591-024-03408-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39775040
Publication Date: 2025-01-07
Author(s): Nora Eisemann et al.
Primary Topic: AI in cancer detection

Overview

The PRAIM study is a multicenter, observational, noninferiority trial assessing the effectiveness of AI-supported double reading in mammography screening compared to standard double reading. Conducted across 12 sites in Germany from July 2021 to February 2023, the study involved 463,094 women aged 50-69, with 260,739 receiving AI assistance from 119 participating radiologists. Results indicated that the AI-supported group achieved a breast cancer detection rate of 6.7 per 1,000, which was 17.6% higher than the control group’s rate of 5.7 per 1,000 (95% confidence interval: +5.7%, +30.8%). Additionally, the recall rate in the AI group was 37.4 per 1,000, slightly lower than the control group’s 38.3 per 1,000, demonstrating noninferiority (percentage difference: -2.5% (-6.5%, +1.7%)).

The positive predictive value (PPV) of recall was also improved in the AI group at 17.9%, compared to 14.9% in the control group, while the PPV of biopsy was 64.5% versus 59.2%. These findings suggest that AI-supported double reading enhances breast cancer detection rates without adversely affecting recall rates, indicating a potential for AI to improve mammography screening outcomes. The study highlights the ongoing need for advancements in breast cancer screening to enhance sensitivity and specificity, ultimately reducing false positives and the associated psychological and financial burdens on patients and healthcare systems.

Methods

In this study, a sample size of 200,000 women was targeted to evaluate the noninferiority of artificial intelligence (AI) in breast cancer detection rates (BCDR). Due to biases introduced by real-world reading behaviors, a simulation study was conducted to identify a suitable statistical method for correcting these biases and estimating the expected statistical power. Among the methods evaluated, a simple regression model utilizing a quasi-binomial error distribution and overlap weighting based on propensity scores emerged as the only approach yielding unbiased results with adequate power. Propensity scores were calculated using logistic regression, accounting for the reader set and AI predictions, and overlap weighting was applied to mitigate the influence of extreme propensity scores.

For the assessment of noninferiority, AI-supported reading was deemed acceptable if the detection rate in the AI group was no more than 10% lower than that in the control group, with the lower bound of a two-sided 95% confidence interval (CI) exceeding -10%. Similarly, for recall rates, noninferiority was established if the AI group’s recall rate was at most 10% higher than the control group’s. Model-based predictions for BCDR and recall rates were derived from the regression analyses, with subgroup analyses conducted based on screening round, breast density, and age. Sensitivity analyses were performed to address potential confounding factors, and all statistical analyses were executed using R and Python, with the code available for further review.

Results

The study evaluated the effectiveness of an AI-supported double reading system within Germany’s organized breast cancer screening program, involving 461,818 women aged 50-69 screened between July 1, 2021, and February 23, 2023. Participants were divided into two groups: 260,739 women in the AI group and 201,079 in the control group. The results indicated that 41.9 per 1,000 women had suspicious findings leading to recalls, with 10.4 per 1,000 undergoing biopsies and 6.2 per 1,000 ultimately diagnosed with breast cancer. Notably, 79.4% of detected cancers were invasive, while 18.9% were classified as ductal carcinoma in situ (DCIS).

AI systems tagged 56.7% of examinations as normal, with a higher rate in the AI group (59.4%) compared to the control group (53.3%). The safety net mechanism was activated for 1.5% of AI-tagged examinations, resulting in 541 recalls and 204 breast cancer diagnoses. Conversely, 3.1% of AI-tagged normal examinations were further evaluated by the consensus group, leading to additional recalls and a small number of breast cancer diagnoses. These findings suggest that AI can influence screening outcomes and potentially improve triaging efficiency in breast cancer detection.

Discussion

The PRAIM study, involving 461,818 women screened for breast cancer across 12 sites, assessed the integration of artificial intelligence (AI) into mammography workflows. The study found that the breast cancer detection rate (BCDR) in the AI group was 6.70 per 1,000 women, significantly higher than the control group’s 5.70 per 1,000, representing a 17.6% relative increase. Notably, the AI group also exhibited a lower recall rate (37.4 per 1,000) compared to the control group (38.3 per 1,000), indicating that AI can enhance detection while potentially reducing unnecessary follow-ups. The positive predictive value (PPV) of biopsy was higher in the AI group, suggesting improved accuracy in identifying true positives.

Subgroup analyses revealed consistent improvements in BCDR across various demographics, including age and breast density, while sensitivity analyses confirmed the robustness of these findings. The study’s design allowed for real-world applicability, as it included diverse screening sites and radiologists, and it utilized a decision referral approach where AI predictions informed but did not replace radiologist assessments. Overall, the results support the feasibility and safety of AI in mammography screening, highlighting its potential to alleviate radiologist workload and improve cancer detection rates, while also raising questions about the implications of increased detection of ductal carcinoma in situ (DCIS) and the need for further investigation into long-term outcomes.