نحو الكشف المبكر القابل للتوسع عن السرطان: تقييم النماذج التنبؤية المعتمدة على السجلات الصحية الإلكترونية مقابل معايير الفحص التقليدية Toward scalable early cancer detection: evaluating EHR-based predictive models against traditional screening criteria

المجلة: npj Precision Oncology
DOI: https://doi.org/10.1038/s41698-026-01483-1
PMID: https://pubmed.ncbi.nlm.nih.gov/42141079
تاريخ النشر: 2026-05-15
المؤلف: Jiheum Park وآخرون
الموضوع الرئيسي: تعلم الآلة في الرعاية الصحية

نظرة عامة

تناقش هذه الفقرة قيود إرشادات فحص السرطان الحالية، التي تركز بشكل أساسي على عدد محدود من أنواع السرطان وتعتمد على معايير محددة مثل العمر أو تاريخ التدخين لتحديد الأفراد المعرضين لمخاطر عالية. يقترح المؤلفون أن النماذج التنبؤية التي تستخدم السجلات الصحية الإلكترونية (EHRs) يمكن أن تعزز من تحديد المجموعات المعرضة لمخاطر عالية من خلال كشف إشارات خفية قبل التشخيص للسرطان. على الرغم من التقدم في نماذج اللغة الكبيرة ونماذج الأساس، إلا أنه لا يوجد دليل كافٍ يقارن فعالية النماذج المعتمدة على السجلات الصحية الإلكترونية بالعوامل التقليدية المستخدمة في الفحص.

تقوم الدراسة بتقييم الفائدة السريرية للنماذج التنبؤية المعتمدة على السجلات الصحية الإلكترونية مقابل العوامل التقليدية، بما في ذلك الطفرات الجينية وتاريخ السرطان العائلي، عبر ثمانية أنواع رئيسية من السرطان: الثدي، الرئة، القولون والمستقيم، البروستاتا، المبيض، الكبد، البنكرياس، والمعدة. باستخدام بيانات من برنامج أبحاث “كلنا” (All of Us Research Program)، الذي يشمل بيانات السجلات الصحية الإلكترونية، والبيانات الجينية، واستطلاعات الرأي من أكثر من 865,000 مشارك، تكشف النتائج أن النماذج المعتمدة على السجلات الصحية الإلكترونية تحقق زيادة تتراوح بين 3 إلى 6 أضعاف في إثراء حالات السرطان الحقيقية بين الأفراد المعرضين لمخاطر عالية مقارنة بالطرق التقليدية. علاوة على ذلك، فإن نموذج الأساس للسجلات الصحية الإلكترونية، الذي يتم تدريبه على مسارات المرضى الشاملة، يعزز بشكل كبير الأداء التنبؤي عبر 26 نوعًا من السرطان، مما يبرز إمكانيات النمذجة التنبؤية المعتمدة على السجلات الصحية الإلكترونية لاستراتيجيات الكشف المبكر الأكثر دقة وقابلية للتوسع.

مقدمة

تسلط المقدمة الضوء على الأهمية الحاسمة للكشف المبكر عن السرطان في تحسين معدلات البقاء، لا سيما بالنسبة للسرطانات العدوانية مثل سرطان البنكرياس، الكبد، المبيض، والمعدة، التي تفتقر حاليًا إلى إرشادات فحص فعالة. يبرز التباين الواضح في معدلات البقاء لمدة خمس سنوات – 44% للسرطان في مراحله المبكرة مقابل 3% للسرطان في مراحله المتأخرة – الحاجة إلى أساليب مبتكرة لتحديد الأفراد المعرضين لمخاطر عالية. تُقدم السجلات الصحية الإلكترونية (EHRs) كمورد قيم لهذا الغرض، حيث توفر وسيلة غير جراحية وفعالة من حيث التكلفة لتحليل بيانات المرضى لأعراض ما قبل التشخيص وأنماط استخدام الرعاية الصحية.

أظهرت التقدمات الأخيرة في الذكاء الاصطناعي، لا سيما تطوير نماذج الأساس للسجلات الصحية الإلكترونية المشابهة لنماذج اللغة الكبيرة، وعدًا في تعزيز القدرات التنبؤية عبر مهام سريرية متنوعة. على الرغم من تركيز نماذج مخاطر السرطان المعتمدة على السجلات الصحية الإلكترونية الحالية على الابتكارات المنهجية والأداء التنبؤي لأنواع السرطان الفردية، إلا أن هناك فجوة ملحوظة في تقييم فائدتها السريرية لتصنيف السكان المعرضين لمخاطر عالية لفحص مستهدف. تشير النتائج إلى أن النماذج الأساسية مثل XGBoost يمكن أن تتفوق على العوامل التقليدية في تحديد المجموعات المعرضة لمخاطر عالية، وأن دمج نماذج الأساس للسجلات الصحية الإلكترونية يعزز بشكل كبير الأداء التنبؤي عبر أنواع متعددة من السرطان. تمهد هذه الأبحاث الطريق لإطار عمل موحد وقابل للتوسع للكشف المبكر عن السرطان باستخدام بيانات السجلات الصحية الإلكترونية.

الطرق

توضح فقرة “الطرق” تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث نفذوا تجربة محكومة لتقييم تأثير المتغير X على النتيجة Y. تم جمع البيانات من خلال قياسات موحدة، مما يضمن الموثوقية والصلاحية. تم إجراء تحليلات إحصائية، بما في ذلك نماذج الانحدار وتحليل التباين (ANOVA)، لتقييم دلالة النتائج، مع تحديد مستوى دلالة عند $p < 0.05$. بالإضافة إلى ذلك، تضمنت الدراسة حساب حجم العينة لتحديد العدد اللازم من المشاركين لتحقيق القوة الكافية. تم تناول الاعتبارات الأخلاقية، مع الحصول على الموافقة من مجلس المراجعة المؤسسية المعني. تم تصميم الطرق المستخدمة لتقليل التحيز وتعزيز قابلية تكرار النتائج، مما يساهم في قوة الاستنتاجات المستخلصة من البيانات.

النتائج

تقدم فقرة “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. توضح بشكل منهجي النتائج، مع تسليط الضوء على الاتجاهات البيانية المهمة والتحليلات الإحصائية التي تدعم الفرضيات. غالبًا ما يتم توضيح النتائج من خلال الجداول أو الرسوم البيانية أو الأشكال، التي توفر تمثيلًا بصريًا للبيانات، مما يسهل تفسير النتائج.

قد تناقش الفقرة أيضًا تداعيات النتائج فيما يتعلق بأسئلة البحث المطروحة في بداية الدراسة. من الضروري ملاحظة أي نتائج غير متوقعة أو شذوذات ظهرت خلال البحث، حيث يمكن أن توفر هذه رؤى قيمة للدراسات المستقبلية. بشكل عام، تساهم النتائج في الفهم الأوسع للموضوع وقد تقترح اتجاهات لمزيد من التحقيق.

المناقشة

ت outlines فقرة المناقشة من ورقة البحث المنهجية والنتائج المتعلقة بتطوير وتقييم نماذج التنبؤ المعتمدة على السجلات الصحية الإلكترونية (EHR) لتقييم مخاطر السرطان. استخدمت الدراسة بيانات السجلات الصحية الإلكترونية المنظمة التي تم توحيدها بموجب نموذج البيانات الشائعة لشراكة النتائج الطبية الملاحظة (OMOP) لتحديد مجموعات السرطان. نجح نهج التصنيف القائم على الموجهات باستخدام نموذج GPT-4 من OpenAI في تصنيف حوالي 54,000 مصطلح متعلق بالسرطان إلى 52 فئة سرطان متميزة، محققًا دقة عالية في التعيين (94.4% لمجموعة بيانات مركز كولومبيا الجامعي الطبي و98.4% لمجموعة بيانات “كلنا”). تألفت المجموعات التحليلية النهائية من حوالي 38,000 حالة سرطان و210,000 ضابط من مجموعة بيانات “كلنا”، و120,000 حالة سرطان و1.1 مليون ضابط من CUIMC.

أظهرت النماذج التنبؤية، التي تم تطويرها باستخدام XGBoost، أداءً متفوقًا مقارنة بنماذج الانحدار اللوجستي التقليدية، مع قيم منطقة تحت منحنى التشغيل (AUROC) التي تختلف حسب نوع السرطان. ومن الجدير بالذكر أن النماذج المعتمدة على السجلات الصحية الإلكترونية أظهرت إثراءً كبيرًا للمجموعات المعرضة لمخاطر عالية مقارنة بالعوامل التقليدية، كما يتضح من مقياس الرفع. على سبيل المثال، في سرطان البنكرياس، حقق نموذج السجلات الصحية الإلكترونية رفعًا قدره 15.3، مقارنة بـ 4.71 لحالة الحامل الجينية وحدها. كما سلطت الدراسة الضوء على أهمية دمج نماذج السجلات الصحية الإلكترونية مع الإرشادات السريرية الحالية لتعزيز تصنيف المخاطر. أكدت تحليلات الحساسية قوة أداء النموذج حتى عند استبعاد رموز التشخيص القريب، مما يشير إلى أن أنماط الصحة الأوسع تساهم في دقة التنبؤ. بشكل عام، تؤكد النتائج على إمكانيات النماذج المعتمدة على السجلات الصحية الإلكترونية لتحسين الكشف المبكر عن السرطان وإبلاغ استراتيجيات الفحص الشخصية، بينما تعالج أيضًا التحديات المتعلقة بالتحيز واكتمال البيانات في التطبيقات السريرية.

Journal: npj Precision Oncology
DOI: https://doi.org/10.1038/s41698-026-01483-1
PMID: https://pubmed.ncbi.nlm.nih.gov/42141079
Publication Date: 2026-05-15
Author(s): Jiheum Park et al.
Primary Topic: Machine Learning in Healthcare

Overview

This section discusses the limitations of current cancer screening guidelines, which primarily focus on a limited number of cancer types and rely on specific criteria such as age or smoking history to identify high-risk individuals. The authors propose that predictive models utilizing electronic health records (EHRs) can enhance the identification of high-risk groups by uncovering subtle prediagnostic signals of cancer. Despite advancements in large language and foundation models, there is insufficient evidence comparing the efficacy of EHR-based models to traditional risk factors used in screening.

The study systematically assesses the clinical utility of EHR-based predictive models against conventional risk factors, including genetic mutations and family cancer history, across eight major cancers: breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach. Utilizing data from the All of Us Research Program, which encompasses EHR, genomic, and survey data from over 865,000 participants, the findings reveal that EHR-based models yield a 3-to-6-fold increase in the enrichment of true cancer cases among high-risk individuals compared to traditional methods. Furthermore, the EHR foundation model, which is trained on comprehensive patient trajectories, significantly enhances predictive performance across 26 cancer types, highlighting the potential of EHR-based predictive modeling for more precise and scalable early detection strategies.

Introduction

The introduction highlights the critical importance of early cancer detection in improving survival rates, particularly for aggressive cancers like pancreatic, liver, ovarian, and stomach cancers, which currently lack effective screening guidelines. The stark contrast in five-year survival rates—44% for early-stage versus 3% for late-stage pancreatic cancer—emphasizes the need for innovative approaches to identify high-risk individuals. Electronic health records (EHRs) are presented as a valuable resource for this purpose, offering a non-invasive and cost-effective means to analyze patient data for prediagnostic symptoms and healthcare utilization patterns.

Recent advancements in artificial intelligence, particularly the development of EHR foundation models akin to large language models, have shown promise in enhancing predictive capabilities across various clinical tasks. Despite the focus of existing EHR-based cancer risk models on methodological innovations and predictive performance for individual cancer types, there is a notable gap in evaluating their clinical utility for stratifying high-risk populations for targeted screening. The findings indicate that even basic models like XGBoost can outperform traditional risk factors in identifying high-risk cohorts, and the integration of EHR foundation models significantly boosts predictive performance across multiple cancer types. This research lays the groundwork for a scalable, unified framework for early cancer detection leveraging EHR data.

Methods

The “Methods” section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, implementing a controlled experiment to assess the impact of variable X on outcome Y. Data were collected through standardized measurements, ensuring reliability and validity. Statistical analyses, including regression models and ANOVA, were conducted to evaluate the significance of the findings, with a significance level set at $p < 0.05$. Additionally, the study incorporated a sample size calculation to determine the necessary number of participants for adequate power. Ethical considerations were addressed, with approval obtained from the relevant institutional review board. The methods employed were designed to minimize bias and enhance the reproducibility of the results, thereby contributing to the robustness of the conclusions drawn from the data.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It systematically outlines the outcomes, highlighting significant data trends and statistical analyses that support the hypotheses. The results are often illustrated through tables, graphs, or figures, which provide a visual representation of the data, making it easier to interpret the findings.

The section may also discuss the implications of the results in relation to the research questions posed at the outset of the study. It is crucial to note any unexpected findings or anomalies that emerged during the research, as these can provide valuable insights for future studies. Overall, the results contribute to the broader understanding of the topic and may suggest directions for further investigation.

Discussion

The discussion section of the research paper outlines the methodology and findings related to the development and evaluation of electronic health record (EHR)-based predictive models for cancer risk assessment. The study utilized structured EHR data standardized under the Observational Medical Outcomes Partnership (OMOP) common data model to identify cancer cohorts. A prompt-based classification approach using OpenAI’s GPT-4 model successfully categorized approximately 54,000 malignancy-related terms into 52 distinct cancer categories, achieving high mapping accuracy (94.4% for the Columbia University Irving Medical Center dataset and 98.4% for the All of Us dataset). The final analytic cohorts comprised around 38,000 cancer cases and 210,000 controls from the All of Us dataset, and 120,000 cancer cases and 1.1 million controls from CUIMC.

The predictive models, developed using XGBoost, demonstrated superior performance compared to traditional logistic regression models, with area under the receiver operating characteristic curve (AUROC) values varying by cancer type. Notably, the EHR-based models showed significant enrichment of high-risk cohorts compared to traditional risk factors, as evidenced by the lift metric. For instance, in pancreatic cancer, the EHR model achieved a lift of 15.3, compared to 4.71 for genetic carrier status alone. The study also highlighted the importance of integrating EHR models with existing clinical guidelines to enhance risk stratification. Sensitivity analyses confirmed the robustness of model performance even when excluding near-diagnosis codes, suggesting that broader health patterns contribute to predictive accuracy. Overall, the findings underscore the potential of EHR-based models to improve early cancer detection and inform personalized screening strategies, while also addressing challenges related to bias and data completeness in clinical applications.