فهم الإفراط في التخصيص في الغابة العشوائية لتقدير الاحتمالات: دراسة بصرية ومحاكاة Understanding overfitting in random forest for probability estimation: a visualization and simulation study

المجلة: Diagnostic and Prognostic Research، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1186/s41512-024-00177-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39334348
تاريخ النشر: 2024-09-27
المؤلف: Lasai Barreñada وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في اكتشاف السرطان

نظرة عامة

في هذا البحث، يحقق المؤلفون في فعالية الغابات العشوائية لتوقع المخاطر السريرية، وخاصة في سياق توقع الأورام المبيضية. لقد لاحظوا قيم عالية لمنطقة التدريب تحت المنحنى (AUC)، مما يشير إلى احتمال وجود ملاءمة زائدة؛ ومع ذلك، كانت النماذج تؤدي بشكل تنافسي على مجموعات بيانات الاختبار. استخدمت الدراسة كل من دراسات الحالة الواقعية ودراسة محاكاة شاملة تشمل 48 آلية مختلفة لتوليد البيانات اللوجستية (DGM) لاستكشاف سلوك الغابات العشوائية في تقدير الاحتمالات. تضمنت المحاكاة تباينات في توزيعات المتنبئين، والارتباطات، وأحجام العينات، مما أدى في النهاية إلى تدريب نماذج الغابات العشوائية تحت ظروف مختلفة.

كشفت النتائج أن الغابات العشوائية تميل إلى تعلم قمم احتمالية محلية، مما يؤدي إلى AUCs تدريب قريبة من 1، مع الحفاظ على أداء معقول على بيانات الاختبار. على وجه التحديد، تراوحت AUCs التدريبية الوسيطة من 0.97 إلى 1، مع خسارة تمييز وسيطة قدرها 0.025. أبرزت الدراسة أن زيادة عدد الأحداث لكل متغير وأحجام العقد الدنيا الأكبر حسنت AUCs الاختبار الوسيطة. من المثير للاهتمام، أن النتائج تتحدى الحكمة التقليدية لاستخدام الأشجار الناضجة بالكامل في الغابات العشوائية لتقدير الاحتمالات، حيث وجد المؤلفون أن مثل هذه التكوينات لم تعزز بشكل كبير معايرة النموذج أو الأداء على البيانات غير المرئية.

مقدمة

تناقش مقدمة ورقة البحث خوارزمية الغابات العشوائية (RF)، وهي طريقة تعلم جماعي طورها ليو بريمان في عام 2001، والتي تميز نفسها عن طرق التجميع الشجري الأخرى من خلال استخدام أشجار مستقلة. يتم بناء كل شجرة باستخدام عينة bootstrap، وفي كل عقدة، يتم النظر في مجموعة فرعية عشوائية من المتنبئين للتقسيم، مما يقلل من ارتباط الأشجار. لقد اكتسبت RF زخمًا كبيرًا في نمذجة التنبؤ السريري بسبب أدائها القوي ومتطلبات ضبط المعلمات الدنيا. من الجدير بالذكر أن نماذج RF أظهرت أداءً محسنًا حتى عندما كانت الأشجار الفردية مفرطة التكيف.

على الرغم من استخدامها على نطاق واسع كتصنيف، إلا أن الأدبيات حول فعالية RF كأشجار تقدير الاحتمالات (PET) محدودة. أجرى المؤلفون دراسة مقارنة على مجموعة متنوعة من خوارزميات التعلم الآلي، بما في ذلك RF، لتقدير احتمالات خمسة أنواع من الأورام المبيضية باستخدام مجموعة بيانات تدريبية تضم 5,909 مرضى. تم استخدام مؤشر التمييز متعدد القيم (PDI) لتقييم تمييز النموذج، مما يكشف أن RF حققت PDI قريب من الكمال قدره 0.93 على بيانات التدريب، مع الحفاظ على أداء تنافسي (PDI قدره 0.54) خلال التحقق الخارجي. أثارت هذه الفجوة مخاوف بشأن احتمال وجود ملاءمة زائدة، مما دفع المؤلفين إلى التحقيق في سلوك RF في تقدير الاحتمالات من خلال دراسات الحالة ودراسة المحاكاة. تم هيكلة الورقة لتلخيص خوارزمية RF لتقدير الاحتمالات أولاً، تليها تصورات لدراسات الحالة، وتختتم بدراسة محاكاة تفحص تأثيرات عمق الشجرة، وحجم عينة التدريب، وآليات توليد البيانات.

الطرق

في هذه الدراسة، كان هدف المؤلفين هو تصور الاحتمالات المقدرة في فضاء البيانات لتعزيز فهم نماذج الغابات العشوائية (RF) التي تظهر تمييزًا شبه مثالي مع الحفاظ على أداء تنافسي خلال التحقق الخارجي. تضمنت المنهجية تقسيم عشوائي قياسي للتدريب والاختبار، حيث تم تطوير نماذج RF والانحدار اللوجستي المتعدد (MLR) باستخدام متنبئين مستمرين وعدد من المتنبئين الفئويين. من خلال تثبيت المتنبئين الفئويين عند قيمهم الأكثر شيوعًا، أنشأ المؤلفون تمثيلًا ثنائي الأبعاد لفضاء البيانات، مما سمح بتصور الاحتمالات المقدرة كخريطة حرارية جنبًا إلى جنب مع الحالات الفردية الممثلة في مخطط تشتت. سهل هذا النهج فحص كيفية ترجمة نماذج RF وMLR لقيم المتنبئين إلى تقديرات احتمالية، مما يكشف عن اختلافات في نطاق الاحتمالات المقدرة بين النموذجين.

تم تدريب نماذج RF باستخدام حزمة ranger مع معلمات محددة (ntree = 500، mtry = ⌈√P⌉، min.node.size = 2)، باستخدام طرق آلة الاحتمالات لمالي لتقدير الاحتمالات. في المقابل، تضمنت نماذج MLR منحنيات مكعبة مقيدة (rcs) مع ثلاث عقد لتأخذ في الاعتبار العلاقات غير الخطية. شملت الدراسة تحليلًا شاملاً عبر 192 سيناريو، مستمدة من تباين معلمات النموذج وأحجام مجموعة بيانات التدريب (200 و4000). خضع كل سيناريو لـ 1000 عملية محاكاة لضمان القوة، مع إجراء التحقق على مجموعة بيانات اختبار كبيرة (N = 100,000) لتقليل تباين العينة. الكود الخاص بتدريب النموذج وتوليد المخططات متاح في مستودع Open Science Framework.

النتائج

تُعرض نتائج الدراسة من خلال نتائج المحاكاة المجمعة، والتي تستخدم الوسيط والمدى الربعي لتقييم التمييز والمعايرة، بينما يتم استخدام المتوسط والانحراف المعياري لتقييم متوسط الخطأ التربيعي. يمكن العثور على النتائج التفصيلية في الملف الإضافي 1 (الجدول S6). علاوة على ذلك، فإن المحاكاة الكاملة، التي تشمل الكود و1000 محاكاة عبر كل من السيناريوهات الـ 192، متاحة في مستودع Open Science Framework (OSF) على الرابط المقدم.

المناقشة

في هذا القسم، يناقش المؤلفون تطبيق الغابات العشوائية (RF) لتقدير الاحتمالات، وخاصة في سياق النتائج الفئوية. تعمل RF من خلال بناء عدة أشجار قرار من عينات bootstrap من بيانات التدريب، مع تحسين الانقسامات بناءً على معيار محدد (مثل مؤشر جيني) بينما يتم إدخال العشوائية من خلال اختيار المتنبئين في كل انقسام. يبرز المؤلفون أن نماذج RF تميل إلى تعلم قمم احتمالية محلية حول أحداث بيانات التدريب، مما يمكن أن يؤدي إلى تمييز ظاهر مرتفع في مجموعات بيانات التدريب ولكن أداء أضعف على البيانات غير المرئية. تتفاقم هذه الظاهرة عند استخدام أشجار أعمق (حجم عقدة دنيا منخفض)، مما يؤدي إلى الملاءمة الزائدة وسوء معايرة تقديرات الاحتمالات.

يقدم المؤلفون دراسات حالة حول تشخيص سرطان المبيض، وتوقع إصابات الدماغ الرضحية، وتشخيص نوع السكتة الدماغية، مما يوضح الاختلافات في المعايرة والتمييز بين RF والنماذج التقليدية مثل الانحدار اللوجستي المتعدد (MLR). في هذه الدراسات، أظهرت RF مؤشرات أداء تدريب عالية ولكنها واجهت صعوبة في المعايرة، خاصة في بيانات التدريب، حيث كانت منحدرات المعايرة باستمرار فوق 1، مما يشير إلى نقص الثقة في تقديرات الاحتمالات. يستنتج المؤلفون أنه بينما يمكن أن توفر RF تمييزًا عاليًا، فإن أدائها في المعايرة غالبًا ما يكون دون المستوى المطلوب، مما يستلزم ضبطًا دقيقًا للمعلمات مثل حجم العقدة الدنيا لتحسين تقديرات المخاطر. إنهم يدعون إلى فهم أكثر دقة لسلوك RF في تقدير الاحتمالات، مع التأكيد على الحاجة إلى تقنيات التحقق والمعايرة المناسبة في التطبيقات السريرية.

Journal: Diagnostic and Prognostic Research, Volume: 8, Issue: 1
DOI: https://doi.org/10.1186/s41512-024-00177-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39334348
Publication Date: 2024-09-27
Author(s): Lasai Barreñada et al.
Primary Topic: AI in cancer detection

Overview

In this research, the authors investigate the efficacy of random forests for clinical risk prediction, particularly in the context of ovarian malignancy prediction. They observed high training area under the curve (AUC) values, suggesting potential overfitting; however, the models performed competitively on test datasets. The study employed both real-world case studies and a comprehensive simulation study involving 48 different logistic data-generating mechanisms (DGM) to explore the behavior of random forests in probability estimation. The simulation included variations in predictor distributions, correlations, and sample sizes, ultimately training random forest models under various conditions.

The findings revealed that random forests tend to learn localized probability peaks, resulting in training AUCs nearing 1, while maintaining reasonable performance on test data. Specifically, median training AUCs ranged from 0.97 to 1, with a median discrimination loss of 0.025. The study highlighted that higher event counts per variable and larger minimum node sizes improved median test AUCs. Interestingly, the results challenge the conventional wisdom of using fully grown trees in random forests for probability estimation, as the authors found that such configurations did not significantly enhance model calibration or performance on unseen data.

Introduction

The introduction of the research paper discusses the Random Forests (RF) algorithm, an ensemble learning method developed by Leo Breiman in 2001, which distinguishes itself from other tree ensemble methods by utilizing independent trees. Each tree is built using a bootstrap sample, and at each node, a random subset of predictors is considered for splitting, thereby reducing tree correlation. RF has gained significant traction in clinical prediction modeling due to its robust performance and minimal hyperparameter tuning requirements. Notably, RF models have shown improved performance even when individual trees are overfitted.

Despite its widespread use as a classifier, there is limited literature on RF’s efficacy as probability estimation trees (PET). The authors conducted a comparative study on various machine learning algorithms, including RF, to estimate the probabilities of five ovarian tumor types using a training dataset of 5,909 patients. The Polytomous Discrimination Index (PDI) was employed to evaluate model discrimination, revealing that RF achieved a near-perfect PDI of 0.93 on training data, while maintaining competitive performance (PDI of 0.54) during external validation. This discrepancy raised concerns about potential overfitting, prompting the authors to investigate RF’s behavior in probability estimation through case studies and a simulation study. The paper is structured to first summarize the RF algorithm for probability estimation, followed by visualizations of case studies, and conclude with a simulation study examining the effects of tree depth, training sample size, and data generation mechanisms.

Methods

In this study, the authors aimed to visualize estimated probabilities in data space to enhance understanding of Random Forest (RF) models that exhibit near-perfect discrimination while maintaining competitive performance during external validation. The methodology involved a standard random train-test split, where RF and Multinomial Logistic Regression (MLR) models were developed using two continuous and several categorical predictors. By fixing categorical predictors at their most common values, the authors created a two-dimensional representation of the data space, allowing for the visualization of estimated probabilities as a heatmap alongside individual cases depicted in a scatter plot. This approach facilitated the examination of how RF and MLR models translate predictor values into probability estimates, revealing differences in the range of estimated probabilities between the two models.

The RF models were trained using the ranger package with specific parameters (ntree = 500, mtry = ⌈√P⌉, min.node.size = 2), employing Malley’s probability machine methods for probability estimation. In contrast, MLR models incorporated restricted cubic splines (rcs) with three knots to account for nonlinear relationships. The study included a comprehensive analysis across 192 scenarios, derived from varying model parameters and training dataset sizes (200 and 4000). Each scenario underwent 1000 simulation runs to ensure robustness, with validation conducted on a large test dataset (N = 100,000) to minimize sampling variability. The code for model training and plot generation is accessible in the Open Science Framework repository.

Results

The results of the study are presented through aggregated simulation outcomes, which utilize median and interquartile range for assessing discrimination and calibration, while mean and standard deviation are employed for evaluating mean squared error. Detailed findings can be found in Additional file 1 (Table S6). Furthermore, the complete simulation, encompassing the code and 1000 simulations across each of the 192 scenarios, is accessible in the Open Science Framework (OSF) repository at the provided link.

Discussion

In this section, the authors discuss the application of Random Forest (RF) for probability estimation, particularly in the context of categorical outcomes. RF operates by constructing multiple decision trees from bootstrap samples of the training data, optimizing splits based on a selected criterion (e.g., Gini index) while introducing randomness through the selection of predictors at each split. The authors highlight that RF models tend to learn local probability peaks around training data events, which can lead to high apparent discrimination in training datasets but poorer performance on unseen data. This phenomenon is exacerbated when using deeper trees (low minimum node size), resulting in overfitting and poor calibration of probability estimates.

The authors present case studies on ovarian cancer diagnosis, traumatic brain injury prognosis, and stroke type diagnosis, demonstrating the differences in calibration and discrimination between RF and traditional models like multiple logistic regression (MLR). In these studies, RF exhibited high training performance indicators but struggled with calibration, particularly in training data, where calibration slopes were consistently above 1, indicating underconfidence in probability estimates. The authors conclude that while RF can provide high discrimination, its calibration performance is often suboptimal, necessitating careful tuning of hyperparameters such as minimum node size to improve risk estimates. They advocate for a more nuanced understanding of RF’s behavior in probability estimation, emphasizing the need for proper validation and calibration techniques in clinical applications.