optRF: تحسين استقرار الغابة العشوائية من خلال تحديد العدد الأمثل من الأشجار optRF: Optimising random forest stability by determining the optimal number of trees

المجلة: BMC Bioinformatics، المجلد: 26، العدد: 1
DOI: https://doi.org/10.1186/s12859-025-06097-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40165065
تاريخ النشر: 2025-03-31
المؤلف: Thomas M. Lange وآخرون
الموضوع الرئيسي: تصميم هندسي احتمالي وقوي

نظرة عامة

في هذه الدراسة، نحقق في الطبيعة غير الحتمية لخوارزميات الغابات العشوائية، والتي يمكن أن تؤدي إلى تباين في توقعات النموذج وتقديرات أهمية المتغيرات عند استخدام نفس بيانات الإدخال. نقترح طريقة لقياس هذا التأثير ونظهر أن زيادة عدد الأشجار في الغابة العشوائية تعزز استقرار النموذج بطريقة غير خطية، بينما يزيد الوقت الحاسوبي بشكل خطي. تشير نتائجنا إلى أن هناك عددًا مثاليًا من الأشجار لكل مجموعة بيانات يوازن بين الاستقرار والكفاءة الحاسوبية. لتسهيل هذه العملية، قمنا بتطوير حزمة R المسماة optRF، التي نمذجة العلاقة بين عدد الأشجار واستقرار الغابات العشوائية، مقدمة توصيات لعدد الأشجار المثالي.

تشير النتائج إلى أن الإعداد الافتراضي المكون من 500 شجرة غالبًا ما يؤدي إلى عدم استقرار مفرط لعمليات اتخاذ القرار عبر مجموعات بيانات متنوعة. يتأثر العدد المطلوب من الأشجار لتحقيق نموذج مستقر بخصائص مجموعة البيانات، بما في ذلك عدد المتغيرات والملاحظات. على عكس حزم R الحالية التي تركز على زيادة دقة التوقعات، تعطي نهجنا الأولوية لتحسين استقرار النموذج. نشجع الباحثين على استخدام حزمة optRF لتحديد العدد المثالي من الأشجار والإبلاغ عن كل من عدد الأشجار المستخدمة واستقرار نماذج الغابات العشوائية الخاصة بهم، حيث أن هذه العوامل حاسمة لإعادة إنتاج النتائج في تحليلات التنبؤ وسياقات اتخاذ القرار.

مقدمة

تناقش مقدمة هذه الورقة البحثية أهمية التعلم الآلي، وخاصة طريقة الغابات العشوائية، في تحليل مجموعات البيانات المعقدة عبر مجالات علمية متنوعة. يمكّن التعلم الآلي أجهزة الكمبيوتر من تحديد الأنماط وإجراء التوقعات، مما يسهل اتخاذ القرارات المستندة إلى البيانات. يتم تسليط الضوء على الغابات العشوائية لطبيعتها غير المعلمية، وسهولة إعداد المدخلات، وقدرتها على التعامل مع أنواع متنوعة من المتغيرات المستجيبة. تقوم الطريقة ببناء عدة أشجار قرار وتجميع توقعاتها، مع وجود معلمات رئيسية مثل عدد المتغيرات التي يتم النظر فيها عند كل انقسام (mtry)، وحجم العينة للتدريب، وعمق الشجرة التي تؤثر على دقة التوقعات.

تؤكد الورقة على التحديات التي تطرحها مشكلة “صغير n، كبير p” في البحث الجينومي، حيث يكون عدد الملاحظات محدودًا مقارنة بعدد المتغيرات. يمكن أن تعالج الغابات العشوائية هذه المشكلة من خلال تقدير أهمية المتغيرات واختيار المتنبئين المهمين. ومع ذلك، يمكن أن تؤدي الطبيعة غير الحتمية للغابات العشوائية إلى تباين في التوقعات وتقديرات أهمية المتغيرات، وهو أمر حاسم في سياقات اتخاذ القرار. يقوم المؤلفون بقياس استقرار الغابات العشوائية، مع التركيز بشكل خاص على العلاقة بين عدد الأشجار واستقرار النموذج، كاشفين عن ارتباط غير خطي. يقدمون حزمة R المسماة optRF، المصممة لتوصية بعدد مثالي من الأشجار التي تعزز الاستقرار دون مطالب حاسوبية مفرطة، وبالتالي تحسين موثوقية القرارات المستندة إلى البيانات في اختيار الجينات وتطبيقات أخرى.

طرق

يستعرض قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يوضح المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، ومعدات، وعينات بيولوجية، مما يضمن إعادة إنتاج التجارب. يتم وصف المنهجية بطريقة منهجية، مع تسليط الضوء على التقنيات المستخدمة في جمع البيانات وتحليلها، مثل الطرق الإحصائية أو النماذج الحاسوبية المطبقة.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول حجم العينة، والضوابط، وأي اعتبارات أخلاقية تم أخذها في الاعتبار أثناء البحث. بشكل عام، يخدم هذا القسم لتوفير نظرة شاملة على الإطار التجريبي، مما يسمح بالتقييم النقدي وإعادة إنتاج نتائج الدراسة.

نتائج

تظهر النتائج من تطبيق حزمة optRF على 43 مجموعة بيانات متنوعة فعاليتها في تحسين نماذج الغابات العشوائية لكل من التنبؤ وأهمية المتغيرات. كشفت التحليلات أن العدد المثالي من الأشجار المطلوبة يختلف بشكل كبير ليس فقط مع عدد المتغيرات ولكن أيضًا مع عدد الملاحظات وطبيعة المتغير المستجيب. على سبيل المثال، بينما كانت مجموعات البيانات ذات الخصائص المماثلة (مثل مجموعات البيانات 35 و38) تتطلب أعدادًا مختلفة من الأشجار (75,000 مقابل 9,000) لتحقيق الاستقرار، أظهرت النتائج أيضًا أن نوع المتغير المستجيب يؤثر على العدد المثالي من الأشجار، كما يتضح من المقارنة بين مجموعات البيانات 43 و44.

تشير النتائج إلى زيادة ملحوظة في استقرار التنبؤ والاختيار عندما يتم تعديل عدد الأشجار من 500 افتراضي إلى العدد المثالي الموصى به من قبل حزمة optRF، مع ارتفاع متوسط استقرار التنبؤ من 0.9545 إلى 0.9957 واستقرار الاختيار من 0.7934 إلى 0.964. علاوة على ذلك، بينما أدى استخدام عشرة أضعاف عدد المتغيرات كعدد الأشجار إلى تحسين طفيف في الاستقرار، كان عمومًا أقل فعالية من النهج المثالي. تم أيضًا ملاحظة الكفاءة الحاسوبية لحزمة optRF، خاصة لمجموعات البيانات التي تحتوي على أكثر من 30,000 متغير، حيث أدت إلى أوقات حساب أقل مقارنة بالطريقة الثابتة المتمثلة في استخدام عشرة أضعاف عدد المتغيرات. بشكل عام، تؤكد هذه النتائج على أهمية اختيار الأشجار المخصصة في تعزيز استقرار النموذج وكفاءته في تطبيقات الغابات العشوائية.

مناقشة

توفر حزمة optRF التي تم تطويرها في هذه الدراسة إطارًا قويًا لتحسين عدد الأشجار في نماذج الغابات العشوائية، مع التركيز على تعزيز كل من استقرار التنبؤ وأهمية المتغيرات. تتضمن الحزمة وظيفتين رئيسيتين: `opt_prediction`، التي تحسن أعداد الأشجار للتنبؤات واختيار الأفراد الأوائل، و`opt_importance`، التي تركز على تقدير أهمية المتغيرات. تستخدم كلا الوظيفتين مقاييس الاستقرار مثل معامل الارتباط داخل الفئة (ICC) للاستجابات المستمرة وكابا فليس للاستجابات الفئوية لتقييم موثوقية النموذج عبر عدة تكرارات. كما تستخدم الحزمة نموذج لوجستي ذو معلمتين (2PL) لتقدير العدد المثالي من الأشجار، المحدد كنقطة حيث تؤدي الأشجار الإضافية إلى تحسينات ضئيلة في الاستقرار.

في التطبيقات العملية، تم اختبار حزمة optRF على 45 مجموعة بيانات جينومية، مما يظهر فعاليتها في معالجة مشكلة صغير n، كبير p الشائعة في البحث الجينومي. أشارت النتائج إلى أن زيادة عدد الأشجار عززت بشكل كبير كل من استقرار التنبؤ وأهمية المتغيرات، مع اختلاف الأعداد المثالية للأشجار بناءً على خصائص مجموعة البيانات المحددة. على سبيل المثال، في مجموعة بيانات الذرة، زاد العدد الموصى به من الأشجار من 500 إلى 5,000، مما أدى إلى تحسينات كبيرة في مقاييس الاستقرار. وبالمثل، بالنسبة لمجموعة بيانات الدجاج الأكبر، ارتفع العدد المثالي إلى 137,000 شجرة، مما يبرز ضرورة ضبط المعلمات الفائقة لتحقيق أداء موثوق للنموذج. بشكل عام، تسلط النتائج الضوء على أهمية ضبط عدد الأشجار في نماذج الغابات العشوائية لتعظيم قدراتها التنبؤية وموثوقية اختيار المتغيرات.

Journal: BMC Bioinformatics, Volume: 26, Issue: 1
DOI: https://doi.org/10.1186/s12859-025-06097-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40165065
Publication Date: 2025-03-31
Author(s): Thomas M. Lange et al.
Primary Topic: Probabilistic and Robust Engineering Design

Overview

In this study, we investigate the non-deterministic nature of random forest algorithms, which can lead to variability in model predictions and variable importance estimates when using the same input data. We propose a method to quantify this impact and demonstrate that increasing the number of trees in a random forest enhances model stability in a non-linear manner, while the computational time increases linearly. Our findings suggest that there is an optimal number of trees for each dataset that balances stability and computational efficiency. To facilitate this process, we developed the R package optRF, which models the relationship between the number of trees and the stability of random forests, offering recommendations for optimal tree counts.

The results indicate that the default setting of 500 trees often results in excessive instability for decision-making processes across various datasets. The required number of trees for achieving a stable model is influenced by the dataset’s characteristics, including the number of variables and observations. Unlike existing R packages that focus on maximizing prediction accuracy, our approach prioritizes the optimization of model stability. We encourage researchers to utilize the optRF package to determine the optimal number of trees and to report both the number of trees used and the stability of their random forest models, as these factors are critical for the reproducibility of results in predictive analytics and decision-making contexts.

Introduction

The introduction of this research paper discusses the significance of machine learning, particularly the random forest method, in analyzing complex datasets across various scientific fields. Machine learning enables computers to identify patterns and make predictions, facilitating data-driven decision-making. Random forest is highlighted for its non-parametric nature, ease of input preparation, and ability to handle diverse types of response variables. The method constructs multiple decision trees and aggregates their predictions, with key parameters such as the number of variables considered at each split (mtry), sample size for training, and tree depth influencing prediction accuracy.

The paper emphasizes the challenges posed by the “small n, large p” problem in genomic research, where the number of observations is limited compared to the number of variables. Random forest can address this issue by estimating variable importance and selecting significant predictors. However, the non-deterministic nature of random forest can lead to variability in predictions and variable importance estimates, which is critical in decision-making contexts. The authors quantify the stability of random forests, particularly focusing on the relationship between the number of trees and model stability, revealing a non-linear correlation. They introduce the R package optRF, designed to recommend an optimal number of trees that enhances stability without excessive computational demands, thus improving the reliability of data-driven decisions in genomic selection and other applications.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, ensuring reproducibility of the experiments. The methodology is described in a systematic manner, highlighting the techniques for data collection and analysis, such as statistical methods or computational models applied.

Additionally, the section may include information on the sample size, controls, and any ethical considerations taken into account during the research. Overall, this section serves to provide a comprehensive overview of the experimental framework, allowing for critical evaluation and replication of the study’s findings.

Results

The results from the application of the optRF package on 43 diverse data sets demonstrate its effectiveness in optimizing random forest models for both prediction and variable importance. The analysis revealed that the optimal number of trees required varies significantly not only with the number of variables but also with the number of observations and the nature of the response variable. For instance, while data sets with similar characteristics (e.g., data sets 35 and 38) required different numbers of trees (75,000 vs. 9,000) to achieve stability, the results also highlighted that the type of response variable influenced the optimal tree count, as seen in the comparison between data sets 43 and 44.

The findings indicate a marked increase in prediction and selection stability when the number of trees is adjusted from a default of 500 to the optimal number recommended by the optRF package, with average prediction stability rising from 0.9545 to 0.9957 and selection stability from 0.7934 to 0.964. Furthermore, while using ten times the number of variables as the number of trees yielded slightly improved stability, it was generally less effective than the optimal approach. The computational efficiency of the optRF package was also noted, particularly for data sets with over 30,000 variables, where it resulted in lower computation times compared to the fixed method of using ten times the number of variables. Overall, these results underscore the importance of tailored tree selection in enhancing model stability and efficiency in random forest applications.

Discussion

The optRF package developed in this study provides a robust framework for optimizing the number of trees in random forest models, focusing on enhancing both prediction and variable importance stability. The package includes two primary functions: `opt_prediction`, which optimizes tree numbers for predictions and selection of top individuals, and `opt_importance`, which focuses on estimating variable importance. Both functions utilize stability metrics such as the intraclass correlation coefficient (ICC) for continuous responses and Fleiss’ kappa for categorical responses to assess model reliability across multiple iterations. The package also employs a two-parameter logistic (2PL) model to estimate the optimal number of trees, defined as the point where additional trees yield negligible improvements in stability.

In practical applications, the optRF package was tested on 45 genomic datasets, demonstrating its effectiveness in addressing the small n, large p problem common in genomic research. Results indicated that increasing the number of trees significantly enhanced both prediction and variable importance stability, with optimal tree counts varying based on the specific dataset characteristics. For instance, in a maize dataset, the recommended number of trees increased from 500 to 5,000, resulting in substantial improvements in stability metrics. Similarly, for a larger chicken dataset, the optimal count rose to 137,000 trees, underscoring the necessity of adjusting hyperparameters to achieve reliable model performance. Overall, the findings highlight the importance of tuning the number of trees in random forest models to maximize their predictive capabilities and variable selection reliability.