تعزيز أداء أشجار تعزيز التدرج في مشاكل الانحدار Enhancing the performance of gradient boosting trees on regression problems

المجلة: Journal Of Big Data، المجلد: 12، العدد: 1
DOI: https://doi.org/10.1186/s40537-025-01071-3
تاريخ النشر: 2025-02-17
المؤلف: Lydia Wahid Rizkallah
الموضوع الرئيسي: تعلم الآلة وتصنيف البيانات

نظرة عامة

في هذه الورقة البحثية، يقدم المؤلفون نهجًا هجينًا يعزز القوة التنبؤية لأشجار تعزيز التدرج (GBT) من خلال دمج تقنيات التجميع K-means و Bisecting K-means. تُستخدم GBT، المعروفة بفعاليتها في التعلم الجماعي، بالتزامن مع التجميع لمعالجة قيود نموذج GBT واحد، الذي قد لا يلتقط بشكل كافٍ الخصائص المتنوعة لمجموعات البيانات المعقدة. يتم تقييم الطريقة المقترحة على 40 مجموعة بيانات انحدارية مأخوذة من UCI وKaggle، مما يظهر أداءً متفوقًا مقارنةً بتطبيقات GBT التقليدية.

تشير النتائج إلى تحسينات كبيرة في دقة التنبؤ، كما تقاس بواسطة خطأ الجذر التربيعي المتوسط (RMSE) وخطأ القيمة المطلقة المتوسطة (MAE)، مع تحسينات تتجاوز 75% في بعض الحالات. تؤكد التحليلات الإحصائية، بما في ذلك اختبارات فريدمان و ويلكوكسون ذات الرتبة الموقعة، قوة هذه التحسينات، مما يثبت أن النهج الهجين يحقق تقدمًا ذا دلالة إحصائية مقارنةً بنموذج GBT واحد. بشكل عام، تؤكد النتائج فعالية دمج التجميع مع النمذجة الجماعية لتعزيز أداء المهام الانحدارية.

مقدمة

تناقش مقدمة الورقة تعزيز، وهي تقنية تعلم آلي جماعي تعزز الأداء التنبؤي من خلال دمج عدة متعلمين ضعفاء، عادةً أشجار القرار، في متعلم قوي. يعمل التعزيز بشكل متسلسل، حيث يركز كل متعلم ضعيف على تصحيح الأخطاء التي ارتكبها سلفه. يتم تسليط الضوء على خوارزميات تعزيز مختلفة، بما في ذلك AdaBoost و Gradient Boosting و XGBoost و CatBoost و LightGBM، كل منها له خصائصه وتنفيذه الفريد. على سبيل المثال، يقوم AdaBoost بتعديل الأوزان على نقاط البيانات التي تم تصنيفها بشكل خاطئ لتحسين المتعلمين اللاحقين، بينما يقلل Gradient Boosting من دالة الخسارة للنماذج السابقة من خلال الانحدار التدرجي.

توضح هذه الفقرة أيضًا تطبيق خوارزميات تعزيز التدرج عبر مشاكل التنبؤ المتنوعة، مشيرةً إلى العديد من الدراسات. تتراوح هذه التطبيقات من كشف التسلل وتوقع ترويج البيع بالتجزئة إلى رسم خرائط قابلية الانزلاق والتنبؤ بالنشاط غير القانوني في بنى التشفير. بالإضافة إلى ذلك، تشير الورقة إلى تركيبات مبتكرة من تقنيات التجميع مع تعزيز التدرج، مثل استخدام K-means ونماذج المزيج الغاوسي لتصنيف الصخور و K-means الضبابي لكشف البرمجيات الضارة على أندرويد. بشكل عام، تؤكد المقدمة على أهمية طرق التعزيز في تحسين الدقة التنبؤية عبر مجالات متنوعة.

طرق

في هذا القسم، يتم توضيح المنهجية المستخدمة في النهج المقترح، مع التركيز على استخدام تعزيز التدرج كنموذج تنبؤي رئيسي لمعالجة مشاكل الانحدار. لتعزيز الأداء التنبؤي لتعزيز التدرج، يتم استخدام تقنيات التجميع K-means و bisecting K-means. تتضمن هذه الاستراتيجية إنشاء مجموعة من نماذج تعزيز التدرج، يتم تدريب كل منها حصريًا على مجموعة متميزة مشتقة من بيانات التدريب المجمع.

تسهل PySpark تنفيذ هذه المنهجية، حيث تجمع بين وظائف Python و Apache Spark، مما يمكّن من إجراء حسابات فعالة على مجموعات بيانات كبيرة. توفر الأقسام الفرعية اللاحقة مناقشات مفصلة حول التقنيات المحددة المستخدمة ضمن هذا الإطار، مع التأكيد على مساهماتها في القدرة التنبؤية العامة للنموذج.

نتائج

يقدم قسم النتائج تقييمًا شاملاً لنهج المجموعة المقترح باستخدام نماذج أشجار تعزيز التدرج (GBT)، المدربة على مجموعات بيانات مجمعة، مقارنةً بنموذج GBT واحد مدرب على مجموعة البيانات الكاملة. تستخدم الدراسة 40 مجموعة بيانات انحدارية من مستودع تعلم الآلة UCI وKaggle، مصنفة بناءً على عدد الحالات: أقل من 10,000 (الفئة 1)، بين 10,000 و 100,000 (الفئة 2)، وأكثر من 100,000 (الفئة 3). يتم تقييم أداء نهج المجموعة باستخدام خوارزميات التجميع K-means و bisecting K-means، مع قيم متغيرة لـ $K$ تمثل عدد المجموعات (وبالتالي عدد نماذج GBT في المجموعة).

تكشف التحليلات أن نهج المجموعة المقترح يتفوق بشكل كبير على نموذج GBT الواحد عبر جميع فئات مجموعات البيانات، مع تحسينات في مقاييس خطأ الجذر التربيعي المتوسط (RMSE) وخطأ القيمة المطلقة المتوسطة (MAE). من الجدير بالذكر أن النسبة المئوية المتوسطة للتحسين تزداد مع ارتفاع قيم $K$، خاصةً لمجموعات البيانات الأكبر في الفئة 3، حيث تتضاعف التحسينات بشكل ملحوظ عندما يكون $K=7$ مقارنةً بـ $K=3$. تؤكد الاختبارات الإحصائية، بما في ذلك اختبار فريدمان واختبار ويلكوكسون ذو الرتبة الموقعة، على دلالة هذه التحسينات، مع قيم p أقل بكثير من 0.05، مما يشير إلى دليل قوي ضد فرضية العدم بعدم وجود فرق بين النهجين. بينما يؤدي K-means بشكل أفضل لمجموعات البيانات الصغيرة، يظهر bisecting K-means أداءً متفوقًا لمجموعات البيانات الأكبر، مما يشير إلى أن فعالية خوارزميات التجميع قد تختلف مع حجم مجموعة البيانات، على الرغم من أن اختيار خوارزمية التجميع نفسها أقل أهمية من فعل التجميع لأداء نموذج المجموعة.

مناقشة

في قسم المناقشة، تتوسع الورقة في منهجيات تعزيز التدرج، وتجميع K-means، وتجميع bisecting K-means، مع تسليط الضوء على أدوارها في تعزيز الأداء التنبؤي في التعلم الآلي. يتم وصف تعزيز التدرج كونه تقنية جماعية تكرارية تجمع بين عدة متعلمين ضعفاء لتقليل دالة خسارة قابلة للاشتقاق، مما يحسن دقة النموذج. تتضمن العملية حساب البقايا الزائفة واستخدام تقنيات التنظيم للتخفيف من الإفراط في التكيف، مثل تعديل حجم الخطوة أثناء التحديثات أو تقليل تعقيد المتعلمين الضعفاء.

يتم تقديم تجميع K-means كطريقة غير خاضعة للإشراف تقسم البيانات إلى K مجموعات، مع تحسين المسافات داخل المجموعة وبين المجموعات من خلال عملية تكرارية. يقوم bisecting K-means بمزيد من تحسين هذه الطريقة من خلال تقسيم المجموعات بشكل متكرر حتى يتم تحقيق العدد المطلوب من المجموعات. يدمج الخوارزمية المقترحة هذه التقنيات التجميعية مع تعزيز التدرج، حيث يتم تدريب كل مجموعة باستخدام نموذج GBT منفصل. يتم استخدام مقاييس الأداء، تحديدًا خطأ الجذر التربيعي المتوسط (RMSE) وخطأ القيمة المطلقة المتوسطة (MAE)، لتقييم فعالية الطريقة المقترحة مقارنةً بنموذج GBT واحد، مما يظهر تحسينات كبيرة عبر مجموعات بيانات متنوعة. تشير النتائج إلى أن نهج المجموعة، الذي يستفيد من التجميع، يمكن أن يعزز الدقة التنبؤية بشكل كبير، مع تحسينات تتجاوز 75% في بعض الحالات.

Journal: Journal Of Big Data, Volume: 12, Issue: 1
DOI: https://doi.org/10.1186/s40537-025-01071-3
Publication Date: 2025-02-17
Author(s): Lydia Wahid Rizkallah
Primary Topic: Machine Learning and Data Classification

Overview

In this research paper, the authors present a hybrid approach that enhances the predictive power of Gradient Boosting Trees (GBT) by integrating K-means and Bisecting K-means clustering techniques. GBT, known for its efficacy in ensemble learning, is utilized in conjunction with clustering to address the limitations of a single GBT model, which may not adequately capture the diverse characteristics of complex datasets. The proposed method is evaluated on 40 regression datasets sourced from UCI and Kaggle, demonstrating superior performance compared to traditional GBT applications.

The results indicate significant improvements in prediction accuracy, as measured by Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), with enhancements exceeding 75% in some cases. Statistical analyses, including the Friedman and Wilcoxon signed-rank tests, confirm the robustness of these improvements, establishing that the hybrid approach yields statistically significant advancements over a single GBT model. Overall, the findings underscore the effectiveness of combining clustering with ensemble modeling to boost the performance of regression tasks.

Introduction

The introduction of the paper discusses boosting, an ensemble machine learning technique that enhances predictive performance by combining multiple weak learners, typically decision trees, into a strong learner. Boosting operates sequentially, with each weak learner focusing on correcting the errors made by its predecessor. Various boosting algorithms are highlighted, including AdaBoost, Gradient Boosting, XGBoost, CatBoost, and LightGBM, each with unique characteristics and implementations. For instance, AdaBoost adjusts weights on misclassified data points to improve subsequent learners, while Gradient Boosting minimizes the loss function of prior models through gradient descent.

The section further illustrates the application of gradient boosting algorithms across diverse prediction problems, citing numerous studies. These applications range from intrusion detection and retail promotion forecasting to landslide susceptibility mapping and illegal activity prediction in cryptocurrency infrastructures. Additionally, the paper notes innovative combinations of clustering techniques with gradient boosting, such as using k-means and Gaussian mixture models for lithology classification and fuzzy C-means clustering for Android malware detection. Overall, the introduction establishes the significance of boosting methods in enhancing predictive accuracy across various domains.

Methods

In this section, the methodology employed in the proposed approach is outlined, focusing on the use of gradient boosting as the primary predictive model for addressing regression problems. To enhance the predictive performance of gradient boosting, K-means and bisecting K-means clustering techniques are utilized. This strategy involves creating an ensemble of gradient boosting models, each trained exclusively on a distinct cluster derived from the clustered training data.

The implementation of this methodology is facilitated by PySpark, which combines the functionalities of Python with Apache Spark, enabling efficient computations on large datasets. Subsequent subsections provide detailed discussions on the specific techniques employed within this framework, emphasizing their contributions to the overall predictive capability of the model.

Results

The results section presents a comprehensive evaluation of a proposed ensemble approach utilizing Gradient Boosting Trees (GBT) models, trained on clustered datasets, compared to a single GBT model trained on the entire dataset. The study employs 40 regression datasets from the UCI Machine Learning Repository and Kaggle, categorized based on the number of instances: less than 10,000 (Category 1), between 10,000 and 100,000 (Category 2), and more than 100,000 (Category 3). The performance of the ensemble approach is assessed using K-means and bisecting K-means clustering algorithms, with varying values of $K$ representing the number of clusters (and thus the number of GBT models in the ensemble).

The analysis reveals that the proposed ensemble approach significantly outperforms the single GBT model across all dataset categories, with improvements in Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) metrics. Notably, the average percentage of improvement increases with higher values of $K$, particularly for larger datasets in Category 3, where improvements are notably doubled when $K=7$ compared to $K=3$. Statistical tests, including the Friedman test and Wilcoxon signed-rank test, confirm the significance of these improvements, with p-values significantly less than 0.05, indicating strong evidence against the null hypothesis of no difference between the approaches. While K-means performs better for smaller datasets, bisecting K-means shows superior performance for larger datasets, suggesting that the effectiveness of clustering algorithms may vary with dataset size, although the choice of clustering algorithm itself is less significant than the act of clustering for ensemble model performance.

Discussion

In the discussion section, the paper elaborates on the methodologies of gradient boosting, K-means clustering, and bisecting K-means clustering, highlighting their roles in enhancing predictive performance in machine learning. Gradient boosting is described as an iterative ensemble technique that combines multiple weak learners to minimize a differentiable loss function, thereby improving model accuracy. The process involves calculating pseudo residuals and using regularization techniques to mitigate overfitting, such as adjusting the step size during updates or reducing the complexity of weak learners.

K-means clustering is presented as an unsupervised method that partitions data into K clusters, optimizing intra-cluster and inter-cluster distances through an iterative process. Bisecting K-means further refines this approach by recursively splitting clusters until the desired number of clusters is achieved. The proposed algorithm integrates these clustering techniques with gradient boosting, where each cluster is trained with a separate GBT model. Performance metrics, specifically Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), are employed to evaluate the effectiveness of the proposed method against a single GBT model, demonstrating significant improvements across various datasets. The results indicate that the ensemble approach, leveraging clustering, can enhance predictive accuracy substantially, with improvements exceeding 75% in some cases.