خوارزميات التعلم الجماعي المعتمدة على أخذ عينات easyensemble لتوقع الضغوط المالية Ensemble learning algorithms based on easyensemble sampling for financial distress prediction

المجلة: Annals of Operations Research، المجلد: 346، العدد: 3
DOI: https://doi.org/10.1007/s10479-025-06494-y
تاريخ النشر: 2025-02-18
المؤلف: Wei Liu وآخرون
الموضوع الرئيسي: التنبؤ بالضغوط المالية والإفلاس

نظرة عامة

تبحث ورقة البحث في فعالية خوارزميات التعلم الجماعي في التنبؤ بالضغوط المالية، مع التركيز على قضية التعامل مع البيانات غير المتوازنة التي غالبًا ما يتم تجاهلها. تقدم طريقة Easyensemble، التي تستخدم تقنيات تقليل العينة بالتزامن مع نماذج التعلم الجماعي، وتقارن أدائها بتقنية أخذ العينات SMOTE. تكشف النتائج أن Easyensemble تعزز بشكل كبير دقة التنبؤ مقارنةً بـ SMOTE. بالإضافة إلى ذلك، تستخدم الدراسة طرقًا مثل أهمية التبديل (PIMP) وإزالة الميزات التكرارية (RFE) لإظهار أن اختيار الميزات الفعال يمكن أن يقلل من عدد المؤشرات دون المساس بدقة التنبؤ، مما يحسن الكفاءة ويقلل من الوقت الحاسوبي. تشمل المؤشرات الرئيسية المحددة للتنبؤ بالضغوط المالية الربحية، وتدفق النقد، والقدرة على السداد، والنسب الهيكلية.

في الختام، تبني الدراسة على الأبحاث السابقة من خلال التأكيد على أن خوارزميات التعلم الجماعي، وخاصة عند دمجها مع طريقة Easyensemble، تتفوق على النماذج التقليدية مثل SVM وLR وDNN في التنبؤ بالضغوط المالية. تقترح أن دمج معالجة البيانات غير المتوازنة مع التعلم الآلي هو مجال واعد للبحث المستقبلي، خاصة في سياق تطبيقات الذكاء الاصطناعي في المالية الشركات. يدعو المؤلفون إلى مزيد من الاستكشاف للتقنيات المقترحة في سيناريوهات التنبؤ الثنائية الأخرى، مثل التنبؤات الطبية، مع الاعتراف بالحاجة إلى تحسين قابلية التفسير وتنقيح الخوارزميات الحالية.

مقدمة

تتناول مقدمة ورقة البحث هذه القضية الحرجة للتنبؤ بالضغوط المالية في الشركات، مع التأكيد على أهميتها في تقييم القدرة على السداد وحماية مصالح المستثمرين. تم استخدام طرق إحصائية تقليدية، مثل تحليل التمييز المتعدد والانحدار اللوجستي، على نطاق واسع لهذا الغرض؛ ومع ذلك، غالبًا ما تعتمد على البيانات التاريخية وتفشل في التكيف ديناميكيًا مع المعلومات الجديدة، مما قد يهدد دقة التنبؤ. علاوة على ذلك، غالبًا ما تنتهك هذه الطرق الافتراضات الرئيسية، مثل التوزيع الطبيعي المتعدد للمتغيرات المستقلة، مما يحد من فعاليتها (Qian et al., 2022; Sun & Li, 2008).

بالمقابل، اكتسبت تقنيات التعلم الآلي زخمًا بسبب قدرتها على التعلم تلقائيًا من البيانات دون افتراض توزيع محدد. أظهرت أساليب مثل الشبكات العصبية الاصطناعية (ANN) وآلات الدعم الناقل (SVM) وطرق التعلم الجماعي، بما في ذلك الغابات العشوائية (RF) وأشجار القرار المعززة بالتدرج (GBDT)، نتائج واعدة في التنبؤ بالضغوط المالية (Elhoseny et al., 2022; Liu et al., 2024). ومع ذلك، غالبًا ما تتجاهل الأبحاث الحالية تحدي البيانات غير المتوازنة، حيث يتم تجاوز الشركات المتعثرة بشكل كبير من قبل الشركات السليمة. تهدف هذه الدراسة إلى تعزيز طرق التنبؤ بالضغوط المالية من خلال معالجة البيانات غير المتوازنة باستخدام تقنية أخذ العينات الزائفة للأقليات (SMOTE) وطريقة تقليل العينة Easyensemble، تليها تطبيق نماذج التعلم الآلي المختلفة، بما في ذلك XGBoost وAdaBoost والشبكات العصبية العميقة (DNN). بالإضافة إلى ذلك، تتضمن البحث تقنيات اختيار الميزات لتحسين أداء النموذج، خاصة في سياق الشركات المدرجة في الصين.

طرق

في هذا القسم، يحدد المنهج النماذج المرجعية والعمليات المستخدمة في أخذ العينات، واختيار الميزات، ومقاييس التقييم المستخدمة في الدراسة. تتكون الإجراءات التجريبية، الموضحة في الشكل 2، من عدة مقارنات رئيسية: آثار معالجة البيانات غير المتوازنة، تأثير اختيار الميزات، وتحليل مساهمة 80 ميزة. تشمل النماذج المرجعية المستخدمة XGBoost وAdaBoost وأشجار القرار المعززة بالتدرج (GBDT) والغابات العشوائية (RF) وآلة الدعم الناقل (SVM) والانحدار اللوجستي (LR) والشبكات العصبية العميقة (DNN). يتم تقييم هذه النماذج باستخدام ثمانية مقاييس تصنيف: الدقة (ACC)، الدقة، درجة F1، الاسترجاع، الخصوصية، المساحة تحت المنحنى (AUC)، مقياس H، وإحصائية Kolmogorov-Smirnov (KS).

تبدأ الدراسة بجمع المؤشرات المالية التاريخية لإنشاء مجموعة البيانات الأصلية، والتي يتم تقسيمها بعد ذلك إلى مجموعة تدريب ومجموعة اختبار بنسبة 9:1. يسمح هذا التقسيم بتدريب واختبار النموذج بشكل فعال، مما يضمن تقييم خطأ التعميم للنموذج بناءً على مجموعة الاختبار. يختتم القسم بعرض النتائج التجريبية، بما في ذلك أداء النماذج الأساسية، واختيارات الميزات، ومخططات الاعتماد الجزئي، التي توفر رؤى حول العلاقات بين الميزات وتنبؤات النموذج.

نتائج

يناقش قسم النتائج أداء النماذج المرجعية بعد تحسين المعلمات البايزية، التي حددت المعلمات المثلى لمجموعات عينات التدريب المختلفة (T-2 وT-3 وT-4) كما هو موضح في الجدول 4. يتم تسليط الضوء على مقاييس الأداء التنبؤية، بما في ذلك الدقة (ACC) والاسترجاع والخصوصية، مع التركيز على الطبيعة غير المتوازنة لمجموعة البيانات، حيث تتجاوز الشركات العادية الشركات المتعثرة بأكثر من 30 مرة، كما هو ملاحظ في الجدول 2.

تشير النتائج، المقدمة في الجدول 5، إلى أنه بينما حققت معظم النماذج (باستثناء SVM) قيم ACC تتجاوز 90% عبر جميع مجموعات العينات، كانت قيم الاسترجاع منخفضة بشكل ملحوظ، حيث سجلت العديد من النماذج صفرًا، مما يشير إلى عدم القدرة الكبيرة على التنبؤ بالشركات المتعثرة ماليًا. بالمقابل، كانت قيم الخصوصية مرتفعة، وغالبًا ما تساوي 1، مما يشير إلى أن النماذج حددت بشكل أساسي الشركات المالية العادية. تؤكد هذه النتيجة على التحديات التي تطرحها عينة التدريب غير المتوازنة، مما يكشف أن نماذج التنبؤ غير فعالة في التنبؤ بالضغوط المالية.

مناقشة

تستعرض قسم المناقشة في ورقة البحث طرق إحصائية وتقنيات تعلم آلي مختلفة للتنبؤ بالضغوط المالية، مع تسليط الضوء على التطور من النماذج الإحصائية التقليدية إلى تقنيات التعلم الآلي المتقدمة. وضعت النماذج المبكرة مثل تحليل التمييز لبيفر والانحدار اللوجستي لأوهلسون الأساس، لكنها محدودة بالافتراضات مثل الطبيعية والتعددية. بالمقابل، أظهرت أساليب التعلم الآلي، بما في ذلك الشبكات العصبية الاصطناعية (ANN) وآلات الدعم الناقل (SVM) وطرق التعلم الجماعي، أداءً متفوقًا من خلال استخراج الأنماط تلقائيًا من البيانات دون افتراضات توزيع صارمة. تشير الدراسات بشكل ملحوظ إلى أن نماذج مثل SVM وخوارزميات التعلم الجماعي (مثل الغابات العشوائية وAdaBoost وXGBoost) تتفوق على الطرق التقليدية من حيث دقة التنبؤ، خاصة في السيناريوهات التي تتضمن مجموعات بيانات غير متوازنة نموذجية في التنبؤ بالضغوط المالية.

كما يبرز القسم أهمية معالجة البيانات غير المتوازنة من خلال تقنيات مثل تقنية أخذ العينات الزائفة للأقليات (SMOTE) وEasyEnsemble، التي تعزز تدريب النموذج من خلال موازنة تمثيل الشركات المتعثرة وغير المتعثرة. علاوة على ذلك، تناقش الورقة تنفيذ تحسين المعلمات البايزية وطرق اختيار الميزات، مثل أهمية التبديل (PIMP) وإزالة الميزات التكرارية (RFE)، لتنقيح أداء النموذج. يتم استخدام مقاييس التقييم، بما في ذلك الدقة والدقة والاسترجاع والمساحة تحت المنحنى (AUC)، لتقييم القدرات التنبؤية لمختلف النماذج، مما يبرز تعقيد وضرورة منهجيات قوية في التنبؤ بالضغوط المالية.

Journal: Annals of Operations Research, Volume: 346, Issue: 3
DOI: https://doi.org/10.1007/s10479-025-06494-y
Publication Date: 2025-02-18
Author(s): Wei Liu et al.
Primary Topic: Financial Distress and Bankruptcy Prediction

Overview

The research paper investigates the effectiveness of ensemble learning algorithms in forecasting financial distress, emphasizing the often-overlooked issue of imbalanced data handling. It introduces the Easyensemble method, which utilizes undersampling in conjunction with ensemble learning models, and compares its performance against the SMOTE sampling technique. The findings reveal that Easyensemble significantly enhances prediction accuracy compared to SMOTE. Additionally, the study employs methods such as Permutation Importance (PIMP) and Recursive Feature Elimination (RFE) to demonstrate that effective feature selection can reduce the number of indicators without compromising prediction accuracy, thereby improving efficiency and reducing computational time. Key indicators identified for predicting financial distress include profitability, cash flow, solvency, and structural ratios.

In conclusion, the study builds on previous research by highlighting that ensemble learning algorithms, particularly when combined with the Easyensemble method, outperform traditional models like SVM, LR, and DNN in predicting financial distress. It suggests that the integration of imbalanced data processing with machine learning is a promising area for future research, particularly in the context of AI applications in corporate finance. The authors advocate for further exploration of the proposed techniques in other binary forecasting scenarios, such as medical predictions, while acknowledging the need for improved interpretability and refinement of existing algorithms.

Introduction

The introduction of this research paper addresses the critical issue of predicting financial distress in firms, emphasizing its importance for assessing solvency and safeguarding investor interests. Traditional statistical methods, such as multiple discriminant analysis and logistic regression, have been widely used for this purpose; however, they often rely on historical data and fail to adapt dynamically to new information, which can compromise prediction accuracy. Moreover, these methods frequently violate key assumptions, such as the multivariate normality of independent variables, limiting their effectiveness (Qian et al., 2022; Sun & Li, 2008).

In contrast, machine learning techniques have gained traction due to their ability to automatically learn from data without assuming a specific distribution. Approaches like artificial neural networks (ANN), support vector machines (SVM), and ensemble methods, including random forests (RF) and gradient boosting decision trees (GBDT), have shown promising results in financial distress prediction (Elhoseny et al., 2022; Liu et al., 2024). However, existing research often overlooks the challenge of imbalanced data, where distressed firms are significantly outnumbered by healthy ones. This study aims to enhance financial distress prediction methods by preprocessing imbalanced data using the synthetic minority over-sampling technique (SMOTE) and the Easyensemble undersampling method, followed by the application of various machine learning models, including XGBoost, AdaBoost, and deep neural networks (DNN). Additionally, the research incorporates feature selection techniques to improve model performance, particularly in the context of Chinese-listed firms.

Methods

In this section, the methodology outlines the benchmark models and the processes for sampling, feature selection, and evaluation metrics employed in the study. The experimental procedure, illustrated in Figure 2, consists of several key comparisons: the effects of processing imbalanced data, the impact of feature selection, and a contribution analysis of 80 features. The benchmark models utilized include XGBoost, AdaBoost, Gradient Boosting Decision Trees (GBDT), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), and Deep Neural Networks (DNN). The evaluation of these models is conducted using eight classification metrics: Accuracy (ACC), Precision, F1 Score, Recall, Specificity, Area Under the Curve (AUC), H-measure, and Kolmogorov-Smirnov statistic (KS).

The research begins with the collection of historical financial indicators to create the original dataset, which is subsequently divided into a training set and a test set in a 9:1 ratio. This division allows for effective model training and testing, ensuring that the model’s generalization error is assessed based on the test set. The section concludes with a presentation of the experimental results, including the performance of baseline models, feature selections, and partial dependence plots, which provide insights into the relationships between features and model predictions.

Results

The results section discusses the performance of benchmark models following Bayesian hyperparameter optimization, which identified optimal parameters for different training sample sets (T-2, T-3, and T-4) as detailed in Table 4. The predictive performance metrics, including Accuracy (ACC), Recall, and Specificity, are highlighted, with a focus on the imbalanced nature of the dataset, where normal firms outnumber distressed firms by over 30 times, as noted in Table 2.

The findings, presented in Table 5, indicate that while most models (excluding the SVM) achieved ACC values exceeding 90% across all sample sets, the Recall values were notably low, with many models recording zero, indicating a significant inability to predict financially distressed firms. In contrast, Specificity values were high, often equaling 1, suggesting that the models predominantly identified financially normal firms. This outcome underscores the challenges posed by the imbalanced training sample, revealing that the forecasting models are ineffective in predicting financial distress.

Discussion

The discussion section of the research paper reviews various statistical and machine learning methods for forecasting financial distress, highlighting the evolution from traditional statistical models to advanced machine learning techniques. Early models such as Beaver’s discriminant analysis and Ohlson’s logistic regression laid the groundwork, but they are limited by assumptions like normality and multi-collinearity. In contrast, machine learning approaches, including Artificial Neural Networks (ANN), Support Vector Machines (SVM), and ensemble methods, have shown superior performance by automatically extracting patterns from data without strict distributional assumptions. Notably, studies indicate that models like SVM and ensemble algorithms (e.g., Random Forest, AdaBoost, and XGBoost) outperform traditional methods in terms of prediction accuracy, particularly in scenarios involving imbalanced datasets typical in financial distress forecasting.

The section also emphasizes the importance of addressing imbalanced data through techniques like Synthetic Minority Over-sampling Technique (SMOTE) and EasyEnsemble, which enhance model training by balancing the representation of distressed and non-distressed firms. Furthermore, the paper discusses the implementation of Bayesian hyperparameter optimization and feature selection methods, such as Permutation Importance (PIMP) and Recursive Feature Elimination (RFE), to refine model performance. Evaluation metrics, including accuracy, precision, recall, and the area under the curve (AUC), are employed to assess the predictive capabilities of various models, underscoring the complexity and necessity of robust methodologies in financial distress prediction.