العالِم العربي - الدقة، والوضوح، والاسترجاع، ودرجة F1، أو MCC؟ أدلة تجريبية من الإحصاءات المتقدمة، والتعلم الآلي، والذكاء الاصطناعي القابل للتفسير لتقييم نماذج التنبؤ التجارية Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models

المجلة: Journal Of Big Data، المجلد: 12، العدد: 1
DOI: https://doi.org/10.1186/s40537-025-01313-4
تاريخ النشر: 2025-12-09
المؤلف: Khaled Mahmud Sujon وآخرون
الموضوع الرئيسي: تقنيات تصنيف البيانات غير المتوازنة

نظرة عامة

تتناول هذه الدراسة التحديات التي تطرحها مجموعات البيانات غير المتوازنة في تعدين بيانات الأعمال، لا سيما في المجالات ذات المخاطر العالية مثل توقع المخاطر المالية وتحليل فقدان العملاء. وتبرز عدم كفاية مقاييس التقييم التقليدية – مثل الدقة، والدقة الإيجابية، والاسترجاع، ودرجة F1، ومعامل ارتباط ماثيو (MCC) – في تقديم تقييمات موثوقة للأداء في ظل ظروف العالم الحقيقي. تقدم الدراسة إطار تقييم شامل يتضمن حساسية العتبة، وضوضاء المدخلات، وقابلية التفسير، باستخدام مجموعتي بيانات مرجعية: افتراض عملاء بطاقات الائتمان ومجموعات بيانات فقدان عملاء شركات الاتصالات. تم تقييم خمسة نماذج تعلم آلي (الانحدار اللوجستي، شجرة القرار، الغابة العشوائية، تعزيز التدرج المتطرف، وأقرب الجيران) من خلال طرق إحصائية صارمة، بما في ذلك فترات الثقة باستخدام طريقة البوتستراب وتحليل التباين (ANOVA).

تشير النتائج إلى أن درجة F1 تقدم باستمرار التقييم الأكثر استقرارًا وتوازنًا عبر مجموعات بيانات وظروف اختبار مختلفة، بينما يوفر MCC رؤى تكاملية. بالمقابل، وُجد أن الدقة والدقة الإيجابية أقل قوة في سياق عدم توازن الفئات. تقدم الدراسة إطار ذكاء اصطناعي قابل للتفسير (XAI) من مرحلتين باستخدام SHapley Additive exPlanations (SHAP)، مما يعزز قابلية التفسير من خلال ربط مساهمات الميزات بالتغيرات في عتبات التصنيف ومقاييس التقييم. تقدم هذه البحث منهجية قائمة على الإحصاءات وقابلة للتفسير لاختيار مقاييس الأداء في مهام التصنيف غير المتوازن، مع آثار كبيرة على نشر النماذج في المالية، والتسويق، وتحليل العملاء. يُقترح أن يمتد العمل المستقبلي إلى توسيع هذا الإطار ليشمل مشاكل متعددة الفئات ودمج قيود الأعمال في العالم الحقيقي.

مقدمة

في مقدمة هذه الورقة البحثية، يؤكد المؤلفون على الدور الحاسم لتعلم الآلة (ML) في اتخاذ القرارات التجارية المعاصرة، لا سيما في التطبيقات مثل توقع فقدان العملاء، واكتشاف الاحتيال، وتقييم الائتمان المالي. يبرزون القضية الشائعة لعدم توازن الفئات في مجموعات البيانات الواقعية، مما يعقد أداء النموذج والتقييم، حيث أن مقاييس الدقة التقليدية غير كافية. بدلاً من ذلك، يُقترح استخدام مقاييس مثل الدقة الإيجابية، والاسترجاع، ومعامل ارتباط ماثيو (MCC) كبدائل أكثر ملاءمة، ومع ذلك لا تزال غير مستخدمة بشكل كافٍ في تعدين بيانات الأعمال.

يحدد المؤلفون عدة فجوات في الأدبيات الحالية، بما في ذلك نقص التقييمات الشاملة لمقاييس الأداء، واستكشاف غير كافٍ للتقنيات الإحصائية المتقدمة لاختيار المقاييس، والحاجة إلى اختبار قوة الضوضاء في سياقات الأعمال. يقترحون دراسة تهدف إلى معالجة هذه الفجوات من خلال تحقق إحصائي مفصل لخمس مقاييس أداء رئيسية، وتحليلات حساسية لعتبات التصنيف، وإطار ذكاء اصطناعي (XAI) مبتكر من مرحلتين يستخدم SHAP لتحليل تأثير الميزات. تركز الأسئلة البحثية التي توجه تحقيقهم على موثوقية المقاييس المختلفة، وتأثير حساسية العتبة، وتطبيق الأساليب الإحصائية للتحقق من المقاييس، وتأثير الضوضاء على استقرار المقاييس، وقوة التفسير لتقنيات XAI. ستتوسع الأقسام التالية من الورقة في مراجعة الأدبيات، والمنهجية، والنتائج التجريبية، والنقاش، والاستنتاجات.

طرق

في هذا القسم، توضح الدراسة منهجية شاملة تهدف إلى تقييم واختيار مقاييس الأداء المثلى لنماذج تعلم الآلة في سياق تعدين بيانات الأعمال، مع التركيز بشكل خاص على مجموعات البيانات الكبيرة غير المتوازنة. تستخدم الدراسة إطارًا تحليليًا متعدد الأوجه يدمج تعلم الآلة، والتقنيات الإحصائية المتقدمة، وذكاء اصطناعي قابل للتفسير (XAI). يشمل الإطار، الموضح في الشكل 1، مراحل مختلفة من التحليل لضمان نهج منهجي.

تستند النتائج التجريبية إلى تقييمات موسعة أجريت على مجموعتي بيانات مرجعيتين من العالم الحقيقي: مجموعة بيانات افتراض عملاء بطاقات الائتمان ومجموعة بيانات فقدان عملاء شركات الاتصالات. تعمل هذه المجموعات، التي تختلف في الحجم ومستويات عدم التوازن، كأمثلة تمثيلية لسيناريوهات التنبؤ التجاري العملية. بالنسبة لمجموعة البيانات الأكبر، تم إجراء مجموعة شاملة من التقييمات، بما في ذلك تحليل البيانات الاستكشافية، والتحقق المتقاطع، واختبارات القوة المختلفة. بالمقابل، ركز تحليل مجموعة البيانات الأصغر على تقييمات رئيسية مثل التحقق المتقاطع واختبار قوة الضوضاء الغاوسية، مدعومًا بتقييمات إحصائية متقدمة مثل اختبار مك نيمار وكابا كوهين لتقييم توافق النموذج وأهمية التصنيف. ساعد استخدام تقنيات مستندة إلى SHAP في تحليل أهمية الميزات عبر مقاييس الأداء المختلفة وعتبات التصنيف، مما يضمن أن تكون النتائج قوية، وقابلة للتفسير، وقابلة للتطبيق عبر سياقات متنوعة في تعدين بيانات الأعمال.

نتائج

تشير نتائج التحقق المتقاطع المنهجي ذو الخمس طيات، كما هو موضح في الجدول 6، إلى أن الانحدار اللوجستي (LR) حقق أعلى دقة بنسبة 0.803؛ ومع ذلك، أظهر استرجاعًا منخفضًا بنسبة 0.548، مما يشير إلى نقص في القدرة على توقع العملاء الذين فقدوا بشكل فعال. تبع ذلك الغابة العشوائية (RF) بدقة تبلغ 0.793 ولكنها أظهرت أيضًا استرجاعًا منخفضًا بنسبة 0.493، وهو ما يمثل مشكلة في سيناريوهات التصنيف غير المتوازن.

بالمقابل، حقق XGBoost (XGB) أعلى درجة F1 بنسبة 0.562، مما يعكس توازنًا أكثر ملاءمة بين الدقة الإيجابية (0.611) والاسترجاع (0.521). تؤكد هذه النتيجة على فائدة درجة F1 كمقياس قوي لتقييم الأداء في تعدين بيانات الأعمال، لا سيما مع مجموعات البيانات غير المتوازنة. كانت كل من أقرب الجيران (KNN) وأشجار القرار (DT) ذات أداء ضعيف مقارنة بالنماذج الأخرى، لا سيما من حيث الدقة ومعامل ارتباط ماثيو (MCC)، مما يشير إلى فعاليتها المحدودة في المهمة المطروحة.

نقاش

تسلط قسم النقاش في الورقة الضوء على التحديات المستمرة التي تطرحها مجموعات البيانات غير المتوازنة في تعدين بيانات الأعمال، لا سيما في المجالات ذات المخاطر العالية مثل مخاطر الائتمان واكتشاف الاحتيال. يؤكد على أن مقاييس التقييم التقليدية، مثل الدقة، غير كافية بسبب ميلها لتفضيل الفئات الكبرى، مما يؤدي إلى نماذج متحيزة وتقييمات أداء مضللة. يدعو المؤلفون إلى استخدام مقاييس بديلة مثل درجة F1 ومعامل ارتباط ماثيو (MCC)، والتي أظهرت أنها توفر تقييمات أكثر موثوقية عبر نسب عدم التوازن المتنوعة. على الرغم من هذه التوصيات، لا تزال العديد من الدراسات تعتمد على ممارسات غير مثلى، وغالبًا ما تتجاهل التحقق الإحصائي وقابلية التفسير في اختيار مقاييسها.

تنتقد الورقة الأدبيات الحالية لتعاملها مع مقاييس الأداء وقابلية التفسير كمخاوف منفصلة، وتقترح إطار ذكاء اصطناعي قابل للتفسير (XAI) مبتكر من مرحلتين يدمج تقييم المقاييس مع تصورات SHAP (SHapley Additive exPlanations). يسمح هذا الإطار بتحليل شامل لكيفية ارتباط مساهمات الميزات بسلوك المقاييس عبر العتبات، مما يعزز قابلية التفسير. يقدم المؤلفون نهجًا موحدًا قائمًا على الإحصاءات لتقييم الأداء، باستخدام اختبارات إحصائية صارمة (ANOVA، اختبار مك نيمار) وتحليل حساسية العتبة على مجموعتي بيانات غير متوازنة من العالم الحقيقي: افتراض عملاء بطاقات الائتمان ومجموعة بيانات فقدان عملاء شركات الاتصالات. تهدف هذه المنهجية الشاملة إلى سد الفجوة في الأدبيات من خلال تقديم إطار قابل للتعميم لاختيار المقاييس يكون موثقًا إحصائيًا وواعياً لقابلية التفسير، مما يعزز في النهاية الفائدة العملية لتقييمات الأداء في سياقات الأعمال.

القيود

تسلط قيود هذه الدراسة الضوء على عدة مجالات للبحث المستقبلي لتعزيز تقييم مقاييس الأداء في تعدين بيانات الأعمال غير المتوازنة. أولاً، بينما استخدمت التحليل مجموعتي بيانات من العالم الحقيقي – واحدة كبيرة (افتراض بطاقة الائتمان) وواحدة صغيرة (فقدان العملاء) – كانت كلتاهما مشتقة من مجالات جدولة منظمة. يجب أن تستكشف التحقيقات المستقبلية مجموعة أوسع من أحجام البيانات، والهياكل، والصناعات للتحقق من قابلية تعميم النتائج. ثانيًا، ركزت الدراسة على خمسة نماذج تعلم آلي شائعة الاستخدام، بما في ذلك الطرق الخطية والمرتكزة على الأشجار، واستبعدت الهياكل العميقة، التي قد تظهر أنماط أداء وقابلية تفسير مختلفة، خاصة في سيناريوهات البيانات عالية الأبعاد أو غير المنظمة.

بالإضافة إلى ذلك، على الرغم من أن التحليل ركز على خمسة مقاييس تقييم رئيسية – الدقة، والدقة الإيجابية، والاسترجاع، ودرجة F1، ومعامل ارتباط ماثيو (MCC) – تم تحليل مقاييس بديلة مثل المساحة تحت المنحنى (AUC)، ومتوسط G، والدقة المتوازنة، والدقة المستندة إلى المعلومات (IBA) بشكل وصفي فقط. يمكن أن تستفيد الأبحاث المستقبلية من فحص إحصائي أكثر صرامة لهذه المقاييس. علاوة على ذلك، على الرغم من تقديم إطار تقييم حساس للتكلفة، لم يتم تناول دمج قيود محددة للمجال مثل عتبات القرار في الوقت الحقيقي والتحسين القائم على المخاطر. أخيرًا، يشير تركيز الدراسة على نمذجة التنبؤ القائمة على التصنيف إلى أن توسيع الإطار ليشمل سياقات دعم القرار الأخرى، مثل التنبؤ أو التحسين، يمثل اتجاهًا واعدًا للعمل المستقبلي.

Journal: Journal Of Big Data, Volume: 12, Issue: 1
DOI: https://doi.org/10.1186/s40537-025-01313-4
Publication Date: 2025-12-09
Author(s): Khaled Mahmud Sujon et al.
Primary Topic: Imbalanced Data Classification Techniques

Overview

The research addresses the challenges posed by imbalanced datasets in business data mining, particularly in high-stakes areas like financial risk prediction and customer churn analysis. It highlights the inadequacy of traditional evaluation metrics—such as accuracy, precision, recall, F1-score, and Matthews Correlation Coefficient (MCC)—in providing reliable performance assessments under real-world conditions. The study presents a comprehensive evaluation framework that incorporates threshold sensitivity, input noise, and interpretability, utilizing two benchmark datasets: the Default of Credit Card Clients and the Telco Customer Churn datasets. Five machine learning models (Logistic Regression, Decision Tree, Random Forest, Extreme Gradient Boosting, and k-Nearest Neighbors) were assessed through rigorous statistical methods, including bootstrap confidence intervals and ANOVA.

The findings indicate that the F1-score consistently delivers the most stable and balanced evaluation across various datasets and testing conditions, while MCC provides complementary insights. In contrast, accuracy and precision were found to be less robust in the context of class imbalance. The study introduces a two-stage explainable artificial intelligence (XAI) framework using SHapley Additive exPlanations (SHAP), which enhances interpretability by linking feature contributions to variations in classification thresholds and evaluation metrics. This research offers a statistically grounded and interpretable methodology for selecting performance metrics in imbalanced classification tasks, with significant implications for model deployment in finance, marketing, and customer analytics. Future work is suggested to extend this framework to multi-class problems and incorporate real-world business constraints.

Introduction

In the introduction of this research paper, the authors emphasize the critical role of machine learning (ML) in contemporary business decision-making, particularly in applications such as customer churn prediction, fraud detection, and financial credit scoring. They highlight the prevalent issue of class imbalance in real-world datasets, which complicates model performance and evaluation, as traditional accuracy metrics are inadequate. Instead, metrics like precision, recall, and Matthews correlation coefficient (MCC) are suggested as more appropriate, yet they remain underutilized in business data mining.

The authors identify several gaps in existing literature, including a lack of comprehensive evaluations of performance metrics, insufficient exploration of advanced statistical techniques for metric selection, and the need for noise robustness testing in business contexts. They propose a study that aims to address these gaps through a detailed statistical validation of five key performance metrics, sensitivity analyses of classification thresholds, and an innovative two-stage explainable AI (XAI) framework utilizing SHAP for feature impact analysis. The research questions guiding their investigation focus on the reliability of various metrics, the effects of threshold sensitivity, the application of statistical methods for metric validation, the impact of noise on metric stability, and the interpretative power of XAI techniques. The subsequent sections of the paper will further elaborate on the literature review, methodology, experimental results, discussion, and conclusions.

Methods

In this section, the study outlines a comprehensive methodology aimed at evaluating and selecting optimal performance metrics for machine learning models in the context of business data mining, particularly focusing on large imbalanced datasets. The research employs a multi-faceted analytical framework that integrates machine learning, advanced statistical techniques, and explainable AI (XAI). The framework, illustrated in Figure 1, encompasses various phases of the analysis to ensure a systematic approach.

The experimental results are derived from extensive evaluations conducted on two real-world benchmark datasets: the Default of Credit Card Clients dataset and the Telco Customer Churn dataset. These datasets, differing in size and imbalance levels, serve as representative examples of practical business prediction scenarios. For the larger dataset, a comprehensive suite of evaluations was performed, including exploratory data analysis, cross-validation, and various robustness tests. In contrast, the analysis of the smaller dataset focused on key evaluations such as cross-validation and Gaussian noise robustness testing, supplemented by advanced statistical assessments like McNemar’s test and Cohen’s kappa to evaluate model agreement and classification significance. The use of SHAP-based techniques further facilitated the analysis of feature importance across different performance metrics and classification thresholds, ensuring that the findings are robust, interpretable, and applicable across diverse contexts in business data mining.

Results

The results of the 5-fold stratified cross-validation, as presented in Table 6, indicate that Logistic Regression (LR) achieved the highest accuracy of 0.803; however, it exhibited a low Recall of 0.548, suggesting a deficiency in predicting churned customers effectively. Random Forest (RF) followed closely with an accuracy of 0.793 but similarly demonstrated a low Recall of 0.493, which is problematic for imbalanced classification scenarios.

In contrast, XGBoost (XGB) attained the highest F1-score of 0.562, reflecting a more favorable balance between Precision (0.611) and Recall (0.521). This finding underscores the utility of the F1-score as a robust metric for evaluating performance in business data mining, particularly with imbalanced datasets. Both K-Nearest Neighbors (KNN) and Decision Trees (DT) performed poorly relative to the other models, particularly in terms of Precision and Matthews Correlation Coefficient (MCC), indicating their limited effectiveness for the task at hand.

Discussion

The discussion section of the paper highlights the persistent challenges posed by imbalanced datasets in business data mining, particularly in high-stakes areas like credit risk and fraud detection. It emphasizes that traditional evaluation metrics, such as accuracy, are inadequate due to their tendency to favor majority classes, leading to biased models and misleading performance assessments. The authors advocate for the use of alternative metrics like the F1-score and Matthews Correlation Coefficient (MCC), which have been shown to provide more reliable evaluations across varying imbalance ratios. Despite these recommendations, many studies continue to rely on suboptimal practices, often neglecting statistical validation and explainability in their metric selection.

The paper further critiques the existing literature for treating performance metrics and explainability as separate concerns, proposing a novel two-stage explainable AI (XAI) framework that integrates metric evaluation with SHAP (SHapley Additive exPlanations) visualizations. This framework allows for a comprehensive analysis of how feature contributions relate to metric behavior across thresholds, thereby enhancing interpretability. The authors present a unified, statistically grounded approach to performance evaluation, employing rigorous statistical tests (ANOVA, McNemar’s test) and threshold sensitivity analysis on two real-world imbalanced datasets: the Default of Credit Card Clients and the Telco Customer Churn dataset. This comprehensive methodology aims to bridge the gap in the literature by providing a generalizable framework for metric selection that is both statistically validated and explainability-aware, ultimately enhancing the practical utility of performance evaluations in business contexts.

Limitations

The limitations of this study highlight several areas for future research to enhance the evaluation of performance metrics in imbalanced business data mining. Firstly, while the analysis utilized two real-world datasets—one large (credit card default) and one small (customer churn)—both were derived from structured tabular domains. Future investigations should explore a wider variety of dataset sizes, structures, and industries to validate the generalizability of the findings. Secondly, the study’s focus on five commonly used machine learning models, including linear and tree-based methods, excluded deep learning architectures, which may exhibit different performance and explainability patterns, especially in high-dimensional or unstructured data scenarios.

Additionally, although the primary analysis centered on five key evaluation metrics—Accuracy, Precision, Recall, F1-score, and Matthews Correlation Coefficient (MCC)—alternative metrics such as Area Under the Curve (AUC), G-Mean, Balanced Accuracy, and Information-Based Accuracy (IBA) were only descriptively analyzed. Future research could benefit from a more rigorous statistical examination of these metrics. Furthermore, while a cost-sensitive evaluation framework was introduced, incorporating domain-specific constraints like real-time decision thresholds and risk-based optimization was not addressed. Lastly, the study’s focus on classification-based predictive modeling suggests that extending the framework to other decision-support contexts, such as forecasting or optimization, represents a promising direction for future work.