مقاييس التقييم والاختبارات الإحصائية لتعلم الآلة Evaluation metrics and statistical tests for machine learning

المجلة: Scientific Reports، المجلد: 14، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-56706-x
PMID: https://pubmed.ncbi.nlm.nih.gov/38480847
تاريخ النشر: 2024-03-13
المؤلف: Oona Rainio وآخرون
الموضوع الرئيسي: تشخيص COVID-19 باستخدام الذكاء الاصطناعي

نظرة عامة

في هذا القسم، يقدم المؤلفون نظرة شاملة على مقاييس التقييم المصممة لمهام التعلم الآلي (ML) المختلفة، بما في ذلك التصنيف الثنائي والمتعدد الفئات، والانحدار، وتقسيم الصور، واكتشاف الكائنات. يؤكدون على أهمية الاختبار الإحصائي لتحديد ما إذا كانت الفروق الملحوظة في قيم المقاييس بين النماذج ذات دلالة إحصائية أم أنها مجرد نتيجة للصدفة. يعتمد اختيار الاختبارات الإحصائية المناسبة على المهمة المحددة، ومقياس التقييم المستخدم، وتوافر مجموعات اختبار متعددة.

يسلط المؤلفون الضوء على قيود الاختبارات التقليدية، مثل اختبار t المزدوج، الذي قد يقلل من تقدير التباين عندما تنتج المقاييس قيمة واحدة من مجموعة اختبار واحدة. لمعالجة هذه القضايا، يدعون إلى استخدام الاختبارات غير المعلمية، وبشكل خاص اختبار ويلكوكسون للاختلافات المرتبطة واختبار فريدمان، كبدائل أكثر موثوقية لتقييم أداء النموذج تحت هذه القيود. تهدف هذه الطريقة إلى تعزيز قوة المقارنات بين نماذج التعلم الآلي.

النتائج

تُعرض نتائج مهمة التصنيف الثنائي في جدول الطوارئ (الجدول 3) ومقاييس التقييم (الجدول 4). يشير اختبار مك نيمار إلى أن U-Net المعدل حقق حساسية أعلى بشكل ملحوظ لمرضى COVID-19 (p < 5.07e-5) ولكن حساسية أقل للمرضى السلبيين (p < 0.0207). تظهر منحنيات ROC لكل من U-Net المعدل وInceptionV3، كما هو موضح في الشكل 1، عدم وجود فرق كبير في قيم المساحة تحت المنحنى (AUC) (p = 0.137) وفقًا لاختبار DeLong. في مهمة التصنيف متعدد الفئات، تفوق U-Net المعدل على InceptionV3 عبر جميع المقاييس، كما تم تأكيده من خلال اختبارات t واختبارات ويلكوكسون. كانت قيمة p لاختبار t المتعلقة بالمتوسط العام لدرجة F1 هي 6.47e-4، بينما كانت أقل من 2.38e-5 للمقاييس الأخرى. وبالمثل، أسفر اختبار ويلكوكسون عن قيمة p قدرها 0.00116 للمتوسط العام لدرجة F1 وأقل من 6.37e-5 للمقاييس المتبقية. بالنسبة لمهمة التقسيم، يتم تقديم قيم الوسيط والانحراف المعياري لمقاييس Dice وIntersection over Union (IoU) في الجدول 6. تشير الاختبارات الإحصائية إلى أن قيم Dice وIoU ليست موزعة بشكل طبيعي، حيث أظهر U-Net الأعمق أداءً أفضل بشكل ملحوظ في كلا المقياسين. على الرغم من أن U-Net الأعمق أظهر انحرافًا معياريًا أعلى، إلا أن هذه الفجوة كانت ذات دلالة فقط لقيم IoU وفقًا لاختبار ليفين.

المناقشة

تتناول قسم المناقشة في ورقة البحث منهجيات التصنيف والتقييم المختلفة، بما في ذلك التصنيف متعدد الفئات، والتصنيف متعدد التسميات، والانحدار، وتقسيم الصور، واكتشاف الكائنات، واسترجاع المعلومات. بالنسبة للتصنيف متعدد الفئات، يؤكد على استخدام مصفوفات الالتباس ومقاييس التقييم مثل المتوسط العام والمتوسط الدقيق، مع تسليط الضوء على أساليبها المميزة في التعامل مع اختلال التوازن بين الفئات. بشكل محدد، يعامل المتوسط العام جميع الفئات بالتساوي، بينما يتأثر المتوسط الدقيق بأحجام الفئات. يقدم القسم أيضًا مقاييس متخصصة مثل كابا كوهين ومعامل ارتباط ماثيوز (MCC) المصممة لسيناريوهات متعددة الفئات.

في التصنيف متعدد التسميات، يتحول التركيز إلى تقييم النماذج التي يمكن أن تعين تسميات متعددة للحالات، باستخدام مقاييس مثل خسارة هامينغ والدقة المتوسطة. يناقش قسم الانحدار أهمية معاملات الارتباط، وخاصة معامل بيرسون ومعامل سبيرمان، لتقييم أداء النموذج، جنبًا إلى جنب مع مقاييس الخطأ مثل متوسط الخطأ المطلق (MAE) ومتوسط الخطأ التربيعي (MSE). تستكشف الورقة أيضًا تقسيم الصور واكتشاف الكائنات، موضحة مقاييس التقييم مثل معامل سورنسن-دايس وIntersection over Union (IoU) لتقييم دقة التقسيم، ومتوسط الدقة المتوسطة (mAP) لاكتشاف الكائنات. أخيرًا، تتناول استرجاع المعلومات، مع التأكيد على الدقة والاسترجاع في تقييم نتائج البحث، وت outlines الاختبارات الإحصائية لمقارنة أداء النماذج، بما في ذلك اختبار ويلكوكسون للاختلافات المرتبطة واختبار فريدمان لعدة نماذج.

Journal: Scientific Reports, Volume: 14, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-56706-x
PMID: https://pubmed.ncbi.nlm.nih.gov/38480847
Publication Date: 2024-03-13
Author(s): Oona Rainio et al.
Primary Topic: COVID-19 diagnosis using AI

Overview

In this section, the authors present a comprehensive overview of evaluation metrics tailored for various machine learning (ML) tasks, including binary and multi-class classification, regression, image segmentation, and object detection. They emphasize the importance of statistical testing to determine whether observed differences in metric values between models are statistically significant or merely due to chance. The selection of appropriate statistical tests is contingent upon the specific task, the evaluation metric employed, and the availability of multiple test sets.

The authors highlight the limitations of traditional tests, such as the paired t-test, which may underestimate variance when metrics yield a single value from a solitary test set. To address these issues, they advocate for the use of non-parametric tests, specifically the Wilcoxon signed-rank test and Friedman’s test, as more reliable alternatives for evaluating model performance under these constraints. This approach aims to enhance the robustness of comparisons between ML models.

Results

The results of the binary classification task are presented in a contingency table (Table 3) and evaluation metrics (Table 4). McNemar’s test indicates that the modified U-Net achieved significantly higher sensitivity for COVID-19 patients (p < 5.07e-5) but lower specificity for negative patients (p < 0.0207). The ROC curves for both the modified U-Net and InceptionV3, shown in Fig. 1, reveal no significant difference in their area under the curve (AUC) values (p = 0.137) according to the DeLong test. In the multi-class classification task, the modified U-Net outperformed InceptionV3 across all metrics, as confirmed by t-tests and Wilcoxon tests. The p-value for the t-test regarding the macro-average F1-score was 6.47e-4, while it was less than 2.38e-5 for other metrics. Similarly, the Wilcoxon test yielded a p-value of 0.00116 for the macro-average F1-score and less than 6.37e-5 for the remaining metrics. For the segmentation task, median and standard deviation values for Dice and Intersection over Union (IoU) metrics are provided in Table 6. Statistical tests indicate that neither Dice nor IoU values are normally distributed, with the deeper U-Net demonstrating significantly better performance in both metrics. Although the deeper U-Net exhibited a higher standard deviation, this difference was only significant for IoU values according to Levene's test.

Discussion

The discussion section of the research paper elaborates on various classification and evaluation methodologies, including multi-class, multi-label, regression, image segmentation, object detection, and information retrieval. For multi-class classification, it emphasizes the use of confusion matrices and evaluation metrics such as macro-averaging and micro-averaging, highlighting their distinct approaches to handling class imbalances. Specifically, macro-averaging treats all classes equally, while micro-averaging is influenced by class sizes. The section also introduces specialized metrics like Cohen’s kappa and Matthews correlation coefficient (MCC) tailored for multi-class scenarios.

In multi-label classification, the focus shifts to evaluating models that can assign multiple labels to instances, using metrics like Hamming loss and average precision. The regression segment discusses the importance of correlation coefficients, particularly Pearson’s and Spearman’s, for assessing model performance, alongside error metrics such as mean absolute error (MAE) and mean squared error (MSE). The paper further explores image segmentation and object detection, detailing evaluation metrics like the Sørensen-Dice coefficient and Intersection over Union (IoU) for assessing segmentation accuracy, and mean average precision (mAP) for object detection. Lastly, it addresses information retrieval, emphasizing precision and recall in evaluating search results, and outlines statistical tests for comparing model performances, including the Wilcoxon signed-rank test and Friedman’s test for multiple models.