معالجة عدم التوازن في مجموعات بيانات الصحة: طريقة جديدة NR-Clustering SMOTE وتعديل مقياس المسافة Addressing Imbalance in Health Datasets: A New Method NR-Clustering SMOTE and Distance Metric Modification

المجلة: Computers, materials & continua/Computers, materials & continua (Print)، المجلد: 82، العدد: 2
DOI: https://doi.org/10.32604/cmc.2024.060837
تاريخ النشر: 2025-01-01
المؤلف: Didik Dwi Prasetya وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول الأبحاث التحديات التي تطرحها مجموعات البيانات غير المتوازنة في تعلم الآلة، وخاصة في مهام التصنيف حيث تكون الفئات الأقل تمثيلاً، مما يؤدي إلى نماذج متحيزة. للتخفيف من هذه المشكلات، تقدم الدراسة طريقة NR-Clustering SMOTE، التي تتعامل في الوقت نفسه مع الضوضاء والتداخل في بيانات الفئة الأقل التي تم إنشاؤها بواسطة تقنية العينة الزائدة الاصطناعية التقليدية (SMOTE). تتكون الطريقة المقترحة من ثلاث مراحل رئيسية: (1) تصفية الضوضاء من بيانات الفئة الأقل باستخدام طريقة الجيران الأقرب (k-nn)، (2) تجميع البيانات باستخدام K-means لتحديد حدود القرار، و(3) تطبيق SMOTE مع مقاييس مسافة مانهاتن المعدلة داخل كل مجموعة لتقليل التداخل.

تظهر النتائج التجريبية أن NR-Clustering SMOTE تعزز بشكل كبير دقة التصنيف عبر طرق مختلفة، بما في ذلك Random Forest وSVM وNaïve Bayes. على وجه الخصوص، تحسن الدقة بنسبة 15.34% على مجموعة بيانات Pima و20.96% على مجموعة بيانات Haberman مقارنة بـ SMOTE-LOF، وتظهر تحسينات أداء متسقة على بدائل أخرى مثل Radius-SMOTE وRN-SMOTE. تقترح الدراسة أن الأبحاث المستقبلية يمكن أن تصقل هذه الطريقة بشكل أكبر من خلال معالجة الفجوات الصغيرة من خلال تحليل كثافة المجموعات واستكشاف قابليتها للتطبيق على عدم التوازن متعدد الفئات ودمجها مع أطر التعلم العميق لتحليلات البيانات الصحية على نطاق واسع.

مقدمة

تؤكد المقدمة على أهمية توازن مجموعة البيانات في تعلم الآلة، وخاصة لمهام التصنيف التي تتضمن مجموعات بيانات غير متوازنة، حيث غالبًا ما تكون الفئات الأقل تمثيلاً. يمكن أن تؤدي هذه التمثيلات الناقصة إلى نماذج متحيزة، مما يستدعي تطوير تقنية العينة الزائدة الاصطناعية (SMOTE) لإنشاء بيانات اصطناعية للفئات الأقل. على الرغم من تطبيقها الواسع، فإن لـ SMOTE نقاط ضعف ملحوظة، بما في ذلك إدخال الضوضاء والتداخل في البيانات الاصطناعية، مما يمكن أن يؤثر سلبًا على أداء التصنيف.

حاولت العديد من الدراسات تحسين SMOTE من خلال معالجة قيودها. على سبيل المثال، تركز طرق مثل SMOTE-LOF وRadius-SMOTE على تحديد وإزالة الضوضاء ولكنها تفشل في تحديد حدود قرار واضحة. تستخدم طرق أخرى، مثل RN-SMOTE، تقنيات التجميع لاكتشاف وإزالة الضوضاء بعد العينة الزائدة. ومع ذلك، لا تعالج العديد من هذه الدراسات بشكل كافٍ كل من مشكلات الضوضاء والتداخل في الوقت نفسه. تقترح هذه الأبحاث طريقة جديدة، وهي تقليل الضوضاء-تجميع SMOTE مع مسافة مانهاتن (NR-Clustering SMOTE)، التي تدمج التصفية والتجميع ومقاييس المسافة المعدلة لتحسين توازن البيانات. تشمل المساهمات الرئيسية تطبيق طريقة الجيران الأقرب (k-nn) لتصفية الضوضاء، وتجميع K-means لتحديد حدود القرار، واستخدام مسافة مانهاتن المعدلة في SMOTE لتقليل التداخل، مما يعزز من قوة النموذج وقدرته على التعميم في تصنيف البيانات الصحية. يتم تقييم فعالية هذه الطريقة باستخدام طرق التصنيف مثل Random Forest وSVM وNaïve Bayes، مع مقاييس الأداء بما في ذلك الدقة، وF1-measure، وAUC.

نقاش

ت outlines قسم النقاش في ورقة البحث المنهجية والنتائج المتعلقة بـ NR-Clustering SMOTE المقترحة (تقنية العينة الزائدة الاصطناعية لتقليل الضوضاء والتجميع) التي تهدف إلى معالجة عدم التوازن في الفئات في مجموعات البيانات الصحية، وتحديداً مجموعات بيانات Pima وHaberman. تتكون مجموعة بيانات Pima من 768 حالة مع نسبة عدم توازن تبلغ 1.87%، بينما تحتوي مجموعة بيانات Haberman على 306 حالات مع نسبة تبلغ 2.78%. تتضمن طريقة NR-Clustering SMOTE ثلاث خطوات رئيسية: تصفية الضوضاء من الفئة الأقل باستخدام طريقة الجيران الأقرب (k-NN)، وتجميع البيانات باستخدام K-means لتحديد حدود القرار، وتطبيق SMOTE مع مقياس مسافة مانهاتن المعدل لإنشاء حالات اصطناعية من الفئة الأقل. هذه الطريقة تقلل بشكل فعال من الضوضاء والتداخل في البيانات، مما يعزز أداء التصنيف.

تظهر النتائج أن NR-Clustering SMOTE تتفوق بشكل كبير على SMOTE التقليدي وتعديلات أخرى عبر طرق التصنيف المختلفة، بما في ذلك Random Forest، وآلة الدعم الناقل (SVM)، وNaïve Bayes. على سبيل المثال، حقق مصنف Random Forest دقة قدرها 89.56% على مجموعة بيانات Pima باستخدام NR-Clustering SMOTE، مقارنة بدقة أقل مع SMOTE القياسي. كما حسنت الطريقة الدقة بنسبة 15.34% على مجموعة بيانات Pima و20.96% على مجموعة بيانات Haberman عند مقارنتها بـ LOF-SMOTE، وتفوقت باستمرار على بدائل SMOTE الأخرى. تختتم الدراسة بأن NR-Clustering SMOTE هي حل قوي للتخفيف من عدم التوازن في الفئات، على الرغم من أن الأبحاث المستقبلية يمكن أن تستكشف تطبيقها على سيناريوهات متعددة الفئات وتتناول التحديات المتعلقة بالفجوات الصغيرة داخل الفئة الأقل.

Journal: Computers, materials & continua/Computers, materials & continua (Print), Volume: 82, Issue: 2
DOI: https://doi.org/10.32604/cmc.2024.060837
Publication Date: 2025-01-01
Author(s): Didik Dwi Prasetya et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research addresses the challenges posed by imbalanced datasets in machine learning, particularly in classification tasks where minority classes are underrepresented, leading to biased models. To mitigate these issues, the study introduces the NR-Clustering SMOTE method, which simultaneously tackles noise and overlapping in synthetic minority class data generated by the traditional Synthetic Minority Over-Sampling Technique (SMOTE). The proposed method consists of three key stages: (1) filtering out noise from minority class data using the k-nearest neighbors (k-nn) method, (2) clustering the data with K-means to establish decision boundaries, and (3) applying SMOTE with modified Manhattan distance metrics within each cluster to reduce overlap.

Empirical results demonstrate that NR-Clustering SMOTE significantly enhances classification accuracy across various methods, including Random Forest, SVM, and Naïve Bayes. Specifically, it improves accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset compared to SMOTE-LOF, and shows consistent performance improvements over other variants like Radius-SMOTE and RN-SMOTE. The study suggests that future research could further refine this approach by addressing small disjuncts through cluster density analysis and exploring its applicability to multi-class imbalances and integration with deep learning frameworks for large-scale health data analytics.

Introduction

The introduction emphasizes the significance of dataset balancing in machine learning, particularly for classification tasks involving imbalanced datasets, where minority classes are often underrepresented. This underrepresentation can lead to biased models, prompting the development of the Synthetic Minority Over-Sampling Technique (SMOTE) to generate synthetic data for minority classes. Despite its widespread application, SMOTE has notable weaknesses, including the introduction of noise and overlapping in synthetic data, which can adversely affect classification performance.

Several studies have attempted to enhance SMOTE by addressing its limitations. For instance, methods like SMOTE-LOF and Radius-SMOTE focus on identifying and removing noise but fail to establish clear decision boundaries. Other approaches, such as RN-SMOTE, utilize clustering techniques to detect and eliminate noise after oversampling. However, many of these studies do not adequately address both noise and overlapping issues simultaneously. This research proposes a novel method, Noise Reduction-Clustering SMOTE with Manhattan Distance (NR-Clustering SMOTE), which integrates filtering, clustering, and modified distance metrics to improve data balancing. Key contributions include the application of the k-nearest neighbors (k-nn) method for noise filtering, K-means clustering for establishing decision boundaries, and the use of modified Manhattan distance in SMOTE to minimize overlaps, thereby enhancing model robustness and generalization in healthcare data classification. The effectiveness of this approach is evaluated using classification methods such as Random Forest, SVM, and Naïve Bayes, with performance metrics including accuracy, F1-measure, and AUC.

Discussion

The discussion section of the research paper outlines the methodology and findings related to the proposed NR-Clustering SMOTE (Noise Reduction Clustering Synthetic Minority Over-sampling Technique) aimed at addressing class imbalance in health datasets, specifically the Pima and Haberman datasets. The Pima dataset consists of 768 instances with an imbalance ratio of 1.87%, while the Haberman dataset contains 306 instances with a ratio of 2.78%. The NR-Clustering SMOTE method involves three key steps: filtering out noise from the minority class using the k-nearest neighbors (k-NN) method, clustering the data with K-means to establish decision boundaries, and applying SMOTE with a modified Manhattan distance metric to generate synthetic minority class instances. This approach effectively reduces noise and overlap in the data, enhancing classification performance.

The results demonstrate that NR-Clustering SMOTE significantly outperforms traditional SMOTE and other modifications across various classification methods, including Random Forest, Support Vector Machine (SVM), and Naïve Bayes. For instance, the Random Forest classifier achieved an accuracy of 89.56% on the Pima dataset using NR-Clustering SMOTE, compared to lower accuracies with standard SMOTE. The method also improved accuracy by 15.34% on the Pima dataset and 20.96% on the Haberman dataset when compared to LOF-SMOTE, and it consistently outperformed other SMOTE variants. The study concludes that NR-Clustering SMOTE is a robust solution for mitigating class imbalance, although future research could explore its application to multi-class scenarios and address challenges related to small disjuncts within the minority class.