العالِم العربي - تعزيز تشخيص أمراض الكبد باستخدام نماذج التعلم الآلي المتوازنة الهجينة SMOTE-ENN - تحليل تجريبي لمجموعات بيانات مرضى الكبد في الهند Enhancing liver disease diagnosis with hybrid SMOTE-ENN balanced machine learning models—an empirical analysis of Indian patient liver disease datasets

المجلة: Frontiers in Medicine، المجلد: 12
DOI: https://doi.org/10.3389/fmed.2025.1502749
PMID: https://pubmed.ncbi.nlm.nih.gov/40495970
تاريخ النشر: 2025-05-27
المؤلف: Ritu Rani وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تتناول ورقة البحث القضية الحرجة لمرض الكبد، الذي يشكل تهديدًا صحيًا كبيرًا على مستوى العالم، مما يؤدي إلى وفاة ملايين الأشخاص سنويًا. تقيم الدراسة خوارزميات تعلم الآلة المختلفة، بما في ذلك الانحدار اللوجستي، وجار الأقرب (KNN)، وآلة الدعم الناقل، وغيرها، لتشخيص مرض الكبد المزمن باستخدام مجموعات بيانات غير متوازنة، وتحديدًا مجموعة بيانات مرضى الكبد الهندية (ILPD). لمواجهة مشكلة عدم التوازن، يقترح المؤلفون نموذجين هجينين، SMOTEENN-KNN وSMOTEENN-AdaBoost، اللذان يدمجان تقنيات توازن البيانات والتعلم الجماعي لتعزيز دقة التنبؤ.

تشير النتائج إلى أن كلا النموذجين الهجينين يحققان معدلات دقة عالية، حيث حقق نموذج KNN-SMOTE-ENN دقة تبلغ 91.89%، بينما حقق نموذج AdaBoost-SMOTE-ENN نفس الدقة مع إظهار دقة واسترجاع متفوقين. تختتم الدراسة بأن نموذج AdaBoost-SMOTE-ENN أكثر ملاءمة للتطبيقات في الوقت الحقيقي نظرًا لسرعة وقت الاستدلال مقارنةً بـ KNN-SMOTE-ENN. علاوة على ذلك، تسلط الأبحاث الضوء على الإمكانية للعمل المستقبلي لتضمين تقنيات متقدمة، مثل نماذج التعلم العميق وطرق التصوير، لتحسين دقة التنبؤ بمرض الكبد بشكل أكبر. بشكل عام، تؤكد النتائج على أهمية التشخيص في الوقت المناسب وفعالية أساليب تعلم الآلة الهجينة في معالجة مرض الكبد.

مقدمة

تسلط المقدمة الضوء على التأثير العالمي الكبير لمرض الكبد، الذي يمثل حوالي 2 مليون حالة وفاة سنويًا، مما يمثل 4% من جميع الوفيات. من بين هذه الحالات، يُعتبر سرطان الكبد والتشمع من المساهمين البارزين، حيث يُعد التشمع السبب الحادي عشر للوفاة على مستوى العالم. تنجم الحالة عن عوامل متعددة، بما في ذلك إساءة استخدام الكحول، والتهاب الكبد الفيروسي، ومرض الكبد الدهني، مما يؤدي إلى مضاعفات خطيرة مثل فشل الكبد والسرطان. الكبد، وهو عضو حيوي مسؤول عن أكثر من 500 وظيفة، ضروري للحفاظ على الصحة العامة.

تستخدم الدراسة مجموعة بيانات مرضى الكبد الهندية، التي تتميز بعدم توازن الفئات، مما يعكس السيناريوهات الواقعية حيث غالبًا ما تكون البيانات منحرفة. تشكل هذه اللاموازنة تحديات لخوارزميات تعلم الآلة (ML)، التي تفترض عادةً توزيع فئات متوازن. كما توضح المقدمة التحديات الرئيسية في اكتشاف مرض الكبد، بما في ذلك الطبيعة غير العرضية للمرض في مراحله المبكرة، والوعي العام المحدود، ونقص المراكز السريرية المتخصصة المجهزة لتشخيص حالات الكبد بشكل فعال.

طرق البحث

تتركز المنهجية الموضحة في هذا البحث على التنبؤ بمرض الكبد باستخدام مجموعة بيانات مرضى الكبد الهندية (ILPD)، التي تتكون من 167 سجلًا لمرضى غير كبد و416 سجلًا لمرضى كبد. تتضمن مجموعة البيانات سمات مختلفة مثل العمر، والجنس، وعلامات كيميائية حيوية (مثل البيليروبين، والألبومين، والفوسفاتاز القلوي، ومستويات إنزيمات الكبد). شملت معالجة البيانات التعامل مع القيم المفقودة، حيث تم استبدال الإدخالات الفارغة في “نسبة الألبومين والغلوبولين” بالمتوسط، وتم تحديد السجلات المكررة وإزالتها. تم استخدام مصفوفة الارتباط لتقييم العلاقات بين الميزات، مما يوجه اختيار السمات ذات الصلة مع استبعاد تلك التي لها ارتباط ضئيل بالمتغير المستهدف.

تتضمن المنهجية المقترحة عدة خطوات لمعالجة تحديات مجموعة البيانات غير المتوازنة. تم إجراء تنظيف البيانات وإزالة الضوضاء، بما في ذلك اكتشاف القيم الشاذة باستخدام طريقة Z-score. تم تطبيق تقنيات التوازن لتوليد عينات اصطناعية للفئة الأقل، تلاها تطبيع الميزات باستخدام StandardScaler لضمان توحيد القياس عبر الميزات. تم استخدام نماذج تعلم الآلة المختلفة، بما في ذلك جار الأقرب، والانحدار اللوجستي، وآلات الدعم الناقل، للتصنيف، مع إجراء ضبط للمعلمات من خلال البحث الشبكي والتحقق المتقاطع 5-fold لتحسين أداء النموذج وتقليل الإفراط في التخصيص. تهدف هذه المقاربة الشاملة إلى تعزيز دقة التنبؤ بتشخيص مرض الكبد مع ضمان قوة النماذج.

النتائج

في هذا القسم، يقدم المؤلفون النتائج المستمدة من النماذج الهجينة المنفذة، مع التركيز على تحليل مقارن شامل لأدائها. تشير النتائج إلى أن النماذج الهجينة تتفوق على الأساليب التقليدية، خاصة في السيناريوهات التي تتضمن مجموعات بيانات غير متوازنة. تبرز النتائج قيود المنهجية التقليدية، التي تفشل في معالجة التحديات التي تطرحها عدم توازن البيانات بشكل كافٍ، مما يؤدي إلى نتائج دون المستوى الأمثل.

بالإضافة إلى ذلك، يوضح القسم سير العمل للنظام عند استخدام النهج التقليدي، مما يوفر تباينًا واضحًا مع الأداء المحسن الذي تم تحقيقه من خلال النماذج الهجينة. يبرز هذا الاستكشاف المقارن أهمية اعتماد تقنيات النمذجة المتقدمة لتحسين دقة التنبؤ وفعالية النظام العامة في التعامل مع مجموعات البيانات غير المتوازنة.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التحديات الكبيرة والتقدم في اكتشاف أمراض الكبد، مع التركيز بشكل خاص على قيود طرق التشخيص التقليدية وإمكانات خوارزميات تعلم الآلة (ML). غالبًا ما تكون الأساليب التقليدية، التي تشمل اختبارات وظائف الكبد، وتقنيات التصوير، والخزعات، تدخلاً، مكلفة، وقد تتجاهل أمراض الكبد في مراحلها المبكرة. يؤكد المؤلفون على أن تعلم الآلة يمكن أن يعزز الكشف المبكر وبدء العلاج، مما يحسن نتائج المرضى. ومع ذلك، يشيرون إلى أن العديد من مجموعات البيانات الحالية، مثل مجموعة بيانات مرضى الكبد الهندية (ILPD)، غير متوازنة، مما يعقد دقة التنبؤ لنماذج تعلم الآلة التقليدية.

لمعالجة هذه التحديات، تناقش الورقة تقنيات التوازن المختلفة وتنفيذ نماذج تعلم الآلة الهجينة التي تجمع بين الخوارزميات التقليدية والأساليب المتقدمة مثل تقنية زيادة العينات الأقلية الاصطناعية (SMOTE) وجيران الأقرب المعدلة (ENN). تفيد الدراسة بأن النموذج الهجين المقترح، الذي يدمج إزالة الميزات التكرارية (RFE) لاختيار الميزات وSMOTE-ENN لتوازن البيانات، حقق دقة مثيرة للإعجاب تبلغ 93.2% على مجموعة بيانات ILPD. لم يتفوق هذا النموذج فقط على النماذج التقليدية، بل أظهر أيضًا قابلية التعميم عند تطبيقه على مجموعة بيانات اضطرابات الكبد BUPA. يختتم المؤلفون بالتأكيد على أهمية معالجة عدم توازن مجموعات البيانات واستغلال النماذج الهجينة لتعزيز قدرات التنبؤ لاكتشاف مرض الكبد، مما يمهد الطريق للبحث المستقبلي في هذا المجال الحيوي.

Journal: Frontiers in Medicine, Volume: 12
DOI: https://doi.org/10.3389/fmed.2025.1502749
PMID: https://pubmed.ncbi.nlm.nih.gov/40495970
Publication Date: 2025-05-27
Author(s): Ritu Rani et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper addresses the critical issue of liver disease, which poses a significant health threat globally, resulting in millions of deaths annually. The study evaluates various machine learning algorithms, including Logistic Regression, K-Nearest Neighbor (KNN), Support Vector Machine, and others, for diagnosing chronic liver disease using imbalanced datasets, specifically the Indian Patient Liver Disease (ILPD) dataset. To tackle the imbalance issue, the authors propose two hybrid models, SMOTEENN-KNN and SMOTEENN-AdaBoost, which incorporate data balancing techniques and ensemble learning to enhance predictive accuracy.

The results indicate that both hybrid models achieve high accuracy rates, with the KNN-SMOTE-ENN model attaining an accuracy of 91.89% and the AdaBoost-SMOTE-ENN model achieving the same accuracy while demonstrating superior precision and recall metrics. The study concludes that the AdaBoost-SMOTE-ENN model is more suitable for real-time applications due to its faster inference time compared to KNN-SMOTE-ENN. Furthermore, the research highlights the potential for future work to incorporate advanced techniques, such as deep learning models and imaging methods, to further improve liver disease prediction accuracy. Overall, the findings underscore the importance of timely diagnosis and the effectiveness of hybrid machine learning approaches in addressing liver disease.

Introduction

The introduction highlights the significant global impact of liver disease, which accounts for approximately 2 million deaths annually, representing 4% of all fatalities. Among these, liver cancer and cirrhosis are notable contributors, with cirrhosis being the 11th leading cause of death worldwide. The condition results from various factors, including alcohol misuse, viral hepatitis, and fatty liver disease, leading to severe complications such as liver failure and cancer. The liver, a vital organ responsible for over 500 functions, is crucial for maintaining overall health.

The study utilizes the Indian Patient Liver Disease dataset, which is characterized by class imbalance, reflecting real-world scenarios where data is often skewed. This imbalance poses challenges for machine learning (ML) algorithms, which typically assume balanced class distributions. The introduction also outlines key challenges in liver disease detection, including the asymptomatic nature of the disease in its early stages, limited public awareness, and a shortage of specialized clinical centers equipped to diagnose liver conditions effectively.

Methods

The methodology outlined in this research focuses on predicting liver disease using the Indian Liver Patient Dataset (ILPD), which comprises 167 records of non-liver patients and 416 records of liver patients. The dataset includes various attributes such as age, sex, and biochemical markers (e.g., bilirubin, albumin, alkaline phosphatase, and liver enzyme levels). Data preprocessing involved handling missing values, where null entries in the “Albumin_and_Globulin_Ratio” were replaced with the mean, and duplicate records were identified and removed. A correlation matrix was utilized to assess relationships among features, guiding the selection of relevant attributes while eliminating those with negligible correlation to the target variable.

The proposed methodology incorporates several steps to address the challenges of an imbalanced dataset. Data cleaning and noise removal were performed, including outlier detection using the Z-score method. Balancing techniques were applied to generate synthetic samples for the minority class, followed by feature normalization using the StandardScaler to ensure consistent scaling across features. Various machine learning models, including K-nearest neighbors, logistic regression, and support vector machines, were employed for classification, with hyperparameter tuning conducted through grid search and 5-fold cross-validation to optimize model performance and mitigate overfitting. This comprehensive approach aims to enhance the predictive accuracy of liver disease diagnosis while ensuring the robustness of the models.

Results

In this section, the authors present the results derived from the implemented hybrid models, emphasizing a thorough comparative analysis of their performance. The findings indicate that the hybrid models outperform traditional approaches, particularly in scenarios involving imbalanced datasets. The results highlight the limitations of the traditional methodology, which fails to adequately address the challenges posed by data imbalance, leading to suboptimal outcomes.

Additionally, the section outlines the workflow of the system when employing the traditional approach, providing a clear contrast to the enhanced performance achieved through the hybrid models. This comparative exploration underscores the significance of adopting advanced modeling techniques to improve predictive accuracy and overall system efficacy in handling imbalanced datasets.

Discussion

The discussion section of the research paper highlights the significant challenges and advancements in the detection of liver diseases, particularly focusing on the limitations of traditional diagnostic methods and the potential of machine learning (ML) algorithms. Traditional approaches, which include liver function tests, imaging techniques, and biopsies, are often invasive, costly, and may overlook early-stage liver diseases. The authors emphasize that machine learning can enhance early detection and treatment initiation, thereby improving patient outcomes. However, they note that many existing datasets, such as the Indian Patient Liver Disease (ILPD) dataset, are imbalanced, which complicates the predictive accuracy of conventional ML models.

To address these challenges, the paper discusses various balancing techniques and the implementation of hybrid ML models that combine traditional algorithms with advanced methods like the Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbors (ENN). The study reports that the proposed hybrid ensemble model, which integrates Recursive Feature Elimination (RFE) for feature selection and SMOTE-ENN for data balancing, achieved an impressive accuracy of 93.2% on the ILPD dataset. This model not only outperformed traditional models but also demonstrated generalizability when applied to the BUPA Liver Disorders Dataset. The authors conclude by underscoring the importance of addressing dataset imbalances and leveraging hybrid models to enhance the predictive capabilities for liver disease detection, thereby paving the way for future research in this critical area.