تقنية الاستيفاء الهجينة الدورية للقيم المفقودة في مجموعات البيانات Cyclical hybrid imputation technique for missing values in data sets

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-90964-7
PMID: https://pubmed.ncbi.nlm.nih.gov/39994302
تاريخ النشر: 2025-02-24
المؤلف: Kurban Kotan وآخرون
الموضوع الرئيسي: طرق إحصائية واستدلال

نظرة عامة

تتناول ورقة البحث التي كتبها كوربان كوتان وسردار كيريش أوغلو القضية الحرجة للبيانات المفقودة في مجموعات البيانات، لا سيما في سياق النمذجة المتعلقة بالصحة، حيث تعتبر التقديرات الدقيقة ضرورية لتعزيز أداء التنبؤ. يؤكد المؤلفون أن التعامل غير السليم مع القيم المفقودة يمكن أن يؤدي بشكل كبير إلى تدهور دقة النموذج وموثوقيته. يجادلون بأن استخدام نماذج الذكاء الاصطناعي المتقدمة للتقدير، بدلاً من الطرق التقليدية مثل المتوسط أو الوضع، يؤدي إلى نتائج أفضل.

لمواجهة التحديات المرتبطة بالبيانات المفقودة، يقترح المؤلفون خوارزمية تقدير هجينة دورية جديدة تدمج بين التقنيات المعتمدة على الصفوف والتقنيات المعتمدة على الأعمدة. تم تصميم هذه الخوارزمية لتقدير القيم المفقودة بشكل فعال من خلال أخذ النقاط البيانية الفردية في الاعتبار من خلال الميزات المعتمدة على الأعمدة، بينما تستفيد أيضًا من الهيكل العام للبيانات عبر الميزات المعتمدة على الصفوف. تظهر النتائج أن الخوارزمية المقترحة حققت دقة 100% على ثلاث مجموعات بيانات مختلفة، متفوقة على طرق التقدير الحالية وتبرز إمكانياتها في تحسين جودة البيانات في النمذجة التنبؤية.

طرق

في هذه الدراسة، تم استخدام ثلاث مجموعات بيانات للتحقيق في طرق تقدير البيانات المفقودة في الرعاية الصحية. تتكون مجموعة البيانات الرئيسية، مجموعة بيانات مرض الكلى المزمن (CKD)، من نتائج سريرية ومخبرية لأكثر من 400 مريض، تحتوي على 24 سمة وعلامة تشخيصية. تعتبر هذه المجموعة، المستمدة من مستودع التعلم الآلي UCI، ملحوظة بسبب ارتفاع معدل البيانات المفقودة، مما يجعلها مرشحًا مثاليًا لتقييم تقنيات التعلم الآلي في هذا السياق. تتضمن مجموعة البيانات الثانية، مجموعة بيانات أمراض القلب من مؤسسة كليفلاند كلينيك، 13 سمة تتعلق بتشخيص مرض الشريان التاجي، بينما تحتوي الثالثة، مجموعة بيانات تعبير البروتين الفأري من جامعة لندن، على مستويات تعبير البروتين من أدمغة الفئران عبر مراحل مختلفة من مرض هنتنغتون.

استخدمت الدراسة عدة نماذج للتعلم الآلي لتقدير البيانات المعتمدة على الصفوف، بما في ذلك K-Nearest Neighbors (KNN)، والانحدار الخطي، والانحدار باستخدام آلة الدعم المتجه (SVR)، والانحدار باستخدام شجرة القرار، والانحدار باستخدام الغابة العشوائية، والانحدار القائم على تعزيز التدرج. يقوم KNN بتقدير القيم المفقودة من خلال متوسط مخرجات الجيران الأقرب، بينما يفترض الانحدار الخطي وجود علاقة خطية بين النقاط البيانية. يقوم SVR بتكييف خوارزمية آلة الدعم المتجه للقيم المستهدفة المستمرة، بهدف تقليل الأخطاء ضمن هامش محدد. تستخدم شجرة القرار هيكل شجري لإجراء التنبؤات بناءً على علاقات الميزات، وتجمع الغابة العشوائية التنبؤات من عدة أشجار قرار لتعزيز الدقة. تركز تقنيات تعزيز التدرج ونسختها القائمة على الهيستوغرام على تدريب نماذج ضعيفة بشكل متتابع لتحسين الأداء التنبؤي العام، لا سيما في مجموعات البيانات الكبيرة. تم أيضًا تطبيق تقنيات التعلم العميق، التي تستخدم الشبكات العصبية الاصطناعية، لالتقاط العلاقات المعقدة وغير الخطية في البيانات.

نتائج

في قسم النتائج، تستخدم الدراسة مصفوفة الارتباك لتقييم أداء نماذج التصنيف من خلال مقارنة النتائج المتوقعة مع النتائج الفعلية. تسهل هذه المصفوفة حساب مقاييس الأداء الرئيسية، بما في ذلك الدقة والحساسية والنوعية. يتم تسليط الضوء بشكل خاص على مقياس الإيجابيات الحقيقية (TP)، الذي ي quantifies الحالات التي يحدد فيها النموذج بدقة الفئة الإيجابية، مما يشير إلى عدد المرضى الفعليين الذين تم التنبؤ بهم بشكل صحيح على أنهم مصابون بالمرض.

تؤكد النتائج على فائدة مصفوفة الارتباك في تقييم فعالية النموذج، حيث توفر تمثيلًا بصريًا واضحًا لنتائج التصنيف. من خلال تحليل هذه المقاييس، تهدف الدراسة إلى تعزيز فهم قدرات النموذج التنبؤية وآثارها المحتملة في الإعدادات السريرية.

مناقشة

في هذه الدراسة، تم تقييم طرق تقدير البيانات المعتمدة على الأعمدة المختلفة، بما في ذلك المتوسط، الوسيط، الوضع، التعبئة الأمامية/الخلفية، KNN Imputer، والتداخل، لفعاليتها في معالجة البيانات المفقودة. تستخدم طرق تقدير المتوسط والوسيط القيم المتوسطة والوسيطية للأعمدة المعنية لملء الإدخالات المفقودة، بينما يتم تخصيص تقدير الوضع للبيانات الفئوية. تعتبر التعبئة الأمامية/الخلفية مفيدة بشكل خاص لبيانات السلاسل الزمنية، حيث تملأ الفجوات بالقيم الصالحة المجاورة. يقوم خوارزمية الجار الأقرب (KNN) بتقدير القيم المفقودة بناءً على قرب العينات المماثلة، ويقوم التداخل بتقدير القيم من خلال النقاط البيانية الموجودة. تم تقييم أداء هذه التقنيات التقديرية باستخدام مصنفات مختلفة، بما في ذلك KNN، وأشجار القرار، والغابة العشوائية، وGaussian Naïve Bayes، ومتعدد الطبقات Perceptron، والانحدار اللوجستي، ونماذج التعلم العميق.

تم تقديم تقنية تقدير هجينة دورية (CHIT) المقترحة لتحسين عملية التقدير، حيث أظهرت تحسينات كبيرة في دقة التنبؤات لمجموعات البيانات المتعلقة بالصحة، لا سيما في مجموعات بيانات مرض الكلى المزمن (CKD) وأمراض القلب. كشفت الدراسة أن CHIT تفوقت على الطرق الإحصائية التقليدية، محققة معدلات دقة عالية عبر مصنفات متعددة، حيث وصلت بعض النماذج إلى دقة 100%. تؤكد النتائج على ضرورة استخدام تقنيات التعلم الآلي المتقدمة لمعالجة تعقيدات تقدير البيانات المفقودة بشكل فعال، لا سيما في مجموعات البيانات المنظمة حيث تفشل الطرق التقليدية. بشكل عام، تدعو النتائج إلى دمج تقنيات التعلم الآلي في تحليل بيانات الصحة لتحسين جودة البيانات والأداء التنبؤي.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-90964-7
PMID: https://pubmed.ncbi.nlm.nih.gov/39994302
Publication Date: 2025-02-24
Author(s): Kurban Kotan et al.
Primary Topic: Statistical Methods and Inference

Overview

The research paper by Kurban Kotan and Serdar Kırışoğlu addresses the critical issue of missing data in datasets, particularly in the context of health-related modeling, where accurate imputation is essential for enhancing prediction performance. The authors emphasize that improper handling of missing values can significantly degrade model accuracy and reliability. They argue that utilizing advanced artificial intelligence models for imputation, as opposed to traditional methods like mean or mode, leads to superior outcomes.

To tackle the challenges associated with missing data, the authors propose a novel cyclical hybrid imputation algorithm that integrates both row-based and column-based techniques. This algorithm is designed to effectively estimate missing values by considering individual data points through column-based features while also leveraging the overall data structure via row-based features. The results demonstrate that the proposed algorithm achieved 100% accuracy on three different datasets, outperforming existing imputation methods and highlighting its potential for improving data quality in predictive modeling.

Methods

In this study, three datasets were utilized to investigate missing data imputation methods in healthcare. The primary dataset, the Chronic Kidney Disease (CKD) dataset, comprises clinical and laboratory results from over 400 patients, featuring 24 characteristics and a diagnostic label. This dataset, sourced from the UCI Machine Learning Repository, is notable for its high incidence of missing data, making it an ideal candidate for evaluating machine learning techniques in this context. The second dataset, the Heart Disease Dataset from the Cleveland Clinic Foundation, includes 13 features related to coronary artery disease diagnosis, while the third, the Mice Protein Expression Dataset from the University of London, contains protein expression levels from mouse brains across various stages of Huntington’s disease.

The study employed several machine learning models for row-based imputation, including K-Nearest Neighbors (KNN), Linear Regression, Support Vector Regression (SVR), Decision Tree Regression, Random Forest Regression, and Histogram-based Gradient Boosting Regression. KNN estimates missing values by averaging the outputs of the nearest neighbors, while Linear Regression assumes a linear relationship among data points. SVR adapts the Support Vector Machine algorithm for continuous target values, aiming to minimize errors within a defined margin. Decision Tree Regression utilizes a tree structure to make predictions based on feature relationships, and Random Forest Regression aggregates predictions from multiple decision trees to enhance accuracy. Gradient Boosting and its histogram-based variant focus on sequentially training weak models to improve overall predictive performance, particularly in large datasets. Deep learning techniques, employing artificial neural networks, were also applied to capture complex, nonlinear relationships in the data.

Results

In the results section, the study employs a confusion matrix to assess the performance of classification models by comparing predicted outcomes to actual results. This matrix facilitates the calculation of key performance metrics, including accuracy, sensitivity, and specificity. Specifically, the true positive (TP) metric is highlighted, which quantifies the instances where the model accurately identifies the positive class, indicating the number of actual patients correctly predicted as having the disease.

The findings underscore the utility of the confusion matrix in evaluating model efficacy, providing a clear visual representation of classification outcomes. By analyzing these metrics, the study aims to enhance understanding of the model’s predictive capabilities and its potential implications in clinical settings.

Discussion

In this study, various column-based imputation methods, including Mean, Median, Mode, Forward/Backward Fill, KNN Imputer, and Interpolation, were evaluated for their effectiveness in addressing missing data. Mean and Median imputation methods utilize the average and median values of the respective columns to fill in missing entries, while Mode imputation is tailored for categorical data. Forward/Backward Fill is particularly advantageous for time series data, filling gaps with adjacent valid values. The K-Nearest Neighbor (KNN) algorithm estimates missing values based on the proximity of similar samples, and interpolation estimates values through the existing data points. The performance of these imputation techniques was assessed using various classifiers, including KNN, Decision Trees, Random Forest, Gaussian Naïve Bayes, Multi-Layer Perceptron, Logistic Regression, and Deep Learning models.

The proposed Cyclical Hybrid Imputation Technique (CHIT) was introduced to enhance the imputation process, demonstrating significant improvements in the accuracy of predictions for health-related datasets, particularly in chronic kidney disease (CKD) and heart disease datasets. The study revealed that CHIT outperformed traditional statistical methods, achieving high accuracy rates across multiple classifiers, with some models reaching 100% accuracy. The results underscore the necessity for advanced machine learning approaches to effectively tackle the complexities of missing data imputation, particularly in structured datasets where conventional methods fall short. Overall, the findings advocate for the integration of machine learning techniques in health data analysis to improve data quality and predictive performance.