العالِم العربي - خوارزميات التعلم الآلي العميق ذات الإشراف الذاتي مع نهج جديد لإزالة واختيار الميزات لتصنيف المخاطر الصحية متعددة الأبعاد المستندة إلى اختبارات الدم Deep self-supervised machine learning algorithms with a novel feature elimination and selection approaches for blood test-based multi-dimensional health risks classification

المجلة: BMC Bioinformatics، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12859-024-05729-2
PMID: https://pubmed.ncbi.nlm.nih.gov/38459463
تاريخ النشر: 2024-03-08
المؤلف: Önder Tutsoy وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تقدم هذه الورقة البحثية خوارزمية جديدة للتعلم الآلي العميق ذاتية الإشراف مصممة لتحليل بيانات اختبارات الدم الخام ذات الأبعاد الخمسة، مع دمج تقنيات الإزالة التكيفية متعددة الأبعاد، والتوزين الذاتي، وتقنيات اختيار الميزات المبتكرة. تعدل الدراسة أربعة خوارزميات تعلم آلي متميزة، تتراوح من الأساليب غير المعتمدة على النموذج إلى الأساليب المدفوعة بالتدرج، لتصنيف مخاطر الصحة بناءً على بيانات اختبارات الدم المعالجة. تشير التحليلات الإحصائية الأولية إلى عدم وجود ارتباط كبير بين الجنس وبيانات اختبار الدم، بينما لوحظ ارتباط ملحوظ بين حجم الصفائح الدموية المتوسط (MPV) والهيماتوكريت (HTC) في الفئة العمرية 18-37.

تظهر النتائج أن الخوارزميات المقترحة تقضي بفعالية على الميزات غير الضرورية، وتخصص أوزانًا ذاتية الأهمية، وتختار أكثر الميزات إفادة لتصنيف مخاطر الصحة، مع تحديد قيم اختبارات الدم المنخفضة والعالية في أسوأ الحالات بنجاح. ومع ذلك، تشمل القيود عدم توازن الجنس في مجموعة البيانات، وتنوع محدود للبيانات غير الطبيعية، وبيانات إيجابية غير كافية في أسوأ الحالات. يقترح المؤلفون أن تستكشف الأبحاث المستقبلية خوارزميات تعلم آلي متعددة الطبقات إضافية وتعالج إمكانية فقدان البيانات في التطبيقات الواقعية. علاوة على ذلك، يقترحون توسيع الخوارزميات لتشمل سياسات العلاج التي قد تؤثر على استجابات اختبارات الدم المستقبلية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الدور الحاسم للدم في نقل العناصر الأساسية والحماية من العدوى، مع التأكيد على الطلب المتزايد على اختبارات الدم بسبب نمو السكان واحتياجات الرعاية الصحية. على الرغم من فعاليتها من حيث التكلفة، فإن الزيادة في حجم اختبارات الدم تضع عبئًا كبيرًا على أنظمة الرعاية الصحية، مما يؤدي إلى أخطاء محتملة في جمع العينات، والتحليل، والتشخيص. لمعالجة هذه التحديات، تقترح الورقة خوارزمية جديدة للتعلم الآلي العميق ذاتية الإشراف تتضمن تقنيات متقدمة مثل إزالة الميزات، وتوزين الميزات الذاتية، واختيار الميزات متعددة الأبعاد لتصنيف مخاطر الصحة بناءً على بيانات اختبار الدم.

يستعرض المؤلفون الأساليب الإحصائية الحالية وأساليب التعلم الآلي، مشيرين إلى قيودها في التعامل مع بيانات اختبارات الدم متعددة الأبعاد وغير المعلّمة. يجادلون بأن الأساليب التقليدية غالبًا ما تركز على التحليلات أحادية البعد وتتطلب مجموعات بيانات معلمة، وهو ما لا ينطبق على نتائج اختبارات الدم الخام. تهدف الخوارزمية المقترحة إلى تعزيز دقة التنبؤ من خلال استخدام استراتيجية إزالة الميزات التكيفية متعددة الأبعاد، وآلية توزيع الميزات الذاتية لتصنيف البيانات، وعملية شاملة لاختيار الميزات. توضح الورقة المنهجية لتهيئة البيانات وتحسين أربع خوارزميات تعلم آلي، بهدف تحسين تصنيف مخاطر الصحة من بيانات اختبارات الدم الخام.

النتائج

تبدأ قسم النتائج في الورقة البحثية بتفصيل معلمات خوارزميات التعلم الآلي المستخدمة، تليها تحليل شامل لأدائها. تشير نتائج التدريب، كما هو موضح في الجدول 5، إلى أن خوارزميات BLS وLSLC وSCE تلتقط بفعالية خصائص بيانات اختبارات الدم المنخفضة بشكل غير طبيعي، كما هو موضح في الشكل 9a. في المقابل، تكافح خوارزمية INN لتعلم المخرجات المعلمة حول -100%، ويرجع ذلك أساسًا إلى طبيعتها التكرارية، مما يؤدي إلى نسيان التعلم السابق مع اقتراب أفق العينة من 0%. تظهر خوارزمية BLS قدرة قوية على التخفيف من آثار عدم اليقين غير المعروف، مما يتماشى عن كثب مع المخرجات الفعلية.

بالنسبة لبيانات اختبارات الدم العالية بشكل غير طبيعي، الموضحة في الشكل 9b، تفشل جميع الخوارزميات باستثناء INN في تعلم المخرجات المرغوبة بسبب البيانات غير الكافية وغير المتوازنة، مما يؤدي إلى ضعف المتانة. تكشف التحليلات الإحصائية اللاحقة، الموضحة في الشكل 10، أن خوارزمية BLS تظهر خطأ متوسطًا كبيرًا مع بيانات الاختبار بسبب حساسيتها للتغيرات الناتجة عن عكس المصفوفة. تظهر خوارزمية INN أخطاء متوسطة مماثلة ولكن مع انحراف معياري يقارب الضعف. تمتلك خوارزمية LSLC خطأ متوسطًا قدره 0.07 مع بيانات التدريب، والذي يرتفع إلى 0.82 مع بيانات الاختبار، على الرغم من أنها تظهر انحرافًا معياريًا أقل في الأخيرة. من الجدير بالذكر أن خوارزمية SCE تحقق أقل خطأ متوسط لكل من بيانات التدريب والاختبار، مشابهة لـ LSLC، بينما تنتج أيضًا انحرافات معيارية أقل مع بيانات الاختبار. تختتم القسم بالإشارة إلى أن الجزء التالي من الورقة سيلخص النتائج ويحدد اتجاهات البحث المستقبلية.

المناقشة

في قسم المناقشة من الورقة البحثية، يحلل المؤلفون خصائص خوارزمية التعلم الآلي العميق المطبقة على مجموعة بيانات تتكون من بيانات اختبارات الدم ذات الأبعاد الخمسة، وتحديدًا الهيماتوكريت (HTC)، والهيموغلوبين (HGB)، وكريات الدم البيضاء (WBC)، والصفائح الدموية (PLT)، وحجم الصفائح الدموية المتوسط (MPV). تتضمن مجموعة البيانات 58,490 عينة، معظمها من الذكور (53,459 ذكور مقابل 5,031 إناث)، مما يثير القلق بشأن التحيزات المحتملة في نتائج التعلم الآلي بسبب عدم توازن الجنس. تكشف تحليلات الارتباط أنه بينما ترتبط قيم MPV للإناث بشكل كبير مع HTC، تظهر بيانات MPV للذكور ارتباطًا طفيفًا فقط مع WBC. يبرز المؤلفون أن التمثيل المحدود للمواضيع الأنثوية قد يقلل من تأثير الجنس على حلول التعلم الآلي.

يمتد التحليل ليشمل الفئات العمرية، مشيرًا إلى أن بيانات HTC وMPV مرتبطة بشكل كبير عبر فئتين عمريتين (18-37 و38-64 عامًا)، حيث تظهر الفئة الأخيرة ارتباطات أقوى. يقترح المؤلفون نهج إزالة الميزات متعددة الأبعاد لتعزيز كفاءة خوارزميات التعلم الآلي من خلال تقليل الأبعاد والتركيز على الميزات المفيدة. يؤكدون أن الطبيعة التكيفية لطريقة اختيار الميزات الخاصة بهم تسمح بإجراء تعديلات في الوقت الحقيقي، مما يقلل من المشكلات الشائعة مثل الإفراط في التكيف ونقص التكيف. تختتم المناقشة بالإشارة إلى قيود الدراسة، بما في ذلك نقص تمثيل المواضيع الأنثوية والحاجة إلى مزيد من الاستكشاف لخوارزميات التعلم الآلي متعددة الطبقات ودمج سياسات العلاج لتعزيز قدرات التنبؤ.

Journal: BMC Bioinformatics, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12859-024-05729-2
PMID: https://pubmed.ncbi.nlm.nih.gov/38459463
Publication Date: 2024-03-08
Author(s): Önder Tutsoy et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper presents a novel deep self-supervised machine learning algorithm designed to analyze 5-dimensional raw blood test data, incorporating multi-dimensional adaptive feature elimination, self-weighting, and innovative feature selection techniques. The study modifies four distinct machine learning algorithms, ranging from model-free to gradient-driven approaches, to classify health risks based on processed blood test data. Initial statistical analyses indicate no significant correlation between gender and the blood test data, while a notable correlation between Mean Platelet Volume (MPV) and Hematocrit (HTC) is observed in the 18-37 age group.

The findings demonstrate that the proposed algorithms effectively eliminate unnecessary features, assign self-importance weights, and select the most informative features for health risk classification, successfully identifying both worst-case low and high blood test values. However, limitations include a gender imbalance in the dataset, restricted diversity of abnormal data, and insufficient worst-case positive data. The authors suggest that future research should explore additional multi-layer machine learning algorithms and address the potential for missing data in real-world applications. Furthermore, they propose expanding the algorithms to incorporate treatment policies that could influence future blood test responses.

Introduction

The introduction of this research paper highlights the critical role of blood in transporting essential elements and protecting against infections, emphasizing the increasing demand for blood tests due to population growth and healthcare needs. Despite their cost-effectiveness, the rising volume of blood tests places a significant burden on healthcare systems, leading to potential errors in sample collection, analysis, and diagnosis. To address these challenges, the paper proposes a novel deep self-supervised machine learning algorithm that incorporates advanced techniques such as feature elimination, self-feature weighting, and multi-dimensional feature selection for health risk classification based on blood test data.

The authors review existing statistical and machine learning approaches, noting their limitations in handling multi-dimensional, unlabelled blood test data. They argue that traditional methods often focus on single-dimensional analyses and require labeled datasets, which are not applicable to raw blood test results. The proposed algorithm aims to enhance predictive accuracy by employing a multi-dimensional adaptive feature elimination strategy, a self-feature weighting mechanism to categorize data, and a comprehensive feature selection process. The paper outlines the methodology for data preprocessing and the optimization of four machine learning algorithms, ultimately aiming to improve health risk classification from raw blood test data.

Results

The results section of the research paper begins by detailing the parameters of the machine learning algorithms used, followed by an extensive analysis of their performance. The training results, as presented in Table 5, indicate that the BLS, LSLC, and SCE algorithms effectively capture the characteristics of abnormally low blood test data, as illustrated in Figure 9a. In contrast, the INN algorithm struggles to learn the labeled output around -100%, primarily due to its iterative nature, which leads to forgetting previous learnings as the sample horizon approaches 0%. The BLS algorithm demonstrates a strong ability to mitigate the effects of unknown uncertainties, closely aligning with the actual output.

For abnormally high blood test data, depicted in Figure 9b, all algorithms except INN fail to learn the desired output due to insufficient and imbalanced data, resulting in poor robustness. The subsequent statistical analyses, shown in Figure 10, reveal that the BLS algorithm exhibits a significant mean error with test data due to its sensitivity to variations from matrix inversion. The INN algorithm shows similar mean errors but with nearly double the standard deviation. The LSLC algorithm has a mean error of 0.07 with training data, which increases to 0.82 with test data, although it exhibits a lower standard deviation in the latter. Notably, the SCE algorithm achieves the smallest mean error for both training and testing data, similar to LSLC, while also producing lower standard deviations with test data. The section concludes by indicating that the next part of the paper will summarize the findings and outline future research directions.

Discussion

In the discussion section of the research paper, the authors analyze the characteristics of a deep machine learning algorithm applied to a dataset comprising 5-dimensional blood test data, specifically hematocrit (HTC), hemoglobin (HGB), white blood cell (WBC), platelet (PLT), and mean platelet volume (MPV). The dataset includes 58,490 samples, predominantly male (53,459 males vs. 5,031 females), which raises concerns about potential biases in machine learning outcomes due to the gender imbalance. Correlation analyses reveal that while female MPV values correlate significantly with HTC, the male MPV data shows only slight correlation with WBC. The authors highlight that the limited representation of female subjects may diminish the impact of gender on machine learning solutions.

The analysis extends to age groups, indicating that HTC and MPV data are significantly correlated across two age categories (18-37 and 38-64 years), with the latter group exhibiting stronger correlations. The authors propose a multi-dimensional feature elimination approach to enhance the efficiency of the machine learning algorithms by reducing dimensionality and focusing on informative features. They emphasize that the adaptive nature of their feature selection method allows for real-time adjustments, thereby mitigating common issues of overfitting and underfitting. The discussion concludes by noting the limitations of the study, including the underrepresentation of female subjects and the need for further exploration of multi-layer machine learning algorithms and the incorporation of treatment policies to enhance predictive capabilities.