تحسين مستوحى من البيولوجيا مع نهج الحوسبة التعلم الآلي لتوقع الغدة الدرقية Improved bio-inspired with machine learning computing approach for thyroid prediction

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-03299-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40595742
تاريخ النشر: 2025-07-01
المؤلف: Divya Kesavulu وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تبحث ورقة البحث في تطبيق تقنيات التعلم الآلي (ML) والتعلم العميق (DL) لتوقع اضطرابات الغدة الدرقية، والتي تعتبر قضايا صحية شائعة لها تداعيات كبيرة، خاصة بالنسبة للنساء في المناطق النامية. تستخدم الدراسة خوارزميات ML متنوعة، بما في ذلك الغابة العشوائية (RF)، شجرة القرار، آلة الدعم الناقل (SVM)، وجار الأقرب (KNN)، معززة بطريقة تحسين جديدة تعرف باسم تحسين سرب الثعابين الجزيئي (PSSO). تشير النتائج إلى أن نموذج RF المحسن باستخدام PSSO حقق مقاييس أداء ملحوظة، بما في ذلك دقة تبلغ 98.7%، ودرجة F1 تبلغ 98.47%، ودقة تبلغ 98.51%، واسترجاع يبلغ 98.7%، وخصوصية تبلغ 98%. تفوق هذا النموذج على نهج التعلم العميق CNN-LSTM الأساسي بنسبة 2.98% في الدقة، مما يبرز فعالية تقنيات التحسين المستوحاة من الطبيعة في تحسين نماذج ML التقليدية.

تؤكد الدراسة أيضًا على مزايا استخدام PSSO لتحسين المعلمات، مما يسهم في تحسين أداء النماذج مقارنة بالأبحاث السابقة. بالإضافة إلى ذلك، فإن تنفيذ SMOTEENN لموازنة عينات الفئات يقلل من تحيز النموذج، مما يؤدي إلى قدرات تنبؤية قوية. تظهر طريقة PSSO الهجينة تعقيدًا خطيًا (O(T⋅S⋅F))، مما يجعلها فعالة لمجموعات البيانات الطبية عالية الأبعاد. بشكل عام، تظهر النماذج المحسنة نهجًا موثوقًا وفعالًا للكشف المبكر عن اضطرابات الغدة الدرقية، مع تطبيقات محتملة في أنظمة دعم اتخاذ القرار التشخيصي في الوقت الحقيقي، خاصة في بيئات الرعاية الصحية ذات الموارد المحدودة.

طرق

تقدم ورقة البحث منهجية تعلم آلي لتوقع مرض الغدة الدرقية، باستخدام مجموعة بيانات مأخوذة من مستودع تعلم الآلة UCI، والتي تتكون من 22,632 عينة مع 31 سمة. تركز الدراسة على خمس فئات مستهدفة من اضطرابات الغدة الدرقية، تم اختيارها بناءً على وفرة العينات، مع استبعاد الفئات الأقل كثافة. لمعالجة عدم توازن الفئات، استخدم المؤلفون تقنية SMOTEENN، التي تجمع بين تقنية الزيادة الاصطناعية للأقليات (SMOTE) مع الجيران الأقرب المعدلة (ENN) لتنظيف البيانات. بالإضافة إلى ذلك، تم استخدام SimpleImputer لاستبدال القيم الخاطئة، وتم تقسيم مجموعة البيانات إلى مجموعات تدريب (80%) واختبار (20%). تم تطبيق StandardScaler لتطبيع الميزات، مما يضمن متوسطًا قدره 0 وانحرافًا معياريًا قدره 1.

شملت المنهجية خوارزميات تعلم آلي متنوعة، بما في ذلك الغابة العشوائية (RF)، آلة الدعم الناقل (SVM)، أشجار القرار (DT)، بايز الساذج (NB)، والانحدار اللوجستي (LR)، محققة دقة قدرها 97.05% مع 519 عينة. حققت النماذج الهجينة، مثل ALO-LSTM وDeepCNN المدمجة مع تقنيات التحسين، دقة قدرها 98.6% و92% على التوالي. تم استخدام طريقة تحسين سرب الجسيمات (PSSO) لتحسين المعلمات، مما أدى إلى تحديد الميزات المهمة لتدريب النموذج. تم تقييم النماذج بناءً على الدقة، والاسترجاع، والدقة، ودرجة F1، والخصوصية. استخدم الإعداد التجريبي جهاز Intel Evo i7 من الجيل الثاني عشر، وتم تنفيذ العمل في Jupyter Notebook باستخدام Python ومكتبة scikit-learn.

النتائج

تتناول قسم النتائج مقاييس أداء نماذج التعلم الآلي، مع التركيز على مقارنة تلك التي تم تحسينها بواسطة مُحسِّن سرب الثعابين الجزيئي بتلك التي لم يتم تحسينها. توضح خريطة الحرارة (الشكل 3) معاملات الارتباط بيرسون الزوجية (|r|≥ 0.1) بين الخصائص السريرية والمخبرية ضمن مجموعة بيانات الغدة الدرقية. تم تطبيق التجميع الهرمي على كل من الصفوف والأعمدة، مما أدى إلى تجميع الميزات ذات أنماط الارتباط المماثلة بشكل فعال.

تشير كثافة لون خريطة الحرارة والقيم المسمى إلى قوة واتجاه الارتباطات، حيث يمثل اللون الأصفر الارتباطات الإيجابية القوية، بينما يشير اللون الأرجواني الداكن إلى الارتباطات الضعيفة أو السلبية. تساعد هذه الأداة المرئية في تحديد مجموعات الميزات والاحتمالات المتعددة، وهو أمر حاسم لإبلاغ استراتيجيات اختيار الميزات والنمذجة اللاحقة.

المناقشة

تؤكد قسم المناقشة في ورقة البحث هذه على الحاجة الملحة لأدوات حسابية متقدمة في الرعاية الصحية، خاصة لتشخيص الغدة الدرقية، التي غالبًا ما تعيقها الطرق التقليدية البطيئة والمكلفة والمعرضة للأخطاء البشرية. تسلط الدراسة الضوء على إمكانيات تقنيات التعلم الآلي (ML) والتعلم العميق (DL) لتحليل مجموعات البيانات الكبيرة وتحديد الأنماط التشخيصية التي قد تفوتها التقييمات اليدوية. ومع ذلك، فإن فعالية هذه النماذج تعتمد على اختيار الميزات بعناية وضبط المعلمات لتجنب الإفراط في التكيف وتعزيز التعميم. الهدف الرئيسي من البحث هو تحسين اختيار الميزات وتحقيق دقة عالية في توقع أمراض الغدة الدرقية، مع معالجة أسئلة حاسمة تتعلق بعدد الميزات الأمثل، وطرق البحث المناسبة لاختيار الميزات، ومقاييس الأداء المناسبة.

لمعالجة هذه التحديات، تدمج الدراسة تقنيات تحسين مستوحاة من الطبيعة، وتحديدًا نهج تحسين سرب الثعابين الجزيئي (PSSO) الهجين، مع نماذج ML المختلفة. يهدف هذا الأسلوب إلى تعزيز دقة التنبؤ وكفاءة الحوسبة، مما يسهم في تطوير أدوات تشخيص آلية موثوقة لحالات الغدة الدرقية. تستخدم الدراسة مجموعة بيانات شاملة تتكون من 22,632 عينة، مع استخدام تقنيات معالجة البيانات لإزالة الضوضاء وإدارة القيم المفقودة. تشير النتائج إلى أن طريقة PSSO المقترحة، بالتزامن مع مصنفات ML التقليدية، تحقق دقة مثيرة للإعجاب تبلغ 98.70%، متفوقة بشكل كبير على المنهجيات الحالية. تؤكد النتائج على أهمية اختيار الميزات المتقدم وتحسين المعلمات في تحسين دقة التشخيص وكفاءته في التطبيقات الطبية.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-03299-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40595742
Publication Date: 2025-07-01
Author(s): Divya Kesavulu et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

The research paper investigates the application of machine learning (ML) and deep learning (DL) techniques for the prediction of thyroid disorders, which are prevalent health issues with significant implications, particularly for women in developing regions. The study employs various ML algorithms, including random forest (RF), decision tree, support vector machine (SVM), and K-Nearest neighbor (KNN), enhanced by a novel optimization method known as particle snake swarm optimization (PSSO). The findings indicate that the RF model optimized with PSSO achieved remarkable performance metrics, including an accuracy of 98.7%, an F1-score of 98.47%, precision of 98.51%, recall of 98.7%, and specificity of 98%. This model outperformed a baseline CNN-LSTM deep learning approach by 2.98% in accuracy, underscoring the effectiveness of bio-inspired optimization techniques in improving traditional ML models.

The study further emphasizes the advantages of using PSSO for hyperparameter optimization, which contributes to the enhanced performance of the models compared to previous research. Additionally, the implementation of SMOTEENN for balancing class samples mitigates model bias, resulting in robust predictive capabilities. The hybrid PSSO method exhibits linear complexity (O(T⋅S⋅F)), making it efficient for high-dimensional medical datasets. Overall, the optimized models demonstrate a reliable and effective approach for the early detection of thyroid disorders, with potential applications in real-time diagnostic decision-support systems, particularly in resource-constrained healthcare environments.

Methods

The research paper presents a machine-learning methodology for predicting thyroid disease, utilizing a dataset sourced from the UCI machine learning repository, which comprises 22,632 samples with 31 attributes. The study focuses on five target classes of thyroid disorders, selected based on sample abundance, while excluding less populated classes. To address class imbalance, the authors employed the SMOTEENN technique, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) for data cleaning. Additionally, the SimpleImputer was utilized to replace faulty values, and the dataset was split into training (80%) and testing (20%) sets. StandardScaler was applied to normalize the features, ensuring a mean of 0 and a standard deviation of 1.

The methodology involved various machine learning algorithms, including Random Forest (RF), Support Vector Machine (SVM), Decision Trees (DT), Naive Bayes (NB), and Logistic Regression (LR), achieving an accuracy of 97.05% with 519 samples. Hybrid models, such as ALO-LSTM and DeepCNN combined with optimization techniques, yielded accuracies of 98.6% and 92%, respectively. The Particle Swarm Optimization (PSSO) method was employed for hyperparameter optimization, leading to the identification of significant features for model training. The models were evaluated based on accuracy, recall, precision, F1 score, and specificity. The experimental setup utilized an Intel Evo i7 12th generation machine, with implementation carried out in Jupyter Notebook using Python and the scikit-learn library.

Results

The results section details the performance metrics of machine learning models, specifically comparing those enhanced by the particle snake swarm optimizer against those that are not. A heatmap (Fig. 3) illustrates the pairwise Pearson correlation coefficients (|r|≥ 0.1) among clinical and laboratory characteristics within the thyroid dataset. Hierarchical clustering was applied to both rows and columns, effectively grouping features with similar correlation patterns.

The heatmap’s color intensity and labeled values indicate the strength and direction of correlations, with yellow representing strong positive correlations and dark purple indicating weak or negative correlations. This visual tool aids in identifying clusters of features and potential multicollinearity, which is crucial for informing subsequent feature selection and modeling strategies.

Discussion

The discussion section of this research paper emphasizes the urgent need for advanced computational tools in healthcare, particularly for thyroid diagnostics, which are often hindered by traditional methods that are slow, costly, and prone to human error. The study highlights the potential of machine learning (ML) and deep learning (DL) techniques to analyze extensive datasets and identify diagnostic patterns that may be missed through manual evaluation. However, the effectiveness of these models is contingent upon meticulous feature selection and hyperparameter tuning to avoid overfitting and enhance generalization. The primary objective of the research is to optimize feature selection and achieve high accuracy in predicting thyroid diseases, addressing critical questions regarding the optimal number of features, suitable search methods for feature selection, and appropriate performance metrics.

To tackle these challenges, the study integrates bio-inspired optimization techniques, specifically a hybrid Particle-Snake Swarm Optimization (PSSO) approach, with various ML models. This method aims to enhance prediction accuracy and computational efficiency, thereby contributing to the development of reliable automated diagnostic tools for thyroid conditions. The research employs a comprehensive dataset of 22,632 samples, utilizing data preprocessing techniques to eliminate noise and manage missing values. The results indicate that the proposed PSSO method, in conjunction with classical ML classifiers, achieves an impressive accuracy of 98.70%, significantly outperforming existing methodologies. The findings underscore the importance of advanced feature selection and hyperparameter optimization in improving diagnostic accuracy and efficiency in medical applications.