التعلم الآلي تحت الإشراف لتصنيف وتوقع التقزم بين الأطفال المصريين دون سن الخامسة Supervised machine learning for classification and prediction of stunting among under-five Egyptian children

المجلة: BMC Pediatrics، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12887-025-06138-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40963124
تاريخ النشر: 2025-09-18
المؤلف: Abdelaziz Hendy وآخرون
الموضوع الرئيسي: تغذية الأطفال والوصول إلى المياه

نظرة عامة

تبحث هذه الدراسة في انتشار التقزم بين الأطفال دون سن الخامسة في مصر، وهي قضية حاسمة مرتبطة بسوء التغذية المزمن الذي يؤثر على الملايين ويعيق التنمية في البلدان ذات الدخل المنخفض والمتوسط. تستخدم الدراسة مجموعة متنوعة من خوارزميات التعلم الآلي (ML) تحت الإشراف – تحديدًا XGBoost، والانحدار اللوجستي، وغابة عشوائية، وتعزيز التدرج، وأقرب الجيران – لتصنيف وتوقع التقزم، مستفيدة من بيانات من المسوح الديموغرافية والصحية في مصر (DHS) من 2005 و2008 و2014. بعد معالجة البيانات بشكل شامل، تم تقييم النماذج من خلال التحقق المتقاطع الطبقي ذو العشرة طيات، مع قياس الأداء باستخدام مقاييس مثل الدقة، والموثوقية، والاسترجاع، ودرجة F1، وROC-AUC.

تكشف النتائج أن خوارزميات تعزيز التدرج وغابة عشوائية حققت أعلى دقة تنبؤية، متجاوزة 90% مع قيم ROC-AUC تتجاوز 0.96، بينما أظهر الانحدار اللوجستي أيضًا أداءً قويًا. تشمل المؤشرات الرئيسية للتقزم التي تم تحديدها في الدراسة حالة التغذية للطفل، وتعليم الأم، وحجم الولادة، ومؤشر الثروة، والإقامة في المناطق الريفية. تمثل هذه الدراسة تطبيقًا رائدًا لتقنيات التعلم الآلي على قضية التقزم في مصر، مما يبرز الإمكانية لهذه الأساليب في إبلاغ التدخلات الصحية العامة. يوصي المؤلفون بإجراء دراسات إضافية ببيانات أكثر حداثة وتحسين نماذج أفضل لتقوية القدرات التنبؤية ومعالجة قيود مجموعة البيانات الحالية.

مقدمة

تتناول مقدمة هذه الورقة البحثية القضية الحرجة للصحة العامة المتعلقة بالتقزم، وهو ضعف في النمو لدى الأطفال يتميز بانخفاض نسبة الطول إلى العمر، تحديدًا عندما يكون ز-score الطول للعمر أكثر من انحرافين معياريين تحت الوسيط للسكان المرجعيين لمنظمة الصحة العالمية (WHO). يؤثر التقزم بشكل رئيسي على الأطفال في أول 24 شهرًا من الحياة، مما يعيق بشكل كبير نموهم البدني والمعرفي ويعرضهم لمخاطر صحية طويلة الأمد. إن انتشار التقزم مرتفع بشكل خاص في البلدان ذات الدخل المنخفض والمتوسط، حيث تأثر حوالي 149.2 مليون طفل دون سن الخامسة على مستوى العالم حتى عام 2020. في مصر، أظهرت دراسة أن انتشار التقزم بين الأطفال في المدارس بلغ 18.4%، مرتبطًا بعوامل مثل العدوى الطفيلية، وفقر الدم، وانخفاض مؤشر كتلة الجسم.

تستخدم هذه الدراسة تقنيات التعلم الآلي (ML) تحت الإشراف على مجموعة بيانات تمثيلية وطنياً من المسوح الديموغرافية والصحية في مصر (DHS) التي أجريت بين 2005 و2014، بهدف تحديد المؤشرات الرئيسية وعوامل الخطر المرتبطة بالتقزم. باستخدام خوارزميات مثل XGBoost، والانحدار اللوجستي، وغابة عشوائية، وتعزيز التدرج، وأقرب الجيران (k-NN)، تسعى الدراسة إلى كشف العلاقات المعقدة التي قد تتجاهلها الأساليب الإحصائية التقليدية. سيتم تقييم أداء هذه النماذج بناءً على الدقة، والموثوقية، والاسترجاع، ودرجة F1، وROC-AUC، مع توقع أن يعزز التعلم الآلي من التعرف المبكر على عوامل الخطر للتدخلات الاستراتيجية. تعتبر هذه الدراسة بارزة لكونها الأولى التي تطبق مجموعة شاملة من مصنفات التعلم الآلي على مجموعة بيانات التقزم في مصر، مما يعالج فجوة كبيرة في الأدبيات الحالية.

النتائج

يقدم قسم النتائج تحليلًا ديموغرافيًا شاملاً للسكان الذين تم مسحهم، حيث يكشف أن 62.4% من المستجيبين يعيشون في المناطق الريفية، مع تباين ملحوظ في التعليم بين الأمهات: 12.5% لديهن تعليم عالٍ، بينما 25.8% لا يمتلكن أي تعليم رسمي. يشير مؤشر الثروة إلى توزيع اقتصادي متنوع، حيث تم تصنيف 20.3% على أنهم من الطبقة المتوسطة و22.2% من بين الأفقر. تظهر إحصائيات الولادة أن 96% من الأطفال هم ولادات فردية، مع توزيع جنسي يبلغ 62.2% إناث و37.8% ذكور. تشير طرق الولادة إلى أن 67.1% من الولادات لم تكن عن طريق عملية قيصرية، وكان معظم الأطفال (81.6%) بحجم متوسط عند الولادة. تكشف التقييمات الغذائية أن 90.6% من الأطفال في حالة تغذية طبيعية، على الرغم من أن 13.0% منهم يعانون من التقزم.

تسلط التحليلات الإحصائية الضوء على ارتباطات هامة بين عوامل مختلفة وZ-scores الطول للعمر (HAZ). بشكل ملحوظ، يرتبط الطول بالسنتيمترات بشكل إيجابي مع HAZ (B = 0.26، p < 0.05)، بينما يظهر الوزن علاقة سلبية (B = -0.07، p < 0.05). يؤثر وزن الأم بشكل إيجابي على HAZ (B = 0.015، p < 0.05)، بينما يرتبط طول الأم وBMI بشكل سلبي مع HAZ. بالإضافة إلى ذلك، يرتبط العمر بالأشهر، وترتيب الولادة، ومدة الرضاعة بشكل سلبي مع HAZ، بينما يؤثر عدد الأطفال الأحياء والولادات الأخيرة بشكل إيجابي على HAZ. يظهر نموذج الانحدار العام توافقًا قويًا، حيث يفسر 76% من التباين في HAZ (R² = 0.76، p = 0.000)، مما يبرز أهمية المراقبة والتدخلات قبل الولادة لتقليل مخاطر التقزم.

المناقشة

في هذه الدراسة، تم استخدام مجموعة متنوعة من خوارزميات التعلم الآلي (ML) تحت الإشراف للتنبؤ بخطر التقزم بين الأطفال المصريين دون سن الخامسة، مما يعالج قضية هامة للصحة العامة. شمل عملية تنظيف البيانات دمج مجموعات البيانات من 2005 و2008 و2014، تلتها معالجة صارمة للتعامل مع القيم المفقودة وعدم توازن الفئات. تم استخدام نهج التحقق المتقاطع الطبقي ذو العشرة طيات لضمان تمثيل متوازن لفئات التقزم، مما يعزز قوة النتائج. من بين النماذج الخمسة التي تم تقييمها – XGBoost، والانحدار اللوجستي، وغابة عشوائية، وتعزيز التدرج، وأقرب الجيران – تفوقت غابة عشوائية وتعزيز التدرج باستمرار على الآخرين، محققة دقة تبلغ 90.5% و90.15% على التوالي. أدارت هذه النماذج البيانات المفقودة وعدم توازن الفئات بشكل فعال، مما يظهر ملاءمتها لمجموعات بيانات الصحة والتغذية.

كشفت النتائج أن المؤشرات الرئيسية للتقزم تشمل تعليم الأم، والحالة الاجتماعية والاقتصادية، وحالة التغذية للطفل، مع تحديد عتبات محددة لطول الأم ووزن الولادة. تؤكد الدراسة على إمكانيات خوارزميات التعلم الآلي في إبلاغ التدخلات الصحية العامة من خلال استهداف الفئات عالية المخاطر بناءً على مؤشرات صحة الأم والطفل المتاحة. بينما أظهرت النماذج قدرات تنبؤية واعدة، سلطت الدراسة أيضًا الضوء على الحاجة إلى التحقق الخارجي لتقييم القابلية للتعميم خارج السياق المصري. بشكل عام، تسهم هذه الدراسة في تقديم رؤى قيمة حول تطبيق التعلم الآلي في الصحة العامة، خاصة في معالجة سوء التغذية والتقزم لدى الأطفال.

القيود

تنبع قيود هذه الدراسة بشكل أساسي من اعتمادها على بيانات ثانوية مأخوذة من المسوح الديموغرافية والصحية (DHS)، مما قد يقيد قابلية تعميم النتائج. لا تأخذ النطاق الزمني للبيانات، التي تغطي سنوات 2005 و2008 و2014، في الاعتبار الاتجاهات الأكثر حداثة أو آثار التدخلات الصحية العامة الجديدة، خاصة أنه لم تكن هناك مسوح لاحقة حول التقزم في مصر منذ عام 2014. وبالتالي، قد لا تكون النتائج قابلة للتطبيق على مناطق أو دول أخرى ذات سياقات اجتماعية واقتصادية وبيئية مختلفة.

بالإضافة إلى ذلك، تفتقر الدراسة إلى فحص شامل لتقنيات اختيار الميزات وضبط المعلمات الفائقة للنماذج المستخدمة في التعلم الآلي. على الرغم من تضمين جميع الميزات المتاحة لضمان نهج شامل، فإن تطبيق أساليب اختيار الميزات المتقدمة، مثل الإزالة التكرارية للميزات (RFE) أو الانحدار LASSO، قد يعزز من تحديد المؤشرات الحرجة ويحسن أداء النموذج. علاوة على ذلك، يشير استخدام المعلمات الافتراضية للمصنفات إلى أن الضبط الدقيق من خلال أساليب مثل البحث الشبكي أو البحث العشوائي يمكن أن يحسن القدرات التنبؤية. يجب أن تعالج الأبحاث المستقبلية هذه القيود لتعزيز قوة ودقة النماذج.

Journal: BMC Pediatrics, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12887-025-06138-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40963124
Publication Date: 2025-09-18
Author(s): Abdelaziz Hendy et al.
Primary Topic: Child Nutrition and Water Access

Overview

The research investigates the prevalence of stunting among children under five in Egypt, a critical issue linked to chronic undernutrition that affects millions and hampers development in low- and middle-income countries. The study employs various supervised machine learning (ML) algorithms—specifically XGBoost, Logistic Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbors—to classify and predict stunting, utilizing data from the Egypt Demographic and Health Surveys (DHS) from 2005, 2008, and 2014. After thorough data preprocessing, the models were evaluated through 10-fold stratified cross-validation, measuring performance using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC.

The findings reveal that Gradient Boosting and Random Forest algorithms achieved the highest predictive accuracy, exceeding 90% with ROC-AUC values above 0.96, while Logistic Regression also demonstrated strong performance. Key predictors of stunting identified in the study include the child’s nutritional status, maternal education, birth size, wealth index, and rural residence. This research marks a pioneering application of machine learning techniques to the stunting issue in Egypt, underscoring the potential for these methods to inform public health interventions. The authors recommend further studies with more recent data and enhanced model optimization to refine predictive capabilities and address the limitations of the current dataset.

Introduction

The introduction of this research paper addresses the critical public health issue of stunting, a growth impairment in children characterized by a low height-for-age ratio, specifically when the height-for-age z-score is more than two standard deviations below the median of the World Health Organization (WHO) reference population. Stunting predominantly affects children in the first 24 months of life, significantly hindering their physical and cognitive development and posing long-term health risks. The prevalence of stunting is particularly high in low- and middle-income countries, with approximately 149.2 million children under five affected globally as of 2020. In Egypt, a study indicated an 18.4% prevalence of stunting among schoolchildren, linked to factors such as parasitic infections, anemia, and low body mass index.

This research employs supervised machine learning (ML) techniques on a nationally representative dataset from the Egypt Demographic and Health Surveys (DHS) conducted between 2005 and 2014, aiming to identify key predictors and risk factors associated with stunting. Utilizing algorithms such as XGBoost, Logistic Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbors (k-NN), the study seeks to uncover complex relationships that traditional statistical methods may overlook. The performance of these models will be evaluated based on accuracy, precision, recall, F1 score, and ROC-AUC, with the expectation that ML can enhance early identification of risk factors for strategic interventions. This research is notable for being the first to apply a comprehensive range of ML classifiers to the stunting dataset in Egypt, addressing a significant gap in the existing literature.

Results

The results section presents a comprehensive demographic analysis of the surveyed population, revealing that 62.4% of respondents live in rural areas, with a notable educational disparity among mothers: 12.5% have higher education, while 25.8% possess no formal education. The wealth index indicates a diverse economic distribution, with 20.3% classified as middle class and 22.2% as the poorest. Birth statistics show that 96% of children are single births, with a gender distribution of 62.2% female and 37.8% male. Delivery methods indicate that 67.1% of births were not by Cesarean section, and the majority of children (81.6%) were of average size at birth. Nutritional assessments reveal that 90.6% of children are of normal nutritional status, although 13.0% are stunted.

Statistical analyses highlight significant associations between various factors and height-for-age Z-scores (HAZ). Notably, height in centimeters is positively correlated with HAZ (B = 0.26, p < 0.05), while weight shows a negative relationship (B = -0.07, p < 0.05). Maternal weight positively influences HAZ (B = 0.015, p < 0.05), whereas maternal height and BMI negatively correlate with HAZ. Additionally, age in months, birth order, and breastfeeding duration are negatively associated with HAZ, while the number of living children and recent births positively affect HAZ. The overall regression model demonstrates a strong fit, explaining 76% of the variance in HAZ (R² = 0.76, p = 0.000), underscoring the importance of prenatal monitoring and interventions to mitigate stunting risks.

Discussion

In this study, various supervised machine learning (ML) algorithms were employed to predict the risk of stunting among Egyptian children under five, addressing a significant public health issue. The data cleaning process involved merging datasets from 2005, 2008, and 2014, followed by rigorous preprocessing to handle missing values and class imbalances. A 10-fold stratified cross-validation approach was utilized to ensure balanced representation of stunting categories, enhancing the robustness of the results. Among the five models evaluated—XGBoost, Logistic Regression, Random Forest, Gradient Boosting, and K-Nearest Neighbors—Random Forest and Gradient Boosting consistently outperformed the others, achieving accuracies of 90.5% and 90.15%, respectively. These models effectively managed missing data and class imbalance, demonstrating their suitability for health and nutrition datasets.

The findings revealed that key predictors of stunting included maternal education, socioeconomic status, and child nutritional status, with specific thresholds identified for maternal height and birth weight. The study underscores the potential of ML algorithms in informing public health interventions by targeting high-risk groups based on accessible maternal and child health indicators. While the models showed promising predictive capabilities, the research also highlighted the need for external validation to assess generalizability beyond the Egyptian context. Overall, this study contributes valuable insights into the application of ML in public health, particularly in addressing childhood malnutrition and stunting.

Limitations

The limitations of this study primarily stem from its reliance on secondary data sourced from the Demographic and Health Surveys (DHS), which may restrict the generalizability of the findings. The temporal scope of the data, covering the years 2005, 2008, and 2014, does not account for more recent trends or the effects of newer public health interventions, particularly as there have been no subsequent surveys on stunting in Egypt since 2014. Consequently, the findings may not be applicable to other regions or countries with differing socioeconomic and environmental contexts.

Additionally, the study lacks a thorough examination of feature selection techniques and hyperparameter tuning for the employed machine learning models. Although all available features were included to ensure a comprehensive approach, the application of advanced feature selection methods, such as Recursive Feature Elimination (RFE) or LASSO regression, could enhance the identification of critical predictors and improve model performance. Furthermore, the use of default hyperparameters for classifiers suggests that fine-tuning through methods like grid search or random search could optimize predictive capabilities. Future research should address these limitations to bolster the robustness and accuracy of the models.