التنبؤ بالتسرب من الجامعة من خلال نماذج قائمة على الغابة العشوائية Prediction of university dropouts through random forest-based models

المجلة: Journal Of Advanced Pharmacy Education And Research، المجلد: 15، العدد: 1
DOI: https://doi.org/10.51847/pfb18qb60j
تاريخ النشر: 2025-01-01
المؤلف: Fred Torres‐Cruz وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية

نظرة عامة

تبحث هذه الورقة البحثية في توقع معدلات التسرب من الجامعة باستخدام تقنيات رقمية متقدمة، مع التركيز بشكل خاص على استخدام خوارزمية الغابة العشوائية. تم بناء النموذج استنادًا إلى متغيرات أكاديمية رئيسية، بما في ذلك سنة التسجيل، برنامج الدراسة، الفصل الدراسي الذي تم حضوره، والأداء الأكاديمي المقاس بمعدل النقاط التراكمية (GPA). تم تحديد عتبة التسرب عند GPA أقل من 11، وتم تقسيم مجموعة البيانات إلى 70% للتدريب و30% للاختبار، محققة دقة إجمالية تبلغ 85.9%. ومن الجدير بالذكر أن الفصل الدراسي الذي تم حضوره وسنة التسجيل تم تحديدهما كأهم المؤشرات للتسرب، حيث أظهر النموذج خصوصية تبلغ 91% في تحديد الطلاب الذين من غير المحتمل أن يتسربوا، ولكن حساسية تبلغ 52% فقط للذين هم في خطر.

تؤكد النتائج الفرضية الأولية بأن مؤشرات أكاديمية معينة تؤثر بشكل كبير على احتمال التسرب، مما يشير إلى أن التركيز على هذه المتغيرات يمكن أن يؤدي إلى نماذج تنبؤية فعالة دون الاعتماد على عوامل خارجية مثل الوضع الاجتماعي والاقتصادي. يسمح هذا التقدم المنهجي بالتدخلات المستهدفة استنادًا إلى بيانات أكاديمية قابلة للتحكم، مما يعزز الإمكانية لتحسين الاحتفاظ بالطلاب. بينما يوفر نموذج الغابة العشوائية دقة وتفسير تنافسيين، تشير الحساسية المحدودة في اكتشاف الطلاب المعرضين للخطر إلى الحاجة لمزيد من الاستكشاف لتقنيات التعلم الآلي أو التعديلات لمعالجة عدم توازن الفئات. بشكل عام، تضع هذه الدراسة أساسًا قويًا لتنفيذ نماذج تنبؤية تركز على الأداء الأكاديمي، داعية إلى دمجها في أنظمة إدارة الطلاب للتخفيف من معدلات التسرب بشكل استباقي عبر المؤسسات التعليمية.

مقدمة

تتناول مقدمة هذه الورقة البحثية القضية الملحة لمعدلات التسرب من الجامعة، والتي لها آثار كبيرة على المؤسسات التعليمية والتنمية الاجتماعية والاقتصادية الأوسع. الهدف الرئيسي من الدراسة هو تطوير نموذج تنبؤي باستخدام خوارزمية الغابة العشوائية لتحديد الطلاب المعرضين لخطر التسرب استنادًا إلى البيانات الأكاديمية. تُفضل هذه الطريقة لقدرتها على إدارة مجموعات بيانات كبيرة ومتنوعة وفعاليتها في تفسير المتغيرات المعقدة، مما يوفر أداة قوية لتنفيذ تدخلات استباقية تهدف إلى تعزيز الاحتفاظ بالطلاب.

تستند الدراسة إلى أبحاث سابقة استكشفت عوامل اجتماعية واقتصادية وأكاديمية مختلفة تؤثر على معدلات التسرب، مع التأكيد على أهمية الالتزام المؤسسي والاندماج الاجتماعي. بينما استخدمت الدراسات السابقة تقنيات التعلم الآلي مثل الانحدار اللوجستي والشبكات العصبية، فإن تطبيق الغابة العشوائية في هذا السياق يعتبر جديدًا نسبيًا. تهدف الأبحاث إلى تحقيق ثلاثة أهداف محددة: بناء نموذج غابة عشوائية عالي الدقة لتصنيف الطلاب، تقييم أهمية المتغيرات الأكاديمية (مثل GPA وسنة التسجيل) في خطر التسرب، وتقديم إطار تفسيري للمؤسسات لفهم أنماط التسرب. من خلال التركيز على العوامل الأكاديمية، تسعى الدراسة لتوفير أداة عملية للمؤسسات التعليمية لتعزيز استراتيجيات الاحتفاظ، مما يساهم في تحسين نتائج الطلاب وفعالية المؤسسات.

الطرق

تستخدم الأبحاث منهجية كمية لتحليل معدلات التسرب من الجامعة، مع التركيز على تحديد الأنماط التنبؤية في بيانات الطلاب الأكاديمية. تم إجراء هذه الدراسة في الجامعة الوطنية في ألتيبلانو في بونو، بيرو، وتهدف إلى إقامة علاقات سببية بين المتغيرات الأكاديمية – مثل برنامج الدراسة، الفصل الدراسي الذي تم حضوره، وسنة التسجيل – وخطر التسرب. تفترض الفرضية المركزية أن هذه المتغيرات الأكاديمية تؤثر بشكل كبير على احتمال التسرب ويمكن نمذجتها باستخدام خوارزمية الغابة العشوائية. تشمل الدراسة مجموعة بيانات شاملة تتكون من 249,043 سجل تسجيل ودرجات من طلاب الهندسة الجامعيين على مدى ثلاث سنوات (2020-2022)، مما يضمن تمثيلًا قويًا لعدد الطلاب.

استخدم تحليل البيانات خوارزمية الغابة العشوائية لفعاليتها في التعامل مع العلاقات غير الخطية بين المتغيرات. تم تقسيم مجموعة البيانات إلى مجموعات تدريب (70%) واختبار (30%)، وتم تقييم أداء النموذج من خلال مقاييس مثل الدقة الإجمالية، الحساسية، والخصوصية. حقق النموذج دقة تبلغ 85.9%، مع حساسية تبلغ 52% لتحديد الطلاب المعرضين للخطر وخصوصية تبلغ 91% لأولئك الذين استمروا. تم تقييم أهمية المتغيرات باستخدام مقياس متوسط الانخفاض في الشوائب (MDI)، مما كشف أن الفصل الدراسي الذي تم حضوره وسنة التسجيل كانا من أهم المؤشرات لخطر التسرب. يبرز التصميم المنهجي أهمية جودة البيانات والتحليل الشامل لإبلاغ التدخلات التعليمية الاستباقية.

النتائج

تشير نتائج الدراسة إلى أن نموذج الغابة العشوائية يتنبأ بفعالية بمعدلات التسرب من الجامعة، محققًا دقة إجمالية تبلغ 85.9%. يظهر النموذج خصوصية عالية (91%) في تحديد الطلاب الذين من المحتمل أن يستمروا، لكن حساسيته محدودة عند 52% لاكتشاف الطلاب المعرضين للخطر. تشمل المؤشرات الرئيسية للتسرب الفصل الدراسي الذي تم حضوره وسنة التسجيل، حيث كان للفصل الدراسي الذي تم حضوره أعلى قيمة لمتوسط الانخفاض في الشوائب (MDI) تبلغ 0.328، تليه سنة التسجيل (MDI = 0.296) وGPA (MDI = 0.162). أظهرت تحليل منحنى التشغيل الاستقبالي (ROC) منطقة تحت المنحنى (AUC) تبلغ 0.77، مما يدل على تمييز معقول بين المتسربين والطلاب المستمرين، على الرغم من وجود فرص لتحسين الحساسية.

أكد التحليل الإضافي باستخدام الانحدار اللوجستي أهمية الفصل الدراسي الذي تم حضوره (B = 0.452، p < 0.001) وسنة التسجيل (B = 0.376، p < 0.001) كمؤشرات، مما يعزز نتائج نموذج الغابة العشوائية. أظهر النموذج معدل خطأ سلبي مرتفع (48%) مقارنة بمعدل خطأ إيجابي يبلغ 9%، مما يشير إلى ميل لتجاهل الطلاب المعرضين للخطر. كما كشفت الدراسة أن الطلاب في بداية الفصل الدراسي (من 1 إلى 3) هم في خطر أكبر للتسرب، مما يبرز الحاجة إلى تدخلات مستهدفة خلال هذه الفترات الحرجة. بشكل عام، تدعم الأبحاث تطوير نظام مراقبة تنبؤي لتعزيز استراتيجيات الاحتفاظ الأكاديمي، مع توصيات لتحسين حساسية النموذج لتحسين دقة توقع التسرب وإبلاغ السياسات الوقائية.

المناقشة

تشير نتائج البحث إلى أنه يمكن التنبؤ بالتسرب من الجامعة بفعالية باستخدام تقنيات التعلم الآلي المتقدمة، وخاصة نموذج الغابة العشوائية، الذي حقق دقة إجمالية تبلغ 85.9% وخصوصية تبلغ 91%. تشير حساسية النموذج البالغة 52% إلى قدرة معتدلة على تحديد الطلاب المعرضين للخطر، حيث تم تحديد الفصل الدراسي الذي تم حضوره وسنة التسجيل كأهم المؤشرات. يبرز هذا أهمية التقدم الأكاديمي كعامل مركزي في توقع التسرب، مما يدعم الفرضية القائلة بأن المتغيرات الأكاديمية يمكن أن تعمل كمؤشرات قوية دون الحاجة إلى بيانات سياقية خارجية، مثل العوامل الاجتماعية والاقتصادية.

تسلط الدراسة الضوء على تقدم منهجي من خلال إظهار أن التركيز فقط على الأداء الأكاديمي يمكن أن يؤدي إلى رؤى قابلة للتنفيذ للتدخل، مما يتناقض مع الأبحاث السابقة التي دمجت المتغيرات الديموغرافية. بينما يوفر نموذج الغابة العشوائية دقة وتفسير تنافسيين من خلال تحليل أهمية الميزات، تشير حساسيته المحدودة إلى الحاجة لمزيد من الاستكشاف لأساليب التعلم الآلي البديلة أو التعديلات لمعالجة عدم توازن الفئات. يمكن أن يعزز الجمع بين الغابة العشوائية وتقنيات التجميع أو تتبع البيانات الطولية فعالية النموذج. بشكل عام، تضع هذه الأبحاث أساسًا قويًا لتنفيذ نماذج تنبؤية تركز على الأداء الأكاديمي، مما يسهل التحذيرات المبكرة للطلاب المعرضين للخطر ويبسّط استراتيجيات التدخل داخل المؤسسات التعليمية.

Journal: Journal Of Advanced Pharmacy Education And Research, Volume: 15, Issue: 1
DOI: https://doi.org/10.51847/pfb18qb60j
Publication Date: 2025-01-01
Author(s): Fred Torres‐Cruz et al.
Primary Topic: Artificial Intelligence in Healthcare

Overview

This research paper investigates the prediction of university dropout rates using advanced digital technologies, specifically employing the Random Forest algorithm. The model was constructed based on key academic variables, including year of enrollment, program of study, semester attended, and academic performance measured by grade point average (GPA). A dropout threshold was set at a GPA below 11, and the dataset was divided into 70% for training and 30% for testing, achieving an overall accuracy of 85.9%. Notably, semester attended and year of enrollment were identified as the most significant predictors of dropout, with the model demonstrating a specificity of 91% for identifying students unlikely to drop out, but a sensitivity of only 52% for those at risk.

The findings affirm the initial hypothesis that specific academic indicators significantly influence dropout likelihood, suggesting that a focus on these variables can yield effective predictive models without reliance on external factors such as socioeconomic status. This methodological advancement allows for targeted interventions based on controllable academic data, enhancing the potential for improving student retention. While the Random Forest model offers competitive accuracy and interpretability, the limited sensitivity in detecting at-risk students indicates a need for further exploration of machine learning techniques or adjustments to address class imbalances. Overall, this study lays a robust foundation for implementing predictive models centered on academic performance, advocating for their integration into student management systems to proactively mitigate dropout rates across educational institutions.

Introduction

The introduction of this research paper addresses the pressing issue of university dropout rates, which have significant implications for educational institutions and broader socio-economic development. The primary objective of the study is to develop a predictive model utilizing the Random Forest algorithm to identify students at risk of dropping out based on academic data. This method is favored for its capacity to manage large, heterogeneous datasets and its effectiveness in interpreting complex variables, thereby providing a robust tool for implementing proactive interventions aimed at enhancing student retention.

The study builds on previous research that has explored various socio-economic and academic factors influencing dropout rates, emphasizing the importance of institutional commitment and social integration. While earlier studies have employed machine learning techniques like logistic regression and neural networks, the application of Random Forest in this context is relatively novel. The research aims to achieve three specific goals: constructing a high-accuracy Random Forest model for student classification, evaluating the significance of academic variables (such as GPA and enrollment year) in dropout risk, and offering an interpretative framework for institutions to understand dropout patterns. By focusing on academic factors, the study seeks to provide a practical tool for educational institutions to enhance retention strategies, ultimately contributing to improved student outcomes and institutional effectiveness.

Methods

The research employs a quantitative methodology to analyze university dropout rates, focusing on identifying predictive patterns in students’ academic data. This study, conducted at the National University of the Altiplano in Puno, Peru, aims to establish causal relationships between academic variables—such as program of study, semester attended, and year of enrollment—and the risk of dropout. The central hypothesis posits that these academic variables significantly influence dropout probability and can be modeled using the Random Forest algorithm. The study encompasses a comprehensive dataset of 249,043 enrollment and grade records from undergraduate engineering students over three years (2020-2022), ensuring a robust representation of the student population.

Data analysis utilized the Random Forest algorithm for its effectiveness in handling non-linear relationships among variables. The dataset was divided into training (70%) and testing (30%) sets, with model performance evaluated through metrics such as overall accuracy, sensitivity, and specificity. The model achieved an accuracy of 85.9%, with a sensitivity of 52% for identifying at-risk students and a specificity of 91% for those who persisted. Variable importance was assessed using the Mean Decrease in Impurity (MDI) metric, revealing that semester attended and year of enrollment were the most significant predictors of dropout risk. The methodological design emphasizes the importance of data quality and comprehensive analysis to inform proactive educational interventions.

Results

The study’s findings indicate that the Random Forest model effectively predicts university dropout rates, achieving an overall accuracy of 85.9%. The model demonstrates high specificity (91%) in identifying students likely to persist, but its sensitivity is limited at 52% for detecting at-risk students. Key predictors of dropout include semester attended and year of enrollment, with the semester attended having the highest Mean Decrease in Impurity (MDI) value of 0.328, followed by year of enrollment (MDI = 0.296) and GPA (MDI = 0.162). The Receiver Operating Characteristic (ROC) curve analysis yielded an Area Under the Curve (AUC) of 0.77, indicating reasonable discrimination between dropouts and persisting students, although opportunities for improving sensitivity remain.

Further analysis using logistic regression confirmed the significance of semester attended (B = 0.452, p < 0.001) and year of enrollment (B = 0.376, p < 0.001) as predictors, reinforcing the Random Forest model's findings. The model exhibited a high false-negative rate (48%) compared to a false-positive rate of 9%, suggesting a tendency to overlook at-risk students. The study also revealed that early-semester students (1st to 3rd) are at a greater risk of dropout, emphasizing the need for targeted interventions during these critical periods. Overall, the research supports the development of a predictive monitoring system to enhance academic retention strategies, with recommendations for optimizing model sensitivity to improve dropout prediction accuracy and inform preventive policies.

Discussion

The research findings indicate that university dropout can be effectively predicted using advanced machine learning techniques, particularly the Random Forest model, which achieved an overall accuracy of 85.9% and a specificity of 91%. The model’s sensitivity of 52% suggests a moderate ability to identify at-risk students, with semester attended and year of enrollment identified as the most significant predictors. This underscores the importance of academic progress as a central factor in dropout prediction, supporting the hypothesis that academic variables can serve as robust indicators without the need for external contextual data, such as socioeconomic factors.

The study highlights a methodological advancement by demonstrating that a focus solely on academic performance can yield actionable insights for intervention, contrasting with previous research that incorporated demographic variables. While the Random Forest model offers competitive accuracy and interpretability through feature importance analysis, its limited sensitivity points to a need for further exploration of alternative machine learning approaches or adjustments for class imbalance. The potential for combining Random Forest with ensemble techniques or longitudinal data tracking could enhance the model’s effectiveness. Overall, this research lays a strong foundation for implementing predictive models centered on academic performance, facilitating early warnings for at-risk students and simplifying intervention strategies within educational institutions.