توقع تسرب الطلاب من خلال تحسين التعلم الآلي: رؤى من بيانات سجلات Moodle Student dropout prediction through machine learning optimization: insights from moodle log data

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-93918-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40119104
تاريخ النشر: 2025-03-21
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: التعلم عبر الإنترنت والتحليلات

نظرة عامة

تناقش هذه الفقرة القضايا المستمرة المتعلقة بتسرب الطلاب والفشل الأكاديمي في التعليم، مع تسليط الضوء على إمكانية أنظمة إدارة التعلم مثل Moodle لتوفير بيانات قيمة للتحليلات التنبؤية. تركز الدراسة على استخدام الذكاء الاصطناعي، وبشكل خاص خوارزمية CatBoost، للتنبؤ بمعدلات التسرب والفشل من خلال تحليل سجلات نشاط الطلاب. لمعالجة التحديات المتعلقة بمجموعات البيانات المحدودة وغير المتوازنة، نفذ الباحثون تقنيات متقدمة لموازنة البيانات، مثل أخذ العينات الاصطناعية التكيفية، واستخدموا تحسين المعلمات متعددة الأهداف عبر خوارزمية الفرز الجيني غير المهيمن II (NSGA-II).

تشير النتائج إلى أن النموذج المدرب على بيانات شاملة على مدى عدة أسابيع تفوق على النماذج المدربة على بيانات أسبوعية فقط، محققًا متوسط درجة F1 يبلغ حوالي 0.8 في اختبار الاحتفاظ، مع تميز خاص في تحديد الطلاب المعرضين للخطر. وهذا يبرز فعالية أساليب التعلم الآلي المستهدفة في تسهيل التعرف المبكر على الطلاب الذين يحتاجون إلى تدخل، مما يعزز النتائج التعليمية. كما تؤكد النتائج على أهمية مراعاة العوامل الأكاديمية وغير الأكاديمية، مثل الظروف الاجتماعية والاقتصادية، للحصول على فهم شامل لتحديات الطلاب في عملية التعلم.

الطرق

توضح فقرة المنهجية في هذه الدراسة نهجًا شاملاً للتنبؤ بتسرب الطلاب وفشل الدورات، بناءً على مشروع LANSE. تشمل العملية معالجة البيانات، تدريب النموذج، وتحسين المعلمات باستخدام خوارزمية الفرز الجيني غير المهيمن II (NSGA-II). تتضمن البحث تجربتين متميزتين مع مجموعات بيانات تدريب واختبار مختلفة، مصممة لتسهيل تقييم شامل لأداء النموذج.

تشمل المراحل الرئيسية للمنهجية معالجة البيانات، حيث يتم تنظيف البيانات الخام وتنسيقها للتحليل، تليها هندسة الميزات لتعزيز دقة النموذج من خلال إنشاء ميزات ذات مغزى. تتضمن مرحلة التحسين اختيار الميزات لتقليل الأبعاد، وضبط المعلمات لتحسين الأداء، واختيار النموذج لتحديد الخوارزمية الأكثر فعالية. بالإضافة إلى ذلك، يتم استخدام تقنيات إعادة أخذ العينات مثل SMOTE وADASYN للتخفيف من عدم توازن الفئات. تتضمن الخطوة النهائية تقييمًا صارمًا لأداء النموذج باستخدام مقاييس ذات صلة، مما يضمن القوة والدقة من خلال الضبط الدقيق التكراري.

النتائج

تقدم فقرة “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. توضح النتائج الناتجة عن اختبارات مختلفة، مع تسليط الضوء على الاتجاهات والأنماط الهامة التي لوحظت في البيانات. غالبًا ما تكون النتائج مصحوبة بتحليلات إحصائية، بما في ذلك قيم p وفواصل الثقة، للتحقق من موثوقية النتائج.

بالإضافة إلى ذلك، قد تتضمن الفقرة تمثيلات بصرية، مثل الرسوم البيانية أو الجداول، لتوضيح العلاقات بين المتغيرات بوضوح. تعزز هذه المساعدات البصرية من فهم النتائج وتدعم الاستنتاجات التي توصل إليها المؤلفون. بشكل عام، تساهم النتائج في الجسم المعرفي القائم في هذا المجال وقد يكون لها آثار على الأبحاث المستقبلية أو التطبيقات العملية.

المناقشة

تسلط فقرة المناقشة في ورقة البحث الضوء على التقدم الكبير في التنبؤ بتسرب الطلاب والفشل الأكاديمي من خلال تطبيق تقنيات التعلم الآلي. تؤكد على تطور النماذج التنبؤية التي تستفيد من البيانات من أنظمة إدارة التعلم (LMS) ومصادر تعليمية متنوعة. ركزت الدراسات الحديثة على تعزيز دقة النموذج من خلال معالجة التحديات مثل عدم توازن الفئات واختيار الميزات، مع دمج العوامل غير الأكاديمية مثل الوضع الاجتماعي والاقتصادي. من الجدير بالذكر أن خوارزميات التعلم الآلي، بما في ذلك الانحدار اللوجستي، وأشجار القرار، وطرق متقدمة مثل الشبكات العصبية الاصطناعية (ANN) وشبكات الذاكرة طويلة وقصيرة المدى (LSTM)، أظهرت درجات متفاوتة من النجاح في التنبؤ بالتسرب، مع نماذج مثل CatBoost التي أظهرت أداءً متفوقًا من خلال تحسين صارم باستخدام خوارزمية NSGA-II.

تؤكد الفقرة أيضًا على الدور الحاسم لهندسة الميزات وجودة مصادر البيانات في تطوير نماذج تنبؤية فعالة. تناقش أهمية بيانات سجلات LMS، التي تلتقط مقاييس مشاركة الطلاب مثل الحضور والمشاركة، كمؤشرات رئيسية للنتائج الأكاديمية. استخدمت الدراسة مجموعة بيانات تتكون من 567 حالة فريدة من الطلاب والدورات عبر 23 دورة، مستخدمة تقنيات مثل التحقق المتبادل وإعادة أخذ العينات للتخفيف من الإفراط في التكيف وعدم توازن الفئات. تشير النتائج إلى أن النموذج الجديد المقترح CatBoost، الذي تم تحسينه من خلال NSGA-II، تفوق على النماذج السابقة من حيث الاسترجاع والدقة، مما يوفر إطارًا أكثر موثوقية لتحديد الطلاب المعرضين للخطر وتسهيل التدخلات في الوقت المناسب. بشكل عام، تسهم الدراسة في تقديم رؤى قيمة حول المنهجيات لتنبؤ التسرب، مع التأكيد على الحاجة إلى تحسين مستمر للنماذج التنبؤية لتتكيف مع السياقات التعليمية المتنوعة.

القيود

تسلط فقرة القيود الضوء على عدة تحديات تم مواجهتها في تطوير نماذج تنبؤية لتسرب الطلاب. إحدى القضايا الرئيسية هي عدم توازن الفئات، حيث تختلف معدلات التسرب حسب نوع الدورة، مما قد يؤدي إلى نماذج متحيزة تفشل في تحديد الطلاب المعرضين للخطر بدقة. للتخفيف من ذلك، تستخدم الدراسة تقنية أخذ العينات الاصطناعية التكيفية (ADASYN) لإعادة أخذ العينات وخوارزمية الفرز الجيني غير المهيمن (NSGA) لتحسين المعلمات. على الرغم من أن هذه الطرق أظهرت وعدًا، إلا أن فعاليتها يمكن أن تكون مقيدة بالموارد الحاسوبية، مما يتطلب تعديلات في استراتيجيات الطفرة، وتقنيات التهجين، ومعلمات السكان.

بالإضافة إلى ذلك، فإن مجموعة بيانات الدراسة، التي تتكون من 567 حالة فريدة من الطلاب والدورات عبر 23 دورة، محدودة في الحجم، مما يقيد عمومية النموذج وقدرته على التقاط أنماط متنوعة من مشاركة الطلاب وسلوك التسرب. كما أن غياب البيانات الطولية والرؤى من تنسيقات التعلم غير المتزامن يحد من التحليل. يجب أن تركز الأبحاث المستقبلية على التحقق من النموذج باستخدام مجموعات بيانات أكبر تشمل نطاقًا أوسع من مشاركة الطلاب والعوامل الخارجية مثل الوضع الاجتماعي والاقتصادي وجودة التدريس، والتي من المعروف أنها تؤثر على معدلات التسرب. بينما تم تصميم النهج الحالي للمؤسسات ذات الموارد المحدودة، فإن دمج هذه المتغيرات الإضافية يمكن أن يعزز دقة النموذج التنبؤية ويوفر فهمًا أكثر دقة لديناميات التسرب.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-93918-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40119104
Publication Date: 2025-03-21
Author(s): Zhenyun Du et al.
Primary Topic: Online Learning and Analytics

Overview

The section discusses the persistent issues of student attrition and academic failure in education, highlighting the potential of learning management systems like Moodle to provide valuable data for predictive analytics. The study focuses on utilizing artificial intelligence, specifically the CatBoost algorithm, to predict dropout and failure rates by analyzing student activity logs. To address challenges related to limited and imbalanced datasets, the researchers implemented advanced data balancing techniques, such as Adaptive Synthetic Sampling, and employed multi-objective hyperparameter optimization via the Non-dominated Sorting Genetic Algorithm II (NSGA-II).

The results indicate that a model trained on comprehensive data across multiple weeks outperformed models trained on weekly data alone, achieving an average F1-score of approximately 0.8 in the holdout test, particularly excelling in identifying at-risk students. This underscores the efficacy of targeted machine learning approaches in facilitating early identification of students in need of intervention, thereby enhancing educational outcomes. The findings also emphasize the importance of considering both academic and non-academic factors, such as socio-economic conditions, to gain a holistic understanding of student challenges in the learning process.

Methods

The methodology section of this study outlines a comprehensive approach to predicting student dropout and course failure, building on the LANSE project. The process encompasses data preprocessing, model training, and hyperparameter optimization utilizing the Non-dominated Sorting Genetic Algorithm II (NSGA-II). The research involves two distinct experiments with varying training and test datasets, structured to facilitate a thorough evaluation of model performance.

Key stages of the methodology include data processing, where raw data is cleaned and formatted for analysis, followed by feature engineering to enhance model accuracy through the creation of meaningful features. The optimization phase incorporates feature selection to minimize dimensionality, hyperparameter tuning for performance enhancement, and model selection to identify the most effective algorithm. Additionally, resampling techniques such as SMOTE and ADASYN are employed to mitigate class imbalance. The final step involves rigorous evaluation of the model’s performance using relevant metrics, ensuring robustness and accuracy through iterative fine-tuning.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It details the outcomes of various tests, highlighting significant trends and patterns observed in the data. The results are often accompanied by statistical analyses, including p-values and confidence intervals, to validate the findings’ reliability.

Additionally, the section may include visual representations, such as graphs or tables, to illustrate the relationships between variables clearly. These visual aids enhance the understanding of the results and support the conclusions drawn by the authors. Overall, the findings contribute to the existing body of knowledge in the field and may have implications for future research or practical applications.

Discussion

The discussion section of the research paper highlights significant advancements in predicting student dropout and academic failure through the application of machine learning techniques. It emphasizes the evolution of predictive models that leverage data from Learning Management Systems (LMS) and various educational sources. Recent studies have focused on enhancing model accuracy by addressing challenges such as class imbalance and feature selection, while also integrating non-academic factors like socio-economic status. Notably, machine learning algorithms, including logistic regression, decision trees, and advanced methods like artificial neural networks (ANN) and Long Short-Term Memory (LSTM) networks, have shown varying degrees of success in dropout prediction, with models like CatBoost demonstrating superior performance through rigorous optimization using the NSGA-II algorithm.

The section further underscores the critical role of feature engineering and the quality of data sources in developing effective predictive models. It discusses the importance of LMS log data, which captures student engagement metrics such as attendance and participation, as key predictors of academic outcomes. The study utilized a dataset comprising 567 unique student-course instances across 23 courses, employing techniques like cross-validation and oversampling to mitigate overfitting and class imbalance. The findings indicate that the newly proposed CatBoost model, optimized through NSGA-II, outperformed previous models in terms of recall and precision, thus providing a more reliable framework for identifying at-risk students and facilitating timely interventions. Overall, the research contributes valuable insights into the methodologies for dropout prediction, emphasizing the need for continuous refinement of predictive models to adapt to diverse educational contexts.

Limitations

The section on limitations highlights several challenges encountered in developing predictive models for student dropout. A significant issue is class imbalance, where dropout rates vary by course type, potentially leading to biased models that fail to accurately identify at-risk students. To mitigate this, the study employs the Adaptive Synthetic Sampling (ADASYN) technique for resampling and the Non-dominated Sorting Genetic Algorithm (NSGA) for hyperparameter optimization. While these methods have shown promise, their effectiveness can be constrained by computational resources, necessitating adjustments in mutation strategies, crossover techniques, and population parameters.

Additionally, the study’s dataset, comprising 567 unique student-course instances across 23 courses, is limited in size, which restricts the model’s generalizability and ability to capture diverse patterns of student engagement and dropout behavior. The absence of longitudinal data and insights from asynchronous learning formats further constrains the analysis. Future research should focus on validating the model with larger datasets that include a broader range of student participation and external factors such as socioeconomic status and teaching quality, which are known to influence dropout rates. While the current approach is designed for institutions with limited resources, integrating these additional variables could enhance the model’s predictive accuracy and provide a more nuanced understanding of dropout dynamics.