العالِم العربي - تعزيز تنقيب البيانات التعليمية لتوقع أداء الطلاب بشكل أفضل: دمج خوارزميات اختيار الميزات وتقنيات التصنيف مع تطور مجموعة الميزات الديناميكية Advancing educational data mining for enhanced student performance prediction: a fusion of feature selection algorithms and classification techniques with dynamic feature ensemble evolution

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-92324-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40082624
تاريخ النشر: 2025-03-13
المؤلف: Saleem Malik وآخرون
الموضوع الرئيسي: التعلم عبر الإنترنت والتحليلات

نظرة عامة

إن دمج التكنولوجيا في التعليم قد أدى إلى توليد بيانات واسعة، مما أدى إلى تقدم في تعدين البيانات التعليمية (EDM) بهدف تحسين نتائج التعلم. تقدم هذه الدراسة نموذج اختيار ميزات جديد، “تطور مجموعة الميزات الديناميكية لتحسين اختيار الميزات” (DE-FS)، الذي يجمع بين الطرق التقليدية مثل تحليل مصفوفة الارتباط، واكتساب المعلومات، واختبار كاي مع خرائط الحرارة لتحديد الميزات الأكثر صلة بتوقع أداء الطلاب. إحدى الابتكارات الرئيسية في DE-FS هي آلية تحديد العتبات الديناميكية والتكيفية، التي تعدل العتبات استجابةً لتغير أنماط البيانات، مما يعالج قيود الطرق الثابتة ويقلل من مشكلات مثل الإفراط في التكيف وعدم التكيف. تسهم هذه الأبحاث بشكل كبير من خلال تقديم منهجية متقدمة لاختيار الميزات تعتمد على التجميع، مما يعزز الدقة التنبؤية والمرونة، ويظهر الأداء المتفوق لـ DE-FS عبر مجموعات بيانات تعليمية متنوعة.

تشير النتائج إلى أن DE-FS تحقق دقة مثيرة للإعجاب تبلغ 95.8% بينما تقلص مجموعة الميزات إلى 12 سمة، مما يمثل تحسناً كبيراً على طرق اختيار الميزات الثابتة التقليدية. لا يعزز هذا النموذج الدقة التنبؤية وكفاءة الحوسبة من خلال تقليل تكرار الميزات فحسب، بل يثبت أيضاً فائدته في توقع التسرب، والتدخلات الأكاديمية، ومسارات التعلم الشخصية. ومع ذلك، لا تزال هناك تحديات تتعلق بالقدرة على التوسع لمجموعات البيانات التعليمية الكبيرة وقابلية تفسير الميزات للمعلمين. تهدف الأبحاث المستقبلية إلى تحسين DE-FS لتحليلات التعلم متعددة الأنماط، ودمجه مع نماذج التعلم العميق لاستخراج الميزات، وتحسينه لتطبيقات التعلم التكيفية في الوقت الحقيقي. بشكل عام، يمثل DE-FS تقدماً كبيراً في EDM، يجمع بين الدقة والمرونة وتقليل التعقيد الحاسوبي لتحسين النتائج التعليمية.

الطرق

تقدم الأبحاث منهجية تعلم آلي من مرحلتين تهدف إلى تعزيز دقة تحديد الطلاب ذوي المخاطر العالية. تم استخدام خوارزميات متنوعة، بما في ذلك الجيران الأقرب (KNN)، وأشجار القرار، ونايف بايز، والانحدار الخطي، حيث تم تحقيق دقة تبلغ 81% باستخدام طريقة شجرة القرار. ومع ذلك، تشير الدراسة إلى قيد كبير في تقنيات اختيار الميزات، والتي يمكن أن تؤثر على أداء النموذج. تدمج المنهجية كل من أساليب التعلم الخاضع للإشراف وغير الخاضع للإشراف لمعالجة هذه المشكلة.

استخدم الإعداد التجريبي بايثون ومكتبة scikit-learn لتنفيذ خوارزمية DF-FS على نظام مزود بمعالج Intel Core i7، وNVIDIA GeForce GTX 1650 GPU، و16 جيجابايت من ذاكرة الوصول العشوائي DDR4. تم تدريب النموذج واختباره باستخدام مجموعتين من البيانات العامة من المدارس الثانوية البرتغالية. تم إجراء اختيار الميزات والتصنيف باستخدام تقنيات متعددة، بما في ذلك أشجار القرار، والغابات العشوائية، وآلات الدعم الناقل، والشبكات العصبية، وJ48. تم استخدام مقاييس الأداء مثل الدقة، وF1-score، وCV-score لتقييم النماذج. تشمل الميزات الرئيسية التي تم تحديدها من خلال طرق متنوعة مؤشرات الأداء الأكاديمي (مثل G1، G2)، والعوامل الديموغرافية (مثل Medu، Fedu)، والجوانب السلوكية (مثل Dalc، Walc).

النتائج

تستكشف قسم النتائج في الدراسة فعالية طرق اختيار الميزات المختلفة وخوارزميات التصنيف في توقع أداء الطلاب عبر مجموعات بيانات مختلفة. تم إجراء ثلاث تجارب باستخدام تقنيات اختبار كاي، واكتساب المعلومات، وخرائط الحرارة الارتباطية لتحديد السمات الرئيسية التي تؤثر على درجات الطلاب، مع G3 (الدرجة النهائية) كميزة مستهدفة. تفوقت طريقة خريطة الحرارة الارتباطية على غيرها من خلال التقاط الميزات ذات الصلة بفعالية مع تقليل التكرار. أشارت مقاييس الأداء، بما في ذلك الدقة وF1-score، إلى أن الشبكات العصبية (NN) والغابات العشوائية (RF) كانت الأكثر دقة في التصنيف، حيث حققت دقة تزيد عن 91%، مع وصول NN إلى 92.43% باستخدام خريطة الحرارة الارتباطية.

في مجموعة بيانات الرياضيات، تفوقت مجموعة اختيار الميزات باستخدام اختبار كاي مع الغابة العشوائية بسبب قدرتها على إدارة السمات الفئوية وعدم توازن الفئات. بالنسبة لمجموعة البيانات البرتغالية، أظهرت طريقة اختبار كاي أيضاً فعاليتها، بينما أظهرت التركيبات مثل اكتساب المعلومات مع شجرة القرار والغابة العشوائية نتائج واعدة في الدقة، والدقة، والاسترجاع. تؤكد الدراسة على أهمية توافق طرق اختيار الميزات مع خصائص مجموعة البيانات لتعزيز الأداء التنبؤي. ومن الجدير بالذكر أن تركيبة اختبار كاي + RF حققت أدنى معدل خطأ تصنيف بلغ 4.20%، بينما أظهرت اكتساب المعلومات + نايف بايز أعلى معدل خطأ بلغ 11.24%. بشكل عام، تؤكد النتائج على الدور الحاسم لاختيار الميزات واستراتيجية التصنيف في تحسين توقعات أداء الطلاب.

المناقشة

في قسم المناقشة هذا، تسلط الأبحاث الضوء على قيود نماذج توقع أداء الطلاب الحالية، ولا سيما اعتمادها على طرق اختيار الميزات الثابتة وعدم قدرتها على التكيف مع السياقات التعليمية الديناميكية. تؤكد الدراسة على أهمية التدخل المبكر في تحسين نتائج الطلاب، كما يتضح من النماذج المختلفة للتعلم الآلي التي تم تطبيقها لتوقع أداء الطلاب ومعدلات التسرب. من الجدير بالذكر أن خوارزمية DE-FS المقترحة تدمج تقنيات اختيار الميزات التقليدية والتجميع مع آلية تحديد العتبات الديناميكية، مما يعالج أوجه القصور في الطرق السابقة من خلال تعزيز دقة التوقع والتكيف مع أنماط البيانات المتغيرة.

تستخدم خوارزمية DE-FS نهجاً منهجياً لاختيار الميزات، مستفيدة من تقنيات مثل اختبار كاي، واكتساب المعلومات، وتحليل الارتباط لتحديد الميزات ذات الصلة مع التكيف الديناميكي مع مجموعات البيانات التعليمية المتطورة. تعتبر هذه القابلية للتكيف ضرورية للحفاظ على أداء النموذج في مواجهة تغير خصائص البيانات. تؤكد الأبحاث على إمكانية DE-FS في تحسين النتائج التعليمية من خلال توفير إطار عمل قوي لتوقع أداء الطلاب، مما يمكّن التدخلات في الوقت المناسب واتخاذ قرارات مستنيرة من قبل المعلمين. تهدف الأعمال المستقبلية إلى تحسين عملية اختيار الميزات بشكل أكبر من خلال دمج خوارزميات تحسين متقدمة، مما يعزز دقة النموذج وقدرته على التوسع في بيئات تعليمية متنوعة.

القيود

تسلط قيود DE-FS الضوء على عدة تحديات حاسمة تؤثر على فعاليتها واعتمادها في البيئات التعليمية. بشكل أساسي، تعتمد أداء DE-FS على دقة وجودة بيانات التدريب؛ يمكن أن تؤدي مجموعات البيانات غير المكتملة أو المنحازة إلى تقليل فعاليتها بشكل كبير. بالإضافة إلى ذلك، تتطلب الطريقة موارد حوسبة كبيرة، مما يشكل عائقاً أمام المؤسسات ذات البنية التحتية التكنولوجية المحدودة.

علاوة على ذلك، يعتمد نجاح DE-FS على الميزات المحددة جيداً داخل مجموعات البيانات، ويمكن أن تعيق السمات غير المحددة بشكل جيد أدائها. تواجه المؤسسات التي تفتقر إلى أدوات علم البيانات صعوبات في تنفيذ DE-FS، حيث تتطلب موظفين مهرة في تحليل البيانات. لتعزيز وظيفة DE-FS في تعدين البيانات التعليمية، من الضروري التركيز على تحسين جودة البيانات، وتحسين استخدام الموارد، وتوفير التدريب الكافي للمعلمين.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-92324-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40082624
Publication Date: 2025-03-13
Author(s): Saleem Malik et al.
Primary Topic: Online Learning and Analytics

Overview

The integration of technology in education has generated extensive data, leading to advancements in Educational Data Mining (EDM) aimed at enhancing learning outcomes. This study presents a novel feature selection model, “Dynamic Feature Ensemble Evolution for Enhanced Feature Selection” (DE-FS), which synergizes traditional methods such as correlation matrix analysis, information gain, and Chi-square with heat maps to identify the most pertinent features for predicting student performance. A key innovation of DE-FS is its dynamic and adaptive thresholding mechanism, which adjusts thresholds in response to changing data patterns, thereby addressing the limitations of static methods and reducing issues like overfitting and underfitting. The research contributes significantly by introducing an advanced ensemble-based feature selection methodology, enhancing predictive accuracy and flexibility, and demonstrating DE-FS’s superior performance across various educational datasets.

The findings indicate that DE-FS achieves an impressive accuracy of 95.8% while condensing the feature set to 12 attributes, marking a substantial improvement over traditional static feature selection methods. This model not only enhances predictive accuracy and computational efficiency by minimizing feature redundancy but also proves beneficial for dropout prediction, academic interventions, and personalized learning pathways. However, challenges remain regarding scalability for large educational datasets and the interpretability of features for educators. Future research aims to refine DE-FS for multi-modal learning analytics, integrate it with deep learning models for feature extraction, and optimize it for real-time adaptive learning applications. Overall, DE-FS represents a significant advancement in EDM, combining accuracy, flexibility, and reduced computational complexity to improve educational outcomes.

Methods

The research presents a two-phase machine learning methodology aimed at enhancing the accuracy of identifying high-risk students. Various algorithms, including K-Nearest Neighbors (KNN), decision trees, Naive Bayes, and linear regression, were employed, achieving an accuracy of 81% with the decision tree method. However, the study notes a significant limitation in feature selection techniques, which could impact model performance. The methodology integrates both supervised and unsupervised learning approaches to address this issue.

The experimental setup utilized Python and the scikit-learn library to implement the DF-FS algorithm on a system equipped with an Intel Core i7 processor, NVIDIA GeForce GTX 1650 GPU, and 16 GB DDR4 RAM. The model was trained and tested using two public datasets from Portuguese high schools. Feature selection and classification were conducted using multiple techniques, including Decision Trees, Random Forests, Support Vector Machines, Neural Networks, and J48. Performance metrics such as accuracy, F1-score, and CV-score were employed to evaluate the models. Key features identified through various methods included academic performance indicators (e.g., G1, G2), demographic factors (e.g., Medu, Fedu), and behavioral aspects (e.g., Dalc, Walc).

Results

The results section of the study investigates the effectiveness of various feature selection methods and classification algorithms in predicting student performance across different datasets. Three experiments were conducted utilizing chi-square, information gain, and correlation heat map techniques to identify key attributes influencing student grades, with G3 (final grade) as the target feature. The correlation heat map method outperformed others by effectively capturing relevant features while minimizing redundancy. Performance metrics, including accuracy and F1-score, indicated that Neural Networks (NN) and Random Forest (RF) were the most accurate classifiers, achieving over 91% accuracy, with NN reaching 92.43% using the correlation heat map.

In the Mathematics dataset, the combination of chi-square feature selection with Random Forest excelled due to its capability to manage categorical attributes and class imbalances. For the Portuguese dataset, the chi-square method also demonstrated effectiveness, while combinations like information gain with Decision Tree and Random Forest showed promising results in accuracy, precision, and recall. The study emphasizes the importance of aligning feature selection methods with dataset characteristics to enhance predictive performance. Notably, the chi-square + RF combination yielded the lowest misclassification rate of 4.20%, while information gain + Naïve Bayes exhibited the highest error rate at 11.24%. Overall, the findings underscore the critical role of feature selection and classification strategy in optimizing student performance predictions.

Discussion

In this discussion section, the research highlights the limitations of existing student performance prediction models, particularly their reliance on static feature selection methods and lack of adaptability to dynamic educational contexts. The study emphasizes the importance of early intervention in improving student outcomes, as demonstrated by various machine learning models that have been applied to predict student performance and dropout rates. Notably, the proposed DE-FS algorithm integrates traditional and ensemble feature selection techniques with a dynamic thresholding mechanism, addressing the shortcomings of previous methods by enhancing prediction accuracy and adaptability to changing data patterns.

The DE-FS algorithm employs a systematic approach to feature selection, utilizing techniques such as Chi-square, Information Gain, and correlation analysis to identify relevant features while dynamically adjusting to evolving educational datasets. This adaptability is crucial for maintaining model performance in the face of fluctuating data characteristics. The research underscores the potential of DE-FS to improve educational outcomes by providing a robust framework for predicting student performance, thereby enabling timely interventions and informed decision-making by educators. Future work aims to refine the feature selection process further by incorporating advanced optimization algorithms, enhancing the model’s accuracy and scalability in diverse educational settings.

Limitations

The limitations of DE-FS highlight several critical challenges that impact its effectiveness and adoption in educational settings. Primarily, the performance of DE-FS is contingent upon the accuracy and quality of the training data; incomplete or biased datasets can significantly diminish its efficacy. Additionally, the method demands substantial computational resources, posing a barrier for institutions with limited technological infrastructure.

Furthermore, the success of DE-FS relies on well-defined features within the datasets, and poorly characterized attributes can hinder its performance. Institutions lacking data science tools face difficulties in implementing DE-FS, as they require personnel skilled in data analysis. To enhance the functionality of DE-FS in educational data mining, it is essential to focus on improving data quality, optimizing resource utilization, and providing adequate training for educators.