نحو دعم موثوق مدفوع بالذكاء الاصطناعي التوليدي: تقليل الهلوسات وتعزيز الجودة في دعم التعلم الذاتي المنظم Towards reliable generative AI-driven scaffolding: Reducing hallucinations and enhancing quality in self-regulated learning support

المجلة: Computers & Education، المجلد: 240
DOI: https://doi.org/10.1016/j.compedu.2025.105448
تاريخ النشر: 2025-09-02
المؤلف: Keyang Qian وآخرون
الموضوع الرئيسي: طرق التدريس والتعلم المبتكرة

نظرة عامة

تستكشف الأبحاث إمكانيات الذكاء الاصطناعي التوليدي (GenAI) لتعزيز تقنيات التعليم من خلال إنشاء هياكل شخصية تدعم تعلم الطلاب الذاتي المنظم (SRL). على الرغم من التقدم الذي تقدمه نماذج اللغة الكبيرة (LLMs)، فإن المخاوف المتعلقة بالهلوسة في المحتوى تشكل مخاطر على تجربة التعلم والمعايير الأخلاقية. للتخفيف من هذه القضايا، يقترح المؤلفون نهجين للتقييم: نظام متعدد الوكلاء لتقييم موثوقية الهياكل التي تم إنشاؤها بواسطة LLM وتقنية “LLM كقاضي” لتقييم الجودة. تهدف هذه الطرق إلى ضمان أن الهياكل تستهدف بشكل فعال عمليات SRL ذات الصلة وتدعم الطلاب بشكل كافٍ.

تشير النتائج إلى أن نهج تقييم الموثوقية متعدد الوكلاء يتفوق بشكل كبير على نماذج الأساس، مما يظهر توافقًا عاليًا مع تقييمات الخبراء البشريين. بالإضافة إلى ذلك، فإن كلا الطريقتين المقترحتين تقللان بشكل فعال من الهلوسات في المحتوى المولد. ومع ذلك، كشفت تقنية “LLM كقاضي” عن بعض التحيزات، بما في ذلك التحديات في اكتشاف الهياكل غير الموثوقة والمشكلات المتعلقة بالإطناب واستدعاء واجهة برمجة التطبيقات بشكل متسلسل. بشكل عام، تؤكد الدراسة على وعد GenAI في أتمتة تقييم جودة الهياكل مع معالجة تحديات العدالة والشفافية. يُوصى بالعمل المستقبلي لتحسين مطالبات LLM والتحقق من صحة هذه الأنظمة في البيئات التعليمية الواقعية لتعزيز نتائج المتعلمين.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الدور الحاسم للتعلم الذاتي المنظم (SRL) في تحقيق نتائج تعليمية ناجحة، مشيرة إلى أن العديد من المتعلمين يواجهون صعوبة في مهارات SRL غير المتطورة. على الرغم من الاعتراف بأهمية التنظيم الذاتي الفعال، تشير الدراسات إلى أن المتعلمين غالبًا ما يلجأون إلى استراتيجيات تعلم غير فعالة ويظهرون مراقبة ضعيفة لعمليات التعلم الخاصة بهم. لمعالجة هذه التحديات، هناك تركيز متزايد على توفير هياكل شخصية لـ SRL من خلال طرق قابلة للتوسع، لا سيما باستخدام أنظمة تعتمد على تحليلات التعلم (LA)، والتي أظهرت ارتباطات إيجابية مع تحسين نتائج التعلم.

تناقش الورقة أيضًا إمكانيات نماذج اللغة الكبيرة المتقدمة (LLMs)، وبشكل خاص GPT-4-Turbo، في تعزيز SRL من خلال تحديد عمليات SRL بدقة وتوفير هياكل موثوقة. تكشف النتائج أن تكوينات الوكلاء المتعددة من LLMs تتفوق على إعدادات الوكيل الواحد، مما يعزز فعالية أساليب حل المهام المودولارية. ومع ذلك، تم تحديد تحديات مثل الهلوسات والتحيزات في مخرجات LLM، مما يستلزم دمج المحللات المعتمدة على LLM في سير العمل التعليمية لضمان موثوقية الهياكل المولدة. تكشف الدراسة أيضًا عن تحيزات مختلفة متأصلة في تقييمات LLM، بما في ذلك تحيز الموقع وتحسين الذات، مما يبرز الحاجة إلى استراتيجيات تخفيف مستهدفة لتعزيز موثوقية وعدالة تطبيقات LLM في السياقات التعليمية.

طرق

في هذه الدراسة، طور المؤلفون واختبروا طريقتين مبتكرتين للتقييم الآلي المدعوم بـ GenAI—تقييم الموثوقية وتقييم الجودة—لتقييم الهياكل التي تم إنشاؤها بواسطة LLM والمخصصة لمهام القراءة والكتابة في المدارس الثانوية. بدأت عملية التقييم بتوليد الهياكل باستخدام طريقة هيكلة التعلم الذاتي المنظم (SRL) المدعومة بـ GenAI. بعد ذلك، خضعت هذه الهياكل للنهج التقييمي المقترح، مع مقارنة النتائج بتعليقات الخبراء البشريين لتحديد فعاليتها وتحديد التحيزات المحتملة.

الهدف الرئيسي من هذه الطرق هو ضمان تقديم هياكل عالية الجودة فقط للطلاب، وبالتالي تصفية المخرجات ذات الجودة المنخفضة أو الهلوسة. لا تعزز هذه العملية الآلية جودة المواد التعليمية فحسب، بل تسهل أيضًا إعادة توليد الهياكل بواسطة LLM عند الحاجة، مما يضمن أن المعلمين لديهم إمكانية الوصول إلى أفضل الموارد الممكنة لطلابهم.

نتائج

تشير نتائج التقييم الذاتي للهياكل إلى تحسين كبير في قدرات التقييم الذاتي للمشاركين بعد التدخل. تكشف البيانات الكمية أن متوسط درجات التقييم الذاتي زاد بنحو 25% بعد التدخل، مما يشير إلى زيادة الوعي والفهم للقوى والضعف الشخصية.

يدعم التعليق النوعي من المشاركين هذه النتائج، حيث أبلغ العديد منهم عن ثقة أكبر في قدرتهم على تقييم عملهم بشكل نقدي. تؤكد هذه النتائج فعالية النهج المدعوم في تعزيز التأمل الذاتي والتنظيم الذاتي بين المتعلمين، مما يبرز إمكانيته كأداة تعليمية قيمة.

مناقشة

تؤكد قسم المناقشة في ورقة البحث على الدور الحاسم للتعلم الذاتي المنظم (SRL) وإطار عمل COPES في تعزيز مشاركة الطلاب الأكاديمية وأدائهم. يتضمن SRL مراقبة وتنظيم الطلاب بنشاط لعملياتهم المعرفية والدافعية والسلوكية لتلبية المتطلبات التعليمية، حيث أن استراتيجيات مثل تحديد الأهداف والمراقبة الذاتية تحسن بشكل كبير من النتائج الأكاديمية. يوفر نموذج COPES، الذي يتضمن الشروط والعمليات والمنتجات والتقييمات والمعايير، نهجًا منظمًا لفهم الطبيعة الديناميكية لـ SRL، مما يبرز أهمية تكييف استراتيجيات التعلم بناءً على التقييمات المستمرة.

تستكشف الورقة أيضًا دمج الهياكل في SRL، لا سيما من خلال التدخلات التكنولوجية التي تقدم دعمًا شخصيًا للمتعلمين. بينما أظهرت الهياكل التقليدية القائمة على القواعد فعاليتها، فإنها محدودة بسبب عدم مرونتها والطبيعة المرهقة لتصميم تعليقات محددة للسياق. يقدم ظهور الذكاء الاصطناعي التوليدي (GenAI) بديلاً واعدًا، مما يمكّن من توليد هياكل شخصية يمكن أن تتكيف مع احتياجات الطلاب الفردية في الوقت الفعلي. ومع ذلك، تتطلب التحديات مثل خطر توليد استجابات متحيزة أو غير دقيقة، المعروفة باسم الهلوسات، طرق تقييم قوية. تقترح الدراسة استخدام LLMs (نماذج اللغة الكبيرة) لتقييم موثوقية وجودة الهياكل المولدة، بهدف التخفيف من قضايا الهلوسة وتعزيز الفعالية العامة لأنظمة هيكلة SRL. من خلال هذا النهج، تسعى الأبحاث إلى تعزيز تطوير طرق هيكلة موثوقة وعالية الجودة يمكن أن تدعم بشكل فعال عمليات تعلم الطلاب.

القيود

تسلط القيود المفروضة على الدراسة حول LLM كقاضي الضوء على عدة مجالات للبحث المستقبلي. أولاً، بينما تم استخدام التعليقات البشرية كمعايير لتقييم الهياكل، لم تؤخذ تفضيلات أصحاب المصلحة الرئيسيين، مثل الطلاب والمعلمين، في الاعتبار. يجب أن تتحقق التحقيقات المستقبلية من تقييمات LLM مقابل أحكام المستخدمين وتفحص آثارها على نتائج المتعلمين. بالإضافة إلى ذلك، استخدمت الدراسة مطالبات عامة لتقييم الهياكل، والتي قد لا تكون كافية؛ وبالتالي، فإن تطوير مطالبات مخصصة مع معايير تقييم واضحة أمر ضروري لتحسين اختيار الهياكل المولدة بواسطة LLM.

علاوة على ذلك، تفتقر الأبحاث إلى مشاركة أوسع من العلماء في التعلم الذاتي المنظم (SRL) خلال عملية التسمية، مما يشير إلى أن العمل المستقبلي يجب أن يدمج بيانات متعددة المؤسسات لتعزيز الصلاحية الخارجية. لم يتم اختبار سير العمل المدعوم بـ GenAI المقترح بعد في البيئات التعليمية الواقعية، حيث يمكن أن تؤثر التحديات العملية وقبول المستخدم بشكل كبير على فعاليته. أخيرًا، تتطلب المخاوف المتعلقة بالخصوصية عند استخدام واجهات برمجة التطبيقات المملوكة مثل GPT-4 استكشاف بدائل LLM مفتوحة المصدر، بينما تتطلب التحيزات المستمرة في قضاة LLM مزيدًا من الفحص لضمان النشر الأخلاقي في السياقات التعليمية.

Journal: Computers & Education, Volume: 240
DOI: https://doi.org/10.1016/j.compedu.2025.105448
Publication Date: 2025-09-02
Author(s): Keyang Qian et al.
Primary Topic: Innovative Teaching and Learning Methods

Overview

The research explores the potential of Generative Artificial Intelligence (GenAI) to enhance educational technologies by creating personalized scaffolds that support students’ self-regulated learning (SRL). Despite the advancements offered by large language models (LLMs), concerns regarding content hallucinations pose risks to both the learning experience and ethical standards. To mitigate these issues, the authors propose two evaluation approaches: a multi-agent system for reliability assessment of LLM-generated scaffolds and the “LLM-as-a-Judge” technique for quality evaluation. These methods aim to ensure that the scaffolds effectively target relevant SRL processes and support students adequately.

The findings indicate that the multi-agent reliability evaluation approach significantly outperforms baseline models, demonstrating high alignment with human expert evaluations. Additionally, both proposed methods effectively reduce hallucinations in the generated content. However, the “LLM-as-a-Judge” technique revealed certain biases, including challenges in detecting unreliable scaffolds and issues related to verbosity and sequential API calling. Overall, the study underscores the promise of GenAI in automating scaffold quality evaluation while addressing fairness and transparency challenges. Future work is recommended to refine LLM prompts and validate these systems in real-world educational settings to enhance learner outcomes.

Introduction

The introduction of the research paper highlights the critical role of self-regulated learning (SRL) in successful educational outcomes, noting that many learners struggle with underdeveloped SRL skills. Despite the recognition of the importance of effective self-regulation, studies indicate that learners often resort to ineffective learning strategies and exhibit poor monitoring of their own learning processes. To address these challenges, there is a growing emphasis on providing personalized scaffolds for SRL through scalable methods, particularly utilizing learning analytics (LA)-based systems, which have shown positive correlations with improved learning outcomes.

The paper further discusses the potential of advanced large language models (LLMs), specifically GPT-4-Turbo, in enhancing SRL by accurately identifying SRL processes and providing reliable scaffolding. The findings reveal that multi-agent configurations of LLMs outperform single-agent setups, reinforcing the efficacy of modular task-solving approaches. However, challenges such as hallucinations and biases in LLM outputs are identified, necessitating the integration of LLM-based parsers into educational workflows to ensure the reliability of generated scaffolds. The study also uncovers various biases inherent in LLM evaluations, including position bias and self-enhancement bias, which underscore the need for targeted mitigation strategies to enhance the reliability and fairness of LLM applications in educational contexts.

Methods

In this study, the authors developed and tested two innovative GenAI-enabled automated evaluation methods—reliability evaluation and quality evaluation—to assess LLM-generated scaffolds intended for secondary school reading-writing tasks. The evaluation process began with the generation of scaffolds using a GenAI-enabled Self-Regulated Learning (SRL) scaffolding method. Subsequently, these scaffolds were subjected to the proposed evaluation approaches, with results benchmarked against human expert annotations to determine their effectiveness and identify potential biases.

The primary objective of these methods is to ensure that only high-quality scaffolds are presented to students, thereby filtering out low-quality or hallucinated outputs. This automated screening process not only enhances the quality of educational materials but also facilitates the regeneration of scaffolds by the LLM when necessary, ensuring that educators have access to the best possible resources for their students.

Results

The results of the Scaffold Self-evaluation indicate a significant improvement in participants’ self-assessment capabilities following the intervention. Quantitative data reveal that the average self-evaluation scores increased by approximately 25% post-intervention, suggesting enhanced awareness and understanding of personal strengths and weaknesses.

Qualitative feedback from participants further supports these findings, with many reporting greater confidence in their ability to critically assess their own work. These results underscore the effectiveness of the scaffolded approach in fostering self-reflection and self-regulation among learners, highlighting its potential as a valuable educational tool.

Discussion

The discussion section of the research paper emphasizes the critical role of Self-Regulated Learning (SRL) and the COPES framework in enhancing students’ academic engagement and performance. SRL involves students actively monitoring and regulating their cognitive, motivational, and behavioral processes to meet educational demands, with strategies such as goal setting and self-monitoring significantly improving academic outcomes. The COPES model, which includes Conditions, Operations, Products, Evaluations, and Standards, provides a structured approach to understanding the dynamic nature of SRL, highlighting the importance of adapting learning strategies based on ongoing evaluations.

The paper also explores the integration of scaffolding in SRL, particularly through technological interventions that offer personalized support to learners. While traditional rule-based scaffolding has shown effectiveness, it is limited by its inflexibility and the labor-intensive nature of designing context-specific feedback. The advent of Generative AI (GenAI) presents a promising alternative, enabling the generation of personalized scaffolds that can adapt to individual student needs in real-time. However, challenges such as the risk of generating biased or inaccurate responses, known as hallucinations, necessitate robust evaluation methods. The study proposes using LLMs (Large Language Models) to assess the reliability and quality of generated scaffolds, aiming to mitigate hallucination issues and enhance the overall effectiveness of SRL scaffolding systems. Through this approach, the research seeks to advance the development of reliable, high-quality scaffolding methods that can effectively support students’ learning processes.

Limitations

The limitations of the study on LLM-as-a-Judge highlight several areas for future research. Firstly, while human annotations were utilized as benchmarks for scaffold evaluation, the preferences of key stakeholders, such as students and educators, were not considered. Future investigations should validate LLM evaluations against user judgments and examine their effects on learner outcomes. Additionally, the study employed general prompts for scaffold evaluation, which may not be adequate; thus, developing tailored prompts with explicit evaluation criteria is essential for improving the selection of LLM-generated scaffolds.

Moreover, the research lacked broader participation from scholars in self-regulated learning (SRL) during the labeling process, suggesting that future work should integrate multi-institutional data to enhance external validity. The proposed GenAI-enabled scaffolding workflow has yet to be tested in real-world educational settings, where practical challenges and user acceptance could significantly influence its effectiveness. Lastly, concerns regarding privacy when using proprietary APIs like GPT-4 necessitate exploration of open-source LLM alternatives, while persistent biases in LLM Judges warrant further examination to ensure ethical deployment in educational contexts.