تقييم إجابات قصيرة تلقائيًا في عصر نماذج اللغة الكبيرة: هل يتفوق GPT-4 مع هندسة المطالبات على النماذج التقليدية؟ Automatic Short Answer Grading in the LLM Era: Does GPT-4 with Prompt Engineering beat Traditional Models?

المجلة: Proceedings of the 15th International Learning Analytics and Knowledge Conference
DOI: https://doi.org/10.1145/3706468.3706481
تاريخ النشر: 2025-02-21
المؤلف: Rafael Ferreira Mello وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

يتناول هذا القسم التحديات المتعلقة بتقييم الإجابات القصيرة في السياقات التعليمية، مع تسليط الضوء على ظهور نظام تقييم الإجابات القصيرة التلقائي (ASAG) كحل. تم دراسة نماذج التعلم الآلي التقليدية، بما في ذلك طرق التجميع والتضمينات، بشكل موسع ولكنها غالبًا ما تواجه صعوبات في القابلية للتعميم. بالمقابل، تم اقتراح نماذج اللغة الكبيرة (LLMs) مثل GPT-4 مؤخرًا كبدائل، ومع ذلك، تفتقر الأبحاث السابقة إلى تقييم شامل لأدائها مقارنة بالنماذج التقليدية، لا سيما فيما يتعلق باستراتيجيات هندسة المطالبات.

تجري الدراسة المقدمة هنا تحليلًا مقارنًا بين نماذج التعلم الآلي التقليدية وGPT-4 ضمن إطار عمل ASAG. تفحص الدراسة نماذج وتقنيات تمثيل النصوص المختلفة بينما تستكشف أيضًا تأثير هندسة المطالبات على أداء LLM. تكشف النتائج أن النماذج التقليدية تتفوق عمومًا على LLM؛ ومع ذلك، يظهر GPT-4 إمكانات كبيرة عندما يتم تحسينه بمطالبات محسّنة، مثل أمثلة قليلة وإرشادات واضحة. تسهم هذه الأبحاث في الأدبيات الموجودة من خلال تقديم تقييم شامل لـ LLMs في سياق ASAG متعدد اللغات، مما يوفر رؤى قيمة لتطوير أنظمة تقييم تلقائية أكثر فعالية.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الدور الحاسم للتقييم في عملية التعلم، لا سيما التقييمات التكوينية التي توجه كل من الطلاب والمعلمين في تحسين النتائج التعليمية. بينما تعتبر تنسيقات الإجابة المغلقة مثل أسئلة الاختيار من متعدد فعالة في التقييم، إلا أنها تعاني من قيود مثل الاعتماد على استراتيجيات الاختبار ونقص في الصلاحية الظاهرة. توفر الأسئلة المفتوحة، على الرغم من أنها تقدم رؤى أعمق حول فهم الطلاب، تحديات كبيرة للتقييم على نطاق واسع. ظهر نظام تقييم الإجابات القصيرة التلقائي (ASAG) كحل قابل للتطبيق، مستفيدًا من التكنولوجيا لتقييم الاستجابات المفتوحة بشكل أكثر كفاءة. ومع ذلك، تواجه أنظمة ASAG التقليدية صعوبات بسبب تباين استجابات الطلاب والحاجة إلى مجموعات بيانات موسومة بشكل واسع.

تطرح الورقة فرضية أن دمج نماذج اللغة الكبيرة (LLMs) مثل GPT-4 في أنظمة ASAG يمكن أن يعالج هذه التحديات، حيث أظهرت LLMs وعدًا في تقييم الاستجابات المتنوعة مع الحد الأدنى من هندسة المطالبات. على الرغم من النتائج الأولية التي تشير إلى إمكاناتها، لا يزال هناك فجوة في فهم كيفية أداء LLMs عبر لغات ومواضيع مختلفة مقارنة بنماذج التعلم الآلي التقليدية. تهدف هذه الدراسة إلى سد هذه الفجوة من خلال إجراء تحليل مقارن شامل لـ LLMs والنماذج التقليدية في ASAG، باستخدام مجموعات بيانات باللغتين البرتغالية والإنجليزية. ستستكشف الأبحاث تقنيات تمثيل نصية مختلفة واستراتيجيات هندسة المطالبات، مما يوفر في النهاية رؤى حول فعالية هذه النماذج في سياقات لغوية متنوعة ومعالجة قيود أنظمة ASAG الحالية.

الطرق

في هذا القسم، يحدد المؤلفون منهجية التقييم الخاصة بهم لتقييم أداء النماذج المقترحة باستخدام خطأ الجذر التربيعي المتوسط (RMSE) ومتوسط الخطأ المطلق (MAE). تم اختيار هذه المقاييس لفعاليتها في قياس دقة التنبؤات في سياقات الانحدار، لا سيما ضمن تعدين النصوص ومهمة ASAG. يتم تعريف RMSE على أنه الجذر التربيعي لمتوسط الفروق المربعة بين القيم المتوقعة والفعلية، مما يبرز الأخطاء الأكبر بسبب عملية التربيع. من ناحية أخرى، يحسب MAE متوسط حجم الأخطاء دون النظر إلى اتجاهها، مما يوفر مقياسًا مباشرًا لدقة التنبؤ.

لإجابة سؤال البحث 1 (RQ1)، قام المؤلفون بتقييم 128 مطالبة من كل مجموعة بيانات، مرتبين إياها بناءً على درجات RMSE وMAE الخاصة بها، حيث تشير المطالبات الأعلى تصنيفًا إلى توافق أفضل مع الحقيقة الأساسية. كما قاموا بتحليل تكرار مكونات المطالبات المختلفة بين المطالبات الأعلى تصنيفًا لتحديد الاتجاهات. بالنسبة لأسئلة البحث 2 (RQ2) و3 (RQ3)، تقارن الدراسة خوارزميات التعلم الآلي التقليدية مع أفضل ثلاث تكوينات لمطالبات GPT عبر تقسيمين لمجموعات البيانات: أحدهما يعتمد على إجابات الطلاب والآخر على أسئلة المعلمين، كما هو موضح في الجداول 5 و6. تتيح هذه الطريقة إجراء تحليل شامل لأداء النموذج تحت ظروف مختلفة.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يوضح النتائج الناتجة عن اختبارات مختلفة، مع تسليط الضوء على العلاقات الإحصائية الهامة والأنماط التي لوحظت في البيانات. غالبًا ما تكون النتائج مصحوبة بأشكال وجداول ذات صلة توضح النتائج بوضوح، مما يسمح بتفسير بصري للبيانات.

بالإضافة إلى ذلك، قد يناقش القسم تداعيات هذه النتائج فيما يتعلق بالفرضيات المطروحة في بداية الدراسة. من الضروري ملاحظة أي نتائج غير متوقعة أو شذوذ ظهرت خلال البحث، حيث يمكن أن توفر هذه رؤى حول الظواهر الأساسية التي يتم دراستها. بشكل عام، يخدم هذا القسم في التحقق من أهداف البحث ويساهم في الفهم الأوسع للموضوع.

المناقشة

تسلط قسم المناقشة في الورقة الضوء على الاهتمام المتزايد في استخدام نماذج اللغة الكبيرة (LLMs) لتقييم الإجابات القصيرة التلقائي (ASAG)، مع التركيز على مجالات رئيسية مثل هندسة المطالبات، التعلم القليل، والتعديل الدقيق التي تؤثر بشكل كبير على أداء النموذج في البيئات التعليمية. تظهر الدراسات الحديثة، بما في ذلك تلك التي أجراها Naismith وآخرون وNguyen وآخرون، إمكانات نماذج مثل GPT-4 للتوافق عن كثب مع المقيمين البشريين في تقييم تماسك الخطاب واستجابات الشرح الذاتي، على التوالي. ومع ذلك، لا تزال التحديات قائمة، لا سيما في التعامل مع مهام محددة وضمان القابلية للتعميم والعدالة، وهي أمور حاسمة في كل من التعلم الآلي التقليدي والتقييمات البشرية.

تحدد الورقة الفجوات في الأبحاث الحالية، لا سيما الاستكشاف المحدود لتقنيات هندسة المطالبات والتحليل المقارن لـ LLMs مقابل نماذج التعلم الآلي التقليدية في سياقات ASAG. يقترح المؤلفون ثلاثة أسئلة بحث تهدف إلى التحقيق في فعالية عناصر المطالبات المختلفة لـ LLMs باللغتين الإنجليزية والبرتغالية البرازيلية (RQ1)، وتقييم أداء LLMs مقارنة بالنماذج التقليدية (RQ2)، وتقييم القابلية للتعميم لهذه النماذج عند مواجهة أسئلة غير مرئية (RQ3). تشير النتائج إلى أنه بينما تظهر LLMs، لا سيما GPT-4، وعدًا، لا تزال النماذج التقليدية تتفوق عليها في بعض السياقات، مما يشير إلى الحاجة إلى مزيد من التحسين في تصميم المطالبات وتدريب النماذج لتعزيز دقة وفعالية التقييم.

القيود

في قسم “القيود” من الدراسة، يعترف المؤلفون بعدة قيود قد تؤثر على النتائج. أولاً، بينما شمل التقييم مكونات مطالبات متنوعة، يمكن أن تؤثر التكوينات المحددة لهذه المكونات بشكل كبير على الأداء. ركزت الدراسة بشكل أساسي على النصوص البسيطة لتقييم التأثير العام لكل مكون، ولكن الأبحاث المستقبلية تهدف إلى إشراك معلمي الدورة في صياغة مطالبات أكثر تفصيلاً، مما يعزز الفهم لكيفية تأثير صياغة كل مكون على النتائج.

ثانيًا، كانت تحليل الدراسة محدودًا إلى 30% من مجموعات البيانات المستخدمة بسبب القيود المالية، على الرغم من أن المؤلفين يشيرون إلى أن نتائجهم تتماشى مع الأدبيات الموجودة التي غالبًا ما تستخدم مجموعات بيانات أصغر. تهدف التحقيقات المستقبلية إلى دمج حجم عينة أكبر عبر لغات وسياقات مختلفة لتعزيز قوة النتائج. أخيرًا، ركزت الأبحاث فقط على نماذج GPT، المعروفة بأدائها القوي، مما قد يضيق نطاق الدراسة. ستوسع الأعمال المستقبلية التحليل لتشمل نماذج لغة كبيرة متنوعة (LLMs)، لا سيما الخيارات مفتوحة المصدر، لتسهيل مقارنة أكثر شمولاً لقدراتها في توليد تقييمات الطلاب التلقائية (ASAG).

Journal: Proceedings of the 15th International Learning Analytics and Knowledge Conference
DOI: https://doi.org/10.1145/3706468.3706481
Publication Date: 2025-02-21
Author(s): Rafael Ferreira Mello et al.
Primary Topic: Topic Modeling

Overview

This section discusses the challenges of assessing short answers in educational contexts, highlighting the emergence of Automatic Short Answer Grading (ASAG) as a solution. Traditional machine learning models, including ensemble methods and embeddings, have been extensively studied but often struggle with generalizability. In contrast, Large Language Models (LLMs) like GPT-4 have recently been proposed as alternatives, yet prior research lacks a thorough evaluation of their performance relative to traditional models, particularly in relation to prompt engineering strategies.

The study presented here conducts a comparative analysis of traditional machine learning models and GPT-4 within the ASAG framework. It examines various models and text representation techniques while also exploring the impact of prompt engineering on LLM performance. Findings reveal that traditional models generally outperform LLMs; however, GPT-4 demonstrates significant potential when enhanced with optimized prompts, such as few-shot examples and explicit instructions. This research contributes to the existing literature by offering a comprehensive evaluation of LLMs in a multilingual ASAG context, thereby providing valuable insights for the development of more effective automatic grading systems.

Introduction

The introduction of the research paper highlights the critical role of assessment in the learning process, particularly formative assessments that guide both students and teachers in enhancing educational outcomes. While closed-response formats like multiple-choice questions are efficient for grading, they suffer from limitations such as reliance on test-taking strategies and a lack of face validity. Open-ended questions, although providing deeper insights into student understanding, pose significant challenges for large-scale grading. Automatic Short Answer Grading (ASAG) has emerged as a viable solution, leveraging technology to evaluate open-ended responses more efficiently. However, traditional ASAG systems face difficulties due to the variability in student responses and the need for extensive annotated datasets.

The paper posits that integrating Large Language Models (LLMs) like GPT-4 into ASAG systems could address these challenges, as LLMs have shown promise in evaluating diverse responses with minimal prompt engineering. Despite preliminary findings indicating their potential, there remains a gap in understanding how LLMs perform across different languages and subjects compared to traditional machine learning models. This study aims to fill this gap by conducting a comprehensive comparative analysis of LLMs and traditional models in ASAG, utilizing datasets in both Portuguese and English. The research will explore various textual representation techniques and prompt engineering strategies, ultimately providing insights into the efficacy of these models in diverse linguistic contexts and addressing the limitations of existing ASAG systems.

Methods

In this section, the authors outline their evaluation methodology for assessing the performance of proposed models using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). These metrics are selected for their effectiveness in measuring the accuracy of predictions in regression contexts, particularly within Text Mining and the ASAG task. RMSE is defined as the square root of the average squared differences between predicted and actual values, emphasizing larger errors due to the squaring process. Conversely, MAE calculates the average magnitude of errors without regard to their direction, providing a straightforward measure of prediction accuracy.

To address Research Question 1 (RQ1), the authors evaluated 128 prompts from each dataset, ranking them based on their RMSE and MAE scores, with higher-ranked prompts indicating better alignment with the ground truth. They further analyzed the frequency of various prompt components among the top-ranked prompts to identify trends. For Research Questions 2 (RQ2) and 3 (RQ3), the study compares traditional machine learning algorithms with the top three GPT prompt configurations across two dataset splits: one based on student answers and the other on teacher questions, as detailed in Tables 5 and 6. This approach allows for a comprehensive analysis of model performance under different conditions.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It details the outcomes of various tests, highlighting significant statistical relationships and patterns observed in the data. The results are often accompanied by relevant figures and tables that illustrate the findings clearly, allowing for a visual interpretation of the data.

Additionally, the section may discuss the implications of these results in relation to the hypotheses posed at the beginning of the study. It is crucial to note any unexpected outcomes or anomalies that emerged during the research, as these can provide insights into the underlying phenomena being studied. Overall, this section serves to validate the research objectives and contributes to the broader understanding of the topic.

Discussion

The discussion section of the paper highlights the growing interest in utilizing Large Language Models (LLMs) for Automated Short Answer Grading (ASAG), emphasizing key areas such as prompt engineering, few-shot learning, and fine-tuning that significantly affect model performance in educational settings. Recent studies, including those by Naismith et al. and Nguyen et al., demonstrate the potential of models like GPT-4 to closely align with human evaluators in grading discourse coherence and self-explanation responses, respectively. However, challenges remain, particularly in handling specific tasks and ensuring generalizability and fairness, which are critical in both traditional machine learning and human evaluations.

The paper identifies gaps in existing research, particularly the limited exploration of prompt engineering techniques and the comparative analysis of LLMs against traditional machine learning models in ASAG contexts. The authors propose three research questions aimed at investigating the effectiveness of various prompt elements for LLMs in both English and Brazilian Portuguese (RQ1), assessing the performance of LLMs relative to traditional models (RQ2), and evaluating the generalizability of these models when faced with unseen questions (RQ3). The findings suggest that while LLMs, particularly GPT-4, show promise, traditional models still outperform them in certain contexts, indicating a need for further refinement in prompt design and model training to enhance grading accuracy and effectiveness.

Limitations

In the “Limitations” section of the study, the authors acknowledge several constraints that may affect the findings. Firstly, while the evaluation included various prompt components, the specific configurations of these components could significantly influence performance. The study primarily focused on simple texts to assess the overall impact of each component, but future research intends to involve course instructors in crafting more detailed prompts, thereby enhancing the understanding of how the articulation of each component affects outcomes.

Secondly, the study’s analysis was limited to 30% of the datasets used due to financial constraints, although the authors note that their results align with existing literature that often utilizes even smaller datasets. Future investigations aim to incorporate a larger sample size across different languages and contexts to strengthen the robustness of the findings. Lastly, the research concentrated solely on GPT models, which are known for their strong performance, potentially narrowing the study’s scope. Future work will expand the analysis to include various large language models (LLMs), particularly open-source options, to facilitate a more comprehensive comparison of their capabilities in automated student assessment generation (ASAG).