هل يمكن للذكاء الاصطناعي تقييم مقالاتك؟ تحليل مقارن لنماذج اللغة الكبيرة وتقييمات المعلمين في تقييم المقالات متعددة الأبعاد Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

المجلة: Proceedings of the 15th International Learning Analytics and Knowledge Conference
DOI: https://doi.org/10.1145/3706468.3706527
تاريخ النشر: 2025-02-21
المؤلف: Kathrin Seßler وآخرون
الموضوع الرئيسي: قابلية قراءة النص والتبسيط

نظرة عامة

في هذه الدراسة، قام المؤلفون بتقييم كل من نماذج اللغة الكبيرة (LLMs) مفتوحة المصدر ومغلقة المصدر من حيث فعاليتها في تقييم مقالات الطلاب الألمان، مقارنةً بتقييمات النماذج بتلك التي قدمها المعلمون البشر. استخدمت الأبحاث إطار تقييم متعدد الأبعاد، كما هو موضح في الشكل 1، حيث تم تقييم مقالات الطلاب بناءً على معايير محددة مسبقًا. كشفت النتائج أن النموذج الجديد o1 أظهر موثوقية عالية وارتباطات قوية مع تقييمات البشر، خاصة في الجوانب المتعلقة باللغة، على الرغم من أنه كان يميل إلى منح درجات إجمالية أعلى من المقيمين البشر.

على النقيض من ذلك، أظهرت النماذج مفتوحة المصدر مثل LLaMA 3 وMixtral تباينًا منخفضًا وارتباطات ضعيفة مع تقييمات البشر، مما حد من فائدتها في التقييمات التعليمية. تستنتج الدراسة أنه بينما يظهر نموذج o1 إمكانات كأداة داعمة للمعلمين، هناك حاجة إلى تحسينات إضافية لتحسين قدراته في تقييم المحتوى. هذا التحسين ضروري لضمان أن تكون التعليقات المقدمة متوافقة ليس فقط مع المعايير البشرية ولكن أيضًا توجه الطلاب بشكل فعال حول كيفية تحسين مهارات الكتابة لديهم.

مقدمة

تسلط المقدمة الضوء على العبء الزمني الكبير الذي يضعه تصحيح نصوص الطلاب على المعلمين، خاصة في المدارس الثانوية الألمانية، حيث يمثل التدقيق اللغوي 14% من ساعات عملهم. تتفاقم هذه التحديات بسبب نقص المعلمين على مستوى البلاد في ألمانيا، مما يثير الحاجة إلى حلول فعالة لدعم المعلمين في مهام التقييم الخاصة بهم. يقدم ظهور الذكاء الاصطناعي (AI) ونماذج اللغة الكبيرة (LLMs) فرصًا واعدة لتخفيف هذا العبء من خلال أتمتة جوانب تقييم المقالات وتوليد التعليقات.

على الرغم من الفوائد المحتملة لـ LLMs، تشير المقدمة إلى الفجوات الحرجة في الفهم الحالي لفعاليتها في السياقات التعليمية. غالبًا ما تعتمد الأبحاث الحالية على التقييم الشامل أو معايير محدودة، مما يفشل في التقاط تعقيد كتابة الطلاب. علاوة على ذلك، فإن الطبيعة الذاتية لتقييمات المعلمين والموثوقية المحدودة لمجموعات البيانات المستخدمة في تدريب LLMs تطرح تحديات إضافية. لمعالجة هذه القضايا، تهدف الدراسة إلى تقييم أداء نماذج LLMs المختلفة مقابل معايير محددة مسبقًا في تقييم مقالات الطلاب، مقارنةً بمخرجاتها مع التقييمات البشرية. تركز أسئلة البحث على موثوقية تقييمات LLM، وقوتها وقيودها في تقييم جودة المقالات متعددة الأبعاد، وتأثير المعايير المختلفة على عمليات التفكير لكل من البشر وLLMs.

الطرق

في هذه الدراسة، نحقق في فعالية نماذج اللغة الكبيرة (LLMs) في تقييم مقالات الطلاب بناءً على عشرة معايير محددة. تشمل المنهجية مخططًا شاملاً للمقالات التي تم تقييمها، ومعايير التقييم التي تم وضعها للتقييم، وخصائص المشاركين، وتنفيذ LLMs في عملية التقييم. بالإضافة إلى ذلك، نفصل المقاييس المستخدمة لتحليل أداء LLMs في هذا السياق، مما يضمن نهجًا منهجيًا لفهم قدراتها في التقييم التعليمي.

النتائج

في هذا القسم، تحقق الدراسة من الفروق بين تقييمات المعلمين الفعلية وتقييمات نصوص الطلاب التي تم إنشاؤها بواسطة كل من نماذج اللغة الكبيرة مفتوحة المصدر ومغلقة المصدر (LLMs). يتم هيكلة التحليل حول ثلاثة أسئلة بحثية رئيسية، والتي توجه استكشاف هذه التباينات. تهدف النتائج إلى تقديم رؤى حول فعالية وموثوقية التقييمات التي تم إنشاؤها بواسطة LLM مقارنةً بتقييمات المعلمين التقليدية، مع تسليط الضوء على الآثار المحتملة للممارسات التعليمية ودمج الذكاء الاصطناعي في عمليات التقييم.

المناقشة

في هذه الدراسة، تم تقييم أداء وموثوقية كل من نماذج اللغة الكبيرة مفتوحة المصدر ومغلقة المصدر (LLMs) في تقييم مقالات الطلاب الناطقة بالألمانية. تشير النتائج إلى أن النماذج المغلقة المصدر، وخاصة o1، أظهرت موثوقية أعلى وارتباطات أقوى مع تقييمات البشر مقارنةً بالنماذج مفتوحة المصدر مثل LLaMA 3 وMixtral. على وجه التحديد، حقق o1 ارتباطات كبيرة في ثمانية من عشرة فئات تقييم، متفوقًا بشكل خاص في الجوانب المتعلقة باللغة، بينما أظهرت النماذج مفتوحة المصدر ارتباطًا ضئيلًا مع تقييمات البشر، وغالبًا ما تتجمع حول النقاط الوسطى مع تباين منخفض. يشير هذا إلى أن النماذج مفتوحة المصدر أقل فعالية في التمييز بين مستويات جودة المقالات المختلفة، كما يتضح من درجات ارتباط الفئات الداخلية (ICC) الضعيفة.

كما سلطت الدراسة الضوء على التحديات التي تواجه LLMs في مواءمة تقييماتها مع الحكم البشري، خاصة فيما يتعلق بالمعايير التي تتطلب محتوى كثيفًا. بينما أظهرت النماذج المغلقة المصدر ميلًا للتركيز على ميزات السطح اللغوي، مثل الإملاء وعلامات الترقيم، وضع المعلمون البشر أهمية أكبر على الهيكل العام ومحتوى المقالات. تثير هذه الفجوة مخاوف بشأن ملاءمة LLMs في السياقات التعليمية، حيث يعد الفهم الدقيق لجودة المحتوى أمرًا أساسيًا. بشكل عام، تؤكد الأبحاث على الحاجة إلى مزيد من الاستكشاف لقدرات LLM في تقييم المقالات، خاصة في السياقات غير الإنجليزية ومع التركيز على معايير التقييم متعددة الأبعاد.

القيود

تسلط قيود هذه الدراسة الضوء على عدة مجالات للبحث المستقبلي والتحسين في تقييم نماذج اللغة الكبيرة (LLMs). من الجدير بالذكر أن غياب هندسة المطالبات في هذه الأبحاث يشير إلى أن التحقيقات المستقبلية يمكن أن تستفيد من تقنيات متقدمة مثل هندسة سلسلة الأفكار (CoT)، والتي أظهرت أنها تعزز أداء النموذج. يركز البحث على مجموعة محدودة من النماذج – تحديدًا GPT-3.5 وGPT-4 وo1 وLLaMA 3 وMixtral – مما يشير إلى أن دمج نماذج إضافية، مثل Claude وGemini، يمكن أن يوفر تقييمًا أكثر شمولاً لقدرات LLM عبر هياكل مختلفة.

علاوة على ذلك، ركزت الأبحاث على نوع مقال واحد، مما يقيد إمكانية تعميم النتائج. يجب أن تستكشف الدراسات المستقبلية تنسيقات مقالات مختلفة، مثل المقالات الجدلية، وتستخدم مجموعة بيانات أوسع وأكثر تنوعًا لفهم التحديات الفريدة التي تطرحها أنماط الكتابة المختلفة بشكل أفضل. تؤكد التباينات الملحوظة في أداء LLM عبر عدة جولات، خاصة بالنسبة للنماذج مفتوحة المصدر، على ضرورة وجود آليات لتجميع الدرجات وتخفيف القيم الشاذة. بالإضافة إلى ذلك، فإن عدم وجود معيار ذهبي بسبب التباين بين المقيمين البشر يعقد التقييمات الآلية، مما يتطلب مزيدًا من الاستكشاف للطرق التي تدمج الطبيعة الدقيقة للتقييمات البشرية في تدريب LLM. أخيرًا، تشير الاتجاهات نحو تحسين الموثوقية والتوافق مع التقييمات البشرية في نماذج OpenAI المغلقة المصدر إلى أن التقدم المستمر في تكنولوجيا LLM سيستمر في تعزيز فائدتها في تقييم المقالات الآلي ضمن السياقات التعليمية.

Journal: Proceedings of the 15th International Learning Analytics and Knowledge Conference
DOI: https://doi.org/10.1145/3706468.3706527
Publication Date: 2025-02-21
Author(s): Kathrin Seßler et al.
Primary Topic: Text Readability and Simplification

Overview

In this study, the authors evaluated both open-source and closed-source Large Language Models (LLMs) for their effectiveness in assessing German student essays, comparing the models’ ratings to those given by human teachers. The research employed a multidimensional evaluation framework, as illustrated in Figure 1, where student essays were rated based on predefined criteria. The findings revealed that the novel o1 model exhibited high reliability and strong correlations with human ratings, particularly in language-related aspects, although it tended to assign higher overall scores than human evaluators.

Conversely, open-source models such as LLaMA 3 and Mixtral demonstrated low variance and weak correlations with human ratings, which limited their utility in educational assessments. The study concludes that while the o1 model shows potential as a supportive tool for educators, further enhancements are necessary to improve its content evaluation capabilities. This refinement is crucial to ensure that the feedback provided is not only aligned with human standards but also effectively guides students on how to enhance their writing skills.

Introduction

The introduction highlights the significant time burden that correcting student texts places on teachers, particularly in German secondary schools, where proofreading accounts for 14% of their working hours. This challenge is exacerbated by a nationwide teacher shortage in Germany, prompting the need for efficient solutions to support educators in their assessment tasks. The advent of Artificial Intelligence (AI) and Large Language Models (LLMs) presents promising opportunities to alleviate this workload by automating aspects of essay evaluation and feedback generation.

Despite the potential benefits of LLMs, the introduction notes critical gaps in the current understanding of their effectiveness in educational contexts. Existing research often relies on holistic scoring or limited criteria, failing to capture the complexity of student writing. Furthermore, the subjective nature of teacher evaluations and the limited robustness of datasets used for training LLMs pose additional challenges. To address these issues, the study aims to evaluate the performance of various LLMs against predefined criteria in assessing student essays, comparing their outputs with human evaluations. The research questions focus on the reliability of LLM assessments, their strengths and limitations in evaluating multidimensional essay quality, and the influence of different criteria on the reasoning processes of both humans and LLMs.

Methods

In this study, we investigate the efficacy of large language models (LLMs) in assessing student essays based on ten specified criteria. The methodology encompasses a comprehensive outline of the essays evaluated, the scoring criteria established for assessment, the participant demographics, and the implementation of LLMs in the evaluation process. Additionally, we detail the metrics utilized for analyzing the performance of the LLMs in this context, ensuring a systematic approach to understanding their capabilities in educational assessment.

Results

In this section, the study investigates the differences between actual teacher assessments and evaluations of student texts generated by both open-source and closed-source large language models (LLMs). The analysis is structured around three primary research questions, which guide the exploration of these discrepancies. The findings aim to provide insights into the effectiveness and reliability of LLM-generated evaluations compared to traditional teacher assessments, highlighting potential implications for educational practices and the integration of AI in assessment processes.

Discussion

In this study, the performance and reliability of both open-source and closed-source large language models (LLMs) in assessing German-language student essays were evaluated. The findings indicate that closed-source models, particularly o1, demonstrated higher reliability and stronger correlations with human assessments compared to open-source models like LLaMA 3 and Mixtral. Specifically, o1 achieved significant correlations in eight out of ten evaluation categories, particularly excelling in language-related aspects, while the open-source models exhibited minimal correlation with human ratings, often clustering around midpoints with low variance. This suggests that the open-source models are less effective at distinguishing between varying levels of essay quality, as evidenced by their poor inter-class correlation (ICC) scores.

The study also highlighted the challenges faced by LLMs in aligning their assessments with human judgment, especially concerning content-heavy criteria. While closed-source models showed a tendency to focus on linguistic surface features, such as spelling and punctuation, human evaluators placed greater emphasis on the overall structure and content of the essays. This discrepancy raises concerns about the appropriateness of LLMs in educational contexts, where a nuanced understanding of content quality is essential. Overall, the research underscores the need for further exploration of LLM capabilities in essay assessment, particularly in non-English contexts and with a focus on multidimensional evaluation criteria.

Limitations

The limitations of this study highlight several areas for future research and improvement in the evaluation of large language models (LLMs). Notably, the absence of prompt engineering in this research suggests that future investigations could benefit from advanced techniques like Chain-of-Thought (CoT) Engineering, which has been shown to enhance model performance. The study’s focus on a limited set of models—specifically GPT-3.5, GPT-4, o1, LLaMA 3, and Mixtral—indicates that incorporating additional models, such as Claude and Gemini, could provide a more comprehensive assessment of LLM capabilities across various architectures.

Moreover, the research concentrated on a single essay type, which restricts the generalizability of the findings. Future studies should explore different essay formats, such as argumentative essays, and utilize a broader and more diverse dataset to better understand the unique challenges posed by various writing styles. The observed variability in LLM performance across multiple runs, particularly for open-source models, underscores the necessity for mechanisms to aggregate scores and mitigate outliers. Additionally, the lack of a gold standard due to variability among human raters complicates automated evaluations, necessitating further exploration of methods to incorporate the nuanced nature of human assessments into LLM training. Lastly, the trend of improving reliability and alignment with human evaluations in OpenAI’s closed-source models suggests that ongoing advancements in LLM technology will continue to enhance their utility in automated essay evaluation within educational contexts.