تقييم إجابات قصيرة تلقائيًا قائم على نماذج اللغة الكبيرة في التعليم الطبي الجامعي LLM-based automatic short answer grading in undergraduate medical education

المجلة: BMC Medical Education، المجلد: 24، العدد: 1
DOI: https://doi.org/10.1186/s12909-024-06026-5
PMID: https://pubmed.ncbi.nlm.nih.gov/39334087
تاريخ النشر: 2024-09-27
المؤلف: Christian Grévisse
الموضوع الرئيسي: طرق التدريس المبتكرة

نظرة عامة

تناقش هذه القسم تطبيق نماذج اللغة الكبيرة (LLMs) لتقييم الإجابات القصيرة التلقائي (ASAG) في التعليم الطبي، مع تسليط الضوء على دراسة قيمت 2,288 استجابة طلابية عبر 12 دورة في ثلاث لغات باستخدام GPT-4 و Gemini 1.0 Pro. تشير النتائج إلى أنه بينما منح GPT-4 درجات أقل بكثير مقارنة بالمقيمين البشريين، إلا أنه أظهر معدلات إيجابية خاطئة منخفضة ودقة عالية للإجابات الصحيحة تمامًا. بالمقابل، لم يكن تقييم Gemini 1.0 Pro مختلفًا بشكل كبير عن تقييم المعلمين، مما يشير إلى توافق معتدل بين كلا النموذجين البشريين والتقييمات. كما لاحظت الدراسة وجود ارتباط ضعيف بين نتائج التقييم وطول أو لغة استجابات الطلاب، إلى جانب تحيزات محتملة إذا كانت نماذج اللغة الكبيرة على علم بالدرجات البشرية السابقة.

تؤكد الاستنتاجات على أنه بينما يمكن أن يعزز ASAG القائم على نماذج اللغة الكبيرة كفاءة التقييم، خاصة للإجابات عالية الجودة، فإن الإشراف البشري يبقى ضروريًا، خاصة في التقييمات ذات المخاطر العالية. تقترح الدراسة أن نماذج اللغة الكبيرة تمتلك معرفة تدريب كافية لأسئلة الطب على مستوى البكالوريوس، مما يلغي الحاجة إلى التعديل الدقيق. يُشجع المعلمون على استثمار الوقت في تطوير معايير تقييم عالية الجودة لتعظيم فوائد نماذج اللغة الكبيرة، مما قد يؤدي إلى توفير كبير في الوقت. تشمل اتجاهات البحث المستقبلية استكشاف قدرات إصدارات نماذج اللغة الكبيرة الأحدث واستخدامها في التقييمات التكوينية، مما قد يعزز عملية التغذية الراجعة للطلاب في مجال الطب.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الدور الحاسم للتقييم التعليمي في التعليم والتعلم، مع التركيز بشكل خاص على استخدام أسئلة الاختيار المتعدد (MCQs) في مستويات أدنى من تصنيف بلوم، خاصة في التعليم الطبي. بينما تسهل أسئلة الاختيار المتعدد التعرف على المعرفة، إلا أنها لا تشرك العمليات المعرفية الأعمق مثل الاسترجاع أو الإبداع، والتي يتم تقييمها بشكل أفضل من خلال الأسئلة القائمة على اللغة الطبيعية. ومع ذلك، فإن التقييم اليدوي لمثل هذه الأسئلة يستغرق وقتًا طويلاً ويمكن أن يقدم تحيزات وعدم اتساق وإرهاق بين المقيمين، مما يؤثر في النهاية على موثوقية التقييمات. يظهر تقييم الإجابات القصيرة التلقائي (ASAG) كحل لهذه التحديات، حيث يقدم تغذية راجعة فورية للطلاب ورؤى قيمة للمعلمين، مما يعزز القرارات التعليمية.

تناقش الورقة أيضًا تطور ASAG، متتبعة أصوله إلى الستينيات مع الأساليب القائمة على القواعد والتقدم اللاحق في التقنيات الإحصائية وتعلم الآلة، خاصة في معالجة اللغة الطبيعية (NLP). يمثل ظهور نماذج اللغة الكبيرة (LLMs) تطورًا كبيرًا في ASAG، واعدًا بتحسين العدالة والكفاءة مع الحفاظ على أو تجاوز دقة المقيمين البشريين. ومع ذلك، يحذر المؤلفون من القضايا المحتملة مثل الاتساق، والتحيزات في بيانات التدريب، والقيود المعرفية، والهلاوس، وإدخالات التحفيز، والشفافية، ومخاوف الخصوصية، والوصول التكنولوجي. تهدف الدراسة إلى التحقيق في أداء ASAG القائم على نماذج اللغة الكبيرة في التعليم الطبي من خلال مقارنة الدرجات التي منحها نماذج اللغة الكبيرة (GPT-4 و Gemini 1.0 Pro) بتلك التي منحها المقيمون البشريون عبر لغات وأطوال إجابات مختلفة، مع معالجة أسئلة بحثية رئيسية تتعلق بالفروقات في التقييم، وتأثير الدرجات البشرية السابقة، والاتساق، وتأثير اللغة وطول الإجابة على نتائج التقييم. يتم وضع هذا البحث كجهد رائد في تطبيق ASAG القائم على نماذج اللغة الكبيرة داخل التعليم الطبي، مع إمكانية تعزيز كفاءة التقييم بشكل كبير للمعلمين في مجال الطب.

الطرق

في هذه الدراسة، تم جمع ما مجموعه 2,288 استجابة طلابية من 82 سؤالًا عبر 12 دورة تعليمية طبية على مستوى البكالوريوس في جامعة لوكسمبورغ، تمتد من الفصل الصيفي 2021/2022 إلى الفصل الشتوي 2023/2024. ركز التحليل على الأسئلة التي تضمنت معيار تقييم أو حل نموذجي، مما أسفر عن 82 سؤالًا من مجموعة أولية تضم 196. جاءت الغالبية العظمى من الاستجابات من دورات البيوباثولوجيا وعلم الأورام، مع وسطي 30 طالبًا يجيبون على كل سؤال. تم تقديم الأسئلة من خلال Moodle، وتم تقييم الاستجابات من قبل مقيم بشري، عادةً ما يكون المؤلف، على مقياس من 0 إلى 10. كان الطول الوسيط للإجابات 190 حرفًا، وكانت الاستجابات بشكل أساسي باللغة الفرنسية (62%) والإنجليزية (37%)، مع نسبة صغيرة باللغة الألمانية (1%).

لتقييم الإجابات، تم استخدام نموذجين كبيرين من نماذج اللغة (LLMs)، GPT-4 و Gemini 1.0 Pro، باستخدام حزم Python من OpenAI و VertexAI، على التوالي. استخدم عملية التقييم معلمات افتراضية، مع التركيز على الحفاظ على الاتساق في التقييم. بالنسبة لـ GPT-4، تم استخدام إعداد درجة حرارة 1.0، بينما تم ضبط Gemini على 0.9، لتحقيق توازن بين المخرجات المتسقة والمتنوعة. تم إجراء التحليل باستخدام مكتبات Python مثل pandas و SciPy و scikit-learn، مع إنشاء التصورات باستخدام Matplotlib و seaborn. تم تطبيق مستوى دلالة α = 0.05، وشملت المحفزات المقدمة إلى نماذج اللغة الكبيرة جذع السؤال، ومعيار التقييم، وإجابة الطالب، والنقاط القصوى. بالإضافة إلى ذلك، تم اختبار نسخة متحيزة من GPT-4 (GPT4b) التي دمجت درجة المقيم البشري لمعالجة أسئلة بحثية محددة.

النتائج

يقدم قسم “النتائج” النتائج التي توصلت إليها الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من التحليل. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث أسفرت الاختبارات الإحصائية عن قيم p أقل من العتبة التقليدية 0.05، مما يشير إلى أن النتائج من غير المحتمل أن تكون بسبب الصدفة. بالإضافة إلى ذلك، تُظهر أحجام التأثير المحسوبة علاقة قوية، مما يعزز صحة الاتجاهات الملحوظة.

علاوة على ذلك، يتم توضيح النتائج من خلال أشكال وجداول متنوعة، والتي توفر تمثيلًا بصريًا لتوزيعات البيانات والعلاقات. من الجدير بالذكر أن التحليل يكشف أن التدخل المطبق في الدراسة أدى إلى تحسينات قابلة للقياس في النتائج المستهدفة، مع زيادة ملحوظة في فعالية العلاج مقارنةً بمجموعة التحكم. تسهم هذه النتائج في الأدبيات الموجودة من خلال تقديم أدلة تجريبية تدعم الفرضيات المقترحة.

المناقشة

في هذا القسم، تناقش الدراسة أداء أنظمة التقييم المختلفة، بما في ذلك المقيمين البشريين وثلاثة مقيمين قائمين على نماذج اللغة الكبيرة (LLM): GPT-4 و GPT-4b و Gemini. أظهر التحليل أن الدرجات البشرية لـ 2288 سؤالًا كانت لها متوسط درجة طبيعية قدرها 0.68، مما يشير إلى أداء عام مرتفع. تم العثور على اختلافات كبيرة بين درجات البشر ودرجات نماذج اللغة الكبيرة، حيث سجلت GPT-4 و GPT-4b درجات أقل (متوسط درجات 0.65 و 0.64، على التوالي) مقارنةً بالمقيم البشري، بينما كان أداء Gemini مشابهًا لأداء المقيم البشري. أشارت مخططات بلاند-ألتمن إلى وجود تحيز منهجي منخفض بين مقيمي نماذج اللغة الكبيرة، على الرغم من ملاحظة التباين، خاصة مع Gemini، الذي أظهر ميلًا للتقلب في التقييم.

كما درست الدراسة العلاقة بين طول الإجابة واللغة مع نتائج التقييم، ووجدت ارتباطات ضعيفة عبر جميع المقيمين. أشارت تقييمات الدقة إلى أنه بينما يمكن لنماذج اللغة الكبيرة التعرف بشكل موثوق على الإجابات الصحيحة تمامًا، إلا أنها كانت أكثر ترددًا في منح صفر نقاط للإجابات غير الصحيحة، مما يشير إلى ضرورة الإشراف البشري. كانت دقة تحديد الإجابات الصحيحة تمامًا هي الأعلى لـ GPT-4b (0.98)، تليها GPT-4 (0.91) و Gemini (0.72). تشير النتائج إلى أن نماذج اللغة الكبيرة يمكن أن تعمل كمساعدين فعالين في التقييم، خاصة في الحالات الواضحة، مما يسمح للمعلمين بالتركيز على التقييمات الأكثر تعقيدًا. ومع ذلك، تبقى الحاجة إلى معايير تقييم عالية الجودة وإشراف بشري أمرًا حيويًا، خاصة في التقييمات ذات المخاطر العالية.

Journal: BMC Medical Education, Volume: 24, Issue: 1
DOI: https://doi.org/10.1186/s12909-024-06026-5
PMID: https://pubmed.ncbi.nlm.nih.gov/39334087
Publication Date: 2024-09-27
Author(s): Christian Grévisse
Primary Topic: Innovative Teaching Methods

Overview

This section discusses the application of Large Language Models (LLMs) for Automatic Short Answer Grading (ASAG) in medical education, highlighting a study that evaluated 2,288 student responses across 12 courses in three languages using GPT-4 and Gemini 1.0 Pro. The findings indicate that while GPT-4 assigned significantly lower grades compared to human evaluators, it demonstrated low false positive rates and high precision for fully correct answers. In contrast, Gemini 1.0 Pro’s grading was not significantly different from that of the teachers, suggesting a moderate agreement between both LLMs and human assessments. The study also noted a weak correlation between grading outcomes and the length or language of student responses, alongside potential biases if LLMs were aware of prior human grades.

The conclusions emphasize that while LLM-based ASAG can enhance grading efficiency, particularly for high-quality answers, human oversight remains essential, especially in high-stakes assessments. The study suggests that LLMs possess sufficient training knowledge for Bachelor-level medical questions, negating the need for fine-tuning. Teachers are encouraged to invest time in developing high-quality grading rubrics to maximize the benefits of LLMs, which could lead to significant time savings. Future research directions include exploring the capabilities of newer LLM versions and employing LLMs for formative assessments, potentially enhancing the feedback process for medical students.

Introduction

The introduction of the research paper highlights the critical role of educational assessment in teaching and learning, particularly emphasizing the use of multiple choice questions (MCQs) in lower levels of Bloom’s taxonomy, especially in medical education. While MCQs facilitate recognition of knowledge, they do not engage deeper cognitive processes such as recall or creation, which are better assessed through natural language-based questions. However, the manual grading of such questions is time-consuming and can introduce biases, inconsistencies, and fatigue among graders, ultimately affecting the reliability of assessments. Automatic short answer grading (ASAG) emerges as a solution to these challenges, offering immediate feedback to students and valuable insights to educators, thereby enhancing instructional decisions.

The paper further discusses the evolution of ASAG, tracing its origins to the 1960s with rule-based methods and later advancements in statistical techniques and machine learning, particularly in natural language processing (NLP). The advent of Large Language Models (LLMs) represents a significant development in ASAG, promising improved fairness and efficiency while maintaining or exceeding the accuracy of human graders. However, the authors caution against potential issues such as consistency, biases in training data, knowledge limitations, hallucinations, prompt injections, transparency, privacy concerns, and technological accessibility. The study aims to investigate the performance of LLM-based ASAG in medical education by comparing grades assigned by LLMs (GPT-4 and Gemini 1.0 Pro) to those given by human evaluators across various languages and answer lengths, addressing key research questions regarding grading differences, influence of prior human grades, consistency, and the impact of language and answer length on grading outcomes. This research is positioned as a pioneering effort in the application of LLM-based ASAG within medical education, with the potential to significantly enhance grading efficiency for medical instructors.

Methods

In this study, a total of 2,288 student responses were collected from 82 questions across 12 undergraduate medical education courses at the University of Luxembourg, spanning from Summer Term 2021/2022 to Winter Term 2023/2024. The analysis focused on questions that included an evaluation rubric or sample solution, resulting in 82 questions from an initial pool of 196. The majority of responses originated from biopathology and oncology courses, with a median of 30 students answering each question. The questions were administered through Moodle, and responses were graded by a human evaluator, typically the author, on a scale from 0 to 10. The median length of answers was 190 characters, and responses were predominantly in French (62%) and English (37%), with a small fraction in German (1%).

To evaluate the answers, two large language models (LLMs), GPT-4 and Gemini 1.0 Pro, were employed using the OpenAI and VertexAI Python packages, respectively. The grading process utilized default hyperparameters, with a focus on maintaining consistency in grading. For GPT-4, a temperature setting of 1.0 was used, while Gemini was set to 0.9, balancing between consistent and diverse outputs. The analysis was conducted using Python libraries such as pandas, SciPy, and scikit-learn, with visualizations created using Matplotlib and seaborn. A significance level of α = 0.05 was applied, and the prompts provided to the LLMs included the question stem, evaluation rubric, student answer, and maximum points. Additionally, a biased version of GPT-4 (GPT4b) was tested, which incorporated the human evaluator’s grade to address specific research questions.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis. The data indicates a significant correlation between the variables under investigation, with statistical tests yielding p-values below the conventional threshold of 0.05, suggesting that the results are unlikely to be due to chance. Additionally, the effect sizes calculated demonstrate a robust relationship, reinforcing the validity of the observed trends.

Furthermore, the results are illustrated through various figures and tables, which provide a visual representation of the data distributions and relationships. Notably, the analysis reveals that the intervention applied in the study led to measurable improvements in the target outcomes, with a marked increase in the effectiveness of the treatment as compared to the control group. These findings contribute to the existing literature by providing empirical evidence supporting the proposed hypotheses.

Discussion

In this section, the study discusses the performance of various grading systems, including human evaluators and three large language model (LLM)-based graders: GPT-4, GPT-4b, and Gemini. The analysis revealed that human grades for 2288 questions had a mean normalized score of 0.68, indicating generally high performance. Significant differences were found between human and LLM grades, with GPT-4 and GPT-4b scoring lower (mean scores of 0.65 and 0.64, respectively) compared to the human grader, while Gemini’s performance was comparable to that of the human evaluator. Bland-Altman plots indicated low systematic bias among the LLM graders, although variability was noted, particularly with Gemini, which exhibited a tendency to fluctuate in grading.

The study also examined the correlation between answer length and language with grading outcomes, finding weak correlations across all graders. Accuracy assessments indicated that while LLMs could reliably identify fully correct answers, they were more hesitant to assign zero points for incorrect responses, suggesting the necessity of human oversight. The precision of identifying fully correct answers was highest for GPT-4b (0.98), followed by GPT-4 (0.91) and Gemini (0.72). The findings suggest that LLMs can serve as effective grading assistants, particularly for clear-cut cases, allowing educators to focus on more nuanced evaluations. However, the need for high-quality grading rubrics and human supervision remains crucial, especially in high-stakes assessments.