الذكاء الاصطناعي مقابل الأسئلة متعددة الخيارات التي ينتجها البشر في التعليم الطبي: دراسة جماعية في امتحان عالي المخاطر AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

المجلة: BMC Medical Education، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12909-025-06796-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39923067
تاريخ النشر: 2025-02-08
المؤلف: Alex Kwok-Keung Law وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تدرس هذه الدراسة فعالية ChatGPT-4o في توليد أسئلة اختيار من متعدد (MCQs) عالية الجودة للتعليم الطبي، وخاصة في سياق امتحان ترخيص عالي المخاطر. أجريت الدراسة بين الأطباء الذين يستعدون للامتحان الأولي في طب الطوارئ (PEEM)، وقارنت بين أسئلة MCQs التي تم إنشاؤها بواسطة الذكاء الاصطناعي وتلك التي أنشأها خبراء بشريون. قام المراجعون الخبراء بتقييم الأسئلة من حيث الدقة الواقعية، والملاءمة، والصعوبة، والتوافق مع تصنيف بلوم، بينما قيمت التحليلات السيكومترية مقاييس أدائها، بما في ذلك مؤشرات الصعوبة والتمييز.

كشفت النتائج أن أسئلة MCQs التي تم إنشاؤها بواسطة الذكاء الاصطناعي كانت أسهل إحصائيًا من تلك التي أنشأها البشر (متوسط مؤشر الصعوبة 0.78 مقابل 0.69، p < 0.01) لكنها أظهرت قدرات تمييز مماثلة. ومع ذلك، وُجد أن الأسئلة التي أنشأها الذكاء الاصطناعي تحتوي على المزيد من الأخطاء الواقعية وكانت أقل فعالية في تقييم المهارات المعرفية العليا. ومن الجدير بالذكر أن استخدام الذكاء الاصطناعي قلل بشكل كبير من الوقت المطلوب لتوليد الأسئلة (24.5 ساعة مقارنة بـ 96 ساعة للأسئلة التي أنشأها البشر). تستنتج الدراسة أنه بينما يمكن لـ ChatGPT-4o إنتاج أسئلة MCQs بكفاءة، فإنه يفتقر إلى العمق اللازم للتقييمات الشاملة، مما يبرز أهمية الإشراف البشري. يُوصى بنموذج هجين يجمع بين كفاءة الذكاء الاصطناعي ومراجعة الخبراء لتحسين جودة الأسئلة في الامتحانات الطبية عالية المخاطر. يجب أن تركز الأبحاث المستقبلية على تعزيز قدرات الذكاء الاصطناعي لإنتاج أسئلة أكثر تعقيدًا وملاءمة للسياق.

مقدمة

تسلط المقدمة الضوء على أهمية أسئلة اختيار من متعدد (MCQs) عالية الجودة في تقييم معرفة المتدربين في الطب، خاصة في سياق الامتحان الأولي في طب الطوارئ (PEEM) الذي تنظمه كلية طب الطوارئ في هونغ كونغ (HKCEM). لقد زاد الطلب المتزايد على مثل هذه التقييمات من الضغط على مطوري الامتحانات، الذين يعتمدون تقليديًا على الخبراء البشر في إنشاء أسئلة MCQ – وهي عملية تستغرق وقتًا طويلاً وتحتاج إلى موارد كبيرة. تشير الدراسات الأولية إلى أن الذكاء الاصطناعي (AI) يمكن أن يبسط هذه العملية من خلال توليد حجم كبير من أسئلة MCQs بشكل أكثر كفاءة، ومع ذلك، لا تزال فعالية الأسئلة التي تم إنشاؤها بواسطة الذكاء الاصطناعي في الامتحانات عالية المخاطر غير مستكشفة بشكل كافٍ.

تؤكد الأدبيات الحالية بشكل أساسي على كفاءة الذكاء الاصطناعي في البيئات التعليمية، وغالبًا ما تفتقر إلى التحقق الدقيق من جودة وقوة الأسئلة التي تم إنشاؤها بواسطة الذكاء الاصطناعي. بينما أفادت بعض الدراسات بتقييمات إيجابية من الخبراء للأسئلة التي تم إنشاؤها بواسطة الذكاء الاصطناعي، فإن هذه التقييمات غالبًا ما تكون محدودة في نطاقها وقد لا تعكس الإمكانات الحقيقية للذكاء الاصطناعي في السياقات عالية المخاطر. ومن الجدير بالذكر أن الأبحاث حول ChatGPT-3.5 قد أشارت إلى ميل لإنتاج أسئلة تركز على المهارات المعرفية الأدنى، مما يثير القلق بشأن ملاءمتها للتقييمات الدقيقة. تهدف هذه الدراسة إلى سد الفجوة من خلال تقييم قدرات ChatGPT-4o في توليد أسئلة MCQs عالية الجودة بشكل صارم، ومقارنتها مباشرة بالأسئلة التي أنشأها البشر من حيث الجودة السيكومترية، وأداء المرشحين، ومراجعات الخبراء، وكفاءة الوقت، مما يعالج أسئلة حاسمة حول دور الذكاء الاصطناعي في امتحانات الترخيص الطبي.

الطرق

شملت منهجية هذه الدراسة الاستباقية مجموعة من الأطباء الذين يستعدون للامتحان الأولي في طب الطوارئ (PEEM) الذي تنظمه كلية طب الطوارئ في هونغ كونغ (HKCEM) في أغسطس 2024. تم تجنيد المشاركين، الذين كان مطلوبًا منهم الحصول على درجات طبية أساسية، من خلال دعوات عبر البريد الإلكتروني وقدّموا موافقة مستنيرة. استخدمت الدراسة عينة ملائمة، مستبعدة الأفراد الذين لم يكملوا كلا التقييمين أو سحبوا موافقتهم. يتكون PEEM من 100 سؤال اختيار من متعدد (MCQs) من أفضل خمسة تغطي العلوم الطبية التطبيقية ذات الصلة بطب الطوارئ، مع تمثيل متساوٍ من التشريح، وعلم الأمراض، وعلم الأدوية، وعلم وظائف الأعضاء.

تم تقييم المشاركين باستخدام مجموعتين من أسئلة MCQs: واحدة تم إنشاؤها بواسطة ChatGPT-4o وأخرى بواسطة لجنة من 26 خبيرًا بشريًا في الموضوع. تم تصميم الأسئلة التي أنشأها الذكاء الاصطناعي بناءً على مطالبات محددة تتماشى مع أهداف التعلم في PEEM، بينما اتبعت الفريق البشري إرشادات محددة لتطوير الأسئلة. خضعت كلا المجموعتين لتقييم صارم من قبل لجنة من ستة مراجعين خبراء، الذين قاموا بتقييم الدقة الواقعية، والملاءمة، ومستوى الصعوبة، والتوافق مع تصنيف بلوم، وعيوب كتابة العناصر. شمل عملية المراجعة ملاحظات تكرارية لكل من الأسئلة التي أنشأها الذكاء الاصطناعي والبشر، مما يضمن الالتزام بمعايير PEEM. أكمل المشاركون امتحانًا تجريبيًا مع أسئلة MCQs التي أنشأها الذكاء الاصطناعي قبل ثلاثة أسابيع من PEEM الفعلي، مما سمح بمقارنة مباشرة للأداء بين مجموعتي الأسئلة. تم جمع وتحليل بيانات حول خصائص المشاركين وأدائهم بشكل منهجي.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يسلط الضوء على النتائج المهمة التي تدعم الفرضيات أو أسئلة البحث المطروحة سابقًا في الدراسة. يتم عادةً توضيح البيانات من خلال أشكال مختلفة مثل الجداول، والرسوم البيانية، أو المعادلات، التي توفر تمثيلًا بصريًا واضحًا للنتائج.

قد يتضمن القسم أيضًا تحليلات إحصائية تتحقق من النتائج، مما يشير إلى مستوى الأهمية وأي علاقات تم ملاحظتها. علاوة على ذلك، يناقش الآثار المترتبة على هذه النتائج في سياق الأدبيات الحالية، مؤكدًا كيف تساهم في الفهم الأوسع للموضوع. بشكل عام، تعتبر النتائج حاسمة لدعم الاستنتاجات المستخلصة في الدراسة.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على النتائج المتعلقة بفعالية الأسئلة التي تم إنشاؤها بواسطة الذكاء الاصطناعي (MCQs) في الامتحانات الطبية عالية المخاطر. تم تحديد حجم العينة ليكون 34 مشاركًا، واستخدم التحليل الإحصائي تقييمات سيكومترية متنوعة، بما في ذلك مؤشر الصعوبة، ومؤشر التمييز (DI)، وموثوقية كودر ريتشاردسون (KR-20). أشارت النتائج إلى أن أسئلة MCQs التي تم إنشاؤها بواسطة الذكاء الاصطناعي كانت أسهل بشكل ملحوظ، مع متوسط مؤشر صعوبة 0.78 مقارنة بـ 0.69 لأسئلة MCQs التي أنشأها البشر (p < 0.01). ومع ذلك، أظهرت كلا المجموعتين من الأسئلة قدرات تمييز قابلة للمقارنة، مع DIs تبلغ 0.22 للذكاء الاصطناعي و0.26 للأسئلة التي أنشأها البشر، مما يشير إلى أنه بينما تكون الأسئلة التي أنشأها الذكاء الاصطناعي أقل تحديًا، إلا أنها لا تزال تميز بفعالية بين الأداء العالي والمنخفض. تؤكد الدراسة على أهمية مراجعة الخبراء، كاشفة أن أسئلة MCQs التي تم إنشاؤها بواسطة الذكاء الاصطناعي تحتوي على نسبة أعلى من الأخطاء الواقعية وكانت أكثر احتمالًا لتقييم المهارات المعرفية الأدنى، وفقًا لتصنيف بلوم. على الرغم من كفاءة الذكاء الاصطناعي في توليد الأسئلة، فإن النتائج تبرز ضرورة الإشراف البشري لضمان الملاءمة التعليمية والدقة. يدعو المؤلفون إلى نهج هجين لتطوير أسئلة MCQ يجمع بين الذكاء الاصطناعي والبشر، حيث يمكن للذكاء الاصطناعي توليد الأسئلة الأولية التي يتم تحسينها بعد ذلك بواسطة الخبراء البشر. يجب أن تركز الأبحاث المستقبلية على نماذج التعاون التكرارية واستكشاف دمج القدرات متعددة الوسائط لتعزيز التعقيد المعرفي للتقييمات، مما يضمن أن تكمل أدوات الذكاء الاصطناعي بدلاً من استبدال الخبرة البشرية في التعليم الطبي.

القيود

تقدم الدراسة عدة قيود قد تؤثر على صلاحية وعمومية نتائجها. أحد المخاوف الرئيسية هو حجم العينة الصغيرة، حيث تم تجنيد 24 فقط من بين 34 مشاركًا مستهدفًا، مما قد يؤدي إلى تحليل غير كافٍ. كان معظم المشاركين من الأطباء المبتدئين، بما في ذلك المتدربين والمقيمين في بداية حياتهم المهنية، وقد تؤثر خبرتهم المحدودة في مهام التفكير السريري المعقدة على أدائهم في كل من أسئلة MCQs التي أنشأها الذكاء الاصطناعي وتلك التي أنشأها البشر. تثير هذه العوامل الديموغرافية تساؤلات حول قابلية تطبيق النتائج على مجموعة أوسع من المهنيين الطبيين.

بالإضافة إلى ذلك، لم تكن المطالبات المستخدمة لتوليد أسئلة MCQs التي أنشأها الذكاء الاصطناعي، على الرغم من توافقها مع معايير PEEM، تستهدف بشكل محدد المهارات المعرفية العليا، مما قد يؤثر على جودة الأسئلة. كان السياق الذي تم فيه تقديم الأسئلة التي أنشأها الذكاء الاصطناعي – امتحان تجريبي – مختلفًا عن إعداد امتحان PEEM الفعلي للأسئلة التي أنشأها البشر، مما قد يؤثر على دافع المشاركين وأدائهم. قد يؤدي الفاصل الزمني الذي يبلغ ثلاثة أسابيع بين التقييمات أيضًا إلى تعزيز جهود الدراسة، مما يزيد من تعقيد النتائج. علاوة على ذلك، قد يعني cutoff تدريب نموذج الذكاء الاصطناعي أنه لا يتم عكس التقدم في قدرات الذكاء الاصطناعي بعد cutoff في نتائج الدراسة. أخيرًا، فإن غياب التعمية للمقيمين الخبراء والمشاركين يقدم انحيازات محتملة يجب الاعتراف بها عند تفسير النتائج.

Journal: BMC Medical Education, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12909-025-06796-6
PMID: https://pubmed.ncbi.nlm.nih.gov/39923067
Publication Date: 2025-02-08
Author(s): Alex Kwok-Keung Law et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

This study investigates the efficacy of ChatGPT-4o in generating high-quality multiple-choice questions (MCQs) for medical education, specifically in the context of a high-stakes licensing exam. Conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM), the research compared AI-generated MCQs with those created by human experts. Expert reviewers assessed the questions for factual correctness, relevance, difficulty, and alignment with Bloom’s taxonomy, while psychometric analyses evaluated their performance metrics, including difficulty and discrimination indices.

The findings revealed that AI-generated MCQs were statistically easier than human-generated ones (mean difficulty index of 0.78 vs. 0.69, p < 0.01) but exhibited similar discrimination capabilities. However, AI questions were found to contain more factual inaccuracies and were less effective in assessing higher-order cognitive skills. Notably, the use of AI significantly reduced the time required for question generation (24.5 hours compared to 96 hours for human-generated questions). The study concludes that while ChatGPT-4o can efficiently produce MCQs, it lacks the depth necessary for comprehensive assessments, underscoring the importance of human oversight. A hybrid model combining AI efficiency with expert review is recommended to optimize question quality in high-stakes medical examinations. Future research should focus on enhancing AI capabilities to produce more complex and contextually appropriate questions.

Introduction

The introduction highlights the significance of high-quality multiple-choice questions (MCQs) in evaluating medical trainees’ knowledge, particularly within the context of the Primary Examination on Emergency Medicine (PEEM) conducted by the Hong Kong College of Emergency Medicine (HKCEM). The increasing demand for such assessments has intensified pressure on exam developers, traditionally reliant on human experts for MCQ creation—a process that is both time-consuming and resource-intensive. Preliminary studies suggest that artificial intelligence (AI) could streamline this process by generating a large volume of MCQs more efficiently, yet the effectiveness of AI-generated questions in high-stakes examinations remains inadequately explored.

Existing literature primarily emphasizes the efficiency of AI in educational settings, often lacking rigorous validation of the quality and psychometric robustness of AI-generated MCQs. While some studies have reported favorable expert review scores for AI-generated questions, these evaluations are often limited in scope and may not reflect the true potential of AI in high-stakes contexts. Notably, research on ChatGPT-3.5 has indicated a tendency to produce questions focused on lower-order cognitive skills, raising concerns about their appropriateness for rigorous assessments. This study aims to fill the gap by rigorously evaluating ChatGPT-4o’s capabilities in generating high-quality MCQs, comparing them directly with human-generated questions in terms of psychometric quality, candidate performance, expert reviews, and time efficiency, thereby addressing critical questions about AI’s role in medical licensing examinations.

Methods

The methodology of this prospective cohort study involved medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organized by the Hong Kong College of Emergency Medicine (HKCEM) in August 2024. Participants, who were required to have basic medical degrees, were recruited through email invitations and provided informed consent. The study utilized convenience sampling, excluding individuals who did not complete both assessments or withdrew consent. The PEEM consists of 100 best-of-five multiple-choice questions (MCQs) covering applied medical sciences relevant to emergency medicine, with equal representation from anatomy, pathology, pharmacology, and physiology.

Participants were assessed using two sets of MCQs: one generated by ChatGPT-4o and another by a panel of 26 human subject matter experts. The AI-generated questions were crafted based on specific prompts aligned with the PEEM’s learning objectives, while the human team followed established guidelines for question development. Both sets underwent rigorous evaluation by a panel of six expert reviewers, who assessed factual correctness, relevance, difficulty level, alignment with Bloom’s taxonomy, and item writing flaws. The review process included iterative feedback for both AI and human-generated questions, ensuring adherence to PEEM standards. Participants completed a mock examination with AI-generated MCQs three weeks prior to the actual PEEM, allowing for a direct comparison of performance between the two question sets. Data on participant characteristics and performance were systematically collected and analyzed.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments or analyses. It highlights the significant outcomes that support the hypotheses or research questions posed earlier in the study. The data is typically illustrated through various forms such as tables, graphs, or equations, which provide a clear visual representation of the results.

The section may also include statistical analyses that validate the findings, indicating the level of significance and any correlations observed. Furthermore, it discusses the implications of these results in the context of the existing literature, emphasizing how they contribute to the broader understanding of the topic. Overall, the results are crucial for substantiating the conclusions drawn in the study.

Discussion

The discussion section of the research paper highlights the findings regarding the efficacy of AI-generated multiple-choice questions (MCQs) in high-stakes medical examinations. The sample size was determined to be 34 participants, and the statistical analysis employed various psychometric evaluations, including the Difficulty Index, Discrimination Index (DI), and Kuder Richardson Reliability (KR-20). The results indicated that AI-generated MCQs were significantly easier, with a mean Difficulty Index of 0.78 compared to 0.69 for human-generated MCQs (p < 0.01). However, both sets of questions demonstrated comparable discrimination capabilities, with DIs of 0.22 for AI and 0.26 for human-generated questions, suggesting that while AI-generated questions are less challenging, they still effectively differentiate between high and low performers. The study emphasizes the importance of expert review, revealing that AI-generated MCQs contained a higher incidence of factual inaccuracies and were more likely to assess lower cognitive skills, as per Bloom's taxonomy. Despite the efficiency of AI in generating questions, the findings underscore the necessity of human oversight to ensure educational appropriateness and accuracy. The authors advocate for a hybrid AI-human approach to MCQ development, where AI can generate initial questions that are then refined by human experts. Future research should focus on iterative collaboration models and explore the integration of multimodal capabilities to enhance the cognitive complexity of assessments, ensuring that AI tools complement rather than replace human expertise in medical education.

Limitations

The study presents several limitations that may impact the validity and generalizability of its findings. A primary concern is the small sample size, with only 24 out of the targeted 34 participants recruited, potentially leading to an underpowered analysis. The majority of participants were junior doctors, including interns and early-career residents, whose limited experience with complex clinical reasoning tasks could have influenced their performance on both AI-generated and human-generated multiple-choice questions (MCQs). This demographic factor raises questions about the applicability of the results to a broader population of medical professionals.

Additionally, the prompts used for generating AI MCQs, while aligned with PEEM standards, did not specifically target higher-order cognitive skills, which may have affected the quality of the questions. The context in which the AI-generated questions were administered—a mock examination—differed from the actual PEEM examination setting for human-generated questions, potentially impacting participant motivation and performance. A three-week interval between assessments could also have led to enhanced study efforts, further confounding the results. Furthermore, the AI model’s training cutoff may mean that advancements in AI capabilities post-cutoff are not reflected in the study’s findings. Lastly, the absence of blinding for expert evaluators and participants introduces potential biases that should be acknowledged when interpreting the results.