تقييم وتحدي قدرات التفكير للذكاء الاصطناعي التوليدي في التعليم الكيميائي المدعوم بالتكنولوجيا Evaluating and challenging the reasoning capabilities of generative artificial intelligence for technology-assisted chemistry education

المجلة: Education and Information Technologies، المجلد: 30، العدد: 8
DOI: https://doi.org/10.1007/s10639-024-13295-6
تاريخ النشر: 2025-01-03
المؤلف: Suna-Şeyma Uçar وآخرون
الموضوع الرئيسي: التعليم وتطوير التفكير النقدي

نظرة عامة

تستكشف هذه الدراسة قدرات التفكير لنماذج اللغة الكبيرة (LLMs)، وبشكل خاص ChatGPT، في سياق تعليم الكيمياء. من خلال استخدام مطالبات عدائية مستهدفة لتحدي النماذج بمشاكل كيميائية معقدة، تقيم البحث أدائها وتكشف عن نقاط القوة والقيود. تشير النتائج إلى أنه بينما يمكن لنماذج LLMs توليد استجابات متنوعة، فإنها غالبًا ما تظهر عدم دقة، وتحامل، وتناقضات، خاصة عندما تواجه استفسارات مصممة بعناية.

يدعو المؤلفون إلى نهج متوازن لدمج نماذج LLMs في الممارسات التعليمية، مقترحين أن هذه النماذج يمكن أن تعزز تجارب التعلم إذا تم استخدامها بشكل استراتيجي. يتم تشجيع المعلمين على إنشاء مهام تدفع الطلاب لتحليل المحتوى الذي تم إنشاؤه بواسطة الذكاء الاصطناعي بشكل نقدي، مما يعزز الفهم الأعمق والانخراط مع المفاهيم العلمية. في النهاية، تفترض الدراسة أن نماذج LLMs يجب أن تعمل كأدوات مكملة لطرق التدريس التقليدية، مع لعب المعلمين دورًا محوريًا في توجيه استخدامها الفعال لتعزيز التفكير النقدي.

مقدمة

تسلط المقدمة الضوء على التأثير التحويلي للذكاء الاصطناعي التوليدي (GAI) على معالجة المعلومات. يشير GAI إلى قدرة أنظمة الذكاء الاصطناعي على فهم وإنتاج أشكال متنوعة من المحتوى، بما في ذلك اللغة الطبيعية، والصور، والصوت، والفيديو. مثال بارز هو ChatGPT، الذي يستخدم هياكل نماذج اللغة الكبيرة (LLM) المميزة بمئات الملايين من المعلمات، مما يدل على نطاق كبير في التدريب والتعقيد. يبرز هذا التقدم إمكانيات GAI لتعزيز التواصل والإبداع عبر مجالات متعددة.

الطرق

في هذه الدراسة، نستكشف قدرات وقيود نماذج اللغة الكبيرة (LLMs)، وبشكل خاص GPT-3.5 (ChatGPT)، في سياق تعليم الكيمياء على مستوى الدراسات العليا من خلال الهجمات العدائية. تتضمن الهجمات العدائية إدخال بيانات معدلة لاستنباط استجابات غير صحيحة من النموذج. اخترنا مواضيع كيميائية أساسية—مثل الأحماض والقواعد، والبنية الذرية، والروابط الكيميائية، والتفاعلات الكيميائية، وتقنيات المختبر—مأخوذة من كتب دراسية خاصة بمستوى الدراسات العليا. تم اختيار هذه المواضيع لملاءمتها عبر تخصصات مختلفة، مما يسمح بتقييم شامل لأداء LLM. تم تعديل ما مجموعه 40 تمرينًا إلى مطالبات عدائية مصممة لتحدي فهم النموذج.

تم تصنيف المطالبات بناءً على إطار عمل تالانكير (2023) إلى نوعين رئيسيين: “المفاهيم الخاطئة والتحيزات”، حيث تحتوي المطالبات على عدم دقة أو مفاهيم مضللة، و”الضغط للحصول على مزيد من المعلومات”، والتي تتضمن سيناريوهات غير مكتملة أو معقدة تتطلب من النموذج تحسين استجاباته. على سبيل المثال، تضمنت المطالبات معلومات علمية غير دقيقة أو افتراضات خاطئة، مثل المقارنات غير الصحيحة بين الأحماض القوية والضعيفة أو حسابات pH المستحيلة. تم تقييم الاستجابات التي تم إنشاؤها بواسطة ChatGPT لاحقًا من قبل خبير في المجال لتقييم قدرة النموذج على التعرف على الأخطاء وتقديم معلومات دقيقة، مما يسلط الضوء على نقاط قوته وضعفه في السياقات التعليمية.

المناقشة

في هذه المناقشة، تفحص الورقة الدور التحويلي لنماذج اللغة الكبيرة (LLMs) في معالجة اللغة الطبيعية، وخاصة تطبيقاتها في السياقات التعليمية مثل الكيمياء. تظهر نماذج LLMs، المبنية على هياكل المحولات، قدرات متفوقة في فهم وتوليد نصوص شبيهة بالبشر، مما يعزز تفاعل المستخدم دون الحاجة إلى معرفة تقنية واسعة. تتيح سهولة الوصول إليها عبر واجهات برمجة التطبيقات تطبيقات متنوعة، بما في ذلك التعلم الشخصي والتغذية الراجعة الفورية، مما يجعلها أدوات قيمة في السياقات التعليمية. ومع ذلك، فإن تعقيد تفاعلات LLM يطرح تحديات في تقييم جودتها وفعاليتها، خاصة في تعزيز مهارات التفكير النقدي الضرورية في تخصصات مثل الكيمياء.

تسلط الورقة الضوء على الفعالية المختلطة لنماذج LLMs، وخاصة ChatGPT، في تعليم الكيمياء. بينما تشير الدراسات إلى أن نماذج LLMs يمكن أن تولد تقارير مختبر متماسكة وتقدم تغذية راجعة مفيدة، إلا أنها تواجه صعوبة في المهام المعقدة مثل التحليل الكيميائي والستيوكيومترية. تؤكد الأبحاث على ضرورة تقييم المعلمين للمحتوى الذي تم إنشاؤه بواسطة الذكاء الاصطناعي بشكل نقدي، لضمان أنه يكمل طرق التدريس التقليدية بدلاً من تقويضها. علاوة على ذلك، يتم التأكيد على ضرورة إنشاء معايير لتقييم أداء LLM في مهام الكيمياء، مما يوفر نهجًا منظمًا لفهم نقاط قوتها وقيودها. يعد هذا التقييم أمرًا حيويًا لدمج تقنيات الذكاء الاصطناعي بشكل فعال في المناهج الدراسية، مما يعزز في النهاية مهارات التفكير النقدي وحل المشكلات لدى الطلاب في السياقات العلمية.

القيود

تناقش قسم القيود مشكلتين رئيسيتين تم مواجهتهما مع استجابات الدردشة: التحسين مع القيود والميول إلى اختلاق المعلومات. يحدث التحسين مع القيود عندما يحاول الدردشة توضيح أو تحسين إجاباته ولكنه يفشل في معالجة عدم الدقة المفاهيمية الأساسية. على سبيل المثال، عندما يُطلب منه إعداد محلول 10 M NaCl، يعترف ChatGPT بعدم واقعية مثل هذه التركيز العالي ولكنه يستمر في تقديم التعليمات عند إصرار المستخدم، مما يظهر عدم وجود حل للمشكلة الأساسية. بالمثل، عندما يُطلب منه حساب الصيغ التجريبية بناءً على نسب الكتلة، ينتج النموذج نتائج غير صحيحة حيث تنحرف النسب عن الأعداد الصحيحة البسيطة.

تسلط المشكلة الثانية، اختلاق الأمور، الضوء على ميل الدردشة إلى توليد استجابات بناءً على معلومات غير موجودة أو غير دقيقة. تشمل الأمثلة النموذج الذي يحسب الكتلة المولية لمركب خيالي، H2SO10، ويقدم إجابات محددة لمطالبات تفتقر إلى التفاصيل الأساسية، مثل تركيز NaOH لتحقيق pH المطلوب. كما أن الدردشة تختلق نظريات ومفاهيم، مثل “نظرية هندرسون” غير الموجودة للروابط التساهمية، مما يوضح ميلها لاستنتاج وتوليد معلومات تبدو معقولة ولكنها في النهاية غير صحيحة. تؤكد هذه القيود على أهمية التقييم النقدي للمحتوى الذي تم إنشاؤه بواسطة الذكاء الاصطناعي، خاصة في السياقات التعليمية، حيث تكون الدقة في المعلومات الكيميائية أمرًا بالغ الأهمية.

Journal: Education and Information Technologies, Volume: 30, Issue: 8
DOI: https://doi.org/10.1007/s10639-024-13295-6
Publication Date: 2025-01-03
Author(s): Suna-Şeyma Uçar et al.
Primary Topic: Education and Critical Thinking Development

Overview

This study investigates the reasoning capabilities of large language models (LLMs), specifically ChatGPT, in the context of chemistry education. By employing targeted adversarial prompts to challenge the models with complex chemistry problems, the research assesses their performance and reveals both strengths and limitations. The findings indicate that while LLMs can generate diverse responses, they frequently exhibit inaccuracies, biases, and contradictions, particularly when faced with carefully designed queries.

The authors advocate for a balanced approach to integrating LLMs into educational practices, suggesting that these models can enhance learning experiences if utilized strategically. Educators are encouraged to create tasks that prompt students to critically analyze AI-generated content, fostering deeper understanding and engagement with scientific concepts. Ultimately, the study posits that LLMs should serve as complementary tools to traditional teaching methods, with educators playing a pivotal role in guiding their effective use to promote critical thinking.

Introduction

The introduction highlights the transformative impact of generative artificial intelligence (GAI) on information processing. GAI refers to AI systems’ ability to understand and produce various forms of content, including natural language, images, audio, and video. A prominent example is ChatGPT, which utilizes large language model (LLM) architectures characterized by hundreds of millions of parameters, indicating a significant scale in training and complexity. This advancement underscores the potential of GAI to enhance communication and creativity across multiple domains.

Methods

In this study, we investigate the capabilities and limitations of large language models (LLMs), specifically GPT-3.5 (ChatGPT), in the context of graduate-level chemistry education through adversarial attacks. Adversarial attacks involve inputting modified data to elicit incorrect responses from the model. We selected fundamental chemistry topics—such as acids and bases, atomic structure, chemical bonds, chemical reactions, and instrumental laboratory techniques—derived from proprietary graduate-level textbooks. These topics were chosen for their relevance across various disciplines, allowing for a comprehensive evaluation of the LLM’s performance. A total of 40 exercises were adapted into adversarial prompts designed to challenge the model’s understanding.

The prompts were categorized based on Talanquer’s (2023) framework into two main types: “misconceptions and biases,” where prompts contained inaccuracies or misleading concepts, and “pressing for more information,” which involved incomplete or complex scenarios requiring the model to refine its responses. For instance, prompts included scientifically inaccurate information or flawed assumptions, such as incorrect comparisons between strong and weak acids or impossible pH calculations. Responses generated by ChatGPT were subsequently evaluated by a domain expert to assess the model’s ability to recognize errors and provide accurate information, thereby illuminating its strengths and weaknesses in educational settings.

Discussion

In this discussion, the paper examines the transformative role of large language models (LLMs) in natural language processing, particularly their application in educational contexts such as chemistry. LLMs, built on transformer architectures, demonstrate superior capabilities in understanding and generating human-like text, which enhances user interaction without requiring extensive technical knowledge. Their accessibility via APIs allows for diverse applications, including personalized learning and immediate feedback, making them valuable tools in educational settings. However, the complexity of LLM interactions poses challenges in assessing their quality and effectiveness, especially in fostering critical thinking skills essential in disciplines like chemistry.

The paper highlights the mixed efficacy of LLMs, particularly ChatGPT, in chemistry education. While studies indicate that LLMs can generate coherent lab reports and provide useful feedback, they struggle with complex tasks such as chemical analysis and stoichiometry. The research underscores the necessity for educators to critically evaluate AI-generated content, ensuring that it complements traditional teaching methods rather than undermines them. Furthermore, the establishment of benchmarks for assessing LLM performance in chemistry tasks is emphasized, providing a structured approach to understanding their strengths and limitations. This evaluation is crucial for effectively integrating AI technologies into curricula, ultimately enhancing students’ critical thinking and problem-solving abilities in scientific contexts.

Limitations

The section on limitations discusses two primary issues encountered with chatbot responses: refinement with limitations and the tendency to fabricate information. Refinement with limitations occurs when a chatbot attempts to clarify or improve its answers but fails to address the underlying conceptual inaccuracies. For instance, when asked to prepare a 10 M NaCl solution, ChatGPT acknowledges the impracticality of such a high concentration but continues to provide instructions upon user insistence, demonstrating a lack of resolution to the core issue. Similarly, when tasked with calculating empirical formulas based on mass ratios, the model produces incorrect results as the ratios deviate from simple whole numbers.

The second issue, making things up, highlights the chatbot’s propensity to generate responses based on non-existent or inaccurate information. Examples include the model calculating the molar mass of a fictional compound, H2SO10, and providing specific answers to prompts that lack essential details, such as the concentration of NaOH for achieving a desired pH. The chatbot also fabricates theories and concepts, such as a non-existent “Henderson theory” of covalent bonding, illustrating its tendency to infer and generate plausible-sounding but ultimately incorrect information. These limitations underscore the importance of critical evaluation of AI-generated content, particularly in educational contexts, where accuracy in chemical information is paramount.