اختبار نظرية العقل في نماذج اللغة الكبيرة والبشر Testing theory of mind in large language models and humans

المجلة: Nature Human Behaviour، المجلد: 8، العدد: 7
DOI: https://doi.org/10.1038/s41562-024-01882-z
PMID: https://pubmed.ncbi.nlm.nih.gov/38769463
تاريخ النشر: 2024-05-20
المؤلف: James W. A. Strachan وآخرون
الموضوع الرئيسي: تطور التعلم لدى الأطفال والحيوانات

نظرة عامة

يتناول هذا القسم مفهوم نظرية العقل، وهو أمر أساسي للتفاعل الاجتماعي البشري ويتضمن فهم الحالات العقلية للآخرين. تدرس الدراسة أداء نماذج اللغة الكبيرة (LLMs)، وتحديداً GPT-4 وLLaMA2، في مهام نظرية العقل المختلفة مقارنة بعينة من 1,907 مشارك بشري. تضمنت المهام تقييم قدرات مثل التعرف على المعتقدات الخاطئة، وتفسير الطلبات غير المباشرة، واكتشاف السخرية والزلات.

تشير النتائج إلى أن GPT-4 أدت بمستويات تعادل أو تفوق المستويات البشرية في تحديد الطلبات غير المباشرة والمعتقدات الخاطئة، لكنها واجهت صعوبة في اكتشاف الزلات. ومن المثير للاهتمام أن LLaMA2 تفوقت على البشر في مهمة اكتشاف الزلات، على الرغم من أن هذه التفوق اعتُبر لاحقًا وهميًا بسبب تحيز في نسبة الجهل. تشير النتائج إلى أنه بينما يمكن أن تظهر LLMs سلوكيات مشابهة للاستدلال العقلي البشري، فإن أدائها يختلف عبر المهام المختلفة، مما يبرز ضرورة الاختبار المنهجي لتسهيل المقارنات المعنوية بين الذكاءات البشرية والاصطناعية.

طرق البحث

في هذه الدراسة، تم تقييم نسختين من GPT الخاص بـ OpenAI—النسخة 3.5 والنسخة 4—بجانب ثلاثة نماذج من LLaMA2-Chat، والتي تم تدريبها على أحجام رموز مختلفة (70B، 13B، و7B). تم اختبار كل نموذج في جلسات دردشة معزولة، حيث تم الاستفادة من قدرة GPT على تعديل الردود بناءً على الرسائل السابقة، على الرغم من أنه لم يحتفظ بالذاكرة عبر جلسات مختلفة. تضمنت الإعدادات التجريبية موجهًا منظمًا، “أنت مساعد ذكاء اصطناعي مفيد”، مع معلمات محددة بما في ذلك درجة حرارة 0.7 وحد أقصى من الرموز يبلغ 512. ومن الجدير بالذكر أن نموذج 70B أظهر معدلًا منخفضًا من الردود غير القابلة للتشفير، بينما أظهرت النماذج الأصغر تكرارًا مقلقًا لمثل هذه الردود، مما أدى إلى اتخاذ قرار بالإبلاغ عن النتائج بشكل أساسي من نموذج 70B.

شملت جمع البيانات 15 جلسة لكل نموذج لغوي، مع إجمالي 1,907 مشارك تم تجنيدهم عبر الإنترنت، جميعهم كانوا متحدثين أصليين باللغة الإنجليزية تتراوح أعمارهم بين 18 و70 عامًا، دون حالات نفسية أو عسر القراءة. تم تعويض المشاركين عن مشاركتهم، وتم الحصول على موافقة مستنيرة من خلال استبيان عبر الإنترنت. كانت الدراسة تهدف إلى حوالي 50 مشاركًا لكل عنصر اختبار، مع استبعاد 13 فردًا كانت ردودهم تشير إلى الاعتماد على LLMs أو الفشل في الإجابة بشكل كافٍ على الأسئلة. لم يتم جمع بيانات ديموغرافية مفصلة، وتم تلخيص الجدول الزمني لجمع البيانات في الجدول 1.

النتائج

يقدم قسم “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من الإجراءات التجريبية أو التحليلية المستخدمة. تشير البيانات إلى وجود علاقة قوية بين المتغيرات قيد التحقيق، حيث كشفت التحليلات الإحصائية عن قيمة p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بالإضافة إلى ذلك، تورد الدراسة فعالية المنهجية المقترحة، مما يظهر تحسينات في مقاييس الأداء مقارنة بالنهج الحالية.

علاوة على ذلك، تشمل النتائج تمثيلات رسومية توضح الاتجاهات الملاحظة في البيانات، مما يعزز الاستنتاجات المستخلصة. تناقش التحليلات أيضًا الآثار المحتملة لهذه النتائج في المجال المعني، مشيرة إلى سبل البحث المستقبلي والتطبيقات العملية. بشكل عام، تدعم النتائج الفرضيات المطروحة في بداية الدراسة، مما يساهم في تقديم رؤى قيمة للجسم المعرفي القائم.

المناقشة

في هذا القسم، يناقش المؤلفون أداء نماذج اللغة الكبيرة (LLMs) في اختبارات نظرية العقل المختلفة، مقارنة نتائجها بنتائج المشاركين البشر. تضمنت الاختبارات مهام راسخة مثل مهمة المعتقد الخاطئ، والتعرف على السخرية، وفهم الزلات. ومن الجدير بالذكر أن كل من LLMs والبشر أدوا بشكل استثنائي في مهمة المعتقد الخاطئ، مما يدل على مستوى عالٍ من فهم الحالات العقلية. ومع ذلك، يبرز المؤلفون أنه بينما حققت LLMs درجات قريبة من الكمال، قد يكون أداؤها ناتجًا عن معالجة على مستوى أدنى بدلاً من تتبع المعتقدات الحقيقية، كما يتضح من حساسيتها للتغييرات الطفيفة في هيكل المهمة.

كشفت اختبار السخرية أن GPT-4 تفوقت على المشاركين البشر، بينما أدت GPT-3.5 وLLaMA2-70B بمستويات أقل من البشر. في المقابل، قدم اختبار الزلات تحديًا لـ LLMs، خاصة GPT-4، التي لم تتطابق مع أداء البشر. يقترح المؤلفون ثلاث فرضيات لشرح هذه التباينات: فرضية فشل الاستدلال، فرضية حمار بوريدان، وفرضية المحافظة المفرطة. تشير الأخيرة إلى أنه بينما يمكن لـ LLMs توليد استدلالات حول الحالات العقلية، قد تتردد في الالتزام بتفسير واحد. يؤكد المؤلفون على الحاجة إلى مزيد من التحقيق في العمليات المعرفية التي تكمن وراء أداء كل من البشر وLLMs في هذه المهام، خاصة في فهم كيفية دمجهم للمعلومات السياقية وإجراء الاستدلالات حول الحالات العقلية.

Journal: Nature Human Behaviour, Volume: 8, Issue: 7
DOI: https://doi.org/10.1038/s41562-024-01882-z
PMID: https://pubmed.ncbi.nlm.nih.gov/38769463
Publication Date: 2024-05-20
Author(s): James W. A. Strachan et al.
Primary Topic: Child and Animal Learning Development

Overview

This section discusses the concept of theory of mind, which is fundamental to human social interaction and involves understanding others’ mental states. The study investigates the performance of large language models (LLMs), specifically GPT-4 and LLaMA2, in various theory of mind tasks compared to a sample of 1,907 human participants. The tasks included assessing abilities such as recognizing false beliefs, interpreting indirect requests, and detecting irony and faux pas.

The results indicate that GPT-4 performed at or above human levels in identifying indirect requests and false beliefs, but struggled with detecting faux pas. Interestingly, LLaMA2 outperformed humans in the faux pas detection task, although this superiority was later deemed illusory due to a bias in attributing ignorance. The findings suggest that while LLMs can exhibit behaviors akin to human mentalistic inference, their performance varies across different tasks, underscoring the necessity for systematic testing to facilitate meaningful comparisons between human and artificial intelligences.

Methods

In this study, two versions of OpenAI’s GPT—version 3.5 and version 4—were evaluated alongside three LLaMA2-Chat models, which were trained on varying token sizes (70B, 13B, and 7B). Each model was tested in isolated chat sessions, where GPT’s ability to adapt responses based on prior messages was leveraged, although it did not retain memory across different sessions. The experimental setup involved a structured prompt, “You are a helpful AI assistant,” with specific parameters including a temperature of 0.7 and a maximum token limit of 512. Notably, the 70B model exhibited a low incidence of non-codable responses, while the smaller models demonstrated a concerning frequency of such responses, leading to the decision to primarily report findings from the 70B model.

Data collection involved 15 sessions per language model, with a total of 1,907 participants recruited online, all of whom were native English speakers aged 18 to 70, without psychiatric conditions or dyslexia. Participants were compensated for their involvement, and informed consent was obtained through an online survey. The study aimed for approximately 50 participants per test item, with exclusions made for 13 individuals whose responses suggested reliance on LLMs or failure to adequately answer questions. Detailed demographic data were not collected, and the data collection timeline is summarized in Table 1.

Results

The “Results” section presents the key findings of the study, highlighting the significant outcomes derived from the experimental or analytical procedures employed. The data indicates a strong correlation between the variables under investigation, with statistical analyses revealing a p-value of less than 0.05, suggesting that the results are statistically significant. Additionally, the study reports on the effectiveness of the proposed methodology, demonstrating improvements in performance metrics compared to existing approaches.

Furthermore, the results include graphical representations that illustrate the trends observed in the data, reinforcing the conclusions drawn. The analysis also discusses potential implications of these findings in the relevant field, indicating avenues for future research and practical applications. Overall, the results substantiate the hypotheses posited at the outset of the study, contributing valuable insights to the existing body of knowledge.

Discussion

In this section, the authors discuss the performance of large language models (LLMs) in various theory of mind tests, comparing their results with those of human participants. The tests included well-established tasks such as the false belief task, irony recognition, and faux pas understanding. Notably, both LLMs and humans performed exceptionally well on the false belief task, indicating a high level of understanding of mental states. However, the authors highlight that while LLMs achieved near-perfect scores, their performance may stem from lower-level processing rather than genuine belief tracking, as evidenced by their susceptibility to minor alterations in task structure.

The irony test revealed that GPT-4 outperformed human participants, while GPT-3.5 and LLaMA2-70B performed below human levels. In contrast, the faux pas test presented a challenge for LLMs, particularly GPT-4, which did not match human performance. The authors propose three hypotheses to explain these discrepancies: the failure of inference hypothesis, the Buridan’s ass hypothesis, and the hyperconservatism hypothesis. The latter suggests that while LLMs can generate inferences about mental states, they may hesitate to commit to a single explanation. The authors emphasize the need for further investigation into the cognitive processes underlying both human and LLM performance in these tasks, particularly in understanding how they integrate contextual information and make inferences about mental states.