نماذج اللغة الكبيرة بارعة في حل وإنشاء اختبارات الذكاء العاطفي Large language models are proficient in solving and creating emotional intelligence tests

المجلة: Communications Psychology، المجلد: 3، العدد: 1
DOI: https://doi.org/10.1038/s44271-025-00258-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40399566
تاريخ النشر: 2025-05-21
المؤلف: Katja Schlegel وآخرون
الموضوع الرئيسي: الذكاء العاطفي والأداء

نظرة عامة

تبحث هذه الدراسة في قدرات الذكاء العاطفي (EI) لنماذج اللغة الكبيرة (LLMs)، مع التركيز بشكل خاص على أدائها في اختبارات الذكاء العاطفي. وجدت الدراسة أن العديد من نماذج LLMs، بما في ذلك ChatGPT-4 وآخرين، تفوقت بشكل كبير على المشاركين البشر، محققة دقة متوسطة تبلغ 81% مقارنة بمتوسط البشر البالغ 56%. بالإضافة إلى ذلك، كان ChatGPT-4 قادرًا على توليد عناصر اختبار جديدة تحافظ على صعوبة معادلة للاختبارات الأصلية عند إدارتها للمشاركين البشر (N = 467). على الرغم من وجود اختلافات ملحوظة في وضوح العناصر، والواقعية، وتنوع المحتوى بين الاختبارات الأصلية والمولدة، كانت هذه الاختلافات ضئيلة، مع عدم تجاوز أحجام التأثير ± 0.25.

في الختام، بينما تعترف الدراسة بالقيود المتعلقة بالتطبيق الثقافي وتعقيد التفاعلات العاطفية في العالم الحقيقي، فإنها تبرز إمكانيات نماذج LLMs، وخاصة ChatGPT-4، في أداء وتوليد تقييمات الذكاء العاطفي. تشير هذه النتائج إلى أن نماذج LLMs يمكن أن تكون أدوات قيمة لتعزيز النتائج الاجتماعية والعاطفية ومساعدة المستخدمين في اتخاذ قرارات عاطفية مستنيرة. وهذا يضع نماذج LLMs كمرشحين واعدين للتكامل في التفاعلات بين الإنسان والكمبيوتر ويدعم فكرة دورها المحتمل في تطوير أنظمة الذكاء الاصطناعي العام (AGI).

طرق البحث

تحدد قسم الطرق تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث تم تنفيذ تجارب محكومة لتقييم تأثير المتغير X على النتيجة Y. شملت جمع البيانات حجم عينة من N مشاركًا، مما يضمن قوة إحصائية لاكتشاف التأثيرات المهمة.

تضمنت الطرق التحليلية تحليل الانحدار لتقييم العلاقة بين المتغيرات المستقلة والتابعة، مع تحديد مستويات الدلالة عند p < 0.05. بالإضافة إلى ذلك، استخدمت الدراسة اختبارات إحصائية متنوعة للتحقق من النتائج، بما في ذلك ANOVA والمقارنات بعد الاختبار، والتي قدمت رؤى حول التفاعلات بين عوامل متعددة. تم تصميم المنهجية بدقة لتقليل التحيز وتعزيز موثوقية النتائج.

النتائج

تشير نتائج الدراسة إلى أن جميع نماذج اللغة الكبيرة (LLMs) التي تم اختبارها تفوقت على عينات التحقق البشرية في حل عناصر اختبار الذكاء العاطفي (EI). على وجه التحديد، كانت دقة LLMs المتوسطة 81%، وهي أعلى بكثير من دقة 56% المتوسطة التي لوحظت في العينات البشرية. ومن الجدير بالذكر أن نماذج LLMs مثل ChatGPT-01 وDeepSeek V3 حققت مستويات أداء تتجاوز انحرافين معياريين فوق المتوسط البشري. أظهرت كل من اختبارات EI الخمس اتجاهات مماثلة، مع أحجام تأثير كبيرة ومستوى عالٍ من الاتفاق بين النماذج الستة، مما انعكس في معامل الارتباط الداخلي (ICC) البالغ 0.88 عبر 105 عناصر اختبار.

كشفت التحليلات الإضافية عن ارتباط قدره r = 0.46 بين نسب الاستجابات الصحيحة من المشاركين البشر وتلك من نماذج LLMs، مما يشير إلى أن العناصر التي اعتبرها البشر أسهل كانت أيضًا أكثر إجابة بشكل صحيح من قبل نماذج LLMs. تتوفر مقارنات مفصلة على مستوى العناصر في المواد التكميلية، مما يوفر رؤى إضافية حول ديناميكيات الأداء بين استجابات البشر وLLMs.

المناقشة

في هذا القسم، تناقش الدراسة اختبارات الذكاء العاطفي (EI) المختلفة، بما في ذلك اختبار إدارة المشاعر في المواقف (STEM)، واختبار فهم المشاعر في المواقف (STEU)، واختبار معرفة المشاعر في جنيف (GEMOK-B)، واختبار الكفاءة العاطفية في مكان العمل في جنيف (GECo). يتكون كل اختبار من مشاهد تقدم سيناريوهات عاطفية، مما يتطلب من المشاركين اختيار استجابات مناسبة بناءً على استراتيجيات تنظيم المشاعر أو فهم المشاعر. تم التحقق من صحة الاختبارات مع عينات من الطلاب الجامعيين في أستراليا وسكان آخرين، حيث تم تسجيل الاستجابات كصحيحة أو خاطئة لاشتقاق درجات إجمالية تعكس الذكاء العاطفي للمشاركين.

تقيم الدراسة أيضًا أداء نماذج اللغة الكبيرة (LLMs) المختلفة، بما في ذلك ChatGPT-4، في حل عناصر اختبار EI وتوليد عناصر اختبار جديدة. تم تحفيز نماذج LLMs لحل الاختبارات عدة مرات، وتمت مقارنة أدائها مع المستجيبين البشر من الدراسات الأصلية للتحقق. بالإضافة إلى ذلك، قيمت دراسة تصنيف التشابه درجة التشابه بين العناصر الأصلية والعناصر التي تم إنشاؤها بواسطة LLMs، مما كشف أن الغالبية العظمى من السيناريوهات الجديدة لم تُعتبر مجرد إعادة صياغة للعناصر الأصلية. وهذا يشير إلى أن ChatGPT-4 أظهر القدرة على توليد سيناريوهات جديدة بدلاً من مجرد تكرار العناصر الموجودة، مما يدل على فائدته المحتملة في مجال تقييم الذكاء العاطفي.

Journal: Communications Psychology, Volume: 3, Issue: 1
DOI: https://doi.org/10.1038/s44271-025-00258-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40399566
Publication Date: 2025-05-21
Author(s): Katja Schlegel et al.
Primary Topic: Emotional Intelligence and Performance

Overview

This research investigates the emotional intelligence (EI) capabilities of Large Language Models (LLMs), specifically examining their performance on emotional intelligence tests. The study found that several LLMs, including ChatGPT-4 and others, significantly outperformed human participants, achieving an average accuracy of 81% compared to the human average of 56%. Additionally, ChatGPT-4 was able to generate new test items that maintained equivalent difficulty to the original tests when administered to human participants (N = 467). Although there were notable differences in item clarity, realism, and content diversity between original and generated tests, these differences were minimal, with effect sizes not exceeding Cohen’s d ± 0.25.

In conclusion, while the study acknowledges limitations regarding cultural applicability and the complexity of real-world emotional interactions, it highlights the potential of LLMs, particularly ChatGPT-4, in performing and generating EI assessments. These findings suggest that LLMs could be valuable tools for enhancing socio-emotional outcomes and aiding users in making informed emotional decisions. This positions LLMs as promising candidates for integration into human-computer interactions and supports the notion of their potential role in the development of artificial general intelligence (AGI) systems.

Methods

The Methods section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, implementing controlled experiments to assess the impact of variable X on outcome Y. Data collection involved a sample size of N participants, ensuring statistical power to detect significant effects.

Analytical methods included regression analysis to evaluate the relationship between the independent and dependent variables, with significance levels set at p < 0.05. Additionally, the study employed various statistical tests to validate the findings, including ANOVA and post-hoc comparisons, which provided insights into the interactions between multiple factors. The methodology was rigorously designed to minimize bias and enhance the reliability of the results.

Results

The results of the study indicate that all tested large language models (LLMs) outperformed human validation samples in solving emotional intelligence (EI) test items. Specifically, the mean accuracy of the LLMs was 81%, significantly higher than the 56% mean accuracy observed in human samples. Notably, LLMs such as ChatGPT-o1 and DeepSeek V3 achieved performance levels exceeding two standard deviations above the human mean. Each of the five EI tests demonstrated similar trends, with substantial effect sizes and a high level of agreement among the six LLMs, reflected in an Intraclass Correlation (ICC) of 0.88 across 105 test items.

Further analysis revealed a correlation of r = 0.46 between the proportions of correct responses from human test takers and those from the LLMs, suggesting that items deemed easier by humans were also more frequently answered correctly by the LLMs. Detailed comparisons at the item level are available in the supplementary materials, providing additional insights into the performance dynamics between human and LLM responses.

Discussion

In this section, the research discusses various emotional intelligence (EI) tests, including the Situational Test of Emotion Management (STEM), Situational Test of Emotion Understanding (STEU), Geneva EMOtion Knowledge Test-Blends (GEMOK-B), and the Geneva Emotional Competence Test in the workplace (GECo). Each test comprises vignettes that present emotional scenarios, requiring participants to select appropriate responses based on emotion regulation strategies or emotional understanding. The tests were validated with undergraduate samples in Australia and other populations, scoring responses as correct or incorrect to derive total scores reflecting participants’ emotional intelligence.

The study also evaluates the performance of various large language models (LLMs), including ChatGPT-4, in solving EI test items and generating new test items. The LLMs were prompted to solve the tests multiple times, and their performance was compared to human respondents from the original validation studies. Additionally, a similarity rating study assessed the degree of similarity between original and LLM-generated items, revealing that a significant majority of the newly created scenarios were not perceived as mere paraphrases of the originals. This suggests that ChatGPT-4 demonstrated the ability to generate novel scenarios rather than simply replicating existing items, indicating its potential utility in the field of emotional intelligence assessment.