إطار نفسي لتقييم وتشكيل سمات الشخصية في نماذج اللغة الكبيرة A psychometric framework for evaluating and shaping personality traits in large language models

المجلة: Nature Machine Intelligence، المجلد: 7، العدد: 12
DOI: https://doi.org/10.1038/s42256-025-01115-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41438004
تاريخ النشر: 2025-12-18
المؤلف: Gregory Serapio‐García وآخرون
الموضوع الرئيسي: سمات الشخصية وعلم النفس

نظرة عامة

في هذا القسم، يقدم المؤلفون إطارًا جديدًا لقياس سمات الشخصية في نماذج اللغة الكبيرة (LLMs)، معالجين نقص أدوات القياس الدقيقة لمثل هذه المفاهيم الاجتماعية المعقدة. من خلال تقييم 18 نموذجًا مستخدمًا على نطاق واسع وفقًا لمعايير نفسية معتمدة من حيث الموثوقية والصلاحية، يكشف البحث أن النماذج الأكبر، التي تم ضبطها على التعليمات، تظهر شخصية تركيبية أكثر استقرارًا ودقة. تشير هذه النتيجة إلى أن فعالية إسقاط الشخصية في LLMs تتحسن مع زيادة ما بعد التدريب وحجم النموذج.

بالإضافة إلى ذلك، يقدم المؤلفون مفهوم تشكيل الشخصية بدون عينة، موضحين أن علامات اللغة المحددة والمحددات يمكن أن توجه LLMs بفعالية نحو ملفات تعريف الشخصية المرغوبة بدقة عالية. تمتد تداعيات هذا البحث إلى توافق الذكاء الاصطناعي وتخفيف الأضرار، مما يثير اعتبارات أخلاقية مهمة بشأن تجسيد وتخصيص أنظمة الذكاء الاصطناعي، فضلاً عن إمكانية إساءة الاستخدام.

مقدمة

تسلط المقدمة الضوء على التحقيقات الأخيرة في العواقب غير المقصودة لنماذج اللغة الكبيرة (LLMs)، لا سيما فيما يتعلق باستخدامها للغة الخادعة، والتحيزات، وعدم الاتساق في الحوار والمعرفة الواقعية. بينما حاولت الأبحاث السابقة قياس سمات الشخصية في LLMs من خلال استبيانات غير رسمية وتقنيات تحفيز قليلة، لا يزال النهج المنهجي والمعتمد نفسيًا لتقييم شخصية LLM غير مُعالج. هذه الفجوة حاسمة، حيث أن صلاحية البناء لقياسات الشخصية في أنظمة الذكاء الاصطناعي تؤثر بشكل مباشر على استراتيجيات التخفيف والحوكمة.

صلاحية البناء، التي تقيم ما إذا كان القياس يعكس بدقة البناء النفسي المقصود، ضرورية للقياس العلمي الموثوق. أثارت الدراسات السابقة، مثل تلك التي تفحص استجابات GPT-3 لمخزون HEXACO للشخصية، مخاوف بشأن صلاحية استجابات الاستطلاع التي تم إنشاؤها بواسطة LLM، كاشفة عن عدم اتساق في أنماط سمات الشخصية. يجادل المؤلفون من أجل تقييمات نفسية أكثر صرامة لضمان أن استجابات LLM لتقييمات الشخصية موثوقة وصحيحة، مؤكدين أن أداء LLM في استبيان قد لا يمثل بدقة سلوكه في سياقات متنوعة. يحدد البحث خط أنابيب مقترح لتقييم صلاحية البناء، والذي يتم توضيحه بشكل أكبر في الأقسام التالية.

الطرق

في هذه الدراسة، استخدم المؤلفون منهجية من مرحلتين لتقييم قدرة نماذج اللغة الكبيرة (LLMs) على تقليد سمات الشخصية البشرية. في البداية، قاموا بإجراء قياسين مختلفين للشخصية وسلسلة من 11 اختبارًا نفسيًا على نماذج LLM مختلفة، مستخدمين نهج تحفيز منظم بناءً على الأبحاث السابقة. تلا ذلك تقييم صارم للخصائص النفسية لاستجابات LLM، مع التركيز على الموثوقية وصلاحية البناء من خلال التحليلات الإحصائية. تم تقييم ما مجموعه 18 LLM من عائلات نماذج مختلفة، مع اختلافات في حجم النموذج، وضبط التعليمات، وطرق التدريب.

لتشكيل سمات الشخصية في LLMs، استند الباحثون إلى الفرضية المعجمية، التي تفترض أن أوصاف الشخصية مشفرة في اللغة. قاموا بتكييف قائمة من 70 صفة ثنائية القطب مرتبطة بنموذج الشخصية Big Five وتوسيعها إلى 104 صفات لضمان تغطية شاملة لوجوه الشخصية. سمح تصميم التحفيز بالتحكم الدقيق في مستويات الشخصية باستخدام محددات لغوية، مما مكن من تشكيل السمات عبر تسعة مستويات. شمل التقييم تجارب تشكيل سمات فردية ومتعددة، حيث تم قياس فعالية التشكيل من خلال مقاييس إحصائية، بما في ذلك معاملات ارتباط رتبة سبيرمان. بالإضافة إلى ذلك، أنشأت LLMs تحديثات لوسائل التواصل الاجتماعي بناءً على ملفات تعريف بشرية محاكاة، والتي تم تحليلها بعد ذلك للتعبير عن الشخصية باستخدام واجهة برمجة تطبيقات معتمدة، مما ربط أيضًا درجات الاختبارات النفسية مع سمات الشخصية الملاحظة في النصوص المولدة.

النتائج

يتناول قسم النتائج في الدراسة صلاحية المعيار والتمييز لقياسات الشخصية التركيبية في نماذج اللغة الكبيرة (LLMs) فيما يتعلق بأبعاد الشخصية Big Five. يكشف التحليل أن النماذج الأكبر، التي تم ضبطها على التعليمات، تظهر صلاحية معيارية أقوى مقارنة بالنماذج الأصغر، غير المضبوطة على التعليمات. بشكل محدد، تفيد الدراسة أنه بالنسبة لـ 11 من أصل 12 نموذجًا تم اختباره، كان هناك ارتباط قوي (متوسط ارتباط سبيرمان $\rho \geq 0.80$) بين مستويات الشخصية المستهدفة ودرجات IPIP-NEO الملاحظة. من الجدير بالذكر أن النماذج التي تحتوي على أكثر من 62 مليار معلمة نشطة حققت تغييرات كبيرة في درجات الشخصية، حيث أظهر نموذج Flan-PaLM 540B أعلى تغيير متوسط ($\Delta = 3.67$).

بالإضافة إلى ذلك، يبرز البحث أن الاختبارات النفسية لشخصية LLM توقعت بفعالية سمات الشخصية في تحديثات وسائل التواصل الاجتماعي التي تم إنشاؤها بواسطة LLM. كان الارتباط بين مقاييس الشخصية المستندة إلى الاستطلاع والمقاييس المستندة إلى اللغة متوسطًا $r = 0.67$، متجاوزًا المتوسط البشري المعتمد البالغ $r = 0.38$. توضح الدراسة التداعيات العملية من خلال سحب الكلمات المولدة، مما يعرض أنماط لغة مميزة مرتبطة بسمات شخصية متنوعة. على سبيل المثال، تم ربط الاستقرار العاطفي العالي باللغة الإيجابية، بينما تم ربط العصابية العالية بالمصطلحات السلبية. تؤكد هذه النتائج صلاحية البناء لقياسات شخصية LLM وتقترح تحيزات محتملة في بيانات تدريب LLM التي تؤثر على التعبير عن الشخصية في النصوص المولدة.

المناقشة

في هذا القسم، يناقش المؤلفون تطوير والتحقق من صحة منهجية لقياس سمات الشخصية في نماذج اللغة الكبيرة (LLMs) باستخدام اختبارات نفسية معتمدة. يبرزون أنه بينما تظهر LLMs قدرات لغوية شبيهة بالبشر، هناك حاجة علمية لتقييم رسمي لموثوقية وصلاحية قياسات الشخصية المستمدة من هذه النماذج. يقدم المؤلفون طريقة تحفيز منظمة تحاكي عوامل ديموغرافية وسياقية متنوعة، مما يمكّن من إجراء اختبارات نفسية عبر عدة LLMs. تشير نتائجهم إلى أن النماذج الأكبر، التي تم ضبطها على التعليمات، مثل Flan-PaLM 540B وGPT-4o، تظهر موثوقية وصلاحية أقوى في تركيب سمات الشخصية البشرية مقارنة بالنماذج الأصغر أو الأساسية.

تكشف النتائج أن توزيعات درجات الشخصية تختلف عبر عائلات النماذج، حيث تظهر المتغيرات المضبوطة على التعليمات تحسينًا كبيرًا في الموثوقية وصلاحية البناء. يؤكد المؤلفون أن نموذج التدريب للنماذج هو مؤشر حاسم لصلاحية قياسات الشخصية، حيث تتفوق النماذج المضبوطة على التعليمات باستمرار على نظرائها الأساسية. بالإضافة إلى ذلك، يقدمون أدلة على أنه يمكن تشكيل سمات الشخصية في LLMs، مع ملاحظات متفاوتة من السيطرة عبر نماذج مختلفة. بشكل عام، يؤسس هذا العمل إطارًا شاملاً لقياس وتشكيل شخصية LLM، مما يساهم في المناقشات حول سلامة الذكاء الاصطناعي وتوافقه من خلال تقديم رؤى حول الظواهر الاجتماعية والسلوكية المشفرة داخل هذه النماذج.

القيود

تسلط الأبحاث الضوء على قيود نتائجها، لا سيما فيما يتعلق باستكشاف سمات الشخصية في نماذج اللغة الكبيرة (LLMs) المختلفة. درست الدراسة بشكل أساسي النسخ الأساسية والمضبوطة على التعليمات من عائلات نماذج PaLM وLlama 2 وMistral وMixtral وGPT. بينما تم اختيار هذه النماذج لأسباب عملية، يُلاحظ أن المنهجية المطورة لإجراء الاستطلاعات النفسية هي غير مرتبطة بنموذج، مما يسمح بتطبيقها عبر أي نموذج لغة يعتمد على فك التشفير فقط.

قد يمتد العمل المستقبلي للتحقيق ليشمل مجموعة أوسع من LLMs، مما قد يعزز الفهم لكيفية تأثير الهياكل المختلفة ونماذج التدريب على محاكاة سمات الشخصية. قد يؤدي ذلك إلى رؤى أكثر شمولاً حول العلاقة بين خصائص النموذج وقدرته على إظهار سمات شخصية مميزة في توليد اللغة.

Journal: Nature Machine Intelligence, Volume: 7, Issue: 12
DOI: https://doi.org/10.1038/s42256-025-01115-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41438004
Publication Date: 2025-12-18
Author(s): Gregory Serapio‐García et al.
Primary Topic: Personality Traits and Psychology

Overview

In this section, the authors present a novel framework for quantifying personality traits in large language models (LLMs), addressing the lack of rigorous measurement tools for such complex social constructs. By evaluating 18 widely used LLMs against established psychometric standards of reliability and validity, the study reveals that larger, instruction-tuned models demonstrate a more stable and accurate synthetic personality. This finding suggests that the effectiveness of personality projection in LLMs improves with increased post-training and model scale.

Additionally, the authors introduce the concept of zero-shot personality shaping, demonstrating that specific language markers and qualifiers can effectively guide LLMs toward desired personality profiles with high fidelity. The implications of this research extend to AI alignment and harm mitigation, raising important ethical considerations regarding the anthropomorphization and personalization of AI systems, as well as the potential for misuse.

Introduction

The introduction highlights recent investigations into the unintended consequences of large language models (LLMs), particularly concerning their use of deceptive language, biases, and inconsistencies in dialogue and factual knowledge. While prior research has attempted to measure personality traits in LLMs through informal questionnaires and few-shot prompting techniques, a systematic and psychometrically validated approach to assessing LLM personality remains unaddressed. This gap is critical, as the construct validity of personality measurements in AI systems directly influences mitigation and governance strategies.

Construct validity, which assesses whether a measure accurately reflects the intended psychological construct, is essential for reliable scientific measurement. Previous studies, such as one examining GPT-3’s responses to the HEXACO Personality Inventory, have raised concerns about the validity of LLM-generated survey responses, revealing inconsistencies in personality trait patterns. The authors argue for more rigorous psychometric evaluations to ensure that LLM responses to personality assessments are both reliable and valid, emphasizing that an LLM’s performance on a questionnaire may not accurately represent its behavior in diverse contexts. The paper outlines a proposed pipeline for evaluating construct validity, which is further elaborated in the subsequent sections.

Methods

In this study, the authors employed a two-stage methodology to assess the ability of large language models (LLMs) to emulate human personality traits. Initially, they administered two distinct personality measures and a series of 11 psychometric tests to various LLMs, utilizing a structured prompting approach based on prior research. This was followed by a rigorous evaluation of the psychometric properties of the LLM responses, focusing on reliability and construct validity through statistical analyses. A total of 18 LLMs from different model families were evaluated, with variations in model size, instruction tuning, and training methods.

To shape the personality traits of the LLMs, the researchers drew on the lexical hypothesis, which posits that personality descriptors are encoded in language. They adapted a list of 70 bipolar adjectives associated with the Big Five personality model and expanded it to 104 adjectives to ensure comprehensive coverage of personality facets. The prompt design allowed for precise control over personality levels using linguistic qualifiers, enabling the shaping of traits across nine levels. The evaluation involved both single-trait and multi-trait shaping experiments, where the effectiveness of the shaping was quantified through statistical measures, including Spearman’s rank correlation coefficients. Additionally, the LLMs generated social media updates based on simulated human profiles, which were then analyzed for personality expression using a validated API, further linking psychometric test scores with observed personality traits in generated text.

Results

The results section of the study investigates the criterion and discriminant validity of synthetic personality measurements in large language models (LLMs) concerning the Big Five personality dimensions. The analysis reveals that larger, instruction-fine-tuned models exhibit stronger criterion validity compared to smaller, non-instruction-tuned models. Specifically, the study reports that for 11 out of 12 models tested, there was a strong correlation (average Spearman correlation $\rho \geq 0.80$) between the targeted personality levels and observed IPIP-NEO scores. Notably, models with over 62 billion active parameters achieved significant changes in personality scores, with the Flan-PaLM 540B model demonstrating the highest average change ($\Delta = 3.67$).

Additionally, the research highlights that psychometric tests of LLM personality effectively predicted personality traits in LLM-generated social media updates. The correlation between survey-based and language-based personality measures averaged $r = 0.67$, surpassing the established human average of $r = 0.38$. The study illustrates practical implications through generated word clouds, showcasing distinct language patterns associated with varying personality traits. For example, high emotional stability was linked to positive language, while high neuroticism was associated with negative terms. These findings underscore the construct validity of LLM personality measurements and suggest potential biases in LLM training data that influence personality expression in generated text.

Discussion

In this section, the authors discuss the development and validation of a methodology for quantifying personality traits in large language models (LLMs) using established psychometric tests. They highlight that while LLMs exhibit human-like language capabilities, there is a scientific need to formally evaluate the reliability and validity of personality measurements derived from these models. The authors introduce a structured prompting method that simulates various demographic and contextual factors, enabling the administration of psychometric tests across multiple LLMs. Their findings indicate that larger, instruction-tuned models, such as Flan-PaLM 540B and GPT-4o, demonstrate stronger reliability and validity in synthesizing human personality traits compared to smaller or base models.

The results reveal that personality score distributions vary across model families, with instruction-tuned variants showing significantly improved reliability and construct validity. The authors emphasize that the training paradigm of the models is a critical predictor of the validity of personality measurements, with instruction-tuned models consistently outperforming their base counterparts. Additionally, they provide evidence that personality traits can be shaped in LLMs, with varying degrees of control observed across different models. Overall, this work establishes a comprehensive framework for measuring and shaping LLM personality, contributing to discussions on AI safety and alignment by providing insights into the socio-behavioral phenomena encoded within these models.

Limitations

The research highlights the limitations of its findings, particularly concerning the exploration of personality traits in various large language models (LLMs). The study primarily examined base and instruction-tuned versions of the PaLM, Llama 2, Mistral, Mixtral, and GPT model families. While these models were selected for pragmatic reasons, the methodology developed for administering psychometric surveys is noted to be model agnostic, allowing for its application across any decoder-only architecture language model.

Future work may expand the investigation to include a broader range of LLMs, potentially enhancing the understanding of how different architectures and training paradigms influence the simulation of personality traits. This could lead to more comprehensive insights into the relationship between model characteristics and their ability to exhibit distinct personality traits in language generation.