نماذج اللغة الكبيرة بدون أساس تستعيد ميزات المفاهيم البشرية غير الحسية الحركية ولكن ليس الميزات الحسية الحركية Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts

المجلة: Nature Human Behaviour، المجلد: 9، العدد: 9
DOI: https://doi.org/10.1038/s41562-025-02203-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40468013
تاريخ النشر: 2025-06-04
المؤلف: Qihui Xu وآخرون
الموضوع الرئيسي: اللغة والاستعارة والإدراك

نظرة عامة

في هذا القسم، تبحث الدراسة في التوافق بين تقييمات الكلمات التي تم إنشاؤها بواسطة النماذج وتلك التي أنشأها البشر عبر أبعاد مختلفة، مع التركيز على العلاقة بين هذه التقييمات. تم استخدام ارتباط رتبة سبيرمان لتقييم التشابه بين مجموعتي التقييمات، مما يكشف أن نماذج مثل ChatGPT و Google LLMs تظهر ارتباطات قوية (Rs > 0.50) مع تقييمات البشر في الأبعاد غير الحسية والحركية. ومع ذلك، فإن الارتباطات أضعف بكثير في الأبعاد الحسية والحركية، كما يتضح من اختبارات مان-ويتني U التي تشير إلى اختلافات ذات دلالة إحصائية (على سبيل المثال، GPT-4: $U(N_1 = 7, N_2 = 11) = 65.00$, $P = 0.018$, ارتباط رتبة-بسيط $r_{rb} = 0.69$).

تشير النتائج إلى أنه بينما تتوافق نماذج اللغة الكبيرة (LLMs) بشكل جيد مع تقييمات البشر في بعض الأبعاد المعرفية، فإنها تواجه صعوبة في التجارب الحسية والحركية، مما يثير تساؤلات حول ضرورة تجسيد المعرفة البشرية. يناقش القسم أيضًا مفهوم الإثارة فيما يتعلق بتقييمات الكلمات، حيث قام المشاركون بتقييم الكلمات على مقياس من 9 نقاط من “غير مثير للغاية” (1) إلى “مثير للغاية” (9)، مما يبرز الطبيعة الذاتية لإدراك الكلمات وآثارها على فهم الإدراك البشري فيما يتعلق بـ LLMs.

الطرق

يستعرض قسم “الطرق” في ورقة البحث التصميم التجريبي والتقنيات التحليلية المستخدمة للتحقيق في سؤال البحث. استخدمت الدراسة نهجًا كميًا، يتضمن تحليلات إحصائية لتقييم البيانات التي تم جمعها من تجارب مختلفة. تضمنت المنهجيات المحددة تجارب مختبرية محكومة، حيث تم التلاعب بالمتغيرات بشكل منهجي لمراقبة آثارها على النتائج ذات الاهتمام.

شملت جمع البيانات استخدام أدوات موحدة لضمان الموثوقية والصلاحية، مع إجراء التحليل اللاحق باستخدام أدوات البرمجيات للحسابات الإحصائية. كما يتناول القسم تقنيات أخذ العينات المستخدمة لاختيار المشاركين، لضمان عينة تمثيلية تعزز من قابلية تعميم النتائج. بشكل عام، فإن الطرق المستخدمة قوية ومصممة لتقديم رؤى موثوقة حول الظواهر قيد التحقيق.

النتائج

في هذا القسم، يقدم المؤلفون نتائج من دراسة تقارن تقييمات الكلمات المفاهيمية التي تم إنشاؤها بواسطة نماذج اللغة الكبيرة المختلفة (LLMs) — على وجه التحديد، GPT-3.5 و GPT-4 من OpenAI، و PaLM و Gemini من Google — مع تقييمات البشر من المعايير المعتمدة. تضمنت المنهجية استخدام مطالبات موحدة لـ LLMs لضمان التناسق مع جمع البيانات البشرية، تلاها تحليل التشابه بين النموذج والبشر من خلال ارتباطات الأبعاد وتحليل التشابه التمثيلي (RSA). تشير النتائج إلى أنه بينما تظهر LLMs أداءً مشابهًا للبشر في المهام المعرفية، هناك تباين ملحوظ في التمثيل المفاهيمي، خاصة في المجالات الحسية والحركية مقارنة بالمجالات غير الحسية والحركية.

تستكشف الدراسة أيضًا تأثير وضوح الكلمات على توافق النموذج والبشر. باستخدام ارتباطات سبيرمان الجزئية وتحليلات ثنائية، وجد المؤلفون تشابهًا قويًا في التقييمات حتى بعد التحكم في وضوح الكلمات، مما يشير إلى أن التباين الملحوظ لا يتأثر بشكل كبير بهذا العامل. ومع ذلك، لوحظت تأثيرات تفاعلية، مما يدل على أن ارتباطات النموذج والبشر قد تكون أقوى للكلمات الملموسة في المجال الحسي. بشكل عام، تسلط النتائج الضوء على تعقيدات التمثيل المفاهيمي في LLMs وتثير تساؤلات حول ضرورة التجارب متعددة الحواس لفهم المفاهيم بشكل فعال.

المناقشة

في هذه الدراسة، استكشفنا قدرة نماذج اللغة الكبيرة (LLMs) على اكتساب المعرفة المفاهيمية من المدخلات اللغوية والبصرية، مع التركيز على توافقها مع التمثيلات المفاهيمية البشرية. تشير نتائجنا إلى أنه بينما يمكن لـ LLMs التقاط الأبعاد غير الحسية والحركية بشكل فعال، فإنها تواجه صعوبة في المعرفة الحسية والحركية. على وجه التحديد، كشفت تحليلات التشابه التمثيلي أن LLMs تظهر توافقًا أقل بكثير مع التمثيلات البشرية في المجالات الحسية والحركية مقارنة بالمجالات غير الحسية والحركية، مما يشير إلى أن التأسيس في التجارب الحسية والحركية أمر حاسم لتحقيق فهم مفاهيمي مشابه للبشر.

علاوة على ذلك، بحثنا في تأثير التدريب البصري على توافق النموذج والبشر، مقارنةً بين LLMs البصرية (مثل GPT-4 و Gemini) مع النماذج النصية فقط (مثل GPT-3.5 و PaLM). أظهرت نتائجنا أن المدخلات البصرية تعزز التوافق مع التمثيلات البشرية، خاصة في الأبعاد المرتبطة ارتباطًا وثيقًا بالمعالجة البصرية، مثل القدرة على التصوير والإدراك اللمسي. وهذا يبرز أهمية التعلم متعدد الحواس، حيث يمكن أن يؤدي دمج المدخلات المتنوعة إلى تمثيلات مفاهيمية أغنى وأكثر تماسكًا. ومع ذلك، تسلط الدراسة أيضًا الضوء على قيود LLMs، خاصة في التقاط الميزات الحسية والحركية الدقيقة، مما يبرز الحاجة إلى مزيد من البحث في دور التجارب المجسدة في تطوير اللغة والمفاهيم.

Journal: Nature Human Behaviour, Volume: 9, Issue: 9
DOI: https://doi.org/10.1038/s41562-025-02203-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40468013
Publication Date: 2025-06-04
Author(s): Qihui Xu et al.
Primary Topic: Language, Metaphor, and Cognition

Overview

In this section, the research investigates the alignment between model-generated and human-generated word ratings across various dimensions, focusing on the correlation of these ratings. The Spearman rank correlation was employed to assess the similarity between the two sets of ratings, revealing that models such as ChatGPT and Google LLMs exhibit strong correlations (Rs > 0.50) with human ratings in non-sensorimotor dimensions. However, the correlations are significantly weaker in sensory and motor dimensions, as evidenced by Mann-Whitney U tests that indicate statistically significant differences (e.g., GPT-4: $U(N_1 = 7, N_2 = 11) = 65.00$, $P = 0.018$, rank-biserial correlation $r_{rb} = 0.69$).

The findings suggest that while large language models (LLMs) align well with human ratings in certain cognitive dimensions, they struggle with sensory and motor experiences, raising questions about the necessity of embodiment in human knowledge representation. The section also discusses the concept of arousal in relation to word ratings, where participants rated words on a 9-point scale from “VERY UNAROUSING” (1) to “VERY AROUSING” (9), highlighting the subjective nature of word perception and its implications for understanding human cognition in relation to LLMs.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research question. The study utilized a quantitative approach, incorporating statistical analyses to evaluate the data collected from various experiments. Specific methodologies included controlled laboratory experiments, where variables were systematically manipulated to observe their effects on the outcomes of interest.

Data collection involved the use of standardized instruments to ensure reliability and validity, with subsequent analysis performed using software tools for statistical computation. The section also details the sampling techniques used to select participants, ensuring a representative sample that enhances the generalizability of the findings. Overall, the methods employed are robust and designed to yield reliable insights into the phenomena under investigation.

Results

In this section, the authors present findings from a study comparing conceptual word ratings generated by various large language models (LLMs) — specifically, OpenAI’s GPT-3.5 and GPT-4, and Google’s PaLM and Gemini — with human ratings from established norms. The methodology involved standardized prompts for LLMs to ensure consistency with human data collection, followed by an analysis of model-human similarity through dimension-wise correlations and representational similarity analysis (RSA). The results indicate that while LLMs exhibit human-like performance in cognitive tasks, there is a notable divergence in conceptual representation, particularly in sensory and motor domains compared to non-sensorimotor domains.

The study also investigates the influence of word concreteness on model-human alignment. Using partial Spearman correlations and bin analyses, the authors found a strong similarity in ratings even after controlling for word concreteness, suggesting that the observed divergence is not significantly affected by this factor. However, interaction effects were noted, indicating that model-human correlations may be stronger for concrete words in the sensory domain. Overall, the findings highlight the complexities of conceptual representation in LLMs and raise questions about the necessity of multimodal experiences for effective conceptual understanding.

Discussion

In this study, we explored the capacity of large language models (LLMs) to acquire conceptual knowledge from language and visual inputs, focusing on their alignment with human conceptual representations. Our findings indicate that while LLMs can effectively capture non-sensorimotor dimensions, such as valence and emotional arousal, they struggle with sensorimotor knowledge. Specifically, representational similarity analyses revealed that LLMs exhibit significantly lower alignment with human representations in sensory and motor domains compared to non-sensorimotor domains, suggesting that grounding in sensorimotor experiences is crucial for achieving human-like conceptual understanding.

Moreover, we investigated the impact of visual training on model-human alignment, comparing visual LLMs (e.g., GPT-4 and Gemini) with text-only models (e.g., GPT-3.5 and PaLM). Our results demonstrated that visual inputs enhance alignment with human representations, particularly in dimensions closely associated with visual processing, such as imageability and haptic perception. This underscores the importance of multimodal learning, where the integration of diverse inputs can lead to richer and more coherent conceptual representations. However, the study also highlights the limitations of LLMs, particularly in capturing nuanced sensory and motor features, emphasizing the need for further research into the role of embodied experiences in language and concept development.