تقييم عدم اليقين التقديري في نماذج اللغة الكبيرة An evaluation of estimative uncertainty in large language models

المجلة: npj Complexity، المجلد: 3، العدد: 1
DOI: https://doi.org/10.1038/s44260-026-00070-6
تاريخ النشر: 2026-02-02
المؤلف: Zhisheng Tang وآخرون
الموضوع الرئيسي: اتخاذ القرار والاقتصاد السلوكي

نظرة عامة

تتناول هذه القسم استخدام كلمات الاحتمالية التقديرية (WEPs)، مثل “ربما” و”على الأرجح لا”، والتي تهدف إلى نقل عدم اليقين في اللغة الطبيعية. تسلط الدراسة الضوء على الفجوات بين تفسيرات البشر لـ WEPs وتلك التي تولدها نماذج اللغة الكبيرة (LLMs)، خاصة في سياق نمذجة اللغة الإحصائية. أجرى المؤلفون تحليلًا تجريبيًا يكشف أن LLMs المعروفة، بما في ذلك GPT-4، تتماشى مع تقديرات البشر من استطلاع فاغن-أولمشنايدر فقط لمجموعة فرعية من WEPs باللغة الإنجليزية، مع تباينات ملحوظة في السياقات الجنسانية والصينية.

تشير النتائج إلى وجود فجوات كبيرة في أداء GPT-4 في القدرة على ترجمة التعبيرات الإحصائية لعدم اليقين إلى WEPs المناسبة بشكل متسق، مما يبرز قيود LLMs في التقاط الفروق الدقيقة في التواصل البشري تحت ظروف عدم اليقين. تسهم هذه البحث في النقاش الأوسع حول تطبيق LLMs في فهم الظواهر التواصلية المعقدة عبر إعدادات تجريبية متنوعة.

طرق البحث

تستخدم الدراسة منهجية منظمة لمقارنة تقديرات الاحتمالية للاحتمالات المعبر عنها بالكلمات (WEPs) من قبل نماذج اللغة الكبيرة (LLMs) والبشر. تستفيد الدراسة من مجموعة من 17 WEPs، كما تم تحديدها سابقًا من قبل فاغن-أولمشنايدر، وتحقق في تأثير السياق من خلال أربعة أنواع سردية: مختصرة، موسعة، تركز على الإناث، وتتركز على الذكور. يختلف كل نوع سردي في الطول والدقة، حيث يبلغ متوسط السياقات المختصرة 7.1 كلمات ومتوسط السياقات الموسعة 24.3 كلمات. تشمل LLMs التي تم تحليلها GPT-3.5، GPT-4، LLaMa-7B، LLaMa-13B، وERNIE-4.0، مع إجراء التقييمات باللغتين الإنجليزية والصينية لتقييم الاتساق عبر اللغات في تفسير WEPs.

بالنسبة للهدف البحثي الثاني، تركز الدراسة على تطبيق GPT-4 لـ WEPs في تقدير النتائج المستقبلية بناءً على البيانات العددية. تم بناء سيناريوهات تتعلق بعدم اليقين الإحصائي، ويتم تقييم أداء النموذج باستخدام أربعة مقاييس: الاتساق الثنائي، اتساق الأحادية، الاتساق التجريبي، واتساق الأحادية التجريبية. تقيم هذه المقاييس التماسك المنطقي وموثوقية استجابات GPT-4 عند تفسير WEPs في سياق البيانات غير المؤكدة. يتضمن الإعداد التجريبي منهجيات تفصيلية لبناء البيانات، تقنيات التحفيز، وإطار تقييم شامل لضمان إمكانية إعادة الإنتاج والدقة في النتائج.

النتائج

في قسم النتائج، يحدد المؤلفون أولاً المنهجية وخيارات التصميم التي أثرت على دراستهم التجريبية، مؤكدين أن المعلومات التفصيلية يمكن العثور عليها في قسم “طرق البحث”. يهيئ هذا النهج المسرح لتقديم النتائج اللاحقة، مما يضمن أن القراء يفهمون السياق والإطار الذي تم الحصول فيه على النتائج. يهدف المؤلفون إلى تقديم نتائجهم بوضوح ومنهجية، مع تسليط الضوء على النتائج الرئيسية التي تنبع من بحثهم.

المناقشة

تتناول قسم المناقشة في ورقة البحث الفجوات في تقديرات الاحتمالية التي تقدمها نماذج اللغة الكبيرة (LLMs) مثل GPT-3.5 وGPT-4 مقارنةً باستجابات البشر، مع التركيز بشكل خاص على كلمات الاحتمالية التقديرية (WEPs). تكشف التحليلات أنه بالنسبة لـ 13 من أصل 17 WEPs، تختلف تقديرات النماذج بشكل كبير عن عينات البشر، كما يتضح من الاختبارات الإحصائية مثل اختبار برونر-مونزيل. ومن الجدير بالذكر أنه بينما تتماشى LLMs عن كثب مع تقديرات البشر لـ WEPs ذات اليقين العالي (مثل “تقريبًا مؤكد”)، فإنها تظهر تباينًا أكبر بالنسبة للمصطلحات ذات التفسيرات الأوسع، مثل “من المحتمل” و”محتمل”. يُفترض أن هذا التباين ناتج عن اعتماد النماذج على الأنماط الإحصائية المتعلمة بدلاً من الفهم السياقي الدقيق الذي يطبقه البشر.

بالإضافة إلى ذلك، تسلط الدراسة الضوء على أن LLMs، خاصة تحت التحفيزات المحددة جنسيًا، تميل إلى إنتاج مخرجات أقل تنوعًا، ربما بسبب التعرض لأنماط اللغة المنظمة أثناء التدريب. تؤكد النتائج على أهمية فهم كيفية تفسير LLMs لعدم اليقين، خاصة مع تزايد دمجها في مجالات ذات مخاطر عالية مثل الرعاية الصحية والحكومة. يؤكد المؤلفون على الحاجة إلى مزيد من البحث لاستكشاف تعقيدات مخرجات LLMs وآثارها على التواصل الفعال لعدم اليقين، مشيرين إلى أنه بينما تظهر LLMs وعدًا، فإن الحذر مطلوب في تطبيقها بسبب الاختلافات الجوهرية في كيفية معالجتها والتعبير عن المعلومات الاحتمالية مقارنة بالبشر.

Journal: npj Complexity, Volume: 3, Issue: 1
DOI: https://doi.org/10.1038/s44260-026-00070-6
Publication Date: 2026-02-02
Author(s): Zhisheng Tang et al.
Primary Topic: Decision-Making and Behavioral Economics

Overview

This section examines the use of Words of Estimative Probability (WEPs), such as “maybe” and “probably not,” which serve to convey uncertainty in natural language. The study highlights the discrepancies between human interpretations of WEPs and those generated by large language models (LLMs), particularly in the context of statistical language modeling. The authors conducted an empirical analysis revealing that established LLMs, including GPT-4, align with human estimates from the Fagen-Ulmschneider survey for only a subset of WEPs in English, with notable divergences in gendered and Chinese contexts.

The findings indicate significant performance gaps in GPT-4’s ability to consistently translate statistical expressions of uncertainty into appropriate WEPs, underscoring the limitations of LLMs in capturing the nuances of human communication under uncertainty. This research contributes to the broader discourse on the application of LLMs in understanding complex communicative phenomena across varied experimental settings.

Methods

The research employs a structured methodology to compare the probability estimations of worded event probabilities (WEPs) by large language models (LLMs) and humans. The study utilizes a set of 17 WEPs, as previously established by Fagen-Ulmschneider, and investigates the influence of context through four narrative types: concise, extended, female-centric, and male-centric. Each narrative type varies in length and specificity, with concise contexts averaging 7.1 words and extended contexts averaging 24.3 words. The LLMs analyzed include GPT-3.5, GPT-4, LLaMa-7B, LLaMa-13B, and ERNIE-4.0, with assessments conducted in both English and Chinese to evaluate cross-linguistic consistency in interpreting WEPs.

For the second research objective, the study focuses on GPT-4’s application of WEPs in estimating future outcomes based on numerical data. Scenarios involving statistical uncertainty are constructed, and the model’s performance is evaluated using four metrics: pair-wise consistency, monotonicity consistency, empirical consistency, and empirical monotonicity consistency. These metrics assess the logical coherence and reliability of GPT-4’s responses when interpreting WEPs in the context of uncertain data. The experimental setup includes detailed data construction methodologies, prompting techniques, and a comprehensive evaluation framework to ensure reproducibility and accuracy in the findings.

Results

In the Results section, the authors first outline the methodology and design choices that informed their empirical study, emphasizing that detailed information can be found in the “Methods” section. This approach sets the stage for the subsequent presentation of findings, ensuring that readers understand the context and framework within which the results were obtained. The authors aim to present their results clearly and systematically, highlighting key outcomes that stem from their research.

Discussion

The discussion section of the research paper examines the discrepancies in probability estimates provided by large language models (LLMs) like GPT-3.5 and GPT-4 compared to human responses, particularly focusing on words of estimative probability (WEPs). The analysis reveals that for 13 out of 17 WEPs, the models’ estimates diverge significantly from human samples, as evidenced by statistical tests such as the Brunner-Munzel test. Notably, while LLMs align closely with human estimates for high-certainty WEPs (e.g., “almost certain”), they exhibit greater divergence for terms with broader interpretations, such as “likely” and “probable.” This divergence is hypothesized to stem from the models’ reliance on learned statistical patterns rather than the nuanced contextual understanding that humans apply.

Additionally, the study highlights that LLMs, particularly under gender-specific prompts, tend to produce less variable outputs, potentially due to exposure to structured language patterns during training. The findings underscore the importance of understanding how LLMs interpret uncertainty, especially as they are increasingly integrated into high-stakes domains like healthcare and government. The authors emphasize the need for further research to explore the complexities of LLM outputs and their implications for effective communication of uncertainty, suggesting that while LLMs show promise, caution is warranted in their application due to inherent differences in how they process and express probabilistic information compared to humans.