تقييم تأثير الحواجز الأمنية على نماذج اللغة الكبيرة باستخدام مقاييس الانفعال Assessing the impact of safety guardrails on large language models using irritability metrics

المجلة: npj Digital Medicine، المجلد: 9، العدد: 1
DOI: https://doi.org/10.1038/s41746-025-02333-3
PMID: https://pubmed.ncbi.nlm.nih.gov/41507509
تاريخ النشر: 2026-01-08
المؤلف: Bazen Gashaw Teferra وآخرون
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تستكشف هذه الدراسة السلوك العاطفي للانزعاج في نماذج اللغة الكبيرة (LLMs) وكيف يتأثر بالضوابط الأمنية المصممة للتخفيف من المخاطر في تطبيقات الصحة النفسية. باستخدام ثلاثة أدوات موثوقة – اختبار الانزعاج القصير، استبيان الانزعاج، ومقياس كابرا للانزعاج – تقارن الأبحاث استجابات الانزعاج لأربعة نماذج LLM مصنفة حسب مستويات الضوابط الأمنية: عالية (GPT-4o وClaude-3.5-sonnet) ومنخفضة (Grok-3-mini وNous-hermes-2-mixtral-8x7b-dpo).

تحت ظروف الاستفزاز، أظهرت النماذج ذات الضوابط المنخفضة زيادة ملحوظة في الانزعاج، حيث أظهر نموذج Nous تغييرًا قدره $\Delta = +1.56$ في اختبار الانزعاج القصير. بالمقابل، أظهرت النماذج ذات الضوابط العالية انخفاضًا متناقضًا في الانزعاج، حيث خفض نموذج GPT-4o الدرجات إلى الصفر عبر جميع المقاييس. كشفت التحليلات الإحصائية عن انخفاض كبير في الانزعاج (p < .001) في النماذج ذات الضوابط العالية عند الاستفزاز. تشير هذه النتائج إلى أن آليات الأمان قد تقمع الاستجابات العاطفية الطبيعية، مما يثير مخاوف مهمة بشأن الواقعية والأصالة لنماذج LLM في السياقات النفسية.

مقدمة

تسلط المقدمة الضوء على التحدي الكبير الذي تطرحه اضطرابات الصحة النفسية، والتي تؤثر على جزء كبير من السكان لكنها تواجه حواجز في الوصول والرعاية. تقدم التطورات الأخيرة في الذكاء الاصطناعي، وخاصة نماذج اللغة الكبيرة (LLMs)، فرصًا جديدة لتعزيز خدمات الصحة النفسية من خلال المنصات الرقمية. لقد أظهرت نماذج LLM، المدربة على بيانات نصية واسعة، وعدًا في تطبيقات متنوعة مثل التعليم النفسي، ووكلاء المحادثة العلاجية، والفحص المبكر للأعراض النفسية. ومع ذلك، بينما يمكن لهذه النماذج تقليد الاستجابات المهنية في سياقات معينة، إلا أنها تظهر أيضًا قيودًا، مما يستدعي مزيدًا من التقييم لقدرتها على التعبير العاطفي والديناميات العلائقية في البيئات العلاجية.

تتأثر فعالية نماذج LLM في سياقات الصحة النفسية بقدرتها على محاكاة الاستجابات العاطفية البشرية، مما يمكن أن يؤثر على تفاعل المستخدم وثقته. تشير الأبحاث إلى أن الدردشة العاطفية يمكن أن تعزز الروابط المعنوية وتوفر فوائد للصحة النفسية، حيث غالبًا ما يبلغ المستخدمون عن شعورهم بالفهم والدعم. تؤكد المقدمة على أهمية تحقيق التوازن بين الواقعية العاطفية والاعتبارات الأخلاقية، مشيرة إلى أن نماذج LLM التي تعكس نغمات عاطفية مناسبة قد تعزز من قبولها وفعاليتها في التطبيقات السريرية. بدأت الدراسات التجريبية في استكشاف هذه الديناميات، كاشفة أن تصميم واستجابة نماذج LLM العاطفية يمكن أن يلعب دورًا حاسمًا في تكاملها في رعاية الصحة النفسية.

طرق البحث

في هذه الدراسة، تم استخدام حالتين تجريبيتين لتقييم الانزعاج لمختلف النماذج عند تعرضها للضغوط. تضمنت حالة الأساس تقديم عناصر الاستبيان بشكل منفصل لتقليل التحيز السياقي، بينما قدمت الحالة المزعجة محفزات مهيجة مصممة لاستفزاز الإحباط، مثل التعليمات المتناقضة والمدخلات المفرطة. بعد هذه الاستفزازات، تم تقديم نفس الاستبيانات لقياس درجات الانزعاج بعد الاستفزاز.

تم تقييم أربعة نماذج، مصنفة حسب ضوابط الأمان الخاصة بها: أمان عالي (Claude-3.5-sonnet، GPT-4o) وأمان منخفض (Grok-3-mini، Nous-hermes-2mixtral-8x7b-dpo). تم تسجيل الاستجابات وتسجيلها وفقًا لمعايير محددة مسبقًا، مع استخدام التغييرات من حالة الأساس لت quantifying استجابة النماذج للانزعاج تحت الضغط. يتم تلخيص الإجراء التجريبي بصريًا في الشكل 4، موضحًا المنهجية وسياق التقييمات.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، مسلطًا الضوء على النتائج الرئيسية المستمدة من التحليل. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث أسفرت الاختبارات الإحصائية عن قيم p أقل من العتبة التقليدية 0.05، مما يشير إلى وجود دليل قوي ضد الفرضية الصفرية. بالإضافة إلى ذلك، تُظهر أحجام التأثير المحسوبة أهمية عملية، مما يعزز من صلة العلاقات الملاحظة.

علاوة على ذلك، يتم توضيح النتائج من خلال أشكال وجداول متنوعة، والتي توفر تمثيلًا بصريًا واضحًا للاتجاهات والأنماط المحددة في مجموعة البيانات. من الجدير بالذكر أن التحليل يكشف أن التدخل المطبق كان له تأثير كبير على النتائج المقاسة، مع تحسين ملحوظ لوحظ في المجموعة التجريبية مقارنة بالمجموعة الضابطة. تسهم هذه النتائج في الأدبيات الحالية من خلال تقديم رؤى جديدة حول تأثيرات التدخل وإمكانياته في هذا المجال.

المناقشة

تستكشف هذه الدراسة تعديل تعبيرات الانزعاج في نماذج اللغة الكبيرة (LLMs) تحت قيود الأمان المتنوعة، باستخدام ثلاثة أدوات موثوقة للانزعاج: اختبار الانزعاج القصير (BITe)، استبيان الانزعاج (IRQ)، ومقياس كابرا للانزعاج (CIS). تشير النتائج إلى أن النماذج ذات الضوابط الأمنية العالية، مثل GPT-4o وClaude، تظهر درجات انزعاج أساسية أقل مقارنة بالنماذج ذات الضوابط المنخفضة مثل Grok وNous. من الجدير بالذكر أنه تحت المحفزات التي تسبب الانزعاج، أظهرت النماذج ذات الضوابط المنخفضة زيادة في الانزعاج، بينما أظهرت النماذج ذات الضوابط العالية انخفاضًا متناقضًا في درجات الانزعاج، مما يشير إلى أن آليات الأمان قد تقمع الاستجابات العاطفية الطبيعية لصالح الامتثال وتجنب المخاطر.

تسلط الدراسة الضوء على توازن حرج بين الواقعية العاطفية والأمان في تصميم نماذج LLM، خاصة في تطبيقات الصحة النفسية حيث يمكن أن تؤثر التعبيرات العاطفية الأصيلة على العلاقة العلاجية. يكشف تحليل مستوى المحفزات أن إشارات لغوية معينة تؤثر بشكل كبير على استجابات الانزعاج، حيث تؤدي التعليمات المتناقضة أو المربكة إلى زيادة الانزعاج. تؤكد هذه النتائج على أهمية اعتبار الأسلوب العاطفي كمعيار تصميم في نماذج LLM، داعية إلى ضوابط أمان قابلة للتعديل لتحقيق التوازن بين الأصالة العاطفية وسلامة المستخدم في السياقات السريرية. يجب أن تستكشف الأبحاث المستقبلية هذه الديناميات ضمن أنظمة أكثر تعقيدًا ومتعددة الطبقات لفهم أفضل لتداعيات تعديل الانزعاج في التطبيقات الواقعية.

Journal: npj Digital Medicine, Volume: 9, Issue: 1
DOI: https://doi.org/10.1038/s41746-025-02333-3
PMID: https://pubmed.ncbi.nlm.nih.gov/41507509
Publication Date: 2026-01-08
Author(s): Bazen Gashaw Teferra et al.
Primary Topic: Mental Health via Writing

Overview

This study investigates the affective behavior of irritability in large language models (LLMs) and how it is influenced by safety guardrails designed to mitigate risks in mental health applications. Utilizing three validated instruments—the Brief Irritability Test, the Irritability Questionnaire, and the Caprara Irritability Scale—the research compares the irritability responses of four LLMs categorized by their guardrail levels: high (GPT-4o and Claude-3.5-sonnet) and low (Grok-3-mini and Nous-hermes-2-mixtral-8x7b-dpo).

Under provocation conditions, low-guardrail models exhibited a significant increase in irritability, with Nous showing a change of $\Delta = +1.56$ on the Brief Irritability Test. In contrast, high-guardrail models demonstrated a paradoxical decrease in irritability, with GPT-4o reducing scores to zero across all scales. Statistical analysis revealed significantly lower irritability (p < .001) in high-guardrail models when provoked. These results suggest that safety mechanisms may suppress natural affective responses, raising important concerns regarding the realism and authenticity of LLMs in psychiatric contexts.

Introduction

The introduction highlights the significant challenge posed by mental health disorders, which affect a substantial portion of the population yet face barriers to access and care. Recent advancements in artificial intelligence, particularly large language models (LLMs), present new opportunities to enhance mental health services through digital platforms. LLMs, trained on extensive text data, have shown promise in various applications such as psychoeducation, therapeutic conversation agents, and early screening for psychological symptoms. However, while these models can approximate professional responses in certain contexts, they also exhibit limitations, necessitating further evaluation of their emotional expressivity and relational dynamics in therapeutic settings.

The effectiveness of LLMs in mental health contexts is influenced by their ability to simulate human-like emotional responses, which can impact user engagement and trust. Research indicates that emotionally responsive chatbots can foster meaningful connections and provide mental health benefits, as users often report feeling understood and supported. The introduction underscores the importance of balancing emotional realism with ethical considerations, suggesting that LLMs that reflect appropriate affective tones may enhance their acceptability and effectiveness in clinical applications. Empirical studies have begun to explore these dynamics, revealing that the design and emotional responsiveness of LLMs could play a crucial role in their integration into mental health care.

Methods

In this study, two experimental conditions were employed to assess the irritability of various models when exposed to stressors. The baseline condition involved administering questionnaire items in isolation to minimize contextual bias, while the irritated condition introduced irritant prompts designed to provoke frustration, such as contradictory instructions and overloaded inputs. Following these provocations, the same questionnaires were administered to measure post-provocation irritability scores.

Four models were evaluated, categorized by their safety guardrails: high safety (Claude-3.5-sonnet, GPT-4o) and low safety (Grok-3-mini, Nous-hermes-2mixtral-8x7b-dpo). Responses were logged and scored according to predefined parameters, with changes from the baseline condition used to quantify the models’ irritability reactivity under stress. The experimental procedure is visually summarized in Figure 4, illustrating the methodology and context of the assessments.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis. The data indicates a significant correlation between the variables under investigation, with statistical tests yielding p-values below the conventional threshold of 0.05, suggesting strong evidence against the null hypothesis. Additionally, the effect sizes calculated demonstrate practical significance, reinforcing the relevance of the observed relationships.

Furthermore, the results are illustrated through various figures and tables, which provide a clear visual representation of the trends and patterns identified in the dataset. Notably, the analysis reveals that the intervention applied had a substantial impact on the measured outcomes, with a marked improvement observed in the experimental group compared to the control group. These findings contribute to the existing literature by offering new insights into the effects of the intervention and its potential applications in the field.

Discussion

This study investigates the modulation of irritability expressions in large language models (LLMs) under varying safety constraints, utilizing three validated irritability instruments: the Brief Irritability Test (BITe), the Irritability Questionnaire (IRQ), and the Caprara Irritability Scale (CIS). Findings indicate that models with high safety guardrails, such as GPT-4o and Claude, exhibit lower baseline irritability scores compared to low-guardrail models like Grok and Nous. Notably, under irritation-inducing prompts, low-guardrail models displayed increased irritability, while high-guardrail models showed a paradoxical decrease in irritability scores, suggesting that safety mechanisms may suppress natural affective responses in favor of compliance and risk avoidance.

The study highlights a critical trade-off between emotional realism and safety in LLM design, particularly in mental health applications where authentic emotional expression can influence therapeutic rapport. Prompt-level analysis reveals that specific lexical cues significantly impact irritability responses, with contradictory or confusing instructions leading to heightened irritability. These results underscore the importance of considering affective style as a design parameter in LLMs, advocating for adjustable safety guardrails to balance emotional authenticity and user safety in clinical contexts. Future research should explore these dynamics within more complex, multi-layered systems to better understand the implications of irritability modulation in real-world applications.