تقييم منهجي لاستراتيجيات LLM لتحليل نصوص الصحة النفسية: الضبط الدقيق مقابل هندسة المطالبات مقابل RAG A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

المجلة: Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)
DOI: https://doi.org/10.18653/v1/2025.clpsych-1.14
تاريخ النشر: 2025-01-01
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تقوم هذه الدراسة بمقارنة منهجيات ثلاث لتحليل نصوص الصحة النفسية باستخدام نماذج اللغة الكبيرة (LLMs): هندسة المطالبات، والتوليد المعزز بالاسترجاع (RAG)، والتدريب الدقيق. باستخدام نموذج LLaMA 3، تقيم الأبحاث هذه الأساليب في مهام تتعلق بتصنيف المشاعر واكتشاف حالات الصحة النفسية عبر مجموعتين من البيانات. تشير النتائج إلى أن التدريب الدقيق يحقق أعلى دقة، حيث يصل إلى 91% لتصنيف المشاعر و80% لاكتشاف حالات الصحة النفسية، على الرغم من أنه يتطلب موارد حسابية كبيرة ومجموعات بيانات تدريب واسعة. بالمقابل، توفر هندسة المطالبات وRAG خيارات نشر أكثر مرونة مع أداء معتدل، يتراوح بين 40% إلى 68% دقة.

تؤكد النتائج على التوازن بين الدقة، والمتطلبات الحسابية، ومرونة النشر في تطبيقات الصحة النفسية المعتمدة على LLM. بينما يظهر التدريب الدقيق إمكانات قوية للفحص الموثوق، تشير تقلبات أدائه عبر حالات عاطفية وظروف مختلفة إلى أنه قد يكون أكثر ملاءمة للتقييمات الأولية بدلاً من التشخيصات النهائية. تدعو الدراسة إلى مزيد من البحث في الأساليب الهجينة التي تدمج نقاط القوة في الأساليب المختلفة، وتطوير تقنيات تدريب دقيقة أكثر كفاءة، وتحسينات في اكتشاف الحالات النفسية الدقيقة. بالإضافة إلى ذلك، هناك دعوة لمزيد من التحقق من صحة هذه المنهجيات في البيئات السريرية وبين مجموعات سكانية متنوعة.

مقدمة

تسلط المقدمة الضوء على الحاجة الملحة لأساليب تقييم الصحة النفسية القابلة للتوسع بسبب الزيادة في انتشار حالات الصحة النفسية والوصول المحدود إلى المتخصصين. غالبًا ما تكون الأساليب التشخيصية التقليدية، التي تعتمد بشكل أساسي على المقابلات السريرية والاستبيانات المبلغ عنها ذاتيًا، مستهلكة للوقت ومتحيزة. تقدم التطورات الحديثة في نماذج اللغة الكبيرة (LLMs)، مثل GPT-4 وLLaMA 2، طرقًا واعدة لتعزيز تقييم الصحة النفسية من خلال تحليل النصوص الآلي. بينما أظهرت LLMs إمكانات في تطبيقات طبية متنوعة، فإن استخدامها في تقييم الصحة النفسية يطرح تحديات فريدة بسبب تعقيد التعبير العاطفي وضرورة الدقة في السياقات السريرية.

تهدف هذه الدراسة إلى سد الفجوة في فهم فعالية استراتيجيات نشر LLM المختلفة لتصنيف نصوص الصحة النفسية. تقارن بشكل منهجي بين ثلاثة أساليب: هندسة المطالبات (بما في ذلك طرق عدم الاعتماد على الأمثلة وطرق القليل من الأمثلة)، والتوليد المعزز بالاسترجاع (RAG)، والتدريب الدقيق. باستخدام مجموعتي بيانات – مجموعة بيانات DAIR-AI Emotion ومجموعة بيانات Reddit SuicideWatch وMental Health Collection – تظهر الأبحاث أن نماذج LLaMA 3 يمكن أن تحقق معدلات دقة تصل إلى 91% في تصنيف المشاعر و80% في تصنيف حالات الصحة النفسية من خلال التدريب الدقيق. توفر النتائج رؤى عملية حول تحديات التنفيذ ومتطلبات الموارد لكل نهج، مما يشير إلى أنه بينما يحقق التدريب الدقيق أعلى دقة، فإن هندسة المطالبات وRAG تقدم بدائل قابلة للتطبيق مع توازنات مميزة. تسهم هذه الدراسة في تطوير أدوات موثوقة وقابلة للتوسع تهدف إلى دعم المتخصصين في الصحة النفسية وتحسين الوصول إلى تقييم الصحة النفسية.

الطرق

تقيم الدراسة ثلاث منهجيات لتحليل نصوص الصحة النفسية باستخدام نماذج اللغة الكبيرة (LLMs): التدريب الدقيق، وهندسة المطالبات، والتوليد المعزز بالاسترجاع (RAG)، باستخدام نموذج LLaMA 3 (8B معلمات) كخط أساسي ثابت. استخدم الإعداد التجريبي وحدة معالجة الرسوميات A100 مع ذاكرة محسّنة من خلال التكميم 4 بت ودقة float16. حقق التدريب الدقيق درجات F1 عالية للعواطف الأساسية، مثل الفرح (0.94) والحزن (0.95)، بينما أظهرت العواطف الأكثر تعقيدًا مثل الحب والدهشة درجات أقل (0.81 و0.72، على التوالي). كما أدت النموذج بشكل جيد في اكتشاف القلق والاضطراب ثنائي القطب (درجات F1 تبلغ 0.86 و0.85)، مع إظهار المطالبات بدون أمثلة فعالية ملحوظة في تحديد حالات الصحة النفسية، وخاصة الاكتئاب (0.70) والقلق (0.74).

من ناحية أخرى، أظهرت RAG والمطالبات القليلة قيودًا. كان أداء RAG متغيرًا بشكل كبير بناءً على جودة الاسترجاع، حيث حقق دقة تصل إلى 64% مع سياق ذي صلة ولكنه انخفض إلى 31% مع بيانات أقل صلة. كانت المطالبات القليلة أقل أداءً مقارنة بأساليب عدم الاعتماد على الأمثلة، مما يشير إلى أن المطالبات المعتمدة على الأمثلة قد تقدم أنماطًا متضاربة تضر بمهام التصنيف في سياقات الصحة النفسية. بشكل عام، بينما أظهر التدريب الدقيق والمطالبات بدون أمثلة نتائج واعدة، واجهت RAG والمطالبات القليلة تحديات في تقديم أداء موثوق لتحليل نصوص الصحة النفسية.

النتائج

تكشف نتائج الدراسة عن اختلافات كبيرة في الأداء بين التدريب الدقيق، وهندسة المطالبات، وطرق التوليد المعزز بالاسترجاع (RAG) لتحليل نصوص الصحة النفسية. حقق التدريب الدقيق لنموذج LLaMA 3 أعلى دقة بلغت 91% و80% على مجموعتي بيانات DAIR-AI Emotion وSWMH، على التوالي. ظهرت المطالبات بدون أمثلة كأفضل طريقة ثانية، حيث حققت دقة 49% و68% على نفس مجموعات البيانات، متفوقة على كل من المطالبات القليلة وRAG. يشير هذا إلى أن المطالبات المصممة بشكل جيد يمكن أن تستفيد بشكل فعال من المعرفة المدربة مسبقًا للنموذج لتصنيف نصوص الصحة النفسية دون الحاجة إلى أمثلة إضافية.

تظهر الفجوة في الأداء بشكل خاص في تصنيف المشاعر مقارنة باكتشاف حالات الصحة النفسية. من الجدير بالذكر أن النموذج المدرب بدقة أظهر تحسنًا مطلقًا بنسبة 15.3% في دقة التصنيف مقارنة بالأعمال السابقة التي استخدمت الشبكات العلائقية (Ji et al., 2021) وأظهر درجات F1 محسنة على مجموعة بيانات SWMH. تسلط الدراسة الضوء على أن نهج المطالبات بدون أمثلة قد تجاوز أيضًا الأداء الأساسي، مما يشير إلى أن نماذج اللغة الكبيرة (LLMs) يمكن أن تعمم بشكل فعال في تصنيف النصوص المتعلقة بالصحة النفسية دون الحاجة إلى هندسة ميزات محددة للنطاق. بالإضافة إلى ذلك، فإن ميزة التدريب الدقيق في مجموعة بيانات DAIR-AI Emotion، التي تحتوي على مجموعة تدريب أكبر (54.4K مقارنة بـ 20K لـ SWMH)، تؤكد على أهمية حجم مجموعة البيانات في تحقيق أداء مثالي للنموذج.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على الدور المتطور لنماذج اللغة الكبيرة (LLMs) في تقييم الصحة النفسية، مع التأكيد على إمكاناتها في تعزيز تقديم الرعاية الصحية. أظهرت التطورات الأخيرة في LLMs، مثل Med-PaLM 2 وClinical-Camel، قدرات في دعم اتخاذ القرار السريري والتواصل مع المرضى، مع تحقيق استراتيجيات التدريب الدقيق لمستويات أداء قابلة للمقارنة مع المتخصصين في الرعاية الصحية في المهام التشخيصية. تحدد الورقة ثلاث استراتيجيات رئيسية لنشر LLMs: التدريب الدقيق، وهندسة المطالبات، والتوليد المعزز بالاسترجاع (RAG)، كل منها له مزايا وقيود مميزة في تطبيقات الصحة النفسية.

تستكشف القسم أيضًا التحديات المتعلقة بتحليل نصوص الصحة النفسية، حيث غالبًا ما تفشل الأساليب التقليدية في التقاط التعبيرات العاطفية الدقيقة اللازمة للتقييمات الدقيقة. أظهرت LLMs وعدًا في اكتشاف علامات مشاكل الصحة النفسية من محتوى وسائل التواصل الاجتماعي، محققة دقة على مستوى الإنسان في مهام الترميز النوعي. ومع ذلك، يؤكد المؤلفون على أهمية تطوير أطر تقييم شاملة تأخذ في الاعتبار ليس فقط مقاييس أداء النموذج مثل الدقة ودرجات F1 ولكن أيضًا الصلة السريرية وقابلية تفسير المخرجات. كما يتم التأكيد على الاعتبارات الأخلاقية، بما في ذلك إمكانية التحيز والحاجة إلى الإشراف السريري، خاصة بالنظر إلى الطبيعة الحساسة لتقييمات الصحة النفسية. تشير النتائج إلى أنه بينما يقدم التدريب الدقيق أداءً متفوقًا، هناك حاجة إلى مزيد من البحث للتحقق من صحة هذه النماذج في البيئات السريرية واستكشاف الأساليب الهجينة التي تستفيد من نقاط القوة في المنهجيات المختلفة.

القيود

تسلط الدراسة الضوء على إمكانات نماذج اللغة الكبيرة، وبشكل خاص LLaMA-3 8B، للتقييمات النفسية مع الاعتراف بعدة قيود. إحدى القضايا الرئيسية هي الموارد الحسابية الكبيرة المطلوبة لتدريب مثل هذه النماذج، مما قد يعيق الوصول للباحثين ذوي القدرات الحسابية المحدودة. بالإضافة إلى ذلك، تم إجراء التدريب والتقييم باستخدام مجموعتي بيانات DAIR-AI Emotion وSWMH، والتي، على الرغم من تنوعها، قد لا تشمل التعقيد الكامل والتنوع لبيانات النصوص النفسية في العالم الحقيقي. قد تؤثر هذه القيود على إمكانية تعميم النتائج عبر مجالات ولغات وصيغ نصية مختلفة، مثل النصوص القصيرة مقابل الطويلة.

علاوة على ذلك، لا تستكشف الأبحاث التكامل العملي لهذه النماذج اللغوية في سير العمل السريري، وهو جانب يتطلب التعاون مع خبراء المجال والتحقق الدقيق. قد يؤدي معالجة هذه القيود في الدراسات المستقبلية إلى تعزيز الوصول، وقابلية التعميم، والتطبيق الأخلاقي للأدوات المعتمدة على نماذج اللغة الكبيرة في التقييم النفسي.

Journal: Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)
DOI: https://doi.org/10.18653/v1/2025.clpsych-1.14
Publication Date: 2025-01-01
Author(s): Zhenyun Du et al.
Primary Topic: Mental Health via Writing

Overview

This study systematically compares three methodologies for analyzing mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Utilizing the LLaMA 3 model, the research evaluates these approaches on tasks related to emotion classification and mental health condition detection across two datasets. The results indicate that fine-tuning yields the highest accuracy, achieving 91% for emotion classification and 80% for mental health condition detection, albeit requiring significant computational resources and extensive training datasets. In contrast, prompt engineering and RAG provide more flexible deployment options with moderate performance, ranging from 40% to 68% accuracy.

The findings underscore the trade-offs between accuracy, computational demands, and deployment flexibility in LLM-based mental health applications. While fine-tuning demonstrates strong potential for reliable screening, its performance variability across different emotional states and conditions suggests it may be more suitable for initial assessments rather than definitive diagnoses. The study advocates for future research into hybrid methods that integrate the strengths of various approaches, the development of more efficient fine-tuning techniques, and enhancements in detecting nuanced psychological states. Additionally, there is a call for further validation of these methodologies in clinical settings and among diverse populations.

Introduction

The introduction highlights the urgent need for scalable mental health assessment methods due to the rising prevalence of mental health conditions and limited access to professionals. Traditional diagnostic approaches, primarily reliant on clinical interviews and self-reported questionnaires, are often time-consuming and biased. Recent advancements in large language models (LLMs), such as GPT-4 and LLaMA 2, offer promising avenues for enhancing mental health assessment through automated text analysis. While LLMs have shown potential in various medical applications, their use in mental health assessment poses unique challenges due to the complexity of emotional expression and the necessity for accuracy in clinical contexts.

This study aims to fill the gap in understanding the effectiveness of different LLM deployment strategies for mental health text classification. It systematically compares three approaches: prompt engineering (including zero-shot and few-shot methods), retrieval augmented generation (RAG), and fine-tuning. Utilizing two datasets—the DAIR-AI Emotion dataset and the Reddit SuicideWatch and Mental Health Collection—the research demonstrates that LLaMA 3-based models can achieve accuracy rates of up to 91% in emotion classification and 80% in mental health condition classification through fine-tuning. The findings provide practical insights into the implementation challenges and resource requirements of each approach, suggesting that while fine-tuning yields the highest accuracy, prompt engineering and RAG present viable alternatives with distinct trade-offs. This work contributes to the development of reliable, scalable tools aimed at supporting mental health professionals and improving access to mental health assessment.

Methods

The study evaluates three methodologies for analyzing mental health texts using large language models (LLMs): fine-tuning, prompt engineering, and retrieval-augmented generation (RAG), employing the LLaMA 3 model (8B parameters) as a consistent baseline. The experimental setup utilized an A100 GPU with optimized memory through 4-bit quantization and float16 precision. Fine-tuning yielded high F1-scores for basic emotions, such as joy (0.94) and sadness (0.95), while more complex emotions like love and surprise showed lower scores (0.81 and 0.72, respectively). The model also performed well in detecting anxiety and bipolar disorder (F1-scores of 0.86 and 0.85), with zero-shot prompting demonstrating notable effectiveness in identifying mental health conditions, particularly depression (0.70) and anxiety (0.74).

Conversely, RAG and few-shot prompting exhibited limitations. RAG’s performance varied significantly based on retrieval quality, achieving up to 64% accuracy with relevant context but dropping to 31% with less relevant data. Few-shot prompting underperformed compared to zero-shot methods, suggesting that example-based prompting may introduce conflicting patterns detrimental to classification tasks in mental health contexts. Overall, while fine-tuning and zero-shot prompting showed promising results, RAG and few-shot prompting faced challenges in delivering reliable performance for mental health text analysis.

Results

The results of the study reveal significant performance differences among fine-tuning, prompt engineering, and retrieval augmented generation (RAG) methods for mental health text analysis. Fine-tuning the LLaMA 3 model yielded the highest accuracies of 91% and 80% on the DAIR-AI Emotion and SWMH datasets, respectively. Zero-shot prompting emerged as the second-best method, achieving 49% and 68% accuracy on the same datasets, outperforming both few-shot prompting and RAG. This indicates that well-designed prompts can effectively utilize the model’s pre-trained knowledge for mental health text classification without the need for additional examples.

The performance disparity is particularly pronounced in emotion classification compared to mental health condition detection. Notably, the fine-tuned model demonstrated a 15.3% absolute improvement in classification accuracy over previous work utilizing relation networks (Ji et al., 2021) and showed enhanced F1-scores on the SWMH dataset. The study highlights that the zero-shot prompting approach also surpassed the baseline performance, suggesting that large language models (LLMs) can generalize effectively in mental health-related text classification without requiring extensive domain-specific feature engineering. Additionally, the advantage of fine-tuning in the DAIR-AI Emotion dataset, which has a larger training set (54.4K compared to 20K for SWMH), underscores the importance of dataset size in achieving optimal model performance.

Discussion

The discussion section of the research paper highlights the evolving role of large language models (LLMs) in mental health assessment, emphasizing their potential to enhance healthcare delivery. Recent advancements in LLMs, such as Med-PaLM 2 and Clinical-Camel, have demonstrated capabilities in clinical decision support and patient communication, with fine-tuning strategies achieving performance levels comparable to healthcare professionals in diagnostic tasks. The paper identifies three primary strategies for deploying LLMs: fine-tuning, prompt engineering, and retrieval-augmented generation (RAG), each with distinct advantages and limitations in mental health applications.

The section further explores the challenges of mental health text analysis, where traditional methods often fail to capture the nuanced emotional expressions necessary for accurate assessments. LLMs have shown promise in detecting signs of mental health issues from social media content, achieving human-level accuracy in qualitative coding tasks. However, the authors stress the importance of developing comprehensive evaluation frameworks that consider not only model performance metrics like accuracy and F1 scores but also the clinical relevance and interpretability of outputs. Ethical considerations, including the potential for bias and the need for clinical oversight, are also emphasized, particularly given the sensitive nature of mental health assessments. The findings suggest that while fine-tuning offers superior performance, further research is needed to validate these models in clinical settings and to explore hybrid approaches that leverage the strengths of various methodologies.

Limitations

The study highlights the potential of large language models, specifically LLaMA-3 8B, for psychological assessments while acknowledging several limitations. A primary concern is the substantial computational resources required for fine-tuning such models, which may hinder accessibility for researchers with limited computational capabilities. Additionally, the training and evaluation were conducted using the DAIR-AI Emotion and SWMH datasets, which, despite their diversity, may not encompass the full complexity and variability of real-world psychological text data. This limitation could affect the generalizability of the findings across different domains, languages, and text formats, such as short versus long texts.

Moreover, the research does not explore the practical integration of these language models into clinical workflows, an aspect that necessitates collaboration with domain experts and thorough validation. Addressing these limitations in future studies could enhance the accessibility, generalizability, and ethical application of large language model-based tools in psychological assessment.