مقارنة بين معالجة اللغة الطبيعية التقليدية ونماذج اللغة الكبيرة لتصنيف حالة الصحة النفسية: تقييم متعدد النماذج Comparing traditional natural language processing and large language models for mental health status classification: a multi-model evaluation

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-08031-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40619512
تاريخ النشر: 2025-07-06
المؤلف: Thomas Kallstenius وآخرون
الموضوع الرئيسي: الصحة النفسية من خلال الكتابة

نظرة عامة

تتناول هذه الدراسة الحاجة الملحة لأدوات فعالة لاكتشاف وتصنيف اضطرابات الصحة النفسية في البيئات الرقمية، من خلال مقارنة ثلاثة نهج حسابية: معالجة اللغة الطبيعية التقليدية (NLP) مع هندسة ميزات متقدمة، ونماذج اللغة الكبيرة المصممة بعناية، ونماذج اللغة الكبيرة المعدلة. باستخدام مجموعة بيانات تضم أكثر من 51,000 بيان نصي من وسائل التواصل الاجتماعي، تم وضع علامات عليها بسبع حالات صحية نفسية، قامت الدراسة بتقييم دقة التصنيف، والدقة، والاسترجاع، ودرجة F1. حقق نموذج NLP التقليدي دقة مثيرة للإعجاب بلغت 95%، متفوقًا بشكل كبير على نموذج اللغة الكبيرة المصمم بعناية (65%) ونموذج اللغة الكبيرة المعدل (91%). تشير النتائج إلى أنه بينما يمكن أن يؤدي تعديل نماذج اللغة الكبيرة إلى تحسين الأداء، تظل طرق NLP التقليدية مع هندسة الميزات المحسّنة متفوقة في مهام تصنيف الصحة النفسية.

في الختام، لم يتجاوز نموذج NLP المحسن فقط نماذج اللغة الكبيرة المعدلة والمصممة بعناية من حيث الدقة، بل قدم أيضًا مزايا في خصوصية البيانات، والكفاءة، وقابلية التفسير. على الرغم من أن نموذج اللغة الكبيرة المعدل أظهر تحسينات في التعامل مع الحالات المعقدة، إلا أنه لم يتطابق مع الأداء العام لنموذج NLP. تقترح الدراسة أن دمج نماذج اللغة الكبيرة المعدلة مع طرق NLP التقليدية، مدعومة بخبرة بشرية، يمكن أن يعزز مراقبة الصحة النفسية في الفضاءات الرقمية. يجب أن تركز الأبحاث المستقبلية على تكييف هذه النماذج للسكان ذوي التنوع العصبي وتوسيع مجموعات البيانات لتحسين دقة التصنيف للحالات الصحية النفسية الممثلة تمثيلًا ناقصًا، مما يضمن دعمًا عادلًا لجميع الأفراد.

طرق

توضح قسم “الطرق” الإجراءات التجريبية والتحليلية المستخدمة في الدراسة. يتناول اختيار المشاركين، وتصميم التجارب، والتقنيات الإحصائية المستخدمة لتحليل البيانات. استخدمت الدراسة إطار تجربة عشوائية محكومة لضمان موثوقية النتائج، مع تخصيص المشاركين إما لمجموعة تجريبية أو مجموعة ضابطة بناءً على طريقة أخذ عينات مصنفة.

شملت جمع البيانات مقاييس موحدة لتقييم النتائج الرئيسية، وتم إجراء التحليل باستخدام برامج إحصائية مناسبة. تم تقييم المقاييس الرئيسية باستخدام إحصائيات استنتاجية، بما في ذلك اختبارات t وANOVA، لتحديد أهمية النتائج. يبرز القسم الالتزام بالإرشادات الأخلاقية طوال عملية البحث، مما يضمن الحصول على موافقة مستنيرة وسرية لجميع المشاركين. بشكل عام، توفر الطرق المستخدمة إطارًا قويًا للتحقق من فرضيات الدراسة واستنتاجاتها.

نتائج

تقييم قسم النتائج في الدراسة أداء ثلاثة نماذج لتصنيف الصحة النفسية باستخدام مقاييس مثل الدقة، والدقة، والاسترجاع، ودرجة F1. تشمل النماذج التي تم تقييمها نموذج GPT-4o-mini المصمم بعناية، ونموذج GPT-4o-mini المعدل، ونموذج NLP الذي يستخدم ترميز TF-IDF مع SVM. كان تقسيم التدريب والاختبار المصنف حاسمًا في التأثير على أداء النموذج عبر حالات الصحة النفسية المختلفة.

حقق نموذج GPT-4o-mini المصمم بعناية دقة إجمالية بلغت 65%، مع دقة تتراوح من 25% للضغط إلى 88% للحالات الطبيعية. كانت نسبة استرجاعه الأعلى للميول الانتحارية عند 80%، لكنه واجه صعوبة مع الحالات الممثلة تمثيلًا ناقصًا مثل اضطراب الشخصية، مما يشير إلى أن الهندسة المصممة بعناية وحدها قد لا تلتقط تعقيدات تصنيف الصحة النفسية. في المقابل، حسّن نموذج GPT-4o-mini المعدل الدقة بشكل كبير إلى 91% بعد ثلاث دورات، مع دقة عالية للفئات الشائعة (مثل 99% للحالات الطبيعية) ونسبة استرجاع قوية للحالات الممثلة تمثيلًا ناقصًا (مثل 94% لاضطراب الشخصية). ومع ذلك، انخفضت الدقة إلى 85% بعد الدورة الرابعة، مما يشير إلى الإفراط في التكيف. تفوق نموذج NLP على الآخرين، محققًا أعلى دقة بلغت 95%، مما يوضح فعالية أساليب التعلم الآلي التقليدية في هذا السياق.

مناقشة

يؤكد قسم المناقشة في ورقة البحث على فعالية النماذج المختلفة في تصنيف حالات الصحة النفسية استنادًا إلى مجموعة بيانات تضم 52,681 بيان نصي فريد. تتضمن مجموعة البيانات، المستمدة من منصات وسائل التواصل الاجتماعي، سبع حالات صحية نفسية: طبيعية، اكتئاب، انتحارية، قلق، ضغط، اضطراب ثنائي القطب، واضطراب الشخصية. تسلط الدراسة الضوء على التحديات التي تطرحها اختلالات الفئات، لا سيما الإفراط في تمثيل حالات مثل الاكتئاب والتمثيل الناقص لأخرى مثل اضطراب الشخصية. تم استخدام تقسيم تدريب واختبار مصنف لضمان تمثيل متوازن عبر جميع الفئات، مما يسهل تقييمًا قويًا لأداء النموذج.

من بين النماذج التي تم تقييمها، تفوق نموذج NLP التقليدي على كل من نماذج اللغة الكبيرة المعدلة والمصممة بعناية، محققًا دقة بلغت 95%. وصل نموذج GPT-4o-mini المعدل إلى دقة قصوى بلغت 91% ولكنه أظهر علامات الإفراط في التكيف بعد أربع دورات. تضمنت نقاط القوة في نموذج NLP دقة واسترجاع عالية عبر حالات مختلفة، لا سيما في تحديد الفئات الحرجة مثل الاكتئاب والميول الانتحارية. تشير النتائج إلى أنه بينما يمكن أن تتكيف النماذج المعدلة مع أنماط لغوية معينة، قد تقدم أساليب NLP التقليدية مزايا من حيث الكفاءة، وقابلية التفسير، ومعالجة البيانات المحلية، مما يجعلها مناسبة للتطبيقات في الوقت الحقيقي في سياقات حساسة مثل الصحة النفسية. تدعو الدراسة أيضًا إلى إجراء أبحاث مستقبلية لمعالجة الفروق الدقيقة في التعبيرات العصبية المتنوعة في الصحة النفسية ولتقييم مجموعة أوسع من النماذج لتحسين دقة التصنيف.

القيود

تقدم الدراسة عدة نقاط قوة، بما في ذلك استخدام مجموعة بيانات كبيرة ومتنوعة تضم أكثر من 52,000 منشور على وسائل التواصل الاجتماعي، مما يسهل التدريب والتقييم القوي عبر حالات الصحة النفسية المختلفة. تعزز الشفافية المنهجية، التي تتميز بخطوات معالجة وتقييم مفصلة، إمكانية التكرار وتفتح الطريق للتقدمات المستقبلية. تجعل البساطة والكفاءة وقابلية تفسير نموذج NLP مناسبًا للتطبيقات في العالم الحقيقي، لا سيما في البيئات ذات الموارد المحدودة، بينما تضيف توفير درجات الثقة فائدة عملية للمهنيين الصحيين.

ومع ذلك، لا تخلو الدراسة من القيود. تعتمد مجموعة البيانات على مؤشرات الصحة النفسية المبلغ عنها ذاتيًا أو المستنتجة من وسائل التواصل الاجتماعي، والتي تفتقر إلى التحقق السريري وقد تقدم تحيزًا. بالإضافة إلى ذلك، قد يؤدي استبعاد السكان ذوي التنوع العصبي إلى تمثيل ناقص أو تصنيف خاطئ للتعبيرات الفريدة للصحة النفسية. يركز التركيز على النص وحده على تجاهل إشارات سياقية أو متعددة الوسائط أخرى ضرورية لتقييمات الصحة النفسية الشاملة. أخيرًا، بينما أظهرت نماذج اللغة الكبيرة المصممة بعناية (LLMs) أداءً محدودًا في هذه الدراسة، قد تؤدي التقدمات المستقبلية في هذه النماذج إلى التخفيف من القيود الحالية.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-08031-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40619512
Publication Date: 2025-07-06
Author(s): Thomas Kallstenius et al.
Primary Topic: Mental Health via Writing

Overview

This study addresses the urgent need for effective tools to detect and classify mental health disorders in digital environments, comparing three computational approaches: Traditional Natural Language Processing (NLP) with advanced feature engineering, prompt-engineered large language models (LLMs), and fine-tuned LLMs. Utilizing a dataset of over 51,000 text statements from social media, tagged with seven mental health conditions, the study evaluated classification accuracy, precision, recall, and F1-score. The traditional NLP model achieved an impressive accuracy of 95%, significantly outperforming the prompt-engineered LLM (65%) and the fine-tuned LLM (91%). The findings indicate that while fine-tuning LLMs can enhance performance, traditional NLP methods with optimized feature engineering remain superior for mental health classification tasks.

In conclusion, the optimized NLP model not only surpassed the fine-tuned and prompt-engineered LLMs in accuracy but also offered advantages in data privacy, efficiency, and interpretability. Although the fine-tuned LLM showed improvements in handling complex conditions, it did not match the NLP model’s overall performance. The study suggests that integrating fine-tuned LLMs with traditional NLP methods, complemented by human expertise, could enhance mental health monitoring in digital spaces. Future research should focus on adapting these models for neurodiverse populations and expanding datasets to improve classification accuracy for underrepresented mental health conditions, ensuring equitable support for all individuals.

Methods

The “Methods” section outlines the experimental and analytical procedures employed in the study. It details the selection of participants, the design of the experiments, and the statistical techniques used for data analysis. The study utilized a randomized controlled trial framework to ensure the reliability of results, with participants assigned to either the experimental or control group based on a stratified sampling method.

Data collection involved standardized measures to assess the primary outcomes, and the analysis was conducted using appropriate statistical software. Key metrics were evaluated using inferential statistics, including t-tests and ANOVA, to determine the significance of the findings. The section emphasizes adherence to ethical guidelines throughout the research process, ensuring informed consent and confidentiality for all participants. Overall, the methods employed provide a robust framework for validating the study’s hypotheses and conclusions.

Results

The results section of the study evaluates the performance of three models for mental health classification using metrics such as accuracy, precision, recall, and F1-score. The models assessed include a prompt-engineered GPT-4o-mini, a fine-tuned GPT-4o-mini, and an NLP model utilizing TF-IDF vectorization with an SVM. The stratified train-test split was crucial in influencing model performance across various mental health conditions.

The prompt-engineered GPT-4o-mini achieved an overall accuracy of 65%, with precision ranging from 25% for stress to 88% for normal conditions. Its recall was highest for suicidal tendencies at 80%, but it struggled with underrepresented conditions like personality disorder, indicating that prompt engineering alone may not capture the complexities of mental health classification. In contrast, the fine-tuned GPT-4o-mini significantly improved accuracy to 91% after three epochs, with high precision for common categories (e.g., 99% for normal) and strong recall for underrepresented conditions (e.g., 94% for personality disorder). However, accuracy dropped to 85% after a fourth epoch, suggesting overfitting. The NLP model outperformed the others, achieving the highest accuracy of 95%, demonstrating the effectiveness of traditional machine learning approaches in this context.

Discussion

The discussion section of the research paper emphasizes the effectiveness of various models in classifying mental health conditions based on a dataset of 52,681 unique text statements. The dataset, sourced from social media platforms, includes seven mental health statuses: Normal, Depression, Suicidal, Anxiety, Stress, Bipolar Disorder, and Personality Disorder. The study highlights the challenges posed by class imbalances, particularly the overrepresentation of conditions like Depression and the underrepresentation of others like Personality Disorder. A stratified train-test split was employed to ensure balanced representation across all classes, which facilitated a robust evaluation of model performance.

Among the models evaluated, the traditional NLP model outperformed both the fine-tuned and prompt-engineered large language models (LLMs), achieving an accuracy of 95%. The fine-tuned GPT-4o-mini model reached a peak accuracy of 91% but showed signs of overfitting after four epochs. The NLP model’s strengths included high precision and recall across various conditions, particularly in identifying critical categories such as Depression and Suicidal tendencies. The findings suggest that while fine-tuned models can adapt to specific linguistic patterns, traditional NLP approaches may offer advantages in terms of efficiency, interpretability, and local data processing, making them suitable for real-time applications in sensitive contexts like mental health. The study also calls for future research to address the nuances of neurodiverse expressions in mental health and to benchmark a wider array of models for improved classification accuracy.

Limitations

The study presents several strengths, including the utilization of a large and diverse dataset comprising over 52,000 social media posts, which facilitates robust training and evaluation across various mental health conditions. The methodological transparency, characterized by detailed preprocessing and evaluation steps, enhances reproducibility and paves the way for future advancements. The simplicity, efficiency, and interpretability of the NLP model make it suitable for real-world applications, particularly in resource-constrained environments, while the provision of confidence scores adds practical utility for healthcare professionals.

However, the study is not without limitations. The dataset relies on self-reported or inferred mental health indicators from social media, which lack clinical validation and may introduce bias. Additionally, the exclusion of neurodiverse populations could lead to underrepresentation or misclassification of unique mental health expressions. The focus on text alone neglects other contextual or multimodal signals that are essential for comprehensive mental health assessments. Lastly, while prompt-engineered large language models (LLMs) demonstrated limited performance in this study, future advancements in these models may mitigate current limitations.