الحاجة إلى حواجز أمان مع نماذج اللغة الكبيرة في اليقظة الدوائية وغيرها من البيئات الطبية الحرجة The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-09138-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40738919
تاريخ النشر: 2025-07-31
المؤلف: Joe B. Hakim وآخرون
الموضوع الرئيسي: اليقظة الدوائية وردود الفعل السلبية للأدوية

نظرة عامة

تحدد البحث تطوير مجموعة من الحواجز تهدف إلى التخفيف من المخاطر المرتبطة بنشر نماذج اللغة الكبيرة (LLMs) في المجالات عالية المخاطر والحرجة من حيث السلامة، وخاصة في سلامة الأدوية. إحدى القضايا الرئيسية في هذه السياقات هي ظاهرة “الهلاوس”، حيث تنتج نماذج اللغة الكبيرة معلومات مزيفة قد تضر المرضى. تشمل الحواجز المقترحة آليات لاكتشاف الوثائق الشاذة، وتحديد أسماء الأدوية غير الصحيحة أو مصطلحات الأحداث السلبية، ونقل عدم اليقين في المحتوى الناتج.

تم دمج هذه الحواجز مع نموذج لغة كبير تم ضبطه لمهمة نص إلى نص، وتحديداً ترجمة البيانات المهيكلة وغير المهيكلة من تقارير الأحداث السلبية إلى لغة طبيعية. أظهرت تطبيق هذه الطريقة على تقارير سلامة الحالة الفردية فعاليتها في معالجة اليقظة الدوائية. بشكل عام، يوفر إطار الحواجز أدوات أساسية لضمان الاستخدام الآمن لنماذج اللغة الكبيرة في السيناريوهات عالية المخاطر، مما يقلل بشكل فعال من حدوث الأخطاء الحرجة ويتماشى مع المعايير التنظيمية والجودة الصارمة في البيئات الحرجة من حيث السلامة الطبية.

الطرق

في هذا القسم، يحدد المؤلفون الإطار المنهجي المستخدم في دراستهم، كما هو موضح في الشكل 1. تشمل سير العمل عدة مكونات رئيسية: معالجة تقارير سلامة الحالة الفردية (ICSRs)، وتنفيذ المهام التي تتضمن نماذج اللغة الكبيرة (LLMs)، وتطوير معايير للتقييم، وتقييم تقارير الحالة التي تم إنشاؤها بواسطة نماذج اللغة الكبيرة. بالإضافة إلى ذلك، تتضمن المنهجية نظام معالجة حواجز متسلسل مصمم لضمان موثوقية وسلامة المخرجات الناتجة، تليها تقييم شامل لهذه الحواجز. تهدف هذه الطريقة المنظمة إلى تعزيز دقة وفعالية تطبيقات نماذج اللغة الكبيرة في سياق اليقظة الدوائية.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من التحليل الذي تم إجراؤه. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد الدراسة، حيث أسفرت الاختبارات الإحصائية عن قيم p أقل من العتبة التقليدية 0.05، مما يشير إلى دليل قوي ضد الفرضية الصفرية.

بالإضافة إلى ذلك، تظهر النتائج أن التدخل المطبق أدى إلى تحسينات قابلة للقياس في المقاييس المستهدفة، مع حساب أحجام التأثير لت quantifying حجم هذه التغييرات. تمثل الرسوم البيانية، مثل المخططات الشريطية ومخططات التشتت، الاتجاهات الملاحظة، مما يدعم الاستنتاجات المستخلصة من التحليل الكمي. بشكل عام، تسهم النتائج في تقديم رؤى قيمة حول العلاقة بين المتغيرات المدروسة وفعالية التدخل.

المناقشة

في هذه الدراسة، تم استخدام مجموعة بيانات شاملة من تقارير سلامة الحالة الفردية (ICSRs) من قاعدة بيانات السلامة العالمية لشركة GSK لتطوير مجموعة متعددة اللغات لضبط نموذج لغة كبير (LLM) يهدف إلى ترجمة تقارير الأحداث السلبية. تتكون مجموعة البيانات، التي تضم أكثر من 4 ملايين حالة تم جمعها على مدى عقدين، من تحليلها في شكلها الأصلي، باستثناء أي تعديلات تم إجراؤها بعد التقديم. يتطلب إطار عمل ICSR عناصر أساسية مثل المراسلين والمرضى القابلين للتحديد، وردود الفعل السلبية المشتبه بها، والمنتجات المعنية، والتي تم تجميعها بشكل منهجي لإنشاء مدخلات منظمة لنموذج اللغة الكبير. تم بناء مجموعة البيانات متعددة اللغات من خلال محاذاة نصوص ICSR الخام مع ملخصات تم إنشاؤها بواسطة البشر باليابانية، والإسبانية، والفرنسية، والألمانية، مع التركيز على اليابانية بسبب تعقيدها وحجمها.

تضمنت نماذج اللغة الكبيرة التي تم تقييمها mt5-xl و mpt-7b-instruct و stablelm-japanese، مع إجراء الضبط على مجموعة بيانات من 131,037 مثال. اعتمدت عملية التدريب تقسيم 70-15-15 للتدريب، والتحقق، والاختبار، مما يضمن تمثيلاً متوازنًا عبر اللغات. أظهر تقييم جودة الترجمة أنه بينما حقق نموذج mt5-xl المضبوط درجة BLEU تبلغ 0.39، مما يشير إلى ترجمات ذات جودة نسبية عالية، إلا أن الأخطاء الكبيرة استمرت، خاصة في صحة أسماء الأدوية والأحداث السلبية. كما نفذت الدراسة حواجز لتعزيز موثوقية الترجمة، بما في ذلك تقدير عدم اليقين على مستوى الوثيقة واكتشاف عدم التطابق لأسماء الأدوية، والتي تعتبر حاسمة لضمان سلامة ودقة تقارير اليقظة الدوائية. بشكل عام، تؤكد النتائج على إمكانيات نماذج اللغة الكبيرة في ترجمة تقارير السلامة المعقدة، بينما تبرز أيضًا الحاجة إلى تقييم صارم واستراتيجيات تخفيف الأخطاء في هذا المجال.

القيود

تسلط قسم القيود الضوء على عدة قيود حاسمة يجب معالجتها قبل التنفيذ الواسع النطاق لإطارات الحواجز في اليقظة الدوائية (PV). كان التركيز الأولي على الحواجز الصعبة يستهدف بشكل أساسي هلاوس أسماء الأدوية، ومع ذلك لا تزال هناك أخطاء كبيرة أخرى، تُسمى “الأحداث التي لا تحدث أبدًا”، مثل سوء تفسير نتائج التعرض المتعلقة بالتحدي/إعادة التحدي والأحداث السلبية (AEs)، دون معالجة. بالإضافة إلى ذلك، بينما لم يتم حل الأخطاء الإملائية للأدوية في هذه الدراسة، يمكن التخفيف منها من خلال إدخال الحالات بشكل استباقي أو من خلال استخدام عناصر بيانات منظمة بأثر رجعي.

يعترف المؤلفون بأن تطوير حواجز عدم اليقين على مستوى الرموز هو مجال بحث متطور، مع توقعات بتحسينات مستمرة من شأنها تعزيز تقدير وعدم اليقين في مخرجات نماذج اللغة الكبيرة (LLM). تم تحديد طريقين رئيسيين للتحسين: إنشاء أنطولوجيات أساسية دقيقة، مثل قواعد بيانات شاملة لأزواج ترجمة الأدوية، والفوائد المحتملة لاستخدام نماذج لغة أكبر وأكثر تقدمًا. قد تقدم هذه النماذج الحديثة قدرات محسنة في نقل عدم اليقين بدقة، اعتمادًا على التحقق التجريبي المستقبلي. بشكل عام، يؤكد المؤلفون على الحاجة إلى مواصلة البحث لتوسيع قائمة الأحداث التي لا تحدث أبدًا في اليقظة الدوائية وتنقيح ترميزها داخل النظام.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-09138-0
PMID: https://pubmed.ncbi.nlm.nih.gov/40738919
Publication Date: 2025-07-31
Author(s): Joe B. Hakim et al.
Primary Topic: Pharmacovigilance and Adverse Drug Reactions

Overview

The research outlines the development of a suite of guardrails aimed at mitigating the risks associated with deploying large language models (LLMs) in high-risk and safety-critical domains, particularly in drug safety. A significant concern in these contexts is the phenomenon of “hallucinations,” where LLMs produce fabricated information that could potentially harm patients. The proposed guardrails include mechanisms for detecting anomalous documents, identifying incorrect drug names or adverse event terms, and conveying uncertainty in generated content.

These guardrails were integrated with an LLM that was fine-tuned for a text-to-text task, specifically translating structured and unstructured data from adverse event reports into natural language. The application of this method to individual case safety reports demonstrated its effectiveness in pharmacovigilance processing. Overall, the guardrail framework provides essential tools to ensure the safe use of LLMs in high-risk scenarios, effectively reducing the occurrence of critical errors and aligning with stringent regulatory and quality standards in medical safety-critical environments.

Methods

In this section, the authors outline the methodological framework employed in their study, as illustrated in Figure 1. The workflow encompasses several key components: the processing of Individual Case Safety Reports (ICSRs), the execution of tasks involving Large Language Models (LLMs), the development of standards for evaluation, and the assessment of case reports generated by the LLMs. Additionally, the methodology includes a sequential guardrail processing system designed to ensure the reliability and safety of the generated outputs, followed by a comprehensive evaluation of these guardrails. This structured approach aims to enhance the accuracy and efficacy of LLM applications in the context of pharmacovigilance.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the analysis conducted. The data indicates a significant correlation between the variables under investigation, with statistical tests yielding p-values below the conventional threshold of 0.05, suggesting strong evidence against the null hypothesis.

Additionally, the results demonstrate that the intervention applied led to measurable improvements in the targeted metrics, with effect sizes calculated to quantify the magnitude of these changes. Graphical representations, such as bar charts and scatter plots, illustrate the trends observed, further supporting the conclusions drawn from the quantitative analysis. Overall, the findings contribute valuable insights into the relationship between the studied variables and the effectiveness of the intervention.

Discussion

In this study, a comprehensive dataset of Individual Case Safety Reports (ICSRs) from GSK’s global safety database was utilized to develop a multilingual corpus for fine-tuning a large language model (LLM) aimed at translating adverse event reports. The dataset, comprising over 4 million cases collected over two decades, was analyzed in its original form, excluding any modifications made post-submission. The ICSR framework requires essential elements such as identifiable reporters and patients, suspect adverse reactions, and implicated products, which were systematically aggregated to create a structured input for the LLM. The multilingual corpus was constructed by aligning raw ICSR texts with human-generated summaries in Japanese, Spanish, French, and German, with a focus on Japanese due to its complexity and volume.

The LLMs evaluated included mt5-xl, mpt-7b-instruct, and stablelm-japanese, with fine-tuning performed on a dataset of 131,037 examples. The training process adopted a 70-15-15 split for training, validation, and testing, ensuring balanced representation across languages. The evaluation of translation quality revealed that while the fine-tuned mt5-xl model achieved a BLEU score of 0.39, indicating relatively high-quality translations, significant errors persisted, particularly in the correctness of drug names and adverse events. The study also implemented guardrails to enhance translation reliability, including document-level uncertainty quantification and mismatch detection for drug names, which are critical for ensuring the safety and accuracy of pharmacovigilance reporting. Overall, the findings underscore the potential of LLMs in translating complex safety reports, while also highlighting the need for rigorous evaluation and error mitigation strategies in this domain.

Limitations

The section on limitations highlights several critical constraints that must be addressed before the widespread implementation of guardrail frameworks in pharmacovigilance (PV). The initial focus on hard guardrails primarily targeted drug name hallucinations, yet other significant errors, termed “never events,” such as misinterpretations of exposure outcomes related to dechallenge/rechallenge and adverse events (AEs), remain unaddressed. Additionally, while drug misspellings were not resolved in this study, they could potentially be mitigated through prospective case intake or by employing structured data elements retrospectively.

The authors acknowledge that the development of token-level uncertainty guardrails is an evolving area of research, with ongoing advancements expected to enhance the quantification and communication of uncertainty in large language model (LLM) outputs. Two primary avenues for improvement are identified: the creation of accurate underlying ontologies, such as comprehensive databases of drug translation pairs, and the potential benefits of utilizing more advanced LLMs. These modern models may offer improved capabilities in accurately conveying uncertainty, contingent upon future experimental validation. Overall, the authors emphasize the need for continued research to expand the list of PV never events and refine their encoding within the system.