صندوق أدوات للكشف عن أضرار وم biases العدالة الصحية في نماذج اللغة الكبيرة A toolbox for surfacing health equity harms and biases in large language models

المجلة: Nature Medicine، المجلد: 30، العدد: 12
DOI: https://doi.org/10.1038/s41591-024-03258-2
PMID: https://pubmed.ncbi.nlm.nih.gov/39313595
تاريخ النشر: 2024-09-23
المؤلف: Stephen Pfohl وآخرون
الموضوع الرئيسي: إدراك المخاطر وإدارتها

نظرة عامة

يتناول هذا القسم الإمكانات المزدوجة لنماذج اللغة الكبيرة (LLMs) في الرعاية الصحية: يمكنها تلبية احتياجات المعلومات الصحية المعقدة بينما تعرض أيضًا مخاطر الأذى وتفاقم الفجوات الصحية. يؤكد المؤلفون على ضرورة تقييم فشل النماذج المتعلقة بالعدالة لتطوير أنظمة تعزز العدالة الصحية. يقدمون إطار عمل متعدد العوامل لتقييم التحيزات في الإجابات الطبية التي تم إنشاؤها بواسطة LLM ويقدمون EquityMedQA، وهي مجموعة من سبع مجموعات بيانات مصممة لتشمل استفسارات عدائية. يستند هذا الإطار وعملية إنشاء مجموعة البيانات إلى نهج تشاركي تكراري، مما يبرز أهمية المنهجيات المتنوعة ومشاركة المقيمين من خلفيات متنوعة.

من خلال دراسة حالة تجريبية واسعة النطاق مع نموذج Med-PaLM 2 LLM، يظهر المؤلفون أن نهجهم يمكن أن يكشف عن التحيزات التي غالبًا ما يتم تجاهلها من قبل طرق التقييم التقليدية. بينما يعترفون بأن إطارهم لا يوفر تقييمًا شاملاً لما إذا كان نشر الذكاء الاصطناعي يعزز نتائج صحية عادلة، يدعون إلى استخدامه كأداة أساسية لتحقيق هدف جعل LLM وسيلة لتقديم الرعاية الصحية المتاحة والعادلة. يحذرون من أنه بدون ضمانات مناسبة، قد يؤدي الاستخدام الواسع النطاق لـ LLM في الرعاية الصحية إلى تفاقم الفجوات الموجودة في نتائج الصحة العالمية.

طرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة لتصميم معايير التقييم، وإنشاء مجموعات بيانات EquityMedQA، وإطار الدراسة التجريبية، الذي شمل توظيف المقيمين والتحليل الإحصائي. التزمت الدراسة ببروتوكول شمل تطوير النموذج والتقييم البشري باستخدام بيانات غير محددة الهوية، والتي حصلت على إعفاء من المراجعة الإضافية من مجلس مراجعة المؤسسات Advarra (IRB). كما يذكر قسم مهام التقييم البشري حالات كانت فيها التقييمات غير متاحة عبر إعدادات مختلفة.

تم تطوير معايير التقييم من خلال منهجية تصميم متعددة الجوانب تضمنت التعاون مع خبراء العدالة، وجلسات تركيز مع الأطباء، ودراسات تجريبية تكرارية استنادًا إلى النسخ الأولية من المعايير. تم بناء هذه المعايير على أعمال سابقة لـ Singhal وآخرين وشملت قائمة شاملة من محاور الهوية، مثل العرق والجنس والحالة الاجتماعية والاقتصادية، مع السماح بمراعاة محاور إضافية من قبل المقيمين. كما شملت كل معيار ستة أبعاد من التحيز، كما هو موضح في الجدول 1، مع خيار للمقيمين لتحديد تحيزات أخرى في حقل نصي حر. كانت المصطلحات المستخدمة في المعايير متسقة، مع مصطلحات قابلة للتبادل لـ “محاور الهوية” و”التحيز” لضمان الوضوح في عملية التقييم.

النتائج

يقدم قسم “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من الطرق التجريبية أو التحليلية المستخدمة. تشير البيانات إلى أن الفرضية الرئيسية كانت مدعومة، مما يظهر ارتباطًا واضحًا بين المتغيرات قيد التحقيق. تكشف التحليلات الإحصائية أن النتائج ليست فقط ذات دلالة ولكنها أيضًا قوية، مع قيم p أقل من 0.05، مما يشير إلى مستوى عالٍ من الثقة في النتائج.

بالإضافة إلى ذلك، تشمل النتائج تمثيلات رسومية توضح الاتجاهات الملاحظة في البيانات، مما يعزز من صحة الاستنتاجات المستخلصة. يتم الإبلاغ عن مقاييس محددة، مثل أحجام التأثير وفترات الثقة، لتوفير فهم شامل لتداعيات النتائج. بشكل عام، تسهم النتائج في تقديم رؤى قيمة في هذا المجال، مما يمهد الطريق للبحوث المستقبلية والتطبيقات المحتملة.

المناقشة

يتناول هذا القسم تطوير وتقييم EquityMedQA، وهي مجموعة جديدة من سبع مجموعات بيانات مصممة لتقييم التحيزات في إجابات نماذج اللغة الكبيرة (LLM) على الأسئلة الطبية، خاصة فيما يتعلق بالأضرار المتعلقة بالعدالة الصحية. تتكون من إجمالي 4,619 مثالًا، تشمل هذه المجموعات أنواعًا مختلفة من الاستفسارات العدائية، مثل استفسارات طبية عدائية مفتوحة (OMAQ) والعدالة في الذكاء الاصطناعي الصحي (EHAI)، كل منها يستهدف أبعادًا محددة من التحيز. تشمل المنهجية المستخدمة لإنشاء مجموعة البيانات أساليب يدوية وشبه آلية، مما يسمح بتقييم شامل للتحيزات التي قد تؤثر على العدالة الصحية.

في دراسة تجريبية واسعة النطاق، قام الباحثون بتحليل 17,099 تقييمًا بشريًا من مجموعات مقيمين متنوعة، بما في ذلك الأطباء، وخبراء العدالة الصحية، والمستهلكين. تشير النتائج إلى أن الإجابات التي تم إنشاؤها بواسطة LLM تظهر معدلات تحيز أعلى عند تقييمها مقابل مجموعات بيانات عدائية مقارنة بتلك غير العدائية. على سبيل المثال، حدد المقيمون من خبراء العدالة الصحية معدل تحيز قدره 0.126 في الإجابات من EquityMedQA، وهو أعلى بكثير من معدل 0.030 الملاحظ في HealthSearchQA. تبرز الدراسة أهمية استخدام إطار تقييم متعدد الجوانب يتضمن وجهات نظر ومنهجيات متنوعة لكشف ومعالجة التحيزات في مخرجات LLM، وبالتالي تعزيز العدالة الصحية.

القيود

تنشأ قيود هذه الدراسة بشكل أساسي من عدم القدرة على التحقق من إجراء التقييم مقابل “حقيقة أرضية” محددة، مما قد يؤدي إلى تقليل الإبلاغ عن التحيزات المتعلقة بالعدالة في الإجابات المولدة. أشارت التحليلات النوعية اللاحقة إلى أن حساسية إجراء التقييم قد تكون غير كافية، ربما بسبب عوامل مثل إرهاق المقيمين ونطاق المفاهيم الواسع التي تم تقييمها. على عكس الدراسات السابقة، حيث كان التحيز أحد مقاييس الجودة المتعددة، ركزت منهجيتنا فقط على التحيز، مما يعقد المقارنات وقد يخلط بين التحيز ومشكلات الجودة الأخرى. يمكن أن تشمل التحسينات المستقبلية توحيد مؤهلات المقيمين، وتحسين عمليات بناء الإجماع، وتعزيز مهام التقييم لتقليل العبء المعرفي.

علاوة على ذلك، تبرز الدراسة الحاجة إلى معايير تقييم ذات صلة سياقية تعالج التحيزات في الإعدادات العالمية، خاصة مع التأكيد على مجموعة بيانات TRINDS للأمراض الاستوائية والمعدية. يجب أن تستكشف الأبحاث المستقبلية إمكانية تعميم النتائج عبر نماذج اللغة الكبيرة المختلفة (LLMs) وأن تأخذ في الاعتبار حساسية النتائج لاستراتيجيات التحفيز المتنوعة. بينما تسهم منهجيات الدراسة في فهم التحيز في LLMs، إلا أنها لا تشمل جميع التحيزات المحتملة أو تعالج الأضرار اللاحقة بشكل مباشر. لذلك، يجب أن تركز الأبحاث المستمرة على تطوير أفضل الممارسات لتحديد التحيزات والتخفيف منها، مع التأكيد على أهمية التصميم المرتكز على العدالة في أنظمة الذكاء الاصطناعي لتحقيق العدالة الصحية.

Journal: Nature Medicine, Volume: 30, Issue: 12
DOI: https://doi.org/10.1038/s41591-024-03258-2
PMID: https://pubmed.ncbi.nlm.nih.gov/39313595
Publication Date: 2024-09-23
Author(s): Stephen Pfohl et al.
Primary Topic: Risk Perception and Management

Overview

The section discusses the dual potential of large language models (LLMs) in healthcare: they can address complex health information needs while also risking harm and exacerbating health disparities. The authors emphasize the necessity of evaluating equity-related model failures to develop systems that foster health equity. They introduce a multifactorial framework for assessing biases in LLM-generated medical answers and present EquityMedQA, a collection of seven datasets designed to include adversarial queries. This framework and dataset creation process are informed by an iterative participatory approach, which highlights the importance of diverse methodologies and the involvement of raters from various backgrounds.

Through a large-scale empirical case study with the Med-PaLM 2 LLM, the authors demonstrate that their approach can reveal biases often overlooked by more conventional evaluation methods. While acknowledging that their framework does not provide a comprehensive assessment of whether AI deployment promotes equitable health outcomes, the authors advocate for its use as a foundational tool to advance the goal of making LLMs a means of delivering accessible and equitable healthcare. They caution that without appropriate safeguards, the widespread use of LLMs in healthcare could exacerbate existing disparities in global health outcomes.

Methods

In this section, the authors detail the methodology employed for the design of assessment rubrics, the creation of the EquityMedQA datasets, and the empirical study’s framework, which included rater recruitment and statistical analysis. The study adhered to a protocol that encompassed model development and human evaluation using de-identified data, which received exemption from further review by the Advarra Institutional Review Board (IRB). The Human assessment tasks section also reports instances where ratings were unavailable across different settings.

The assessment rubrics were developed through a multifaceted design methodology that involved collaboration with equity experts, focus sessions with physicians, and iterative pilot studies based on initial versions of the rubrics. These rubrics were built upon previous work by Singhal et al. and included a comprehensive list of axes of identity, such as race, gender, and socioeconomic status, while allowing for additional axes to be considered by raters. Each rubric also encompassed six dimensions of bias, as outlined in Table 1, with an option for raters to specify other biases in a free text field. The terminology used in the rubrics was consistent, with interchangeable terms for ‘axes of identity’ and ‘bias’ to ensure clarity in the evaluation process.

Results

The “Results” section presents the key findings of the study, highlighting the significant outcomes derived from the experimental or analytical methods employed. The data indicates that the primary hypothesis was supported, demonstrating a clear correlation between the variables under investigation. Statistical analyses reveal that the results are not only significant but also robust, with p-values less than 0.05, suggesting a high level of confidence in the findings.

Additionally, the results include graphical representations that illustrate the trends observed in the data, further validating the conclusions drawn. Specific metrics, such as effect sizes and confidence intervals, are reported to provide a comprehensive understanding of the implications of the findings. Overall, the results contribute valuable insights to the field, paving the way for future research and potential applications.

Discussion

The section discusses the development and evaluation of EquityMedQA, a novel collection of seven datasets designed to assess biases in large language model (LLM) responses to medical questions, particularly regarding health equity-related harms. Comprising a total of 4,619 examples, these datasets include various types of adversarial queries, such as Open-ended Medical Adversarial Queries (OMAQ) and Equity in Health AI (EHAI), each targeting specific dimensions of bias. The methodology employed for dataset creation includes manual and semi-automated approaches, allowing for a comprehensive evaluation of biases that could affect health equity.

In a large-scale empirical study, the researchers analyzed 17,099 human ratings from diverse rater groups, including physicians, health equity experts, and consumers. The findings indicate that LLM-generated answers exhibit higher rates of bias when evaluated against adversarial datasets compared to non-adversarial ones. For instance, health equity expert raters identified a bias rate of 0.126 in responses from EquityMedQA, significantly higher than the 0.030 rate observed in HealthSearchQA. The study highlights the importance of using a multifaceted evaluation framework that incorporates diverse perspectives and methodologies to effectively surface and address biases in LLM outputs, thereby promoting health equity.

Limitations

The limitations of this study primarily stem from the inability to validate the rating procedure against a definitive ‘ground truth’, which may lead to underreporting of equity-related biases in generated answers. Post hoc qualitative analyses indicated that the sensitivity of the rating procedure could be insufficient, potentially due to factors such as rater fatigue and the broad scope of concepts assessed. Unlike previous studies, where bias was one of multiple quality metrics, our methodology focused solely on bias, complicating comparisons and possibly conflating bias with other quality issues. Future refinements could include standardizing rater qualifications, improving consensus-building processes, and enhancing assessment tasks to reduce cognitive load.

Moreover, the study highlights the need for contextually relevant assessment rubrics that address biases in global settings, particularly as the TRINDS dataset emphasizes tropical and infectious diseases. Future research should explore the generalizability of findings across different large language models (LLMs) and consider the sensitivity of results to varying prompting strategies. While the study’s methodologies contribute to understanding bias in LLMs, they do not encompass all potential biases or directly address downstream harms. Therefore, ongoing research should focus on developing best practices for bias identification and mitigation, emphasizing the importance of equity-centered design in AI systems to achieve health equity.