تكشف المحفزات الخاصة بالتفكير التشخيصي عن الإمكانية لفهم نماذج اللغة الكبيرة في الطب Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

المجلة: npj Digital Medicine، المجلد: 7، العدد: 1
DOI: https://doi.org/10.1038/s41746-024-01010-1
PMID: https://pubmed.ncbi.nlm.nih.gov/38267608
تاريخ النشر: 2024-01-24
المؤلف: Thomas Savage وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

في هذا القسم، يتناول المؤلفون تحديًا كبيرًا في تطبيق نماذج اللغة الكبيرة (LLMs) في المجال الطبي: الإدراك بأن هذه النماذج تستخدم طرقًا غير قابلة للتفسير في اتخاذ القرارات السريرية، مما يتعارض مع العمليات المعرفية التي يستخدمها الأطباء. تقدم الدراسة محفزات للتفكير التشخيصي تهدف إلى تقييم ما إذا كانت نماذج اللغة الكبيرة، وخاصة GPT-4، يمكن أن تحاكي التفكير السريري مع الحفاظ على دقة التشخيص.

تشير النتائج إلى أن GPT-4 يمكن بالفعل تحفيزه لمحاكاة عمليات التفكير السريري النموذجية دون المساس بقدرته على إجراء تشخيصات دقيقة. هذه القدرة حاسمة، حيث تتيح لنماذج اللغة الكبيرة تقديم تفسيرات قابلة للتفسير لقراراتها، مما يمكّن الأطباء من تقييم موثوقية استجابات النموذج في سياقات رعاية المرضى. يقترح المؤلفون أن استخدام محفزات التفكير التشخيصي يمكن أن يخفف من مخاوف “الصندوق الأسود” المرتبطة بنماذج اللغة الكبيرة، مما يعزز إمكاناتها للتكامل الآمن والفعال في الممارسة الطبية.

مقدمة

تناقش مقدمة هذه الورقة البحثية قدرات وتطبيقات نماذج اللغة الكبيرة (LLMs) في المجال الطبي، مع تسليط الضوء على قدرتها على توليد الملاحظات السريرية، واجتياز الامتحانات الطبية، والرد على استفسارات المرضى. تؤكد على أهمية فهم قدرات التفكير السريري لنماذج اللغة الكبيرة، والتي تشمل عمليات حل المشكلات الأساسية لتشخيص وإدارة الحالات الطبية. كانت التقييمات التقليدية تركز بشكل أساسي على أسئلة الاختيار من متعدد، لكن الدراسات الحديثة تشير إلى أن النماذج المتقدمة مثل GPT-4 تظهر وعدًا في تشخيص الحالات السريرية المعقدة من خلال أسئلة الاستجابة الحرة.

تقدم الورقة مفهوم هندسة المحفزات، وخاصة استخدام تحفيز سلسلة الأفكار (CoT)، الذي يشجع نماذج اللغة الكبيرة على تقسيم المهام إلى خطوات أصغر وقابلة للإدارة. يتماشى هذا النهج بشكل جيد مع الطبيعة التدريجية للتفكير السريري. يقترح المؤلفون تقييم أداء GPT-3.5 و GPT-4 على أسئلة سريرية مفتوحة، باستخدام مجموعة بيانات MedQA USMLE المعدلة وسلسلة حالات NEJM. يهدفون إلى مقارنة تحفيز CoT التقليدي مع محفزات “التفكير التشخيصي” المتخصصة التي تعكس العمليات المعرفية السريرية، مع افتراض أن نماذج اللغة الكبيرة ستؤدي بشكل أفضل مع هذه المحفزات المخصصة مقارنة بأساليب CoT القياسية.

طرق البحث

في هذا القسم، يصف المؤلفون منهجيتهم لتطوير محفزات التفكير التشخيصي باستخدام نهج تكراري يعرف بهندسة المحفزات. قاموا بإجراء عدة جولات من التجارب مع أنواع مختلفة من المحفزات، وتقييم دقة نموذج GPT-3.5 على مجموعة بيانات MEDQA التدريبية. أشارت النتائج إلى أن المحفزات المصممة لتشجيع التفكير خطوة بخطوة، دون تفصيل الخطوات بشكل صريح، أدت إلى تحسين الأداء. بالإضافة إلى ذلك، لاحظ المؤلفون أن المحفزات التي تستهدف استراتيجية تفكير تشخيصية واحدة كانت أفضل من تلك التي دمجت استراتيجيات متعددة، مما يبرز أهمية التخصص في تصميم المحفزات لتعزيز دقة النموذج.

النتائج

في قسم النتائج، يتم تقديم أداء GPT-3.5 و GPT-4 في مهام التفكير السريري المختلفة. حقق GPT-3.5 دقة بنسبة 46% مع تحفيز سلسلة الأفكار التقليدي، متفوقًا بشكل كبير على دقة تحفيز غير CoT بنسبة 31%. ومن الجدير بالذكر أن أدائه مع التفكير الحدسي كان أفضل قليلاً عند 48%، بينما أظهر دقة أقل بكثير في التفكير التحليلي (40%) وتشكيل التشخيص التفريقي (38%)، مع قيم p تشير إلى دلالة إحصائية قوية لهذه الانخفاضات. كان أداء الاستدلال بايزي قريبًا من الدلالة عند 42%. كانت اتفاقية المقيمين لتقييم MedQA عالية، مع كابا كوهين 0.93.

في المقابل، أظهر GPT-4 دقة عامة محسّنة، حيث حقق 76% مع تحفيز CoT التقليدي. كان أداؤه عبر أنواع التفكير المختلفة ملحوظًا أيضًا، حيث حقق 77% دقة في التفكير الحدسي، و78% في كل من التشخيص التفريقي والتفكير التحليلي، و72% في الاستدلال بايزي. كانت اتفاقية المقيمين لتقييم GPT-4 MedQA أعلى حتى عند 0.98. بالإضافة إلى ذلك، في مجموعة حالات تحدي NEJM، حصل GPT-4 على 38% مع CoT التقليدي مقارنة بـ 34% مع CoT التشخيص التفريقي، على الرغم من أن هذا الاختلاف لم يكن ذا دلالة إحصائية. بشكل عام، تشير هذه النتائج إلى تحسن ملحوظ في الأداء من GPT-3.5 إلى GPT-4 عبر مهام التفكير المختلفة.

المناقشة

في هذه الدراسة، تم تقييم أداء GPT-3.5 و GPT-4 في مهام التفكير السريري باستخدام أسئلة USMLE الخطوة 2 والخطوة 3، مع التركيز على القدرات التشخيصية. كشفت التحليلات أنه بينما كان أداء GPT-3.5 مشابهًا مع محفزات التفكير التقليدية والحدسية، إلا أنه واجه صعوبات كبيرة مع التشخيص التفريقي والمحفزات التحليلية، مما يدل على قيوده في تقليد التفكير السريري المتقدم. على العكس، أظهر GPT-4 أداءً مشابهًا عبر محفزات التفكير التقليدية والتشخيصية، مما يشير إلى تحسين في قدرات التفكير. ومع ذلك، لم يعزز التفكير التشخيصي دقة GPT-4 كما كان سيحدث مع الأطباء البشريين، مما أدى إلى ثلاثة تفسيرات مقترحة: اختلافات في آليات التفكير، تفسير لاحق للتقييمات التشخيصية، أو تأثير سقف في الدقة بناءً على المعلومات المقدمة في السيناريو.

تؤكد الدراسة على قابلية تفسير استجابات GPT-4، حيث يمكنه توليد تفسيرات تسمح للأطباء بتقييم صحة تشخيصاته. هذه القابلية للتفسير حاسمة لتخفيف الطبيعة “الصندوق الأسود” لنماذج اللغة الكبيرة (LLMs)، على الرغم من أنه من الضروري أن نكون حذرين من الأخطاء المحتملة في التفكير. تؤكد النتائج على الحاجة إلى مزيد من البحث لتحسين محفزات التفكير التشخيصي واستكشاف قدرات نماذج ولغات أخرى، حيث كانت الدراسة الحالية محدودة بـ GPT-3.5 و GPT-4، والأسئلة الموجهة نحو الولايات المتحدة، واللغة الإنجليزية.

Journal: npj Digital Medicine, Volume: 7, Issue: 1
DOI: https://doi.org/10.1038/s41746-024-01010-1
PMID: https://pubmed.ncbi.nlm.nih.gov/38267608
Publication Date: 2024-01-24
Author(s): Thomas Savage et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

In this section, the authors address a significant challenge in the application of large language models (LLMs) in the medical field: the perception that these models employ uninterpretable methods for clinical decision-making, which diverges from the cognitive processes utilized by clinicians. The study introduces diagnostic reasoning prompts aimed at assessing whether LLMs, specifically GPT-4, can replicate clinical reasoning while maintaining diagnostic accuracy.

The findings indicate that GPT-4 can indeed be prompted to emulate typical clinical reasoning processes without compromising its ability to make accurate diagnoses. This capability is crucial, as it allows LLMs to provide interpretable rationales for their decisions, enabling physicians to assess the reliability of the model’s responses in patient care contexts. The authors suggest that employing diagnostic reasoning prompts could alleviate the “black box” concerns associated with LLMs, thereby enhancing their potential for safe and effective integration into medical practice.

Introduction

The introduction of this research paper discusses the capabilities and applications of large language models (LLMs) in the medical field, highlighting their ability to generate clinical notes, pass medical exams, and respond to patient inquiries. It emphasizes the importance of understanding LLMs’ clinical reasoning abilities, which encompass problem-solving processes essential for diagnosing and managing medical conditions. Traditional assessments have primarily focused on multiple-choice questions, but recent studies indicate that advanced models like GPT-4 show promise in diagnosing complex clinical cases through free-response questions.

The paper introduces the concept of prompt engineering, particularly the use of Chain-of-Thought (CoT) prompting, which encourages LLMs to break down tasks into smaller, manageable steps. This approach aligns well with the stepwise nature of clinical reasoning. The authors propose to evaluate the performance of GPT-3.5 and GPT-4 on open-ended clinical questions, utilizing a modified MedQA USMLE dataset and the NEJM case series. They aim to compare traditional CoT prompting with specialized “diagnostic reasoning” prompts that reflect clinical cognitive processes, hypothesizing that LLMs will perform better with these tailored prompts than with standard CoT methods.

Methods

In this section, the authors describe their methodology for developing diagnostic reasoning prompts using an iterative approach known as prompt engineering. They conducted multiple rounds of experimentation with various prompt types, assessing the accuracy of the GPT-3.5 model on the MEDQA training set. The findings indicated that prompts designed to encourage step-by-step reasoning, without explicitly detailing the steps, resulted in improved performance. Additionally, the authors observed that prompts targeting a single diagnostic reasoning strategy outperformed those that integrated multiple strategies, highlighting the importance of specificity in prompt design for enhancing model accuracy.

Results

In the results section, the performance of GPT-3.5 and GPT-4 on various clinical reasoning tasks is presented. GPT-3.5 achieved a 46% accuracy with traditional Chain of Thought (CoT) prompting, significantly outperforming its zero-shot non-CoT prompting accuracy of 31%. Notably, its performance with intuitive reasoning was slightly better at 48%, while it showed significantly lower accuracy in analytic reasoning (40%) and differential diagnosis formation (38%), with p-values indicating strong statistical significance for these declines. Bayesian inference performance was close to significance at 42%. Inter-rater agreement for the MedQA evaluation was high, with a Cohen’s Kappa of 0.93.

In contrast, GPT-4 demonstrated improved overall accuracy, achieving 76% with traditional CoT prompting. Its performance across different reasoning types was also notable, with 77% accuracy in intuitive reasoning, 78% in both differential diagnosis and analytic reasoning, and 72% in Bayesian inference. The inter-rater agreement for the GPT-4 MedQA evaluation was even higher at 0.98. Additionally, on the NEJM challenge case set, GPT-4 scored 38% with traditional CoT compared to 34% with differential diagnosis CoT, although this difference was not statistically significant. Overall, these findings indicate a marked improvement in performance from GPT-3.5 to GPT-4 across various reasoning tasks.

Discussion

In this study, the performance of GPT-3.5 and GPT-4 in clinical reasoning tasks was evaluated using USMLE Step 2 and Step 3 questions, focusing on diagnostic capabilities. The analysis revealed that while GPT-3.5 performed comparably with traditional and intuitive reasoning prompts, it struggled significantly with differential diagnosis and analytical prompts, indicating its limitations in mimicking advanced clinical reasoning. Conversely, GPT-4 exhibited similar performance across traditional and diagnostic reasoning prompts, suggesting an improvement in reasoning abilities. However, diagnostic reasoning did not enhance GPT-4’s accuracy as it would for human clinicians, leading to three proposed explanations: differences in reasoning mechanisms, post-hoc explanation of diagnostic evaluations, or a ceiling effect in accuracy based on provided vignette information.

The study emphasizes the interpretability of GPT-4’s responses, as it can generate rationales that allow clinicians to assess the validity of its diagnoses. This interpretability is crucial for mitigating the “black box” nature of large language models (LLMs), although it is essential to remain cautious of potential reasoning inaccuracies. The findings underscore the need for further research to refine diagnostic reasoning prompts and explore the capabilities of other models and languages, as the current study was limited to GPT-3.5 and GPT-4, US-centric questions, and the English language.