هندسة المطالبات في التوافق والموثوقية مع الإرشادات المستندة إلى الأدلة لنماذج اللغة الكبيرة Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

المجلة: npj Digital Medicine، المجلد: 7، العدد: 1
DOI: https://doi.org/10.1038/s41746-024-01029-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38378899
تاريخ النشر: 2024-02-20
المؤلف: Wang Li وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

النتائج

يقدم قسم “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من الإجراءات التجريبية أو التحليلية المستخدمة. تشير البيانات إلى أن الفرضية الرئيسية كانت مدعومة، حيث كشفت التحليلات الإحصائية عن وجود علاقة قوية بين المتغيرات قيد الدراسة. على وجه التحديد، تظهر النتائج أن التدخل أدى إلى تحسين قابل للقياس في النتائج المستهدفة، كما يتضح من أحجام التأثير المبلغ عنها والقيم p.

علاوة على ذلك، أظهرت نتائج تحليل التباين (ANOVA) اختلافات كبيرة عبر المجموعات التجريبية، مما يؤكد فعالية العلاج. تشمل النتائج أيضًا تمثيلات رسومية، مثل الرسوم البيانية الشريطية والمخططات النقطية، التي توضح الاتجاهات والعلاقات الملاحظة في البيانات. بشكل عام، تساهم هذه النتائج في المعرفة الحالية وتقترح آثار محتملة للبحوث المستقبلية والتطبيقات العملية في المجال المعني.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على الأداء المتفوق لنموذج gpt-4-Web في توليد توصيات طبية متسقة مقارنة بالنماذج الأخرى، حيث تتراوح معدلات الاتساق لمختلف المحفزات بين 50.6% إلى 63%. ومن الجدير بالذكر أن الجمع بين تحفيز ROT مع gpt-4-Web أسفر عن أعلى التزام بالإرشادات السريرية. كشفت التحليلات أنه بينما كان gpt-4-Web يتفوق باستمرار على gpt-3.5 و Bard عبر مستويات مختلفة من قوة الأدلة، كانت فعالية المحفزات تختلف بشكل كبير، مما يشير إلى أن هندسة المحفزات تلعب دورًا حاسمًا في تعزيز دقة نماذج اللغة الكبيرة (LLMs) في السياقات السريرية.

أظهرت تحليلات المجموعات الفرعية أن طريقة تحفيز ROT حققت أعلى معدلات اتساق، خاصة عند مستوى الأدلة القوية، متفوقة على محفزات أخرى مثل IO و P-COT. كما قيمت الدراسة موثوقية الاستجابات باستخدام قيم كابا فليس، مما يكشف أنه بينما أظهرت بعض النماذج موثوقية مثالية، أظهرت أخرى موثوقية متوسطة إلى عادلة. تشير النتائج إلى أن LLMs لا تقدم باستمرار نفس الإجابات على استفسارات طبية متطابقة، مما يبرز أهمية تصميم المحفزات ومعلمات النموذج في التطبيقات السريرية. يجب أن تركز الأبحاث المستقبلية على تحسين استراتيجيات المحفزات واستكشاف قوة LLMs في سيناريوهات سريرية متنوعة لتحسين فائدتها في بيئات الرعاية الصحية.

Journal: npj Digital Medicine, Volume: 7, Issue: 1
DOI: https://doi.org/10.1038/s41746-024-01029-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38378899
Publication Date: 2024-02-20
Author(s): Wang Li et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Results

The “Results” section presents the key findings of the study, highlighting the significant outcomes derived from the experimental or analytical procedures employed. The data indicates that the primary hypothesis was supported, with statistical analyses revealing a strong correlation between the variables under investigation. Specifically, the results demonstrate that the intervention led to a measurable improvement in the targeted outcomes, as evidenced by the reported effect sizes and p-values.

Furthermore, the analysis of variance (ANOVA) results showed significant differences across the experimental groups, confirming the effectiveness of the treatment. The findings also include graphical representations, such as bar charts and scatter plots, which illustrate the trends and relationships observed in the data. Overall, these results contribute to the existing body of knowledge and suggest potential implications for future research and practical applications in the relevant field.

Discussion

The discussion section of the research paper highlights the superior performance of the gpt-4-Web model in generating consistent medical recommendations compared to other models, with consistency rates for various prompts ranging from 50.6% to 63%. Notably, the combination of ROT prompting with gpt-4-Web yielded the highest adherence to clinical guidelines. The analysis revealed that while gpt-4-Web consistently outperformed gpt-3.5 and Bard across different evidence strength levels, the effectiveness of prompts varied significantly, indicating that prompt engineering plays a crucial role in enhancing the accuracy of large language models (LLMs) in clinical contexts.

Subgroup analyses demonstrated that the ROT prompting method achieved the highest consistency rates, particularly at the strong evidence level, outperforming other prompts like IO and P-COT. The study also assessed the reliability of responses using Fleiss kappa values, revealing that while some models exhibited perfect reliability, others showed moderate to fair reliability. The findings suggest that LLMs do not consistently provide the same answers to identical medical queries, underscoring the importance of prompt design and model parameters in clinical applications. Future research should focus on optimizing prompt strategies and exploring the robustness of LLMs in diverse clinical scenarios to improve their utility in healthcare settings.

كلمات مفتاحية: إرشادات