تقييم دقة وثبات نماذج اللغة الكبيرة في امتحان أخصائي التغذية المسجل من خلال هندسة المطالب واسترجاع المعرفة Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-024-85003-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39789057
تاريخ النشر: 2025-01-09
المؤلف: Iman Azimi وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

في هذا القسم، يبرز المؤلفون الدور التحويلي لنماذج اللغة الكبيرة (LLMs) في تطبيقات الصحة والرفاهية، لا سيما في تعزيز مشاركة المرضى، واتخاذ القرارات السريرية، والتعليم الطبي. على الرغم من التقدم في التطبيقات الحوارية، لا يزال هناك فجوة في تقييم LLMs بشكل خاص ضمن مجال التغذية والنظام الغذائي. لمعالجة ذلك، تستخدم الدراسة امتحان أخصائي التغذية المسجل (RD) كمعيار لتقييم أداء أحدث نماذج LLMs، بما في ذلك GPT-4o وClaude 3.5 Sonnet وGemini 1.5 Pro، عبر 1050 سؤالاً تغطي مواضيع ومستويات كفاءة مختلفة في التغذية.

تشير النتائج إلى أنه بينما أظهرت LLMs أداءً مقبولاً بشكل عام، إلا أن دقتها وثباتها اختلفت بشكل كبير اعتمادًا على تقنيات التحفيز ومجالات الأسئلة. ومن الجدير بالذكر أن GPT-4o باستخدام سلسلة التفكير مع الاتساق الذاتي (CoT-SC) حقق أعلى أداء، بينما أظهر Gemini 1.5 Pro أفضل اتساق مع التحفيز بدون عينة (ZS). بالإضافة إلى ذلك، وجدت الدراسة أن CoT حسّن الدقة لكل من GPT-4o وClaude 3.5، بينما حسّن CoT-SC كل من الدقة والاتساق. أثبت التحفيز المعزز بالاسترجاع (RAP) فعاليته بشكل خاص لـ GPT-4o في معالجة أسئلة المستوى الخبير. يستنتج المؤلفون أن اختيار LLM واستراتيجية التحفيز المناسبة، المصممة لتناسب المجال المحدد ومستوى الكفاءة، أمر حاسم لتقليل الأخطاء والمخاطر في روبوتات الدردشة المتعلقة بالنظام الغذائي والتغذية.

مقدمة

تقيّم مقدمة ورقة البحث أداء تقنيات التحفيز المختلفة المطبقة على نماذج اللغة الكبيرة (LLMs) عبر أربعة مجالات تتعلق بالتغذية والنظام الغذائي: D1) مبادئ التغذية، D2) رعاية التغذية للأفراد والمجموعات، D3) أنظمة خدمة الطعام، وD4) إدارة برامج وخدمات الطعام والتغذية. يتضمن الفحص ما مجموعه 1,050 سؤالاً موزعة بين المجالات، مع 237 سؤالاً في D1، و392 في D2، و185 في D3، و236 في D4.

تشير النتائج إلى أن فعالية تقنيات هندسة التحفيز تختلف بشكل كبير عبر هذه المجالات. على سبيل المثال، قللت طريقة سلسلة التفكير مع الاتساق الذاتي (CoT-SC) بشكل ملحوظ من متوسط عدد الأخطاء في D3 من 27.4 إلى 12 وفي D4 من 28.4 إلى 18.2 عند مقارنتها بطريقة التحفيز بدون عينة (ZS). بينما أظهرت CoT وRAP أيضًا تحسينات، أظهرت RAP معدلات خطأ أعلى في D2. أظهر النموذج Claude 3.5 Sonnet تقليلاً عامًا في الأخطاء عبر جميع المجالات عند الانتقال من ZS إلى CoT-SC أو التحفيز بـ CoT. على العكس، أظهر Gemini 1.5 Pro تغييرات طفيفة في عدد الأخطاء عبر D1 وD3 وD4، مع تقديم التحفيز بدون عينة (ZS) أقل عدد من الأخطاء في D2، على الرغم من أنه لا يزال أعلى من أداء GPT-4o. بشكل عام، تؤكد النتائج على التأثير الدقيق لاستراتيجيات التحفيز على دقة LLM في سياق التقييمات المتعلقة بالتغذية.

طرق

في هذه الدراسة، يقوم الباحثون بتقييم أداء ثلاثة من نماذج اللغة الكبيرة الرائدة (LLMs)—GPT-4o وClaude 3.5 Sonnet وGemini 1.5 Pro—باستخدام أسئلة من امتحان أخصائي التغذية المسجل (RD)، الذي يشمل مجموعة واسعة من مواضيع التغذية. يتم تقييم المقاييس الأساسية للأداء وهي الدقة والاتساق. لتقييم هذه المقاييس، تستخدم الدراسة أربع تقنيات تحفيز: التحفيز بدون عينة (ZS)، سلسلة التفكير (CoT)، سلسلة التفكير مع الاتساق الذاتي (CoT-SC)، والتحفيز المعزز بالاسترجاع (RAP)، الذي يدمج المعرفة الخارجية بالتغذية.

تشمل عملية التقييم مقارنة استجابات LLM بالإجابات الصحيحة المعتمدة، مما يسمح بتحليل شامل لأداء كل نموذج عبر استراتيجيات التحفيز المختلفة. بالإضافة إلى ذلك، يتم إجراء تحليلات بين المقيمين وداخل المقيمين لتحديد نقاط القوة والضعف الكامنة في كل نهج. ينتهي القسم بوصف تفصيلي لامتحان RD، وLLMs، وتقنيات التحفيز المستخدمة، تليها مخطط لطرق جمع وتحليل الاستجابات في هذه العملية القياسية.

نتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يوضح النتائج الناتجة عن اختبارات مختلفة، مع تسليط الضوء على الاتجاهات والأنماط المهمة التي لوحظت في البيانات. غالبًا ما تكون النتائج مدعومة بتحليلات إحصائية، والتي قد تشمل قيم p، وفترات الثقة، أو أحجام التأثير، مما يوفر أساسًا كميًا للاستنتاجات المستخلصة.

بالإضافة إلى ذلك، قد يتضمن القسم تمثيلات بصرية مثل الرسوم البيانية أو الجداول لتوضيح النتائج بوضوح. تعمل هذه الوسائل البصرية على تعزيز فهم النتائج وتسهيل المقارنات عبر ظروف أو مجموعات تجريبية مختلفة. بشكل عام، تؤكد النتائج على الفرضيات الرئيسية للدراسة وتساهم في الآثار الأوسع للبحث ضمن هذا المجال.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على أداء ثلاثة نماذج لغة كبيرة (LLMs)—GPT-4o وClaude 3.5 Sonnet وGemini 1.5 Pro—عبر تقنيات التحفيز المختلفة في الإجابة على 1050 سؤالًا من امتحان أخصائي التغذية المسجل (RD). حققت جميع النماذج درجات تتجاوز 88%، حيث حقق GPT-4o باستخدام تقنية التحفيز سلسلة التفكير مع الاتساق الذاتي (CoT-SC) أعلى دقة (91-95%) وأقل عدد من الأخطاء (58 خطأ في المتوسط). بالمقابل، أظهر Gemini 1.5 Pro أدنى أداء، لا سيما مع تقنية التحفيز CoT، التي أدت إلى أعلى عدد من الأخطاء. كما كشفت التحليلات أنه بينما حسنت CoT وCoT-SC الدقة عبر مستويات الكفاءة، اختلف الأداء بشكل كبير بين النماذج، حيث كان GPT-4o يتفوق باستمرار على الآخرين.

يتناول القسم أيضًا اتساق استجابات النماذج، التي تم تقييمها من خلال تحليلات موثوقية بين المقيمين وداخل المقيمين. أظهر GPT-4o مستويات عالية من الاتفاق، لا سيما بين تقنيات التحفيز CoT وCoT-SC، بينما أظهر Gemini 1.5 Pro اتساقًا أقل. تؤكد النتائج على أهمية تقنيات التحفيز في التأثير على كل من الدقة والاتساق، حيث تعزز CoT الأداء في مهام التفكير ولكن تؤدي إلى تباين في الاستجابات. أظهرت تقنية التحفيز المعزز بالاسترجاع (RAP) نتائج مختلطة، حيث حسنت الأداء في بعض المجالات بينما أدت إلى معدلات خطأ أعلى في مجالات أخرى، لا سيما بالنسبة لـ Gemini. بشكل عام، تؤكد الدراسة على إمكانيات LLMs في المهام المتعلقة بالتغذية مع تسليط الضوء على الحاجة إلى النظر بعناية في استراتيجيات التحفيز لتحسين فعاليتها في التطبيقات العملية.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-024-85003-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39789057
Publication Date: 2025-01-09
Author(s): Iman Azimi et al.
Primary Topic: Topic Modeling

Overview

In this section, the authors highlight the transformative role of large language models (LLMs) in health and well-being applications, particularly in enhancing patient engagement, clinical decision-making, and medical education. Despite the advancements in conversational applications, there remains a gap in evaluating LLMs specifically within the nutrition and diet domain. To address this, the study employs the Registered Dietitian (RD) exam as a benchmark to assess the performance of state-of-the-art LLMs, including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, across 1050 questions that cover various nutrition topics and proficiency levels.

The findings indicate that while the LLMs demonstrated acceptable overall performance, their accuracy and consistency varied significantly depending on the prompting techniques and question domains. Notably, GPT-4o utilizing Chain of Thought with Self Consistency (CoT-SC) achieved the highest performance, while Gemini 1.5 Pro exhibited the best consistency with Zero-Shot (ZS) prompting. Additionally, the study found that CoT improved accuracy for both GPT-4o and Claude 3.5, while CoT-SC enhanced both accuracy and consistency. Retrieval Augmented Prompting (RAP) proved particularly effective for GPT-4o in addressing Expert level questions. The authors conclude that selecting the appropriate LLM and prompting strategy, tailored to the specific domain and proficiency level, is crucial for minimizing errors and risks in diet and nutrition chatbots.

Introduction

The introduction of the research paper evaluates the performance of various prompting techniques applied to large language models (LLMs) across four domains related to dietetics and nutrition: D1) Principles of Dietetics, D2) Nutrition Care for Individuals and Groups, D3) Food Service Systems, and D4) Management of Food and Nutrition Programs and Service. The examination comprises a total of 1,050 questions distributed among the domains, with 237 questions in D1, 392 in D2, 185 in D3, and 236 in D4.

Findings indicate that the effectiveness of prompt engineering techniques varies significantly across these domains. For instance, the Chain of Thought with Self-Consistency (CoT-SC) method notably reduced average error counts in D3 from 27.4 to 12 and in D4 from 28.4 to 18.2 when compared to the Zero-Shot (ZS) prompting approach. While CoT and RAP also demonstrated improvements, RAP exhibited higher error rates in D2. The model Claude 3.5 Sonnet showed a general reduction in errors across all domains when transitioning from ZS to CoT-SC or CoT prompting. Conversely, Gemini 1.5 Pro displayed minimal changes in error counts across D1, D3, and D4, with ZS prompting yielding the lowest error count in D2, albeit still higher than GPT-4o’s performance. Overall, the results underscore the nuanced impact of prompting strategies on LLM accuracy in the context of nutrition-related assessments.

Methods

In this study, the researchers benchmark the performance of three leading large language models (LLMs)—GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—using questions from the Registered Dietitian (RD) exam, which encompasses a wide range of nutrition topics. The primary performance metrics evaluated are accuracy and consistency. To assess these metrics, the study employs four prompting techniques: Zero Shot prompting (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP), which incorporates external nutrition knowledge.

The evaluation process involves comparing the LLM responses to established ground truth answers, allowing for a comprehensive analysis of each model’s performance across different prompting strategies. Additionally, inter-rater and intra-rater analyses are conducted to discern the strengths and weaknesses inherent in each approach. The section concludes with a detailed description of the RD exam, the LLMs, and the prompting techniques utilized, followed by an outline of the methods for response collection and analysis in this benchmarking exercise.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments or analyses. It details the outcomes of various tests, highlighting significant trends and patterns observed in the data. The results are often supported by statistical analyses, which may include p-values, confidence intervals, or effect sizes, providing a quantitative basis for the conclusions drawn.

Additionally, the section may include visual representations such as graphs or tables to illustrate the findings clearly. These visual aids serve to enhance the understanding of the results and facilitate comparisons across different experimental conditions or groups. Overall, the results underscore the main hypotheses of the study and contribute to the broader implications of the research within the field.

Discussion

The discussion section of the research paper highlights the performance of three large language models (LLMs)—GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—across various prompting techniques in answering 1050 Registered Dietitian (RD) exam questions. All models achieved scores exceeding 88%, with GPT-4o utilizing the Chain of Thought with Self-Consistency (CoT-SC) prompting technique achieving the highest accuracy (91-95%) and the lowest error count (58 errors on average). In contrast, Gemini 1.5 Pro exhibited the lowest performance, particularly with the CoT prompting technique, which resulted in the highest error counts. The analysis also revealed that while CoT and CoT-SC improved accuracy across proficiency levels, the performance varied significantly among models, with GPT-4o consistently outperforming the others.

The section further delves into the consistency of the models’ responses, assessed through inter-rater and intra-rater reliability analyses. GPT-4o demonstrated high agreement levels, particularly between CoT and CoT-SC prompting techniques, while Gemini 1.5 Pro showed lower consistency. The findings underscore the importance of prompting techniques in influencing both accuracy and consistency, with CoT enhancing performance in reasoning tasks but leading to variability in responses. The Retrieval Augmented Prompting (RAP) technique showed mixed results, improving performance in some areas while introducing higher error rates in others, particularly for Gemini. Overall, the study emphasizes the potential of LLMs in nutrition-related tasks while highlighting the need for careful consideration of prompting strategies to optimize their effectiveness in practical applications.