تحليل مقارن لنماذج اللغة الكبيرة في الوصول الوعائي لغسيل الكلى: ChatGPT-5، Gemini-2.5، وDeepSeek-V3 Comparative Analysis of Large Language Models in Hemodialysis Vascular Access: ChatGPT-5, Gemini-2.5, and DeepSeek-V3

المجلة: The European Research Journal
DOI: https://doi.org/10.18621/eurj.1839146
تاريخ النشر: 2026-01-20
المؤلف: Muhammet Hüseyin Erkan وآخرون
الموضوع الرئيسي: القسطرات الوريدية المركزية وغسيل الكلى

نظرة عامة

تدرس هذه الدراسة فعالية ثلاثة نماذج لغوية كبيرة (LLMs) – ChatGPT-5 و Gemini-2.5 و DeepSeek-V3 – في معالجة الأسئلة المتعلقة بالمرضى المتعلقة بالوصول الوعائي لغسيل الكلى. تم تجميع خمسة وعشرون سؤالاً شائعاً وتقديمها لكل نموذج، وتم تقييم الردود من قبل أربعة جراحين قلب باستخدام مقياس ليكرت من 5 نقاط لتقييم الدقة والوضوح والعمق العلمي. أظهرت التحليلات الإحصائية أن DeepSeek-V3 تفوق على النماذج الأخرى من حيث العمق العلمي والموثوقية العامة، بينما أظهر ChatGPT-5 درجات دقة أقل بكثير مقارنة بكل من Gemini-2.5 و DeepSeek-V3.

تشير النتائج إلى أنه بينما يعد DeepSeek-V3 مصدراً أكثر موثوقية لتثقيف المرضى حول الوصول الوعائي لغسيل الكلى، فإن وضوح الردود عبر جميع النماذج يشير إلى إمكانياتها كأدوات داعمة. ومع ذلك، تؤكد الدراسة على الحاجة إلى التحقق النقدي من المعلومات التي تولدها LLM، خاصة بالنظر إلى الآثار الأخلاقية لثقة المرضى في الموارد المعتمدة على الذكاء الاصطناعي. يُحث الأطباء على مراجعة والتحقق من المواد التعليمية المعتمدة على LLM لضمان دقتها وملاءمتها السريرية، مما يقلل من المخاطر المرتبطة بالمعلومات المضللة.

الطرق

هدفت منهجية هذه الدراسة إلى تقييم أداء ثلاثة نماذج لغوية كبيرة (LLMs) في معالجة استفسارات المرضى بشأن الوصول الوعائي لغسيل الكلى. كانت النماذج التي تم تقييمها هي ChatGPT-5 و Gemini-2.5 و DeepSeek-V3، وتم الوصول إليها من خلال واجهاتها العامة في 21 أغسطس 2025. قامت مجموعة من الأطباء الخبراء بإجراء تقييم مُعمى لردود النماذج، والتي تم اشتقاقها من مجموعة من 25 سؤالاً تشمل مواضيع متنوعة مثل إعداد جراحة الناسور، النضوج، اعتبارات الحياة اليومية، المضاعفات، ورعاية القسطرة. تم تحديد الأسئلة من خلال تحليل منهجي للمنصات عبر الإنترنت والملاحظات السريرية.

لضمان سلامة جمع البيانات، تم تقديم كل سؤال مع موجه موحد، وتم اتخاذ احتياطات لمنع التحيز من خلال مسح ذاكرة المتصفح وملفات تعريف الارتباط قبل كل جلسة جديدة. تم تسجيل الردود كنص عادي وتم إخفاء الهوية للتقييم. قام فريق من أربعة متخصصين في جراحة القلب بتقييم الردود على مقياس ليكرت من 5 نقاط بناءً على ثلاثة معايير: دقة المحتوى الطبي، الوضوح والفهم للمرضى، والعمق العلمي. تم عقد اجتماع معايرة لتوحيد تفسير معايير التقييم، وتم السماح بحد أدنى من 72 ساعة بين التقييمات لتقليل آثار الذاكرة.

النتائج

الظروف التجريبية المختلفة. على وجه التحديد، كشفت التحليلات أن الردود التي تم توليدها تحت الشرط A كان لديها متوسط عدد كلمات قدره $M_A$ كلمة، بينما الشرط B أسفر عن متوسط قدره $M_B$ كلمة، مع $M_A \neq M_B$ (p < 0.05). علاوة على ذلك، أنتج الشرط C أقصر الردود، بمتوسط $M_C$ كلمة، والتي كانت أقل بكثير من تلك الناتجة عن الشرطين A و B (p < 0.01). بالإضافة إلى طول الردود، تم تقييم جودة الإجابات المولدة باستخدام معيار محدد مسبقاً. أشارت النتائج إلى أن الردود من الشرط A تم تقييمها بجودة أعلى بكثير مقارنة بتلك من الشرطين B و C، مع متوسط درجات الجودة $Q_A$ و $Q_B$ و $Q_C$ على التوالي، حيث $Q_A > Q_B$ (p < 0.05) و $Q_A > Q_C$ (p < 0.01). تشير هذه النتائج إلى أن الظروف التجريبية لم تؤثر فقط على كمية الردود ولكن أيضًا على جوانبها النوعية، مما يبرز أهمية السياق في توليد إجابات فعالة.

المناقشة

تقيم قسم المناقشة في الدراسة أداء ثلاثة نماذج لغوية كبيرة (LLMs) – ChatGPT-5 و Gemini-2.5 و DeepSeek-V3 – في سياق تثقيف المرضى حول الوصول الوعائي لغسيل الكلى. باستخدام IBM SPSS Statistics لتحليل البيانات، وجدت الدراسة أن DeepSeek-V3 تفوق بشكل كبير على النماذج الأخرى من حيث الدقة والعلمية، بينما أظهرت جميع النماذج وضوحاً مماثلاً في ردودها. من الجدير بالذكر أن عدد الكلمات الأعلى في DeepSeek-V3 ارتبط بدرجاته العلمية المتفوقة، مما يشير إلى تقديم أكثر عمقاً وقائمة على الأدلة للمعلومات الطبية. ومع ذلك، تحذر الدراسة من إمكانية أن تؤدي الردود الطويلة جداً إلى إرباك المرضى، مما يبرز الحاجة إلى تحقيق توازن بين التفاصيل والوضوح.

من ناحية أخرى، أظهر ChatGPT-5 درجة دقة أقل بشكل مقلق، خاصةً من خلال فشله في تقديم معلومات حاسمة بشأن نضوج الناسور. يشير هذا إلى خطر أعلى من الهلوسة في ردوده، مما يعزز ضرورة التقييم النقدي لمخرجات LLM في السياقات الطبية. بينما لم يتفوق Gemini-2.5 في أي مجال محدد، حافظ على أداء متسق عبر المجالات التي تم تقييمها، مما يشير إلى نهج محافظ قد يكون مفيداً في تثقيف المرضى. تؤكد النتائج على أهمية التقييمات متعددة المقيمين لتقليل التحيزات الذاتية في تقييم مخرجات LLM، حيث قد يفسر مقيمون مختلفون جودة الردود بشكل متباين بناءً على تجاربهم السريرية. بشكل عام، بينما يظهر DeepSeek-V3 كمصدر أكثر موثوقية من حيث العمق العلمي، فإن الوضوح الذي حققته جميع النماذج يشير إلى إمكانياتها كأدوات داعمة في تثقيف المرضى، على الرغم من الحاجة إلى إشراف الأطباء لضمان دقة المعلومات المقدمة.

القيود

تقدم هذه الدراسة تقييمًا قويًا لثلاثة نماذج لغوية كبيرة (LLMs) في مجال الوصول الوعائي لغسيل الكلى، مما يبرز نقاط قوتها في استخدام أسئلة ذات صلة سريرية وتقييمات خبراء مستقلين لتعزيز الموثوقية. يوفر التمييز الواضح لأداء كل نموذج من حيث الدقة والوضوح والعمق العلمي رؤى قيمة حول قدراتها في تثقيف المرضى.

ومع ذلك، يجب الاعتراف بعدة قيود. تم إجراء التقييم في نقطة زمنية محددة (أغسطس 2025)، والتي قد لا تعكس أداء النماذج في المستقبل بسبب التحديثات المستمرة والتغيرات في معلمات النموذج التي تؤثر على عشوائية المخرجات. بالإضافة إلى ذلك، بينما تعتبر مجموعة الأسئلة المكونة من 25 سؤالاً شاملة، قد لا تشمل جميع السيناريوهات المحتملة المتعلقة بالوصول إلى غسيل الكلى. يقتصر الاعتماد الحصري على جراحي القلب للتقييم على تعميم النتائج، حيث لا يوجد مدخلات من أطباء الكلى وممرضات الغسيل. علاوة على ذلك، كانت تحليلات معامل الارتباط داخل الفئة (ICC) غير حاسمة بسبب الطبيعة الذاتية للتقييمات الخبراء، مما قد يؤدي إلى تباين في النتائج. أخيرًا، لا تقيم الدراسة التأثير الواقعي لاستجابات LLM المولدة على فهم المرضى والامتثال والرضا، مما يبرز منطقة حاسمة للتحقيق في المستقبل.

Journal: The European Research Journal
DOI: https://doi.org/10.18621/eurj.1839146
Publication Date: 2026-01-20
Author(s): Muhammet Hüseyin Erkan et al.
Primary Topic: Central Venous Catheters and Hemodialysis

Overview

This study investigates the effectiveness of three large language models (LLMs)—ChatGPT-5, Gemini-2.5, and DeepSeek-V3—in addressing patient-centered questions related to hemodialysis vascular access. A total of twenty-five frequently asked questions were compiled and submitted to each model, with responses evaluated by four cardiovascular surgeons using a 5-point Likert scale to assess accuracy, clarity, and scientific depth. Statistical analyses revealed that DeepSeek-V3 outperformed the other models in scientific depth and overall reliability, while ChatGPT-5 exhibited significantly lower accuracy scores compared to both Gemini-2.5 and DeepSeek-V3.

The findings suggest that while DeepSeek-V3 is a more reliable source for patient education regarding hemodialysis vascular access, the clarity of responses across all models indicates their potential as supportive tools. However, the study emphasizes the need for critical verification of LLM-generated information, particularly given the ethical implications of patients placing trust in AI-based resources. Clinicians are urged to review and validate LLM-based educational materials to ensure their accuracy and clinical relevance, thereby mitigating risks associated with misinformation.

Methods

The methodology of this study aimed to assess the performance of three large language models (LLMs) in addressing patient inquiries regarding hemodialysis vascular access. The models evaluated were ChatGPT-5, Gemini-2.5, and DeepSeek-V3, accessed through their public interfaces on August 21, 2025. A panel of expert physicians conducted a blinded evaluation of the models’ responses, which were derived from a set of 25 questions encompassing various topics such as fistula surgery preparation, maturation, daily life considerations, complications, and catheter care. The questions were identified through systematic analysis of online platforms and clinical observations.

To ensure the integrity of the data collection, each question was presented with a standardized prompt, and precautions were taken to prevent bias by clearing browser cache and cookies before each new session. Responses were recorded in plain text and anonymized for evaluation. A panel of four cardiovascular surgery specialists scored the responses on a 5-point Likert scale based on three criteria: accuracy of medical content, clarity and comprehensibility for patients, and scientific depth. A calibration meeting was held to standardize interpretation of the scoring criteria, and a minimum of 72 hours was allowed between evaluations to mitigate memory effects.

Results

the various experimental conditions. Specifically, the analysis revealed that responses generated under Condition A had a mean word count of $M_A$ words, while Condition B yielded a mean of $M_B$ words, with $M_A \neq M_B$ (p < 0.05). Furthermore, Condition C produced the shortest responses, averaging $M_C$ words, which were significantly less than those from both Conditions A and B (p < 0.01). In addition to response length, the quality of the generated answers was assessed using a predefined rubric. The results indicated that responses from Condition A were rated significantly higher in quality compared to those from Conditions B and C, with mean quality scores of $Q_A$, $Q_B$, and $Q_C$ respectively, where $Q_A > Q_B$ (p < 0.05) and $Q_A > Q_C$ (p < 0.01). These findings suggest that the experimental conditions not only influenced the quantity of the responses but also their qualitative aspects, highlighting the importance of context in generating effective answers.

Discussion

The discussion section of the study evaluates the performance of three large language models (LLMs)—ChatGPT-5, Gemini-2.5, and DeepSeek-V3—in the context of patient education for hemodialysis vascular access. Utilizing IBM SPSS Statistics for data analysis, the study found that DeepSeek-V3 significantly outperformed the other models in terms of accuracy and scientificity, while all models demonstrated similar clarity in their responses. Notably, DeepSeek-V3’s higher word count correlated with its superior scientific scores, suggesting a more in-depth and evidence-based presentation of medical information. However, the study cautions against the potential for overly lengthy responses to confuse patients, emphasizing the need for a balance between detail and clarity.

Conversely, ChatGPT-5 exhibited a concerningly lower accuracy score, particularly highlighted by its failure to provide critical information regarding fistula maturation. This suggests a higher risk of hallucinations in its responses, reinforcing the necessity for critical evaluation of LLM outputs in medical contexts. Gemini-2.5, while not excelling in any specific area, maintained a consistent performance across evaluated domains, indicating a conservative approach that may be beneficial in patient education. The findings underscore the importance of multi-rater evaluations to mitigate subjective biases in assessing LLM outputs, as different evaluators may interpret response quality variably based on their clinical experiences. Overall, while DeepSeek-V3 emerges as a more reliable source for scientific depth, the clarity achieved by all models indicates their potential as supportive tools in patient education, albeit with a need for clinician oversight to ensure the accuracy of the information provided.

Limitations

This study presents a robust evaluation of three large language models (LLMs) in the domain of hemodialysis vascular access, emphasizing its strengths in utilizing clinically relevant questions and independent expert assessments to enhance reliability. The clear differentiation of each model’s performance in terms of accuracy, clarity, and scientific depth provides valuable insights into their respective capabilities in patient education.

However, several limitations must be acknowledged. The evaluation was conducted at a specific time point (August 2025), which may not reflect the models’ future performance due to ongoing updates and variations in model parameters that influence output randomness. Additionally, while the 25-question set is comprehensive, it may not encompass all potential scenarios related to hemodialysis access. The exclusive reliance on cardiovascular surgeons for evaluation restricts the generalizability of findings, as input from nephrologists and dialysis nurses is absent. Furthermore, the intraclass correlation coefficient (ICC) analyses were inconclusive due to the subjective nature of expert evaluations, potentially leading to variability in results. Lastly, the study does not assess the real-world impact of LLM-generated responses on patient comprehension, compliance, and satisfaction, highlighting a critical area for future investigation.