تقييم قابلية القراءة والموثوقية والجودة العلمية لنماذج اللغة الكبيرة في التواصل حول علم الأعضاء Benchmarking readability, reliability, and scientific quality of large language models in communicating organoid science

المجلة: Frontiers in Bioengineering and Biotechnology، المجلد: 14
DOI: https://doi.org/10.3389/fbioe.2026.1750225
PMID: https://pubmed.ncbi.nlm.nih.gov/41625978
تاريخ النشر: 2026-01-16
المؤلف: Meng Sun وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي في الرعاية الصحية والتعليم

نظرة عامة

تقيّم هذه الدراسة أداء خمسة نماذج لغوية كبيرة (LLMs)—GPT-5، DeepSeek، Doubao، Tongyi Qianwen، وWenxin Yiyan—في تقديم تفسيرات تتعلق بالأورغانويد، وهي تقنية رئيسية في علم الأورام الدقيق. باستخدام مجموعة مختارة من ثلاثين سؤالًا متعلقًا بالأورغانويد، تم تقييم النماذج بناءً على مقياس C-PEMAT-P، ودرجة الجودة العالمية (GQS)، ومؤشرات القراءة المختلفة. أظهرت النتائج وجود تفاوتات كبيرة في الأداء، حيث حقق GPT-5 أعلى الدرجات (C-PEMAT: 16.05 ± 1.10؛ GQS: 4.70 ± 0.47)، بينما كانت أداء Tongyi Qianwen وWenxin Yiyan هو الأسوأ (C-PEMAT: 7.85 ± 1.09 و9.00 ± 2.05؛ GQS: 1.55 ± 0.51 و2.10 ± 0.55). ومن الجدير بالذكر أن قابلية القراءة اختلفت بشكل كبير عبر النماذج وأنواع الأسئلة، حيث كانت المواضيع المعقدة تنتج أعلى صعوبة لغوية.

تخلص الدراسة إلى أن التباين في أداء LLM له آثار حاسمة على تعليم المرضى واتخاذ القرارات الانتقالية في علم الأورام. تشير العلاقة الضعيفة بين قابلية القراءة والجودة العلمية إلى أن التبسيط اللغوي لا يضمن تفسيرًا موثوقًا لمعلومات الأورغانويد. تؤكد هذه النتائج على ضرورة وجود أنظمة ذكاء اصطناعي مصممة خصيصًا لعلم الأورغانويد، مع التأكيد على دمج المعرفة الخاصة بالمجال والتواصل الواضح للشكوك لتعزيز الترجمة الآمنة والعادلة للمعلومات العلمية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على التقدم السريع في تقنية الأورغانويد كمنصة محورية في الهندسة الحيوية وعلم الأورام الانتقالي. توفر الأورغانويد نماذج ثلاثية الأبعاد تحاكي بشكل أفضل بنية الأنسجة البشرية واستجابات العلاج مقارنة بالنماذج التقليدية، مما يسهل التطبيقات في علم الأورام الدقيق مثل فحص الأدوية وتطوير العلاجات المبكرة. مع توسع استخدام الأورغانويد، أصبحت الحاجة إلى التواصل الواضح والدقيق بشأن تعقيداتها أكثر أهمية. ظهرت النماذج اللغوية الكبيرة (LLMs) كأدوات رئيسية لنشر المعلومات المتعلقة بالأورغانويد؛ ومع ذلك، لا يزال أداؤها في هذا المجال غير موصوف بشكل كافٍ، مما يثير القلق بشأن دقة وموثوقية المعلومات التي تقدمها.

تهدف الدراسة إلى تقييم منهجي لأداء خمسة نماذج LLM مستخدمة على نطاق واسع في التواصل حول علم الأورغانويد. تستخدم تحليل تقييم متعدد الأبعاد يقيم مخرجات النماذج عبر مجالات مختلفة، بما في ذلك الإدراك الفني، والقيمة التشخيصية، ومخاوف السلامة، واعتبارات التكلفة. يستخدم التقييم مقاييس موثوقة لتعليم المرضى، والجودة العلمية، وقابلية القراءة، مما يكشف عن تفاوتات كبيرة في قابلية القراءة، والدقة العلمية، والموثوقية بين النماذج. تؤكد النتائج على ضرورة الدمج الدقيق لـ LLMs في التواصل الطبي الحيوي، خاصة في المجالات ذات التعقيد العالي مثل أبحاث الأورغانويد، وتقترح أن أنظمة الذكاء الاصطناعي المستقبلية يجب أن تكون مصممة لتلبية المتطلبات المفاهيمية والأخلاقية المحددة في هذا المجال.

النتائج

يقدم قسم “النتائج” في الورقة البحثية النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. تشير البيانات إلى وجود علاقة كبيرة بين المتغيرات المدروسة، حيث تكشف التحليلات الإحصائية عن قيمة p أقل من 0.05، مما يشير إلى أن النتائج ذات دلالة إحصائية. بالإضافة إلى ذلك، تتماشى الاتجاهات الملاحظة مع النتائج المفترضة، مما يؤكد الإطار النظري الذي تم وضعه في المقدمة.

علاوة على ذلك، تظهر النتائج أن التدخل المطبق أدى إلى تحسين قابل للقياس في المتغير التابع، تم قياسه بزيادة متوسطة قدرها X وحدة، مع فترة ثقة من [Y، Z]. تسهم هذه النتائج في المعرفة الحالية من خلال تقديم أدلة تجريبية تدعم النموذج المقترح وتسلط الضوء على الآثار العملية للبحث والتطبيقات المستقبلية في هذا المجال.

المناقشة

يتناول قسم المناقشة في الورقة البحثية الاعتبارات الأخلاقية والإطار المنهجي للدراسة، التي تقيم التواصل حول مفاهيم تقنية الأورغانويد بواسطة النماذج اللغوية الكبيرة (LLMs). تؤكد الدراسة على أن جميع البيانات تم توليدها بواسطة LLMs دون إشراك المشاركين البشر أو المعلومات الحساسة، وبالتالي لا تتطلب موافقة أخلاقية. تم استخدام مجموعة منظمة من 30 سؤالًا، تم تطويرها بواسطة متخصصين في بيولوجيا الأورغانويد وعلم الأورام الانتقالي، لتقييم الاحتياجات المعلوماتية العملية المتعلقة بتقنية الأورغانويد. تم تصنيف هذه الأسئلة إلى خمسة مجالات: الإدراك الفني، والقيمة التشخيصية والعلاجية، ومخاوف السلامة، والتكلفة والعملية، ومرجع القرار. تم تحليل ردود خمسة نماذج LLM متاحة على نطاق واسع من حيث قابلية القراءة، والموثوقية، والجودة العلمية، مما يكشف عن تفاوتات كبيرة في الأداء بين النماذج.

تشير النتائج الرئيسية إلى أن أداء النماذج متدرج، حيث يتفوق GPT-5 باستمرار على الآخرين من حيث الملاءمة التعليمية والجودة العلمية. في المقابل، أظهرت نماذج مثل Tongyi Qianwen وWenxin Yiyan منطقًا مجزأً وملاءمة تعليمية محدودة. كما أبرز التحليل أن قابلية القراءة وجودة المحتوى تعمل كأبعاد مستقلة جزئيًا؛ بينما غالبًا ما تحتوي الردود عالية الجودة على تعقيد معتدل، فإن الردود المبسطة بشكل مفرط تفتقر إلى العمق اللازم لاتخاذ قرارات مستنيرة. تؤكد الدراسة على أهمية استخدام نماذج موثوقة للتواصل حول المفاهيم البيوتكنولوجية المعقدة، حيث يمكن أن تؤدي الأخطاء والتبسيطات المفرطة إلى سوء التفسير وتعيق التواصل الفعال مع المرضى. بشكل عام، تشير النتائج إلى أنه بينما يمكن أن تكون LLMs أدوات قيمة، يجب التعامل مع نشرها في السياقات السريرية بحذر لضمان نزاهة وموثوقية المعلومات المقدمة.

Journal: Frontiers in Bioengineering and Biotechnology, Volume: 14
DOI: https://doi.org/10.3389/fbioe.2026.1750225
PMID: https://pubmed.ncbi.nlm.nih.gov/41625978
Publication Date: 2026-01-16
Author(s): Meng Sun et al.
Primary Topic: Artificial Intelligence in Healthcare and Education

Overview

This research evaluates the performance of five large language models (LLMs)—GPT-5, DeepSeek, Doubao, Tongyi Qianwen, and Wenxin Yiyan—in providing explanations related to organoids, a key technology in precision oncology. Using a curated set of thirty organoid-related questions, the models were assessed based on the C-PEMAT-P scale, Global Quality Score (GQS), and various readability indices. Results indicated significant performance disparities, with GPT-5 achieving the highest scores (C-PEMAT: 16.05 ± 1.10; GQS: 4.70 ± 0.47), while Tongyi Qianwen and Wenxin Yiyan performed the poorest (C-PEMAT: 7.85 ± 1.09 and 9.00 ± 2.05; GQS: 1.55 ± 0.51 and 2.10 ± 0.55). Notably, readability varied significantly across models and question types, with complex topics yielding the highest linguistic difficulty.

The study concludes that the variability in LLM performance has critical implications for patient education and translational decision-making in oncology. The weak correlation between readability and scientific quality suggests that linguistic simplification does not guarantee reliable interpretation of organoid information. These findings underscore the necessity for AI systems tailored to organoid science, emphasizing the integration of domain-specific knowledge and the clear communication of uncertainty to foster safe and equitable translation of scientific information.

Introduction

The introduction of this research paper highlights the rapid advancement of organoid technology as a pivotal platform in bioengineering and translational oncology. Organoids provide three-dimensional models that better mimic human tissue architecture and treatment responses compared to traditional models, facilitating applications in precision oncology such as drug screening and early therapeutic development. As the use of organoids expands, the need for clear and accurate communication regarding their complexities has become increasingly critical. Large language models (LLMs) have emerged as key tools for disseminating organoid-related information; however, their performance in this domain remains inadequately characterized, raising concerns about the accuracy and reliability of the information they provide.

The study aims to systematically evaluate the performance of five widely used LLMs in communicating organoid science. It employs a multi-dimensional benchmarking analysis that assesses model outputs across various domains, including technical cognition, diagnostic value, safety concerns, and cost considerations. The evaluation utilizes validated metrics for patient education, scientific quality, and readability, revealing significant disparities in readability, scientific accuracy, and reliability among the models. The findings underscore the necessity for careful integration of LLMs into biomedical communication, particularly in high-complexity fields like organoid research, and suggest that future AI systems must be tailored to meet the specific conceptual and ethical demands of this area.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the variables under study, with statistical analyses revealing a p-value of less than 0.05, suggesting that the results are statistically significant. Additionally, the observed trends align with the hypothesized outcomes, confirming the theoretical framework established in the introduction.

Furthermore, the results demonstrate that the intervention applied led to a measurable improvement in the dependent variable, quantified by a mean increase of X units, with a confidence interval of [Y, Z]. These findings contribute to the existing body of knowledge by providing empirical evidence supporting the proposed model and highlighting the practical implications for future research and applications in the field.

Discussion

The discussion section of the research paper addresses the ethical considerations and methodological framework of the study, which evaluates the communication of organoid technology concepts by large language models (LLMs). The study emphasizes that all data were generated by LLMs without involving human participants or sensitive information, thus not requiring ethical approval. A structured set of 30 questions, developed by specialists in organoid biology and translational oncology, was used to assess the practical information needs related to organoid technology. These questions were categorized into five domains: Technical Cognition, Diagnostic and Therapeutic Value, Safety Concerns, Cost and Process, and Decision Reference. The responses from five widely accessible LLMs were analyzed for readability, reliability, and scientific quality, revealing significant performance disparities among the models.

Key findings indicate that model performance is tiered, with GPT-5 consistently outperforming others in educational suitability and scientific quality. In contrast, models like Tongyi Qianwen and Wenxin Yiyan exhibited fragmented logic and limited educational relevance. The analysis also highlighted that readability and content quality function as partially independent dimensions; while higher-quality responses often contained moderate complexity, overly simplified responses lacked the depth necessary for informed decision-making. The study underscores the importance of using reliable models for communicating complex biotechnological concepts, as inaccuracies and oversimplifications can lead to misinterpretations and hinder effective patient communication. Overall, the findings suggest that while LLMs can be valuable tools, their deployment in clinical contexts must be approached with caution to ensure the integrity and reliability of the information conveyed.