تقييم مقاطع الفيديو المتعلقة باستئصال اللوزتين على يوتيوب من خلال مراجعة خبير بشري وChatGPT-4: تحليل جودة متعدد الأساليب Evaluating tonsillectomy-related YouTube videos via a human expert review and the ChatGPT-4: a multi-method quality analysis

المجلة: BMC Medical Education، المجلد: 25، العدد: 1
DOI: https://doi.org/10.1186/s12909-025-07739-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40790575
تاريخ النشر: 2025-08-11
المؤلف: Serkan Şerifler وآخرون
الموضوع الرئيسي: محو الأمية الصحية وإمكانية الوصول إلى المعلومات

نظرة عامة

تدرس الدراسة جودة وموثوقية المحتوى المتعلق باستئصال اللوزتين على يوتيوب، باستخدام إطار عمل متعدد الطرق يدمج مراجعة الخبراء البشريين، وتحليل بواسطة نموذج اللغة الكبير ChatGPT-4، وتقييمات قابلية القراءة لنصوص الفيديو. تم تقييم ما مجموعه 76 فيديو باللغة الإنجليزية، حيث استخدم طبيبان متخصصان في الأنف والأذن والحنجرة أداة DISCERN ومعايير JAMA لتقييم الجودة. كشفت التحليلات أن الفيديوهات الاحترافية تفوقت على المحتوى الذي أنشأه المرضى في مقاييس الجودة. ومن الجدير بالذكر أن درجات دقة ChatGPT-4 كانت مرتبطة بشكل كبير مع تقييمات JAMA ($\rho = 0.56$) واكتمالها مع درجات DISCERN ($\rho = 0.72$). علاوة على ذلك، أظهرت الفيديوهات الغنية بصريًا دقة أعلى من الذكاء الاصطناعي مقارنة بتلك التي تحتوي على نصوص كثيفة، مما يشير إلى أن السياق البصري يعزز القابلية للتفسير (Cohen’s $d = 0.600$, $p = 0.030$). ومع ذلك، تجاوزت قابلية قراءة النصوص المتوسطة (FKGL = 8.38) المستوى الموصى به لتعليم المرضى.

تؤكد النتائج على التباين في جودة محتوى يوتيوب المتعلق باستئصال اللوزتين وتدعو إلى دمج تقييمات الخبراء مع تقييمات LLM لتحليل شامل. تسلط الدراسة الضوء على أن الفيديوهات التي ينتجها الأطباء تظهر باستمرار جودة متفوقة، بينما غالبًا ما يفتقر المحتوى الذي ينشئه المرضى إلى الدقة الطبية. تشير الزيادة الكبيرة في دقة الذكاء الاصطناعي للفيديوهات الغنية بصريًا إلى أن دمج العناصر المرئية يمكن أن يحسن الفعالية التعليمية. تدعو الدراسة إلى إعطاء الأولوية للتقييم المتعدد الوسائط والتصميم في محتوى الصحة الرقمية المستقبلي لمعالجة مخاوف القابلية للقراءة وتعزيز الوصول للمرضى.

مقدمة

تناقش مقدمة ورقة البحث انتشار وتحديات استئصال اللوزتين، وهو إجراء جراحي شائع في مجال الأنف والأذن والحنجرة، حيث يتم إجراء حوالي 530,000 عملية سنويًا في الولايات المتحدة. على الرغم من أنه آمن بشكل عام، إلا أن تجربة ما بعد الجراحة يمكن أن تكون مليئة بالألم وصعوبات التغذية وقلق الأهل، مما يستلزم توفير معلومات موثوقة للمرضى والعائلات لتسهيل عملية التعافي. لقد غيرت تكنولوجيا الرقمية، وخاصة منصات وسائل التواصل الاجتماعي مثل يوتيوب، كيفية بحث الأفراد عن المعلومات الطبية، ولكن لا تزال هناك مخاوف بشأن جودة ومصداقية هذا المحتوى. تشير الدراسات السابقة إلى أن العديد من الفيديوهات المتعلقة بالصحة على يوتيوب تفتقر إلى الدقة، حيث يُعتبر جزء صغير فقط منها مفيدًا طبيًا.

لمعالجة الفجوة في المعلومات الموثوقة حول استئصال اللوزتين، تهدف هذه الدراسة إلى تقييم فيديوهات يوتيوب المتعلقة بالإجراء باستخدام منهجية شاملة. من خلال استخدام أدوات التقييم البشري المعتمدة – وهي معايير DISCERN وJAMA – جنبًا إلى جنب مع تقييمات من نموذج الذكاء الاصطناعي Chat-GPT-4 ومقاييس قابلية القراءة، تسعى الدراسة إلى وصف جودة التعليم لهذه الفيديوهات بشكل منهجي. لا يقتصر هذا النهج المتعدد الأبعاد على تقييم موثوقية الفيديوهات فحسب، بل يستكشف أيضًا إمكانية الذكاء الاصطناعي في تعزيز تقييم المعلومات الصحية عبر الإنترنت، مما يسهم في تحسين تعليم المرضى واتخاذ القرارات.

الطرق

توضح قسم “المواد والطرق” تصميم التجربة والإجراءات المستخدمة في الدراسة. يتناول المواد المحددة المستخدمة، بما في ذلك أي مواد كيميائية، معدات، وعينات بيولوجية، بالإضافة إلى مصادرها وتحضيرها. كما يصف قسم الطرق البروتوكولات التجريبية، بما في ذلك أي تحليلات إحصائية تم إجراؤها، والضوابط المنفذة، والمعايير لجمع البيانات وتفسيرها.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول حجم العينة، والظروف التجريبية، وأي اعتبارات أخلاقية ذات صلة. يضمن هذا النهج الشامل إمكانية إعادة الإنتاج ويسمح بفهم واضح لكيفية اشتقاق النتائج، مما يدعم صلاحية النتائج المقدمة في الدراسة.

النتائج

في هذه الدراسة، تم تحليل ما مجموعه 76 فيديو على يوتيوب استوفت معايير الأهلية. حصلت الفيديوهات على متوسط عدد مشاهدات بلغ 167,743.1 (± 72,241.5) وكان متوسط عمر التحميل 29.7 شهرًا (± 2.62). كشفت مقاييس التفاعل عن متوسط 1,664.4 إعجاب (± 703.5) و175.5 تعليق (± 51.17) لكل فيديو.

تم تقييم جودة المحتوى باستخدام معايير DISCERN وJAMA، مما أسفر عن متوسط درجات بلغ 56.3 (± 8.7) و2.41 (± 0.88) على التوالي. بالإضافة إلى ذلك، كان متوسط دقة المحتوى، كما قيمه ChatGPT-4، 4.51 (± 0.66)، بينما سجلت درجات الاكتمال متوسط 4.26 (± 0.71). توفر هذه النتائج رؤى حول التفاعل وجودة المحتوى المتعلق بالصحة على يوتيوب.

المناقشة

أجرت هذه الدراسة تقييمًا شاملاً لجودة التعليم والدقة الطبية لفيديوهات يوتيوب المتعلقة باستئصال اللوزتين، باستخدام نهج متعدد الطرق يدمج تقييمات الخبراء البشريين، وتقييمات الذكاء الاصطناعي عبر ChatGPT-4، وتحليلات قابلية القراءة. كشفت النتائج أن الجودة العامة للفيديوهات كانت متوسطة، مع وجود تفاوتات كبيرة بين المحتوى الذي أنشأه المرضى والفيديوهات التي أنتجها محترفون أو مؤسسات صحية. على وجه الخصوص، حصلت الفيديوهات التي أنشأها المرضى على درجات أقل في كل من معايير DISCERN وJAMA، مما يتماشى مع الأبحاث السابقة التي تؤكد على الحاجة إلى مشاركة محترفين في إنشاء محتوى صحي موجه للمرضى.

كما أظهرت الدراسة وجود علاقة قوية بين تقييمات الذكاء الاصطناعي وتقييمات المراجعين الخبراء، مما يشير إلى إمكانية استخدام ChatGPT-4 كأداة موثوقة لتقييم جودة المعلومات الصحية. ومن الجدير بالذكر أن تحليل قابلية القراءة أظهر أن متوسط مستوى Flesch-Kincaid (FKGL) البالغ 8.38 تجاوز المستوى الموصى به للصف السادس لمواد تعليم المرضى، مما يثير مخاوف بشأن الوصول للمشاهدين ذوي محو الأمية الصحية المنخفض. تشير النتائج إلى أنه بينما يمكن لنماذج الذكاء الاصطناعي مثل ChatGPT-4 تعزيز تقييم المحتوى الصحي عبر الإنترنت، هناك حاجة ملحة لدراسات مستقبلية لتطوير قدرات الذكاء الاصطناعي متعددة الوسائط التي يمكن أن تحلل العناصر البصرية والسمعية لتحسين قابلية الفهم والأثر التعليمي للفيديوهات المتعلقة بالصحة.

القيود

تسلط قيود هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على تفسير وملاءمة النتائج. أولاً، كانت التحليلات مقتصرة على الفيديوهات باللغة الإنجليزية، مما يحد من إمكانية تعميم النتائج على الجماهير غير الناطقة بالإنجليزية ويقدم تحيزًا في الاختيار. تؤكد هذه القيود على أهمية اللغة والسياق الثقافي في تشكيل كل من إنشاء واستقبال محتوى التعليم الصحي، حيث يمكن أن تختلف أنماط الاتصال والمعتقدات الصحية بشكل كبير عبر مختلف السكان.

ثانيًا، على الرغم من أن الدراسة حققت موثوقية ممتازة بين المقيمين، إلا أن الذاتية الكامنة في التقييمات البشرية تظل مصدر قلق. بالإضافة إلى ذلك، باعتبارها دراسة مقطعية، تمثل النتائج لقطة محددة من مشهد يوتيوب خلال مايو ويونيو 2025، والتي قد لا تعكس التغييرات المستمرة في المحتوى عبر الإنترنت وخوارزميات البحث. علاوة على ذلك، كانت التقييمات التي أجراها ChatGPT-4 محدودة بالنصوص، مما أغفل العناصر البصرية والسمعية التي يمكن أن تؤثر أيضًا على جودة التعليم. أشارت الأبحاث السابقة التي أجراها Yüce وآخرون إلى أن النموذج قد يواجه صعوبة مع محتوى الفيديو الطويل أو المعقد، ولم يتم استكشاف العلاقة الإيجابية المحتملة بين طول الفيديو والجودة في هذه الدراسة. يجب أن تهدف الأبحاث المستقبلية إلى دمج مجموعات بيانات متعددة اللغات ومقارنات عبر الثقافات لتعزيز فهم جودة الفيديو والوصول عبر جماهير متنوعة.

Journal: BMC Medical Education, Volume: 25, Issue: 1
DOI: https://doi.org/10.1186/s12909-025-07739-x
PMID: https://pubmed.ncbi.nlm.nih.gov/40790575
Publication Date: 2025-08-11
Author(s): Serkan Şerifler et al.
Primary Topic: Health Literacy and Information Accessibility

Overview

The study investigates the quality and reliability of tonsillectomy-related content on YouTube, employing a multi-method framework that integrates expert human review, analysis by the large language model ChatGPT-4, and readability assessments of video transcripts. A total of 76 English-language videos were evaluated, with two otolaryngologists using the DISCERN instrument and JAMA benchmarks for quality ratings. The analysis revealed that professional videos outperformed patient-generated content in quality metrics. Notably, ChatGPT-4’s accuracy scores correlated significantly with JAMA ratings ($\rho = 0.56$) and completeness with DISCERN scores ($\rho = 0.72$). Furthermore, visually rich videos exhibited higher AI accuracy compared to transcript-heavy ones, indicating that visual context enhances interpretability (Cohen’s $d = 0.600$, $p = 0.030$). However, the average readability of transcripts (FKGL = 8.38) surpassed the recommended level for patient education.

The findings underscore the variability in tonsillectomy-related YouTube content quality and advocate for the integration of expert assessments with LLM evaluations for a comprehensive analysis. The study highlights that physician-produced videos consistently demonstrate superior quality, while patient-generated content often lacks medical accuracy. The significant enhancement in AI accuracy for visually rich videos suggests that incorporating visual elements can improve educational effectiveness. The study calls for prioritizing multimodal evaluation and design in future digital health content to address readability concerns and enhance accessibility for patients.

Introduction

The introduction of the research paper discusses the prevalence and challenges associated with tonsillectomy, a common surgical procedure in otolaryngology, with approximately 530,000 surgeries performed annually in the United States. While generally safe, the postoperative experience can be fraught with pain, feeding difficulties, and parental anxiety, necessitating reliable information for patients and families to navigate the recovery process. The rise of digital technologies, particularly social media platforms like YouTube, has transformed how individuals seek medical information, but concerns about the quality and credibility of such content persist. Previous studies indicate that many health-related videos on YouTube lack accuracy, with only a small fraction deemed medically useful.

To address the gap in trustworthy tonsillectomy information, this study aims to evaluate YouTube videos related to the procedure using a comprehensive methodology. By employing established human evaluation tools—namely, the DISCERN and JAMA criteria—alongside assessments from the AI model Chat-GPT-4 and readability metrics, the research seeks to systematically characterize the educational quality of these videos. This multi-faceted approach not only assesses video reliability but also explores the potential of AI in enhancing the evaluation of online health information, thereby contributing to improved patient education and decision-making.

Methods

The “Materials and Methods” section outlines the experimental design and procedures employed in the study. It details the specific materials used, including any reagents, equipment, and biological samples, as well as their sources and preparation. The methods section also describes the experimental protocols, including any statistical analyses performed, the controls implemented, and the criteria for data collection and interpretation.

Additionally, the section may include information on the sample size, experimental conditions, and any relevant ethical considerations. This comprehensive approach ensures reproducibility and allows for a clear understanding of how the findings were derived, thereby supporting the validity of the results presented in the study.

Results

In this study, a total of 76 YouTube videos that fulfilled the eligibility criteria were analyzed. The videos garnered a mean view count of 167,743.1 (± 72,241.5) and had an average upload age of 29.7 months (± 2.62). Engagement metrics revealed an average of 1,664.4 likes (± 703.5) and 175.5 comments (± 51.17) per video.

Content quality was assessed using the DISCERN and JAMA benchmarks, yielding mean scores of 56.3 (± 8.7) and 2.41 (± 0.88), respectively. Additionally, the accuracy of the content, as evaluated by ChatGPT-4, averaged 4.51 (± 0.66), while completeness scored an average of 4.26 (± 0.71). These findings provide insights into the engagement and quality of health-related content on YouTube.

Discussion

This study conducted a comprehensive evaluation of the educational quality and medical accuracy of YouTube videos related to tonsillectomy, utilizing a multi-method approach that integrated human expert assessments, AI evaluations via ChatGPT-4, and readability analyses. The findings revealed that the overall quality of the videos was moderate, with significant discrepancies between patient-generated content and videos produced by healthcare professionals or institutions. Specifically, patient-generated videos received lower scores on both the DISCERN and JAMA benchmarks, aligning with previous research that emphasizes the need for professional involvement in creating patient-facing health content.

The study also demonstrated a strong correlation between AI-generated evaluations and those of expert reviewers, indicating the potential of ChatGPT-4 as a reliable tool for assessing health information quality. Notably, the readability analysis showed that the average Flesch-Kincaid Grade Level (FKGL) of 8.38 exceeded the recommended sixth-grade level for patient education materials, raising concerns about accessibility for viewers with lower health literacy. The results suggest that while AI models like ChatGPT-4 can enhance the evaluation of online health content, there is a pressing need for future studies to develop multimodal AI capabilities that can analyze both visual and auditory elements to improve the comprehensibility and educational impact of health-related videos.

Limitations

The limitations of this study highlight several critical factors that may affect the interpretation and applicability of the findings. Firstly, the analysis was confined to English-language videos, which restricts the generalizability of the results to non-English-speaking audiences and introduces selection bias. This limitation underscores the importance of language and cultural context in shaping both the creation and reception of health education content, as communication styles and health beliefs can vary significantly across different populations.

Secondly, while the study achieved excellent inter-rater reliability, the inherent subjectivity of human assessments remains a concern. Additionally, as a cross-sectional study, the findings represent a specific snapshot of the YouTube landscape during May and June 2025, which may not reflect ongoing changes in online content and search algorithms. Furthermore, the evaluation conducted by ChatGPT-4 was limited to transcripts, omitting visual and auditory elements that could also impact educational quality. Previous research by Yüce et al. has indicated that the model may struggle with lengthy or complex video content, and the potential positive correlation between video length and quality was not explored in this study. Future research should aim to incorporate multilingual datasets and cross-cultural comparisons to enhance understanding of video quality and accessibility across diverse audiences.