تقييم جودة المحتوى الطبي على يوتيوب باستخدام نماذج اللغة الكبيرة Evaluating the quality of medical content on YouTube using large language models

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-94208-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40121315
تاريخ النشر: 2025-03-22
المؤلف: Mahmoud I. Khalil وآخرون
الموضوع الرئيسي: محو الأمية الصحية وإمكانية الوصول إلى المعلومات

نظرة عامة

تدرس الدراسة فعالية نماذج اللغة الكبيرة (LLMs) في تقييم جودة المحتوى الطبي على يوتيوب، وهي منصة يعتمد عليها بشكل متزايد للحصول على معلومات صحية على الرغم من انتشار المحتوى غير الدقيق. من خلال استخدام أداة DISCERN، طلب الباحثون من عشرين نموذج لغة كبير تقييم مجموعة مختارة من مقاطع الفيديو التي تم تقييمها سابقًا من قبل خبراء. أظهر تحليل اتفاقية المقيمين، الذي تم قياسه باستخدام كابا برينان-بريدجر، تباينًا واسعًا في التوافق بين نماذج اللغة الكبيرة وتقييمات الخبراء، حيث تراوحت الدرجات من -1.10 إلى 0.82. ومن الجدير بالذكر أن نماذج اللغة الكبيرة كانت تميل إلى منح تقييمات جودة أعلى من الخبراء البشريين، وكانت الفجوات ملحوظة بشكل خاص في الأسئلة الفردية للتقييم. ساهم دمج إرشادات التقييم في المطالبات في تحسين أداء النموذج.

في الختام، بينما يبقى تقييم المعلومات الصحية تحديًا معقدًا تفاقم بسبب الحجم الهائل من المحتوى عبر الإنترنت ونقص الخبراء في المراجعة، فإن نماذج اللغة الكبيرة تقدم حلاً واعدًا. يمكن أن تحسن قدراتها المتقدمة في الانتباه بشكل كبير عملية تقييم المعلومات الصحية على الويب، بشرط معالجة بعض القضايا العملية. تشير النتائج إلى أن نماذج اللغة الكبيرة يمكن أن تعمل كنظم خبراء مستقلة أو يتم دمجها في أطر التوصية التقليدية للمساعدة في التخفيف من المخاوف المتعلقة بجودة مقاطع الفيديو الطبية عبر الإنترنت.

الطرق

توضح المنهجية المستخدمة في هذه الدراسة في الشكل 1 وتتكون من خمس خطوات متميزة. يتم توضيح كل خطوة في الأقسام التالية، مما يوفر نظرة شاملة على تصميم البحث وتنفيذه. تضمن الطريقة المنظمة الوضوح وقابلية التكرار، مما يسمح بفهم شامل للطرق المطبقة لتحقيق أهداف الدراسة.

النتائج

تشير النتائج إلى أن مجموعة بيانات الدراسة، التي تتكون من 194 مقطع فيديو عبر خمسة مواضيع متميزة وتم تقييمها من قبل سبعة خبراء، تدعم إمكانية تعميم النتائج. تعزز التنوع في كل من المحتوى والمقيمين قوة الاستنتاجات المستخلصة بشأن تقييم جودة المحتوى من قبل نماذج اللغة الكبيرة (LLMs). ومع ذلك، يقترح المؤلفون أن الأبحاث المستقبلية يجب أن توسع النطاق من خلال دمج مقاطع الفيديو حول مواضيع إضافية ونطاق أوسع من تقييمات الخبراء لتوضيح قدرات نماذج اللغة الكبيرة في هذا المجال بشكل أكبر.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على قدرات وقيود نماذج اللغة الكبيرة (LLMs) في تقييم جودة المحتوى الصحي، خاصة على منصات مثل يوتيوب. تشير الدراسات الحديثة إلى أنه بينما تتفوق نماذج اللغة الكبيرة في توليد ملخصات سلسة والتفكير السريري، فإنها غالبًا ما تواجه صعوبة في الاستفسارات الطبية التفصيلية بسبب التحديات في الحفاظ على معلومات دقيقة ومحدثة. أظهرت النماذج المعززة بالاسترجاع، مثل Almanac، وعدًا في تحسين دقة الاستجابات المتعلقة بالصحة. ومع ذلك، لا تزال النماذج العامة مثل ChatGPT تعاني من نقص في الموثوقية للأسئلة الصحية المعقدة، بينما تظهر النماذج المعدلة مثل Med-PaLM 2 أداءً قريبًا من مستوى الإنسان. لقد توسعت مرونة نماذج اللغة الكبيرة لتشمل تحليل المحتوى الصوتي والمرئي، مع تطبيقات ناجحة في اكتشاف الحالات الصحية من بيانات الصوت على يوتيوب وتحليل محتوى الفيديو لتوضيحات الكائنات والأنشطة.

تكشف النتائج أن بعض نماذج اللغة الكبيرة تظهر توافقًا كبيرًا مع الخبراء البشريين عند تقييم مقاطع الفيديو المتعلقة بالصحة باستخدام أداة DISCERN، على الرغم من أن الأداء يختلف بشكل كبير بين النماذج. تشمل العوامل المؤثرة في هذا التباين مجموعات البيانات التدريبية، وحجم النموذج، والهندسة الداخلية. ومن الجدير بالذكر أن نماذج اللغة الكبيرة تميل إلى منح درجات أعلى من الخبراء البشريين، مما قد ينجم عن الفجوات في التقييمات النوعية والذاتية المتأصلة في تقييمات الخبراء. تؤكد الدراسة أيضًا على أهمية هندسة المطالبات، موضحة أن المطالبات الموجهة يمكن أن تعزز أداء النموذج، خاصةً في معايير التقييم المحددة. بشكل عام، بينما تقدم نماذج اللغة الكبيرة طريقًا واعدًا لتقييم جودة المعلومات الصحية، فإن نشرها في مجالات حساسة مثل الطب يتطلب تحققًا دقيقًا وإشرافًا للتخفيف من المخاطر المرتبطة بال inaccuracies.

القيود

تسلط هذه الدراسة الضوء على إمكانيات نماذج اللغة الكبيرة (LLMs) في معالجة قضايا جودة المحتوى الصحي على يوتيوب، مقترحةً تطبيق ويب يقوم بتقييم وترتيب مقاطع الفيديو بناءً على موثوقيتها وجودتها. يمكن للمستخدمين، وخاصة المرضى والمهنيين الصحيين، استخدام هذه الأداة لتحديد معلومات صحية موثوقة بكفاءة. ومع ذلك، تعترف الدراسة بعدة قيود تستدعي الاستكشاف المستقبلي.

أولاً، كان نطاق مقاطع الفيديو والمواضيع التي تم تقييمها محدودًا، بشكل أساسي بسبب توفر تقييمات الخبراء، مما يقيّد إمكانية تعميم النتائج. تهدف الأبحاث المستقبلية إلى توسيع مجموعة البيانات لتشمل نطاقًا أوسع من مقاطع الفيديو عبر مواضيع متنوعة، مما يعزز عمق ودقة تقييمات الجودة. ثانيًا، كانت التحليلات محصورة في نصوص مقاطع الفيديو التي تقل عن 10 دقائق، مما أغفل العناصر المرئية والصوتية التي قد تؤثر على جودة المحتوى. يمكن أن تستفيد الأعمال المستقبلية من نماذج متعددة الوسائط الكبيرة (LMMs) لدمج هذه الجوانب، مما يثري عملية التقييم. بالإضافة إلى ذلك، فإن اعتماد الدراسة على تقنيتين فقط من تقنيات المطالبات والتكاليف المالية المرتبطة باستخدام واجهة برمجة التطبيقات لنماذج اللغة الكبيرة تمثل تحديات إضافية يجب أن تتناولها التحقيقات المستقبلية لتحسين تحليل محتوى الفيديو الصحي.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-94208-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40121315
Publication Date: 2025-03-22
Author(s): Mahmoud I. Khalil et al.
Primary Topic: Health Literacy and Information Accessibility

Overview

The study investigates the efficacy of Large Language Models (LLMs) in evaluating the quality of medical content on YouTube, a platform increasingly relied upon for health-related information despite the prevalence of inaccurate content. By utilizing the DISCERN instrument, the researchers prompted twenty LLMs to rate a curated set of videos previously assessed by experts. The analysis of inter-rater agreement, measured using Brennan-Prediger’s Kappa, revealed a wide variability in alignment between LLMs and expert ratings, with scores ranging from -1.10 to 0.82. Notably, LLMs tended to assign higher quality ratings than human experts, and discrepancies were particularly pronounced on individual assessment questions. Incorporating scoring guidelines into the prompts enhanced model performance.

In conclusion, while the assessment of health-related information remains a complex challenge exacerbated by the vast volume of online content and a shortage of reviewing experts, LLMs present a promising solution. Their advanced attention capabilities could significantly improve the evaluation process for health information on the web, provided that certain practical issues are addressed. The findings suggest that LLMs could serve as standalone expert systems or be integrated into traditional recommendation frameworks to help mitigate the quality concerns associated with online medical videos.

Methods

The methodology employed in this study is illustrated in Figure 1 and consists of five distinct steps. Each step is elaborated upon in the subsequent sections, providing a comprehensive overview of the research design and execution. The structured approach ensures clarity and replicability, allowing for a thorough understanding of the methods applied to achieve the study’s objectives.

Results

The results indicate that the study’s dataset, comprising 194 videos across five distinct topics and evaluated by seven experts, supports the generalizability of the findings. The diversity in both the content and evaluators enhances the robustness of the conclusions drawn regarding the evaluation of content quality by large language models (LLMs). Nonetheless, the authors suggest that future research should expand the scope by incorporating videos on additional topics and a broader range of expert assessments to further elucidate the capabilities of LLMs in this domain.

Discussion

The discussion section of the research paper highlights the capabilities and limitations of large language models (LLMs) in evaluating the quality of health-related content, particularly on platforms like YouTube. Recent studies indicate that while LLMs excel in generating fluent summaries and clinical reasoning, they often struggle with detailed medical inquiries due to challenges in maintaining accurate and current information. Retrieval-augmented models, such as Almanac, have shown promise in enhancing the accuracy of health-related responses. However, general models like ChatGPT still fall short in reliability for complex health questions, whereas fine-tuned models like Med-PaLM 2 demonstrate near-human-level performance. The versatility of LLMs has expanded beyond text to include audio and video content analysis, with successful applications in detecting health conditions from YouTube audio data and analyzing video content for object and activity annotations.

The findings reveal that certain LLMs exhibit substantial agreement with human experts when evaluating health-related videos using the DISCERN instrument, although performance varies significantly across models. Factors influencing this variability include the training datasets, model size, and internal architecture. Notably, LLMs tend to assign higher scores than human experts, which may stem from discrepancies in qualitative assessments and the inherent subjectivity of expert ratings. The study also emphasizes the importance of prompt engineering, demonstrating that guided-scoring prompting can enhance model performance, particularly on specific evaluation criteria. Overall, while LLMs present a promising avenue for assessing health information quality, their deployment in sensitive domains like medicine necessitates careful validation and oversight to mitigate risks associated with inaccuracies.

Limitations

This study highlights the potential of large language models (LLMs) in addressing the quality issues of health-related content on YouTube, proposing a web application that evaluates and ranks videos based on their reliability and quality. Users, particularly patients and healthcare professionals, could utilize this tool to identify trustworthy health information efficiently. However, the study acknowledges several limitations that warrant future exploration.

Firstly, the scope of evaluated videos and topics was limited, primarily due to the availability of expert evaluations, which restricts the generalizability of the findings. Future research aims to expand the dataset to include a broader range of videos across various topics, enhancing the depth and specificity of quality assessments. Secondly, the analysis was confined to transcripts of videos under 10 minutes, omitting visual and auditory elements that could influence content quality. Future work could leverage large multimodal models (LMMs) to integrate these aspects, thereby enriching the evaluation process. Additionally, the study’s reliance on only two prompting techniques and the financial costs associated with API usage for LLMs present further challenges that future investigations should address to optimize the analysis of health-related video content.