تقييم مرجعي لنماذج اللغة الكبيرة في تقييم جودة مقاطع الفيديو الخاصة بتعميم العلوم لعيون جافة Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-39444-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41688579
تاريخ النشر: 2026-02-13
المؤلف: Shiqi Zhou وآخرون
الموضوع الرئيسي: محو الأمية الصحية وإمكانية الوصول إلى المعلومات

نظرة عامة

لقد حول الانتشار السريع لمنصات الفيديو القصير نشر المعلومات الصحية، خاصة في سياق جفاف العين، وهو اضطراب عيني شائع. تتناول هذه الدراسة الحاجة الملحة لطرق فعالة لتحديد وتخفيف المعلومات المضللة في المحتوى الصحي عبر الإنترنت. تم اقتراح إطار عمل جديد يستخدم نماذج اللغة الكبيرة للفيديو (VideoLLMs) للتقييم الآلي لمقاطع الفيديو الخاصة بتعميم العلوم. تم تقييم أداء ثلاثة من VideoLLMs—VideoLLaMA3 وQwenVL وInternVL—مقابل أدوات تقييم الجودة المعتمدة: أداة تقييم مواد التعليم للمرضى للمواد السمعية البصرية (PEMAT-A/V) ودرجة الجودة العالمية (GQS) ومؤشر معلومات الفيديو والجودة (VIQI). تم وضع مجموعة بيانات تتكون من 185 فيديو باللغة الصينية بواسطة أطباء العيون، مما كشف أن VideoLLMs عموماً أظهرت توافقاً ضعيفاً مع تقييمات الخبراء (معامل الارتباط داخل الفئة، ICC < 0.40)، مع بعض الاستثناءات في بُعد القابلية للتنفيذ من PEMAT-A/V. تشير النتائج إلى أنه بينما يمكن تكييف أدوات تقييم الجودة المعتمدة لنماذج الذكاء الاصطناعي متعددة الوسائط، فإن VideoLLMs الحالية محدودة في قدرتها على الاقتراب بشكل موثوق من تقييمات الخبراء. توفر الدراسة مجموعة بيانات محدثة ومؤشرة وإطار عمل مفتوح المصدر، مما يمهد الطريق للتقدم المستقبلي في هذا المجال. ومع ذلك، تؤكد النتائج على ضرورة الحذر في تفسير قدرات VideoLLMs المعاصرة، حيث تظل غير كافية للتقييمات الآلية بالكامل. يؤكد المؤلفون على إمكانية هذه الأدوات في تعزيز توحيد وتيسير التعليم الصحي الرقمي، مع جعل نتائج تقييم الجودة البشرية متاحة عند الطلب.

مقدمة

تتناول مقدمة ورقة البحث القضية الصحية العامة المهمة لجفاف العين، وهو اضطراب سطحي عيني متعدد العوامل يؤدي إلى عدم استقرار فيلم الدموع، وعدم الراحة، واضطرابات بصرية. تؤثر هذه الحالة على جزء كبير من السكان، حيث تتراوح تقديرات انتشارها العالمية بين 5% إلى 50%، مما يؤثر بشكل خاص على كبار السن والنساء والأفراد الذين يتعرضون لشاشات بشكل كبير. لا يقلل الاضطراب من جودة الحياة فحسب، بل يزيد أيضاً من استخدام الرعاية الصحية والضغوط النفسية والاجتماعية. يتم التأكيد على أهمية التعليم الفعال للمرضى كأمر حيوي للكشف المبكر والإدارة، خاصة في سياق الاعتماد المتزايد على المعلومات الصحية عبر الإنترنت.

لقد حول ظهور منصات الفيديو القصير مثل TikTok كيفية نشر المعلومات الصحية، ومع ذلك، فإن هذا النموذج المفتوح الوصول يعرض خطر انتشار المحتوى غير الموثق، مما قد يؤدي إلى ممارسات علاج ذاتي ضارة. تسلط الورقة الضوء على إمكانية نماذج اللغة الكبيرة (LLMs) لتحسين تقييم جودة المحتوى الصحي عبر الإنترنت من خلال التحليل الآلي. على الرغم من قدراتها الواعدة في فهم اللغة واسترجاع المعرفة، لا تزال هناك مخاوف بشأن موثوقية وقابلية إعادة إنتاج مخرجات LLM في السياقات الطبية. يقترح المؤلفون إطار عمل جديد يستخدم نماذج اللغة الكبيرة للفيديو (VideoLLMs) لتقييم وتعزيز جودة المحتوى الصحي العيني المتاح عبر الإنترنت، مما يعالج الحاجة الملحة لمعلومات موثوقة في هذا المجال.

الطرق

تحدد قسم “الطرق” الإجراءات التجريبية والتحليلية المستخدمة في الدراسة. يوضح معايير اختيار المشاركين، والتدخلات أو العلاجات المحددة التي تم إدارتها، والأدوات أو الأجهزة المستخدمة لجمع البيانات. بالإضافة إلى ذلك، يتم وصف الطرق الإحصائية لتحليل البيانات، بما في ذلك أي برامج تم استخدامها للحسابات والعتبات الدالة التي تم وضعها لاختبار الفرضيات.

يؤكد القسم على صرامة المنهجية، مما يضمن إمكانية إعادة إنتاج وموثوقية النتائج. قد يناقش أيضاً أي تدابير تحكم تم تنفيذها للتخفيف من التحيزات والمتغيرات المربكة، فضلاً عن الاعتبارات الأخلاقية التي تم الالتزام بها خلال عملية البحث. بشكل عام، يخدم هذا القسم لتوفير إطار شامل لفهم كيفية اشتقاق نتائج الدراسة.

النتائج

تقوم الدراسة بتقييم VideoLLMs بشكل منهجي لتقييم الجودة الآلي لمقاطع الفيديو الخاصة بتعميم العلوم التي تركز على جفاف العين، باستخدام ثلاثة أدوات تقييم جودة الفيديو المعتمدة: PEMAT-A/V وGQS وVIQI. تشير النتائج إلى أنه بينما يتكيف إطار العمل القائم على VideoLLM بشكل فعال مع هذه الأدوات للتقييم الآلي، كانت العلاقة بين الدرجات التي تم إنشاؤها بواسطة النموذج وتقييمات الخبراء عمومًا منخفضة، مع معامل ارتباط داخل الفئة (ICC) أقل من 0.40. من الجدير بالذكر أن بُعد القابلية للتنفيذ من PEMAT-A/V أظهر توافقًا معتدلاً، حيث حقق QwenVL وInternVL قيم ICC تبلغ 0.50 و0.43، على التوالي، بينما وصلت VideoLLaMA إلى ICC قدره 0.39.

تستكشف الدراسة أيضًا قيود VideoLLMs، منسوبة أدائها غير المثالي إلى تدريبها العام، الذي يركز على مهام مثل التعرف على الكائنات وفهم المشاهد بدلاً من المتطلبات المحددة لتقييم جودة الفيديو العلمي. الطبيعة الثابتة لمقاطع الفيديو الخاصة بتعميم العلوم، التي تعتمد بشكل كبير على العناصر السمعية والنصية بدلاً من الإشارات البصرية الديناميكية، تطرح تحديات إضافية. يعيق تركيز النماذج على المعالجة البصرية، جنبًا إلى جنب مع قدرات الصوت غير الكافية وفقدان المعلومات المحتمل أثناء أخذ عينات الإطارات، فعاليتها. علاوة على ذلك، قد يؤدي استخدام اللغة المجازية في مقاطع الفيديو العلمية الشعبية الصينية إلى سوء التفسير من قبل النماذج، مما يعقد عملية التقييم.

المناقشة

في هذه الدراسة، قام المؤلفون بتقييم فعالية ثلاثة نماذج لغة الفيديو الرئيسية (VideoLLMs) في تقييم جودة مقاطع الفيديو الطبية العلمية الشعبية عبر الإنترنت التي تركز على مرض جفاف العين. كشفت التحليلات أن هذه النماذج أظهرت فقط توافقًا ضعيفًا مع تقييمات الخبراء، خاصة من حيث القابلية للتنفيذ والدقة، مما يبرز قيودها في تقييم المحتوى الطبي المتخصص بدقة. استخدمت الدراسة أدوات تقييم معتمدة، بما في ذلك أداة تقييم مواد التعليم للمرضى للمواد السمعية البصرية (PEMAT-A/V) ودرجة الجودة العالمية (GQS) ومؤشر معلومات الفيديو والجودة (VIQI)، لتقييم 185 فيديو. بينما حصلت الفيديوهات على درجات مرتفعة نسبيًا من حيث الفهم، كانت قابلية التنفيذ والجودة العامة لها أقل بشكل ملحوظ، مما يشير إلى نقص في الإرشادات العملية للمشاهدين.

تؤكد النتائج على الحاجة الملحة لمعلومات صحية عالية الجودة عبر الإنترنت، خاصة بالنظر إلى الزيادة في انتشار مرض جفاف العين وتأثيره الكبير على جودة حياة المرضى. على الرغم من قيود VideoLLMs الحالية، يدعو المؤلفون إلى تطوير أنظمة تقييم آلي لتعزيز جودة المعلومات الصحية التي يتم نشرها عبر منصات مثل TikTok. يقترحون دمج VideoLLMs المتقدمة مع الممارسة السريرية لتمكين التحكم الاستباقي في الجودة ويقترحون أن تشمل التحسينات المستقبلية تحليل الصوت واستراتيجيات تحفيز مصقولة تتناسب مع السياقات الطبية. بشكل عام، بينما تضع الدراسة الأساس للتقدم المستقبلي في تقييم المحتوى الآلي، فإنها تؤكد أيضًا على ضرورة الحذر في تفسير قدرات النماذج الحالية.

القيود

تقدم البحث مساهمات كبيرة من خلال كونه الأول الذي يستخدم نماذج اللغة الكبيرة للفيديو (VLLMs) لتقييم الجودة الآلي لمقاطع الفيديو الطبية العلمية الشعبية عبر الإنترنت، مع التركيز بشكل خاص على جفاف العين. تؤسس هذه الدراسة إطار عمل مفتوح المصدر وتوفر مجموعة بيانات مؤشرة يدويًا من مقاطع الفيديو باللغة الصينية، مما يعالج الجودة غير المثلى السابقة لمحتوى التعليم العيني. ومع ذلك، يتم الاعتراف بعدة قيود.

أولاً، تظهر VLLMs الحالية قيودًا في الأداء تحد من قابليتها للتطبيق للإرشادات المهنية أو التعليمية. يحد تركيز الدراسة على جفاف العين من تعميم النتائج على حالات عينية أخرى، مثل التهاب القرنية والزرق وإعتام عدسة العين. بالإضافة إلى ذلك، استخدمت التحليل معدل أخذ عينات إطار واحد (1 إطار في الثانية) للإجابة على أسئلة الفيديو، مما قد يقدم تحيزًا في العينة. كما أن حصرية مجموعة البيانات لمقاطع الفيديو باللغة الصينية تحد من قابليتها للتطبيق عبر الثقافات، حيث إنها لا تأخذ في الاعتبار اللغات الأخرى الشائعة في التواصل الطبي. علاوة على ذلك، فإن اعتماد الإطار فقط على المعلومات البصرية يتجاهل الميزات الصوتية والبيانات الوصفية، والتي تعتبر حاسمة للتقييمات الشاملة، خاصة في تقييم التناسق بين عناوين الفيديو والمحتوى. تهدف الأبحاث المستقبلية إلى معالجة هذه القيود من خلال توسيع مجموعة البيانات لتشمل أمراض عينية متنوعة ولغات متعددة، مما يعزز قوة وملاءمة النتائج.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-39444-0
PMID: https://pubmed.ncbi.nlm.nih.gov/41688579
Publication Date: 2026-02-13
Author(s): Shiqi Zhou et al.
Primary Topic: Health Literacy and Information Accessibility

Overview

The rapid proliferation of short-video platforms has transformed health information dissemination, particularly in the context of dry eye, a common ocular disorder. This study addresses the urgent need for effective methods to identify and mitigate misinformation in online health content. A novel framework utilizing Video Large Language Models (VideoLLMs) was proposed for the automated evaluation of science popularization videos. The performance of three VideoLLMs—VideoLLaMA3, QwenVL, and InternVL—was benchmarked against established quality assessment tools: the Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V), Global Quality Score (GQS), and Video Information and Quality Index (VIQI). A dataset comprising 185 Chinese-language videos was annotated by ophthalmologists, revealing that VideoLLMs generally exhibited poor agreement with expert evaluations (Intraclass Correlation Coefficient, ICC < 0.40), with some exceptions in the actionability dimension of PEMAT-A/V. The findings indicate that while established quality assessment instruments could be adapted for multimodal AI models, current VideoLLMs are limited in their ability to reliably approximate expert evaluations. The study provides an updated annotated dataset and an open-source framework, laying the groundwork for future advancements in this area. However, the results underscore the necessity for caution in interpreting the capabilities of contemporary VideoLLMs, as they remain inadequate for fully automated assessments. The authors emphasize the potential for these tools to enhance the standardization and accessibility of digital health education, while also making the human quality assessment results available upon request.

Introduction

The introduction of the research paper addresses the significant public health issue of dry eye, a multifactorial ocular surface disorder that leads to tear film instability, discomfort, and visual disturbances. The condition affects a substantial portion of the population, with global prevalence estimates ranging from 5% to 50%, particularly impacting older adults, women, and individuals with high screen exposure. The disorder not only diminishes quality of life but also increases healthcare utilization and psychosocial distress. Effective patient education is emphasized as vital for early detection and management, especially in the context of the growing reliance on online health information.

The rise of short-video platforms like TikTok has transformed how health information is disseminated, yet this open-access model risks the spread of unverified content, potentially leading to harmful self-treatment practices. The paper highlights the potential of Large Language Models (LLMs) to improve the quality assessment of online health content through automated analysis. Despite their promising capabilities in language comprehension and knowledge retrieval, concerns about the reliability and reproducibility of LLM outputs in medical contexts remain. The authors propose a novel framework utilizing Video Large Language Models (VideoLLMs) to systematically evaluate and enhance the quality of ophthalmic health content available online, addressing the urgent need for trustworthy information in this domain.

Methods

The “Methods” section outlines the experimental and analytical procedures employed in the study. It details the selection criteria for participants, the specific interventions or treatments administered, and the tools or instruments used for data collection. Additionally, statistical methods for analyzing the data are described, including any software utilized for computations and the significance thresholds established for hypothesis testing.

The section emphasizes the rigor of the methodology, ensuring reproducibility and reliability of results. It may also discuss any control measures implemented to mitigate biases and confounding variables, as well as the ethical considerations adhered to during the research process. Overall, this section serves to provide a comprehensive framework for understanding how the study’s findings were derived.

Results

The study systematically benchmarks VideoLLMs for the automated quality assessment of science popularization videos focused on dry eye, utilizing three established video quality assessment instruments: PEMAT-A/V, GQS, and VIQI. The findings indicate that while the VideoLLM-based framework effectively adapts these instruments for automated evaluation, the correlation between model-generated scores and expert assessments was generally low, with an Intraclass Correlation Coefficient (ICC) of less than 0.40. Notably, the actionability dimension of PEMAT-A/V showed moderate agreement, with QwenVL and InternVL achieving ICC values of 0.50 and 0.43, respectively, while VideoLLaMA reached an ICC of 0.39.

The study further explores the limitations of the VideoLLMs, attributing their suboptimal performance to their general-purpose training, which emphasizes tasks like object recognition and scene understanding rather than the specific requirements of science video quality assessment. The static nature of science popularization videos, which rely heavily on auditory and textual elements rather than dynamic visual cues, poses additional challenges. The models’ focus on visual processing, combined with inadequate audio capabilities and potential information loss during frame sampling, hampers their effectiveness. Furthermore, the use of figurative language in Chinese popular science videos may lead to misinterpretations by the models, complicating the assessment process.

Discussion

In this study, the authors evaluated the effectiveness of three mainstream Video Language Models (VideoLLMs) in assessing the quality of online medical science popularization videos focused on dry eye disease. The analysis revealed that these models demonstrated only weak agreement with expert evaluations, particularly in terms of actionability and precision, highlighting their limitations in accurately assessing specialized medical content. The study utilized established assessment tools, including the Patient Education Materials Assessment Tool for Audiovisual Materials (PEMAT-A/V), Global Quality Score (GQS), and Video Information and Quality Index (VIQI), to evaluate 185 videos. While the videos scored relatively high on understandability, their actionability and overall quality were notably lower, indicating a lack of practical guidance for viewers.

The findings underscore the urgent need for high-quality online health information, especially given the rising prevalence of dry eye disease and its significant impact on patients’ quality of life. Despite the limitations of current VideoLLMs, the authors advocate for the development of automated evaluation systems to enhance the quality of health information disseminated through platforms like TikTok. They propose integrating advanced VideoLLMs with clinical practice to enable proactive quality control and suggest that future improvements could include audio analysis and refined prompting strategies tailored to medical contexts. Overall, while the study lays the groundwork for future advancements in automated content evaluation, it also emphasizes the necessity for caution in interpreting the capabilities of existing models.

Limitations

The research presents significant contributions by being the first to utilize Video Large Language Models (VLLMs) for the automated quality assessment of online medical science popularization videos, specifically focusing on dry eye. This study establishes an open-source framework and provides a manually annotated dataset of Chinese-language videos, addressing the previously identified suboptimal quality of ophthalmic educational content. However, several limitations are acknowledged.

Firstly, the current VLLMs exhibit performance constraints that restrict their applicability for professional or educational guidance. The study’s focus on dry eye limits the generalizability of findings to other ophthalmic conditions, such as keratitis, glaucoma, and cataract. Additionally, the analysis utilized a single frame sampling rate (1 fps) for video question answering, which may introduce sampling bias. The dataset’s exclusivity to Chinese-language videos further constrains cross-cultural applicability, as it does not account for other prevalent languages in medical communication. Moreover, the framework’s reliance solely on visual information neglects audio features and metadata, which are crucial for comprehensive evaluations, particularly in assessing the consistency between video titles and content. Future research aims to address these limitations by expanding the dataset to include various ophthalmic diseases and multiple languages, thereby enhancing the robustness and applicability of the findings.