العالِم العربي - تقييم جودة البحث بواسطة الذكاء الاصطناعي في عصر نماذج اللغة الكبيرة: المزايا، العيوب، والآثار النظامية – ورقة رأي Research quality evaluation by AI in the era of large language models: advantages, disadvantages, and systemic effects

المجلة: Scientometrics، المجلد: 130، العدد: 10
DOI: https://doi.org/10.1007/s11192-025-05361-8
تاريخ النشر: 2025-07-28
المؤلف: Mike Thelwall
الموضوع الرئيسي: البحث في علم القياسات العلمية والبيبلومetrics

نظرة عامة

تناقش هذه الفقرة آثار تقنيات الذكاء الاصطناعي (AI)، وخاصة نماذج اللغة الكبيرة (LLMs)، على الببليومترية كمؤشرات لجودة البحث. وتبرز أن LLMs تُستخدم بالفعل في بعض أنظمة تقييم البحث وبشكل غير رسمي من قبل المراجعين الأقران. تستعرض المقالة مزايا وعيوب الدرجات النوعية الناتجة عن الذكاء الاصطناعي، مشيرة إلى أن LLMs قد تتفوق على الببليومترية من حيث الدقة والتغطية، مما قد يعكس مجموعة أوسع من أبعاد جودة البحث. ومع ذلك، تشير أيضًا إلى عيوب كبيرة، مثل التحيزات غير المعروفة في LLMs ونقص الشفافية مقارنةً بعدد الاقتباسات التقليدي.

تؤكد الخاتمة على الحاجة إلى الاستخدام المسؤول لـ LLMs في تقييم البحث. بينما تمتلك درجات LLM القدرة على تكملة أو حتى استبدال الببليومترية، يجب معالجة الشكوك المتعلقة بتحيزاتها ومخاطر التلاعب قبل أن يمكن دمجها في سياقات تقييم مهمة. يقترح المؤلفون أنه مع المزيد من التحقيق في هذه القضايا، يمكن أن تخدم LLMs في النهاية كمعلومات إضافية للمراجعين الخبراء في تقييمات البحث المستقبلية، مثل تمرين REF الوطني في المملكة المتحدة.

مقدمة

تسلط المقدمة الضوء على الطبيعة الحرجة ولكن المستهلكة للوقت لتقييم جودة البحث بعد النشر ضمن أنظمة البحث الحديثة. تؤكد على الفجوة في جودة البحث وضرورة المؤسسات في تحديد ومكافأة المنح الدراسية المتفوقة. على سبيل المثال، شمل إطار التميز البحثي في المملكة المتحدة (REF) 2021 تقييم 185,594 مخرجات بحثية من قبل 1,120 خبيرًا، مع تأثير النتائج على تمويل البحث حتى عام 2029. يبرز هذه العملية الاعتماد على حكم الخبراء، الذي يمكن أن يكون بطيئًا، مما يدفع إلى البحث عن مؤشرات جودة أسرع مثل هيبة المجلات ومقاييس الاقتباس، على الرغم من الجدل المحيط بهذه الطرق.

في ضوء هذه التحديات، يمثل ظهور نماذج اللغة الكبيرة (LLMs)، مثل ChatGPT، بديلاً محتملاً لتقييم البحث. تهدف هذه المقالة إلى استكشاف قدرات LLMs في تقييم جودة البحث ومناقشة آثار دمجها في عملية التقييم، مع معالجة الفرص والمخاوف المرتبطة باستخدامها في هذا السياق.

نقاش

في مناقشة تقييم جودة البحث، تسلط الورقة الضوء على التباين في التعريفات والمعايير عبر سياقات مختلفة، مشددة على أنه بينما تُعترف الأبعاد الأساسية مثل الدقة والأصالة والأهمية بشكل شائع، يمكن أن يختلف تفسير هذه العوامل بشكل كبير بين التخصصات والمقيمين. يُحدد مراجعة الأقران، وخاصة مراجعة الخبراء بعد النشر، كآلية حاسمة لتقييم جودة البحث، على الرغم من أنها عرضة للتحيزات والأحكام الذاتية التي تؤثر عليها خبرة المراجعين ووجهات نظرهم الشخصية. تشير الورقة إلى أنه بينما يمكن أن توفر المؤشرات الببليومترية، مثل عدد الاقتباسات، رؤى حول تأثير البحث العلمي، فإنها غالبًا ما تكون غير كافية لتقييم الأبعاد الأوسع لجودة البحث وقد تشجع عن غير قصد سلوكيات تعطي الأولوية للكمية على الجودة.

تستكشف الورقة أيضًا إمكانيات التعلم الآلي ونماذج اللغة الكبيرة (LLMs) في تقييم جودة البحث. لقد أظهرت LLMs، مثل ChatGPT، وعدًا في تقديم تعليقات بأسلوب مراجعة الأقران والتوافق مع تقييمات الخبراء، مما يشير إلى أنها يمكن أن تكون أدوات فعالة لتقييم جودة البحث. ومع ذلك، يحذر المؤلفون من أن تقييمات LLM ليست خالية من التحيزات والقيود، خاصة فيما يتعلق باعتمادها على العناوين والملخصات بدلاً من النصوص الكاملة. على الرغم من هذه التحديات، فإن مزايا LLMs—مثل قدرتها على تقييم مجموعة أوسع من مخرجات البحث وتقديم تقييمات في الوقت المناسب—تضعها كملحقات قيمة للطرق الببليومترية التقليدية، خاصة في المجالات التي تكون فيها بيانات الاقتباس نادرة أو أقل صلة. بشكل عام، تدعو الورقة إلى نهج دقيق لتقييم جودة البحث يدمج مراجعة الخبراء مع منهجيات مبتكرة مدفوعة بالذكاء الاصطناعي.

Journal: Scientometrics, Volume: 130, Issue: 10
DOI: https://doi.org/10.1007/s11192-025-05361-8
Publication Date: 2025-07-28
Author(s): Mike Thelwall
Primary Topic: scientometrics and bibliometrics research

Overview

The section discusses the implications of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), on bibliometrics as indicators of research quality. It highlights that LLMs are already being utilized in some research evaluation systems and informally by peer reviewers. The article reviews the advantages and disadvantages of AI-generated quality scores, noting that LLMs may surpass bibliometrics in accuracy and coverage, potentially reflecting a broader range of research quality dimensions. However, it also points out significant drawbacks, such as unknown biases in LLMs and the lack of transparency compared to traditional citation counts.

The conclusion emphasizes the need for responsible use of LLMs in research evaluation. While LLM scores possess the potential to complement or even replace bibliometrics, uncertainties regarding their biases and the risk of manipulation must be addressed before they can be integrated into significant evaluation contexts. The authors suggest that, with further investigation into these issues, LLMs could eventually serve as supplementary information for expert reviewers in future research evaluations, such as the UK REF national exercise.

Introduction

The introduction highlights the critical yet time-consuming nature of post-publication research quality evaluation within modern research systems. It emphasizes the disparity in research quality and the necessity for institutions to identify and reward superior scholarship. For instance, the UK Research Excellence Framework (REF) 2021 involved the assessment of 185,594 research outputs by 1,120 experts, with outcomes influencing research funding until 2029. This process underscores the reliance on expert judgment, which can be slow, prompting a search for quicker quality proxies such as journal prestige and citation metrics, despite the controversies surrounding these methods.

In light of these challenges, the emergence of Large Language Models (LLMs), such as ChatGPT, presents a potential alternative for research evaluation. This article aims to explore the capabilities of LLMs in assessing research quality and to discuss the implications of their integration into the evaluation process, addressing both the opportunities and concerns associated with their use in this context.

Discussion

In the discussion of research quality assessment, the paper highlights the variability in definitions and criteria across different contexts, emphasizing that while core dimensions such as rigor, originality, and significance are commonly acknowledged, the interpretation of these factors can differ significantly among disciplines and evaluators. Peer review, particularly post-publication expert review, is identified as a crucial mechanism for assessing research quality, though it is subject to biases and subjective judgments influenced by the reviewers’ expertise and personal perspectives. The paper notes that while bibliometric indicators, such as citation counts, can provide insights into scholarly impact, they are often inadequate for evaluating the broader dimensions of research quality and may inadvertently encourage behaviors that prioritize quantity over quality.

The paper further explores the potential of machine learning and large language models (LLMs) in evaluating research quality. LLMs, such as ChatGPT, have shown promise in providing peer review-style feedback and correlating with expert evaluations, suggesting they could serve as effective tools for research quality assessment. However, the authors caution that LLM evaluations are not without biases and limitations, particularly regarding their reliance on titles and abstracts rather than full texts. Despite these challenges, the advantages of LLMs—such as their ability to assess a wider range of research outputs and provide timely evaluations—position them as valuable complements to traditional bibliometric methods, particularly in fields where citation data is sparse or less relevant. Overall, the paper advocates for a nuanced approach to research quality assessment that integrates expert review with innovative AI-driven methodologies.