تقييم البحث باستخدام ChatGPT: هل هو متحيز حسب العمر أو البلد أو الطول أو المجال؟ Research evaluation with ChatGPT: is it age, country, length, or field biased?

المجلة: Scientometrics، المجلد: 130، العدد: 10
DOI: https://doi.org/10.1007/s11192-025-05393-0
تاريخ النشر: 2025-08-08
المؤلف: Mike Thelwall وآخرون
الموضوع الرئيسي: البحث في علم القياسات العلمية والبيبلومetrics

نظرة عامة

تبحث الدراسة في التحيزات المحتملة في تقييمات جودة ChatGPT للمقالات العلمية بناءً على عناوينها وملخصاتها. من خلال تحليل مجموعة بيانات تضم 117,650 مقالة عبر 26 مجالًا نُشرت في 2003 و2008 و2013 و2018 و2023، وجدت الدراسة أن ChatGPT يميل إلى منح درجات أعلى للمقالات الأحدث، مع استمرار هذا الاتجاه عبر جميع المجالات، وإن كان بشكل متواضع. يكشف التحليل عن تباينات كبيرة في الدرجات بناءً على الاختلافات التخصصية وبلدان المؤلفين الأوائل. ومن الجدير بالذكر أن الملخصات الأطول ترتبط بدرجات أعلى، ويعزى ذلك إلى ارتباط الملخصات الأطول بالمقالات من المجلات ذات التأثير الأعلى، بالإضافة إلى النصوص المتزايدة التي تم تحليلها بواسطة ChatGPT.

تشير النتائج إلى أن تطبيع درجات ChatGPT حسب المجال وسنة النشر أمر ضروري لتقييمات جودة البحث بدقة. يتضمن ذلك حساب نسبة درجة المقالة إلى الدرجة المتوسطة ضمن مجاله وسنته، مما يسمح بإجراء مقارنات عادلة. تسلط الدراسة أيضًا الضوء على التحيزات المحتملة المتعلقة ببلد المؤلف الأول وطول الملخص، مما يشير إلى الحاجة لمزيد من التحقيق في هذه العوامل. يوصي المؤلفون بأن يستخدم المقيمون درجات ChatGPT كمعلومات تكميلية بدلاً من مقاييس حاسمة، مؤكدين على أهمية حكم الخبراء في تقييم البحث. بالإضافة إلى ذلك، فإن الآثار المترتبة على قيود طول الملخصات على الدرجات تستدعي مزيدًا من الاستكشاف لتحديد ما إذا كان التطبيع لهذا العامل ضروريًا.

مقدمة

تؤكد مقدمة ورقة البحث على الدور الحاسم لمراجعة الخبراء في تقييم الأبحاث المنشورة للتعيينات الأكاديمية والترقيات والدروس، مع الاعتراف أيضًا بإمكانية استخدام الببليومتريكس لتعزيز دقة هذه التقييمات. تناقش الورقة دمج أساليب الذكاء الاصطناعي (AI) في تقييم الأبحاث، مع تسليط الضوء بشكل خاص على وعد ChatGPT كأداة قد توفر درجات جودة للمقالات العلمية التي ترتبط إيجابيًا بتقييمات الخبراء البشر عبر معظم المجالات، باستثناء ربما الطب السريري. ومع ذلك، تثير المخاوف بشأن التحيزات المحتملة في تقييمات الذكاء الاصطناعي، خاصة فيما يتعلق بسنة النشر، واختلافات المجالات، وطول العناوين والملخصات.

التركيز الأساسي للدراسة هو التحقيق فيما إذا كانت درجات جودة ChatGPT تظهر تباينات منهجية بناءً على سنة النشر، والمجال الأكاديمي، وطول العنوان والملخص، وبلد المنشأ. تهدف أسئلة البحث إلى استكشاف هذه التحيزات المحتملة وآثارها على موثوقية التقييمات التي ينتجها الذكاء الاصطناعي. بالإضافة إلى ذلك، تسعى الورقة لفهم العلاقة بين درجات ChatGPT وعدد الاقتباسات، مما يوفر سياقًا مقارنًا ضد التحيزات المعروفة المرتبطة بمؤشرات الاقتباس التقليدية. بشكل عام، تهدف الدراسة إلى تقييم فعالية وإنصاف استخدام الذكاء الاصطناعي في تقييمات الأبحاث الأكاديمية.

الطرق

في هذه الدراسة، استخدم الباحثون نهجًا منهجيًا لتحليل المقالات البحثية عبر مجالات وسنوات مختلفة. جمعوا عينات مع التحكم في متغيرات متعددة لتسهيل تحليل وصفي أولي للدرجات المتوسطة حسب السنة والمجال. بعد ذلك، استخدموا تحليل الانحدار لتقييم التأثيرات المتزامنة لجميع المتغيرات، مما يقلل من خطر التحيز الذي قد ينشأ من التأثيرات من الدرجة الثانية.

بينما تشير الأدبيات السابقة (Thelwall، 2024؛ Thelwall، 2025) إلى أن تقديم المقالات عدة مرات إلى ChatGPT يمكن أن يؤدي إلى درجات أكثر دقة، أعطت الدراسة الحالية أولوية لتركيز مختلف. لم تكن الدقة هي القلق الأساسي، باستثناء في سياق سؤال البحث 5 (RQ5)، الذي تم تضمينه للرجوع إليه. وبالتالي، تم تقديم كل مقالة مرة واحدة فقط للحفاظ على نزاهة منهجية الدراسة.

النتائج

يقدم قسم النتائج نتائج من 26 تحليل انحدار قائم على المجال يهدف إلى فهم العوامل التي تؤثر على درجات ChatGPT للمقالات الأكاديمية. أظهر متغير السنة باستمرار معاملًا إيجابيًا وذو دلالة إحصائية عبر جميع الانحدارات، مما يشير إلى أن الدرجات المتوسطة المنخفضة للمقالات القديمة لا تعزى إلى اختلافات في بلدان المؤلفين أو أطوال الملخصات. ومن الجدير بالذكر أن الملخصات الأطول كانت مرتبطة إيجابيًا بدرجات أعلى في 23 من المجالات، على الرغم من وجود حد أدنى لطول مصمم لاستبعاد المساهمات الأقصر.

ارتبط المؤلفون الكنديون بشكل فريد بدرجات ChatGPT أعلى عبر جميع المجالات، مع دلالة إحصائية في 21 حالة. بالمقابل، بينما أظهرت دول أخرى ذات حجم نشر كبير في Scopus عمومًا درجات أعلى، كانت الهند استثناءً، على الأرجح بسبب استثمارها المنخفض في البحث لكل فرد. كشف التحليل عن عدم وجود ميزة واضحة للدول الناطقة بالإنجليزية، ولم يهيمن مؤلفو المملكة المتحدة على درجات ChatGPT على الرغم من صلة معايير REF. تراوحت قيم R² لنماذج الانحدار من 0.05 إلى 0.21، بمتوسط 0.12، مما يشير إلى أن العوامل التي تم فحصها تمثل فقط جزءًا متواضعًا من التباين في درجات ChatGPT، مما يدل على التأثير المحتمل لعوامل إضافية غير مقاسة.

المناقشة

يتناول قسم المناقشة في الورقة القضية الشائعة للتحيز في أنظمة الذكاء الاصطناعي، خاصة في معالجة النصوص ونماذج اللغة الكبيرة (LLMs). يبرز أن التحيزات يمكن أن تنشأ من خيارات التصميم، وبيانات التدريب، والخصائص الجوهرية للنماذج. على سبيل المثال، قد يتعلم نظام الذكاء الاصطناعي عن غير قصد ربط الجنس بتفضيلات العمل بناءً على بيانات تدريب متحيزة، مما يؤدي إلى نتائج متحيزة في اختيار المرشحين. يؤكد المؤلفون أن العديد من نماذج اللغة الكبيرة يتم تدريبها بشكل أساسي على بيانات غربية، مما يمكن أن يهمش وجهات نظر الثقافات العالمية. على الرغم من الجهود المبذولة للتخفيف من التحيزات في النماذج الأحدث، مثل ChatGPT، لا تزال الإصدارات القديمة تظهر تحيزات كبيرة، خاصة في ارتباطاتها بالجنس والعرق، مما يمكن أن يؤثر على مجالات حيوية مثل التوصيات الطبية.

تستكشف الورقة أيضًا استراتيجيات قائمة على المطالبات لتقليل التحيز في مخرجات نماذج اللغة الكبيرة، مقترحة أن أساليب مثل تحذيرات التحيز المعرفي والتعلم القائم على الأمثلة يمكن أن تحقق تحسينات متواضعة. ومع ذلك، يحذر المؤلفون من أن فعالية هذه الاستراتيجيات قد تختلف اعتمادًا على السياق والهندسة المعمارية المحددة لنموذج اللغة الكبيرة. علاوة على ذلك، تعترف الدراسة بالقيود في منهجيتها، خاصة فيما يتعلق باختيار المقالات من Scopus، والتي قد لا تمثل بشكل كامل تنوع المنح الدراسية الأكاديمية. تشير النتائج إلى اتجاه لزيادة متوسط درجات ChatGPT بمرور الوقت، مما يشير إلى تحسينات في جودة البحث، ولكنها تبرز أيضًا الحاجة إلى تطبيع دقيق عبر المجالات وسنوات النشر لتجنب معاقبة الأبحاث القديمة. بشكل عام، تؤكد المناقشة على تعقيد معالجة التحيز في الذكاء الاصطناعي وضرورة التقييم المستمر وتحسين نماذج اللغة الكبيرة في السياقات الأكاديمية.

Journal: Scientometrics, Volume: 130, Issue: 10
DOI: https://doi.org/10.1007/s11192-025-05393-0
Publication Date: 2025-08-08
Author(s): Mike Thelwall et al.
Primary Topic: scientometrics and bibliometrics research

Overview

The research investigates the potential biases in ChatGPT’s quality assessments of journal articles based on their titles and abstracts. By analyzing a dataset of 117,650 articles across 26 fields published in 2003, 2008, 2013, 2018, and 2023, the study finds that ChatGPT tends to assign higher scores to more recent articles, with this trend being consistent across all fields, albeit modest in magnitude. The analysis reveals significant variations in scores based on disciplinary differences and the countries of first authors. Notably, longer abstracts correlate with higher scores, attributed to the association of longer abstracts with articles from higher-impact journals, as well as the increased text analyzed by ChatGPT.

The findings suggest that normalization of ChatGPT scores by field and publication year is essential for accurate research quality evaluations. This involves calculating a ratio of an article’s score to the average score within its field and year, allowing for fair comparisons. The study also highlights potential biases related to the first author’s country and abstract length, indicating a need for further investigation into these factors. The authors recommend that evaluators use ChatGPT scores as supplementary information rather than definitive measures, emphasizing the importance of expert judgment in research evaluation. Additionally, the implications of abstract length restrictions on scoring warrant further exploration to determine if normalization for this factor is necessary.

Introduction

The introduction of the research paper emphasizes the critical role of expert review in evaluating published research for academic appointments, promotions, and tenure, while also acknowledging the potential of bibliometrics to enhance the accuracy of these evaluations. The paper discusses the integration of Artificial Intelligence (AI) methods in research evaluation, particularly highlighting the promise of ChatGPT as a tool that may provide quality scores for journal articles that correlate positively with human expert evaluations across most fields, except possibly clinical medicine. However, it raises concerns regarding potential biases in AI evaluations, particularly regarding publication year, field differences, and the length of titles and abstracts.

The primary focus of the study is to investigate whether ChatGPT’s quality scores exhibit systematic variations based on publication year, academic field, title and abstract length, and country of origin. The research questions aim to explore these potential biases and their implications for the reliability of AI-generated evaluations. Additionally, the paper seeks to understand the relationship between ChatGPT scores and citation counts, providing a comparative context against the well-documented biases associated with traditional citation-based indicators. Overall, the study aims to critically assess the efficacy and fairness of using AI in academic research evaluations.

Methods

In this study, the researchers employed a systematic approach to analyze research articles across various fields and years. They collected samples while controlling for multiple variables to facilitate an initial descriptive analysis of average scores by year and field. Subsequently, they utilized regression analysis to evaluate the simultaneous effects of all variables, thereby mitigating the risk of bias that could arise from second-order effects.

While previous literature (Thelwall, 2024; Thelwall, 2025) indicates that submitting articles multiple times to ChatGPT can yield more accurate scores, the current study prioritized a different focus. Accuracy was not the primary concern, except in the context of Research Question 5 (RQ5), for which it was included for reference. Consequently, each article was submitted only once to maintain the study’s methodological integrity.

Results

The results section presents findings from 26 field-based regression analyses aimed at understanding factors influencing ChatGPT scores of academic articles. The year variable consistently showed a positive and statistically significant coefficient across all regressions, indicating that the lower average scores for older articles are not attributable to variations in the authors’ countries or abstract lengths. Notably, longer abstracts were positively correlated with higher scores in 23 of the fields, despite a minimum length cutoff designed to exclude shorter contributions.

Canadian authors were uniquely associated with higher ChatGPT scores across all fields, with statistical significance in 21 instances. In contrast, while other countries with substantial publication volumes in Scopus generally exhibited higher scores, India was an exception, likely due to its lower per capita research investment. The analysis revealed no clear advantage for English-speaking nations, and UK authors did not dominate the ChatGPT scores despite the relevance of REF criteria. The R² values of the regression models ranged from 0.05 to 0.21, averaging 0.12, suggesting that the examined factors account for only a modest portion of the variation in ChatGPT scores, indicating the potential influence of additional, unmeasured variables.

Discussion

The discussion section of the paper addresses the pervasive issue of bias in AI systems, particularly in text processing and large language models (LLMs). It highlights that biases can arise from design choices, training data, and the inherent characteristics of the models. For instance, an AI system might inadvertently learn to associate gender with job preferences based on biased training data, leading to biased outcomes in candidate selection. The authors emphasize that many LLMs are trained predominantly on Western data, which can marginalize global majority cultural perspectives. Despite efforts to mitigate biases in newer models, such as ChatGPT, older versions still exhibit significant biases, particularly in their associations with gender and ethnicity, which can affect critical areas like medical recommendations.

The paper also explores prompt-based strategies to reduce bias in LLM outputs, suggesting that approaches like cognitive bias warnings and example-based learning can yield modest improvements. However, the authors caution that the effectiveness of these strategies may vary depending on the context and the specific LLM architecture. Furthermore, the study acknowledges limitations in its methodology, particularly regarding the selection of articles from Scopus, which may not fully represent the diversity of academic scholarship. The findings indicate a trend of increasing average ChatGPT scores over time, suggesting improvements in research quality, but also highlight the need for careful normalization across fields and publication years to avoid penalizing older research. Overall, the discussion underscores the complexity of addressing bias in AI and the necessity for ongoing evaluation and refinement of LLMs in academic contexts.