نهج شبه آلي على نطاق واسع لتقييم أخطاء تصنيف نوع الوثائق في قواعد البيانات البيبليومترية A large-scale semi-automated approach for assessing document-type classification errors in bibliometric databases

المجلة: Scientometrics، المجلد: 130، العدد: 3
DOI: https://doi.org/10.1007/s11192-025-05244-y
تاريخ النشر: 2025-03-01
المؤلف: Domenico Maisano وآخرون
الموضوع الرئيسي: البحث في علم القياسات العلمية والبيبلومetrics

نظرة عامة

تبحث هذه الورقة البحثية في دقة تصنيفات نوع الوثيقة (DT) في قواعد البيانات البيبليومترية، وبشكل خاص سكوبس وويب أوف ساينس (WoS)، وهو أمر حيوي للمجتمع الأكاديمي بسبب تأثير المؤشرات البيبليومترية على تمويل الأبحاث والسمعة. يقترح المؤلفون منهجية شبه آلية تحدد بكفاءة التباينات في تصنيفات DT بين قاعدتي البيانات، مما يسمح بتحليل الوثائق التي قد تكون مصنفة بشكل خاطئ. تم تطبيق المنهجية على عينة من 27,734 وثيقة من جامعتين في تورين، كاشفة عن معدلات خطأ تبلغ حوالي 2.7% لسكوبس و2.3% لـ WoS. كانت الأخطاء الأكثر شيوعًا تتعلق بالتصنيفات الخاطئة بين المقالات البحثية والمراجعات، مع اختلافات ملحوظة في أنواع الأخطاء بين قواعد البيانات.

تؤكد الدراسة على الآثار العملية للمنهجية المقترحة، والتي يمكن أن تساعد الباحثين والمؤسسات ومديري قواعد البيانات في اكتشاف وتصحيح أخطاء تصنيف DT. من خلال التركيز على مجموعة فرعية من الوثائق التي تم تحديدها على أنها قد تكون مصنفة بشكل خاطئ، تسمح المنهجية بإجراء تحليل أكثر شمولاً مع جهد يدوي محدود. ومع ذلك، يعترف المؤلفون ببعض القيود، مثل استبعاد DTs معينة والحاجة إلى أبحاث مستقبلية لمعالجة تصنيفات DT المزدوجة في WoS. بشكل عام، تسلط النتائج الضوء على أهمية دقة تصنيف DT وتقترح سبلًا لتحسين جودة قواعد البيانات البيبليومترية.

مقدمة

تناقش مقدمة الورقة البحثية تصنيف المنشورات العلمية إلى أنواع وثائق مختلفة (DTs)، مثل المقالات البحثية، وأعمال المؤتمرات، والمراجعات، وهو أمر أساسي لتنظيم المعرفة وتسهيل استرجاع المعلومات في البحث الأكاديمي. لا يساعد هذا التصنيف فقط في تقييم المساهمات العلمية ولكنه يؤثر أيضًا على المؤشرات البيبليومترية، بما في ذلك عامل التأثير للمجلات. ومع ذلك، يمكن أن تؤدي الأخطاء في تصنيف DT إلى تصنيفات خاطئة كبيرة، مثل تصنيف المقالات البحثية بشكل غير صحيح كمراجعات أو أعمال مؤتمرات، ويرجع ذلك أساسًا إلى نقص القواعد الموحدة والتفسيرات الذاتية عبر الناشرين المختلفين وقواعد البيانات البيبليومترية.

تسلط الورقة الضوء على أن الأدبيات الحالية حول أخطاء تصنيف DT محدودة، مع دراسات تشير إلى أن مثل هذه الأخطاء في قواعد البيانات الرئيسية مثل سكوبس وويب أوف ساينس (WoS) يمكن أن تكون ملحوظة. يقترح المؤلفون منهجية شبه آلية جديدة لتحليل أخطاء تصنيف DT على نطاق أوسع، مع معالجة قيود الدراسات السابقة التي اعتمدت على مجموعات بيانات أصغر وتحليل يدوي. تتضمن هذه الطريقة الجديدة التعرف التلقائي على مجموعة كبيرة من المنشورات من سكوبس وWoS، تليها مراجعة يدوية مستهدفة للوثائق ذات DTs غير المتسقة. يتم توضيح المنهجية من خلال دراسة حالة تتعلق بالمنشورات من جامعتين في تورين، إيطاليا، تهدف إلى تعزيز دقة تصنيفات DT وتقديم رؤى حول انتشار التصنيفات الخاطئة في قواعد البيانات البيبليومترية.

طرق

تصف المنهجية شبه الآلية الموضحة في هذا القسم نهجًا منهجيًا لجمع البيانات وتحليلها، حيث تدمج بين العمليات اليدوية والآلية لتعزيز الكفاءة والدقة. تتضمن المنهجية استخدام خوارزميات متقدمة لمعالجة البيانات، والتي تكملها إشراف بشري لضمان موثوقية النتائج. تشمل الخطوات الرئيسية الحصول على البيانات الأولية، حيث يتم جمع البيانات الخام من مصادر مختلفة، تليها معالجة مسبقة لتنظيف وتوحيد البيانات للتحليل.

بعد ذلك، تستخدم المنهجية تقنيات إحصائية ونماذج تعلم الآلة لاستخراج رؤى ذات مغزى من البيانات المعالجة. ثم يتم التحقق من النتائج من خلال سلسلة من الفحوصات، بما في ذلك التحقق المتبادل وتحليل الحساسية، لتأكيد قوة النتائج. لا تعمل هذه الطريقة الهجينة على تبسيط سير العمل فحسب، بل تسمح أيضًا بمرونة أكبر في التعامل مع مجموعات بيانات متنوعة، مما يؤدي في النهاية إلى نتائج أكثر دقة وقابلة للتنفيذ.

نتائج

في قسم النتائج، يقدم المؤلفون تحليلًا مفصلًا لأخطاء التصنيف في أنواع الوثائق (DTs) من قواعد بيانات سكوبس وويب أوف ساينس (WoS). قاموا بإنشاء جداول طوارئ لتلخيص التباينات بين تصنيفات DT الآلية والتحليلات اليدوية، مع تسليط الضوء على التصنيفات الخاطئة الكبيرة. بالنسبة لسكوبس، تم تصنيف عدد ملحوظ من الوثائق بشكل غير صحيح كمقالات أو مراجعات، حيث تم تصنيف 217 مقالة بشكل خاطئ كمراجعات و102 مراجعة بشكل خاطئ كمقالات. بالإضافة إلى ذلك، تم تصنيف الوثائق التحريرية بشكل متكرر بشكل خاطئ، مما يشير إلى نقص الوضوح في التعريفات المقدمة من سكوبس، خاصة لفئة “التحرير”.

كشف التحليل لـ WoS عن أنماط مماثلة، حيث تم تصنيف 178 مقالة بشكل خاطئ كمراجعات و74 مراجعة بشكل خاطئ كمقالات. علاوة على ذلك، تم تصنيف أنواع وثائق مختلفة، بما في ذلك أوراق المؤتمرات والرسائل، بشكل خاطئ كمقالات. يؤكد المؤلفون أن التعريفات الغامضة لـ DTs في كلا القاعدتين تسهم في هذه الأخطاء في التصنيف، خاصة للفئات مثل “الملاحظات”، التي يتم التعرف عليها بشكل غير متسق. تسلط النتائج الضوء على الحاجة إلى تعريفات أوضح ومعايير تصنيف أكثر دقة داخل قواعد البيانات البيبليومترية لتقليل معدلات التصنيف الخاطئ وتحسين دقة تصنيف الوثائق الأكاديمية.

مناقشة

تُ outline المنهجية المقترحة لتحديد أخطاء تصنيف نوع الوثيقة (DT) في سكوبس وويب أوف ساينس (WoS) من خلال عملية منهجية شبه آلية. في البداية، يتم تحديد مجموعة كبيرة من المنشورات العلمية باستخدام معايير محددة، تليها استعلامات لكلا القاعدتين لتحديد تقاطع المنشورات المفهرسة من قبل كلاهما. يتم إنشاء مصفوفة توافق لمقارنة تصنيفات DT من سكوبس وWoS، مما يسمح بتحديد التصنيفات المتناقضة التي قد تشير إلى أخطاء محتملة. ثم يتم إجراء تحليل يدوي لهذه المنشورات المتناقضة لتحديد “DT الحقيقي” ولتحديد أي قاعدة بيانات مسؤولة عن خطأ التصنيف. يختتم التحليل بتوليد إحصائيات الأخطاء، التي توضح معدلات التعيينات المفقودة والتصنيفات الخاطئة لكل قاعدة بيانات.

تركز مرحلة جمع البيانات على المنشورات من جامعتين رئيسيتين في تورين، إيطاليا، تغطي فترة خمس سنوات. تتكون مجموعة التقاطع الناتجة من 27,734 وثيقة، مصنفة بشكل أساسي إلى ثلاثة DTs: المقالات، المراجعات، وأوراق المؤتمرات. من الجدير بالذكر أن التباينات في تصنيفات DT بين القاعدتين تبرز، خاصة في الفئات المتخصصة. تؤكد المنهجية على أهمية التحقق اليدوي من التصنيفات المتناقضة، مع الاعتراف أيضًا بإمكانية حدوث تصنيفات خاطئة متزامنة في كلا القاعدتين. تكشف إحصائيات الأخطاء عن معدل خطأ أعلى قليلاً لسكوبس مقارنة بـ WoS، مع ظهور DTs معينة بمعدلات ملحوظة من التصنيف الخاطئ، مما يبرز الحاجة إلى مراقبة مستمرة لممارسات تصنيف قواعد البيانات.

Journal: Scientometrics, Volume: 130, Issue: 3
DOI: https://doi.org/10.1007/s11192-025-05244-y
Publication Date: 2025-03-01
Author(s): Domenico Maisano et al.
Primary Topic: scientometrics and bibliometrics research

Overview

This research paper investigates the accuracy of document type (DT) classifications in bibliometric databases, specifically Scopus and Web of Science (WoS), which is vital for the academic community due to the influence of bibliometric indicators on research funding and reputation. The authors propose a semi-automated methodology that efficiently identifies discrepancies in DT classifications between the two databases, allowing for the analysis of potentially misclassified documents. The methodology was applied to a sample of 27,734 documents from two universities in Turin, revealing error rates of approximately 2.7% for Scopus and 2.3% for WoS. The most frequent errors involved misclassifications between research articles and reviews, with notable differences in error types between the databases.

The study emphasizes the practical implications of the proposed methodology, which can assist researchers, institutions, and database managers in detecting and correcting DT-classification errors. By focusing on a subset of documents identified as potentially misclassified, the methodology allows for a more extensive analysis with limited manual effort. However, the authors acknowledge certain limitations, such as the exclusion of specific DTs and the need for future research to address double DT classifications in WoS. Overall, the findings highlight the significance of DT-classification accuracy and suggest avenues for improving the quality of bibliometric databases.

Introduction

The introduction of the research paper discusses the classification of scientific publications into various document types (DTs), such as research articles, conference proceedings, and reviews, which is essential for organizing knowledge and facilitating information retrieval in academic research. This classification not only aids in the evaluation of scientific contributions but also influences bibliometric indicators, including the impact factor of journals. However, inaccuracies in DT classification can lead to significant misclassifications, such as research articles being incorrectly categorized as reviews or conference proceedings, primarily due to the lack of standardized rules and subjective interpretations across different publishers and bibliometric databases.

The paper highlights that existing literature on DT-classification errors is limited, with studies indicating that such errors in major databases like Scopus and Web of Science (WoS) can be notable. The authors propose a novel semi-automated methodology to analyze DT-classification errors on a larger scale, addressing the limitations of previous studies that relied on smaller datasets and manual analysis. This new approach involves the automatic identification of a large corpus of publications from Scopus and WoS, followed by a targeted manual review of documents with inconsistent DTs. The methodology is exemplified through a case study involving publications from two universities in Turin, Italy, aiming to enhance the accuracy of DT classifications and provide insights into the prevalence of misclassifications in bibliometric databases.

Methods

The semi-automated methodology described in this section outlines a systematic approach to data collection and analysis, integrating both manual and automated processes to enhance efficiency and accuracy. The methodology involves the use of advanced algorithms for data processing, which are complemented by human oversight to ensure the reliability of the results. Key steps include the initial data acquisition, where raw data is gathered from various sources, followed by preprocessing to clean and standardize the data for analysis.

Subsequently, the methodology employs statistical techniques and machine learning models to extract meaningful insights from the processed data. The results are then validated through a series of checks, including cross-validation and sensitivity analysis, to confirm the robustness of the findings. This hybrid approach not only streamlines the workflow but also allows for greater adaptability in handling diverse datasets, ultimately leading to more accurate and actionable outcomes.

Results

In the results section, the authors present a detailed analysis of classification errors in document types (DTs) from the Scopus and Web of Science (WoS) databases. They constructed contingency tables to summarize the discrepancies between automated DT classifications and manual analyses, highlighting significant misclassifications. For Scopus, a notable number of documents were incorrectly classified as articles or reviews, with 217 articles misclassified as reviews and 102 reviews misclassified as articles. Additionally, editorial documents were frequently misclassified, indicating a lack of clarity in the definitions provided by Scopus, particularly for the “editorial” category.

The analysis for WoS revealed similar patterns, with 178 articles misclassified as reviews and 74 reviews misclassified as articles. Furthermore, various document types, including proceedings papers and letters, were also misclassified as articles. The authors emphasize that the ambiguous definitions of DTs in both databases contribute to these classification errors, particularly for categories like “notes,” which are inconsistently recognized. The findings underscore the need for clearer definitions and more precise classification criteria within bibliometric databases to reduce misclassification rates and improve the accuracy of academic document categorization.

Discussion

The proposed methodology for identifying document type (DT) classification errors in Scopus and Web of Science (WoS) is outlined through a systematic, semi-automatic process. Initially, a substantial corpus of scientific publications is identified using specific criteria, followed by querying both databases to determine the intersection of publications indexed by both. A concordance matrix is constructed to compare DT classifications from Scopus and WoS, allowing for the identification of discordant classifications that may indicate potential errors. A manual analysis of these discordant publications is then conducted to ascertain the “true” DT and to determine which database is responsible for the classification error. The analysis culminates in the generation of error statistics, detailing the rates of missing assignments and false classifications for each database.

The data collection phase focuses on publications from two major universities in Turin, Italy, covering a five-year period. The resulting intersection subset consists of 27,734 documents, predominantly classified into three DTs: articles, reviews, and proceedings papers. Notably, discrepancies in DT classifications between the two databases are highlighted, particularly in specialized categories. The methodology emphasizes the importance of manual verification for discordant classifications, while also acknowledging the potential for simultaneous misclassifications in both databases. The error statistics reveal a slightly higher error rate for Scopus compared to WoS, with specific DTs exhibiting notable rates of misclassification, underscoring the need for ongoing scrutiny of database classification practices.