قياس الاتفاق بين عدة مقيمين يصنفون الموضوعات إلى فئة واحدة أو أكثر (هرمية): تعميم لكابا فليس Measuring agreement among several raters classifying subjects into one or more (hierarchical) categories: A generalization of Fleiss’ kappa

المجلة: Behavior Research Methods، المجلد: 57، العدد: 10
DOI: https://doi.org/10.3758/s13428-025-02746-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40954368
تاريخ النشر: 2025-09-15
المؤلف: Filip Moons وآخرون
الموضوع الرئيسي: الموثوقية والاتفاق في القياس

نظرة عامة

تناقش هذه القسم قيود كابا كوهين وكابا فليس في قياس اتفاق المقيّمين عندما يمكن أن ينتمي الموضوع إلى فئات متعددة. يقترح المؤلفون نسخة عامة من كابا فليس، تُعرف بالإحصاء الجديد κ، الذي يسمح بتعيينات فئات متعددة من قبل المقيّمين. يمكن أن تتضمن هذه المقياس الجديد أوزان الفئات بناءً على أهميتها وتستوعب الهياكل الهرمية، مثل الاضطرابات الرئيسية والفرعية. بالإضافة إلى ذلك، تم تصميمه للتعامل مع البيانات المفقودة والتغيرات في عدد المقيّمين لكل موضوع أو فئة.

تتفصل الورقة في اشتقاق الإحصاء الجديد κ، موضحةً معادلته لكابا فليس تحت شرط اختيار فئة واحدة. كما تستكشف الافتراضات والبارادوكسات المحتملة المرتبطة بالمقياس، مقدمةً إرشادات للتفسير. يوضح المؤلفون تطبيق إحصاء κ الخاص بهم في تقييم موثوقية طريقة جديدة لتقييم الرياضيات وينتهون بمثال عملي يتضمن أطباء نفسيين يشخصون مرضى يعانون من اضطرابات متعددة. لتعزيز الوصول، يتم توفير جميع الحسابات من خلال نصوص R وورقة Excel.

مقدمة

تتناول مقدمة هذه الورقة البحثية القضية الحرجة لاتفاق المقيّمين، والتي تشير إلى التناسق بين المراقبين المستقلين الذين يقيمون نفس الظاهرة. يبرز المؤلفون الاعتماد على التقييمات الذاتية من قبل المقيّمين البشريين، مما يمكن أن يؤدي إلى تفاوتات كبيرة في التقييمات. تشكل هذه التباينات تحديًا لإعادة إنتاج وموثوقية القياسات العلمية، حيث ينبغي أن تؤثر التغيرات في خصائص الموضوع فقط على التقييمات، مستبعدةً التغيرات الناتجة عن المقيّمين. تؤكد الورقة على أهمية قياس اتفاق المقيّمين، خاصةً من خلال مقاييس معتمدة مثل كابا كوهين وكابا فليس، التي تقتصر على الفئات المتعارضة.

لمعالجة هذه القيود، يقترح المؤلفون تطوير مقياس جديد مصحح بالصدفة يستوعب التصنيفات المتعددة من قبل المقيّمين. توضح المقدمة هيكل الورقة، الذي يتضمن مناقشة لكابا كوهين وكابا فليس، ومراجعة للطرق الحالية التي تحاول التغلب على قيود الحصرية، واشتقاق المقياس المقترح. لا يقتصر هذا المقياس الجديد على تعميم كابا فليس للفئات العادية فحسب، بل يسمح أيضًا بإدراج الفئات الموزونة والاعتماديات الهرمية. يخطط المؤلفون لفحص الافتراضات والبارادوكسات المحتملة للمقياس الجديد، وتقديم إرشادات لتفسيره، ومقارنته بالطرق الحالية، مدعومةً بأمثلة توضيحية.

طرق

في هذا القسم، يناقش المؤلفون طرقًا بديلة لتقييم اتفاق المقيّمين عندما يمكن للمقيّمين تصنيف الموضوعات إلى فئات متعددة، مشيرين إلى فجوة في الأدبيات بشأن قيود الفئات المتعارضة. يشيرون إلى العمل الأساسي لميزيتش وآخرين (1981) ويقدمون وظائف R للطرق الموصوفة كمادة إضافية.

يقارن المؤلفون أيضًا إحصاء κ المقترح مع طرق أخرى معتمدة، مشيرين إلى قيود كبيرة. على سبيل المثال، تعطي طريقة التداخل النسبي قيمة κ تساوي 0.602 لكنها تعتمد على افتراضات بأن جميع العناصر موزونة بالتساوي وأن جميع الفئات متاحة لجميع المقيّمين، مما يمكن أن يشوه النتائج. بالإضافة إلى ذلك، تعود معاملات الارتباط داخل الفئة المصححة بالصدفة بقيمة κ تساوي 0.379، لكنها لا تأخذ في الاعتبار بشكل كافٍ الحالات التي لا يختار فيها بعض المقيّمين أي فئة لموضوعات معينة، مما يؤدي إلى استنتاجات مضللة. يؤكد المؤلفون أن إحصاء κ الخاص بهم يعالج هذه القضايا من خلال تقييم عدم الاختيارات بالتساوي، مع الاعتراف أيضًا بأن استبعاد عدم الاختيارات يمكن أن ينتج عنه نتائج غير مرضية. ويستنتجون أن الطرق المعتمدة على معاملات الارتباط المصححة بالصدفة غير قابلة للتطبيق في هذا السياق، حيث لا تتضمن التصنيفات فئات مرتبة.

مناقشة

في مناقشة مقاييس اتفاق المقيّمين، تبرز الورقة تطور وتطبيق مختلف إحصاءات الكابا، وخاصة كابا كوهين وكابا فليس. كابا كوهين، الذي تم تقديمه في عام 1960، هو مقياس مصحح بالصدفة للاتفاق يأخذ في الاعتبار احتمال حدوث الاتفاق بالصدفة. يتم حسابه باستخدام الصيغة \( \kappa = \frac{P_o – P_e}{1 – P_e} \)، حيث \( P_o \) هو الاتفاق الملحوظ و \( P_e \) هو الاتفاق المتوقع بالصدفة. يوسع كابا فليس هذا المفهوم ليشمل مقيّمين متعددين، مما يسمح لعدد ثابت من المقيّمين بتصنيف الموضوعات إلى فئات، وبالتالي توفير إطار عمل أكثر شمولاً لتقييم الاتفاق في بيئات متنوعة.

تناقش الورقة أيضًا طرقًا لتوسيع أو تجميع كابا كوهين عندما تكون هناك فئات متعددة، مشددةً على قيود المتوسط المباشر، خاصةً عندما تظهر قيم كابا غير محددة. بدلاً من ذلك، تقترح تجميع الاتفاقات الملحوظة والمتوقعة عبر الفئات للحصول على إحصاء كابا إجمالي أكثر دقة. بالإضافة إلى ذلك، تقدم طريقة التداخل النسبي ومعاملات الارتباط داخل الفئة المصححة بالصدفة كطرق بديلة لحساب الاتفاق بين مقيّمين متعددين، خاصةً في السياقات التي يمكن فيها تصنيف الموضوعات إلى فئات متعددة. يتم تقديم إحصاء الكابا المقترح كعمومية لكابا فليس، قادر على التعامل مع هياكل الفئات الهرمية وأعداد المقيّمين المتغيرة، مما يعزز قابليته للتطبيق في سيناريوهات التصنيف المعقدة.

Journal: Behavior Research Methods, Volume: 57, Issue: 10
DOI: https://doi.org/10.3758/s13428-025-02746-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40954368
Publication Date: 2025-09-15
Author(s): Filip Moons et al.
Primary Topic: Reliability and Agreement in Measurement

Overview

This section discusses the limitations of Cohen’s and Fleiss’ kappa in measuring inter-rater agreement when subjects can belong to multiple categories. The authors propose a generalized version of Fleiss’ kappa, denoted as the new κ statistic, which allows for multiple category assignments by raters. This new measure can incorporate category weights based on their significance and accommodate hierarchical structures, such as primary and sub-disorders. Additionally, it is designed to handle missing data and variations in the number of raters per subject or category.

The paper details the derivation of the new κ statistic, demonstrating its equivalence to Fleiss’ kappa under the condition of single category selection. It also explores the assumptions and potential paradoxes associated with the measure, providing guidelines for interpretation. The authors illustrate the application of their κ statistic in evaluating the reliability of a new mathematics assessment method and conclude with a practical example involving psychiatrists diagnosing patients with multiple disorders. To enhance accessibility, all calculations are made available through R scripts and an Excel sheet.

Introduction

The introduction of this research paper addresses the critical issue of inter-rater agreement, which refers to the consistency among independent observers assessing the same phenomenon. The authors highlight the reliance on subjective evaluations by human raters, which can lead to significant discrepancies in ratings. This variability poses a challenge to the reproducibility and accuracy of scientific measurements, as ideally, only changes in the subject’s attributes should influence the ratings, excluding rater-induced variations. The paper emphasizes the importance of measuring inter-rater agreement, particularly through established metrics like Cohen’s and Fleiss’ kappa, which are limited to mutually exclusive categories.

To address this limitation, the authors propose the development of a new chance-corrected measure that accommodates multiple classifications by raters. The introduction outlines the structure of the paper, which includes a discussion of Cohen’s and Fleiss’ kappa, a review of existing methods that attempt to overcome the exclusivity constraint, and a derivation of the proposed measure. This new metric not only generalizes Fleiss’ kappa for regular categories but also allows for the incorporation of weighted categories and hierarchical interdependencies. The authors plan to examine the assumptions and potential paradoxes of the new measure, provide guidelines for its interpretation, and compare it with existing methods, supported by illustrative examples.

Methods

In this section, the authors discuss alternative methods for assessing inter-rater agreement when raters can classify subjects into multiple categories, highlighting a gap in the literature regarding the limitations of mutually exclusive categories. They reference the foundational work of Mezzich et al. (1981) and provide R-functions for the described methods as supplementary material.

The authors also compare their proposed κ statistic with other established methods, noting significant limitations. For instance, the proportional overlap method yields a κ value of 0.602 but relies on assumptions that all items are equally weighted and that all categories are available to all raters, which can distort the results. Additionally, the chance-corrected intraclass correlations return a κ value of 0.379, but they inadequately account for instances where certain raters do not select any category for specific subjects, leading to misleading conclusions. The authors emphasize that their κ statistic addresses these issues by valuing non-selections equally, while also acknowledging that excluding non-selections can produce unsatisfactory results. They conclude that methods based on chance-corrected rank correlations are not applicable in this context, as the classification does not involve ordered categories.

Discussion

In the discussion of inter-rater agreement measures, the paper highlights the evolution and application of various kappa statistics, particularly Cohen’s kappa and Fleiss’ kappa. Cohen’s kappa, introduced in 1960, is a chance-corrected measure of agreement that accounts for the likelihood of agreement occurring by chance. It is calculated using the formula \( \kappa = \frac{P_o – P_e}{1 – P_e} \), where \( P_o \) is the observed agreement and \( P_e \) is the expected agreement by chance. Fleiss’ kappa extends this concept to multiple raters, allowing for a fixed number of raters to classify subjects into categories, thus providing a more comprehensive framework for assessing agreement in diverse settings.

The paper further discusses methods for averaging or pooling Cohen’s kappa when multiple categories are involved, emphasizing the limitations of direct averaging, particularly when undefined kappa values arise. Instead, it suggests pooling observed and expected agreements across categories to yield a more accurate overall kappa statistic. Additionally, it introduces the proportional overlap method and chance-corrected intraclass correlations as alternative approaches for calculating agreement among multiple raters, particularly in contexts where subjects can be classified into multiple categories. The proposed kappa statistic is presented as a generalization of Fleiss’ kappa, capable of handling hierarchical category structures and varying numbers of raters, thus enhancing its applicability in complex classification scenarios.