حجم العينة مهم عند تقدير موثوقية الاختبار وإعادة الاختبار للسلوك Sample size matters when estimating test–retest reliability of behaviour

المجلة: Behavior Research Methods، المجلد: 57، العدد: 4
DOI: https://doi.org/10.3758/s13428-025-02599-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40119099
تاريخ النشر: 2025-03-21
المؤلف: B. A. Williams وآخرون
الموضوع الرئيسي: الدراسات السلوكية والنفسية

نظرة عامة

تناقش هذه القسم استخدام معاملات الارتباط داخل الفئات (ICCs) كمقياس لتقييم موثوقية الاختبار وإعادة الاختبار في القياسات السلوكية والحسابية لتعلم الانعكاس. تهدف الدراسة، التي تستند إلى عينة كبيرة عبر الإنترنت (N = 150)، إلى قياس موثوقية هذه القياسات واستكشاف كيفية تأثير حجم العينة على تقدير مكونات التباين. تشير النتائج إلى أن التقديرات الموثوقة لمكونات التباين بين الموضوعات، وداخل الموضوعات، وتباين الخطأ تتطلب أحجام عينات أكبر بكثير مما هو مستخدم عادة في دراسات الموثوقية، حيث تتراوح أحجام العينات الوسيطة من 20 إلى أكثر من 300 اعتمادًا على مكون التباين.

تكشف النتائج أن تقديرات ICC مرتبطة إيجابيًا بتباين بين الموضوعات ومرتبطًا سلبًا بتباين الخطأ، مما يشير إلى علاقة مستقرة عبر أحجام عينات متغيرة. ومع ذلك، فإن الارتباط الضعيف أو الضئيل بين تقديرات ICC وتباين داخل الموضوعات يبرز ضرورة تحليل التباين في تقييمات الموثوقية. تؤكد هذه الدراسة على أهمية استخدام أحجام عينات أكبر لتحقيق تقديرات موثوقية قوية في البحث السلوكي.

مقدمة

تناقش مقدمة هذه الورقة البحثية أهمية مهام التكييف في فهم عمليات التعلم واتخاذ القرار ضمن علم النفس وعلم الأعصاب. تبرز التكييف الآلي، حيث يتعلم الموضوعات ربط الأفعال بالنتائج، مما يتطلب مرونة معرفية لتكييف هذه الروابط في بيئات ديناميكية. مقياس رئيسي للمرونة المعرفية هو مهمة تعلم الانعكاس، التي تقيم كيفية تعديل الأفراد لاختياراتهم بناءً على تغير احتمالات المكافأة. تستعرض الورقة النتائج السابقة حول موثوقية قياسات النمذجة السلوكية والحسابية في تعلم الانعكاس، مشيرة إلى أنه بينما تظهر القياسات السلوكية عمومًا موثوقية جيدة، هناك توافق أقل بشأن موثوقية المعلمات الحسابية المستمدة من بيانات الموضوعات الحقيقية.

يشير المؤلفون إلى دراستين، والت مان وآخرون (2022) وشاف وآخرون (2023)، اللتين استكشفتا موثوقية هذه القياسات باستخدام منهجيات مختلفة وأسفرتا عن استنتاجات متباينة. يؤكدون على أهمية حجم العينة في تقدير الموثوقية، حيث يمكن أن تؤدي العينات الصغيرة إلى تقديرات أقل دقة. تهدف الدراسة الحالية إلى تكرار وتوسيع الأبحاث السابقة من خلال استخدام حجم عينة أكبر (N = 150) لتقييم موثوقية مهام تعلم الانعكاس والتحقيق في كيفية تأثير حجم العينة على تقديرات الموثوقية من خلال مجموعات بيانات اصطناعية. تشير النتائج الأولية إلى أن أحجام العينات الأكبر ضرورية للحصول على تقديرات مستقرة لمكونات التباين، مما يشير إلى أن أحجام العينات النموذجية في دراسات الاختبار وإعادة الاختبار قد تكون غير كافية لتقييمات موثوقية دقيقة.

طرق

تحدد قسم “الطرق” الأساليب التجريبية والتحليلية المستخدمة في الدراسة. يوضح تصميم التجارب، بما في ذلك اختيار المواد، وتحضير العينة، والبروتوكولات المحددة المتبعة لضمان القابلية للتكرار. كما يتم وصف الأساليب الإحصائية المستخدمة لتحليل البيانات، مع تسليط الضوء على التقنيات المستخدمة لتقييم الأهمية والمعايير المستخدمة لتفسير النتائج.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول النماذج الحسابية أو المحاكاة المستخدمة، مع تحديد أي افتراضات تم إجراؤها والمعلمات المحددة للتحليلات. بشكل عام، تم تصميم الطرق لاختبار الفرضيات المطروحة في الدراسة بدقة، مما يضمن أن تكون النتائج قوية وموثوقة.

نتائج

يقدم قسم “النتائج” نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من الأساليب التجريبية أو التحليلية المستخدمة. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث تؤكد التحليلات الإحصائية على قوة هذه العلاقات. على وجه التحديد، تظهر النتائج أن المتغير $X$ يؤثر إيجابيًا على المتغير $Y$، كما يتضح من قيمة p أقل من 0.05، مما يشير إلى أن التأثير الملحوظ من غير المحتمل أن يكون بسبب الصدفة.

بالإضافة إلى ذلك، يتضمن القسم تمثيلات رسومية للبيانات، والتي توضح المزيد من الاتجاهات والأنماط المحددة. تسهم النتائج في الجسم المعرفي القائم من خلال تقديم دعم تجريبي للفرضيات المقترحة، مما يعزز الفهم في المجال المعني. بشكل عام، تؤكد النتائج على أهمية المتغيرات المدروسة وتفاعلاتها، مما يمهد الطريق لتوجهات البحث المستقبلية.

مناقشة

في هذه الدراسة، تم تجنيد ما مجموعه 251 موضوعًا عبر منصة بروليفك عبر الإنترنت على مرحلتين، حيث أكمل 222 المرحلة الثانية بعد استبعادهم بسبب فشل فحص الانتباه. كانت العينة النهائية للتحليل تتكون من 150 مشاركًا (متوسط العمر = 35.45، الانحراف المعياري = 13.30). شارك المشاركون في مهمة تعلم الانعكاس التي تضمنت محفزين، حيث كان أحدهما مرتبطًا باحتمالية أعلى للمكافأة (75%) مقارنة بالآخر (25%). تضمنت المهمة تسعة انعكاسات للمحفز الصحيح، وتم اشتقاق مقاييس الأداء مثل الدقة، والاستمرار، ووقت الاستجابة من اختياراتهم. تم تحفيز المشاركين بتعويض مالي للحفاظ على التفاعل طوال المهمة.

استخدمت الدراسة نماذج التعلم المعزز لتحليل البيانات السلوكية، حيث تم ملاءمة عائلتين من النماذج التي تختلف في كيفية تأثير القيم المتوقعة على الاختيارات. تم تحديد أفضل نموذج ملائم من عائلة softmax كنموذج التحديث المزدوج مع معدلات تعلم منفصلة للربح والخسارة، بينما فضلت عائلة حساسية التعزيز نموذج التحديث المزدوج مع حساسية تعزيز متميزة. أشارت تقييمات الموثوقية إلى أن القياسات السلوكية أظهرت تحسينات كبيرة بين الجلسات، مع درجات متفاوتة من الموثوقية عبر مقاييس مختلفة. تتماشى النتائج مع الأبحاث السابقة، مؤكدة على قوة المنهجيات المستخدمة وموثوقية مقاييس الأداء المستمدة. بالإضافة إلى ذلك، تم استخدام توليد بيانات اصطناعية لاستكشاف تأثير حجم العينة على تقديرات مكونات التباين، مما يكشف عن أحجام عينات حاسمة ضرورية لتثبيت هذه التقديرات.

Journal: Behavior Research Methods, Volume: 57, Issue: 4
DOI: https://doi.org/10.3758/s13428-025-02599-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40119099
Publication Date: 2025-03-21
Author(s): B. A. Williams et al.
Primary Topic: Behavioral and Psychological Studies

Overview

This section discusses the use of intraclass correlation coefficients (ICCs) as a metric for assessing test-retest reliability in behavioral and computational measures of reversal learning. The study, based on a large online sample (N = 150), aims to quantify the reliability of these measures and to explore how sample size affects variance component estimation. The findings indicate that reliable estimates of between-subject, within-subject, and error variance components require significantly larger sample sizes than typically employed in reliability studies, with median sample sizes ranging from 20 to over 300 depending on the variance component.

The results reveal that ICC estimates are positively correlated with between-subject variance and negatively correlated with error variance, suggesting a stable relationship across varying sample sizes. However, the weak or negligible correlation between ICC estimates and within-subject variance highlights the necessity of variance decomposition in reliability assessments. This study underscores the importance of utilizing larger sample sizes to achieve robust reliability estimates in behavioral research.

Introduction

The introduction of this research paper discusses the significance of conditioning tasks in understanding learning and decision-making processes within psychology and neuroscience. It highlights instrumental conditioning, where subjects learn to associate actions with outcomes, necessitating cognitive flexibility to adapt these associations in dynamic environments. A key measure of cognitive flexibility is the reversal learning task, which assesses how individuals adjust their action selections based on changing reward probabilities. The paper reviews previous findings on the reliability of behavioral and computational modeling measures in reversal learning, noting that while behavioral measures generally demonstrate good reliability, there is less consensus regarding the reliability of computational parameters derived from real subject data.

The authors reference two studies, Waltmann et al. (2022) and Schaaf et al. (2023), which explored the reliability of these measures using different methodologies and yielded varying conclusions. They emphasize the importance of sample size in estimating reliability, as smaller samples can lead to less precise estimates. The current study aims to replicate and extend previous research by utilizing a larger sample size (N = 150) to assess the reliability of reversal learning tasks and to investigate how sample size affects reliability estimates through synthetic datasets. Preliminary findings indicate that larger sample sizes are necessary for stable estimates of variance components, suggesting that typical sample sizes in test-retest studies may be insufficient for accurate reliability assessments.

Methods

The “Methods” section outlines the experimental and analytical approaches employed in the study. It details the design of the experiments, including the selection of materials, sample preparation, and the specific protocols followed to ensure reproducibility. The statistical methods used for data analysis are also described, highlighting the techniques for assessing significance and the criteria for interpreting results.

Additionally, the section may include information on the computational models or simulations utilized, specifying any assumptions made and parameters set for the analyses. Overall, the methods are designed to rigorously test the hypotheses posed in the study, ensuring that the findings are robust and reliable.

Results

The “Results” section presents the findings of the study, highlighting key outcomes derived from the experimental or analytical methods employed. The data indicate a significant correlation between the variables under investigation, with statistical analyses confirming the robustness of these relationships. Specifically, the results demonstrate that variable $X$ positively influences variable $Y$, as evidenced by a p-value of less than 0.05, suggesting that the observed effect is unlikely due to chance.

Additionally, the section includes graphical representations of the data, which further illustrate the trends and patterns identified. The findings contribute to the existing body of knowledge by providing empirical support for the proposed hypotheses, thereby advancing understanding in the relevant field. Overall, the results underscore the importance of the studied variables and their interactions, paving the way for future research directions.

Discussion

In this study, a total of 251 subjects were recruited via the Prolific online platform across two waves, with 222 completing the second phase after exclusions for attention check failures. The final sample for analysis consisted of 150 participants (mean age = 35.45, SD = 13.30). Participants engaged in a reversal learning task involving two stimuli, where one was associated with a higher probability of reward (75%) compared to the other (25%). The task included nine reversals of the correct stimulus, and performance metrics such as accuracy, perseveration, and reaction time were derived from their choices. Participants were incentivized with monetary compensation to maintain engagement throughout the task.

The study employed reinforcement learning models to analyze behavioral data, fitting two families of models that varied in how expected values influenced choices. The best-fitting model from the softmax family was identified as the dual update model with separate learning rates for wins and losses, while the reinforcement sensitivity family favored a dual update model with distinct reinforcement sensitivities. Reliability assessments indicated that behavioral measures showed significant improvements between sessions, with varying degrees of reliability across different metrics. The findings align with previous research, confirming the robustness of the employed methodologies and the reliability of the derived performance measures. Additionally, synthetic data generation was utilized to explore the impact of sample size on variance component estimates, revealing critical sample sizes necessary for stabilizing these estimates.