ما بعد بونفيروني: اختبارات تباين متعددة جديدة لبيانات الوقت حتى الحدث تحت مخاطر غير متناسبة Beyond Bonferroni: new multiple contrast tests for time-to-event data under non-proportional hazards

المجلة: Lifetime Data Analysis، المجلد: 32، العدد: 1
DOI: https://doi.org/10.1007/s10985-025-09676-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41533205
تاريخ النشر: 2026-01-14
المؤلف: Ina Dormuth وآخرون
الموضوع الرئيسي: طرق إحصائية في التجارب السريرية

نظرة عامة

في التجارب السريرية التي تشمل مجموعات متعددة، من الضروري ليس فقط تحديد الفروقات بين المجموعات ولكن أيضًا تحديد المواقع المحددة لهذه الفروقات. يتطلب ذلك اختبار فرضيات فردية متعددة، مما يثير القلق بشأن التحكم في معدل الخطأ العائلي (FWER). غالبًا ما يتم استخدام الطرق التقليدية، مثل اختبار لوغاريتم الرتبة المصحح بونفيروني، ومع ذلك، فإن لها قيودًا ملحوظة، بما في ذلك انخفاض القوة عندما يتم انتهاك فرضية المخاطر النسبية وانخفاض القوة العامة بسبب الاعتماد بين إحصائيات الاختبار.

لمعالجة هذه القضايا، يقترح المؤلفون اختبارين جديدين يعتمد كل منهما على اختبارات لوغاريتم الرتبة الموزونة المجمعة: اختبار تباين متعدد بسيط وتوسيع لاختبار CASANOVA، الذي تم تصميمه في الأصل للتجارب العاملية. يظهر اختبار CASANOVA المقترح قوة محسنة في السيناريوهات التي تتضمن مخاطر متقاطعة ويقضي على الحاجة إلى تصحيحات إضافية لقيمة p. من خلال محاكاة مونت كارلو الشاملة، يتم تقييم أداء هذه الاختبارات الجديدة عبر سيناريوهات المخاطر النسبية وغير النسبية. تشير النتائج إلى أن الطرق الجديدة تتحكم بفعالية في FWER وتحافظ على قوة معقولة، متفوقة على الأساليب التقليدية المعدلة في بعض السياقات غير النسبية.

مقدمة

تتناول مقدمة هذه الورقة البحثية التحديات المتعلقة بتحليل الوقت حتى الحدث أو البقاء، وهو شائع في مجالات متعددة مثل البحث الطبي والعلوم الاجتماعية. غالبًا ما تكون الطرق التقليدية مثل اختبارات ANOVA غير كافية للتجارب التي تحتوي على أذرع علاجية متعددة، حيث إنها تقيم بشكل أساسي ما إذا كانت أي مجموعات تختلف بدلاً من تحديد الفروقات المحددة بينها. تتضمن الممارسات الحالية عادةً اختبارات لوغاريتم الرتبة المتعددة الزوجية مع تعديلات على التعددية، مثل تصحيح بونفيروني، ولكن يمكن أن تكون هذه غير فعالة بسبب الافتراضات التقييدية المتعلقة بهيكل الارتباط لإحصائيات الاختبار.

لمعالجة هذه القيود، قدمت التطورات الأخيرة إجراءات اختبار تباين متعددة (MCTPs) وفترات ثقة متزامنة (SCIs) تستوعب الارتباطات التعسفية بين إحصائيات الاختبار. ومن الجدير بالذكر أن مونكو وآخرين (2024) اقترحوا MCTP قائمًا على RMST لبيانات الوقت حتى الحدث، لكن هذه الطريقة غير مناسبة للسيناريوهات التي تتضمن مخاطر متقاطعة. يهدف المؤلفون إلى سد هذه الفجوة من خلال تقديم MCTP مرن جديد مصمم لبيانات البقاء مع مخاطر متقاطعة. يبرزون عدم كفاية اختبار لوغاريتم الرتبة تحت المخاطر غير النسبية ويناقشون طرقًا بديلة، مثل اختبارات لوغاريتم الرتبة الموزونة وطريقة تحليل البقاء التراكمي Aalen (CASANOVA)، التي تم تطويرها لتعزيز التحليل في هذه السياقات. يتم دعم MCTP المقترح من خلال دراسات محاكاة شاملة تظهر قوته المتفوقة في حالات المخاطر غير النسبية.

النتائج

تشير نتائج المحاكاة تحت الفرضية الصفرية إلى أن معدل الخطأ العائلي (FWER) يتم التحكم فيه بشكل جيد عبر سيناريوهات البقاء المختلفة وأنواع مصفوفات التباين، باستثناء اختبار mdir المعدل، الذي يظهر سلوكًا ليبراليًا قليلاً لمصفوفة من نوع Dunnett. تعتبر الأساليب الجديدة للاختبار المتعدد، وخاصة طريقة multiWeightedLR، أكثر تحفظًا بشكل ملحوظ من الطرق المعدلة، خاصة لمصفوفات التباين من نوع Tukey. تم الحفاظ على FWER ضمن فترة الدقة الثنائية المتوقعة عبر 10,000 تشغيل لمحاكاة.

تحت الفرضية البديلة، أظهر تحليل القوة أن زيادة عدد الاختبارات ترتبط بانخفاض القوة للفرضيات المحلية. أظهر اختبار لوغاريتم الرتبة المعدل أعلى قوة في إعدادات المخاطر النسبية، بينما أدت جميع الاختبارات بشكل جيد بشكل متساوٍ تحت المخاطر غير النسبية ولكن غير المتقاطعة. ومع ذلك، انخفضت قوة اختبار لوغاريتم الرتبة بشكل كبير في السيناريوهات التي تتضمن مخاطر متقاطعة. برز اختبار mdir المعدل كالأكثر قوة في الإعدادات المختلطة، على الرغم من أنه أظهر ميولًا ليبرالية قليلاً تحت الفرضية الصفرية. بشكل عام، توفر الطرق الجديدة المقدمة في هذه الدراسة قوة قوية عبر سيناريوهات مختلفة، حيث تظهر طريقة multiWeightedLR تباينًا أقل في نتائج القوة.

المناقشة

في قسم المناقشة هذا، يتناول المؤلفون تحديات التباينات المتعددة في تحليلات الوقت حتى الحدث، مؤكدين على مخاطر الاكتشافات الزائفة عند تطبيق اختبارات منفصلة دون تعديلات على المقارنات المتعددة. يقدمون مجموعة من الطرق الإحصائية، بما في ذلك اختبارات لوغاريتم الرتبة المعدلة بونفيروني التقليدية وطرقهم الجديدة المطورة، مثل اختبار لوغاريتم الرتبة متعدد الاتجاهات والاختبارات القصوى، التي تهدف إلى تحسين القوة ضد البدائل المتعددة. يتم تعريف الإطار الإحصائي للدراسات التي تحتوي على مجموعات علاجية متعددة، باستخدام دوال المخاطر التراكمية لتقييم تأثيرات العلاج.

أجرى المؤلفون دراسة محاكاة شاملة لتقييم أداء هذه الطرق تحت سيناريوهات بقاء مختلفة، بما في ذلك المخاطر النسبية وغير النسبية. أشارت النتائج إلى أنه بينما كانت معظم الطرق تتحكم بفعالية في معدل الخطأ العائلي (FWER)، أظهر اختبار mdir المعدل ميلاً نحو السلوك الليبرالي، خاصة مع التباينات من نوع Dunnett. على العكس، كانت طرق multiWeightedLR وmultiCASANOVA أكثر تحفظًا، مما قد يقلل من أخطاء النوع الأول ولكن على حساب القوة الإحصائية. من الجدير بالذكر أن اختبار لوغاريتم الرتبة المعدل أظهر قوة متفوقة تحت المخاطر النسبية، بينما حافظت الطرق الجديدة على القوة في السيناريوهات غير النسبية. تشير النتائج إلى أنه بينما قد تؤدي الطرق التقليدية إلى نتائج أكثر دلالة، فإن الطرق الجديدة توفر بدائل موثوقة لتحليل بيانات الوقت حتى الحدث، مما يستدعي مزيدًا من الاستكشاف لتطبيقها في البحث السريري.

Journal: Lifetime Data Analysis, Volume: 32, Issue: 1
DOI: https://doi.org/10.1007/s10985-025-09676-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41533205
Publication Date: 2026-01-14
Author(s): Ina Dormuth et al.
Primary Topic: Statistical Methods in Clinical Trials

Overview

In clinical trials involving multiple groups, it is crucial not only to identify differences between groups but also to pinpoint the specific locations of these differences. This necessitates testing multiple individual hypotheses, which raises concerns about controlling the familywise error rate (FWER). Traditional methods, such as the Bonferroni-corrected log-rank test, are often employed; however, they have notable limitations, including reduced power when the proportional hazards assumption is violated and decreased overall power due to dependencies among test statistics.

To address these issues, the authors propose two novel tests based on combined weighted log-rank tests: a straightforward multiple contrast test and an extension of the CASANOVA test, originally designed for factorial experiments. The proposed CASANOVA-based test demonstrates enhanced power in scenarios with crossing hazards and eliminates the need for additional p-value corrections. Through extensive Monte Carlo simulations, the performance of these new tests is evaluated across both proportional and non-proportional hazard scenarios. The results indicate that the new methods effectively control the FWER and maintain reasonable power, outperforming traditional adjusted approaches in certain non-proportional contexts.

Introduction

The introduction of this research paper addresses the challenges of time-to-event or survival analysis, which is prevalent in various fields such as medical research and social sciences. Traditional methods like ANOVA-type tests are often inadequate for trials with multiple treatment arms, as they primarily assess whether any groups differ rather than identifying specific differences among them. Current practices typically involve pairwise multiple log-rank tests with multiplicity adjustments, such as the Bonferroni correction, but these can be inefficient due to restrictive assumptions regarding the correlation structure of test statistics.

To address these limitations, recent advancements have introduced multiple contrast test procedures (MCTPs) and simultaneous confidence intervals (SCIs) that accommodate arbitrary correlations among test statistics. Notably, Munko et al. (2024) proposed an RMST-based MCTP for time-to-event data, but this approach is not suitable for scenarios with crossing hazards. The authors aim to fill this gap by presenting a new flexible MCTP designed for survival data with crossing hazards. They highlight the inadequacies of the log-rank test under non-proportional hazards and discuss alternative methods, such as weighted log-rank tests and the Cumulative Aalen Survival Analysis-of-Variance (CASANOVA) method, which have been developed to enhance analysis in these contexts. The proposed MCTP is supported by extensive simulation studies demonstrating its superior power in non-proportional hazard situations.

Results

The results of the simulations under the null hypothesis indicate that the familywise error rate (FWER) is well-controlled across various survival scenarios and contrast matrix types, with the exception of the adjusted mdir test, which exhibits slightly liberal behavior for the Dunnett-type matrix. The new multiple-testing approaches, particularly the multiWeightedLR method, are notably more conservative than the adjusted methods, especially for Tukey-type contrast matrices. The FWER was maintained within the expected binomial precision interval across 10,000 simulation runs.

Under the alternative hypothesis, the analysis of power revealed that an increased number of tests correlates with decreased power for local hypotheses. The adjusted log-rank test demonstrated the highest power in proportional hazards settings, while all tests performed comparably well under non-proportional but non-crossing hazards. However, the log-rank test’s power significantly diminished in scenarios with crossing hazards. The adjusted mdir test emerged as the most powerful in mixed settings, although it displayed slightly liberal tendencies under the null hypothesis. Overall, the new methods introduced in this study provide robust power across various scenarios, with the multiWeightedLR approach showing lower variability in power outcomes.

Discussion

In this discussion section, the authors address the challenges of multiple contrasts in time-to-event analyses, emphasizing the risks of false discoveries when applying separate tests without adjustments for multiple comparisons. They present a range of statistical methods, including traditional Bonferroni-adjusted log-rank tests and their newly developed approaches, such as the multi-directional log-rank test and maximum tests, which aim to improve robustness against multiple alternatives. The statistical framework is defined for studies with multiple treatment groups, utilizing cumulative hazard functions to assess treatment effects.

The authors conducted a comprehensive simulation study to evaluate the performance of these methods under various survival scenarios, including proportional and non-proportional hazards. Results indicated that while most methods effectively controlled the Family-Wise Error Rate (FWER), the adjusted mdir test exhibited a tendency towards liberal behavior, particularly with Dunnett-type contrasts. Conversely, the multiWeightedLR and multiCASANOVA methods were more conservative, potentially reducing Type I errors but at the cost of statistical power. Notably, the adjusted log-rank test demonstrated superior power under proportional hazards, while the new methods maintained robustness in non-proportional scenarios. The findings suggest that while traditional methods may yield more significant results, the new approaches provide reliable alternatives for analyzing time-to-event data, warranting further exploration of their application in clinical research.