تقدير قائم على Wasserstein GAN لدالة التوزيع الشرطي مع بيانات الحالة الحالية Wasserstein GAN-based estimation for conditional distribution function with current status data

المجلة: Lifetime Data Analysis، المجلد: 32، العدد: 1
DOI: https://doi.org/10.1007/s10985-026-09691-4
PMID: https://pubmed.ncbi.nlm.nih.gov/41618053
تاريخ النشر: 2026-01-31
المؤلف: Wen Su وآخرون
الموضوع الرئيسي: طرق إحصائية واستدلال

نظرة عامة

في هذا القسم، يتناول المؤلفون التحديات التي تطرحها بيانات الحالة الحالية، والتي تتواجد بشكل شائع في مجالات مثل الطب والاقتصاديات القياسية والعلوم الاجتماعية. يبرزون قيود الأساليب التحليلية الحالية، لا سيما عندما تكون النماذج الأساسية غير محددة بشكل صحيح. للتغلب على هذه القضايا، يقترح المؤلفون نهجًا جديدًا خاليًا من النموذج من مرحلتين لتقدير دالة التوزيع التراكمي الشرطي بناءً على المتنبئين.

تشمل المرحلة الأولى التعلم غير المعلمي لمولد شرطي لتوزيع الأوقات الملاحظة وحالة الحدث. في المرحلة الثانية، يستخرج المؤلفون مقدرات الاحتمالية القصوى غير المعلمية لدوال التوزيع الشرطي باستخدام عينات تم توليدها من المولد الشرطي. يثبتون خصائص التقارب لمقدّرهم، مما يوضح اتساقه. من خلال دراسات المحاكاة عبر سيناريوهات مختلفة، يظهر النهج العميق الشرطي المقترح أداءً متفوقًا مقارنةً بالأساليب الكلاسيكية. بالإضافة إلى ذلك، توضح تطبيقات بيانات انتشار فيروس بارفوفيروس B19 الفائدة العملية لنهجهم، مما يؤدي إلى نتائج تنبؤية معقولة.

مقدمة

تناقش المقدمة التحديات المرتبطة بتحليل بيانات التعتيم الزمني، والتي تتواجد بشكل شائع في مجالات مثل الطب والعلوم الاجتماعية. يحدث التعتيم الزمني عندما يكون وقت الفشل الدقيق غير معروف، فقط أنه يقع ضمن فترة محددة. يتم توضيح ذلك من خلال مسح انتشار فيروس بارفوفيروس B19، حيث لا يتم تسجيل توقيت العدوى بدقة. يبرز القسم تعقيد تحليل مثل هذه البيانات ويستعرض مجموعة متنوعة من الأساليب غير المعلمية وشبه المعلمية التي تم تطويرها لمعالجة هذه التحديات، بما في ذلك مقدر الاحتمالية القصوى غير المعلمية (NPMLE) واختبارات نسبة الاحتمالية.

فتحت التطورات الأخيرة في القدرة الحاسوبية وتعلم الآلة، لا سيما تطوير الشبكات التنافسية التوليدية (GANs)، آفاقًا جديدة لتحليل بيانات الحالة الحالية. يقترح المؤلفون مقدر توزيع شرطي توليدي جديد (GCDE) يستخدم عملية من خطوتين للتنبؤ بحالة التعتيم وتقدير دوال التوزيع التراكمي. يهدف هذا الأسلوب إلى التغلب على قيود الأساليب التقليدية، لا سيما في حالات عدم تحديد النموذج بشكل صحيح. تختتم المقدمة بتحديد هيكل الورقة، والذي يتضمن المنهجية، الاتساق الأسيمبتي، دراسات المحاكاة، وتطبيق على بيانات فيروس بارفوفيروس B19.

نقاش

في هذا القسم، يقترح المؤلفون طريقة تعتمد على الشبكات التنافسية التوليدية (GAN) من مرحلتين لتقدير التوزيع الشرطي لأوقات الفشل في سياق تحليل البقاء، لا سيما عند التعامل مع بيانات التعتيم الزمني. تشمل المرحلة الأولى تعلم مولد شرطي يقوم بنمذجة التوزيع المشترك لأوقات الفحص ومؤشر ثنائي لحالة الفشل، بناءً على مجموعة من المتنبئين. يستخدم المؤلفون مسافة 1-فاسرشتاين كقياس للتباين لضمان أن التوزيع المتعلم يتطابق عن كثب مع التوزيع المشترك الحقيقي للبيانات الملاحظة. يسمح هذا النهج بتقدير المولد الشرطي باستخدام الشبكات العصبية الأمامية، التي يتم تدريبها من خلال مشكلة تحسين مينيمكس.

في المرحلة الثانية، يقدر المؤلفون دالة التوزيع الشرطي لوقت الفشل باستخدام أساليب غير معلمية بناءً على العينات المولدة من المولد الشرطي. يستخرجون مقدر الاحتمالية القصوى غير المعلمية (NPMLE) لدالة التوزيع الشرطي، مما يضمن أن المقدّر صالح ومتسق. يظهر أن الطريقة المقترحة تتفوق على النماذج التقليدية، مثل نموذج المخاطر النسبية لكوكس ونماذج وقت الفشل المعجل، لا سيما في السيناريوهات التي يتم فيها انتهاك افتراضات النموذج. يبرز المؤلفون قوة ومرونة نهجهم، الذي يستفيد من تقنيات تعلم الآلة لتعزيز الاستدلال الإحصائي في تحليل البقاء، مما يسهم في تحسين الفهم والعلاج للنتائج السريرية.

Journal: Lifetime Data Analysis, Volume: 32, Issue: 1
DOI: https://doi.org/10.1007/s10985-026-09691-4
PMID: https://pubmed.ncbi.nlm.nih.gov/41618053
Publication Date: 2026-01-31
Author(s): Wen Su et al.
Primary Topic: Statistical Methods and Inference

Overview

In this section, the authors address the challenges posed by current status data, which are prevalent in fields such as medicine, econometrics, and social science. They highlight the limitations of existing analytical methods, particularly when the underlying models are misspecified. To overcome these issues, the authors propose a novel model-free two-stage generative approach for estimating the conditional cumulative distribution function based on predictors.

The first stage involves the nonparametric learning of a conditional generator for the joint distribution of observation times and event status. In the second stage, the authors derive nonparametric maximum likelihood estimators for the conditional distribution functions using samples generated from the conditional generator. They establish the convergence properties of their estimator, demonstrating its consistency. Through simulation studies across various scenarios, the proposed deep conditional generative approach shows superior performance compared to classical methods. Additionally, an application to Parvovirus B19 seroprevalence data illustrates the practical utility of their approach, yielding reasonable predictive outcomes.

Introduction

The introduction discusses the challenges associated with analyzing interval-censored data, which is prevalent in fields such as medicine and social sciences. Interval censoring occurs when the exact failure time is unknown, only that it falls within a specific interval. This is exemplified by a seroprevalence survey of Parvovirus B19, where the timing of infection is not precisely recorded. The section highlights the complexity of analyzing such data and reviews various nonparametric and semiparametric methods developed to address these challenges, including the nonparametric maximum likelihood estimator (NPMLE) and likelihood ratio tests.

Recent advancements in computational power and machine learning, particularly the development of generative adversarial networks (GANs), have opened new avenues for analyzing current status data. The authors propose a novel generative conditional distribution estimator (GCDE) that utilizes a two-step process to predict censoring status and estimate cumulative distribution functions. This method aims to overcome limitations of traditional approaches, particularly in cases of model misspecification. The introduction concludes by outlining the structure of the paper, which includes methodology, asymptotic consistency, simulation studies, and application to the Parvovirus B19 data.

Discussion

In this section, the authors propose a two-stage Generative Adversarial Network (GAN)-based method for estimating the conditional distribution of failure times in the context of survival analysis, particularly when dealing with interval-censored data. The first stage involves learning a conditional generator that models the joint distribution of examination times and a binary indicator of failure status, given a set of predictors. The authors utilize the 1-Wasserstein distance as a divergence measure to ensure that the learned distribution closely matches the true joint distribution of the observed data. This approach allows for the estimation of the conditional generator using feedforward neural networks, which are trained through a minimax optimization problem.

In the second stage, the authors estimate the conditional distribution function of the failure time using nonparametric methods based on the generated samples from the conditional generator. They derive a nonparametric maximum likelihood estimator (NPMLE) for the conditional distribution function, ensuring that the estimator is valid and consistent. The proposed method is shown to outperform traditional models, such as the Cox proportional hazards model and accelerated failure time models, particularly in scenarios where model assumptions are violated. The authors highlight the robustness and flexibility of their approach, which leverages machine learning techniques to enhance statistical inference in survival analysis, ultimately contributing to improved understanding and treatment of clinical outcomes.