اختيار المتغيرات عبر Knockoffs في إعدادات البيانات المفقودة مع المتنبئين الفئويين Variable Selection via Knockoffs in Missing Data Settings with Categorical Predictors

المجلة: Psychometrika
DOI: https://doi.org/10.1017/psy.2026.10109
PMID: https://pubmed.ncbi.nlm.nih.gov/42117181
تاريخ النشر: 2026-05-12
المؤلف: Silvia Bacci وآخرون
الموضوع الرئيسي: طرق بايزي والنماذج المختلطة

نظرة عامة

يتناول قسم ورقة البحث توسيع طريقة النوكوف لاختيار المتنبئين في بيانات التقييم على نطاق واسع التي تتميز بالقيم المفقودة. يقدم المؤلفون مرحلة أولية من الإحلال المتعدد (MI) لمعالجة هذه القيم المفقودة، تليها تطبيق فلتر النوكوف على كل مجموعة بيانات تم إحلالها. تظهر دراسات المحاكاة أن هذه الطريقة تحقق أداءً مرضيًا، متماشية مع الأساليب المتقدمة المعاصرة.

علاوة على ذلك، يتم تطبيق الطريقة على دراسة حالة تتعلق ببيانات INVALSI حول درجات اختبار الطلاب الإيطاليين في الصف الخامس، والتي تقدم تحديات فريدة بسبب وجود متنبئين فئويين غير مرتبين وقيم مفقودة بين المتغيرات الرئيسية. يستنتج المؤلفون أن دمجهم المقترح لطريقة النوكوف ضمن إطار عمل MI ليس فقط ممكنًا ولكن أيضًا مرنًا وفعالًا للتعامل مع مثل هذه السيناريوهات المعقدة للبيانات.

مقدمة

تتناول مقدمة ورقة البحث التحدي الحاسم لاختيار المتنبئين ذوي الصلة في النماذج الإحصائية، لا سيما في الإعدادات عالية الأبعاد حيث قد تؤدي العديد من المتنبئين إلى تضمين متغيرات صفرية (متنبئين بمعامل انحدار يساوي صفر) أو استبعاد المتغيرات غير الصفرية. هذه المشكلة شائعة بشكل خاص في التقييمات التعليمية، مثل تلك التي يجريها المعهد الوطني الإيطالي لتقييم نظام التعليم والتدريب (INVALSI)، الذي يجمع بيانات واسعة حول أداء الطلاب والمتغيرات المرتبطة. تعتبر طرق اختيار المتغيرات الفعالة ضرورية، خاصة بالنظر إلى معدلات البيانات المفقودة العالية (حتى 26%) في مثل هذه المجموعات.

تناقش الورقة استراتيجيات مختلفة لاختيار المتغيرات، بما في ذلك الطرق التقليدية مثل الاختيار الأمامي والخلفي، بالإضافة إلى التقنيات الحديثة مثل فلتر النوكوف، الذي يهدف إلى التحكم في معدل الاكتشاف الخاطئ (FDR) أو معدل الخطأ لكل عائلة (PFER). بينما أظهر فلتر النوكوف وعدًا، فإن الأساليب الحالية تلبي بشكل أساسي المتنبئين المستمرين وتتطلب بيانات كاملة. تحاول التطورات الأخيرة، مثل تلك التي قام بها زيا et al. (2023)، استيعاب أنواع المتغيرات المختلطة والبيانات المفقودة ولكنها تعتمد على افتراضات مقيدة بشأن آلية الفقد. يقترح المؤلفون نهجًا جديدًا يدمج طريقة النوكوف ضمن إطار عمل الإحلال المتعدد، مما يسمح بمعالجة أكثر مرونة للبيانات المفقودة مع فصل مراحل الإحلال واختيار المتغيرات. تهدف هذه الطريقة إلى تعزيز قوة اختيار المتغيرات في مجموعات البيانات المعقدة مثل تلك من INVALSI، مع معالجة قيود المنهجيات الحالية.

طرق

في هذا القسم، يوضح المؤلفون دراسة محاكاة مونت كارلو تهدف إلى تقييم أداء نهج جديد يدمج طريقة النوكوف مع الإحلال المتعدد (MI) للتعامل مع البيانات المفقودة. تستخدم الدراسة طريقتين محددتين من طرق النوكوف المعتمدة على MI: فلتر النوكوف غير العشوائي (MI-RWC) وفلتر النوكوف المتسلسل النادر (MI-seq). تتضمن عملية الاختيار تشغيل طريقة النوكوف عبر 10 مجموعات بيانات تم إحلالها، مع عتبة اختيار تبلغ 0.5، ثم دمج النتائج لتحديد مجموعة فريدة من المتغيرات المختارة. يتم مقارنة أداء هذه الطرق مع النهج المعتمد من قبل زيا et al. (2023)، المشار إليه باسم XCDW، بالإضافة إلى طريقة لاسو القياسية المطبقة على مجموعات البيانات المتم إحلالها (MI-lasso).

يتكون إعداد المحاكاة من 50 متغيرًا ثنائيًا و50 متغيرًا مستمرًا، مع هيكل ارتباط محدد ونموذج انحدار خطي يولد المتغير المستجيب. يتم إدخال القيم المفقودة بناءً على آلية الفقد العشوائي (MAR)، المتأثرة بكل من المتغيرات الملاحظة غير الصفرية والصفرية. يقوم المؤلفون بتغيير معدل الفقد بشكل منهجي ويختصرون النتائج باستخدام مؤشرات مثل احتمال معدل الخطأ الخاطئ (PFER)، ومعدل الاكتشاف الخاطئ (FDR)، ومعدل الإيجابيات الحقيقية (TPR). تعمل هذه المؤشرات كقياسات تجريبية لتقييم فعالية طرق اختيار المتغيرات المختلفة تحت ظروف متغيرة من البيانات المفقودة.

نتائج

تظهر نتائج محاكاة مونت كارلو التي تقيم قوة طرق النوكوف أن كل من النوكوف غير العشوائي والنوکوف المتسلسل النادر يتحكمان بفعالية في معدل الخطأ لكل عائلة (PFER) في غياب القيم المفقودة، محققين قيم PFER تجريبية تبلغ 1.37 و1.51، على التوالي، مع معدلات إيجابية حقيقية عالية (TPR) تبلغ 0.99 و1.00. مع زيادة معدل البيانات المفقودة (من 10% إلى 45%)، ينخفض أداء جميع الطرق، حيث تُظهر طريقة MI-seq ميزة كبيرة على MI-RWC، لا سيما عند 32% من القيم المفقودة، حيث تفشل MI-RWC في التحكم في PFER. في المقابل، تحافظ MI-seq على PFER تجريبي أقل من 2 عندما يتم تعيين نسبة الاختيار إلى 0.9، على الرغم من أن ذلك يؤدي إلى انخفاض TPR إلى 0.65 عند 45% من الفقد.

تشير التحليلات الإضافية عند معدل بيانات مفقودة يبلغ 32% إلى أنه بينما يؤدي أسلوب XCDW بشكل مشابه تحت كل من آليات البيانات المفقودة (SMAR وMAR)، تظهر الأساليب المعتمدة على MI أداءً متباينًا بناءً على نسب الاختيار. تقترب MI-RWC من PFER الاسمي البالغ 2 فقط عند نسبة اختيار تبلغ 1، بينما تحقق MI-seq أداءً مشابهًا مع نسبة اختيار بين 0.8 و0.9. تشير النتائج إلى أن نسب الاختيار المثلى لأساليب MI تعتمد على مدى البيانات المفقودة، حيث تتطلب المعدلات الأعلى من الفقد قواعد اختيار أكثر صرامة للحفاظ على التحكم في الأخطاء. بشكل عام، تظهر MI-seq كبديل قوي لـ XCDW، لا سيما في السيناريوهات التي تتضمن فقدًا كبيرًا، على الرغم من أنه من الضروري مراعاة آثار نسب الاختيار على TPR.

مناقشة

في هذا القسم، يناقش المؤلفون نهجًا جديدًا لاختيار المتغيرات باستخدام فلتر النوكوف في وجود بيانات مفقودة، مصمم خصيصًا لكل من المتغيرات المستمرة والفئوية. يقترحون إجراءً متعدد الخطوات يسمى MI-seq، الذي يدمج الإحلال المتعدد (MI) للتعامل مع القيم المفقودة بفعالية. تتضمن العملية أولاً إحلال البيانات المفقودة باستخدام تقنيات MI، تليها تطبيق فلتر النوكوف على كل مجموعة بيانات تم إحلالها، وأخيرًا دمج المتغيرات المختارة عبر مجموعات البيانات. يؤكد المؤلفون على مرونة طريقتهم، لا سيما قدرتها على استيعاب أنواع المتغيرات المختلفة والتزامها بفرضية الفقد العشوائي (MAR)، مما يميزها عن الأساليب الحالية التي قد تفرض شروطًا أكثر صرامة.

يتم توضيح تطبيق MI-seq من خلال دراسة حالة باستخدام بيانات INVALSI، التي تقيم إنجاز الطلاب في الرياضيات. تشير النتائج إلى درجة عالية من الاستقرار في اختيار المتغيرات، حيث تم تحديد 22 من أصل 30 متنبئًا بشكل متسق عبر مجموعات البيانات المتم إحلالها. من الجدير بالذكر أن النتائج تتماشى مع الأدبيات المعروفة حول الإنجاز التعليمي، كاشفة عن تأثيرات كبيرة للتعليم الأبوي والوضع الاجتماعي والاقتصادي على أداء الطلاب. كما يتناول المؤلفون القيود المحتملة، مثل تقدير الأخطاء المعيارية بشكل غير دقيق بسبب عملية الاختيار والتحديات المتعلقة بالاستدلال بعد الاختيار. بشكل عام، تُظهر طريقة MI-seq أداءً قويًا في تحديد المتنبئين ذوي الصلة مع الحفاظ على القدرات التنبؤية، مما يجعلها أداة قيمة لتحليل مجموعات البيانات ذات القيم المفقودة.

Journal: Psychometrika
DOI: https://doi.org/10.1017/psy.2026.10109
PMID: https://pubmed.ncbi.nlm.nih.gov/42117181
Publication Date: 2026-05-12
Author(s): Silvia Bacci et al.
Primary Topic: Bayesian Methods and Mixture Models

Overview

The research paper section discusses the extension of the knockoffs method for predictor selection in large-scale assessment data characterized by missing values. The authors introduce a preliminary phase of multiple imputation (MI) to address these missing values, followed by the application of a knockoff filter to each imputed dataset. Simulation studies demonstrate that this approach yields satisfactory performance, aligning with contemporary advanced methods.

Furthermore, the method is applied to a case study involving INVALSI data on test scores of Italian fifth-grade students, which presents unique challenges due to the presence of unordered categorical predictors and missing values among key variables. The authors conclude that their proposed integration of the knockoffs method within an MI framework is not only feasible but also flexible and effective for handling such complex data scenarios.

Introduction

The introduction of the research paper addresses the critical challenge of selecting relevant predictors in statistical models, particularly in high-dimensional settings where numerous predictors may lead to the inclusion of null variables (predictors with a regression coefficient of zero) or the exclusion of non-null variables. This issue is notably prevalent in educational assessments, such as those conducted by the Italian National Institute for the Evaluation of the Education and Training System (INVALSI), which collects extensive data on student performance and associated variables. Effective variable selection methods are essential, especially given the high rates of missing data (up to 26%) in such datasets.

The paper discusses various variable selection strategies, including traditional methods like forward and backward selection, as well as modern techniques such as the knockoff filter, which aims to control the false discovery rate (FDR) or the per family error rate (PFER). While the knockoff filter has shown promise, existing methods primarily cater to continuous predictors and require complete data. Recent advancements, such as those by Xie et al. (2023), attempt to accommodate mixed variable types and missing data but rely on restrictive assumptions regarding the missingness mechanism. The authors propose a novel approach that integrates the knockoff method within a multiple imputation framework, allowing for a more flexible treatment of missing data while separating the phases of imputation and variable selection. This method aims to enhance the robustness of variable selection in complex datasets like those from INVALSI, addressing limitations of current methodologies.

Methods

In this section, the authors detail a Monte Carlo simulation study aimed at evaluating the performance of a novel approach that integrates the knockoffs method with multiple imputation (MI) for handling missing data. The study employs two specific MI-based knockoff methods: the derandomized knockoff filter (MI-RWC) and the sparse sequential knockoff filter (MI-seq). The selection process involves running the knockoff method across 10 imputed datasets, with a selection threshold of 0.5, and subsequently combining the results to identify a unique set of selected variables. The performance of these methods is compared against the established approach by Xie et al. (2023), referred to as XCDW, as well as a standard lasso method applied to the imputed datasets (MI-lasso).

The simulation setup consists of 50 binary and 50 continuous variables, with a specified correlation structure and a linear regression model generating the response variable. Missing values are introduced based on a Missing at Random (MAR) mechanism, influenced by both non-null and null observed variables. The authors systematically vary the rate of missingness and summarize the results using indices such as the Probability of False Error Rate (PFER), False Discovery Rate (FDR), and True Positive Rate (TPR). These indices serve as empirical measures to evaluate the effectiveness of the different variable selection methods under varying conditions of missing data.

Results

The results from Monte Carlo simulations assessing the robustness of knockoff methods reveal that both derandomized and sparse sequential knockoffs effectively control the per-family error rate (PFER) in the absence of missing values, achieving empirical PFER values of 1.37 and 1.51, respectively, with high true positive rates (TPR) of 0.99 and 1.00. As the rate of missing data increases (10% to 45%), the performance of all methods declines, with the MI-seq approach demonstrating a significant advantage over MI-RWC, particularly at 32% missing values, where MI-RWC fails to control the PFER. In contrast, MI-seq maintains an empirical PFER below 2 when the selection proportion is set to 0.9, although this results in a reduced TPR of 0.65 at 45% missingness.

Further analysis at a 32% missing data rate indicates that while the XCDW method performs comparably under both missing data mechanisms (SMAR and MAR), MI-based approaches exhibit varying performance based on selection proportions. MI-RWC approaches the nominal PFER of 2 only at a selection proportion of 1, while MI-seq achieves similar performance with a selection proportion between 0.8 and 0.9. The findings suggest that optimal selection proportions for MI-based methods are contingent on the extent of missing data, with higher missing rates necessitating stricter selection rules to maintain error control. Overall, MI-seq emerges as a robust alternative to XCDW, particularly in scenarios with significant missingness, although it is essential to consider the implications of selection proportions on TPR.

Discussion

In this section, the authors discuss a novel approach for variable selection using the knockoff filter in the presence of missing data, specifically tailored for both continuous and categorical variables. They propose a multi-step procedure termed MI-seq, which integrates multiple imputation (MI) to handle missing values effectively. The process involves first imputing missing data using MI techniques, followed by applying the knockoff filter to each imputed dataset, and finally consolidating the selected variables across datasets. The authors emphasize the flexibility of their method, particularly its ability to accommodate various variable types and its adherence to the missing at random (MAR) assumption, contrasting it with existing methods that may impose stricter conditions.

The application of MI-seq is illustrated through a case study using INVALSI data, which evaluates student achievement in mathematics. The results indicate a high degree of stability in variable selection, with 22 out of 30 predictors consistently identified across imputed datasets. Notably, the findings align with established literature on educational achievement, revealing significant effects of parental education and socioeconomic status on student performance. The authors also address potential limitations, such as the underestimation of standard errors due to the selection process and the challenges of post-selection inference. Overall, the MI-seq method demonstrates robust performance in identifying relevant predictors while maintaining predictive capabilities, making it a valuable tool for analyzing datasets with missing values.