التعديل على العوامل المربكة في بيانات الإدارة السكانية عندما يتم قياس العوامل المربكة فقط في مجموعة مرتبطة Adjusting for confounding in population administrative data when confounders are only measured in a linked cohort

المجلة: International Journal for Population Data Science، المجلد: 11، العدد: 1
DOI: https://doi.org/10.23889/ijpds.v11i1.3015
PMID: https://pubmed.ncbi.nlm.nih.gov/41924190
تاريخ النشر: 2026-03-06
المؤلف: Richard J Silverwood وآخرون
الموضوع الرئيسي: نمذجة سكانية وتكيف المناخ

نظرة عامة

تتناول الدراسة تحدي التداخل المتبقي في تحليلات بيانات الإدارة السكانية، التي غالبًا ما تفتقر إلى متغيرات التحكم الشاملة. يقترح المؤلفون نهجًا قائمًا على الاستيفاء المتعدد للتخفيف من هذه المشكلة، خاصة عندما تكون بيانات المجموعة المرتبطة وبيانات الإدارة معزولة. يتم توضيح هذه الطريقة من خلال سيناريوهات محاكاة وتطبيقها على حالة واقعية تفحص العلاقة بين تنقل الطلاب وتحقيق المرحلة الأساسية 2 باستخدام بيانات من قاعدة بيانات الطلاب الوطنية في المملكة المتحدة (NPD) ودراسة مجموعة الألفية (MCS).

في تحليلهم، شملت الدراسة 509,670 فردًا من NPD، مع 7,768 (1.5%) من أعضاء مجموعة MCS. كانت التقديرات الأولية غير المعدلة للعلاقة بين تنقل الطلاب والتحصيل -1.86 (95% CI -1.92، -1.81)، والتي تم تعديلها إلى -0.92 (95% CI -0.97، -0.88) بعد الأخذ في الاعتبار متغيرات التحكم من NPD، وتم تحسينها أكثر إلى -0.76 (95% CI -0.86، -0.67) مع متغيرات التحكم من MCS. تشير النتائج إلى أن ربط بيانات المجموعة يمكن أن يعالج بشكل فعال التداخل المتبقي، مما يعزز قوة التحليلات المستندة إلى بيانات الإدارة. يدعو المؤلفون إلى مزيد من البحث لاستكشاف تطبيقات أوسع لمنهجيتهم المقترحة.

مقدمة

تناقش مقدمة ورقة البحث الاستخدام المتزايد لبيانات الإدارة من الحكومة البريطانية والهيئات العامة، والتي، على الرغم من كونها كبيرة وتمثل السكان، غالبًا ما تفتقر إلى المعلومات الحيوية اللازمة للبحث الشامل. على وجه الخصوص، تبرز القيود المفروضة على بيانات الإدارة، مثل الاعتماد على مؤشرات غير مثالية مثل مؤشر الحرمان المتعدد للحالة الاجتماعية والاقتصادية، مما يمكن أن يؤدي إلى استنتاجات خاطئة. بالمقابل، توفر الدراسات الوطنية الطولية الغنية بيانات متعددة التخصصات ولكنها تعاني من مشكلات مثل التغطية السكانية المحدودة والانقطاع، مما يمكن أن يؤثر على تمثيلها.

تؤكد الورقة على الفوائد المحتملة لربط بيانات المجموعة مع بيانات الإدارة لتعزيز رؤى البحث من خلال الاستفادة من نقاط القوة في كلا المصدرين. يمكن أن يقلل هذا الربط من التحيزات الناتجة عن المتغيرات المحدودة المتاحة في مجموعات بيانات الإدارة. يقترح المؤلفون استخدام الاستيفاء المتعدد (MI) كاستراتيجية لمعالجة المتغيرات المربكة غير الملاحظة التي تُعتبر مشكلة بيانات مفقودة. يسمح MI بإنشاء مجموعات بيانات متعددة تأخذ في الاعتبار عدم اليقين في البيانات المفقودة، مما يسهل تحليلات أكثر قوة. تهدف الورقة إلى استكشاف فعالية استخدام بيانات المجموعة المرتبطة لإدارة التداخل في مجموعات بيانات الإدارة المعزولة، موضحة النهج من خلال أمثلة محاكاة وأمثلة من العالم الحقيقي.

طرق

في هذا القسم، يحدد المؤلفون إطارهم المنهجي من خلال فحص ثلاثة سيناريوهات متميزة تتعلق بهيكل مجموعات البيانات المستخدمة في بحثهم. يقدم كل سيناريو خصائص وتحديات فريدة، مما يدفع المؤلفين إلى اقتراح طرق منهجية بديلة مصممة لمعالجة هذه الاختلافات بفعالية.

يهدف استكشاف هذه السيناريوهات إلى تعزيز قوة النتائج من خلال ضمان أن الطرق المختارة مناسبة تمامًا لهياكل البيانات المحددة التي تم مواجهتها. تعتبر هذه القابلية للتكيف في المنهجية ضرورية لتفسير النتائج بدقة واستخلاص استنتاجات صحيحة من مجموعات البيانات التي تم تحليلها.

نتائج

يقدم قسم النتائج تقديرات لعلاقات X-Y عبر نماذج مختلفة، كما هو موضح في الشكل 2 والجدول S1. يحدد النموذج الأول (النموذج 1) خط الأساس باستخدام بيانات من 100,000 فرد، تم تعديلها بالكامل لتشمل المتغيرات المربكة C و U، مما يمثل “الحقيقة المعروفة”. تستخدم النماذج اللاحقة (2-4) مجموعة من 10,000 فرد مع قيم ملاحظة لـ C و U. تظهر التقديرات غير المعدلة في النموذج 2 تحيزًا كبيرًا بسبب التداخل، والذي يتم تخفيفه جزئيًا في النموذج 3 من خلال التعديل لـ C. ينتج النموذج 4، الذي يعدل لكل من C و U، تقديرات تتماشى عن كثب مع الحقيقة المعروفة، على الرغم من وجود فترات ثقة أوسع بسبب حجم العينة الأصغر. في مجموعة بيانات الإدارة السكانية، تنتج النماذج 5 و 6 تقديرات غير معدلة ومعدلة لـ C مماثلة لتلك الخاصة بالمجموعة، ولكن بدقة أكبر. تؤدي التعديلات الإضافية باستخدام نهج الاستيفاء المتعدد (MI) في النماذج 7-9 إلى تقديرات تقترب بشكل كبير من الحقيقة المعروفة عبر جميع السيناريوهات، مع فترات ثقة أوسع قليلاً.

تكون عينة التحليل مكونة من 509,670 فردًا من بيانات NPD السكانية، بما في ذلك 7,768 (1.5%) من أعضاء مجموعة MCS. يتم تقديم إحصائيات وصفية في الجدول 4. يتم تفصيل التقديرات المقدرة للعلاقات بين تنقل الطلاب ومتوسط نقاط KS2 في الشكل 3 والجدول S2. في بيانات MCS-NPD المرتبطة، شملت عينة التحليل 7,256 من أعضاء مجموعة MCS. تم تخفيف التقدير غير المعدل -1.70 (95% CI -2.24، -1.15) إلى -0.80 (95% CI -1.28، -0.32) بعد التعديل لمتغيرات التحكم من NPD، وإلى -0.67 (95% CI -1.16، -0.17) مع متغيرات التحكم من MCS. في بيانات NPD السكانية، تم تعديل التقدير غير المعدل -1.86 (95% CI -1.92، -1.81) بشكل مماثل إلى -0.92 (95% CI -0.97، -0.88). سمح النهج القائم على MI بإجراء تعديلات إضافية، مما أدى إلى تقدير نهائي قدره -0.76، مع دقة متغيرة عبر السيناريوهات، كما يتضح من فترات الثقة 95% التي تبلغ (-0.86، -0.67)، (-0.89، -0.63)، و (-0.90، -0.61) للسيناريوهات 1 و 2 و 3، على التوالي.

مناقشة

في هذا القسم، يقدم المؤلفون منهجية لمعالجة التداخل في تحليل بيانات الإدارة السكانية من خلال الاستفادة من بيانات المجموعة المرتبطة، كما تم توضيحه من خلال أمثلة محاكاة وأمثلة من العالم الحقيقي. تستخدم الطريقة المقترحة الاستيفاء المتعدد (MI) لأخذ المتغيرات المربكة غير الملاحظة في الاعتبار، مع التركيز بشكل خاص على تعرض ثنائي ($X$) ونتيجة ثنائية ($Y$) ضمن مجموعة بيانات إدارية كبيرة تضم 100,000 فرد. تسلط المحاكاة الضوء على أهمية التمييز بين المتغيرات المربكة الملاحظة ($C$) وغير الملاحظة ($U$)، موضحة أن التحليلات المستندة فقط إلى مجموعة البيانات الإدارية ستؤدي إلى تقديرات متحيزة بسبب إغفال $U$. من خلال دمج بيانات المجموعة المرتبطة، يمكن للمؤلفين استيفاء قيم لـ $U$ والحصول على تقديرات غير متحيزة لعلاقة $X-Y$.

تتناول المناقشة أيضًا ثلاثة سيناريوهات لتحليل البيانات، كل منها يتناول مستويات مختلفة من الوصول إلى بيانات المجموعة وبيانات الإدارة. في السيناريو 1، حيث يمكن التعرف على أعضاء المجموعة داخل مجموعة البيانات الإدارية، يتم تطبيق MI بشكل مباشر لاستيفاء $U$. complicates السيناريو 2 الأمور بعدم التعرف على أعضاء المجموعة، مما يتطلب نهجًا دقيقًا لتجنب الملاحظات المكررة. يقدم السيناريو 3 الوضع الأكثر تحديًا، حيث لا يمكن تحليل بيانات المجموعة وبيانات الإدارة معًا، مما يتطلب تنفيذ MI جديد يفصل تطوير نموذج الاستيفاء عن تطبيقه. يخلص المؤلفون إلى أن استراتيجيتهم القائمة على MI تخفف بشكل فعال من التداخل وتعزز دقة التقديرات في كل من السياقات المحاكاة والواقعية، كما يتضح من خلال تحليل تنقل الطلاب وتحقيق KS2 باستخدام بيانات مرتبطة من دراسة مجموعة الألفية وقاعدة بيانات الطلاب الوطنية.

Journal: International Journal for Population Data Science, Volume: 11, Issue: 1
DOI: https://doi.org/10.23889/ijpds.v11i1.3015
PMID: https://pubmed.ncbi.nlm.nih.gov/41924190
Publication Date: 2026-03-06
Author(s): Richard J Silverwood et al.
Primary Topic: demographic modeling and climate adaptation

Overview

The research addresses the challenge of residual confounding in analyses of population administrative data, which often lack comprehensive control variables. The authors propose a multiple imputation-based approach to mitigate this issue, particularly when linked cohort and administrative data are siloed. This method is demonstrated through simulated scenarios and applied to a real-world case examining the relationship between pupil mobility and Key Stage 2 attainment using data from the UK National Pupil Database (NPD) and the Millennium Cohort Study (MCS).

In their analysis, the study included 509,670 individuals from the NPD, with 7,768 (1.5%) being MCS cohort members. The initial unadjusted estimate of the association between pupil mobility and attainment was -1.86 (95% CI -1.92, -1.81), which was adjusted to -0.92 (95% CI -0.97, -0.88) after accounting for NPD control variables, and further refined to -0.76 (95% CI -0.86, -0.67) with MCS control variables. The findings indicate that linking cohort data can effectively address residual confounding, enhancing the robustness of analyses based on administrative data. The authors call for further research to explore broader applications of their proposed methodology.

Introduction

The introduction of the research paper discusses the growing utilization of administrative data from UK government and public bodies, which, while large and population-representative, often lack critical information necessary for comprehensive research. Specifically, it highlights the limitations of administrative data, such as the reliance on suboptimal indicators like the Index of Multiple Deprivation for socioeconomic status, which can lead to erroneous conclusions. In contrast, national longitudinal cohort studies provide rich, multidisciplinary data but suffer from issues like limited population coverage and attrition, which can affect their representativeness.

The paper emphasizes the potential benefits of linking cohort data with administrative data to enhance research insights by leveraging the strengths of both sources. This linkage can mitigate confounding biases that arise from the limited control variables available in administrative datasets. The authors propose using multiple imputation (MI) as a strategy to address unobserved confounding variables treated as a missing data problem. MI allows for the creation of multiple datasets that account for uncertainty in missing data, facilitating more robust analyses. The paper aims to explore the effectiveness of using linked cohort data to manage confounding in siloed administrative datasets, illustrating the approach with both simulated and real-world examples.

Methods

In this section, the authors outline their methodological framework by examining three distinct scenarios that pertain to the structure of the datasets utilized in their research. Each scenario presents unique characteristics and challenges, prompting the authors to propose alternative methodological approaches tailored to address these variations effectively.

The exploration of these scenarios aims to enhance the robustness of the findings by ensuring that the chosen methods are well-suited to the specific data structures encountered. This adaptability in methodology is crucial for accurately interpreting results and drawing valid conclusions from the datasets analyzed.

Results

The results section presents estimates of X-Y associations across various models, as illustrated in Figure 2 and Table S1. The first model (model 1) establishes a baseline using data from 100,000 individuals, fully adjusted for confounders C and U, representing the “known truth.” Subsequent models (2-4) utilize a cohort of 10,000 individuals with observed values for C and U. The unadjusted estimate in model 2 shows significant bias due to confounding, which is partially mitigated in model 3 through adjustment for C. Model 4, which adjusts for both C and U, yields estimates that closely align with the known truth, albeit with wider confidence intervals due to the smaller sample size. In the population administrative dataset, models 5 and 6 produce similar unadjusted and C-adjusted estimates as the cohort, but with greater precision. Further adjustments using multiple imputation (MI) approaches in models 7-9 yield estimates that closely approximate the known truth across all scenarios, with only slightly wider confidence intervals.

The analysis sample comprised 509,670 individuals from the population NPD data, including 7,768 (1.5%) MCS cohort members. Descriptive statistics are provided in Table 4. The estimated associations between pupil mobility and average KS2 points score are detailed in Figure 3 and Table S2. In the linked MCS-NPD data, the analysis sample included 7,256 MCS cohort members. The unadjusted estimate of -1.70 (95% CI -2.24, -1.15) was attenuated to -0.80 (95% CI -1.28, -0.32) after adjusting for NPD control variables, and further to -0.67 (95% CI -1.16, -0.17) with MCS control variables. In the population NPD data, the unadjusted estimate of -1.86 (95% CI -1.92, -1.81) was similarly adjusted to -0.92 (95% CI -0.97, -0.88). The MI-based approach allowed for additional adjustments, resulting in a final estimate of -0.76, with varying precision across scenarios, reflected in the 95% confidence intervals of (-0.86, -0.67), (-0.89, -0.63), and (-0.90, -0.61) for scenarios 1, 2, and 3, respectively.

Discussion

In this section, the authors present a methodology for addressing confounding in the analysis of population administrative data by leveraging linked cohort data, demonstrated through simulated and real-world examples. The proposed approach utilizes multiple imputation (MI) to account for unobserved confounders, specifically focusing on a binary exposure ($X$) and a binary outcome ($Y$) within a large administrative dataset of 100,000 individuals. The simulation highlights the importance of distinguishing between observed confounders ($C$) and unobserved confounders ($U$), illustrating that analyses based solely on the administrative dataset would yield biased estimates due to the omission of $U$. By incorporating linked cohort data, the authors can impute values for $U$ and obtain unbiased estimates of the $X-Y$ association.

The discussion further elaborates on three scenarios for data analysis, each addressing different levels of access to cohort and administrative data. In Scenario 1, where cohort members are identifiable within the administrative dataset, MI is applied straightforwardly to impute $U$. Scenario 2 complicates matters by not identifying cohort members, requiring a careful approach to avoid duplicated observations. Scenario 3 presents the most challenging situation, where cohort and administrative data cannot be analyzed together, necessitating a novel MI implementation that separates the imputation model development from its application. The authors conclude that their MI-based strategy effectively mitigates confounding and enhances the accuracy of estimates in both simulated and real-world contexts, as demonstrated through an analysis of pupil mobility and KS2 attainment using linked data from the Millennium Cohort Study and the National Pupil Database.