تحسين تقدير الارتباط عن بعد Improved distance correlation estimation

المجلة: Applied Intelligence، المجلد: 55، العدد: 4
DOI: https://doi.org/10.1007/s10489-024-05940-x
تاريخ النشر: 2025-01-03
المؤلف: Blanca E. Monroy-Castillo وآخرون
الموضوع الرئيسي: طرق إحصائية واستدلال

نظرة عامة

يتناول قسم ورقة البحث الارتباط المسافة، وهو مقياس اعتماد متعدد المتغيرات يتراوح من 0 إلى 1 وينطبق على المتجهات العشوائية ذات الأبعاد التعسفية. على عكس معامل ارتباط بيرسون، يكون الارتباط المسافة صفرًا إذا وفقط إذا كانت المتجهات العشوائية مستقلة. تقيم الورقة اثنين من مقدرات الارتباط المسافة: مقدر V-statistic الذي اقترحه Székely وآخرون (2007) ومقدر U-statistic الذي قدمه Székely وRizzo (2014). تقيم الدراسة كفاءتهما من حيث متوسط الخطأ التربيعي (MSE) ووقت الحساب عبر هياكل الاعتماد المختلفة، كاشفة أن مقدر V متحيز تحت الاستقلال، بينما قد ينتج مقدر U قيمًا سالبة، مما يعقد استخدامه.

في الاستنتاجات، يبرز المؤلفون أن اختيار أفضل مقدر للارتباط المسافة بين المتجهات \(X\) و\(Y\) يعتمد على السياق. يُفضل عمومًا مقدر U (dCorU) تحت الاستقلال، لكن يمكن أن تعيق حساباته القيم السالبة في سيناريوهات الاعتماد المنخفض. تقترح الدراسة تقليم أو أخذ القيمة المطلقة لمقدر U-statistic كبدائل. بالنسبة للنماذج الخطية، يعمل مقدر V (dCorV) بشكل أفضل من حيث MSE، بينما يتفوق مقدر U في السياقات غير الخطية، على الرغم من الفروق الطفيفة. تم اقتراح مقدر جديد، وهو مزيج خطي محدب من dCorU وdCorV، والذي غالبًا ما يتفوق على المقدرات الفردية، خاصة عندما يتم تقدير المعامل \(\lambda_0\) باستخدام طرق bootstrap. ومع ذلك، يزداد وقت الحساب مع المزيد من تكرارات bootstrap، مما يعقد التطبيقات العملية.

مقدمة

تناقش مقدمة الورقة أهمية فهم الاعتمادات بين الملاحظات العشوائية عبر مجالات مختلفة، مع التأكيد على دور معاملات الارتباط كمقاييس لهذه العلاقات. بينما يعتبر معامل ارتباط بيرسون الأكثر شهرة، إلا أن له قيودًا، خاصة في التمييز بين العلاقات الخطية وغير الخطية. بالمقابل، يقدم الارتباط المسافة، الذي قدمه المؤلفون المشار إليهم، مقياسًا أكثر قوة، حيث يكون صفرًا فقط عندما تكون المتجهات العشوائية مستقلة وينطبق على كل من الارتباطات الخطية وغير الخطية. يوسع هذا المقياس مفهوم الارتباط إلى أبعاد تعسفية وقد تم استخدامه في تطبيقات متنوعة، بما في ذلك اختيار المتغيرات والسياقات عالية الأبعاد.

تسلط الورقة أيضًا الضوء على تطوير مقدرين للارتباط المسافة: مقدر V ومقدر U، كل منهما له تعقيدات وخصائص حسابية مميزة. مقدر U غير متحيز ولكنه يمكن أن ينتج قيم تباين مسافة مربعة سالبة، مما يعقد استخدامه، بينما مقدر V متحيز ولكنه أبسط حسابيًا. يقترح المؤلفون دراسة محاكاة لتقييم أداء هذه المقدرات تحت نماذج اعتماد مختلفة، كاشفين أن أيًا من المقدرين لا يتفوق باستمرار على الآخر. لمعالجة التحديات في اختيار المقدر، تقدم الورقة نهجًا جديدًا يتضمن مزيجًا خطيًا محدبًا من المقدرات. ستفصل الأقسام اللاحقة الأسس النظرية، ونتائج المحاكاة، والتطبيقات العملية لهذه الطرق.

النتائج

يقدم قسم النتائج في ورقة البحث تحليلًا شاملاً لدراسة المحاكاة، مع التركيز على أداء مختلف المقدرات تحت أحجام عينات مختلفة. بالنسبة لحجم عينة \( n = 100 \)، تم تقييم المقدرين المقترحين dCorU(A) وdCorU(T) جنبًا إلى جنب مع المجموعات الخطية المحدبة. يتم تلخيص النتائج في جداول توضح المتوسط، والانحياز، والتباين، ومتوسط الخطأ التربيعي (MSE) لكل نموذج وتركيبة معلمات، بالإضافة إلى القيمة المقدرة بواسطة bootstrap \( \lambda_0 \). يبرز المقارنة الفروقات بين مقدرات الارتباط المسافة \( dCor_\lambda \) باستخدام تقدير bootstrap \( \lambda_0 \) والقيمة الحقيقية \( \lambda_0 \).

بالنسبة لأحجام العينات الأكبر من \( n = 1000 \) و\( n = 10000 \)، تركز النتائج فقط على المقدرات الأصلية dCorU وdCorV. لوحظ أن القيم السالبة لهذه المقدرات ظهرت فقط تحت ظروف معينة ولم تختلف بشكل كبير عن المقدرات الأصلية، مما أدى إلى استبعادها من النتائج المقدمة. يبرز هذا القسم قوة المقدرات المقترحة وأدائها المقارن عبر أحجام العينات المختلفة.

المناقشة

في هذا القسم، تناقش الورقة تقدير الارتباط المسافة بين المتجهات العشوائية \(X \in \mathbb{R}^p\) و\(Y \in \mathbb{R}^q\) باستخدام تباين المسافة والتباين. يتم تعريف تباين المسافة \(V^2(X, Y)\) من خلال الدوال المميزة، ويتم اشتقاق الارتباط المسافة \(R^2(X, Y)\) منه. يقدم المؤلفون مقدرات تجريبية لكل من تباين المسافة والارتباط، المشار إليها بـ \(V_n(X, Y)\) و\(R_n(X, Y)\) على التوالي، والتي يمكن حسابها باستخدام مصفوفات المسافة المزدوجة المركز. كما يقدمون مقدرات بديلة تعتمد على U-statistics، والتي يمكن حسابها بشكل أكثر كفاءة مع تعقيد قدره \(O(n \log n)\) مقارنةً بمقدرات الأصلية \(O(n^2)\).

تسلط الورقة الضوء على تحديات القيم السالبة في مقدر U-statistic تحت ظروف الاستقلال أو الاعتماد المنخفض، مما يعقد حساب الارتباط المسافة. لمعالجة ذلك، يقترح المؤلفون مقدرين جديدين، \(dCorU(A)\) و\(dCorU(T)\)، اللذان يعدلان U-statistic لتجنب القيم السالبة. علاوة على ذلك، يقدمون مزيجًا خطيًا محدبًا من المقدرات الأصلية، \(dCor_\lambda = \lambda dCorU + (1 – \lambda) dCorV\)، حيث يتم تحديد \(\lambda\) الأمثل من خلال إجراء bootstrap. يهدف هذا المقدار الجديد إلى تقليل الانحياز وتحسين الأداء عبر سيناريوهات الاعتماد المختلفة، مما يوفر في النهاية طريقة أكثر قوة لتقدير الارتباط المسافة.

Journal: Applied Intelligence, Volume: 55, Issue: 4
DOI: https://doi.org/10.1007/s10489-024-05940-x
Publication Date: 2025-01-03
Author(s): Blanca E. Monroy-Castillo et al.
Primary Topic: Statistical Methods and Inference

Overview

The research paper section discusses distance correlation, a multivariate dependence measure that ranges from 0 to 1 and is applicable to random vectors of arbitrary dimensions. Unlike the Pearson correlation coefficient, distance correlation is zero if and only if the random vectors are independent. The paper evaluates two estimators of distance correlation: the V-statistic estimator proposed by Székely et al. (2007) and the U-statistic estimator introduced by Székely and Rizzo (2014). The study assesses their efficiency in terms of mean squared error (MSE) and computational time across various dependence structures, revealing that the V-estimator is biased under independence, while the U-estimator may yield negative values, complicating its use.

In the conclusions, the authors highlight that the choice of the best estimator for distance correlation between vectors \(X\) and \(Y\) is context-dependent. The U-estimator (dCorU) is generally preferable under independence, but its calculation can be hindered by negative values in low-dependence scenarios. The study suggests truncating or taking the absolute value of the U-statistic as alternatives. For linear models, the V-estimator (dCorV) performs better in terms of MSE, while the U-estimator excels in non-linear contexts, albeit with minor differences. A new estimator, a convex linear combination of dCorU and dCorV, is proposed, which often outperforms the individual estimators, particularly when the parameter \(\lambda_0\) is estimated using bootstrap methods. However, the computational time increases with more bootstrap iterations, complicating practical applications.

Introduction

The introduction of the paper discusses the significance of understanding dependencies among random observations across various fields, emphasizing the role of correlation coefficients as metrics for these relationships. While Pearson’s correlation coefficient is the most recognized, it has limitations, particularly in distinguishing between linear and non-linear relationships. In contrast, distance correlation, introduced by the authors referenced, offers a more robust measure, being zero only when the random vectors are independent and applicable to both linear and non-linear correlations. This measure extends the concept of correlation to arbitrary dimensions and has been utilized in diverse applications, including variable selection and high-dimensional contexts.

The paper also highlights the development of two estimators for distance correlation: the V-estimator and the U-estimator, each with distinct computational complexities and properties. The U-estimator is unbiased but can yield negative squared distance covariance values, complicating its use, while the V-estimator is biased but computationally simpler. The authors propose a simulation study to evaluate the performance of these estimators under various dependency models, revealing that neither estimator consistently outperforms the other. To address the challenges in estimator selection, the paper introduces a novel approach involving a convex linear combination of the estimators. Subsequent sections will detail the theoretical foundations, simulation results, and practical applications of these methods.

Results

The results section of the research paper presents a comprehensive analysis of the simulation study, focusing on the performance of various estimators under different sample sizes. For a sample size of \( n = 100 \), the proposed estimators dCorU(A) and dCorU(T) were evaluated alongside convex linear combinations. The findings are summarized in tables that detail the mean, bias, variance, and mean squared error (MSE) for each model and parameter combination, as well as the bootstrap-estimated value \( \lambda_0 \). The comparison highlights the discrepancies between the distance correlation estimators \( dCor_\lambda \) using the bootstrap estimate \( \lambda_0 \) and the true value \( \lambda_0 \).

For larger sample sizes of \( n = 1000 \) and \( n = 10000 \), the results focus solely on the original estimators dCorU and dCorV. It was noted that negative values for these estimators appeared only under specific conditions and did not significantly differ from the original estimators, leading to their exclusion from the presented results. This section underscores the robustness of the proposed estimators and their comparative performance across varying sample sizes.

Discussion

In this section, the paper discusses the estimation of distance correlation between random vectors \(X \in \mathbb{R}^p\) and \(Y \in \mathbb{R}^q\) using distance covariance and variance. The distance covariance \(V^2(X, Y)\) is defined through characteristic functions, and the distance correlation \(R^2(X, Y)\) is derived from it. The authors present empirical estimators for both distance covariance and correlation, denoted as \(V_n(X, Y)\) and \(R_n(X, Y)\), respectively, which can be computed using double-centered distance matrices. They also introduce alternative estimators based on U-statistics, which can be computed more efficiently with a complexity of \(O(n \log n)\) compared to the original estimators’ \(O(n^2)\).

The paper highlights the challenges of negative values in the U-statistic estimator under independence or low dependence scenarios, which complicates the computation of distance correlation. To address this, the authors propose two new estimators, \(dCorU(A)\) and \(dCorU(T)\), which modify the U-statistic to avoid negative values. Furthermore, they introduce a convex linear combination of the original estimators, \(dCor_\lambda = \lambda dCorU + (1 – \lambda) dCorV\), where the optimal \(\lambda\) is determined through a bootstrap procedure. This new estimator aims to reduce bias and improve performance across various dependence scenarios, ultimately providing a more robust method for estimating distance correlation.