تحليل تجميع معدل ومعدل الوزن المعتمد على مسافة غاور للبيانات المختلطة: محاكاة وتحليلات تجريبية A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses

المجلة: BMC Medical Research Methodology، المجلد: 24، العدد: 1
DOI: https://doi.org/10.1186/s12874-024-02427-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39696017
تاريخ النشر: 2024-12-18
المؤلف: Pinyan Liu وآخرون
الموضوع الرئيسي: الميكروبيولوجيا الفموية وبحوث التهاب اللثة

نظرة عامة

تقدم هذه الدراسة DAFI-Gower، وهي تقنية تجميع جديدة مصممة لمجموعات البيانات المختلطة التي تشمل كل من المتغيرات المستمرة والفئوية، مما يعالج قيود طرق التجميع التقليدية. من خلال تعديل مسافة Gower لتضمين أهمية الميزات كأوزان، يضمن DAFI-Gower مساهمات متوازنة من أنواع الميزات المختلفة. تم تقييم الخوارزمية بدقة مقابل 13 تقنية تجميع موجودة باستخدام كل من مجموعات البيانات المحاكية والبيانات الواقعية من المسح الوطني للصحة والتغذية (NHANES)، مما يظهر أداءً متفوقًا في دقة التجميع والتماسك.

أشارت النتائج إلى أن DAFI-Gower تفوقت باستمرار على الطرق الأساسية، محققة أعلى درجة سيلويت تبلغ 0.79 في مجموعة بيانات NHANES، التي شملت 3,760 مشاركًا. كشفت التحليلات عن أربع مجموعات صحية متميزة، حيث كانت عوامل الأمراض القلبية الوعائية تؤثر بشكل كبير على تشكيل المجموعات. علاوة على ذلك، أبرز تحليل الانحدار اللوجستي المتعدد وجود ارتباط ملحوظ بين التهاب اللثة الشديد والأمراض القلبية الوعائية (OR المعدل 1.95، 95% CI 1.50 إلى 2.55، p = 0.012). تؤكد النتائج على إمكانيات DAFI-Gower كأداة فعالة للتجميع المختلط، مما يعزز القابلية للتفسير ويدعم اتخاذ القرارات المستندة إلى البيانات في الأبحاث السريرية والوبائية.

مقدمة

تناقش مقدمة هذه الورقة البحثية أهمية تقسيم السكان المستند إلى البيانات في الإعدادات السريرية، والذي يصنف السكان غير المتجانسين إلى مجموعات متجانسة بناءً على أعباء الأمراض وخصائص الرعاية الصحية. يسهل هذا التقسيم خطط الرعاية المخصصة ويعزز تخطيط موارد الرعاية الصحية وصنع السياسات المستندة إلى الأدلة. ومع ذلك، تواجه طرق التقسيم التقليدية تحديات بسبب الطبيعة التي تتطلب جهدًا كبيرًا لجمع معلومات المرضى وقيود خوارزميات التجميع التقليدية، خاصة عند التعامل مع بيانات مختلطة تشمل كل من المتغيرات المستمرة والفئوية.

تسلط الورقة الضوء على قيود تقنيات التجميع الحالية، مثل k-means وk-prototypes، التي تواجه صعوبات مع أنواع البيانات المختلطة وغالبًا ما تؤدي إلى فقدان المعلومات أثناء تحويل البيانات. تقدم KAMILA، وهي نوع فعال من k-means التي تجمع البيانات المختلطة بفعالية دون افتراضات بارامترية ثقيلة. علاوة على ذلك، تحدد الورقة أوجه القصور في مسافة Gower في تحقيق التوازن الكافي بين تأثير المتغيرات المستمرة والفئوية. لمعالجة هذه القضايا، يقترح المؤلفون إطار عمل جديد من خطوتين يستخدم مسافة Gower المعدلة والموزونة، المسماة DAFI-Gower، والتي تهدف إلى تعزيز جودة التجميع وقابلية التفسير. يتم تقييم فعالية طريقة DAFI-Gower من خلال المحاكاة والتطبيقات الواقعية، لا سيما في تحديد ملفات صحية متميزة تتعلق بالتهاب اللثة والأمراض القلبية الوعائية، بينما تساعد أيضًا في تعديل العوامل المربكة في تحليلات الارتباط.

طرق

في هذه الدراسة، نقدم إطار عمل من خطوتين لتحليل علاقات المتغيرات. تتضمن الخطوة الأولى بناء مصفوفة مسافة تقيس المسافات المتوازنة بين أنواع المتغيرات المختلفة. تستخدم هذه المصفوفة مسافة Gower المعدلة، التي تم تعزيزها من خلال تضمين تعديلات المسافة والأوزان المستمدة من أهمية الميزات المستندة إلى المعلومات المتبادلة (DAFI).

بعد إنشاء مصفوفة المسافة، تطبق الخطوة الثانية خوارزمية تجميع لتحديد الأنماط والمجموعات داخل البيانات. تهدف هذه الطريقة المنهجية إلى تحسين دقة وملاءمة نتائج التجميع من خلال الاستفادة من قياسات المسافة الدقيقة التي توفرها مسافة Gower المعززة بـ DAFI.

نتائج

يقدم قسم النتائج النتائج المستخلصة من المحاكاة والدراسات التجريبية حول طرق التجميع المختلط، مع التركيز بشكل خاص على مسافة DAFI-Gower المدمجة مع التقسيم حول الميدويد (PAM). في نتائج المحاكاة، تفوقت DAFI-Gower باستمرار على الطرق الأخرى، خاصة في السيناريوهات التي تحتوي على نسبة عالية من الميزات غير ذات الصلة، مما يظهر فعاليتها في توفير أهمية الميزات، وهو أمر حاسم لأهداف الدراسة. بينما أظهرت KAMILA أداءً مشابهًا في بعض المحاكاة، إلا أنها كانت تفتقر إلى قدرة أهمية الميزات التي تقدمها DAFI-Gower. أشارت التحليلات إلى أن التقنيات المصممة للبيانات المختلطة عمومًا حققت نتائج أفضل من تلك المصممة للبيانات الكمية أو الفئوية البحتة.

في الدراسة التجريبية، حققت مسافة DAFI-Gower ونهج PAM أعلى درجة سيلويت (0.79)، مما يدل على تماسك وفصل المجموعات بشكل متفوق مقارنة بالطرق الأساسية. تم تحديد أربع مجموعات متميزة، حيث تمثل المجموعة 1 الفئة السكانية الأكثر صحة، والتي تتميز بمتوسط عمر أصغر وانخفاض انتشار الأمراض القلبية الوعائية (CVDs). في المقابل، كانت المجموعة 3 الأقل صحة، أكبر سناً، ولديها أعلى نسبة من الأفراد الذين يعانون من الأمراض القلبية الوعائية. تسلط النتائج الضوء على أهمية تعديلات المسافة وأوزان أهمية الميزات في تعزيز أداء التجميع، مع مساهمة ميزات معينة تتعلق بالأمراض القلبية الوعائية بشكل كبير في خصائص المجموعة. بشكل عام، تؤكد الدراسة على مزايا طريقة DAFI-Gower المقترحة في تطبيقات التجميع المختلط.

مناقشة

في هذا القسم، يقدم المؤلفون خوارزمية DAFI-Gower، وهي نهج جديد لتجميع البيانات المختلطة التي تعالج قيود مقاييس مسافة Gower التقليدية. تُستخدم مسافة Gower عادةً لقياس عدم التشابه في مجموعات البيانات التي تحتوي على ميزات مستمرة وفئوية، لكنها غالبًا ما تؤدي إلى عدم توازن حيث تؤثر المتغيرات الفئوية بشكل غير متناسب على حسابات المسافة. لتصحيح ذلك، يقترح المؤلفون مسافة Gower معدلة تتضمن تعديلات لأهمية الميزات، مما يسمح بمساهمة أكثر عدلاً من كل من المتغيرات المستمرة والفئوية. يتم تحقيق ذلك من خلال تحويل المتغيرات الفئوية إلى متغيرات وهمية وتطبيق عوامل قياس بناءً على النطاق بين الربعين (IQR) للميزات المستمرة.

تعزز خوارزمية DAFI-Gower قابلية تفسير التجميع من خلال دمج أوزان أهمية الميزات المستمدة من المعلومات المتبادلة الطبيعية (NMI)، التي تقيس المعلومات المشتركة بين الميزات. يوضح المؤلفون فعالية طريقتهم من خلال دراسات المحاكاة والتطبيقات الواقعية، لا سيما في تحليل الارتباط بين شدة مرض اللثة (PD) والأمراض القلبية الوعائية (CVDs). تشير النتائج إلى أن خوارزمية DAFI-Gower لا تحسن فقط نتائج التجميع ولكنها توفر أيضًا رؤى أوضح حول مخاطر الصحة من خلال التعديل على عضوية المجموعة بدلاً من الخصائص الفردية. على الرغم من وقتها الحاسوبي الأطول مقارنة بالطرق الأخرى، تعتبر خوارزمية DAFI-Gower مناسبة للتطبيقات السريرية، حيث تقدم أداة قوية لتحليل مجموعات البيانات المعقدة.

Journal: BMC Medical Research Methodology, Volume: 24, Issue: 1
DOI: https://doi.org/10.1186/s12874-024-02427-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39696017
Publication Date: 2024-12-18
Author(s): Pinyan Liu et al.
Primary Topic: Oral microbiology and periodontitis research

Overview

The research presents DAFI-Gower, a novel clustering technique tailored for mixed-type datasets that include both continuous and categorical variables, addressing the limitations of traditional clustering methods. By modifying the Gower distance to incorporate feature importance as weights, DAFI-Gower ensures balanced contributions from different feature types. The algorithm was rigorously evaluated against 13 existing clustering techniques using both simulated datasets and real-world data from the National Health and Nutrition Examination Survey (NHANES), demonstrating superior performance in clustering accuracy and cohesion.

Results indicated that DAFI-Gower consistently outperformed baseline methods, achieving the highest silhouette score of 0.79 in the NHANES dataset, which included 3,760 participants. The analysis revealed four distinct health clusters, with cardiovascular disease-related factors significantly influencing cluster formation. Furthermore, a multivariable logistic regression analysis highlighted a notable association between severe periodontitis and cardiovascular diseases (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012). The findings underscore DAFI-Gower’s potential as an effective tool for mixed-type clustering, enhancing interpretability and supporting data-driven decisions in clinical and epidemiological research.

Introduction

The introduction of this research paper discusses the significance of data-driven population segmentation in clinical settings, which categorizes heterogeneous populations into homogeneous groups based on disease burdens and healthcare characteristics. This segmentation facilitates tailored care plans and enhances healthcare resource planning and evidence-based policymaking. Traditional segmentation methods, however, face challenges due to the labor-intensive nature of patient information collection and the limitations of conventional clustering algorithms, particularly when dealing with mixed-type data that includes both continuous and categorical variables.

The paper highlights the limitations of existing clustering techniques, such as k-means and k-prototypes, which struggle with mixed data types and often result in information loss during data transformation. It introduces KAMILA, an efficient variant of k-means that effectively clusters mixed data without heavy parametric assumptions. Furthermore, the paper identifies the shortcomings of Gower’s distance in adequately balancing the influence of continuous and categorical variables. To address these issues, the authors propose a novel two-step framework utilizing a modified and weighted Gower distance, termed DAFI-Gower, which aims to enhance clustering quality and interpretability. The efficacy of the DAFI-Gower method is evaluated through simulations and real-world applications, particularly in identifying distinct health profiles related to periodontitis and cardiovascular diseases, while also aiding in confounder adjustment in association analyses.

Methods

In this study, we introduce a two-step framework for analyzing variable relationships. The initial step involves constructing a distance matrix that quantifies balanced distances among various variable types. This matrix utilizes a modified Gower distance, which is enhanced by incorporating distance adjustments and weights derived from mutual information-based feature importance (DAFI).

Following the creation of the distance matrix, the second step applies a clustering algorithm to identify patterns and groupings within the data. This methodological approach aims to improve the accuracy and relevance of clustering outcomes by leveraging the nuanced distance measurements provided by the DAFI-enhanced Gower distance.

Results

The results section presents findings from simulations and empirical studies on mixed type clustering methods, particularly focusing on the DAFI-Gower distance combined with Partitioning Around Medoids (PAM). In the simulation results, DAFI-Gower consistently outperformed other methods, especially in scenarios with a high proportion of irrelevant features, demonstrating its effectiveness in providing feature importance, which is crucial for the study’s objectives. While KAMILA showed comparable performance in certain simulations, it lacked the feature importance capability that DAFI-Gower offers. The analysis indicated that techniques tailored for mixed type data generally yielded better results than those designed for purely quantitative or categorical data.

In the empirical study, the DAFI-Gower distance and PAM approach achieved the highest Silhouette score (0.79), indicating superior cluster cohesion and separation compared to baseline methods. Four distinct clusters were identified, with Cluster 1 representing the healthiest demographic, characterized by a younger mean age and low prevalence of cardiovascular diseases (CVDs). In contrast, Cluster 3 was the least healthy, older, and had the highest proportion of individuals with CVDs. The findings highlight the importance of distance adjustments and feature importance weights in enhancing clustering performance, with specific CVD-related features significantly contributing to the cluster characteristics. Overall, the study underscores the advantages of the proposed DAFI-Gower method in mixed type clustering applications.

Discussion

In this section, the authors present the DAFI-Gower algorithm, a novel approach for clustering mixed-type data that addresses the limitations of traditional Gower distance metrics. The Gower distance is commonly used to measure dissimilarity in datasets with both continuous and categorical features, but it often leads to an imbalance where categorical variables disproportionately influence the distance calculations. To rectify this, the authors propose a modified Gower distance that incorporates adjustments for feature importance, allowing for a more equitable contribution from both continuous and categorical variables. This is achieved by converting categorical variables into dummy variables and applying scaling factors based on the inter-quartile range (IQR) for continuous features.

The DAFI-Gower algorithm enhances clustering interpretability by integrating feature importance weights derived from normalized mutual information (NMI), which quantifies the shared information between features. The authors demonstrate the effectiveness of their method through simulation studies and real-world applications, particularly in analyzing the association between periodontal disease (PD) severity and cardiovascular diseases (CVDs). Results indicate that the DAFI-Gower algorithm not only improves clustering outcomes but also provides clearer insights into health risks by adjusting for cluster membership rather than individual characteristics. Despite its longer computational time compared to other methods, the DAFI-Gower algorithm is deemed suitable for clinical applications, offering a robust tool for analyzing complex datasets.