تحسين تصنيف مشية الأمراض النادرة من خلال توازن البيانات والذكاء الاصطناعي التوليدي: رؤى من التأتأة المخيخية الوراثية Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

المجلة: Sensors، المجلد: 24، العدد: 11
DOI: https://doi.org/10.3390/s24113613
PMID: https://pubmed.ncbi.nlm.nih.gov/38894404
تاريخ النشر: 2024-06-03
المؤلف: Dante Trabassi وآخرون
الموضوع الرئيسي: الأمراض العصبية التنكسية الوراثية

نظرة عامة

تتناول هذه الدراسة التحديات المتعلقة بتحليل المشي لدى الأفراد الذين يعانون من عدم توازن المخيخ الوراثي الأولي (pwCA)، وخاصة القيود التي تفرضها أحجام العينات الصغيرة ومجموعات البيانات غير المتوازنة. قام الباحثون بتقييم فعالية تقنيات توازن البيانات المختلفة، بما في ذلك خوارزميات الذكاء الاصطناعي (AI) التوليدية، وبشكل خاص الشبكات التنافسية التوليدية الجدولية الشرطية (ctGAN)، في توليد مجموعات بيانات اصطناعية تعكس بدقة الشذوذات في المشي لدى pwCA. تم جمع بيانات المشي من 30 pwCA و100 شخص سليم، واستخدمت مجموعات البيانات المولدة لتدريب مصنف غابة عشوائية. وجدت الدراسة أن ctGAN عزز بشكل كبير أداء التصنيف مقارنة بالطرق التقليدية، مما أظهر تحسينًا في الدقة والدقة والاسترجاع ودرجة F1، بينما قلل أيضًا من خسارة السجل وزاد من ROC AUC.

تشير النتائج إلى أن ctGAN هو طريقة قوية لتوازن مجموعات البيانات في سياقات الأمراض النادرة، مما يقلل بشكل فعال من أخطاء النوع بيتا في التصنيف. كما تسلط الدراسة الضوء على أهمية استخدام مقاييس مثل تحليل كولموغوروف-سميرنوف وSHAP لتعزيز قابلية تفسير نتائج التعلم الآلي، مما يساعد الأطباء في ترجمة هذه النتائج إلى الممارسة السريرية. تشير النتائج إلى أن الخصائص الأساسية للمشي، مثل دوران الحوض وتباين طول الخطوة، ضرورية لتمييز pwCA عن الأشخاص الأصحاء. لا تسلط هذه الأبحاث الضوء فقط على أنماط المشي الفريدة لـ pwCA ولكنها تقترح أيضًا استراتيجيات لتحسين مصنفات التعلم الآلي في سيناريوهات الأمراض النادرة، مما يمهد الطريق لأدوات تشخيصية وتدخلات أكثر فعالية. يجب أن تستكشف التحقيقات المستقبلية تطبيق ctGANs في اضطرابات نادرة أخرى وتقييم النتائج السريرية على المدى الطويل للتحقق من صحة هذه النتائج بشكل أكبر.

مقدمة

تسلط المقدمة الضوء على التحديات المنهجية التي تواجه الدراسات السريرية للأمراض النادرة، وخاصة في تحليل المشي المتعلق بعدم توازن المخيخ الوراثي (pwCA). نظرًا لأحجام العينات الصغيرة، هناك خطر متزايد من خطأ بيتا، مما يقوض صحة النتائج. يمكن تقييم الشذوذات في المشي لدى pwCA، التي تتميز بضعف التنسيق والتوازن، بشكل كمي باستخدام وحدات القياس القابلة للحركة (IMUs) التي تلتقط تسارع الجذع وتستخرج مؤشرات مشي مختلفة. توفر هذه المؤشرات، مثل أكبر أس exponent ليابونوف على المدى القصير (sLLE) ومعامل تباين طول الخطوة (CV steplength)، رؤى حول الاستقرار الديناميكي والتباين في أنماط المشي.

تناقش الورقة أيضًا إمكانيات مصنفات التعلم الآلي (ML) لتحليل مجموعات بيانات المشي المعقدة والتمييز بين الأفراد الأصحاء وأولئك الذين يعانون من شذوذات في المشي. ومع ذلك، فإن نقص مجموعات العينات الكبيرة والمتوازنة يمثل تحديات لتدريب النماذج بشكل فعال. لمعالجة ذلك، تقترح تقنيات تعزيز البيانات المتقدمة، وخاصة الشبكات التنافسية التوليدية (GANs)، كوسيلة لتحسين جودة مجموعة البيانات وتحسين مقاييس التصنيف. تفترض الدراسة أن هذه التقنيات التوليدية ستتفوق على طرق توازن العينات التقليدية، مما يوفر أداة تشخيصية أكثر فعالية للأمراض النادرة مثل عدم توازن المخيخ الوراثي.

الطرق

تحدد قسم “الطرق” سير العمل المستخدم في هذه الدراسة، كما هو موضح في الشكل 1. يؤكد المؤلفون أن النهج المقترح، الذي يتضمن التوزيعات الأصلية، من المتوقع أن يتجاوز الطرق التقليدية من حيث الفعالية والدقة. يهدف هذا التقدم إلى تعزيز نموذج التصنيف التشخيصي بشكل خاص للأمراض النادرة. يتم توضيح مزيد من التفاصيل حول المنهجيات المستخدمة في الأقسام الفرعية التالية من الورقة.

النتائج

في قسم النتائج، حددت الدراسة علاقة مهمة بين مرحلة الوقوف في المشي ومتغيرات أخرى، مثل مرحلة التأرجح، والدعم المزدوج، والدعم الفردي. وبالتالي، تم تضمين مرحلة الوقوف فقط في مجموعة البيانات النهائية، جنبًا إلى جنب مع معدل ضربات القلب (HR) الأمامي-الخلفي (AP) وطول الخطوة (sLLE) AP، بسبب أهميتها السريرية. من مجموعة أولية من 25 معلمة مشي، تم اختيار ثمانية متغيرات من خلال اختيار الميزات، حيث أظهر تحليل الغابة العشوائية (RF) أداءً قويًا للمصنف لسبعة من هذه الميزات. تم استبعاد المتغير الفئوي ‘الجنس’ لأنه لم يقدم معلومات مهمة مقارنة بميزة ‘الضوضاء’ الوهمية، التي تم تضمينها لتعزيز عملية اختيار الميزات للنموذج. في النهاية، تم استخدام مجموعة مصفاة من سبع ميزات في خوارزمية التعلم الآلي الخاضعة للإشراف لتصنيف الشذوذات في المشي لدى الأشخاص الذين يعانون من عدم استقرار الكاحل المزمن (pwCA)، مما يميز بفعالية بين pwCA والأشخاص الأصحاء (HS).

بالإضافة إلى ذلك، تم تقييم اتساق النموذج وقابليته للتفسير باستخدام اختبار كولموغوروف-سميرنوف (KS)، الذي أشار إلى عدم وجود اختلافات كبيرة في توزيعات المتغيرات بين مجموعات البيانات الأصلية وتلك التي تم إنشاؤها بواسطة ctGAN. أبرز التحليل أن المتغيرات CV وHR ودوران الحوض كانت الأكثر تأثيرًا في نموذج التصنيف، مما يبرز أهميتها في فهم ديناميات المشي في سياق pwCA. يتم تقديم أوصاف مفصلة لمؤشرات المشي المختارة في المواد التكميلية.

المناقشة

تحدد قسم المناقشة إطارًا منهجيًا شاملاً لتعزيز تصنيف التعلم الآلي في الكشف عن الأمراض غير الشائعة، مع التركيز بشكل خاص على عدم توازن المخيخ (pwCA). يبدأ النهج بجمع البيانات ومعالجتها، بما في ذلك اختيار ميزات الضوضاء لضمان جودة البيانات. لمعالجة عدم توازن الفئات، تم استخدام استراتيجيات مختلفة مثل تقليل العينة، والتكرار، وتقنية زيادة العينة للأقليات الاصطناعية (SMOTE)، والشبكات التنافسية التوليدية (GAN)، والشبكات التنافسية التوليدية الجدولية الشرطية (ctGAN). سهلت حزمة LazyPredict تقييمًا أوليًا للنموذج، مما أدى إلى اختيار مصنف غابة عشوائية، الذي خضع لضبط معلمات هايبر عبر تحسين بايزي. تم تقييم أداء النموذج باستخدام معايير معروفة، وتم تطبيق تحليل Shapley Additive Explanations (ShAP) لتعزيز قابلية تفسير النموذج.

شمل جمع بيانات المشي 30 pwCA و100 شخص سليم، مع معايير إدراج صارمة لضمان موثوقية تقييمات المشي. تم جمع البيانات باستخدام وحدة قياس قابلة للحركة (IMU) لالتقاط معلمات المشي المختلفة من حيث المكان والزمان والحركية. تضمنت مرحلة المعالجة المسبقة تنظيف البيانات بدقة وتطبيعها لإعداد مجموعة البيانات للتحليل، بينما استخدم اختيار الميزات تحليل الارتباط وتقنيات الغابة العشوائية لتحديد السمات الأكثر صلة. كما قارنت الدراسة فعالية استراتيجيات توازن البيانات المختلفة، بما في ذلك تقليل العينة وSMOTE، لتحسين تدريب النموذج وأدائه. بشكل عام، تؤكد الأبحاث على أهمية تعزيز البيانات القوي وتقنيات التعلم الآلي في التصنيف الدقيق للأمراض النادرة، مما يسهم في التقدم في التشخيصات السريرية.

Journal: Sensors, Volume: 24, Issue: 11
DOI: https://doi.org/10.3390/s24113613
PMID: https://pubmed.ncbi.nlm.nih.gov/38894404
Publication Date: 2024-06-03
Author(s): Dante Trabassi et al.
Primary Topic: Genetic Neurodegenerative Diseases

Overview

This study addresses the challenges of gait analysis in individuals with primary hereditary cerebellar ataxia (pwCA), particularly the limitations posed by small sample sizes and unbalanced datasets. The researchers evaluated the effectiveness of various data balancing techniques, including generative artificial intelligence (AI) algorithms, specifically conditional tabular generative adversarial networks (ctGAN), in generating synthetic datasets that accurately reflect the gait abnormalities of pwCA. Gait data were collected from 30 pwCA and 100 healthy subjects, and the generated datasets were used to train a random forest classifier. The study found that ctGAN significantly enhanced classification performance compared to traditional methods, demonstrating improved accuracy, precision, recall, and F1 Score, while also reducing log loss and increasing ROC AUC.

The findings suggest that ctGAN is a robust method for balancing datasets in rare disease contexts, effectively mitigating beta type errors in classification. The study also highlights the importance of using metrics such as Kolmogorov-Smirnov analysis and SHAP to enhance the interpretability of machine learning outputs, thereby aiding clinicians in translating these findings into clinical practice. The results indicate that key gait characteristics, such as pelvic rotation and step length variability, are crucial for distinguishing pwCA from healthy subjects. This research not only sheds light on the unique gait patterns of pwCA but also proposes strategies for improving machine learning classifiers in rare disease scenarios, paving the way for more effective diagnostic tools and interventions. Future investigations should explore the application of ctGANs in other rare disorders and assess long-term clinical outcomes to further validate these findings.

Introduction

The introduction highlights the methodological challenges faced in clinical studies of rare diseases, particularly in gait analysis related to hereditary cerebellar ataxia (pwCA). Due to small sample sizes, there is a heightened risk of beta error, which undermines the validity of findings. Gait abnormalities in pwCA, characterized by impaired coordination and balance, can be quantitatively assessed using inertial measurement units (IMUs) that capture trunk acceleration and derive various gait indexes. These indexes, such as the short-term largest Lyapunov’s exponent (sLLE) and the step length coefficient of variation (CV steplength), provide insights into dynamic stability and variability in gait patterns.

The paper also discusses the potential of machine learning (ML) classifiers to analyze complex gait datasets and distinguish between healthy individuals and those with gait abnormalities. However, the lack of large, balanced sample cohorts poses challenges for effective model training. To address this, advanced data augmentation techniques, particularly generative adversarial networks (GANs), are proposed as a means to enhance dataset quality and improve classification metrics. The study hypothesizes that these generative techniques will outperform traditional sample balancing methods, thereby offering a more effective diagnostic tool for rare diseases like hereditary cerebellar ataxia.

Methods

The “Methods” section outlines the workflow utilized in this study, as depicted in Figure 1. The authors emphasize that the proposed approach, which incorporates original distributions, is expected to surpass traditional methods in both effectiveness and accuracy. This advancement aims to enhance the diagnostic classification model specifically for rare diseases. Further details regarding the methodologies employed are elaborated in the subsequent subsections of the paper.

Results

In the results section, the study identified a significant relationship between the stance phase of gait and other variables, such as swing phase, double support, and single support. Consequently, only the stance phase was included in the final dataset, along with heart rate (HR) anterior-posterior (AP) and stride length (sLLE) AP, due to their clinical relevance. From an initial pool of 25 gait parameters, eight variables were selected through feature selection, with a random forest (RF) analysis demonstrating strong classifier performance for seven of these features. The categorical variable ‘gender’ was excluded as it did not provide significant information compared to a dummy ‘noise’ feature, which was included to enhance the model’s feature selection process. Ultimately, a refined set of seven features was utilized in the supervised machine learning algorithm to classify gait abnormalities in persons with chronic ankle instability (pwCA), effectively distinguishing between pwCA and healthy subjects (HS).

Additionally, the consistency and explainability of the model were assessed using the Kolmogorov-Smirnov (KS) test, which indicated no significant differences in the distributions of variables between the original and ctGAN-generated datasets. The analysis highlighted that the variables CV, HR, and pelvic rotation were the most influential in the classification model, underscoring their importance in understanding gait dynamics in the context of pwCA. Detailed descriptions of the selected gait indexes are provided in the supplementary materials.

Discussion

The discussion section outlines a comprehensive methodological framework for enhancing machine learning classification in the detection of uncommon diseases, specifically focusing on cerebellar ataxia (pwCA). The approach begins with data collection and preprocessing, including noise feature selection to ensure data quality. To address class imbalance, various strategies such as undersampling, bootstrapping, Synthetic Minority Oversampling Technique (SMOTE), Generative Adversarial Networks (GAN), and Conditional Tabular Generative Adversarial Networks (ctGAN) were employed. The LazyPredict package facilitated an initial model assessment, leading to the selection of a random forest classifier, which underwent hyperparameter tuning via Bayesian optimization. The model’s performance was evaluated using established criteria, and Shapley Additive Explanations (ShAP) analysis was applied to enhance model interpretability.

The gait data acquisition involved 30 pwCA and 100 healthy subjects, with strict inclusion criteria to ensure the reliability of gait assessments. Data were collected using an Inertial Measurement Unit (IMU) to capture various spatiotemporal and kinematic gait parameters. The preprocessing phase included rigorous data cleaning and normalization to prepare the dataset for analysis, while feature selection utilized correlation analysis and random forest techniques to identify the most relevant attributes. The study also compared the effectiveness of different data balancing strategies, including undersampling and SMOTE, to improve model training and performance. Overall, the research emphasizes the importance of robust data augmentation and machine learning techniques in the accurate classification of rare diseases, contributing to advancements in clinical diagnostics.