نظام تصنيف عميق متعدد الأنماط مع مستشعر اهتزاز قابل للارتداء للكشف عن الأحداث المتعلقة بالحنجرة Multimodal deep ensemble classification system with wearable vibration sensor for detecting throat-related events

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-024-01417-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39775108
تاريخ النشر: 2025-01-07
المؤلف: Yonghun Song وآخرون
الموضوع الرئيسي: تقييم وإدارة عسر البلع

نظرة عامة

يتناول هذا القسم تطوير نظام مراقبة مستمر لاضطراب البلع (ديسفاجيا)، الذي يعتمد تقليديًا على التقييمات السريرية من قبل المتخصصين الطبيين. يقدم المؤلفون مستشعر اهتزاز الحلق القابل للربط بالجلد (STVS) الذي يكتشف بشكل مستقل الأحداث المتعلقة بالحلق، ملتقطًا أصواتًا دقيقة مثل البلع مع تقليل التداخل من الضوضاء المحيطة.

يستخدم النظام نموذج تعلم عميق قائم على التجميع يدمج عدة شبكات عصبية عميقة، مستفيدًا من ميزات صوتية متعددة الأنماط لتصنيف الأحداث ذات الأهمية. يحقق النموذج المقترح دقة تصنيف تبلغ 95.96%، متجاوزًا الدراسات السابقة. تسلط هذه النتائج الضوء على إمكانيات التكنولوجيا القابلة للارتداء في تعزيز إدارة ديسفاجيا وتحسين نتائج المرضى خارج الإعدادات السريرية.

الطرق

في هذا القسم، أجرى المؤلفون تحليل Grad-CAM على الشبكات العصبية المتطورة، وبالتحديد ResNet50 و EfficientNet، لتصنيف الأحداث المتعلقة بالحلق مثل السعال، والتحدث، والبلع، وتنظيف الحلق باستخدام أنماط الصور المستمدة من بيانات الصوت الزمنية. تم تحويل إشارات اهتزاز الحلق إلى صور طيفية وصور طيفية ميل، وتم استخدام تقنيات زيادة البيانات لزيادة مجموعة بيانات التدريب بمقدار ثلاثة عشر ضعفًا، مما قلل بشكل كبير من الإفراط في التكيف وحسن من قوة النماذج. تحسنت دقة التصنيف بمعدل 5.03% لـ ResNet50 و 1.28% لـ EfficientNet بعد الزيادة؛ ومع ذلك، واجهت كلا الشبكتين صعوبة في التمييز بدقة بين أحداث السعال وتنظيف الحلق.

للتحقيق في الدقة المنخفضة في تصنيف بعض الأحداث، استخدم المؤلفون طريقة Grad-CAM لتصور المناطق النشطة في الصور أثناء التصنيف. كشفت التحليلات أن المناطق النشطة للتحدث كانت مركزة في المكونات التوافقية تحت 1 كيلو هرتز، بينما عرضت أحداث البلع إشارات مميزة على شكل ذبذبات عبر نطاق تردد أوسع. من الجدير بالذكر أن الطيفيات الميلية للسعال وتنظيف الحلق أظهرت أنماطًا مشابهة، مما يشير إلى ميزات متداخلة في نطاق التردد من 500 إلى 1000 هرتز. تؤكد هذه النتيجة على الحاجة إلى نماذج تصنيف تدمج البيانات متعددة الأنماط وتقنيات التجميع لالتقاط الميزات الصوتية المتنوعة للأحداث المتعلقة بالحلق بشكل فعال، بدلاً من الاعتماد فقط على الشبكات المعتمدة على الصور.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. يسلط الضوء على اتجاهات البيانات المهمة والنتائج الإحصائية التي تدعم الفرضيات الموضحة في الدراسة. تشير النتائج إلى وجود علاقة واضحة بين المتغيرات التي تم فحصها، مع قياسات كمية تظهر قوة وأهمية هذه العلاقات.

علاوة على ذلك، يتضمن القسم تمثيلات رسومية للبيانات، توضح الأنماط الملحوظة وتسهيل فهم النتائج بشكل أكثر حدسية. من الجدير بالذكر أن النتائج تكشف أن التدخل المطبق في الدراسة أدى إلى تحسين ملحوظ في النتائج المقاسة، مع قيم p تشير إلى الأهمية الإحصائية (مثل، $p < 0.05$). توفر هذه النتائج أساسًا قويًا للاستنتاجات المستخلصة في الأقسام اللاحقة من الورقة.

المناقشة

في هذه الدراسة، قدمنا مستشعر اهتزاز الحلق القابل للربط بالجلد (STVS) المصمم للتصنيف الدقيق للأحداث المتعلقة بالحلق مثل السعال، والتحدث، والبلع، وتنظيف الحلق. يت conform STVS، الذي يتناسب مع منحنيات الرقبة، بشكل فعال مع التقاط إشارات صوتية عالية الجودة مع تقليل التداخل من الضوضاء المحيطة، وهو قيد شائع في الميكروفونات التقليدية. يتكون الجهاز من وحدة استشعار موضوعة فوق بروز الحنجرة للحصول على إشارة مثالية ووحدة تحكم تدير معالجة البيانات ونقلها. تتضمن مجموعة بياناتنا، التي تم جمعها من 32 موضوعًا، 9,000 مقطع من الأحداث المتعلقة بالحلق، والتي تم استخدامها لتدريب نموذج تعلم عميق قائم على التجميع. حقق هذا النموذج دقة تصنيف ملحوظة تبلغ 95.96%، متفوقًا بشكل كبير على الطرق السابقة التي اعتمدت على الميكروفونات التقليدية.

يتضمن تصميم STVS وصلة متعرجة قابلة للتمدد تحافظ على سلامة الإشارة أثناء الحركة، مما يضمن جمع بيانات موثوقة في بيئات ديناميكية. استخدم نموذج التصنيف نهج تجميع متعدد الأنماط، مدمجًا بين الشبكات العصبية المختلفة المدربة على بيانات زمنية وبيانات زمنية ترددية. عززت هذه الاستراتيجية قدرة النموذج على تمييز الميزات الصوتية الدقيقة، مما أدى إلى دقة عالية عبر لغات متعددة. أظهر النظام أداءً قويًا حتى في السيناريوهات الواقعية، مصنفًا بفعالية الأحداث المتعلقة بالحلق وسط الضوضاء الخلفية وعيوب الحركة. بشكل عام، تسلط نتائجنا الضوء على إمكانيات STVS ونموذج التجميع لتحسين مراقبة وإدارة ديسفاجيا، مما يمهد الطريق لأدوات تشخيصية متقدمة في الرعاية الصحية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-024-01417-w
PMID: https://pubmed.ncbi.nlm.nih.gov/39775108
Publication Date: 2025-01-07
Author(s): Yonghun Song et al.
Primary Topic: Dysphagia Assessment and Management

Overview

This section discusses the development of a continuous monitoring system for dysphagia, a swallowing disorder, which traditionally relies on clinical assessments by medical professionals. The authors present a soft skin-attachable throat vibration sensor (STVS) that autonomously detects throat-related events, capturing subtle sounds like swallowing while minimizing interference from ambient noise.

The system employs an ensemble-based deep learning model that integrates multiple deep neural networks, leveraging multi-modal acoustic features to classify events of interest. The proposed model achieves a classification accuracy of 95.96%, surpassing previous studies. These findings highlight the potential of wearable technology in enhancing dysphagia management and improving patient outcomes outside clinical settings.

Methods

In this section, the authors conducted a Grad-CAM analysis on state-of-the-art neural networks, specifically ResNet50 and EfficientNet, to classify throat-related events such as coughing, speaking, swallowing, and throat clearing using image patterns derived from time-series acoustic data. The throat vibration signals were transformed into spectrogram and mel spectrogram images, and data augmentation techniques were employed to increase the training dataset by thirteenfold, which significantly reduced overfitting and improved the robustness of the models. The classification accuracy improved by an average of 5.03% for ResNet50 and 1.28% for EfficientNet post-augmentation; however, both networks struggled with accurately distinguishing between coughing and throat-clearing events.

To investigate the low accuracy in classifying certain events, the authors utilized the Grad-CAM method to visualize the activated regions in the images during classification. The analysis revealed that the activation regions for speaking were concentrated in harmonic components below 1 kHz, while swallowing events displayed distinct spike-shaped signals across a broader frequency range. Notably, the mel spectrograms for coughing and throat clearing exhibited similar patterns, indicating overlapping features in the frequency range of 500 to 1000 Hz. This finding underscores the need for classification models that incorporate multi-modal data and ensemble techniques to effectively capture the diverse acoustic features of throat-related events, rather than relying solely on image-based networks.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments and analyses. It highlights significant data trends and statistical outcomes that support the hypotheses outlined in the study. The results indicate a clear correlation between the variables examined, with quantitative measures demonstrating the strength and significance of these relationships.

Furthermore, the section includes graphical representations of the data, illustrating the observed patterns and facilitating a more intuitive understanding of the results. Notably, the findings reveal that the intervention applied in the study led to a marked improvement in the measured outcomes, with p-values indicating statistical significance (e.g., $p < 0.05$). These results provide a robust foundation for the conclusions drawn in the subsequent sections of the paper.

Discussion

In this study, we introduced a novel soft skin-attachable throat vibration sensor (STVS) designed for the accurate classification of throat-related events such as coughing, speaking, swallowing, and throat clearing. The STVS, which conforms to the contours of the neck, effectively captures high-quality acoustic signals while minimizing interference from ambient noise, a common limitation of conventional microphones. The device comprises a sensing unit positioned above the laryngeal prominence for optimal signal acquisition and a controller unit that manages data processing and transmission. Our dataset, collected from 32 subjects, included 9,000 segments of throat-related events, which were used to train an ensemble-based deep learning model. This model achieved a remarkable classification accuracy of 95.96%, significantly outperforming previous methods that relied on traditional microphones.

The STVS’s design incorporates a stretchable serpentine interconnect that maintains signal integrity during movement, ensuring reliable data collection in dynamic environments. The classification model utilized a multi-modal ensemble approach, integrating various neural networks trained on both time-domain and time-frequency domain data. This strategy enhanced the model’s ability to discern subtle acoustic features, leading to high accuracy across multiple languages. The system demonstrated robust performance even in real-world scenarios, effectively classifying throat-related events amidst background noise and motion artifacts. Overall, our findings highlight the potential of the STVS and the ensemble model to improve the monitoring and management of dysphagia, paving the way for advanced diagnostic tools in healthcare.