التقييم التلقائي متعدد الوسائط للألم الحاد من خلال مقاطع الفيديو الوجهية وإشارات معدل ضربات القلب باستخدام هياكل قائمة على المحولات Multimodal automatic assessment of acute pain through facial videos and heart rate signals utilizing transformer-based architectures

المجلة: Frontiers in Pain Research، المجلد: 5
DOI: https://doi.org/10.3389/fpain.2024.1372814
PMID: https://pubmed.ncbi.nlm.nih.gov/38601923
تاريخ النشر: 2024-03-27
المؤلف: Stefanos Gkikas وآخرون
الموضوع الرئيسي: التعرف على العواطف والمزاج

نظرة عامة

تقدم البحث إطار تقييم تلقائي متعدد الوسائط للألم الحاد، يدمج بين الفيديو وإشارات معدل ضربات القلب لتعزيز بروتوكولات إدارة الألم. يتكون الإطار من أربعة وحدات رئيسية: الوحدة المكانية لاستخراج تضمينات الفيديو، مشفر معدل ضربات القلب لرسم إشارات معدل ضربات القلب، AugmNet لتوليد تحسينات قائمة على التعلم، والوحدة الزمنية للتقييم النهائي للألم. تستخدم الوحدة المكانية استراتيجية تدريب مسبق من مرحلتين تركز على التعرف على الوجه والعواطف، مما يمكّن من استخراج تضمينات عالية الجودة. تشير النتائج التجريبية باستخدام قاعدة بيانات BioVid إلى أن الإطار يحقق أداءً رائدًا، بدقة تصل إلى 82.74% لتصنيف الألم الثنائي و39.77% للتصنيف متعدد المستويات، باستخدام 9.62 مليون معلمة.

تخلص الدراسة إلى أن دمج بيانات الفيديو الوجهية ومعدل ضربات القلب فعال لتقييم الألم التلقائي، مما يظهر دقة تصنيف عالية مع الحفاظ على الكفاءة. يسمح تصميم الإطار بتوليد خرائط الانتباه، مما يوفر رؤى حول وظيفة الوحدات ومناطق التركيز المدخلة. على الرغم من أن الإطار يظهر وعدًا للتطبيقات الواقعية، قد تؤدي التحسينات المحتملة إلى المساس بالكفاءة والسرعة، مما يتطلب اعتبارًا دقيقًا بناءً على احتياجات التطبيق. يدعو المؤلفون إلى الشفافية في التكاليف الحسابية في الأبحاث المستقبلية لتسهيل المقارنات وتشجيع استخدام الأساليب متعددة الوسائط لتقييم الألم بشكل فعال في البيئات السريرية.

طرق

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة لمعالجة بيانات الفيديو وECG، وتصميم إطارهم المقترح، وتنفيذ تقنيات تحسين متنوعة. يتم تقديم طريقتين إضافيتين للتحسين: التقنية الأساسية، التي تجمع بين عكس القطبية وإدخال الضوضاء لإنشاء تباينات في بيانات الإدخال، وتقنية التعتيم، التي تحدد عشوائيًا 10%-20% من العناصر في متجهات التضمين إلى صفر. تعمل كلا الطريقتين ضمن الفضاء الكامن، مما يعزز قوة البيانات المستخدمة للتدريب.

استخدم الإعداد التجريبي مجموعة بيانات BioVid، التي تشمل بيانات الفيديو وECG من 87 موضوعًا، مع تسجيل الفيديوهات بمعدل 25 إطارًا في الثانية وECG تم أخذ عينات منه بمعدل 512 هرتز. استمرت كل جلسة لمدة 5.5 ثوانٍ، مما أدى إلى توليد 138 إطار فيديو ومتجهات ECG تحتوي على 2,816 عنصرًا. استخدم المؤلفون منهجية التحقق المتقاطع (LOSO) لتقييم إطار تقييم الألم التلقائي الخاص بهم عبر مهام التصنيف الثنائي (لا ألم مقابل ألم شديد جدًا) ومهام التصنيف متعدد المستويات. تشير النتائج إلى أن طريقتهم تفوقت على الأساليب الموجودة، محققة دقة 77.10% للمهمة الثنائية و35.39% للمهمة متعددة المستويات في الدراسات المعتمدة على الفيديو. في الدراسات المعتمدة على ECG، تجاوزت طريقتهم الأداء المتوسط بنسبة 8.5% و18.1% لمهام الثنائية ومتعددة المستويات، على التوالي، بينما حققت أيضًا أعلى أداء تصنيفي لمهمة متعددة المستويات بنسبة 31.22%. في الدراسات متعددة الوسائط، وصلت طريقتهم إلى دقة 82.74% للمهمة الثنائية، مما يضعها بين الأساليب الأعلى أداءً في الأدبيات.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على الجهود الكبيرة في تقدير مستويات الألم البشري من خلال وسائط متنوعة، مع التركيز بشكل خاص على الأساليب متعددة الوسائط التي تدمج البيانات السلوكية والفسيولوجية. باستخدام مجموعات بيانات مثل قاعدة بيانات BioVid Heat Pain، طور الباحثون منهجيات متنوعة، كل منها له مزايا وتحديات فريدة تتعلق بالتعقيد، والتكلفة الحسابية، والأداء. تشمل التطورات الملحوظة استخدام طرق التدفق البصري، والشبكات العصبية طويلة الأمد وقصيرة الأمد، والشبكات العصبية التلافيفية ثلاثية الأبعاد (CNNs) لتحليل تعبيرات الوجه وغيرها من الإشارات الفسيولوجية مثل تخطيط القلب (ECG) وتخطيط العضلات (EMG). تظهر هذه الدراسات أنه بينما تحقق الأساليب أحادية الوسائط نتائج مرضية، فإن دمج عدة وسائط يعزز بشكل كبير من دقة وخصوصية تقييمات الألم، خاصة في البيئات السريرية حيث قد تكون بعض الوسائط غير متاحة.

يهدف الإطار المقترح إلى دمج بيانات الفيديو الوجهية مع معلومات معدل ضربات القلب المستمدة من إشارات ECG، مما يمثل نهجًا جديدًا يستفيد من هاتين الوسيلتين لتقييم الألم. تم توضيح خطوات المعالجة المسبقة لكل من بيانات الفيديو وECG بدقة، مما يضمن إعداد الوسائط بشكل مناسب للتحليل. يتكون هيكل الإطار من عدة مكونات، بما في ذلك وحدة مكانية لاستخراج تضمينات الفيديو، ومشفر معدل ضربات القلب لرسم إشارات معدل ضربات القلب، وAugmNet لتعزيز قوة الميزات المتعلمة. تجمع الوحدة الزمنية هذه التضمينات لإنتاج تصنيف نهائي لشدة الألم. تؤكد الدراسة على أهمية التكامل متعدد الوسائط وإمكانية تحسين دقة تقييم الألم من خلال تقنيات المعالجة المسبقة والتضمين المبتكرة، مما يسهم في تطبيقات سريرية أكثر فعالية في إدارة الألم.

Journal: Frontiers in Pain Research, Volume: 5
DOI: https://doi.org/10.3389/fpain.2024.1372814
PMID: https://pubmed.ncbi.nlm.nih.gov/38601923
Publication Date: 2024-03-27
Author(s): Stefanos Gkikas et al.
Primary Topic: Emotion and Mood Recognition

Overview

The research presents a multimodal automatic assessment framework for acute pain, integrating video and heart rate signals to enhance pain management protocols. The framework consists of four key modules: the Spatial Module for video embedding extraction, the Heart Rate Encoder for mapping heart rate signals, the AugmNet for generating learning-based augmentations, and the Temporal Module for final pain assessment. The Spatial Module employs a two-stage pre-training strategy focusing on face and emotion recognition, enabling the extraction of high-quality embeddings. Experimental results using the BioVid database indicate that the framework achieves state-of-the-art performance, with accuracies of 82.74% for binary pain classification and 39.77% for multi-level classification, utilizing 9.62 million parameters.

The study concludes that the combination of facial video and heart rate data is effective for automatic pain assessment, demonstrating high classification accuracy while maintaining efficiency. The framework’s design allows for the generation of attention maps, providing insights into module functionality and input focus areas. Although the framework shows promise for real-world applications, potential enhancements may compromise efficiency and speed, necessitating careful consideration based on application needs. The authors advocate for transparency in computational costs in future research to facilitate comparisons and encourage the use of multimodal approaches for effective pain assessment in clinical settings.

Methods

In this section, the authors detail the methodology employed for preprocessing video and ECG data, the design of their proposed framework, and the implementation of various augmentation techniques. Two supplementary augmentation methods are introduced: the Basic technique, which combines polarity inversion and noise insertion to create variations in the input data, and the Masking technique, which randomly sets 10%-20% of elements in the embedding vectors to zero. Both methods operate within the latent space, enhancing the robustness of the data used for training.

The experimental setup utilized the BioVid dataset, encompassing video and ECG data from 87 subjects, with videos recorded at 25 frames per second and ECG sampled at 512 Hz. Each session lasted 5.5 seconds, generating 138 video frames and ECG vectors with 2,816 elements. The authors employed a leave-one-subject-out (LOSO) cross-validation methodology for evaluating their automatic pain assessment framework across binary (No Pain vs. Very Severe Pain) and multi-level classification tasks. The results indicate that their method outperformed existing approaches, achieving 77.10% accuracy for the binary task and 35.39% for the multi-level task in video-based studies. In ECG-based studies, their method surpassed average performance by 8.5% and 18.1% for binary and multi-level tasks, respectively, while also achieving the highest classification performance for the multi-level task at 31.22%. In multimodal studies, their method reached an accuracy of 82.74% for the binary task, positioning it among the top-performing approaches in the literature.

Discussion

The discussion section of the research paper highlights the extensive efforts in estimating human pain levels through various modalities, particularly emphasizing multimodal approaches that integrate behavioral and physiological data. Utilizing datasets like the BioVid Heat Pain Database, researchers have developed diverse methodologies, each with unique advantages and challenges regarding complexity, computational cost, and performance. Notable advancements include the use of optical flow methods, long short-term memory networks, and 3D convolutional neural networks (CNNs) to analyze facial expressions and other physiological signals such as electrocardiograms (ECG) and electromyograms (EMG). These studies demonstrate that while unimodal approaches yield satisfactory results, the fusion of multiple modalities significantly enhances the specificity and sensitivity of pain assessments, especially in clinical settings where certain modalities may be inaccessible.

The proposed framework aims to integrate facial video data with heart rate information derived from ECG signals, marking a novel approach that leverages these two modalities for pain assessment. The preprocessing steps for both video and ECG data are meticulously outlined, ensuring that the modalities are appropriately prepared for analysis. The architecture of the framework consists of several components, including a Spatial-Module for video embedding extraction, a Heart Rate Encoder for mapping heart rate signals, and an AugmNet for enhancing the robustness of the learned features. The Temporal-Module combines these embeddings to produce a final pain intensity classification. The study underscores the importance of multimodal integration and the potential for improved pain assessment accuracy through innovative preprocessing and embedding techniques, ultimately contributing to more effective clinical applications in pain management.