تحسين دقة الفيديو بناءً على دمج مجموعة التلافيف ثلاثية الأبعاد القابلة للتشويه Video super resolution based on deformable 3D convolutional group fusion

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-93758-z
PMID: https://pubmed.ncbi.nlm.nih.gov/40091043
تاريخ النشر: 2025-03-17
المؤلف: Xiao Chen وآخرون
الموضوع الرئيسي: تقنيات معالجة الصور المتقدمة

نظرة عامة

في هذه الدراسة، قدم المؤلفون شبكة عصبية عميقة جديدة مصممة لتعزيز دقة الفيديو الفائقة من خلال إعادة تنظيم تسلسلات الفيديو المدخلة إلى مجموعات متعددة من التسلسلات الفرعية، كل منها يعمل بمعدلات إطارات مختلفة. تسهل هذه المجموعة الزمنية الهرمية دمج المعلومات بين الإطارات، مما يسمح باستخراج ميزات سطحية فعالة. يستفيد وحدة دمج داخل المجموعة، التي تتكون من خمسة كتل متبقية ثلاثية الأبعاد قابلة للتشويه، من المعلومات الزمنية داخل كل مجموعة لضمان اتساق الوقت أثناء التقاط كل من الديناميات الشكلية والحركية.

تستخدم الشبكة أيضًا وحدة دمج انتباه بين المجموعات التي تجمع بشكل تكيفي المعلومات التكميلية عبر مجموعات مختلفة، مما يؤدي إلى إعادة بناء مخرجات الفيديو عالية الدقة. تظهر النتائج التجريبية على مجموعة بيانات Vid4 أداء إعادة بناء متفوق للشبكة المقترحة، مما يبرز فعاليتها في استخراج ودمج الميزات الزمانية المكانية لتحسين جودة الفيديو.

الطرق

في هذه الدراسة، يقترح المؤلفون نهجًا جديدًا لدقة الفيديو الفائقة (VSR) يستفيد من المعلومات الزمانية المكانية من تسلسلات إطارات الفيديو منخفضة الدقة (LR) لإعادة بناء إطارات مرجعية عالية الدقة (HR). تتكون بنية الشبكة من عدة مكونات رئيسية: وحدة استخراج الميزات السطحية، وحدة دمج داخل المجموعة ثلاثية الأبعاد القابلة للتشويه، وحدة دمج انتباه بين المجموعات، ووحدة إعادة البناء. في البداية، يتم تنظيم إطارات الفيديو المدخلة في مجموعات زمنية، مما يسمح بالاستفادة من معدلات إطارات مختلفة. تخضع إطارات كل مجموعة لاستخراج ميزات أولية من خلال وحدة ثلاثية الأبعاد، تحاكي حالات الحركة.

بعد ذلك، تتم معالجة الميزات المستخرجة بواسطة وحدة دمج داخل المجموعة ثلاثية الأبعاد القابلة للتشويه، التي تستخدم تعويض الحركة التكيفي لدمج الميزات بشكل فعال داخل كل مجموعة. لتسهيل دمج المعلومات عبر مجموعات مختلفة، يتم تقديم آلية انتباه زمني بين المجموعات. ثم يتم تمرير خريطة الميزات المدمجة النهائية إلى وحدة إعادة البناء التي تتكون من ستة كتل متبقية متسلسلة وطبقات فرعية. تنتج هذه الوحدة خريطة متبقية يتم دمجها مع استيفاء بيكوبي من إطار المرجع المدخل، مما يؤدي إلى إطار المرجع عالي الدقة المشار إليه بـ $\hat{I}_t$. يتم تمثيل البنية العامة بصريًا في الشكل 1.

النتائج

في قسم النتائج، تم تقييم أداء طريقة الدقة الفائقة المقترحة (SR) في قناة Y مقابل عدة نماذج موجودة، بما في ذلك VSRnet و VESPCN و TOFlow و D3Dnet و RBPN و EDVR. كشفت التحليلات الكمية أن الطريقة المقترحة حققت نسبة إشارة إلى ضوضاء ذروة (PSNR) تبلغ 27.39 ديسيبل ومؤشر تشابه هيكلي (SSIM) يبلغ 0.828 على مجموعة بيانات Vid4. من الجدير بالذكر أن الطريقة المقترحة تفوقت على RBPN و EDVR بمقدار 0.29 ديسيبل و 0.12 ديسيبل في PSNR، على التوالي، وتجاوزت D3Dnet بمقدار 0.87 ديسيبل في PSNR و 0.029 في SSIM.

بالإضافة إلى ذلك، على الرغم من وجود عدد أكبر من المعلمات، أظهرت الطريقة المقترحة تعقيد شبكة متفوق ووقت تشغيل مثالي، حيث تعمل فقط عند 14.1% من الحد الأقصى لوقت تشغيل TOFlow و 16.1% من الحد الأدنى لوقت تشغيل VESPCN. أشارت التقييمات البصرية إلى أن الصور المستعادة بواسطة الطرق المتنافسة أظهرت دقة منخفضة وعيوب كبيرة، خاصة في التفاصيل الدقيقة مثل نوافذ المباني ووجوه البشر. في المقابل، عزز النموذج المقترح معلومات القوام ووضوح الحواف، مما أدى إلى تحسين ملحوظ في دقة الصورة.

المناقشة

في هذه الدراسة، يقدم المؤلفون نهجًا جديدًا لدقة الفيديو الفائقة (VSR) باستخدام شبكة ثلاثية الأبعاد (3D) قابلة للتشويه. تعالج الطريقة المقترحة قيود تقنيات VSR التقليدية، التي غالبًا ما تفشل في الاستفادة بشكل فعال من معلومات الحركة بين الإطارات ويمكن أن تؤدي إلى عيوب في إطارات الفيديو المعاد بناؤها. من خلال تجميع إطارات الفيديو المدخلة بشكل هرمي بناءً على قربها الزمني من إطار مرجعي، يعزز النموذج استخراج ودمج الميزات الزمانية المكانية. يسمح هذا التجميع بدمج المعلومات التكميلية من الإطارات ذات الخصائص الحركية المتنوعة، مما يحسن جودة إعادة البناء العامة.

يتضمن جوهر البنية المقترحة وحدة دمج داخل المجموعة تستخدم عمليات الالتواء ثلاثية الأبعاد لالتقاط ودمج الميزات داخل كل مجموعة، تليها وحدة دمج بين المجموعات تستخدم آلية انتباه زمني لتجميع المعلومات عبر مجموعات مختلفة. يتيح هذا التصميم للنموذج الاقتراض التكيفي للمعلومات من الإطارات ذات المسافات الزمنية المختلفة، مما يحافظ بشكل فعال على التفاصيل عالية التردد ويعزز إعادة بناء إطارات الفيديو عالية الدقة. تظهر النتائج التجريبية على مجموعة بيانات Vid4 أن الطريقة المقترحة تتفوق على تقنيات VSR الموجودة، محققة قيمًا متفوقة لنسبة إشارة إلى ضوضاء ذروة (PSNR) ومؤشر تشابه هيكلي (SSIM)، مما يثبت فعاليتها في مهام دقة الفيديو الفائقة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-93758-z
PMID: https://pubmed.ncbi.nlm.nih.gov/40091043
Publication Date: 2025-03-17
Author(s): Xiao Chen et al.
Primary Topic: Advanced Image Processing Techniques

Overview

In this study, the authors introduced a novel deep neural network designed to enhance video super-resolution by reorganizing input video sequences into multiple groups of sub-sequences, each operating at varying frame rates. This hierarchical time grouping facilitates the integration of inter-frame information, allowing for effective shallow feature extraction. The intra-group fusion module, which comprises five deformable 3D convolutional residual blocks, leverages the temporal information within each group to ensure time consistency while capturing both appearance and motion dynamics.

The network further employs an inter-group attention fusion module that adaptively aggregates complementary information across different groups, culminating in the reconstruction of high-resolution video outputs. Experimental results on the Vid4 dataset demonstrate the proposed network’s superior reconstruction performance, highlighting its efficacy in extracting and fusing spatio-temporal features for improved video quality.

Methods

In this study, the authors propose a novel approach for Video Super Resolution (VSR) that leverages spatio-temporal information from low-resolution (LR) video frame sequences to reconstruct high-resolution (HR) reference frames. The network architecture comprises several key components: a shallow feature extraction module, a deformable 3D convolutional intra-group fusion module, an inter-group attention fusion module, and a reconstruction module. Initially, the input video frames are organized into time groups, allowing for the utilization of varying frame rates. Each group’s frames undergo preliminary feature extraction through a 3D convolutional module, simulating the motion states.

Subsequently, the extracted features are processed by the deformable 3D convolutional intra-group fusion module, which employs adaptive motion compensation to effectively merge features within each group. To facilitate the integration of information across different groups, an inter-group temporal attention mechanism is introduced. The final fused feature map is then passed to a reconstruction module consisting of six cascaded residual blocks and subpixel layers. This module generates a residual map that is combined with a bicubic interpolation of the input reference frame, resulting in the high-resolution reference frame denoted as $\hat{I}_t$. The overall architecture is visually represented in Figure 1.

Results

In the results section, the performance of the proposed super-resolution (SR) method in the Y channel was evaluated against several existing models, including VSRnet, VESPCN, TOFlow, D3Dnet, RBPN, and EDVR. Quantitative analysis revealed that the proposed method achieved a Peak Signal-to-Noise Ratio (PSNR) of 27.39 dB and a Structural Similarity Index (SSIM) of 0.828 on the Vid4 dataset. Notably, the proposed method outperformed RBPN and EDVR by 0.29 dB and 0.12 dB in PSNR, respectively, and surpassed D3Dnet by 0.87 dB in PSNR and 0.029 in SSIM.

Additionally, despite having a higher number of parameters, the proposed method demonstrated superior network complexity and optimal running time, operating at only 14.1% of TOFlow’s maximum running time and 16.1% of VESPCN’s minimum running time. Visual assessments indicated that the images restored by competing methods exhibited low definition and significant artifacts, particularly in fine details such as building windows and human faces. In contrast, the proposed model effectively enhanced texture information and edge clarity, resulting in a marked improvement in image resolution.

Discussion

In this study, the authors present a novel video super-resolution (VSR) approach utilizing a deformable three-dimensional (3D) convolutional network. The proposed method addresses the limitations of traditional VSR techniques, which often fail to effectively leverage inter-frame motion information and can lead to artifacts in the reconstructed video frames. By hierarchically grouping input video frames based on their temporal proximity to a reference frame, the model enhances the extraction and fusion of spatio-temporal features. This grouping allows for the integration of complementary information from frames with varying motion characteristics, thereby improving the overall reconstruction quality.

The core of the proposed architecture includes an intra-group fusion module that employs deformable 3D convolutions to capture and fuse features within each group, followed by an inter-group fusion module that utilizes a time attention mechanism to aggregate information across different groups. This design enables the model to adaptively borrow information from frames with different temporal distances, effectively preserving high-frequency details and enhancing the reconstruction of high-resolution video frames. Experimental results on the Vid4 dataset demonstrate that the proposed method outperforms existing VSR techniques, achieving superior peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values, thus validating its effectiveness in video super-resolution tasks.