تقدير وضع الإنسان ثلاثي الأبعاد القائم على الانتشار بكفاءة مع تقليم زمني هرمي Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

المجلة: IEEE Transactions on Circuits and Systems for Video Technology
DOI: https://doi.org/10.1109/tcsvt.2026.3666928
تاريخ النشر: 2026-01-01
المؤلف: Yuquan Bi وآخرون
الموضوع الرئيسي: التعرف على وضع الجسم والحركة

نظرة عامة

في هذا البحث، يقدم المؤلفون إطار عمل جديد قائم على الانتشار لتقدير وضع الإنسان ثلاثي الأبعاد الذي يعالج التكاليف الحسابية العالية المرتبطة بالطرق التقليدية. الاستراتيجية المقترحة للتقليم الزمني الهرمي (HTP) تقلل بشكل فعال من التكرار من خلال تقليم رموز الوضع ديناميكيًا على كل من مستوى الإطار والمستوى الدلالي، مما يحافظ على ديناميات الحركة الأساسية. يعمل HTP في ثلاث مراحل: (1) التقليم المعزز بالترابط الزمني (TCEP) يحدد الإطارات الرئيسية من خلال تحليل الرسم البياني الزمني التكيفي لارتباطات الحركة بين الإطارات؛ (2) الانتباه الذاتي متعدد الرؤوس المركّز على الحركة (SFT MHSA) يقلل من حسابات الانتباه من خلال التركيز على الرموز ذات الصلة بالحركة؛ و(3) مقلم رموز الوضع الموجه بالقناع (MGPTP) ينقي اختيار رموز الوضع من خلال التجميع، مما يضمن الاحتفاظ فقط بأكثر الرموز إفادة.

تظهر النتائج التجريبية على مجموعتي بيانات Human3.6M و MPI-INF-3DHP أن HTP يعزز بشكل كبير الكفاءة، محققًا تقليصًا بنسبة 38.5% في MACs التدريب، و56.8% في MACs الاستدلال، وزيادة متوسطة بنسبة 81.1% في سرعة الاستدلال مقارنة بالطرق السابقة القائمة على الانتشار، كل ذلك مع تحقيق أداء رائد في المجال. يبرز هذا العمل إمكانيات HTP لتبسيط عمليات تقدير وضع الإنسان ثلاثي الأبعاد دون المساس بالدقة.

مقدمة

تناقش مقدمة الورقة التقدم في تقدير وضع الإنسان ثلاثي الأبعاد (HPE) من مقاطع الفيديو أحادية العين، مع تسليط الضوء على أهميته في تطبيقات متنوعة مثل التعرف على الحركة، والتفاعل بين الإنسان والروبوت، والواقع الافتراضي. تتضمن الطريقة السائدة خط أنابيب رفع من 2D إلى 3D، والتي، على الرغم من دقتها العالية وطبيعتها الخفيفة، تواجه تحديات بسبب نقص الأولويات العمقية والغموض المتأصل. لقد تضمنت الجهود الأخيرة الترابطات الزمنية عبر إطارات الفيديو لتعزيز إعادة بناء الوضع، خاصة من خلال الهياكل المعتمدة على المحولات التي تلتقط الاعتمادات بعيدة المدى. ومع ذلك، تعاني هذه الطرق من عبء حسابي كبير، خاصة أثناء الاستدلال، بسبب الزيادة التربيعية في متطلبات الموارد المرتبطة بآليات الانتباه الذاتي.

لمعالجة هذه التحديات، يقترح المؤلفون إطار عمل جديد يسمى التقليم الزمني الهرمي (HTP)، والذي يهدف إلى تحسين الكفاءة الحسابية مع الحفاظ على دقة الحركة في تقدير وضع الإنسان ثلاثي الأبعاد القائم على الانتشار. يستخدم HTP نهجًا هرميًا منظمًا يتضمن وحدة التقليم المعزز بالترابط الزمني (TCEP) لتحليل الترابطات الزمنية، والانتباه الذاتي متعدد الرؤوس المركّز على الحركة (SFT MHSA) لتوجيه الانتباه نحو رموز الوضع ذات الصلة، ومقلم رموز الوضع الموجه بالقناع (MGPTP) للتخلص من الرموز الزائدة. لا يقلل هذا الإطار المتماسك من تكاليف الحسابات في التدريب والاستدلال – بنسبة 38.5% و56.8%، على التوالي – فحسب، بل يعزز أيضًا قدرة النموذج على الحفاظ على ديناميات الحركة الحرجة، محققًا دقة رائدة في المجال في تجارب واسعة على مجموعات البيانات المرجعية.

النتائج

في قسم النتائج، يقدم المؤلفون تقييمًا شاملاً لطريقتهم في التقليم الزمني الهرمي (HTP) لتقدير وضع الإنسان ثلاثي الأبعاد (HPE) عبر مجموعتين من البيانات: Human3.6M و MPI-INF-3DHP. على مجموعة بيانات Human3.6M، يحقق HTP أداءً رائدًا (SOTA) مع خطأ متوسط موضع لكل مفصل (MPJPE) يبلغ 29.9 مم و P-MPJPE يبلغ 23.3 مم باستخدام أوضاع ثنائية الأبعاد تم اكتشافها بواسطة CPN، و16.7 مم عند استخدام الأوضاع الثنائية الأبعاد الحقيقية. من الجدير بالذكر أن HTP يقلل عدد المعلمات بنسبة 81% وتكلفة الحسابات بنسبة 40%، متفوقًا على الطريقة السابقة الرائدة، FinePose، بمقدار 2.0 مم في MPJPE. كما تُظهر الطريقة تعددية الاستخدام من خلال التكامل مع الهياكل المعتمدة على المحولات، مما يؤدي إلى تحسينات في دقة الوضع وتقليل عمليات الضرب والتراكم (MACs).

على مجموعة بيانات MPI-INF-3DHP، يحافظ HTP على MPJPE تنافسي بينما يحسن نسبة النقاط الرئيسية الصحيحة (PCK) والمساحة تحت المنحنى (AUC) بنسبة 0.5%. تشير النتائج إلى أن HTP يظل فعالًا حتى مع تقليل التكرار الزمني. بالإضافة إلى ذلك، يظهر HTP مكاسب كبيرة في الكفاءة، محققًا زيادة بمقدار 3× في إطارات الاستدلال في الثانية (FPS) تحت ظروف العينة المنخفضة (K=1) مع الحفاظ على MPJPE أقل مقارنة بالطرق الأخرى التي تعمل بمعدلات عينة أعلى. بشكل عام، يُظهر HTP تعميمًا قويًا وكفاءة في التكلفة، مما يجعله مناسبًا للتطبيقات الواقعية في تقدير وضع الإنسان ثلاثي الأبعاد.

المناقشة

في هذا القسم، تناقش الورقة التقدم في تقدير وضع الإنسان ثلاثي الأبعاد (HPE) باستخدام نماذج معتمدة على المحولات ونماذج قائمة على الانتشار. تستفيد هياكل المحولات، مثل PoseFormer و MixSTE، من آليات الانتباه الذاتي لالتقاط الاعتمادات بعيدة المدى، مما يعزز دقة تقدير الوضع في تسلسلات الفيديو. ومع ذلك، تواجه هذه النماذج تحديات تتعلق بالتعقيد الحسابي بسبب التوسع التربيعي للانتباه الذاتي. في المقابل، أظهرت نماذج الانتشار، التي تضيف الضوضاء تدريجيًا إلى البيانات ثم تقوم بإزالة الضوضاء، وعدًا في توليد أوضاع ثلاثية الأبعاد عالية الدقة. تستخدم طرق بارزة مثل D3DP و FinePose عمليات الانتشار لمعالجة غموض العمق ولكنها تتكبد أيضًا تكاليف حسابية كبيرة بسبب التنقيحات التكرارية.

تهدف طريقة التقليم الزمني الهرمي (HTP) المقترحة إلى تحسين الكفاءة في تقدير وضع الإنسان ثلاثي الأبعاد من خلال دمج استراتيجيات التقليم على مستوى الإطار والمستوى الدلالي. يسمح هذا النهج المزدوج باختيار رموز أكثر وعيًا، مما يحافظ على المعلومات الحيوية للحركة مع تقليل العبء الحسابي. يستخدم إطار HTP استراتيجية تقليم من مرحلتين: أولاً، يقوم بإنشاء طوبولوجيا زمنية نادرة لتصفية الإطارات الزائدة، وثانيًا، يقوم بتجميع هذه الرموز المنقحة في أوصاف عالية المستوى. لا يعزز هذا التصميم الهرمي الكفاءة الحسابية فحسب، بل يحافظ أيضًا على قوة تماسك الحركة أثناء عملية إزالة الضوضاء، مما يحقق في النهاية أداءً متفوقًا مقارنة بالطرق الرائدة الحالية في تقدير وضع الإنسان ثلاثي الأبعاد.

القيود

يسلط قسم القيود الضوء على التحديات الرئيسية التي تواجه إطار HTP على الرغم من تحسيناته الملحوظة في الكفاءة. واحدة من القضايا المهمة هي أن الانسدادات الذاتية الشديدة يمكن أن تؤدي إلى إزالة غير مقصودة للإطارات الأساسية، والتي تعتبر حاسمة لالتقاط الحركات المعقدة بدقة. بالإضافة إلى ذلك، كإطار عمل لرفع من 2D إلى 3D، فإن فعالية HTP مقيدة بجودة المدخلات الثنائية الأبعاد، خاصة في البيئات الخارجية المليئة بالضوضاء.

لمعالجة هذه القيود، ستركز الأبحاث المستقبلية على تطوير آليات انتباه واعية للانسدادات ونمذجة عدم اليقين المكاني. تهدف هذه التحسينات إلى تعزيز قوة إطار HTP، وبالتالي التخفيف من تأثير الانسدادات وتعزيز الأداء في السيناريوهات الصعبة.

Journal: IEEE Transactions on Circuits and Systems for Video Technology
DOI: https://doi.org/10.1109/tcsvt.2026.3666928
Publication Date: 2026-01-01
Author(s): Yuquan Bi et al.
Primary Topic: Human Pose and Action Recognition

Overview

In this research, the authors introduce a novel diffusion-based framework for 3D human pose estimation that addresses the high computational costs associated with traditional methods. The proposed Hierarchical Temporal Pruning (HTP) strategy effectively reduces redundancy by dynamically pruning pose tokens at both frame and semantic levels, thereby maintaining essential motion dynamics. HTP operates in three stages: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies key frames through adaptive temporal graph analysis of interframe motion correlations; (2) Sparse-Focused Temporal Multi-Head Self-Attention (SFT MHSA) minimizes attention computations by concentrating on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) refines the selection of pose tokens through clustering, ensuring that only the most informative tokens are retained.

The experimental results on the Human3.6M and MPI-INF-3DHP datasets reveal that HTP significantly enhances efficiency, achieving a reduction of 38.5% in training MACs, 56.8% in inference MACs, and an average increase of 81.1% in inference speed compared to previous diffusion-based approaches, all while attaining state-of-the-art performance. This work highlights the potential of HTP to streamline 3D human pose estimation processes without compromising accuracy.

Introduction

The introduction of the paper discusses the advancements in 3D human pose estimation (HPE) from monocular videos, highlighting its importance in various applications such as action recognition, human-robot interaction, and virtual reality. The dominant approach involves a 2D-to-3D lifting pipeline, which, despite its high precision and lightweight nature, faces challenges due to the lack of depth priors and inherent ambiguities. Recent efforts have incorporated temporal correlations across video frames to enhance pose reconstruction, particularly through transformer-based architectures that capture long-range dependencies. However, these methods suffer from significant computational overhead, especially during inference, due to the quadratic increase in resource demands associated with self-attention mechanisms.

To address these challenges, the authors propose a novel framework called Hierarchical Temporal Pruning (HTP), which aims to improve computational efficiency while preserving motion fidelity in diffusion-based 3D HPE. HTP employs a structured hierarchical approach that includes the Temporal Correlation-Enhanced Pruning (TCEP) module for analyzing temporal correlations, the Sparse-Focused Temporal Multi-Head Self-Attention (SFT MHSA) for guiding attention towards relevant pose tokens, and the Mask-Guided Pose Token Pruner (MGPTP) for discarding redundant tokens. This cohesive framework not only reduces training and inference computational costs—by 38.5% and 56.8%, respectively—but also enhances the model’s ability to maintain critical motion dynamics, achieving state-of-the-art accuracy in extensive experiments on benchmark datasets.

Results

In the results section, the authors present a comprehensive evaluation of their Hierarchical Temporal Pruning (HTP) method for 3D human pose estimation (HPE) across two datasets: Human3.6M and MPI-INF-3DHP. On the Human3.6M dataset, HTP achieves state-of-the-art (SOTA) performance with a mean per joint position error (MPJPE) of 29.9 mm and a P-MPJPE of 23.3 mm using 2D poses detected by CPN, and 16.7 mm when using ground-truth 2D poses. Notably, HTP reduces parameter count by 81% and computational cost by 40%, outperforming the previous SOTA method, FinePose, by 2.0 mm in MPJPE. The method also demonstrates versatility by integrating with Transformer-based frameworks, yielding improvements in pose accuracy and reductions in multiply-accumulate operations (MACs).

On the MPI-INF-3DHP dataset, HTP maintains competitive MPJPE while improving percentage of correct keypoints (PCK) and area under the curve (AUC) by 0.5%. The results indicate that HTP remains effective even with reduced temporal redundancy. Additionally, HTP exhibits significant efficiency gains, achieving a 3× increase in inference frames per second (FPS) under low-sampling conditions (K=1) while maintaining lower MPJPE compared to other methods operating at higher sampling rates. Overall, HTP demonstrates strong generalization and cost-efficiency, making it suitable for real-world applications in 3D HPE.

Discussion

In this section, the paper discusses advancements in 3D human pose estimation (HPE) using transformer-based and diffusion-based models. Transformer architectures, such as PoseFormer and MixSTE, leverage self-attention mechanisms to capture long-range dependencies, enhancing the accuracy of pose estimation in video sequences. However, these models face challenges related to computational complexity due to the quadratic scaling of self-attention. In contrast, diffusion models, which progressively add noise to data and then denoise it, have shown promise in generating high-fidelity 3D poses. Notable methods like D3DP and FinePose utilize diffusion processes to address depth ambiguity but also incur significant computational costs due to iterative refinements.

The proposed Hierarchical Temporal Pruning (HTP) method aims to improve efficiency in 3D HPE by integrating both frame-level and semantic-level pruning strategies. This dual approach allows for more informed token selection, preserving critical motion information while reducing computational overhead. The HTP framework employs a two-phase pruning strategy: first, it establishes a sparse temporal topology to filter redundant frames, and second, it aggregates these refined tokens into high-level descriptors. This hierarchical design not only enhances computational efficiency but also maintains the robustness of motion coherence during the denoising process, ultimately achieving superior performance compared to existing state-of-the-art methods in 3D HPE.

Limitations

The section on limitations highlights key challenges faced by the HTP framework despite its notable efficiency improvements. One significant issue is that severe self-occlusions can lead to the unintended removal of essential frames, which are crucial for accurately capturing complex articulations. Additionally, as a 2D-to-3D lifting framework, HTP’s effectiveness is constrained by the quality of the 2D input, particularly in noisy outdoor environments.

To address these limitations, future research will focus on developing occlusion-aware attention mechanisms and spatial uncertainty modeling. These enhancements aim to improve the robustness of the HTP framework, thereby mitigating the impact of occlusions and enhancing performance in challenging scenarios.