تمثيل غاوسي ثلاثي الأبعاد هرمي لتقديم البيانات الكبيرة في الوقت الحقيقي A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets

المجلة: ACM Transactions on Graphics، المجلد: 43، العدد: 4
DOI: https://doi.org/10.1145/3658160
تاريخ النشر: 2024-07-19
المؤلف: Bernhard Kerbl وآخرون
الموضوع الرئيسي: تقنيات الرسوميات الحاسوبية والتصور

نظرة عامة

في هذا البحث، يقدم المؤلفون طريقة جديدة لتوليد المشاهد قادرة على عرض مشاهد على مستوى الشارع تمتد لعدة كيلومترات وتستخدم عشرات الآلاف من الصور المدخلة في الوقت الحقيقي. تستند الطريقة إلى ثلاثة مساهمات رئيسية: تطوير هيكل هرمي فعال من Gaussian ثلاثي الأبعاد يسهل آلية سلسة لمستوى التفاصيل لعرض المشاهد الواسعة؛ تقنية تحسين لهذا الهيكل تعزز نسبة الجودة إلى السرعة؛ وخوارزمية قائمة على تقسيم المهام، تتيح المعالجة المتوازية لقطاعات المشهد المستقلة. تسمح هذه الطريقة بالمعالجة الشاملة لمجموعات البيانات الكبيرة في غضون بضع ساعات على مجموعة حوسبة.

تتم مناقشة تفاصيل تنفيذ النظام بشكل شامل، بدءًا من تنظيف البيانات تليها عملية تقدير الوضع في مرحلتين باستخدام COLMAP، الذي يقوم بمعايرة العديد من الكاميرات بكفاءة في غضون ساعات. بالإضافة إلى ذلك، يوضح المؤلفون استراتيجيات التنظيم الخاصة بهم للعمق والتعرض، مما يعزز بشكل أكبر من قوة طريقتهم. بشكل عام، يساهم هذا العمل بشكل كبير في تحسين إمكانية الوصول إلى التقاط وتصفح البيئات الكبيرة في الوقت الحقيقي.

مقدمة

تناقش مقدمة هذه الورقة البحثية التقدم في توليد المشاهد الجديدة، خاصة من خلال استخدام حقول الإشعاع، التي حسنت بشكل كبير من الجودة البصرية ومكنت من العرض في الوقت الحقيقي. يبرز المؤلفون فعالية تقنية Gaussian ثلاثي الأبعاد (3DGS) كممثل قائم على البدائيات يوازن بين دقة بصرية عالية وقدرات تدريب وعرض سريعة. ومع ذلك، يشيرون إلى أن الطرق الحالية تواجه صعوبة في التوسع إلى مشاهد كبيرة بسبب قيود الذاكرة والطبيعة الضمنية لتمثيلاتها. لمعالجة هذه القيود، يقترح المؤلفون نهج تقسيم وتغلب يسمح بتدريب وعرض مشاهد أكبر بكثير من خلال استخدام هيكل هرمي من Gaussian ثلاثي الأبعاد. يسهل هذا الهيكل تحسين العقد الداخلية ويدعم عرض مستوى التفاصيل (LOD)، مما يحسن كل من الجودة البصرية وسرعة العرض.

تشمل الطريقة المقترحة تقسيم المشاهد الكبيرة إلى قطع قابلة للإدارة، وتحسين كل منها بشكل مستقل، وإنشاء تمثيل هرمي شامل يمكن عرضه في الوقت الحقيقي. تشمل المساهمات الرئيسية تقديم هيكل هرمي جديد لـ 3DGS يعزز اختيار المستوى والتداخل، وتقنية تحسين لخصائص العقد الداخلية تعزز الجودة البصرية، واستراتيجية تدريب قائمة على القطع تتناسب مع البيانات النادرة النموذجية لالتقاط المشاهد الكبيرة. يوضح المؤلفون فعالية نهجهم على مجموعات بيانات متنوعة، مما يظهر قدرته على تمكين التنقل في الوقت الحقيقي في بيئات ثلاثية الأبعاد واسعة مع كونه قابلاً للتكيف مع المعدات على مستوى المستهلك. تعد الورقة بإصدار الشيفرة المصدرية ومجموعات البيانات الكبيرة لتسهيل المزيد من البحث والتطبيق في هذا المجال.

النتائج

في هذا القسم، يقدم المؤلفون نتائج طريقتهم المطبقة على أربع مشاهد تم التقاطها: المدينة الصغيرة، المدينة الكبيرة، الحرم الجامعي، ومشهد إضافي من Wayve. يتم توضيح النتائج بشكل أساسي من خلال مقاطع فيديو إضافية، تظهر قدرات التنقل في الوقت الحقيقي داخل هذه البيئات الكبيرة. تعرض الفيديوهات مسارات واسعة تم اجتيازها في كل مشهد، حيث تغطي المدينة الصغيرة كتلة حضرية واحدة، بينما تمتد مشاهد الحرم الجامعي وWayve لأكثر من كيلومتر، وتمتد مشهد المدينة الكبيرة لعدة كيلومترات. يتم تقسيم بيانات التدريب لهذه المشاهد إلى 4 و11 و22 و46 قطعة، على التوالي.

يقدم المؤلفون تحليلًا مقارنًا لجودة العرض والإطارات في الثانية (FPS) مقارنة بالطرق السابقة، مع التركيز على قطعة واحدة لكل مشهد بسبب قيود المنافسين في التعامل مع المشاهد الكاملة. يتضمن نهجهم، المسمى “خاصتنا (الأوراق)”، تحسينًا لكل قطعة والتحسينات الموضحة في القسم 6.2. كما يقيمون تأثير مكونات مستوى التفاصيل الجديدة الخاصة بهم، مع تسميات مختلفة “خاصتنا (????)” لإعدادات دقة مختلفة (????₁ = 3، ????₂ = 6، و????₃ = 15 بكسل) و”خاصتنا opt” تعكس الهياكل المحسنة. تسلط النتائج الضوء على أفضل وأفضل أداء في كل فئة، مما يبرز فعالية طريقتهم في جودة العرض والكفاءة.

المناقشة

في هذا القسم، يناقش المؤلفون التحديات والتقدم في إعادة بناء المشاهد الحضرية الكبيرة باستخدام طرق متنوعة، بما في ذلك تقنيات العرض المعتمدة على الصور، والشبكات، والعرض العصبي. تواجه الطرق التقليدية، مثل الهيكل من الحركة (SfM) والاستريو متعدد الرؤى، صعوبة مع البيئات الحضرية المعقدة التي تتميز بالنباتات والأسطح غير اللامبرتية. بينما أظهرت الأساليب الحديثة مثل حقول الإشعاع العصبية (NeRFs) وGaussian ثلاثي الأبعاد (3DGS) وعدًا في عرض مرئيات عالية الجودة، فإنها غالبًا ما تواجه قيودًا في قابلية التوسع وإدارة الموارد عند تطبيقها على مجموعات بيانات كبيرة على مستوى المدينة. يبرز المؤلفون مساهمتهم الجديدة في إطار عمل 3DGS الهرمي الذي يحسن كفاءة العرض وموثوقية الصورة من خلال استراتيجية تقسيم وتغلب، مما يسمح بعرض المشاهد الكبيرة في الوقت الحقيقي.

يتكون الهيكل المقترح من بدائيات Gaussian ثلاثية الأبعاد يمكن تحسينها لجودة بصرية مع الحفاظ على روتين عرض سريع. يوضح المؤلفون عملية إنشاء هذا الهيكل، بما في ذلك دمج عقد Gaussian بناءً على خصائصها المكانية وتنفيذ تقنيات مستوى التفاصيل (LOD) لإدارة أداء العرض ديناميكيًا. يؤكدون على أهمية الانتقالات السلسة بين مستويات التفاصيل المختلفة لتجنب العيوب البصرية. علاوة على ذلك، يوضح المؤلفون نهجهم في تدريب المشاهد الكبيرة من خلال تقسيمها إلى قطع قابلة للإدارة، مما يمكّن المعالجة المتوازية والتحسين الفعال. تهدف هذه المنهجية إلى تعزيز قابلية التوسع لتقنيات العرض العصبي للبيئات الحضرية، مما يساهم في النهاية في مجال توليد المشاهد الجديدة والعرض في الوقت الحقيقي.

Journal: ACM Transactions on Graphics, Volume: 43, Issue: 4
DOI: https://doi.org/10.1145/3658160
Publication Date: 2024-07-19
Author(s): Bernhard Kerbl et al.
Primary Topic: Computer Graphics and Visualization Techniques

Overview

In this research, the authors introduce a novel view synthesis method capable of rendering street-level scenes that span several kilometers and utilize tens of thousands of input images in real-time. The method is underpinned by three key contributions: the development of an efficient hierarchy of 3D Gaussians that facilitates a smooth level-of-detail mechanism for displaying extensive scenes; an optimization technique for this hierarchy that enhances the quality-to-speed ratio; and a divide-and-conquer, chunk-based algorithm that enables parallel processing of independent scene segments. This approach allows for the comprehensive processing of large datasets within a few hours on a compute cluster.

The implementation details of the system are thoroughly discussed, beginning with data cleanup and followed by a two-stage pose estimation pipeline utilizing COLMAP, which efficiently calibrates numerous cameras in a matter of hours. Additionally, the authors outline their regularization strategies for depth and exposure, further enhancing the robustness of their method. Overall, this work significantly advances the accessibility of capturing and navigating large environments in real-time.

Introduction

The introduction of this research paper discusses advancements in novel-view synthesis, particularly through the use of radiance fields, which have significantly enhanced visual quality and enabled real-time rendering. The authors highlight the effectiveness of 3D Gaussian Splatting (3DGS) as a primitive-based representation that balances high visual fidelity with rapid training and rendering capabilities. However, they note that existing methods struggle with scaling to large scenes due to memory constraints and the implicit nature of their representations. To address these limitations, the authors propose a divide-and-conquer approach that allows for the training and rendering of significantly larger scenes by employing a hierarchical structure of 3D Gaussians. This structure facilitates optimization of interior nodes and supports level-of-detail (LOD) rendering, thus improving both visual quality and rendering speed.

The proposed method involves subdividing large scenes into manageable chunks, optimizing each independently, and creating a comprehensive hierarchical representation that can be rendered in real-time. Key contributions include the introduction of a new hierarchy for 3DGS that enhances level selection and interpolation, an optimization technique for interior node properties that boosts visual quality, and a chunk-based training strategy that accommodates sparse data typical of large scene captures. The authors demonstrate the effectiveness of their approach on various datasets, showcasing its ability to enable real-time navigation in extensive 3D environments while being adaptable to consumer-level equipment. The paper promises to release the source code and large datasets to facilitate further research and application in this domain.

Results

In this section, the authors present the results of their method applied to four captured scenes: Small-City, BigCity, Campus, and an additional scene from Wayve. The findings are primarily illustrated through supplemental videos, demonstrating real-time navigation capabilities within these large environments. The videos showcase extensive paths traversed in each scene, with the Small-City covering a single urban block, while the Campus and Wayve scenes extend over more than a kilometer, and the BigCity scene spans several kilometers. The training data for these scenes is divided into 4, 11, 22, and 46 chunks, respectively.

The authors provide a comparative analysis of rendering quality and frames per second (FPS) against previous methods, focusing on one chunk per scene due to competitors’ limitations in handling full scenes. Their approach, denoted as “Ours (leaves),” incorporates per-chunk optimization and improvements outlined in Section 6.2. They also evaluate the impact of their novel level-of-detail components, with variations labeled “Ours (????)” for different granularity settings (????₁ = 3, ????₂ = 6, and ????₃ = 15 pixels) and “Ours opt” reflecting optimized hierarchies. The results highlight the best and second-best performances in each category, underscoring the effectiveness of their method in rendering quality and efficiency.

Discussion

In this section, the authors discuss the challenges and advancements in reconstructing large-scale urban scenes using various methods, including image-based, mesh-based, and neural rendering techniques. Traditional methods, such as Structure-from-Motion (SfM) and multi-view stereo, struggle with complex urban environments characterized by vegetation and non-Lambertian surfaces. While recent approaches like Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have shown promise in rendering high-quality visuals, they often face limitations in scalability and resource management when applied to extensive city-scale datasets. The authors highlight their novel contribution of a hierarchical 3DGS framework that optimizes rendering efficiency and visual fidelity through a divide-and-conquer strategy, allowing for real-time rendering of large scenes.

The proposed hierarchy consists of 3D Gaussian primitives that can be optimized for visual quality while maintaining a fast rendering routine. The authors detail the process of generating this hierarchy, including the merging of Gaussian nodes based on their spatial attributes and the implementation of Level-of-Detail (LOD) techniques to manage rendering performance dynamically. They emphasize the importance of smooth transitions between different levels of detail to avoid visual artifacts. Furthermore, the authors outline their approach to training large scenes by subdividing them into manageable chunks, enabling parallel processing and efficient optimization. This methodology aims to enhance the scalability of neural rendering techniques for urban environments, ultimately contributing to the field of novel view synthesis and real-time rendering.