ACT-R: مسارات الكاميرا التكيفية لإعادة بناء ثلاثية الأبعاد من منظور واحد ACT-R: Adaptive Camera Trajectories for Single-View 3D Reconstruction

المجلة: 2026 International Conference on 3D Vision (3DV)
DOI: https://doi.org/10.1109/3dv69130.2026.00064
تاريخ النشر: 2026-03-20
المؤلف: Yizhi Wang وآخرون
الموضوع الرئيسي: الرؤية المتقدمة والتصوير

نظرة عامة

تقدم البحث نهجًا جديدًا لتوليف المشاهد المتعددة من خلال التخطيط التكيفي للمشاهد، بهدف تعزيز كشف التداخل والاتساق ثلاثي الأبعاد في إعادة بناء ثلاثية الأبعاد من منظور واحد. تنتج الطريقة تسلسلًا من المشاهد بناءً على مسار كاميرا تكيفي (ACT) يقوم بتحسين رؤية المناطق المحجوبة، بدلاً من الاعتماد على إعداد كاميرا ثابت. من خلال استخدام نموذج انتشار الفيديو لإنشاء مشاهد جديدة على طول هذا المسار، تدمج الطريقة بكفاءة النماذج المدربة مسبقًا لتحليل التداخل وتوليف المشاهد المتعددة، مما يظهر تحسينات كبيرة مقارنة بالطرق الحديثة (SOTA) على مجموعة بيانات GSO غير المرئية.

في الختام، يبرز المؤلفون فعالية طريقتهم، ACT-R، في كشف الهياكل المحجوبة مع الحفاظ على اتساق المشاهد المتعددة. يعترفون بالقيود المتعلقة بالتنفيذ الحالي، مثل قيود درجات حرية الكاميرا وجودة مقاطع الفيديو الناتجة، لكنهم يقترحون أن التخطيط التكيفي للمشاهد يمكن أن يُطبق على نماذج توليد الفيديو الأخرى بمرونة أكبر. كما يشير المؤلفون إلى أنه بينما تعزز المسارات التكيفية كشف التداخل، لا تزال هناك تحديات في تحقيق نفس جودة الصور المولدة مقارنةً بالطرق الثابتة للكاميرا. ستركز الأعمال المستقبلية على دمج التخطيط التكيفي للمشاهد مع نماذج إعادة البناء المباشر للصور إلى 3D على نطاق واسع واستكشاف تطبيقات إضافية في الروبوتات والت扫描 الذاتي.

مقدمة

تتناول مقدمة هذه الورقة البحثية التحديات والتطورات في إعادة بناء ثلاثية الأبعاد من منظور واحد، وهي منطقة بارزة في رؤية الكمبيوتر. غالبًا ما تعتمد الطرق التقليدية على مجموعات بيانات ثلاثية الأبعاد كبيرة للتدريب، مستخدمة الإشراف المباشر ثلاثي الأبعاد لتوليد أو استرجاع تمثيلات ثلاثية الأبعاد من الصور المدخلة. أدت التطورات الأخيرة في توليف المشاهد الجديدة إلى نهج يقوم أولاً بإنشاء توليف متعدد المشاهد، مما ينتج مجموعة غير مرتبة من المشاهد إما بشكل مستقل أو متزامن، تليها إعادة بناء ثلاثية الأبعاد متعددة المشاهد من خلال تقنيات التصوير القابلة للاشتقاق. ومع ذلك، تواجه هذه الطرق تحديات كبيرة، لا سيما في كشف الهياكل المحجوبة وضمان الاتساق ثلاثي الأبعاد عبر المشاهد المولدة.

يقترح المؤلفون حلاً مبتكرًا من خلال تقديم التخطيط التكيفي للمشاهد لتعزيز توليف المشاهد المتعددة. على عكس الطرق السابقة التي تولد مشاهد غير مرتبة، ينتج نهجهم تسلسلًا من المشاهد، مستفيدًا من الاتساق الزمني لتحسين التناسق ثلاثي الأبعاد. تهدف هذه الطريقة إلى معالجة القيود المحددة في التقنيات الحديثة، مما يعزز كل من كشف التداخل والاتساق العام لنموذج 3D المعاد بناؤه.

نقاش

تناقش هذه القسم مسار الكاميرا التكيفي لإعادة بناء ثلاثية الأبعاد من منظور واحد (ACT-R) الذي طوره الباحثون في جامعة سايمون فريزر. تعزز طريقة ACT-R رؤية المناطق المحجوبة في الأجسام ثلاثية الأبعاد من خلال توليد مسارات كاميرا تكيفية مصممة لتناسب مشاهد الإدخال المحددة. يتم تحقيق ذلك من خلال خط أنابيب يستخدم Slice3D لإنتاج صور مقاطع حجمية، والتي يتم تحليلها بعد ذلك لاكتشاف الفروق الدلالية بالنسبة للصورة المدخلة. يتم اختيار أفضل مسار للكاميرا بناءً على قدرته على تعظيم رؤية المناطق المحجوبة، مما يؤدي إلى تحسين توليد الصور متعددة المشاهد عبر نموذج الفيديو المستقر 3D (SV3D). يمكن بعد ذلك معالجة المشاهد المولدة بواسطة نماذج إعادة البناء ثلاثية الأبعاد المختلفة، مثل NeUS و InstantMesh، دون الحاجة إلى إشراف مباشر ثلاثي الأبعاد.

تظهر النتائج أن ACT-R يتفوق على الطرق الحديثة الموجودة في كل من التقييمات النوعية والكمية على معيار GSO. على وجه التحديد، يحقق جودة إعادة بناء متفوقة، لا سيما في المناطق المحجوبة، مع الحفاظ على الكفاءة من خلال تجنب الحاجة إلى بيانات الحقيقة الأرضية. على الرغم من أن ACT-R لا يتفوق على بعض نماذج إعادة البناء الكبيرة الحديثة التي تستخدم إشرافًا مباشرًا ثلاثي الأبعاد، إلا أنه يعزز بشكل فعال أداء هذه النماذج عند دمجها. بشكل عام، تسلط النتائج الضوء على فعالية التخطيط التكيفي للمشاهد في تحسين دقة وجودة إعادة البناء ثلاثية الأبعاد من الصور ذات المنظور الواحد.

Journal: 2026 International Conference on 3D Vision (3DV)
DOI: https://doi.org/10.1109/3dv69130.2026.00064
Publication Date: 2026-03-20
Author(s): Yizhi Wang et al.
Primary Topic: Advanced Vision and Imaging

Overview

The research presents a novel approach to multi-view synthesis through adaptive view planning, aimed at enhancing occlusion revelation and 3D consistency in single-view 3D reconstruction. The method generates a sequence of views based on an adaptive camera trajectory (ACT) that optimizes visibility of occluded regions, rather than relying on a fixed camera setup. By utilizing a video diffusion model to create novel views along this trajectory, the approach efficiently integrates pre-trained models for occlusion analysis and multi-view synthesis, demonstrating significant improvements over state-of-the-art (SOTA) methods on the unseen GSO dataset.

In the conclusion, the authors highlight the effectiveness of their method, ACT-R, in revealing occluded structures while maintaining multi-view consistency. They acknowledge limitations related to the current implementation, such as restricted camera degrees of freedom and the quality of generated videos, but suggest that their adaptive view planning could be applied to other video generation models with greater flexibility. The authors also note that while adaptive trajectories enhance occlusion revelation, challenges remain in achieving the same quality of synthesized images compared to fixed camera approaches. Future work will focus on integrating adaptive view planning with large-scale direct image-to-3D reconstruction models and exploring additional applications in robotics and autoscanning.

Introduction

The introduction of this research paper addresses the challenges and advancements in single-view 3D reconstruction, a prominent area in computer vision. Traditional methods often rely on large 3D datasets for training, utilizing direct 3D supervision to generate or regress 3D representations from input images. Recent developments in novel view synthesis have led to approaches that first create a multi-view synthesis, generating an unordered set of views either independently or simultaneously, followed by multi-view 3D reconstruction through differentiable rendering techniques. However, these methods face significant challenges, particularly in revealing occluded structures and ensuring 3D consistency across generated views.

The authors propose an innovative solution by introducing adaptive view planning to enhance multi-view synthesis. Unlike previous methods that synthesize unordered views, their approach generates a sequence of views, capitalizing on temporal consistency to improve 3D coherence. This method aims to address the limitations identified in state-of-the-art techniques, thereby enhancing both occlusion revelation and the overall consistency of the reconstructed 3D model.

Discussion

The section discusses the Adaptive Camera Trajectory for single-view 3D reconstruction (ACT-R) developed by researchers at Simon Fraser University. The ACT-R method enhances the visibility of occluded regions in 3D objects by generating adaptive camera trajectories tailored to specific input views. This is achieved through a pipeline that utilizes Slice3D to produce volumetric slice images, which are then analyzed for semantic differences relative to the input image. The best camera trajectory is selected based on its ability to maximize visibility of occluded areas, leading to improved multi-view image generation via the Stable Video 3D (SV3D) model. The generated views can subsequently be processed by various 3D reconstruction models, such as NeUS and InstantMesh, without requiring direct 3D supervision.

The results demonstrate that ACT-R outperforms existing state-of-the-art methods in both qualitative and quantitative assessments on the GSO benchmark. Specifically, it achieves superior reconstruction quality, particularly in occluded regions, while maintaining efficiency by avoiding the need for ground truth data. Although ACT-R does not surpass some recent large reconstruction models that utilize direct 3D supervision, it effectively enhances the performance of these models when integrated. Overall, the findings highlight the effectiveness of adaptive view planning in improving the accuracy and quality of 3D reconstructions from single-view images.