النماذج التوليدية البصرية Optical generative models

المجلة: Nature، المجلد: 644، العدد: 8078
DOI: https://doi.org/10.1038/s41586-025-09446-5
PMID: https://pubmed.ncbi.nlm.nih.gov/40866675
تاريخ النشر: 2025-08-27
المؤلف: Shiqi Chen وآخرون
الموضوع الرئيسي: الشبكات العصبية وحوسبة الخزانات

نظرة عامة

تناقش هذه القسم تطوير نماذج توليد بصرية مستوحاة من نماذج الانتشار، مع معالجة التحديات المتعلقة بالاستدلال القابل للتوسع والفعال من حيث الطاقة في نماذج توليد رقمية كبيرة. تستخدم الطريقة المقترحة مشفرًا رقميًا ضحلًا لتحويل الضوضاء العشوائية إلى أنماط طورية، والتي تعمل كبذور توليد بصرية لتوزيع البيانات المطلوب. يقوم فك تشفير قابل لإعادة التكوين تم تدريبه بشكل مشترك بمعالجة هذه البذور بصريًا لتوليد صور جديدة تتماشى مع التوزيعات المستهدفة، مثل الأرقام المكتوبة بخط اليد، وعناصر الموضة، والفراشات، ووجوه البشر، وأعمال فنسنت فان جوخ.

الميزة الرئيسية لهذه النماذج التوليدية البصرية هي متطلبات الطاقة الحاسوبية الدنيا أثناء توليد الصور، حيث تعتمد فقط على قوة الإضاءة وتوليد البذور العشوائية الأولية. تظهر النتائج التجريبية النجاح في توليد الصور بصريًا، سواء كانت أحادية اللون أو متعددة الألوان، محققة مستويات أداء قابلة للمقارنة مع النماذج التقليدية المعتمدة على الشبكات العصبية الرقمية. تشير هذه الابتكارات إلى اتجاه واعد لمهام الاستدلال الفعالة من حيث الطاقة والقابلة للتوسع في الذكاء الاصطناعي، مستفيدة من قدرات البصريات والفوتونيات في توليد المحتوى.

الطرق

في هذا القسم، يصف المؤلفون إعداد التجربة والمنهجية لإظهار نماذج توليد بصرية فورية باستخدام نظام قابل لإعادة التكوين في الطيف المرئي. يستخدم النظام ليزرًا (520 نانومتر) لإضاءة مُعدل ضوء مكاني (SLM)، والذي يعرض أنماط طورية محسوبة مسبقًا $\phi(x, y)$ تمت معالجتها بواسطة مشفر رقمي ضحل. يتم بعد ذلك معالجة الحقول البصرية المعدلة بواسطة هذه الأنماط بواسطة SLM ثانٍ يعمل كفك تشفير ثابت. ركزت التجارب على توليد الصور من مجموعات بيانات MNIST وFashion-MNIST، محققة درجات مسافة فريشيت (FID) تبلغ 131.08 و180.57، على التوالي، مما يدل على قدرة النظام على توليد صور تتبع التوزيعات المستهدفة عن كثب.

يستكشف المؤلفون أيضًا الفضاء الكامن للنموذج التوليدي من خلال التحقيق في العلاقة بين مدخلات الضوضاء العشوائية والصور المولدة، بالإضافة إلى تأثيرات مساحة الترميز الطوري المحدودة وعمق بت فك التشفير على الأداء. كما يمدون عملهم لتوليد صور بدقة أعلى على طراز فان جوخ، مما يظهر مرونة النظام وتنوعه الإبداعي. تشير النتائج إلى أن فك التشفير الانكساري يتفوق على طرق فك التشفير المعتمدة على الفضاء الحر، محققًا توليد صور ثابت بجودة متفوقة. تؤكد التقييمات الكمية، بما في ذلك نسب الإشارة إلى الضوضاء القصوى ودرجات CLIP، على دقة وتناسق الصور المولدة مقارنة بالمحاكاة.

المناقشة

تظهر الأبحاث المقدمة في هذه الورقة فعالية النماذج التوليدية البصرية في توليد صور متنوعة من أنماط الضوضاء باستخدام بنية شبكة انكسارية. على عكس الأنظمة البصرية التقليدية التي تركز على مهام مثل التصوير أو التصنيف، يمكّن هذا الإطار من توليد صور فورية إبداعية، مما يسمح بتوليد صور تتماشى مع توزيعات بيانات محددة دون تغيير الإعداد الفيزيائي. من خلال إعادة تكوين فك التشفير الانكساري، يمكن للنموذج التكيف مع توزيعات مستهدفة مختلفة، مما يظهر مرونته لتطبيقات الحوسبة الطرفية والواقع المعزز أو الافتراضي.

تستخدم الدراسة نموذج توليد رقمي تعليمي يعتمد على نموذج احتمالي للانتشار لإزالة الضوضاء (DDPM) لتقطير المعرفة حول التوزيعات المستهدفة، مما يسهل قدرة النموذج البصري على التقاط المعلومات الدلالية بشكل فعال. تعزز النماذج التوليدية البصرية التكرارية هذه القدرة من خلال تجنب انهيار الوضع وتوليد مخرجات متنوعة من خلال عملية تعلم ذاتي الإشراف. تشير النتائج إلى أن هذه النماذج يمكن أن تنتج صورًا عالية الجودة مع خلفيات أوضح ومؤشرات أداء محسنة، مثل انخفاض مسافة فريشيت (FID) وزيادة درجة البداية (IS)، مقارنة بمجموعات البيانات الأصلية. ومع ذلك، لا تزال هناك تحديات، بما في ذلك عدم محاذاة الأجهزة وحدود عمق بت الطور، والتي يمكن معالجتها من خلال استراتيجيات تدريب متكاملة لتحسين الأداء ضمن القيود الفيزيائية.

Journal: Nature, Volume: 644, Issue: 8078
DOI: https://doi.org/10.1038/s41586-025-09446-5
PMID: https://pubmed.ncbi.nlm.nih.gov/40866675
Publication Date: 2025-08-27
Author(s): Shiqi Chen et al.
Primary Topic: Neural Networks and Reservoir Computing

Overview

This section discusses the development of optical generative models inspired by diffusion models, addressing the challenges of scalable and energy-efficient inference in large digital generative models. The proposed approach utilizes a shallow digital encoder to convert random noise into phase patterns, which act as optical generative seeds for the desired data distribution. A jointly trained free-space-based reconfigurable decoder processes these seeds optically to generate novel images aligned with target distributions, such as handwritten digits, fashion items, butterflies, human faces, and Van Gogh artworks.

The key advantage of these optical generative models is their minimal computational power requirement during image synthesis, relying only on illumination power and the initial random seed generation. Experimental results demonstrate the successful optical generation of both monochrome and multicolored images, achieving performance levels comparable to traditional digital neural network-based models. This innovation suggests a promising direction for energy-efficient and scalable inference tasks in artificial intelligence, leveraging the capabilities of optics and photonics in content generation.

Methods

In this section, the authors describe the experimental setup and methodology for demonstrating snapshot optical generative models using a reconfigurable system in the visible spectrum. The system employs a laser (520 nm) to illuminate a spatial light modulator (SLM), which displays pre-calculated phase patterns $\phi(x, y)$ processed by a shallow digital encoder. The optical fields modulated by these patterns are then processed by a second SLM acting as a static decoder. The experiments focused on generating images from the MNIST and Fashion-MNIST datasets, achieving Fréchet Inception Distance (FID) scores of 131.08 and 180.57, respectively, indicating the system’s capability to generate images that closely follow the target distributions.

The authors further explore the latent space of the generative model by investigating the relationship between random noise inputs and generated images, as well as the effects of limited phase-encoding space and decoder bit-depth on performance. They also extend their work to generate higher-resolution images in the style of Van Gogh, demonstrating the system’s versatility and creative variability. The results indicate that the diffractive decoder outperforms free-space-based decoding methods, achieving stable image generation with superior quality. Quantitative evaluations, including peak signal-to-noise ratios and CLIP scores, confirm the fidelity and semantic consistency of the generated images compared to simulations.

Discussion

The research presented in this paper demonstrates the effectiveness of optical generative models in synthesizing diverse images from noise patterns using a diffractive network architecture. Unlike traditional optical systems focused on tasks such as imaging or classification, this framework enables creative snapshot image generation, allowing for the generation of images that adhere to specific data distributions without altering the physical setup. By reconfiguring the diffractive decoder, the model can adapt to different target distributions, showcasing its versatility for applications in edge computing and augmented or virtual reality.

The study employs a teacher digital generative model based on a denoising diffusion probabilistic model (DDPM) to distill knowledge of target distributions, facilitating the optical model’s ability to capture semantic information effectively. The iterative optical generative models further enhance this capability by avoiding mode collapse and generating diverse outputs through a self-supervised learning process. The results indicate that these models can produce high-quality images with clearer backgrounds and improved performance metrics, such as lower Fréchet Inception Distance (FID) and higher Inception Score (IS), compared to original datasets. However, challenges remain, including hardware misalignments and phase bit-depth limitations, which can be addressed through integrated training strategies to optimize performance within physical constraints.