توليد بنية بلورية باستخدام نمذجة اللغة الكبيرة التلقائية Crystal structure generation with autoregressive large language modeling

المجلة: Nature Communications، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41467-024-54639-7
PMID: https://pubmed.ncbi.nlm.nih.gov/39643601
تاريخ النشر: 2024-12-06
المؤلف: Luis M. Antunes وآخرون
الموضوع الرئيسي: تعلم الآلة في علوم المواد

نظرة عامة

يقدم هذا القسم تحليلًا مقارنًا لأداء النماذج بناءً على مجموعات تدريب وهياكل مختلفة. الأرقام المميزة تشير إلى النتائج المثلى التي تم تحقيقها مع حجم عينة قدره \( n = 20 \)، بينما الأرقام المائلة تدل على أفضل النتائج لـ \( n = 1 \). بشكل محدد، يتم تصنيف النتائج حسب هيكل النموذج: (أ) نموذج صغير تم تدريبه فقط على مجموعات التدريب المرجعية، (ب) نموذج كبير تم تدريبه حصريًا على مجموعات التدريب المرجعية، و (ج) نموذج صغير تم تدريبه على مجموعة بيانات شاملة تضم 2.3 مليون هيكل، باستثناء مجموعات اختبار أو تحقق MPTS-52. تتيح هذه الطريقة المنظمة تقييمًا واضحًا لفعالية النموذج عبر ظروف مختلفة.

طرق

في هذا القسم، يصف المؤلفون الطرق المستخدمة لتوليد وتقييم مواد بلورية جديدة باستخدام نموذج CrystaLLM، الذي تم تدريبه على مجموعة بيانات تضم 2.3 مليون هيكل. قاموا بتوليد 1,000 هيكل بلوري بدون شروط، منها 900 كانت صالحة و891 كانت متميزة هيكليًا. من الجدير بالذكر أن 102 من هذه الهياكل تم التعرف عليها كجديدة مقارنة بمجموعة بيانات التدريب. تم تقييم استقرار هذه المركبات الجديدة باستخدام نظرية الوظائف الكثافة (DFT)، مما كشف عن متوسط طاقة فوق القبة المحدبة قدره 0.40 eV/atom لـ 102 هيكل. من بين هذه الهياكل، كان 20 هيكلًا ضمن 0.1 eV/atom من القبة، بما في ذلك ثلاثة بهيكل طاقة قدرها 0.00 eV/atom، مما يشير إلى استقرار محتمل.

كما يناقش المؤلفون تفرد وجدة المواد المولدة، باستخدام أخذ عينات top-k مع معلمة درجة الحرارة لتحقيق توازن بين الجدة والمعقولية الهيكلية. تم تحديد تفرد المركبات المولدة باستخدام فئة StructureMatcher من pymatgen، مع تسامحات محددة لأبعاد الشبكة ومواقع الذرات. تم تصنيف المركب على أنه فريد إذا كان يمثل نوعًا هيكليًا متميزًا غير موجود بين المركبات المولدة، بينما تم تعريف الجدة بالنسبة لمجموعة بيانات التدريب. تشير النتائج إلى أن النموذج بارع بشكل خاص في توليد المركبات الأيونية وشبه الأيونية المستقرة، على الأرجح بسبب القواعد التنسيقية الأكثر تحديدًا التي تحكم هذه المواد مقارنة بالمركبات المعدنية.

نتائج

يصف قسم النتائج تطوير وتقييم CrystaLLM، وهو نموذج لغوي يعتمد على Transformer، مصمم خصيصًا لتوليد ملفات إطار معلومات البلورات (CIF). تم تدريب النموذج بطريقة تلقائية على مجموعة كبيرة من حوالي 2.2 مليون ملف CIF، باستخدام المحتوى النصي القياسي والمُرمّز بدلاً من التمثيلات الهيكلية. خلال التدريب، يتنبأ النموذج بالتوكنات التالية بناءً على التسلسلات السابقة، مما يمكّنه من توليد ملفات CIF جديدة من المطالبات المعطاة عن طريق أخذ عينات التوكنات بشكل تكراري حتى يتم استيفاء شرط إنهاء.

لتقييم أداء CrystaLLM، تم حجب مجموعة اختبار تتكون من حوالي 10,000 ملف CIF عن بيانات التدريب، مما يسمح بتقييم قدرة النموذج على توليد الهياكل بناءً على المطالبات المستمدة من مجموعة الاختبار هذه. بالإضافة إلى ذلك، تم إنشاء مجموعة تحدي تضم 70 هيكلًا، تم الحصول على 58 منها من الأدبيات الحديثة و12 من مجموعة التدريب، تمثل فئات هيكلية متنوعة. تم تصميم مجموعة التحدي هذه لاختبار قدرة النموذج بشكل صارم على توليد كل من الهياكل المعروفة وغير المعروفة، بينما تخدم مجموعة الاختبار للتقييم الأوسع. يوضح القسم أيضًا المصطلحات المتعلقة بالصياغات الكيميائية، مثل “الصيغة المختزلة” و”تركيب الخلية”، والتي تعتبر أساسية لفهم مخرجات النموذج والخصائص الهيكلية للمركبات المولدة.

نقاش

في هذا القسم، يناقش المؤلفون عملية التدريب وأداء نموذجهم، CrystaLLM، الذي يولد هياكل بلورية بناءً على تركيبات الخلايا المدخلة ومجموعات الفضاء. تم تدريب النموذج باستخدام تكوينين: نموذج صغير يحتوي على 25 مليون معلمة ونموذج كبير يحتوي على 200 مليون معلمة. شمل التدريب التنبؤ بالتوكن التالي في تسلسل، مع ملاحظات حول تحسين الأداء على مدار 90,000 تكرار، على الرغم من العوائد المتناقصة بمرور الوقت. تعلم النموذج بنجاح تمثيلات ذات معنى لمختلف الرموز في مفرداته، بما في ذلك الذرات ومجموعات الفضاء، كما يتضح من التجميع المنطقي في مخططات المتجهات المقلصة الأبعاد. من الجدير بالذكر أن النموذج أظهر القدرة على توليد ملفات CIF صالحة للهياكل غير المعروفة، محققًا معدل تطابق قدره 88.1% مع هياكل مجموعة الاختبار عبر محاولات متعددة.

كما يبرز المؤلفون الأداء المتفوق لـ CrystaLLM مقارنةً بأساليب التعلم الآلي الأخرى لتوليد الهياكل البلورية، خاصة في الإعدادات الشرطية. تسمح مرونة النموذج له بالتبديل بسلاسة بين التوليد الشرطي وغير الشرطي دون الحاجة إلى تعديلات هيكلية. بالإضافة إلى ذلك، يدعم التوليد المقيد بمجموعات الفضاء بشكل أصلي، وهو ميزة كبيرة مقارنة بالنماذج المنافسة. تشير النتائج إلى أن CrystaLLM يمكنه توليد هياكل صالحة لكل من المركبات المعروفة وغير المعروفة، مع معدل نجاح ملحوظ يصل إلى 40% للهياكل غير المعروفة. يستنتج المؤلفون أن القدرات الفريدة لـ CrystaLLM تجعل منه أداة متعددة الاستخدامات لاكتشاف المواد، قادرة على توليد هياكل بلورية معقولة عبر مجموعة متنوعة من التركيبات الكيميائية والأنماط الهيكلية.

Journal: Nature Communications, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41467-024-54639-7
PMID: https://pubmed.ncbi.nlm.nih.gov/39643601
Publication Date: 2024-12-06
Author(s): Luis M. Antunes et al.
Primary Topic: Machine Learning in Materials Science

Overview

The section presents a comparative analysis of model performance based on different training sets and architectures. Bolded numbers signify the optimal results achieved with a sample size of \( n = 20 \), while italicized numbers denote the best outcomes for \( n = 1 \). Specifically, results are categorized by model architecture: (a) small model trained solely on benchmark training sets, (b) large model trained exclusively on benchmark training sets, and (c) small model trained on a comprehensive dataset of 2.3 million structures, excluding the MPTS-52 validation or test sets. This structured approach allows for a clear evaluation of model efficacy across varying conditions.

Methods

In this section, the authors describe the methods employed to generate and assess novel crystalline materials using the CrystaLLM model, which was trained on a dataset of 2.3 million structures. They generated 1,000 crystalline structures unconditionally, of which 900 were valid and 891 were structurally distinct. Notably, 102 of these structures were identified as novel compared to the training dataset. The stability of these novel compounds was evaluated using density functional theory (DFT), revealing a mean energy above the convex hull of 0.40 eV/atom for the 102 structures. Among these, 20 structures were within 0.1 eV/atom of the hull, including three with an energy of 0.00 eV/atom, indicating potential stability.

The authors also discuss the uniqueness and novelty of the generated materials, employing top-k sampling with a temperature parameter to balance novelty and structural reasonableness. The uniqueness of the generated compounds was determined using the pymatgen StructureMatcher class, with specific tolerances for lattice dimensions and atomic positions. A compound was classified as unique if it represented a distinct structural type not found among the generated compounds, while novelty was defined in relation to the training dataset. The findings suggest that the model is particularly adept at generating stable ionic and semi-ionic compounds, likely due to the more defined coordination rules governing these materials compared to metallic compounds.

Results

The results section describes the development and evaluation of CrystaLLM, a Transformer-based, decoder-only language model specifically designed for generating Crystallographic Information Framework (CIF) files. The model was trained autoregressively on a large corpus of approximately 2.2 million CIF files, utilizing the standardized and tokenized text content rather than structural representations. During training, the model predicts subsequent tokens based on preceding sequences, enabling it to generate new CIF files from given prompts by sampling tokens iteratively until a termination condition is met.

To evaluate CrystaLLM’s performance, a test set of around 10,000 CIF files was withheld from the training data, allowing for the assessment of the model’s ability to generate structures based on prompts derived from this test set. Additionally, a challenge set comprising 70 structures was created, with 58 sourced from recent literature and 12 from the training set, representing various structural classes. This challenge set is designed to rigorously test the model’s capability to generate both seen and unseen structures, while the test set serves for broader evaluation. The section also clarifies terminology related to chemical formulas, such as “reduced formula” and “cell composition,” which are essential for understanding the model’s outputs and the structural characteristics of the generated compounds.

Discussion

In this section, the authors discuss the training process and performance of their model, CrystaLLM, which generates crystal structures based on input cell compositions and space groups. The model was trained using two configurations: a small model with 25 million parameters and a large model with 200 million parameters. Training involved predicting the next token in a sequence, with improvements in performance noted over 90,000 iterations, albeit with diminishing returns over time. The model successfully learned meaningful representations for various symbols in its vocabulary, including atoms and space groups, as evidenced by logical clustering in dimensionally-reduced vector plots. Notably, the model demonstrated the ability to generate valid CIF files for unseen structures, achieving an 88.1% match rate with test set structures across multiple attempts.

The authors also highlight CrystaLLM’s superior performance compared to other machine learning approaches for crystal structure generation, particularly in conditional settings. The model’s flexibility allows it to seamlessly switch between conditional and unconditional generation without requiring architectural modifications. Additionally, it supports space-group constrained generation natively, which is a significant advantage over competing models. The results indicate that CrystaLLM can generate valid structures for both seen and unseen compounds, with a notable success rate of up to 40% for unseen structures. The authors conclude that CrystaLLM’s unique capabilities make it a versatile tool for materials discovery, capable of generating plausible crystal structures across a variety of chemical compositions and structural motifs.