استخراج وإعادة بناء المعرفة في أدبيات علوم المواد باستخدام نماذج اللغة الكبيرة Extracting and reconstructing knowledge in materials science literature using large language models

المجلة: Communications Materials، المجلد: 7، العدد: 1
DOI: https://doi.org/10.1038/s43246-025-01043-3
تاريخ النشر: 2026-01-08
المؤلف: Shuyuan Li وآخرون
الموضوع الرئيسي: تعلم الآلة في علوم المواد

نظرة عامة

تقدم البحث طريقة عامة لإعادة بناء المعرفة من الأدبيات العلمية غير العضوية، مع التركيز على الطرق الاصطناعية والخصائص. استخدم المؤلفون تصميمًا يعتمد على طلب واحد لإنشاء مجموعة بيانات شاملة باستخدام نموذج GPT-4، والذي تم استخدامه بعد ذلك لضبط أربعة نماذج لغوية كبيرة (LLMs): LLaMA3-8Binstruct، Gemma-7B، Phi3-mini-128k-instruct، وGPT3.5-turbo-1106. أظهرت هذه النماذج المضبوطة أداءً قويًا في استخراج الطرق الاصطناعية للتقليل الانتقائي المحفز (SCR)، محققة دقة قدرها 0.928، واسترجاع قدره 0.957، ودرجة F1 قدرها 0.962. نجحت النماذج في استخراج 48,925 كيانًا من 2,205 مقالات، مع تخزين البيانات في تنسيق JSON الذي يسهل البحث الإضافي.

بالإضافة إلى ذلك، أظهرت LLMs المضبوطة قابلية نقل عالية عبر خمسة مجالات أخرى، بما في ذلك بطاريات أيونات الليثيوم وتفاعلات محفزة متنوعة، مما يدل على تطبيقها الواسع في مهام علوم المواد. اختتمت الدراسة ببناء رسم بياني شامل للمعرفة يدمج كيانات متعددة – مثل بيانات المقالات، المحفزات، طرق التحضير، ومقاييس الأداء – مترابطة من خلال حواف علاقات تمثل الاعتماديات الفيزيائية الكيميائية والمعايير الإجرائية. يعزز هذا الإطار المنظم استكشاف العلاقات المعقدة داخل الأدبيات العلمية للمواد، مما يعزز التعرف السريع على المحفزات عالية الأداء ويمكّن من تحسين التركيب المدفوع بالبيانات.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الدور الحاسم لتصنيع المواد الجديدة في تقدم العلوم والتكنولوجيا، مع معالجة التحديات الكامنة المرتبطة بهذه العملية. تتطلب تعقيدات علوم المواد فهمًا عميقًا للفيزياء والكيمياء، إلى جانب مهارات تجريبية وحسابية متقدمة. تؤثر عوامل مثل درجة الحرارة، الضغط، والبيئة الكيميائية بشكل كبير على خصائص المواد أثناء التصنيع، حيث تؤدي حتى التغيرات الطفيفة في طرق التحضير إلى اختلافات كبيرة في الأداء. تقتصر الطرق التقليدية لاكتشاف المواد، بما في ذلك التجارب التقليدية والنظرية، بشكل متزايد بسبب التكاليف، والكفاءة، وقيود الوقت.

للتغلب على هذه التحديات، ظهرت تكامل تقنيات الذكاء الاصطناعي (AI) المدفوعة بالبيانات وتقنيات التعلم الآلي (ML) كنهج تحويلي في تصميم المواد، وغالبًا ما يُشار إليها باسم “الثورة الصناعية الرابعة”. تم تطبيق منهجيات ML متنوعة، مثل نماذج الانحدار، الشبكات العصبية، والشبكات العصبية البيانية، بنجاح عبر مجالات متنوعة، بما في ذلك بطاريات الليثيوم وإطارات المعادن العضوية. تعتمد فعالية هذه النماذج على توفر بيانات علمية واسعة وموثوقة، مستمدة أساسًا من قواعد البيانات والأدبيات العلمية الموجودة. ومع ذلك، فإن استخراج المعلومات ذات الصلة يدويًا من الأدبيات يتطلب جهدًا كبيرًا، مما يعيق الحصول الفعال على البيانات. لقد سهلت التطورات الأخيرة في تقنيات معالجة اللغة الطبيعية (NLP) استخراج البيانات القيمة من النصوص العلمية، مما يسرع من تطوير الأنطولوجيا في علوم المواد ويعزز مبادئ إدارة البيانات مثل FAIR (قابلية الاكتشاف، الوصول، التشغيل البيني، وإعادة الاستخدام).

طرق

توضح قسم “الطرق” الأساليب التجريبية والتحليلية المستخدمة في الدراسة. استخدم الباحثون مجموعة من التقنيات الكمية والنوعية لجمع البيانات، مما يضمن فهمًا شاملاً للظواهر قيد التحقيق. شملت المنهجيات المحددة تجارب محكومة، وتحليلات إحصائية، وتقنيات نمذجة، تم تصميمها لاختبار الفرضيات التي تم وضعها في البداية.

شملت جمع البيانات أخذ عينات منهجية وبروتوكولات صارمة لضمان الموثوقية والصلاحية. تم إجراء التحليل باستخدام برامج إحصائية متقدمة، مما سمح بتطبيق اختبارات متنوعة لتحديد أهمية النتائج. يبرز القسم أهمية الصرامة المنهجية في استخلاص الاستنتاجات ويبرز الخطوات المتخذة للتخفيف من التحيزات والأخطاء المحتملة طوال عملية البحث.

مناقشة

في هذه الدراسة، تم بناء مجموعة بيانات شاملة لتعزيز استخراج طرق تصنيع المواد من الأدبيات العلمية، مع التركيز بشكل خاص على علوم المواد غير العضوية. تتكون مجموعة البيانات من 156 فقرة تصنيع وطرق تصنيع متCorresponding، تفصل المواد الخام، المنتجات، الطرق، وخطوات التصنيع، بإجمالي 63,337 كيانًا. تم ضبط أربعة نماذج لغوية كبيرة (LLMs) – Llama، Gemma، Phi، وGPT – باستخدام مجموعة بيانات تدريب منظمة، والتي أظهرت تعلمًا فعالًا كما يتضح من الانخفاض المستمر في خسارة التدريب. أشارت مقاييس التقييم إلى أن نموذج GPT تفوق في استخراج خطوات وظروف التصنيع، بينما كان نموذج Llama الأفضل في استخراج المنتجات، مما يظهر موثوقية النماذج في مهام التعرف على الكيانات المسماة.

استكشفت البحث أيضًا قابلية نقل النماذج عبر مجالات مختلفة، كاشفة أن Llama وGemma وGPT حافظت على أداء عالٍ في مهام علوم المواد المتنوعة، بينما أظهر نموذج Phi نتائج أضعف قليلاً. تم إنشاء رسم بياني للمعرفة لتنظيم وتصوير العلاقات بين المحفزات، وطرق التصنيع، والظروف التجريبية، مما يسهل البحث المدفوع بالبيانات وتوليد الفرضيات. لا يساعد هذا التمثيل المنظم فقط في فهم عمليات التصنيع المعقدة، بل يعزز أيضًا كفاءة اكتشاف المواد وتحسينها. بشكل عام، يوفر الإطار الذي تم تطويره في هذه الدراسة أساسًا قويًا للبحوث المستقبلية في علوم المواد، مع تطبيقات محتملة في دمج البيانات متعددة الأنماط وإثراء رسم المعرفة بمعلومات إضافية.

Journal: Communications Materials, Volume: 7, Issue: 1
DOI: https://doi.org/10.1038/s43246-025-01043-3
Publication Date: 2026-01-08
Author(s): Shuyuan Li et al.
Primary Topic: Machine Learning in Materials Science

Overview

The research presents a generalized method for knowledge reconstruction from inorganic science literature, focusing on synthetic routes and properties. The authors utilized a one-prompt design to create a comprehensive dataset with the GPT-4 model, which was then employed to fine-tune four large language models (LLMs): LLaMA3-8Binstruct, Gemma-7B, Phi3-mini-128k-instruct, and GPT3.5-turbo-1106. These fine-tuned models exhibited strong performance in extracting synthetic routes for selective catalytic reduction (SCR), achieving a precision of 0.928, recall of 0.957, and an F1 score of 0.962. The models successfully extracted 48,925 entities from 2,205 articles, with the data stored in a JSON format that facilitates further research.

Additionally, the fine-tuned LLMs demonstrated high transferability across five other domains, including lithium-ion batteries and various catalytic reactions, indicating their broad applicability in materials science tasks. The study culminated in the construction of a comprehensive knowledge graph that integrates multiple entities—such as article metadata, catalysts, preparation methods, and performance metrics—interconnected through relational edges that represent physicochemical dependencies and procedural parameters. This structured framework enhances the exploration of complex relationships within materials science literature, promoting the rapid identification of high-performance catalysts and enabling data-driven synthesis optimization.

Introduction

The introduction of the research paper highlights the critical role of new material synthesis in advancing scientific and technological progress, while also addressing the inherent challenges associated with this process. The complexity of materials science necessitates a deep understanding of physics and chemistry, alongside advanced experimental and computational skills. Factors such as temperature, pressure, and chemical environment significantly influence material properties during synthesis, with even minor variations in preparation methods leading to substantial differences in performance. Traditional methods of material discovery, including experimental trial-and-error and theoretical approaches, are increasingly limited by cost, efficiency, and time constraints.

To overcome these challenges, the integration of data-driven artificial intelligence (AI) and machine learning (ML) techniques has emerged as a transformative approach in materials design, often referred to as the “fourth industrial revolution.” Various ML methodologies, such as regression models, neural networks, and graph neural networks, have been successfully applied across diverse areas, including lithium batteries and metal-organic frameworks. The effectiveness of these ML models relies heavily on the availability of extensive, reliable scientific data, primarily sourced from existing databases and scientific literature. However, the manual extraction of relevant information from literature is labor-intensive, which hampers efficient data acquisition. Recent advancements in natural language processing (NLP) techniques have facilitated the extraction of valuable data from scientific texts, thereby accelerating ontology development in materials science and enhancing data management principles such as FAIR (Findability, Accessibility, Interoperability, Reusability).

Methods

The “Methods” section outlines the experimental and analytical approaches employed in the study. The researchers utilized a combination of quantitative and qualitative techniques to gather data, ensuring a comprehensive understanding of the phenomena under investigation. Specific methodologies included controlled experiments, statistical analyses, and modeling techniques, which were designed to test the hypotheses formulated at the outset.

Data collection involved systematic sampling and rigorous protocols to ensure reliability and validity. The analysis was conducted using advanced statistical software, allowing for the application of various tests to ascertain the significance of the findings. The section emphasizes the importance of methodological rigor in drawing conclusions and highlights the steps taken to mitigate potential biases and errors throughout the research process.

Discussion

In this study, a comprehensive dataset was constructed to enhance the extraction of material synthesis routes from scientific literature, specifically targeting inorganic materials science. The dataset comprises 156 synthesis paragraphs and corresponding synthesis routes, detailing raw materials, products, methods, and synthesis steps, totaling 63,337 entities. The fine-tuning of four large language models (LLMs)—Llama, Gemma, Phi, and GPT—was conducted using a structured training dataset, which demonstrated effective learning as evidenced by a steady decline in training loss. Evaluation metrics indicated that the GPT model excelled in extracting synthesis steps and conditions, while the Llama model performed best in product extraction, showcasing the models’ reliability in named entity recognition tasks.

The research further explored the models’ transferability across different domains, revealing that Llama, Gemma, and GPT maintained high performance in various material science tasks, while the Phi model exhibited slightly weaker results. A knowledge graph was generated to organize and visualize the relationships among catalysts, synthesis methods, and experimental conditions, facilitating data-driven research and hypothesis generation. This structured representation not only aids in understanding complex synthesis processes but also enhances the efficiency of material discovery and optimization. Overall, the framework developed in this study provides a robust foundation for future research in materials science, with potential applications in integrating multi-modal data and enriching the knowledge graph with additional information.