دمج المعرفة من أجل الانحدار الرمزي المستند إلى الفيزياء باستخدام نماذج لغوية كبيرة مدربة مسبقًا Knowledge integration for physics-informed symbolic regression using pre-trained large language models

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-35327-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41526439
تاريخ النشر: 2026-01-13
المؤلف: Bilge Taskin وآخرون
الموضوع الرئيسي: تعلم الآلة في علوم المواد

نظرة عامة

في هذا القسم، يناقش المؤلفون التقدم في الانحدار الرمزي (SR) كطريقة للاكتشاف العلمي الآلي، وخاصة في اشتقاق المعادلات الحاكمة من البيانات التجريبية. يبرزون دمج المعرفة الميدانية في الانحدار الرمزي المستنير بالفيزياء (PiSR) لتعزيز عمومية وقابلية تطبيق المعادلات المكتشفة. ومع ذلك، تتطلب الطرق الحالية غالبًا صيغًا متخصصة وهندسة ميزات يدوية، مما يحد من استخدامها على خبراء المجال.

لمعالجة هذه القيود، يقترح البحث الاستفادة من نماذج اللغة الكبيرة المدربة مسبقًا (LLMs) لأتمتة دمج المعرفة الميدانية في PiSR. من خلال دمج الفهم السياقي لـ LLM في دالة خسارة SR، يهدف المؤلفون إلى تبسيط العملية وجعلها أكثر وصولاً لمجموعة واسعة من الاستفسارات العلمية. يتم تقييم المنهجية باستخدام ثلاثة خوارزميات SR (DEAP و gplearn و PySR) وثلاثة LLMs (Falcon و Mistral و LLama 2) عبر ديناميات فيزيائية متنوعة، بما في ذلك إسقاط الكرات، والحركة التوافقية البسيطة، والموجات الكهرومغناطيسية. تشير النتائج إلى أن دمج LLM يعزز بشكل كبير إعادة بناء الديناميات الفيزيائية من البيانات، مما يحسن من متانة نماذج SR ضد الضوضاء والتعقيد. بالإضافة إلى ذلك، يكشف البحث أن هندسة المطالبات الفعالة يمكن أن تعزز الأداء أكثر، مما يبرز أهمية المطالبات المعلوماتية في العملية.

النتائج

تلخص نتائج الدراسة التي تقيم أداء إطار الانحدار الرمزي (SR) المدمج مع LLM في الجداول من 2 إلى 4. يتفوق Mistral باستمرار على النماذج الأخرى، بما في ذلك LLaMA و Falcon، عبر مقاييس مختلفة مثل متوسط الخطأ المطلق (MAE) ومتوسط الخطأ التربيعي (MSE) و $1 – R^2$، والمسافة بين أشجار التعبير. يظهر PySR أداءً متفوقًا مقارنةً بنماذج SR الأخرى، محققًا إعادة بناء مثالية لمعادلات الحقيقة الأساسية (GT) (مسافة شجرة التعبير 1.00) عندما يتم توفير المعادلة GT أو أوصاف متغيرة شاملة. ومع ذلك، بينما تكون المعادلات المتوقعة متطابقة، تظهر تقلبات في مقاييس الملاءمة بسبب اختلافات في العينات التجريبية.

كما يتم فحص تأثير الضوضاء على أداء النموذج. بالنسبة لخوارزمية DEAP SR، تزداد مسافة شجرة التعبير من 0.07 عند 1% ضوضاء إلى 0.18 عند 5% ضوضاء، مع ملاحظات لزيادات أكثر حدة تحت ظروف الضوضاء المجمعة. في تحليل الحساسية الذي يتضمن ضوضاء متحيزة، تظهر DEAP و gplearn تدهورًا كبيرًا، حيث ترتفع مسافات شجرة التعبير من 0.16 إلى 0.32 لـ DEAP مع زيادة مستويات الضوضاء من 1% إلى 5%. في المقابل، يظهر PySR متانة أكبر ضد الضوضاء المتحيزة، حيث يحافظ على مسافات شجرة التعبير المنخفضة عبر جميع أنواع ومستويات الضوضاء، على الرغم من أنه لا يزال يعاني من تدهور تحت ظروف الضوضاء المجمعة الأعلى.

المناقشة

يوفر قسم المناقشة في الورقة نظرة شاملة على دمج الانحدار الرمزي (SR) ونماذج اللغة الكبيرة (LLMs) في سياق الاكتشاف العلمي الآلي. يبدأ بتصنيف طرق SR إلى أربع مجموعات رئيسية: القوة الغاشمة، والانحدار المتناثر، والتعلم العميق، والخوارزميات الجينية، مع تسليط الضوء على نقاط قوتها وقيودها. من الجدير بالذكر أن الانحدار الرمزي المستنير بالفيزياء (PiSR) يتم تقديمه كطريقة تدمج المبادئ الفيزيائية لتعزيز البحث عن التعبيرات الرمزية، وبالتالي معالجة التحديات التي تطرحها البيانات الضوضائية ومتعددة المتغيرات في المشكلات الفيزيائية.

تستكشف هذه القسم أيضًا إمكانية تحسين LLMs للانحدار الرمزي من خلال دمج المعرفة الميدانية في عملية البحث. تفترض أن LLMs يمكن أن تقيم المعادلات المرشحة بناءً على معايير مثل التناسق البعدي والصلاحية العلمية، مما يوجه تحسين عملية SR. يقترح المؤلفون دالة خسارة جديدة تجمع بين مقاييس الخطأ التقليدية مع الدرجات المستمدة من LLM، مما يسمح ببحث أكثر إبلاغًا عن المعادلات التحليلية التي تتماشى مع القيود الفيزيائية المحددة مسبقًا. يتم توضيح التصميم التجريبي، مع التركيز على ثلاثة سيناريوهات فيزيائية واستخدام نماذج SR و LLMs مختلفة لتقييم متانة وفعالية الدمج المقترح. بشكل عام، تشير النتائج إلى أن LLMs يمكن أن تعزز بشكل كبير من قابلية تفسير وواقعية النماذج التي تم إنشاؤها من خلال الانحدار الرمزي.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-35327-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41526439
Publication Date: 2026-01-13
Author(s): Bilge Taskin et al.
Primary Topic: Machine Learning in Materials Science

Overview

In this section, the authors discuss the advancements in symbolic regression (SR) as a method for automated scientific discovery, particularly in deriving governing equations from experimental data. They highlight the integration of domain knowledge into physics-informed symbolic regression (PiSR) to enhance the generality and applicability of the discovered equations. However, existing methods often require specialized formulations and manual feature engineering, which restricts their use to domain experts.

To address these limitations, the study proposes leveraging pre-trained Large Language Models (LLMs) to automate the integration of domain knowledge into PiSR. By incorporating the LLM’s contextual understanding into the SR’s loss function, the authors aim to streamline the process and make it more accessible to a wider range of scientific inquiries. The methodology is evaluated using three SR algorithms (DEAP, gplearn, and PySR) and three LLMs (Falcon, Mistral, and LLama 2) across various physical dynamics, including dropping balls, simple harmonic motion, and electromagnetic waves. The findings indicate that LLM integration significantly enhances the reconstruction of physical dynamics from data, improving the robustness of SR models against noise and complexity. Additionally, the study reveals that effective prompt engineering can further boost performance, underscoring the importance of informative prompts in the process.

Results

The results of the study evaluating the performance of the LLM-integrated symbolic regression (SR) framework are summarized in Tables 2 through 4. Mistral consistently outperforms other models, including LLaMA and Falcon, across various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), $1 – R^2$, and expression tree distance. PySR demonstrates superior performance compared to other SR models, achieving perfect reconstruction of ground truth (GT) equations (expression tree distance of 1.00) when the GT equation or comprehensive variable descriptions are provided. However, while the predicted equations are identical, fluctuations in fitness metrics arise due to variations in experimental samples.

The impact of noise on model performance is also examined. For the DEAP SR algorithm, expression tree distance increases from 0.07 at 1% noise to 0.18 at 5% noise, with even steeper increases observed under combined noise conditions. In a sensitivity analysis involving biased noise, DEAP and gplearn exhibit significant degradation, with expression tree distances rising from 0.16 to 0.32 for DEAP as noise levels increase from 1% to 5%. In contrast, PySR shows greater robustness against biased noise, maintaining lower expression tree distances across all noise types and levels, although it still experiences degradation under higher combined noise conditions.

Discussion

The discussion section of the paper provides a comprehensive overview of the integration of symbolic regression (SR) and large language models (LLMs) in the context of automatic scientific discovery. It begins by categorizing SR methods into four main groups: brute force, sparse regression, deep learning, and genetic algorithms, highlighting their respective strengths and limitations. Notably, Physics-Informed Symbolic Regression (PiSR) is introduced as a method that incorporates physical principles to enhance the search for symbolic expressions, thereby addressing the challenges posed by noisy and multivariate data in physical problems.

The section further explores the potential of LLMs to improve symbolic regression by integrating domain knowledge into the search process. It posits that LLMs can evaluate candidate equations based on criteria such as dimensional consistency and scientific validity, thus guiding the optimization of the SR process. The authors propose a novel loss function that combines traditional error metrics with LLM-derived scores, allowing for a more informed search for analytical equations that align with predefined physical constraints. The experimental design is outlined, focusing on three physical scenarios and employing various SR models and LLMs to assess the robustness and effectiveness of the proposed integration. Overall, the findings suggest that LLMs can significantly enhance the interpretability and physical plausibility of models generated through symbolic regression.