فتح إمكانيات الأبحاث السابقة: استخدام الذكاء الاصطناعي التوليدي لإعادة بناء نماذج محاكاة الرعاية الصحية Unlocking the potential of past research: using generative AI to reconstruct healthcare simulation models

المجلة: Journal of the Operational Research Society
DOI: https://doi.org/10.1080/01605682.2025.2554751
تاريخ النشر: 2025-09-10
المؤلف: Thomas Monks وآخرون
الموضوع الرئيسي: عمليات الرعاية الصحية وتحسين جدولة المواعيد

نظرة عامة

تستكشف هذه الدراسة إمكانيات الذكاء الاصطناعي التوليدي (AI) لإعادة إنشاء نماذج المحاكاة للأحداث المنفصلة (DES) في الرعاية الصحية باستخدام البرمجيات الحرة والمفتوحة المصدر (FOSS). نجح المؤلفون في توليد واختبار نموذجين من DES بناءً على أوصاف من الأدبيات الأكاديمية، حيث تحقق النسخ لأحد النماذج بينما واجهوا تحديات مع الآخر بسبب نقص المعلومات حول التوزيعات. تجاوزت تعقيدات النماذج تلك التي تم نشرها سابقًا من نماذج DES التي تم إنشاؤها بواسطة الذكاء الاصطناعي، مما يبرز ضرورة اتباع نهج تطوير تكراري، واختبار منهجي، وخبرة فريق البحث.

تشير النتائج إلى أنه من الممكن توليد نماذج DES في FOSS من خلال مطالبات مصممة هندسيًا مستمدة من الأوصاف السردية، كما يتضح من إعادة إنشاء نموذجين للرعاية الصحية بلغة بايثون، والتي اجتازت اختبارات التحقق وشملت واجهات مستخدم. ومع ذلك، حددت الدراسة أيضًا تحديات كبيرة تتعلق بهندسة المطالبات، وتوليد الشيفرة، واختبار النماذج. يؤكد المؤلفون على أهمية التحسين التكراري والنهج المنهجي في نمذجة DES، مشيرين إلى أنه بينما يحمل الذكاء الاصطناعي التوليدي وعودًا لنمذجة الرعاية الصحية، فإن المزيد من البحث ضروري لاستكشاف تطبيقاته الأوسع وقابليته للتوسع.

مقدمة

تناقش مقدمة هذه الورقة البحثية الدور المهم للبحث التشغيلي (OR) في تحسين خدمات الصحة من خلال النمذجة، مع التركيز بشكل خاص على المحاكاة للأحداث المنفصلة (DES) كطريقة المحاكاة السائدة. تسلط الضوء على التطبيق الواسع لـ DES في مجالات الصحة المختلفة، بما في ذلك الأمراض المزمنة والرعاية الطارئة، مع نشر حوالي 100 مقال متعلق بـ DES سنويًا. بينما يتم غالبًا مشاركة النماذج المفاهيمية من هذه الدراسات، فإن النماذج المشفرة – الضرورية للتطبيق العملي – نادرًا ما تكون متاحة، حيث أن حوالي 8% فقط منها متاحة للمجتمع الأوسع. تعيق هذه الندرة التأثير طويل الأمد لأبحاث DES، حيث تتطلب خدمات الصحة الوصول في الوقت المناسب إلى النماذج لاتخاذ القرارات.

تدعو الورقة إلى اتباع نهج العلوم المفتوحة لتسهيل إعادة استخدام نماذج DES، مشددة على التكاليف العالية المرتبطة بتطوير النماذج المشفرة، والتي تتطلب التعاون بين مختلف أصحاب المصلحة. تقدم إطار عمل “اختصار معالجة اللغة الطبيعية (NLP)”، الذي يستفيد من الذكاء الاصطناعي التوليدي لإعادة إنشاء النماذج المشفرة من الأوصاف المفاهيمية الموجودة في المنشورات الأكاديمية. يهدف المؤلفون إلى استكشاف فعالية هذا الإطار في توليد نماذج DES المعقدة المناسبة لخدمات الصحة، مما يعزز الوصول وفائدة أبحاث المحاكاة في التطبيقات الواقعية.

الطرق

في هذه الدراسة، استخدمنا منهجية منظمة من أربع مراحل لتحقيق أهداف بحثنا. تضمنت المراحل: إعداد وتصميم النموذج (المرحلة 0)، هندسة المطالبات وتوليد الشيفرة (المرحلة 1)، النسخ الداخلي (المرحلة 2)، والتقييم والحفظ (المرحلة 3). شملت كل مرحلة أنشطة محددة، كما هو موضح في الشكل 1.

بالنسبة لتوليد النموذج، استخدمنا النموذج القياسي من Perplexity.AI المتاح في المستوى المجاني، والذي يتضمن قدرات توليد معززة بالاسترجاع (RAG). تتيح هذه الوظيفة للنموذج الوصول إلى المصادر الحالية عبر الإنترنت المتعلقة بالمحاكاة وSimPy، مما يعزز من صلة ودقة المخرجات الناتجة.

النتائج

في قسم النتائج، تقدم الدراسة نتائج التحليل من خلال تمثيلات بصرية وجدولية. على وجه التحديد، يتم استخدام جدول Streamlit لعرض النتائج لبيانات مجموعتين: $df_{acute}$ و $df_{rehab}$. تسهل هذه الطريقة فحصًا تفاعليًا للبيانات، مما يسمح بفهم أوضح للنتائج.

بالإضافة إلى ذلك، يبرز القسم أهمية تمثيل البيانات بصريًا من خلال عرض جميع الرسوم البيانية ذات الصلة التي تم إنشاؤها أثناء التحليل. تعيد وظائف الرسم المستخدمة مجموعة مكونة من شكل ومحور، وهي ضرورية للتصور الفعال للبيانات. يتم استيراد جميع الفئات والوظائف اللازمة لهذه العمليات من وحدة مخصصة تسمى `stroke_rehab_model`، مما يضمن تنفيذًا منظمًا ومنهجيًا للتحليل.

المناقشة

تسلط قسم المناقشة في الورقة البحثية الضوء على التقدم والتحديات المرتبطة بالذكاء الاصطناعي التوليدي، مع التركيز بشكل خاص على نماذج اللغة الكبيرة (LLMs) وتطبيقاتها في مجالات مختلفة، بما في ذلك هندسة البرمجيات والمحاكاة الحاسوبية. يتم التأكيد على مفاهيم رئيسية مثل التعلم بدون تدريب، وتوسيع النماذج، وهندسة المطالبات، موضحة كيف يمكن لنماذج اللغة الكبيرة أداء المهام دون تدريب صريح على فئات معينة. توضح تطور النماذج من GPT-1 إلى GPT-4، مع زيادة المعلمات، التحسينات الكبيرة في القدرات، خاصة في توليد نصوص وشيفات تشبه الإنسان.

يتناول القسم أيضًا قضايا الهلوسة وتلوث البيانات، والتي يمكن أن تؤدي إلى عدم دقة في مخرجات النماذج. تقنيات مثل توليد معزز بالاسترجاع (RAG) مقترحة للتخفيف من هذه التحديات من خلال توفير وصول نماذج اللغة الكبيرة إلى معلومات محدثة من قواعد المعرفة الخارجية. علاوة على ذلك، تتم مناقشة مشكلة المحاذاة، مع التأكيد على أهمية ضمان توافق مخرجات الذكاء الاصطناعي مع القيم الإنسانية من خلال طرق مثل التعلم المعزز من ملاحظات البشر (RLHF). تختتم الورقة بتلخيص الدروس المستفادة من الأدبيات الحالية، بما في ذلك ضرورة خبرة المستخدم، وأهمية التحقق الشامل من النماذج، والحاجة إلى اختيار دقيق لبيانات الاختبار لتجنب تسرب البيانات، وكلها ضرورية للتطبيق الفعال للذكاء الاصطناعي التوليدي في السيناريوهات العملية.

القيود

يسلط قسم القيود الضوء على التحديات الكبيرة في تقييم قدرات النماذج اللغوية الكبيرة (LLMs) بدون تدريب، خاصة فيما يتعلق بتلوث البيانات والهلوسة. يعد تلوث البيانات، الذي يشبه التسرب في التعلم الخاضع للإشراف التقليدي، معقدًا لتقييم دقة النموذج، حيث غالبًا ما يكون من غير الواضح ما إذا كانت بيانات التدريب تتداخل مع بيانات الاختبار. يمكن أن تؤدي هذه التداخلات إلى تضخيم مقاييس الدقة، حيث قد تقوم النماذج ببساطة بإعادة إنتاج المحتوى المحفوظ بدلاً من إظهار قدرات تنبؤية حقيقية. بالإضافة إلى ذلك، تشكل الهلوسة قضية حرجة، حيث تولد النماذج اللغوية الكبيرة مخرجات قد تكون غير صحيحة من الناحية الواقعية أو معيبة منطقيًا، مما قد يؤدي إلى عواقب وخيمة، مثل اتخاذ قرارات خاطئة بناءً على محاكاة معيبة أو مراجع مختلقة في السياقات الأكاديمية. تنشأ تعقيدات الهلوسة من عوامل متعددة، بما في ذلك وجود أخطاء في بيانات التدريب، مما يتطلب البحث المستمر في استراتيجيات التخفيف مثل استرجاع المعلومات التكراري وتقدير عدم اليقين.

كما تم الإشارة إلى قيود الدراسة، حيث ركزت حصريًا على نماذج الترميز في بايثون وSimPy، مما قيد التحقيق في نموذجين محاكاة. سمح هذا النطاق الضيق باستكشاف أعمق لمهام ترميز المحاكاة ولكنه يشير إلى أن الأبحاث المستقبلية يجب أن تتوسع لتشمل لغات برمجة أخرى، مثل R وRSimmer. يحد اعتماد الدراسة على المستوى المجاني لأداة ذكاء اصطناعي واحدة من تعميم النتائج، على الرغم من أن الاختبارات غير الرسمية تشير إلى أن النتائج قد تكون قابلة للتطبيق على المستويات المدفوعة مع نماذج أكبر. علاوة على ذلك، تثير الزيادة في انتشار المحتوى الذي تم إنشاؤه بواسطة الذكاء الاصطناعي مخاوف بشأن “انهيار النموذج”، حيث قد تعاني النماذج المدربة بشكل أساسي على بيانات اصطناعية من تدهور الأداء بسبب التعلم التكراري من المخرجات الخاطئة. سيتعين على الأعمال المستقبلية في مجتمع النمذجة والمحاكاة إعطاء الأولوية لجمع مجموعات بيانات تدريب عالية الجودة لضمان نزاهة وأداء نماذج اللغة الكبيرة.

Journal: Journal of the Operational Research Society
DOI: https://doi.org/10.1080/01605682.2025.2554751
Publication Date: 2025-09-10
Author(s): Thomas Monks et al.
Primary Topic: Healthcare Operations and Scheduling Optimization

Overview

This study investigates the potential of generative artificial intelligence (AI) to recreate discrete-event simulation (DES) models in healthcare using Free and Open Source Software (FOSS). The authors successfully generated and tested two DES models based on descriptions from academic literature, achieving replication for one model while encountering challenges with the other due to incomplete information on distributions. The complexity of the models exceeded that of previously published AI-generated DES models, highlighting the necessity of an iterative development approach, systematic testing, and the expertise of the research team.

The findings indicate that it is feasible to generate DES models in FOSS through engineered prompts derived from narrative descriptions, as demonstrated by the recreation of two healthcare models in Python, which passed verification tests and included user interfaces. However, the study also identified significant challenges related to prompt engineering, code generation, and model testing. The authors emphasize the importance of iterative refinement and systematic approaches in DES modeling, suggesting that while generative AI holds promise for healthcare modeling, further research is essential to explore its broader applicability and scalability.

Introduction

The introduction of this research paper discusses the significant role of Operational Research (OR) in enhancing health services through modeling, particularly focusing on Discrete-Event Simulation (DES) as the predominant simulation method. It highlights the extensive application of DES in various health domains, including chronic diseases and emergency care, with approximately 100 DES-related articles published annually. While conceptual models from these studies are often shared, the coded models—essential for practical application—are rarely made available, with only about 8% accessible to the broader community. This lack of availability hinders the long-term impact of DES research, as health services require timely access to models for decision-making.

The paper advocates for an Open Science approach to facilitate the reuse of DES models, emphasizing the high costs associated with developing coded models, which involve collaboration among various stakeholders. It introduces the “Natural Language Processing (NLP) Shortcut” framework, which leverages generative AI to potentially recreate coded models from conceptual descriptions found in academic publications. The authors aim to explore this framework’s effectiveness in generating complex DES models suitable for health services, thereby enhancing the accessibility and utility of simulation research in real-world applications.

Methods

In this study, we employed a structured four-stage methodology to achieve our research objectives. The stages included: setup and model design (Stage 0), prompt engineering and code generation (Stage 1), internal replication (Stage 2), and evaluation and preservation (Stage 3). Each stage encompassed specific activities, as illustrated in Figure 1.

For the model generation, we utilized Perplexity.AI’s standard model available in the free tier, which incorporates Retrieval-Augmented Generation (RAG) capabilities. This functionality allows the model to access and integrate current online sources related to simulation and SimPy, thereby enhancing the relevance and accuracy of the generated outputs.

Results

In the Results section, the study presents the outcomes of the analysis through visual and tabular representations. Specifically, a Streamlit table is utilized to display the results for two datasets: $df_{acute}$ and $df_{rehab}$. This approach facilitates an interactive examination of the data, allowing for a clearer understanding of the findings.

Additionally, the section emphasizes the importance of visual data representation by showcasing all relevant plots generated during the analysis. The plotting functions employed return a tuple consisting of a figure and an axis, which are essential for effective data visualization. All necessary classes and functions for these operations are imported from a dedicated module named `stroke_rehab_model`, ensuring a structured and organized implementation of the analysis.

Discussion

The discussion section of the research paper highlights the advancements and challenges associated with Generative AI, particularly focusing on Large Language Models (LLMs) and their applications in various domains, including software engineering and computer simulation. Key concepts such as zero-shot learning, model scaling, and prompt engineering are emphasized, illustrating how LLMs can perform tasks without explicit training on specific categories. The evolution of models from GPT-1 to GPT-4, with increasing parameters, showcases the significant improvements in capabilities, particularly in generating human-like text and code.

The section also addresses the issues of hallucination and data contamination, which can lead to inaccuracies in model outputs. Techniques like Retrieval Augmented Generation (RAG) are proposed to mitigate these challenges by providing LLMs with access to up-to-date information from external knowledge bases. Furthermore, the alignment problem is discussed, emphasizing the importance of ensuring that AI outputs align with human values through methods like Reinforcement Learning from Human Feedback (RLHF). The paper concludes by summarizing lessons learned from existing literature, including the necessity for user expertise, the importance of thorough model validation, and the need for careful selection of test data to avoid data leakage, all of which are crucial for the effective application of generative AI in practical scenarios.

Limitations

The section on limitations highlights significant challenges in evaluating the zero-shot capabilities of large language models (LLMs), particularly concerning data contamination and hallucination. Data contamination, akin to leakage in traditional supervised learning, complicates the assessment of model accuracy, as it is often unclear whether training data overlaps with test data. This overlap can lead to inflated accuracy metrics, as models may simply reproduce memorized content rather than demonstrating genuine predictive capabilities. Additionally, hallucination poses a critical issue, where LLMs generate outputs that may be factually incorrect or logically flawed, potentially leading to serious consequences, such as erroneous decisions based on flawed simulations or fabricated references in academic contexts. The complexity of hallucination arises from various factors, including the presence of bugs in the training data, necessitating ongoing research into mitigation strategies like iterative information retrieval and uncertainty estimation.

The study’s limitations are also noted, as it focused exclusively on coding models in Python and SimPy, restricting the investigation to two simulation models. This narrow scope allowed for a deeper exploration of simulation coding tasks but suggests that future research should expand to other programming languages, such as R and RSimmer. The study’s reliance on a single AI tool’s free tier limits the generalizability of the findings, although informal testing indicates that results may be applicable to paid tiers with larger models. Furthermore, the increasing prevalence of AI-generated content raises concerns about “model collapse,” where models trained predominantly on synthetic data may experience performance degradation due to recursive learning from erroneous outputs. Future work in the modeling and simulation community will need to prioritize the curation of high-quality training datasets to ensure the integrity and performance of LLMs.