تنسيق البيانات المعتمد على الأنطولوجيا وLLM للتعلم الفيدرالي في الرعاية الصحية Ontology- and LLM-based data harmonization for federated learning in healthcare

المجلة: Frontiers in Digital Health، المجلد: 8
DOI: https://doi.org/10.3389/fdgth.2026.1756555
PMID: https://pubmed.ncbi.nlm.nih.gov/41929608
تاريخ النشر: 2026-03-18
المؤلف: Natallia Kokash وآخرون
الموضوع الرئيسي: تعلم الآلة في الرعاية الصحية

نظرة عامة

تقدم هذه البحث نهجًا جديدًا لمعالجة التباين الدلالي في السجلات الصحية الإلكترونية (EHRs) من خلال خط أنابيب من خطوتين يجمع بين تنسيق البيانات القائم على الأنطولوجيا مع التحقق من صحة نموذج اللغة الكبير (LLM). تتضمن المنهجية أولاً استرجاع مفاهيم مرشحة من أنطولوجيا مستهدفة باستخدام بحث عن التشابه القائم على التضمين أو المراجع المتقاطعة للأنطولوجيا. بعد ذلك، يعمل LLM كمدقق دلالي، يقيم هذه المرشحات وفقًا لمعايير التكافؤ والاحتواء المعمول بها. يتم توضيح هذا النهج من خلال الخرائط إلى أنطولوجيات MONDO و HPO، محققًا معدلات اتفاق تصل إلى 92% بين الخبراء و LLM عبر مجموعتين سريريتين.

تشير النتائج إلى أنه بينما تكون طرق الاسترجاع وحدها غير كافية لتخطيط الأنطولوجيا بشكل موثوق، فإن دمج التحقق من صحة LLM يعزز الدقة بشكل كبير. تراوحت الأداء العام لخط الأنابيب بين 78% إلى 91%، اعتمادًا على استراتيجية توليد المرشحات المستخدمة. من خلال تحويل المهمة اليدوية التقليدية لتنسيق الأنطولوجيا إلى سير عمل قابل لإعادة الاستخدام وقابل للتكوين، يسهل هذا النهج التحليلات القابلة للتوسع والحفاظ على الخصوصية في أبحاث الرعاية الصحية الموزعة. يتم تسليط الضوء على فصل الاسترجاع (من الاسترجاع) والدقة (من تحقق LLM) كإطار عمل قابل للتعميم يمكن تطبيقه على مهام تخطيط المفردات المختلفة، بما في ذلك التسمية التلقائية واختيار مجموعات المرضى للتجارب السريرية.

مقدمة

تناقش مقدمة هذه الورقة البحثية التأثير التحويلي للسجلات الصحية الإلكترونية (EHRs) على الرعاية الصحية، مع التأكيد على الإمكانات للبحث الطبي القائم على البيانات واتخاذ القرارات السريرية. ومع ذلك، تبرز التحديات الكبيرة في استخدام هذه البيانات بسبب اللوائح المتعلقة بالخصوصية مثل HIPAA و GDPR، التي تعقد تخزين البيانات المركزي وتزيد من التعرض للاختراقات. للتخفيف من هذه القضايا، يتم اقتراح التعلم الموزع (FL) كحل، مما يسمح بتدريب النماذج بشكل تعاوني دون مشاركة البيانات مباشرة. في FL، تقوم المؤسسات بتحديث نموذج عالمي باستخدام بيانات محلية، مما يحافظ على خصوصية المرضى والامتثال للمبادئ التنظيمية.

تناقش الورقة أيضًا تعقيدات تنسيق البيانات اللازمة لتنفيذ FL بشكل فعال في الرعاية الصحية. تمثل التباين في تنسيقات EHR عبر المؤسسات عقبة رئيسية، مما يتطلب تقنيات آلية لمصالحة أنواع البيانات المختلفة. يناقش المؤلفون دور نماذج اللغة الكبيرة (LLMs) في تعزيز توحيد البيانات ومحاذاتها، مع الاعتراف أيضًا بالحاجة إلى الموثوقية وتقليل التحيز في التطبيقات السريرية. تهدف الدراسة إلى دمج وظائف قائمة على LLM ضمن إطار عمل FL قابل للبرمجة لتحسين محاذاة بيانات الرعاية الصحية، موضحةً استراتيجية من خطوتين قائمة على الأنطولوجيا و LLM تم تطبيقها بنجاح في مشروع واقعي. تمهد المقدمة الطريق لاستكشاف تقاطع FL وتنسيق البيانات والوصول إلى البيانات مع الحفاظ على الخصوصية في الرعاية الصحية.

طرق

تحدد قسم “الطرق” الأساليب التجريبية والتحليلية المستخدمة في الدراسة. استخدم الباحثون مزيجًا من التقنيات الكمية والنوعية لجمع البيانات، مما يضمن فهمًا شاملاً للظواهر قيد التحقيق. شملت المنهجيات المحددة تجارب محكومة، وتحليلات إحصائية، وتقنيات نمذجة، تم اختيارها لمعالجة أسئلة البحث بشكل فعال.

شملت عملية جمع البيانات أخذ عينات منهجية وبروتوكولات صارمة لتقليل التحيز وضمان الموثوقية. تم إجراء التحليل باستخدام برامج إحصائية متقدمة، مما يسمح بتطبيق اختبارات مختلفة للتحقق من النتائج. يبرز القسم أهمية القابلية للتكرار والشفافية في عملية البحث، موضحًا الخطوات المتخذة لضمان إمكانية التحقق المستقل من النتائج. بشكل عام، توفر الطرق المستخدمة إطار عمل قوي لاستنتاجات الدراسة.

نتائج

يقدم قسم “النتائج” من الورقة البحثية النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. عادةً ما يتضمن بيانات كمية، وتحليلات إحصائية، وتمثيلات بصرية مثل الرسوم البيانية أو الجداول التي توضح نتائج الدراسة. غالبًا ما تتم مقارنة النتائج مع الفرضيات أو التوقعات الأولية، مما يبرز الأنماط أو الاتجاهات المهمة التي لوحظت في البيانات.

في هذا القسم، قد يناقش المؤلفون أيضًا تداعيات نتائجهم، مؤكدين كيف تساهم في المعرفة الحالية في هذا المجال. يتم تناول أي نتائج غير متوقعة أو شذوذ، مما يوفر فهمًا شاملاً لنتائج البحث. بشكل عام، يعد قسم “النتائج” أساسًا حيويًا للنقاش والتفسير اللاحق للنتائج.

نقاش

يحدد قسم النقاش من الورقة البحثية تكامل أطر التعلم الموزع (FL)، تحديدًا Vantage6 و Brane، لتعزيز معالجة البيانات والتعاون بين منظمات الرعاية الصحية. يسهل Vantage6 عمليات التعلم الآلي على البيانات المحلية في عقد العمل، بينما يوفر Brane بيئة آمنة وقابلة للبرمجة لتبادل البيانات وتنظيم سير العمل. يهدف إطار العمل “تمكين التدخلات الشخصية” (EPI)، المبني على هذه المنصات، إلى دعم رؤى صحية شخصية مع ضمان الامتثال لسياسات البيانات. يبرز المؤلفون التحديات المتعلقة بنشر FL في التحالفات المفتوحة، حيث تصبح محاذاة البيانات وتصميم سير العمل معقدة بسبب اختلاف التقاليد التنظيمية.

لمعالجة هذه التحديات، يقترح المؤلفون خط أنابيب جديد من خطوتين يستخدم نماذج اللغة الكبيرة (LLMs) لمحاذاة السجلات الصحية الإلكترونية (EHRs) مع المفردات الطبية الحيوية الموحدة. تتضمن الخطوة الأولى توليد المرشحات باستخدام “توليد معزز بالاسترجاع” (RAG) لتحديد المطابقات المحتملة، بينما تستخدم الخطوة الثانية LLMs للتحقق من صحة هذه المطابقات وفقًا لمعايير القبول المحددة. تشير النتائج إلى أن هذا النهج يحسن بشكل كبير دقة التخطيط، محققًا معدلات اتفاق على مستوى الخبراء تتراوح بين 78% إلى 92% عبر مجموعات بيانات مختلفة. يؤكد المؤلفون على إمكانية إنشاء وظائف محاذاة قابلة لإعادة الاستخدام لمشاريع FL من خلال هذا الخط الأنبوبي، مما يسهل تنسيق البيانات في بيئات الرعاية الصحية ومعالجة تعقيدات التباين الدلالي في مصادر البيانات الموزعة.

القيود

تتمثل القيود الرئيسية للدراسة في منهجية التقييم، التي اعتمدت على خبير بشري واحد لتقييم الاتفاق، مما قد يقدم تحيزًا. لتعزيز موثوقية النتائج، يجب أن تتضمن الأبحاث المستقبلية تحققًا أوسع مع عدة خبراء وتقييم اتفاقية بين المعلقين. بالإضافة إلى ذلك، يمثل التحدي العملي لنشر إعداد Brane/EPI الموزع عبر مؤسسات ومجموعات بيانات مختلفة عقبة كبيرة.

ستهدف الأعمال المستقبلية إلى معالجة هذه القيود من خلال إنشاء بيئة تعلم موزع (FL) منخفضة التعليمات البرمجية داخل Brane/EPI، مما سيسهل دمج مكونات الذكاء الاصطناعي غير المعتمدة على البيانات وتقنيات تجميع النماذج في سير العمل العلمية. كما يعتزم الباحثون تقييم المزيد من نماذج اللغة الكبيرة (LLMs) مفتوحة المصدر في ظل ظروف الحفاظ على الخصوصية وتطوير مساعدين قائمين على LLM لتحسين تعريف سير العمل، وتحديد السياسات، ومحاذاة البيانات.

Journal: Frontiers in Digital Health, Volume: 8
DOI: https://doi.org/10.3389/fdgth.2026.1756555
PMID: https://pubmed.ncbi.nlm.nih.gov/41929608
Publication Date: 2026-03-18
Author(s): Natallia Kokash et al.
Primary Topic: Machine Learning in Healthcare

Overview

The research presents a novel approach to address semantic heterogeneity in electronic health records (EHRs) through a two-step pipeline that combines ontology-based data harmonization with large language model (LLM) validation. The methodology involves first retrieving candidate concepts from a target ontology using embedding-based similarity search or ontology cross-references. Subsequently, an LLM serves as a semantic validator, evaluating these candidates against established equivalence and subsumption criteria. This approach is demonstrated with mappings to the MONDO and HPO ontologies, achieving expert-LLM agreement rates of up to 92% across two clinical datasets.

The findings indicate that while retrieval methods alone are inadequate for reliable ontology mapping, the integration of LLM validation significantly enhances precision. The overall performance of the pipeline ranged from 78% to 91%, depending on the candidate generation strategy employed. By transforming the traditionally manual task of ontology harmonization into a reusable and configurable workflow, this approach facilitates scalable and privacy-preserving analytics in federated healthcare research. The separation of recall (from retrieval) and precision (from LLM validation) is highlighted as a generalizable framework that can be applied to various vocabulary mapping tasks, including automated labeling and patient cohort selection for clinical trials.

Introduction

The introduction of this research paper discusses the transformative impact of electronic health records (EHRs) on healthcare, emphasizing the potential for data-driven medical research and clinical decision-making. However, it highlights significant challenges in utilizing this data due to privacy regulations like HIPAA and GDPR, which complicate centralized data storage and increase vulnerability to breaches. To mitigate these issues, federated learning (FL) is proposed as a solution, allowing collaborative model training without direct data sharing. In FL, institutions update a global model using local data, thereby maintaining patient privacy and adhering to regulatory principles.

The paper also addresses the complexities of data harmonization necessary for effective FL implementation in healthcare. Variability in EHR formats across institutions poses a major obstacle, necessitating automated techniques for reconciling disparate data types. The authors discuss the role of large language models (LLMs) in enhancing data standardization and alignment, while also acknowledging the need for trustworthiness and bias mitigation in clinical applications. The research aims to integrate LLM-based functionalities within a programmable FL framework to improve healthcare data alignment, detailing a two-step ontology-and-LLM-based strategy that was successfully applied in a real-world project. The introduction sets the stage for exploring the intersection of FL, data harmonization, and privacy-preserving data access in healthcare.

Methods

The “Methods” section outlines the experimental and analytical approaches employed in the study. The researchers utilized a combination of quantitative and qualitative techniques to gather data, ensuring a comprehensive understanding of the phenomena under investigation. Specific methodologies included controlled experiments, statistical analyses, and modeling techniques, which were chosen to address the research questions effectively.

Data collection involved systematic sampling and rigorous protocols to minimize bias and ensure reliability. The analysis was conducted using advanced statistical software, allowing for the application of various tests to validate the findings. The section emphasizes the importance of reproducibility and transparency in the research process, detailing the steps taken to ensure that the results can be independently verified. Overall, the methods employed provide a robust framework for the study’s conclusions.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It typically includes quantitative data, statistical analyses, and visual representations such as graphs or tables that illustrate the outcomes of the study. The results are often compared against the initial hypotheses or expectations, highlighting significant patterns or trends observed in the data.

In this section, the authors may also discuss the implications of their findings, emphasizing how they contribute to the existing body of knowledge in the field. Any unexpected results or anomalies are addressed, providing a comprehensive understanding of the research outcomes. Overall, the “Results” section serves as a critical foundation for the subsequent discussion and interpretation of the findings.

Discussion

The discussion section of the research paper outlines the integration of federated learning (FL) frameworks, specifically Vantage6 and Brane, to enhance data processing and collaboration among healthcare organizations. Vantage6 facilitates machine learning operations on local data at worker nodes, while Brane provides a secure and programmable environment for data exchange and workflow orchestration. The Enabling Personalized Interventions (EPI) framework, built on these platforms, aims to support personalized health insights while ensuring compliance with data policies. The authors highlight the challenges of deploying FL in open consortia, where data alignment and workflow design become complex due to varying organizational conventions.

To address these challenges, the authors propose a novel two-step pipeline that utilizes large language models (LLMs) for aligning electronic health records (EHRs) with standardized biomedical vocabularies. The first step involves candidate generation using Retrieval Augmented Generation (RAG) to identify potential matches, while the second step employs LLMs to validate these matches against defined acceptance criteria. The results indicate that this approach significantly improves mapping precision, achieving expert-level agreement rates of 78% to 92% across different datasets. The authors emphasize the pipeline’s potential to create reusable alignment functions for FL projects, thereby streamlining data harmonization in healthcare settings and addressing the complexities of semantic heterogeneity in decentralized data sources.

Limitations

The study’s primary limitation lies in its evaluation methodology, which relied on a single human expert to assess agreement, potentially introducing bias. To enhance the reliability of the findings, future research should involve broader validation with multiple experts and assess inter-annotator agreement. Additionally, the practical challenge of deploying the federated Brane/EPI setup across various institutions and datasets presents a significant hurdle.

Future work will aim to address these limitations by creating a low-code federated learning (FL) environment within Brane/EPI, which will facilitate the integration of data-agnostic AI components and model aggregation techniques into scientific workflows. The researchers also intend to evaluate more open-source large language models (LLMs) under privacy-preserving conditions and develop LLM-based assistants to improve workflow definition, policy specification, and data alignment.