استرجاع الكيانات السريرية المعززة لاستخراج المعلومات السريرية Clinical entity augmented retrieval for clinical information extraction

المجلة: npj Digital Medicine، المجلد: 8، العدد: 1
DOI: https://doi.org/10.1038/s41746-024-01377-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39828800
تاريخ النشر: 2025-01-19
المؤلف: Iván López وآخرون
الموضوع الرئيسي: تنقيب النصوص الطبية والأنطولوجيات

نظرة عامة

يقدم هذا القسم نظرة عامة على خط أنابيب جديد يسمى استرجاع معزز بالجيل (RAG) يُعرف باسم استرجاع الكيانات السريرية المعزز (CLEAR)، مصمم لتعزيز استخراج المعلومات من الملاحظات السريرية من خلال استخدام الكيانات السريرية بدلاً من الاعتماد فقط على التضمينات. تقارن الدراسة CLEAR بأساليب RAG التقليدية المعتمدة على التضمينات والأساليب الكاملة للملاحظات في استخراج 18 متغيرًا من مجموعة بيانات تحتوي على 20,000 ملاحظة سريرية، كاشفة أن CLEAR يحقق أداءً متفوقًا بمتوسط درجة F1 تبلغ 0.90، إلى جانب تقليصات كبيرة في وقت الاستدلال (4.95 ثوانٍ لكل ملاحظة) واستخدام الرموز (1.1k رمز لكل ملاحظة). بالمقابل، حققت طرق RAG المعتمدة على التضمينات والأساليب الكاملة للملاحظات متوسط درجات F1 تبلغ 0.86 و0.79، مع أوقات استدلال أطول وأعداد رموز أعلى.

تسلط الأبحاث الضوء على التحديات المتعلقة باستخراج معلومات قيمة من الملاحظات النصية الحرة في السجلات الصحية الإلكترونية (EHRs)، التي تحتوي على بيانات غنية مثل الأعراض والتشخيصات ووجهات نظر المرضى التي غالبًا ما تكون غائبة عن الحقول المنظمة. تعاني الطرق التقليدية لاستخراج المعلومات السريرية، التي تعتمد أساسًا على القواعد، من قيود في التقاط تعقيد اللغة السريرية. لا يحسن نهج CLEAR كفاءة الاستخراج فحسب، بل يسهل أيضًا تطبيقات متنوعة في البحث وتحسين الجودة، بما في ذلك اختيار المجموعات، وتحديد الأنماط، والنمذجة التنبؤية. بشكل عام، يمثل CLEAR تقدمًا كبيرًا في منهجيات استخراج المعلومات السريرية، حيث يعالج عدم الكفاءة في الأنظمة السابقة بينما يعزز الأداء.

الطرق

يستعرض قسم الطرق تصميم التجربة والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث قاموا بإجراء تحليلات إحصائية لتقييم البيانات المجمعة من عينة سكانية. تضمنت المنهجيات الرئيسية تطبيق نماذج الانحدار لتقييم العلاقات بين المتغيرات، بالإضافة إلى استخدام ANOVA لمقارنة متوسطات المجموعات.

شملت جمع البيانات استبيانات منظمة وتجارب محكومة، مما يضمن موثوقية وصدق النتائج. تم تحديد حجم العينة بناءً على تحليل القوة لضمان قوة إحصائية كافية لاكتشاف التأثيرات المهمة. بالإضافة إلى ذلك، نفذ الباحثون بروتوكولات صارمة لتنظيف البيانات ومعالجتها مسبقًا لتقليل التحيزات والأخطاء في التحليل. بشكل عام، أسست الإطار المنهجي قاعدة قوية للنتائج اللاحقة المبلغ عنها في الدراسة.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يوضح النتائج التي توصلت إليها الدراسة، مع تسليط الضوء على نقاط البيانات والاتجاهات المهمة التي تم ملاحظتها. غالبًا ما تكون النتائج مصحوبة بتحليلات إحصائية ذات صلة، والتي قد تشمل قيم p، وفترات الثقة، أو أحجام التأثير، لدعم صحة النتائج.

بالإضافة إلى ذلك، يتم عادةً الإشارة إلى أي أشكال أو جداول أو رسوم بيانية ذات صلة لتوضيح النتائج بصريًا، مما يوفر فهمًا أوضح للبيانات. قد يناقش القسم أيضًا تداعيات هذه النتائج فيما يتعلق بالفرضيات الأصلية أو أسئلة البحث المطروحة في الدراسة، مع التأكيد على كيفية مساهمتها في المعرفة الحالية في هذا المجال.

المناقشة

في هذه الدراسة، نقدم CLEAR، وهو خط أنابيب استرجاع معزز بالجيل (RAG) مصمم لتعزيز استخراج المعلومات السريرية من السجلات الصحية الإلكترونية (EHRs). كانت موثوقية المراجعين عالية، حيث كانت قيم كابا لكوهين 0.86 لمجموعة بيانات Stanford MOUD و0.93 لمجموعة بيانات CheXpert، مما يدل على توافق ممتاز. أظهر نهجنا في التعرف على الكيانات المسماة (NER) بدون تدريب باستخدام Flan-T5 حساسية مثيرة للإعجاب، حيث تم التعرف على 96% و99% من الكيانات في مجموعتي بيانات NCBI disease وStanford MOUD، على التوالي. من الجدير بالذكر أن تعزيز مخرجات NER باستخدام الأنطولوجيا ونماذج اللغة الكبيرة (LLMs) زاد من الحساسية إلى 99% و100%، على التوالي. ومع ذلك، كشفت التحليلات أن حذف خطوة NER الأولية قلل بشكل كبير من الأداء العام، مما يبرز أهمية NER في التقاط التباين السريري.

تجاوز CLEAR الأساليب التقليدية المعتمدة على التضمينات الكاملة للملاحظات، محققًا متوسط درجة F1 تبلغ 0.90 عبر 13 متغيرًا في مجموعة بيانات Stanford MOUD، مع تقليص ملحوظ في استخدام الرموز ووقت المعالجة – 71% أقل من الرموز المدخلة و72% أسرع في الاستدلال. تم التحقق من كفاءة خط الأنابيب من خلال مقارنات مع نماذج بحجم BERT، التي أظهرت أداءً مشابهًا للنماذج الأكبر مع الحاجة إلى موارد أقل. على الرغم من نقاط قوته، تعترف الدراسة بالقيود، بما في ذلك التركيز على استخراج المتغيرات السريرية والحاجة إلى استكشاف المزيد من تطبيقات CLEAR لمهام أخرى مثل التلخيص والإجابة على الأسئلة. بشكل عام، يمثل CLEAR تقدمًا كبيرًا في كفاءة وفعالية معالجة المعلومات السريرية، مما يجعله أداة واعدة لتطبيقات الرعاية الصحية.

Journal: npj Digital Medicine, Volume: 8, Issue: 1
DOI: https://doi.org/10.1038/s41746-024-01377-1
PMID: https://pubmed.ncbi.nlm.nih.gov/39828800
Publication Date: 2025-01-19
Author(s): Iván López et al.
Primary Topic: Biomedical Text Mining and Ontologies

Overview

The section presents an overview of a novel retrieval-augmented generation (RAG) pipeline called CLinical Entity Augmented Retrieval (CLEAR), designed to enhance information extraction from clinical notes by utilizing clinical entities rather than relying solely on embeddings. The study compares CLEAR against traditional embedding RAG and full-note approaches in extracting 18 variables from a dataset of 20,000 clinical notes, revealing that CLEAR achieves superior performance with an average F1 score of 0.90, alongside significant reductions in inference time (4.95 seconds per note) and token usage (1.1k tokens per note). In contrast, the embedding RAG and full-note methods yielded average F1 scores of 0.86 and 0.79, with longer inference times and higher token counts.

The research highlights the challenges of extracting valuable information from free-text notes in electronic health records (EHRs), which contain rich data such as symptoms, diagnoses, and patient perspectives that are often absent from structured fields. Traditional clinical information extraction methods, primarily rule-based, have limitations in capturing the complexity of clinical language. CLEAR’s approach not only improves extraction efficiency but also facilitates various applications in research and quality improvement, including cohort selection, phenotyping, and predictive modeling. Overall, CLEAR represents a significant advancement in clinical information extraction methodologies, addressing the inefficiencies of previous systems while enhancing performance.

Methods

The Methods section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, employing statistical analyses to evaluate the data collected from a sample population. Key methodologies included the application of regression models to assess relationships between variables, as well as the use of ANOVA to compare group means.

Data collection involved structured surveys and controlled experiments, ensuring the reliability and validity of the findings. The sample size was determined based on power analysis to ensure adequate statistical power for detecting significant effects. Additionally, the researchers implemented rigorous protocols for data cleaning and preprocessing to minimize biases and errors in the analysis. Overall, the methodological framework established a robust basis for the subsequent findings reported in the study.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It details the outcomes of the study, highlighting significant data points and trends observed. The results are often accompanied by relevant statistical analyses, which may include p-values, confidence intervals, or effect sizes, to support the validity of the findings.

Additionally, any relevant figures, tables, or graphs are typically referenced to illustrate the results visually, providing a clearer understanding of the data. The section may also discuss the implications of these findings in relation to the original hypotheses or research questions posed in the study, emphasizing how they contribute to the existing body of knowledge in the field.

Discussion

In this study, we introduce CLEAR, a retrieval-augmented generation (RAG) pipeline designed to enhance clinical information extraction from electronic health records (EHRs). The inter-rater reliability of annotators was high, with Cohen’s Kappa values of 0.86 for the Stanford MOUD dataset and 0.93 for the CheXpert dataset, indicating excellent agreement. Our zero-shot Named Entity Recognition (NER) approach using Flan-T5 demonstrated impressive sensitivity, identifying 96% and 99% of entities in the NCBI disease and Stanford MOUD datasets, respectively. Notably, augmenting NER outputs with ontology and large language models (LLMs) further improved sensitivity to 99% and 100%, respectively. However, the analysis revealed that omitting the initial NER step significantly decreased overall performance, underscoring the importance of NER in capturing clinical variability.

CLEAR outperformed traditional chunk embedding and full-note approaches, achieving an average F1 score of 0.90 across 13 variables in the Stanford MOUD dataset, with a notable reduction in token usage and processing time—71% fewer input tokens and 72% faster inference. The pipeline’s efficiency was further validated through comparisons with BERT-sized models, which showed comparable performance to larger models while requiring fewer resources. Despite its strengths, the study acknowledges limitations, including a focus on clinical variable extraction and the need for further exploration of CLEAR’s applicability to other tasks such as summarization and question answering. Overall, CLEAR represents a significant advancement in the efficiency and effectiveness of clinical information processing, making it a promising tool for healthcare applications.