التعرف الآلي على الكيانات الطبية الحيوية ذات الصلة بالسياق باستخدام نماذج اللغة الكبيرة المستندة إلى الواقع Automated identification of contextually relevant biomedical entities with grounded LLMs

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-35492-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41530258
تاريخ النشر: 2026-01-13
المؤلف: Manuel Watter وآخرون
الموضوع الرئيسي: تنقيب النصوص الطبية والأنطولوجيات

نظرة عامة

تقيّم هذه الدراسة فعالية نماذج اللغة الكبيرة المختلفة (LLMs) لتوصيف الكيانات البيولوجية التلقائي في المقالات البحثية، مع التركيز على النتائج السياقية والمبنية على أسس. باستخدام سير عمل توليدي من أربع خطوات، تقوم الدراسة بتوليد وتحسين مرشحي الكيانات بشكل تكراري مع دمج مخطط بيانات وصفية للسياق واستخدام قاعدة بيانات PubTator 3 للتحقق. تم تحليل دقة هذا سير العمل من خلال تحليل ميتا بتأثيرات عشوائية، مستندًا إلى مقابلات مع مؤلفين من مركز البحث التعاوني (CRC) 1453 “NephGen.”

تكشف النتائج عن دقة إجمالية تبلغ 91.3%، حيث أظهرت نماذج GPT-4.1 وGPT-4o Mini وGemini 2.0 Flash أعلى معدلات دقة. ومن الجدير بالذكر أن GPT-4.1 وGemini 2.0 Flash حققتا أكبر عدد من التوصيفات الصحيحة، بينما تم تحديد GPT-4o Mini وGemini 2.0 Flash كأسرع وأكثر الخيارات فعالية من حيث التكلفة. تؤكد الدراسة على وجود تباينات كبيرة في عدد التوصيفات وأهمية المراجعة البشرية (“البشر في الحلقة”) بسبب التداخل بين التوصيفات الخاصة بالنشر ومجموعات البيانات. كما تسلط الضوء على التبادلات بين الدقة وحجم التوصيفات والتكلفة والسرعة، مشيرة إلى أنه بينما الجودة أمر حاسم في البحث التعاوني، قد تكون الفعالية من حيث التكلفة أكثر أهمية في التطبيقات العامة الأوسع.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم لمشاركة البيانات المنظمة والبيانات الوصفية القابلة للتشغيل المتبادل في تسهيل إعادة استخدام البيانات على المدى الطويل داخل المؤسسات البحثية، مع الإشارة بشكل خاص إلى مراكز البحث التعاوني في ألمانيا (CRCs). يتماشى هذا النهج مع مبادئ FAIR—قابل للاكتشاف، قابل للوصول، قابل للتشغيل المتبادل، وقابل لإعادة الاستخدام—مؤكداً على ضرورة وجود ممارسات قوية لإدارة البيانات لتعزيز طول عمر البيانات البحثية وفائدتها. تمهد هذه القسم الطريق لمناقشة آثار هذه الممارسات على نتائج البحث والتقدم العام للمعرفة العلمية.

الطرق

تركز المنهجية الموضحة في هذا البحث على نهج منظم لتوقع البيانات الوصفية من خلال محادثات متعددة الأدوار مع نماذج اللغة الكبيرة (LLMs). تتكون العملية من أربع دورات محادثة: أولاً، تستخرج LLM الكيانات ذات الصلة من النصوص العلمية مع استبعاد المناقشات والمراجع؛ ثانياً، تتحقق من هذه الكيانات باستخدام أداة PubTator 3؛ ثالثاً، تنظم الكيانات المحددة وفقًا لهياكل هرمية محددة مسبقًا بناءً على مخطط البيانات الوصفية الذي تم تطويره لمركز البحث “NephGen”; وأخيرًا، تجمع النتائج من الخطوات السابقة. يتم تنسيق المخرجات كقائمة JSON منظمة، مما يسمح بتحسين تكراري لمرشحي الكيانات بناءً على الصلة السياقية.

شملت التقييم ستة مقالات من مركز البحث “NephGen”، تم اختيارها لتمثيل مجموعة متنوعة من مواضيع البحث. تشير النتائج إلى أن نماذج Gemini (2.0 Flash و2.5 Flash) تفوقت على الآخرين في اقتراح الكيانات البيولوجية، مع متوسط توصيلات صحيحة بلغ 47.5 و60.8، على التوالي، مقارنة بمتوسطات أقل من GPT-4o وGPT-4o Mini (22.8 و21.0). ومن الجدير بالذكر أن العدد الإجمالي للكيانات المقترحة لم يتوافق مع دقة الأساليب المستخدمة (p=0.710)، مما يبرز أن كل من GPT-4.1 وGemini Flash 2.0 حققتا معدلات دقة عالية على الرغم من الأعداد المتفاوتة من الكيانات المقترحة. تؤكد الدراسة على أهمية السيطرة الاجتماعية في عملية التقييم، التي تسهلها مشاركة عضو من فريق إدارة البيانات البحثية خلال المقابلات المباشرة.

النتائج

يقدم قسم “النتائج” من ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يوضح النتائج الناتجة عن اختبارات مختلفة، مع تسليط الضوء على الاتجاهات والأنماط المهمة التي لوحظت في البيانات. غالبًا ما تكون النتائج مصحوبة بتحليلات إحصائية ذات صلة، بما في ذلك قيم p وفترات الثقة، لدعم النتائج.

بالإضافة إلى ذلك، قد يتضمن القسم تمثيلات بصرية مثل الرسوم البيانية أو الجداول التي توضح العلاقات بين المتغيرات أو فعالية التدخلات. تساعد هذه المساعدات البصرية في تعزيز وضوح النتائج وتسهيل فهم أعمق لآثار البحث. بشكل عام، تساهم النتائج في الجسم المعرفي القائم في هذا المجال وقد تقترح اتجاهات للبحث المستقبلي أو التطبيقات العملية.

المناقشة

في هذه المناقشة، يقيم المؤلفون فعالية نماذج اللغة الكبيرة المختلفة (LLMs) في توقع الكيانات البيولوجية لتوصيف البيانات الوصفية، مع التأكيد على أهمية السياق في صلة الكيانات. يبرزون أنه بينما تعزز الأساليب التلقائية مثل التعرف على الكيانات المسماة القابلية للاكتشاف، فإن اختيار الكيانات يعتمد على السياق ويختلف مع التركيز التجريبي. أنشأت الدراسة سير عمل من أربع خطوات يدمج التعلم في السياق وأدوات التحقق الخارجية، محققة دقة إجمالية تبلغ 91.3% عبر نماذج متعددة. ومن الجدير بالذكر أن نماذج مثل GPT-4.1 وGemini 2.0 Flash أظهرت دقة عالية، على الرغم من أن الفروق في الأداء لم تكن ذات دلالة إحصائية.

يعترف المؤلفون بالقيود، بما في ذلك الاعتماد على واجهات برمجة التطبيقات الخارجية، التي يمكن أن تؤثر على إمكانية التكرار، وحجم العينة الصغيرة المكونة من ست مقالات من مركز بحث واحد، مما يحد من إمكانية تعميم النتائج. يؤكدون على ضرورة الإشراف البشري في عملية التوصيف لضمان الصلة والدقة، خاصة في التمييز بين المقالات ومجموعاتها البيانية الأساسية. يتم تشجيع البحث المستقبلي على إنشاء توافق حول الحقيقة الأساسية لتحسين تقدير الاسترجاع وتقييم أداء النماذج بشكل شامل. سيساعد ذلك في فهم أوضح لتأثير المراجعة البشرية في تعزيز جودة التوصيفات التي تنتجها نماذج اللغة الكبيرة.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-35492-8
PMID: https://pubmed.ncbi.nlm.nih.gov/41530258
Publication Date: 2026-01-13
Author(s): Manuel Watter et al.
Primary Topic: Biomedical Text Mining and Ontologies

Overview

This study evaluates the effectiveness of various large language models (LLMs) for automated biomedical entity annotation in research articles, emphasizing contextualized and grounded results. Utilizing a four-step generative workflow, the research iteratively generates and refines entity candidates while incorporating a metadata schema for context and employing the PubTator 3 database for validation. The precision of this workflow was analyzed through a random effects meta-analysis, informed by interviews with authors from the Collaborative Research Center (CRC) 1453 “NephGen.”

The findings reveal an overall precision of 91.3%, with models GPT-4.1, GPT-4o Mini, and Gemini 2.0 Flash demonstrating the highest precision rates. Notably, GPT-4.1 and Gemini 2.0 Flash achieved the greatest number of correct annotations, while GPT-4o Mini and Gemini 2.0 Flash were identified as the fastest and most cost-effective options. The study underscores significant variations in annotation counts and the importance of human review (“human-in-the-loop”) due to the conflation of publication and dataset-specific annotations. It also highlights the trade-offs between precision, annotation volume, cost, and speed, suggesting that while quality is crucial in collaborative research, cost-effectiveness may be more vital in broader public applications.

Introduction

The introduction highlights the critical role of structured data sharing and interoperable metadata in facilitating long-term data reusability within research institutions, specifically referencing Germany’s Collaborative Research Centers (CRCs). This approach aligns with the FAIR principles—Findable, Accessible, Interoperable, and Reusable—emphasizing the necessity for robust data management practices to enhance the longevity and utility of research data. The section sets the stage for discussing the implications of these practices on research outcomes and the overall advancement of scientific knowledge.

Methods

The methodology outlined in this research focuses on a structured approach for few-shot metadata prediction through multi-turn conversations with large language models (LLMs). The process consists of four conversational turns: first, the LLM extracts relevant entities from scientific texts while excluding discussions and bibliographies; second, it validates these entities using the PubTator 3 tool; third, it organizes the identified entities according to a predetermined hierarchical structure based on the metadata schema developed for the CRC “NephGen”; and finally, it consolidates the results from the previous steps. The output is formatted as a structured JSON list, which allows for iterative refinement of entity candidates based on contextual relevance.

The evaluation involved six articles from the CRC “NephGen,” selected to represent a diverse range of research topics. The findings indicate that the Gemini models (2.0 Flash and 2.5 Flash) outperformed others in proposing biomedical entities, with mean correct annotations of 47.5 and 60.8, respectively, compared to lower means from GPT-4o and GPT-4o Mini (22.8 and 21.0). Notably, the total number of suggested entities did not correlate with the precision of the approaches used (p=0.710), highlighting that both GPT-4.1 and Gemini Flash 2.0 achieved high precision rates despite varying numbers of suggested entities. The study emphasizes the importance of social control in the evaluation process, facilitated by the involvement of a research data management team member during face-to-face interviews.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It details the outcomes of various tests, highlighting significant trends and patterns observed in the data. The results are often accompanied by relevant statistical analyses, including p-values and confidence intervals, to substantiate the findings.

Additionally, the section may include visual representations such as graphs or tables that illustrate the relationships between variables or the effectiveness of interventions. These visual aids serve to enhance the clarity of the results and facilitate a deeper understanding of the implications of the research. Overall, the findings contribute to the existing body of knowledge in the field and may suggest directions for future research or practical applications.

Discussion

In this discussion, the authors evaluate the effectiveness of various Large Language Models (LLMs) in predicting biomedical entities for metadata annotation, emphasizing the importance of context in entity relevance. They highlight that while automated approaches like named entity recognition enhance discoverability, the selection of entities is context-dependent and varies with the experimental focus. The study established a four-step workflow that integrates in-context learning and external verification tools, achieving an overall precision of 91.3% across multiple models. Notably, models such as GPT-4.1 and Gemini 2.0 Flash demonstrated high precision, although the differences in performance were not statistically significant.

The authors acknowledge limitations, including the reliance on external APIs, which can affect reproducibility, and the small sample size of six articles from a single research center, which restricts generalizability. They emphasize the necessity of human oversight in the annotation process to ensure relevance and accuracy, particularly in distinguishing between articles and their underlying datasets. Future research is encouraged to establish a consensus ground truth to better approximate recall and evaluate the models’ performance comprehensively. This would facilitate a clearer understanding of the impact of human review in enhancing the quality of LLM-generated annotations.