دمج الذكاء الاصطناعي والتنظيم اليدوي لتعزيز تعليقات الاختبار الحيوي في ChEMBL Integrating artificial intelligence and manual curation to enhance bioassay annotations in ChEMBL

المجلة: Journal of Cheminformatics، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1186/s13321-026-01165-x
PMID: https://pubmed.ncbi.nlm.nih.gov/41680846
تاريخ النشر: 2026-02-12
المؤلف: Ines Šmit وآخرون
الموضوع الرئيسي: طرق اكتشاف الأدوية الحاسوبية

نظرة عامة

تتناول ورقة البحث التقدمات في التعليق وتصنيف الاختبارات الحيوية داخل قاعدة بيانات ChEMBL، مع التأكيد على أهمية توحيد بيانات الاختبار لتعزيز فائدتها في الكيمياء المعلوماتية وتطبيقات التعلم الآلي. يوضح المؤلفون جهودهم لتحسين تعليقات الاختبارات الحيوية من خلال مزيج من الترتيب اليدوي والتقنيات المدفوعة بالذكاء الاصطناعي، بما في ذلك تقديم نموذج “وصف اختبار مثالي”. يستخدمون معالجة اللغة الطبيعية (NLP) وطرق التصنيف متعددة الفئات لاستخراج المعلمات الرئيسية للاختبار تلقائيًا وتصنيف البيانات القديمة، مشيرين إلى تطوير نموذج التعرف على الكيانات المسماة (NER) القائم على spaCy الذي يحقق دقة واسترجاع عاليين في تحديد الطرق التجريبية.

تسلط الدراسة الضوء على تحسينات كبيرة في استخراج البيانات الوصفية لنقاط النهاية ADME، وتعليقات الكائنات ومتغيرات البروتين، وربط الأنطولوجيا، مما يعزز من FAIRness (قابلية الاكتشاف، الوصول، التشغيل البيني، وإعادة الاستخدام) لبيانات الاختبار الحيوية في ChEMBL. تسهل هذه التحسينات تحليلات أكثر قوة ونمذجة دقيقة لنشاط المركبات المستهدفة، مما يعود بالنفع في النهاية على مجتمعات الكيمياء المعلوماتية واكتشاف الأدوية من خلال تعزيز الاستخدام المتسق والمتوافق مع FAIR لبيانات ChEMBL. يخلص المؤلفون إلى أن دمجهم للمعرفة الخبراء مع منهجيات الذكاء الاصطناعي القابلة للتوسع يوفر إطارًا قيمًا لتحويل الأوصاف التجريبية غير المنظمة إلى مجموعات بيانات غنية بالتعليقات، قابلة للتشغيل البيني، مما يحسن بشكل كبير من قابلية تفسير وإعادة استخدام بيانات النشاط البيولوجي.

مقدمة

تسلط المقدمة الضوء على الأهمية المتزايدة لبيانات الاختبارات الحيوية في قواعد البيانات المفتوحة مثل ChEMBL، خاصة في سياق التحليلات واسعة النطاق وتطبيقات التعلم الآلي (ML). مع التأكيد على مبادئ FAIR (قابلية الاكتشاف، الوصول، التشغيل البيني، وإعادة الاستخدام)، تناقش هذه الفقرة ضرورة وجود مستودعات بيانات منظمة تلتقط المعلمات الأساسية للاختبار والبيانات الوصفية. توضح قائمة التحقق MIABE (المعلومات الدنيا حول كيان نشط بيولوجيًا)، التي تعزز القابلية للتكرار ودمج البيانات في اكتشاف الأدوية من خلال تحديد البيانات الوصفية المطلوبة مثل الهوية الكيميائية وخصائص الاختبار البيولوجي.

تستعرض النصوص أيضًا مبادرات مختلفة تهدف إلى تعزيز التشغيل البيني لبيانات الاختبارات الحيوية، بما في ذلك BARD (قاعدة بيانات أبحاث الاختبارات الحيوية) التي لم تعد موجودة ومساهماتها في أنطولوجيا الاختبارات الحيوية (BAO). بالإضافة إلى ذلك، تذكر مشروع Data FAIRy، الذي يجمع بين التعلم الآلي والترتيب الخبير لاستخراج البيانات الوصفية المنظمة للاختبار، ومبادرة Drug Target Commons (DTC) التي تستخدم “التعليقات الصغيرة” لسياق الاختبار التفصيلي. يبرز مشروع MICHA الحاجة إلى بيانات وصفية موحدة للاختبار مع زيادة حجم مجموعات البيانات. تختتم المقدمة بالإشارة إلى الجهود المستمرة داخل فريق ChEMBL لتحسين تمثيل بيانات الاختبار وتعليقها، مما سيسهل اتخاذ قرارات أفضل استنادًا إلى المعلومات في الكيمياء المعلوماتية والكيمياء الحاسوبية.

الطرق

في هذا القسم، يوضح المؤلفون تطوير وتقييم نموذج التعرف على الكيانات المسماة (NER) الذي يهدف إلى تحديد الطرق التجريبية داخل أوصاف الاختبارات. تم تدريب النموذج على مجموعة بيانات تم التعليق عليها يدويًا تتكون من 800 اختبار، والتي شملت 500 اختبار ارتباط و300 اختبار وظيفي، مع وجود 460 تحتوي على طرق معلق عليها. شمل عملية التدريب وضع علامات على 272 مصطلحًا فريدًا متعلقًا بالمنهجية واستخدام استراتيجية التحقق المتكرر بخمسة أضعاف، مما أسفر عن مقاييس أداء مثيرة للإعجاب: دقة متوسطة تبلغ 0.93، واسترجاع 0.95، ودرجة F1 تبلغ 0.94. تم تطبيق النموذج النهائي على 1,169,293 وصف اختبار من ChEMBL 35، حيث تم تحديد الطرق التجريبية بنجاح في 662,675 اختبارًا (57% من الإجمالي)، مع أكثر الطرق المعترف بها شيوعًا بما في ذلك “اختبار MTT” و”ELISA”.

بالإضافة إلى ذلك، ربط المؤلفون الطرق التجريبية المحددة بأنطولوجيا الاختبارات الحيوية (BAO) باستخدام نفس مجموعة البيانات، محققين خريطة لـ 52% من الطرق المحددة إلى مصطلح BAO واحد على الأقل. تسلط الدراسة الضوء على نقاط قوة النموذج في الدقة والاسترجاع، مع عدد إيجابيات حقيقية يبلغ 100 ومعدل إيجابيات كاذبة منخفض يبلغ 2، على الرغم من ملاحظة بعض الإيجابيات الكاذبة وعدم التطابق. يخطط المؤلفون لتعزيز عملية ربط الأنطولوجيا من خلال استكشاف تقنيات NLP المتقدمة والنماذج الهجينة لتحسين تغطية طرق الاختبار في العمل المستقبلي.

النتائج

في هذا القسم، يقدم المؤلفون التقدمات الأخيرة في تعليق معلمات الاختبارات الحيوية داخل قاعدة بيانات ChEMBL، التي تم تحقيقها من خلال مزيج من الترتيب اليدوي الخبير وتقنيات الذكاء الاصطناعي (AI). يقدمون مفهوم “وصف اختبار مثالي”، الذي يعمل كمعيار لعملية الترتيب المحسنة التي تم تنفيذها في الإصدارات الأخيرة من ChEMBL.

علاوة على ذلك، يقدم المؤلفون أمثلة توضيحية توضح تطبيق تقنيات استخراج النصوص والذكاء الاصطناعي لاستخراج معلمات اختبار إضافية ذات صلة من أوصاف الاختبارات. تهدف هذه التحسينات إلى تحسين شمولية ودقة بيانات الاختبارات الحيوية، مما يسهل الاستخدام الأفضل لقاعدة بيانات ChEMBL لأغراض البحث.

المناقشة

تتناول قسم المناقشة في ورقة البحث هيكل وترتيب بيانات الاختبارات الحيوية داخل قاعدة بيانات ChEMBL، مع التأكيد على أهمية البيانات الوصفية التفصيلية لتعزيز قابلية الاستخدام وقابلية تفسير قياسات النشاط البيولوجي. تنظم ChEMBL معلومات الاختبارات الحيوية في جداول مختلفة، بما في ذلك ASSAYS وACTIVITIES، التي تشمل تفاصيل حول المركبات المختبرة، والأهداف، وظروف الاختبار. أدت جهود الترتيب في قاعدة البيانات إلى دمج بيانات وصفية دقيقة، مثل متغيرات البروتين ومعلمات الاختبار، والتي تعتبر ضرورية لرسم خرائط دقيقة بين الاختبارات والأهداف والمقارنة. من الجدير بالذكر أن إدخال جداول البيانات الوصفية مثل ASSAY_PARAMETERS وACTIVITY_PROPERTIES يسمح بالتعليق المرن على الاختبارات والأنشطة، مما يسهل فهمًا أكثر دقة للظروف التجريبية.

علاوة على ذلك، تناقش الورقة تصنيف الاختبارات إلى فئات (ASSAY_TYPE) والتحديد المنهجي للاختبارات الحية، التي يتم التعليق عليها مع نماذج الأمراض ورسمها إلى الأنطولوجيات ذات الصلة. لقد حقق رسم خرائط الاختبارات الحيوية إلى أنطولوجيا الاختبارات الحيوية (BAO) تغطية كبيرة، حيث تم ربط 79% من الأنشطة البيولوجية في ChEMBL 35 بمصطلحات BAO_ENDPOINT. يبرز المؤلفون الجهود المستمرة لتحسين أوصاف الاختبارات وتناسق البيانات الوصفية، خاصة للبيانات المودعة، التي تشكل الآن جزءًا كبيرًا من قاعدة البيانات. يهدف إدخال حقول مثل ASSAY_GROUP وASSAY_CATEGORY إلى تعزيز قابلية مقارنة البيانات وتسهيل تنظيم أفضل للاختبارات، مما يدعم في النهاية الباحثين في تحليلاتهم وتفسيراتهم لبيانات النشاط البيولوجي.

Journal: Journal of Cheminformatics, Volume: 18, Issue: 1
DOI: https://doi.org/10.1186/s13321-026-01165-x
PMID: https://pubmed.ncbi.nlm.nih.gov/41680846
Publication Date: 2026-02-12
Author(s): Ines Šmit et al.
Primary Topic: Computational Drug Discovery Methods

Overview

The research paper discusses advancements in the annotation and classification of bioassays within the ChEMBL database, emphasizing the importance of standardizing assay metadata to enhance its utility in cheminformatics and machine learning applications. The authors detail their efforts to improve bioassay annotations through a combination of manual curation and AI-driven techniques, including the introduction of a “perfect assay description” template. They employ natural language processing (NLP) and multi-class classification methods to automatically extract key assay parameters and categorize legacy data, reporting the development of a spaCy-based Named Entity Recognition (NER) model that achieves high precision and recall in identifying experimental methods.

The study highlights significant improvements in metadata extraction for ADME endpoints, organism and protein variant annotations, and ontology linking, thereby enhancing the FAIRness (Findable, Accessible, Interoperable, and Reusable) of ChEMBL’s bioassay data. These enhancements facilitate more robust downstream analyses and precise compound-target activity modeling, ultimately benefiting the cheminformatics and drug discovery communities by promoting consistent and FAIR-compliant use of ChEMBL data. The authors conclude that their integration of expert knowledge with scalable AI methodologies provides a valuable framework for transforming unstructured experimental descriptions into richly annotated, interoperable datasets, significantly improving the interpretability and reusability of bioactivity data.

Introduction

The introduction highlights the increasing significance of bioassay data in open-access databases like ChEMBL, particularly in the context of large-scale analyses and machine learning (ML) applications. Emphasizing the FAIR (findable, accessible, interoperable, and reusable) principles, the section discusses the necessity for structured data repositories that capture essential assay parameters and metadata. It outlines the MIABE (Minimum Information About a Bioactive Entity) checklist, which promotes reproducibility and data integration in drug discovery by specifying required metadata such as chemical identity and biological assay properties.

The text also reviews various initiatives aimed at enhancing bioassay data interoperability, including the now-defunct BARD (BioAssay Research Database) and its contributions to the Bioassay Ontology (BAO). Additionally, it mentions the Data FAIRy project, which combines machine learning with expert curation to extract structured assay metadata, and the Drug Target Commons (DTC) initiative, which employs “micro-annotations” for detailed assay context. The MICHA project further underscores the need for standardized assay metadata as datasets grow in scale. The introduction concludes by noting ongoing efforts within the ChEMBL team to improve assay data representation and annotation, which will facilitate better-informed decision-making in cheminformatics and computational chemistry.

Methods

In this section, the authors detail the development and evaluation of a Named Entity Recognition (NER) model aimed at identifying experimental methods within assay descriptions. The model was trained on a manually annotated dataset of 800 assays, which included 500 binding assays and 300 functional assays, with 460 containing annotated methods. The training process involved tagging 272 unique methodology-related terms and employing a repeated fivefold cross-validation strategy, resulting in impressive performance metrics: average precision of 0.93, recall of 0.95, and F1-score of 0.94. The final model was applied to 1,169,293 assay descriptions from ChEMBL 35, successfully identifying experimental methods in 662,675 assays (57% of the total), with the most frequently recognized methods including “MTT assay” and “ELISA.”

Additionally, the authors linked the identified experimental methods to the Bioassay Ontology (BAO) using the same dataset, achieving a mapping of 52% of identified methods to at least one BAO term. The study highlights the model’s strengths in precision and recall, with a true positive count of 100 and a low false negative rate of 2, although some false positives and mismatches were noted. The authors plan to enhance the ontology linking process by exploring advanced NLP techniques and hybrid models to improve coverage of assay methods in future work.

Results

In this section, the authors present recent advancements in the annotation of bioassay parameters within the ChEMBL database, achieved through a blend of manual expert curation and artificial intelligence (AI) techniques. They introduce the notion of a “perfect assay description,” which serves as a benchmark for the enhanced curation process implemented in the latest ChEMBL releases.

Furthermore, the authors provide illustrative examples demonstrating the application of text mining and AI methodologies to extract additional relevant assay parameters from the assay descriptions. These enhancements aim to improve the comprehensiveness and accuracy of bioassay data, thereby facilitating better utilization of the ChEMBL database for research purposes.

Discussion

The discussion section of the research paper elaborates on the structure and curation of bioassay data within the ChEMBL database, emphasizing the importance of detailed metadata for enhancing the usability and interpretability of bioactivity measurements. ChEMBL organizes bioassay information into various tables, including ASSAYS and ACTIVITIES, which encompass details about tested compounds, targets, and assay conditions. The database’s curation efforts have led to the integration of fine-tuned metadata, such as protein variations and assay parameters, which are essential for accurate assay-target mapping and comparison. Notably, the introduction of metadata tables like ASSAY_PARAMETERS and ACTIVITY_PROPERTIES allows for flexible annotation of assays and activities, facilitating a more nuanced understanding of the experimental conditions.

Furthermore, the paper discusses the classification of assays into categories (ASSAY_TYPE) and the systematic identification of in vivo assays, which are annotated with disease models and mapped to relevant ontologies. The mapping of bioassays to the Bioassay Ontology (BAO) has achieved significant coverage, with 79% of bioactivities in ChEMBL 35 linked to BAO_ENDPOINT terms. The authors highlight ongoing efforts to improve assay descriptions and metadata consistency, particularly for deposited data, which now constitutes a substantial portion of the database. The introduction of fields like ASSAY_GROUP and ASSAY_CATEGORY aims to enhance data comparability and facilitate better organization of assays, ultimately supporting researchers in their analyses and interpretations of bioactivity data.