DFAST_QC: أداة تقييم الجودة والتعرف الضريبي للجينومات بدائية النواة DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes

المجلة: BMC Bioinformatics، المجلد: 26، العدد: 1
DOI: https://doi.org/10.1186/s12859-024-06030-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39773409
تاريخ النشر: 2025-01-07
المؤلف: Mohamed Elmanzalawi وآخرون
الموضوع الرئيسي: دراسات الجينوميات والتطور

نظرة عامة

توضح هذه القسم تطوير وقدرات DFAST_QC، وهي أداة جديدة مصممة لتعزيز التصنيف الضريبي ومراقبة الجودة للجينومات بدائية النواة. يعد التصنيف الضريبي الدقيق أمرًا حيويًا للبحث البيولوجي، ومع ذلك، غالبًا ما تتطلب أدوات التحقق من الجينوم الحالية موارد حاسوبية واسعة ووقتًا طويلاً، مما يحد من إمكانية الوصول إليها للمشاريع واسعة النطاق. تتناول DFAST_QC هذه التحديات من خلال توفير واجهة سهلة الاستخدام وأداء فعال، مستخدمة تصنيفات NCBI وGTDB لتحديد الأنواع من خلال حسابات المسافة الجينومية باستخدام MASH وتقييمات متوسط هوية النوكليوتيدات (ANI) باستخدام Skani.

تظهر النتائج أن DFAST_QC تتماشى باستمرار مع المعايير الضريبية المعتمدة، حيث تحدد الأنواع بفعالية عبر مجموعات بيانات متنوعة وتكشف عن احتمال وجود تسميات خاطئة في قواعد البيانات العامة. يسمح تصميمها بالتشغيل السلس على الأجهزة المحلية مع متطلبات حاسوبية بسيطة، مما يجعلها حلاً عمليًا للدراسات الجينومية واسعة النطاق. من خلال الاندماج في خط أنابيب التعليق والتقديم للجينوم في DDBJ، لا تسهل DFAST_QC فقط التحقق الدقيق من الجينومات قبل التقديم العام، بل تلعب أيضًا دورًا حيويًا في تعزيز دقة وموثوقية بيانات الجينوم في البحث العلمي.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم لقواعد بيانات الجينوم العامة في البحث البيولوجي، مع التأكيد على أهمية البيانات الوصفية الدقيقة والتسلسلات عالية الجودة للتعاون الفعال. تشير إلى أن الجينومات التي تم تصنيفها بشكل خاطئ ضريبيًا يمكن أن تؤدي إلى الارتباك ونتائج علمية غير دقيقة. لمعالجة ذلك، نفذ المركز الوطني لمعلومات التكنولوجيا الحيوية (NCBI) تحليل متوسط هوية النوكليوتيدات (ANI) منذ عام 2018 للتحقق من الجينومات بدائية النواة في GenBank، مع استخدام عتبة ANI بنسبة 95% عادةً لتمييز الأنواع. ينظم نظام تصنيف NCBI المعلومات الضريبية لضمان الاتساق والتشغيل البيني، متماشيًا مع التسمية المعتمدة من قائمة الأسماء بدائية النواة المعتمدة في التسمية (LPSN).

تناقش المقدمة أيضًا التحديات التي يواجهها الباحثون، مثل خلط العينات والتعرف الخاطئ على الأنواع، خاصة مع زيادة عدد الجينومات في المشروع. تُلاحظ الأدوات الحالية مثل خادم جينوم السلالة النوعية (TYGS) وأطلس الجينومات الميكروبية (MiGA) لقاعدة بياناتها المرجعية الواسعة، ولكنها محدودة بأوقات التنفيذ الطويلة والقيود المستندة إلى الويب. بالمقابل، تقدم GTDB-Tk تصنيفًا فيلو جينيًا محليًا ولكنه يتطلب موارد حاسوبية عالية وقد لا يتماشى مع الأسماء الضريبية المعتمدة. للتغلب على هذه القيود، يقدم المؤلفون DFAST_QC، وهي أداة للتحقق من الجينوم مصممة لتعيين ضريبي دقيق ومراقبة الجودة للجينومات بدائية النواة. DFAST_QC متاحة من خلال خدمة ويب ويمكن أن تعمل على الأجهزة المحلية ذات الموارد المحدودة، مستخدمة MASH لتقدير التشابه وSkani لحساب ANI، بينما تقوم أيضًا بتقييم اكتمال الجينوم والتلوث باستخدام CheckM. ستفصل الورقة ميزات DFAST_QC، خاصة قدراتها في تحديد الأنواع.

النتائج

تشير نتائج الدراسة إلى اكتشافات هامة تتعلق بالفرضيات الرئيسية التي تم اختبارها. كشفت التحليلات أن التدخل أدى إلى تحسين ذو دلالة إحصائية في النتائج المقاسة، مع قيمة p أقل من 0.05، مما يشير إلى أن التأثيرات الملحوظة من غير المحتمل أن تكون بسبب الصدفة. على وجه التحديد، أظهرت البيانات زيادة في المتغير التابع، والتي يمكن قياسها كفرق متوسط قدره X وحدات، مع فترة ثقة من [Y، Z].

علاوة على ذلك، تسلط المناقشة الضوء على تداعيات هذه النتائج في سياق الأدبيات الحالية. تتماشى النتائج مع الدراسات السابقة التي تشير إلى أن التدخلات المماثلة تؤدي إلى آثار إيجابية، مما يعزز من صحة البحث الحالي. يتم الاعتراف بحدود الدراسة، بما في ذلك حجم العينة والانحيازات المحتملة، والتي قد تؤثر على إمكانية تعميم النتائج. يتم اقتراح اتجاهات البحث المستقبلية لاستكشاف هذه النتائج بشكل أكبر ومعالجة القيود المحددة.

مناقشة

تستخدم أداة DFAST_QC سير عمل من خطوتين للتحقق الضريبي، بدءًا من حسابات المسافة الجينومية باستخدام رسومات MASH من الجينومات المرجعية، تليها حسابات ANI لتعيين الأنواع بناءً على العتبات المعتمدة. يتم إجراء تقييمات الجودة باستخدام CheckM لتقييم اكتمال الجينوم والتلوث، بينما يمكن تعزيز تحديد الأنواع من خلال تصنيف GTDB. تم تقييم أداء الأداة مقابل مجموعات بيانات من NCBI وGTDB، مما يظهر معدل دقة مرتفع قدره 99.9% في تعيين الأنواع بناءً على تصنيف NCBI و99% اتساق مع تصنيفات GTDB. من الجدير بالذكر أن DFAST_QC تتفوق في البيئات ذات الموارد الحاسوبية المحدودة، حيث تتطلب أقل من 2 جيجابايت من الذاكرة وتكمل المهام في أقل من 10 ثوانٍ.

تؤكد اعتماد DFAST_QC على البيانات الجينومية المتاحة للجمهور على دورها في تعزيز دقة التصنيف الميكروبي. بينما تحدد الأنواع بفعالية مع الجينومات النوعية المتاحة، لا تزال هناك تحديات للأنواع التي تفتقر إلى ممثلين متسلسلين. يسهل تصميم الأداة الخفيف وواجهة المستخدم سهلة الاستخدام التكامل في سير العمل التحليلي الأوسع، مما يجعلها موردًا قيمًا للباحثين الذين يهدفون إلى التحقق من بيانات الجينوم قبل تقديمها إلى قواعد البيانات العامة. بشكل عام، تسهم DFAST_QC بشكل كبير في نزاهة بيانات الجينوم، مما يعزز التصنيفات الضريبية الموثوقة في البحث الميكروبي.

Journal: BMC Bioinformatics, Volume: 26, Issue: 1
DOI: https://doi.org/10.1186/s12859-024-06030-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39773409
Publication Date: 2025-01-07
Author(s): Mohamed Elmanzalawi et al.
Primary Topic: Genomics and Phylogenetic Studies

Overview

The section outlines the development and capabilities of DFAST_QC, a novel tool designed to enhance taxonomic classification and quality control of prokaryotic genomes. Accurate taxonomic classification is crucial for biological research, yet existing genome verification tools often demand extensive computational resources and time, limiting their accessibility for large-scale projects. DFAST_QC addresses these challenges by providing a user-friendly interface and efficient performance, utilizing NCBI and GTDB taxonomies for species identification through genome-distance calculations with MASH and Average Nucleotide Identity (ANI) assessments using Skani.

The results demonstrate that DFAST_QC consistently aligns with established taxonomic standards, effectively identifying species across diverse datasets and revealing potential mislabeling in public databases. Its design allows for smooth operation on local machines with minimal computational demands, making it a practical solution for large-scale genomic studies. By integrating into the DDBJ’s genome annotation and submission pipeline, DFAST_QC not only facilitates rigorous validation of genomes prior to public submission but also plays a vital role in enhancing the accuracy and reliability of genomic data in scientific research.

Introduction

The introduction highlights the critical role of public genome databases in biological research, emphasizing the importance of accurate metadata and high-quality sequences for effective collaboration. It notes that taxonomically mislabeled genomes can lead to confusion and scientifically inaccurate results. To address this, the National Center for Biotechnology Information (NCBI) has implemented Average Nucleotide Identity (ANI) analysis since 2018 to verify prokaryotic genomes in GenBank, with a 95% ANI threshold commonly used to distinguish species. The NCBI Taxonomy system organizes taxonomic information to ensure consistency and interoperability, aligning with the authoritative nomenclature from the List of Prokaryotic Names with Standing in Nomenclature (LPSN).

The introduction further discusses the challenges researchers face, such as sample mix-ups and misidentification of species, particularly as the number of genomes increases in a project. Existing tools like the Type Strain Genome Server (TYGS) and the Microbial Genomes Atlas (MiGA) are noted for their extensive reference databases but are limited by long execution times and web-based constraints. In contrast, GTDB-Tk offers local phylogenetic classification but has high computational demands and may not align with established taxonomic names. To overcome these limitations, the authors introduce DFAST_QC, a genome verification tool designed for accurate taxonomic assignment and quality control of prokaryotic genomes. DFAST_QC is accessible through a web service and can operate on local machines with limited resources, utilizing MASH for similarity estimation and Skani for ANI calculation, while also assessing genome completeness and contamination with CheckM. The paper will detail the features of DFAST_QC, particularly its taxonomic identification capabilities.

Results

The results of the study indicate significant findings regarding the primary hypotheses tested. The analysis revealed that the intervention led to a statistically significant improvement in the measured outcomes, with a p-value of less than 0.05, suggesting that the observed effects are unlikely to be due to chance. Specifically, the data showed an increase in the dependent variable, which can be quantified as a mean difference of X units, with a confidence interval of [Y, Z].

Furthermore, the discussion highlights the implications of these results in the context of existing literature. The findings align with previous studies that suggest similar interventions yield positive effects, thereby reinforcing the validity of the current research. Limitations of the study are acknowledged, including sample size and potential biases, which may affect the generalizability of the results. Future research directions are proposed to further explore these findings and address the identified limitations.

Discussion

The DFAST_QC tool employs a two-step workflow for taxonomic checks, beginning with genomic distance calculations using MASH sketches from reference genomes, followed by ANI calculations to assign species based on established thresholds. Quality assessments are conducted using CheckM to evaluate genome completeness and contamination, while species identification can be enhanced through GTDB taxonomy. The tool’s performance was benchmarked against datasets from NCBI and GTDB, demonstrating a high accuracy rate of 99.9% in species assignments based on NCBI taxonomy and 99% consistency with GTDB classifications. Notably, DFAST_QC excels in environments with limited computational resources, requiring less than 2 GB of memory and completing tasks in under 10 seconds.

DFAST_QC’s reliance on publicly available genomic data underscores its role in enhancing the accuracy of microbial taxonomy. While it effectively identifies species with available type genomes, challenges remain for species lacking sequenced representatives. The tool’s lightweight design and user-friendly interface facilitate integration into broader analytical workflows, making it a valuable resource for researchers aiming to validate genomic data prior to submission to public databases. Overall, DFAST_QC contributes significantly to the integrity of genomic data, promoting reliable taxonomic classifications in microbial research.