تحسين توقعات بنية البروتينات الأحادية والمعقدة باستخدام DeepMSA2 مع بيانات الميتاجينوم الضخمة Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data

المجلة: Nature Methods، المجلد: 21، العدد: 2
DOI: https://doi.org/10.1038/s41592-023-02130-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38167654
تاريخ النشر: 2024-01-02
المؤلف: Wei Zheng وآخرون
الموضوع الرئيسي: دراسات الجينوميات والتطور

نظرة عامة

تم تقديم خط أنابيب DeepMSA2 كنهج جديد لبناء محاذاة متعددة التسلسلات الموحدة (MSAs) للبروتينات أحادية ومتعددة السلاسل، باستخدام بحث محاذاة تكراري عبر قواعد البيانات الجينومية والميتا جينومية. تشير الاختبارات الشاملة إلى أن MSAs التي تم إنشاؤها بواسطة DeepMSA2 تعزز بشكل كبير دقة توقعات الهيكل الثلاثي والرباعي للبروتين مقارنة بالطرق الحالية المتطورة. ومن الجدير بالذكر أنه خلال تجربة CASP15، أنتج خط أنابيب متكامل يتميز بـ DeepMSA2 نماذج هيكلية معقدة ذات جودة أعلى مقارنة بتلك التي تم إنشاؤها بواسطة خادم AlphaFold2-Multimer (الإصدار 2.2.0).

تُعزى الميزة الرئيسية لـ DeepMSA2 إلى بحث المحاذاة المتوازن واختيار النموذج الفعال، إلى جانب قدرته على الاستفادة من قواعد البيانات الميتا جينومية الواسعة. تؤكد هذه الدراسة على الدور الحاسم لجودة المدخلات في توقع هيكل البروتين القائم على التعلم العميق، مما يشير إلى أن تحسين عملية بناء MSA مهم بقدر تصميم النماذج التنبؤية نفسها. تسلط النتائج الضوء على تقدم كبير في مجال توقع هيكل البروتين، خاصة في ضوء التعقيد المتزايد وحجم قواعد بيانات التسلسل الميتا جينومي، والتي تشكل تحديات لبناء MSA بسرعة ودقة.

النتائج

تتناول قسم النتائج في ورقة البحث أداء DeepMSA2، الذي يتكون من خطين متميزين لبناء محاذاة متعددة التسلسلات (MSAs) للمونومرات والموليمرات. لبناء MSA للمونومر، يستخدم DeepMSA2 ثلاثة كتل متوازية—dMSA و quadrupole MSA (qMSA) و mMSA—مستفيدًا من استراتيجيات بحث متنوعة لجمع MSAs الخام من قواعد البيانات الجينومية والميتا جينومية الواسعة. تتضمن العملية بحثًا تكراريًا لضمان عدد كافٍ من التسلسلات الفعالة، تليها تصنيف موجه بالتعلم العميق لاختيار MSA الأمثل. بالمقابل، يربط بناء MSA للموليمر التسلسلات المونومرية من أصول متجانسة، مما ينتج MSAs هجينة يتم تقييمها بناءً على العمق ودرجات الطي.

تشير النتائج إلى أن قواعد البيانات والتقنيات المحسنة في DeepMSA2 تحسن بشكل كبير جودة MSA وبناء النموذج الهيكلي. ومن الجدير بالذكر أن DMFold، الذي يستخدم النسخة الكاملة من DeepMSA2، تفوق على AlphaFold2 في كل من عدد التسلسلات الفعالة (Neff) وTM-score، مع TM-score قدره 0.73 وpLDDT قدره 0.71 لدراسة حالة معينة، مقارنة بـ TM-score الخاص بـ AlphaFold2 البالغ 0.20. تؤكد النتائج على أهمية المعلومات التعاونية في توقع الهيكل، حيث قدم DeepMSA2 MSA أعمق مع 42 تسلسلًا متجانسًا، مما أدى إلى تحسين دقة النمذجة للبروتينات التي كانت تمثل تحديًا سابقًا. بشكل عام، تُظهر الدراسة قدرة DeepMSA2 على تعزيز توقع هيكل البروتين من خلال MSAs الأكثر معلوماتية.

المناقشة

تناقش ورقة البحث التقدم في توقع هيكل المونومر باستخدام خط أنابيب DeepMSA2، الذي يتفوق على خمسة أدوات شائعة الاستخدام لمحاذاة التسلسلات المتعددة (MSA) (BLAST و HHblits و HMMER و MMseqs2 و PSIBLAST) عبر مقاييس مختلفة، بما في ذلك درجات نمذجة القالب، ودقة الاتصال بعيدة المدى، ومتوسط خطأ المسافة المطلقة. ومن الجدير بالذكر أنه عند دمجه مع نسخة معدلة من AlphaFold2، المشار إليها باسم DMFold، أظهر DeepMSA2 تحسنًا ذا دلالة إحصائية في درجات TM لـ 63% من البروتينات المونومرية المختبرة من تجارب CASP13-15. كان متوسط TM-score للنماذج التي تم إنشاؤها بواسطة DMFold هو 0.821، مقارنة بـ 0.781 لـ AlphaFold2، متفوقًا بشكل خاص في المجالات الصعبة.

علاوة على ذلك، كشفت تطبيقات خط أنابيب DeepMSA2/DMFold على البروتين البشري أن DMFold أنتج نماذج ذات ثقة أعلى (pLDDT) للتسلسلات الصعبة، حيث تجاوزت 94% من نماذج DMFold نظيراتها من AlphaFold2. تسلط الدراسة أيضًا الضوء على أهمية MSAs الفعالة في تعزيز توقعات هيكل المونومر والموليمر، حيث حقق DMFold-Multimer درجات TM متفوقة للهياكل المعقدة مقارنة بـ AlphaFold2-Multimer. تشير النتائج إلى أن دمج قواعد البيانات الجينومية والميتا جينومية الواسعة، إلى جانب استراتيجية توليد MSA محسنة، يساهم بشكل كبير في تحسين نمذجة هيكل البروتين، خاصة للأهداف المعقدة في تجربة CASP15، حيث تفوق DMFold-Multimer على جميع الطرق الأخرى.

Journal: Nature Methods, Volume: 21, Issue: 2
DOI: https://doi.org/10.1038/s41592-023-02130-4
PMID: https://pubmed.ncbi.nlm.nih.gov/38167654
Publication Date: 2024-01-02
Author(s): Wei Zheng et al.
Primary Topic: Genomics and Phylogenetic Studies

Overview

The DeepMSA2 pipeline is introduced as a novel approach for constructing uniform multiple-sequence alignments (MSAs) for both single and multichain proteins, utilizing iterative alignment search across genomic and metagenomic databases. Extensive benchmarking indicates that MSAs generated by DeepMSA2 significantly enhance the accuracy of protein tertiary and quaternary structure predictions compared to existing state-of-the-art methods. Notably, during the CASP15 experiment, an integrated pipeline featuring DeepMSA2 produced complex structural models of superior quality relative to those generated by the AlphaFold2-Multimer server (v.2.2.0).

The primary advantage of DeepMSA2 is attributed to its balanced alignment search and effective model selection, alongside its capability to leverage vast metagenomic databases. This research underscores the critical role of input quality in deep learning-based protein structure prediction, suggesting that optimizing the MSA construction process is as vital as the design of the predictive models themselves. The findings highlight a significant advancement in the field of protein structure prediction, particularly in light of the increasing complexity and size of metagenomic sequence databases, which pose challenges for rapid and accurate MSA construction.

Results

The results section of the research paper details the performance of DeepMSA2, which comprises two distinct pipelines for constructing multiple sequence alignments (MSAs) for monomers and multimers. For monomer MSA construction, DeepMSA2 employs three parallel blocks—dMSA, quadrupole MSA (qMSA), and mMSA—utilizing various searching strategies to gather raw MSAs from extensive genomic and metagenomic databases. The process involves iterative searches to ensure a sufficient number of effective sequences, followed by a deep learning-guided ranking to select the optimal MSA. In contrast, the multimer MSA construction links monomeric sequences from orthologous origins, generating hybrid MSAs that are evaluated based on depth and folding scores.

The findings indicate that the enhanced sequence databases and algorithms in DeepMSA2 significantly improve MSA quality and structural model construction. Notably, DMFold, utilizing the full version of DeepMSA2, outperformed AlphaFold2 in both effective sequence count (Neff) and TM-score, with a TM-score of 0.73 and a pLDDT of 0.71 for a specific case study, compared to AlphaFold2’s TM-score of 0.20. The results underscore the importance of coevolutionary information in structure prediction, as DeepMSA2 provided a deeper MSA with 42 homologous sequences, leading to improved modeling accuracy for previously challenging proteins. Overall, the study demonstrates DeepMSA2’s capability to enhance protein structure prediction through more informative MSAs.

Discussion

The research paper discusses the advancements in monomer structure prediction using the DeepMSA2 pipeline, which outperforms five other commonly used multiple sequence alignment (MSA) tools (BLAST, HHblits, HMMER, MMseqs2, and PSIBLAST) across various metrics, including template modeling scores, long-range contact precision, and mean absolute distance error. Notably, when integrated with a modified version of AlphaFold2, referred to as DMFold, DeepMSA2 demonstrated a statistically significant improvement in TM-scores for 63% of tested monomer proteins from the CASP13-15 experiments. The average TM-score for models generated by DMFold was 0.821, compared to 0.781 for AlphaFold2, particularly excelling in challenging domains.

Furthermore, the application of the DeepMSA2/DMFold pipeline to the human proteome revealed that DMFold produced higher confidence models (pLDDT) for difficult sequences, with 94% of DMFold models surpassing their AlphaFold2 counterparts. The study also highlights the importance of effective MSAs in enhancing both monomer and multimer structure predictions, with DMFold-Multimer achieving superior TM-scores for complex structures compared to AlphaFold2-Multimer. The results indicate that the integration of extensive genomic and metagenomic databases, along with a refined MSA generation strategy, significantly contributes to improved protein structure modeling, particularly for complex targets in the CASP15 experiment, where DMFold-Multimer outperformed all other methods.