تحقيق التوازن بين السرعة والدقة في طي البروتينات: مقارنة بين AlphaFold2 وESMFold وOmegaFold Balancing speed and precision in protein folding: a comparison of AlphaFold2, ESMFold, and OmegaFold

المجلة: Frontiers in Genetics، المجلد: 16
DOI: https://doi.org/10.3389/fgene.2025.1715037
PMID: https://pubmed.ncbi.nlm.nih.gov/41608648
تاريخ النشر: 2026-01-13
المؤلف: Anna Hyskova وآخرون
الموضوع الرئيسي: هيكل البروتين والديناميات

نظرة عامة

تتطلب التقدم السريع في أدوات توقع بنية البروتين تقييمات أداء منهجية للمساعدة في اختيار الطريقة، خاصةً بالنظر إلى التوازن بين سرعة الحساب ودقة التوقع. في هذه الدراسة، قمنا بتقييم ثلاثة أدوات بارزة—AlphaFold2 وESMFold وOmegaFold—باستخدام مجموعة بيانات تتكون من 1,337 سلسلة بروتين من بنك بيانات البروتين، مع ضمان عدم وجود تداخل مع بيانات التدريب. قمنا بتقييم التوقعات من خلال مقاييس مختلفة، بما في ذلك الانحراف المعياري الجذري (RMSD)، ودرجة نمذجة القالب (TM-score)، ودرجة اختبار المسافة العالمية-الإجمالية (GDT-TS)، واختبار الفرق في المسافة المحلية المتوقع (pLDDT).

تشير نتائجنا إلى أن AlphaFold2 تفوقت على الأدوات الأخرى، محققة أعلى درجة متوسطة لـ TM-score تبلغ 0.96، ودرجة GDT-TS متوسطة تبلغ 94%، وأدنى انحراف معياري جذري متوسطة يبلغ 1.30 Å. بالمقارنة، سجلت ESMFold وOmegaFold درجات TM متوسطة تبلغ 0.95 و0.93، على التوالي، مع قيم GDT-TS وRMSD المقابلة التي تشير إلى دقة أقل. من الجدير بالذكر أن جميع الطرق أظهرت أداءً مخفضًا للبروتينات التي تفتقر إلى تعليقات عائلية وتلك التي تحتوي على ميزات هيكلية محددة. ومن المثير للاهتمام، أن الطرق التي لا تعتمد على المحاذاة أظهرت أداءً متفوقًا للبروتينات المصممة حديثًا. نظرًا للاختلافات الطفيفة في الأداء للعديد من البروتينات، نقترح أن المتنبئين الأسرع الذين لا يعتمدون على المحاذاة، والذين يعملون بسرعة 10-30 مرة أسرع، قد يكونون كافيين لمجموعة متنوعة من التطبيقات. لمساعدة الممارسين بشكل أكبر، قمنا بتطوير مصنفات LightGBM باستخدام تضمينات ProtBert ودرجات الثقة للتنبؤ بفعالية عندما يكون الاستثمار الحسابي في AlphaFold2 مبررًا، مما يسهل اتخاذ قرارات مستنيرة بين السرعة والدقة في سير عمل توقع الهيكل.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الدور الحاسم للبروتينات في العمليات الخلوية عبر جميع الكائنات الحية، مما يبرز أهمية الطي الصحيح للبروتينات إلى هياكل هرمية: أولية وثانوية وثالثية ورابعة. يمكن أن يؤدي الطي الخاطئ إلى الأمراض، مما يبرز ضرورة تحديد بنية البروتين بدقة. تعتبر الطرق التقليدية مثل البلورة بالأشعة السينية، والميكروسكوبية الإلكترونية بالتبريد (cryo-EM)، وطيف الرنين المغناطيسي النووي (NMR)، على الرغم من موثوقيتها، غير عملية في كثير من الأحيان بسبب متطلبات الوقت والتكلفة. وقد أدى ذلك إلى الاعتماد المتزايد على طرق التنبؤ الحسابية، خاصةً مع اتساع الفجوة بين تسلسلات البروتين المعروفة والهياكل المحددة تجريبيًا.

أدى إدخال AlphaFold2 في عام 2020 إلى تقدم كبير في هذا المجال، محققًا دقة قريبة من التجريبية، مما حفز تطوير متنبئين بديلين مثل ESMFold وOmegaFold، اللذين يستخدمان نماذج لغوية لتوقعات أسرع دون الحاجة إلى محاذاة تسلسلات متعددة. على الرغم من هذه التقدمات، لا يزال هناك نقص في التقييمات الشاملة لهذه الأدوات على البروتينات الجديدة، خاصة تلك التي تم حلها بعد تواريخ قطع التدريب. تهدف هذه الدراسة إلى مقارنة منهجية بين AlphaFold2 وESMFold وOmegaFold باستخدام مجموعة بيانات تضم أكثر من 1,300 هيكل بروتين من بنك بيانات البروتين (PDB) المودعة بين 2022 و2024. من خلال استخدام مقاييس تقييم مختلفة مثل RMSD وTM-score وGDT-TS وpLDDT، تكشف التحليلات أنه بينما يظهر AlphaFold2 أعلى دقة متوسطة، فإن ESMFold وOmegaFold تؤديان بشكل استثنائي في سيناريوهات معينة، خاصة للبروتينات ذات المعلومات العائلية المحدودة. تشير النتائج أيضًا إلى ميزة سرعة كبيرة للطرق التي لا تعتمد على المحاذاة، مما يوفر رؤى للباحثين حول متى يجب استخدام هذه الأدوات الأسرع للتحليلات الهيكلية على نطاق واسع.

النتائج

في هذه الدراسة، تم إجراء توقعات الهيكل لـ 1,337 سلسلة بروتين باستخدام ثلاثة أدوات حسابية: AlphaFold2 وESMFold وOmegaFold. في البداية، واجهت عملية AlphaFold2 فشلًا في توقع هيكل سلسلة معينة (8B2M:A). ومع ذلك، عند إعادة تشغيل العملية، تم تحقيق توقع ناجح، مما يشير إلى أن الفشل الأولي ناتج عن مشكلة مؤقتة داخل خدمة الحوسبة بالجامعة بدلاً من مشكلة جوهرية في بنية البروتين نفسها.

في النهاية، تم توقع جميع سلاسل البروتين بنجاح بواسطة كل من الأدوات الثلاث، مما شكل الأساس للتقييم اللاحق لأدائها. يتم توضيح أمثلة مختارة من هذه التوقعات، التي تم محاذاتها مع هياكلها التجريبية المقابلة، في الشكل 1. تؤكد هذه التحليلات الشاملة على موثوقية الطرق الحسابية المستخدمة في توقع هياكل البروتين.

المناقشة

في هذه الدراسة، تم تجميع مجموعة بيانات مرجعية تتكون من 1,337 هيكل بروتين من بنك بيانات البروتين (PDB) لتقييم أداء ثلاثة أدوات لتوقع بنية البروتين: AlphaFold2 وESMFold وOmegaFold. تم تنسيق مجموعة البيانات لتجنب التداخل مع بيانات التدريب السابقة، وتشمل أحادية السلسلة، والمركبات متعددة السلاسل الصغيرة، والبروتينات المصممة حديثًا. كشفت التحليلات أن AlphaFold2 تفوقت باستمرار على كل من ESMFold وOmegaFold عبر مقاييس مختلفة، بما في ذلك الانحراف المعياري الجذري (RMSD)، ودرجة نمذجة القالب (TM-score)، ودرجة اختبار المسافة العالمية-الإجمالية (GDT-TS). من الجدير بالذكر أن AlphaFold2 حققت أعلى درجة متوسطة لـ TM-score تبلغ 0.96 وأدنى انحراف معياري جذري متوسطة يبلغ 1.30 Å، بينما تبعت ESMFold وOmegaFold بدرجات TM تبلغ 0.95 و0.93، على التوالي.

سلطت الدراسة أيضًا الضوء على تأثير عوامل مثل طرق تحديد الهيكل التجريبي وتعليقات عائلة البروتين على دقة التوقع. أدت جميع الأدوات أفضل أداءً على هياكل البلورة بالأشعة السينية، بينما شكلت الهياكل المحددة بواسطة NMR تحديات كبيرة. ومن المثير للاهتمام، أن AlphaFold2 واجهت صعوبة مع البروتينات التي تفتقر إلى تعليقات عائلية، مما يشير إلى اعتمادها على المعلومات التطورية، بينما حافظت ESMFold وOmegaFold على أداء ثابت بغض النظر عن تعيينات العائلة. تشير النتائج إلى أنه بينما يتفوق AlphaFold2 في الدقة، قد تكون ESMFold خيارًا أكثر موثوقية لبعض التوقعات، خاصة للبروتينات المصممة حديثًا، حيث تظهر الأساليب المعتمدة على نماذج اللغة مزايا. تؤكد الأبحاث على أهمية اختيار أدوات التوقع المناسبة بناءً على الخصائص المحددة للبروتين الذي يتم تحليله.

Journal: Frontiers in Genetics, Volume: 16
DOI: https://doi.org/10.3389/fgene.2025.1715037
PMID: https://pubmed.ncbi.nlm.nih.gov/41608648
Publication Date: 2026-01-13
Author(s): Anna Hyskova et al.
Primary Topic: Protein Structure and Dynamics

Overview

The rapid advancement of protein structure prediction tools necessitates systematic performance evaluations to aid in method selection, particularly considering the trade-offs between computational speed and prediction accuracy. In this study, we benchmarked three prominent tools—AlphaFold2, ESMFold, and OmegaFold—using a dataset of 1,337 protein chains from the Protein Data Bank, ensuring no overlap with training data. We assessed the predictions through various metrics, including Root Mean Square Deviation (RMSD), Template Modeling score (TM-score), Global Distance Test-Total Score (GDT-TS), and predicted Local Distance Difference Test (pLDDT).

Our findings indicate that AlphaFold2 outperformed the other tools, achieving the highest median TM-score of 0.96, median GDT-TS of 94%, and the lowest median RMSD of 1.30 Å. In comparison, ESMFold and OmegaFold recorded median TM-scores of 0.95 and 0.93, respectively, with corresponding GDT-TS and RMSD values indicating lesser accuracy. Notably, all methods exhibited reduced performance for proteins without family annotations and those with specific structural features. Interestingly, alignment-free methods demonstrated superior performance for de novo designed proteins. Given the minimal performance differences for many proteins, we propose that faster alignment-free predictors, which operate 10-30 times quicker, may suffice for various applications. To further assist practitioners, we developed LightGBM classifiers utilizing ProtBert embeddings and confidence scores to effectively predict when the computational investment in AlphaFold2 is justified, thereby facilitating informed decisions between speed and accuracy in structural prediction workflows.

Introduction

The introduction of this research paper highlights the critical role of proteins in cellular processes across all living organisms, emphasizing the importance of proper protein folding into hierarchical structures: primary, secondary, tertiary, and quaternary. Misfolding can lead to diseases, underscoring the necessity for accurate protein structure determination. Traditional methods such as X-ray crystallography, cryo-electron microscopy (cryo-EM), and nuclear magnetic resonance (NMR) spectroscopy, while reliable, are often impractical due to their time and cost demands. This has led to a growing reliance on computational prediction methods, especially as the disparity between known protein sequences and experimentally determined structures widens.

The introduction of AlphaFold2 in 2020 marked a significant advancement in the field, achieving near-experimental accuracy and prompting the development of alternative predictors like ESMFold and OmegaFold, which utilize language models for faster predictions without the need for multiple sequence alignments. Despite these advancements, there remains a lack of comprehensive evaluations of these tools on novel proteins, particularly those solved post-training cutoff dates. This study aims to systematically compare AlphaFold2, ESMFold, and OmegaFold using a dataset of over 1,300 protein structures from the Protein Data Bank (PDB) deposited between 2022 and 2024. By employing various evaluation metrics such as RMSD, TM-score, GDT-TS, and pLDDT, the analysis reveals that while AlphaFold2 demonstrates the highest average accuracy, ESMFold and OmegaFold perform exceptionally well in specific scenarios, particularly for proteins with limited homology information. The findings also indicate a significant speed advantage for alignment-free methods, providing insights for researchers on when to utilize these faster tools for large-scale structural analyses.

Results

In this study, structure predictions were conducted for 1,337 protein chains utilizing three computational tools: AlphaFold2, ESMFold, and OmegaFold. Initially, the AlphaFold2 pipeline encountered a failure in predicting the structure of one specific chain (8B2M:A). However, upon rerunning the pipeline, a successful prediction was achieved, suggesting that the initial failure stemmed from a transient issue within the university’s computing service rather than an inherent problem with the protein structure itself.

Ultimately, all protein chains were successfully predicted by each of the three tools, forming the foundation for the subsequent evaluation of their performance. Selected examples of these predictions, which were aligned with their corresponding experimental structures, are illustrated in Figure 1. This comprehensive analysis underscores the reliability of the computational methods employed in predicting protein structures.

Discussion

In this study, a benchmark dataset of 1,337 protein structures was compiled from the Protein Data Bank (PDB) to evaluate the performance of three protein structure prediction tools: AlphaFold2, ESMFold, and OmegaFold. The dataset, curated to avoid overlap with prior training data, includes single-chain monomers, small multi-chain complexes, and de novo designed proteins. The analysis revealed that AlphaFold2 consistently outperformed both ESMFold and OmegaFold across various metrics, including root mean square deviation (RMSD), template modeling score (TM-score), and global distance test-total score (GDT-TS). Notably, AlphaFold2 achieved the highest median TM-score of 0.96 and the lowest median RMSD of 1.30 Å, while ESMFold and OmegaFold followed with TM-scores of 0.95 and 0.93, respectively.

The study also highlighted the influence of factors such as experimental structure determination methods and protein family annotations on prediction accuracy. All tools performed best on X-ray crystallography structures, while NMR-determined structures posed significant challenges. Interestingly, AlphaFold2 struggled with proteins lacking family annotations, suggesting its reliance on evolutionary information, whereas ESMFold and OmegaFold maintained consistent performance regardless of family assignments. The findings indicate that while AlphaFold2 excels in accuracy, ESMFold may be a more reliable choice for certain predictions, particularly for de novo proteins, where language model-based approaches demonstrate advantages. The research underscores the importance of selecting appropriate prediction tools based on the specific characteristics of the protein being analyzed.