مسح شامل لوظائف التقييم لنماذج ربط البروتين A comprehensive survey of scoring functions for protein docking models

المجلة: BMC Bioinformatics، المجلد: 26، العدد: 1
DOI: https://doi.org/10.1186/s12859-024-05991-4
PMID: https://pubmed.ncbi.nlm.nih.gov/39844036
تاريخ النشر: 2025-01-22
المؤلف: Azam Shirali وآخرون
الموضوع الرئيسي: تعلم الآلة في المعلوماتية الحيوية

نظرة عامة

في هذا القسم، يبرز المؤلفون أهمية وظائف التقييم في ربط البروتينات، وهو أمر أساسي لفهم تفاعلات البروتين. يؤكدون على أن فعالية أدوات الربط تعتمد بشكل كبير على القدرة على التمييز بدقة بين المجمعات المرتبطة الأصلية وغير الأصلية. على الرغم من اقتراح العديد من وظائف التقييم المبتكرة، لا تزال طريقة تقييم فعالة عالمياً بعيدة المنال. يقترح المؤلفون أن نماذج التعلم العميق تمثل بديلاً واعداً لوظائف التقييم التقليدية التجريبية أو الرياضية.

تجري الدراسة مسحاً شاملاً لأساليب التقييم الحالية، تشمل كل من الأساليب التقليدية وأساليب التعلم العميق، وتقيّم أدائها وكفاءة وقت التشغيل. هذه المقارنة حاسمة لتطبيقات الربط على نطاق واسع، حيث تكون الكفاءة الحاسوبية ذات أهمية قصوى. يقيم المؤلفون نقاط القوة والضعف في هذه الأساليب باستخدام سبعة مجموعات بيانات عامة معترف بها على نطاق واسع، بهدف تقديم رؤى حول التقدم في وظائف التقييم لمجمعات البروتين-بروتين وتوجيه الأبحاث المستقبلية في هذا المجال.

مقدمة

تؤكد مقدمة هذه الورقة البحثية على أهمية ربط البروتينات في فهم تفاعلات البروتين، والتي تعتبر حاسمة للوظائف البيولوجية وتطوير العلاجات. تبرز أهمية وظائف التقييم الدقيقة التي يمكن أن تميز بفعالية بين المجمعات المرتبطة الأصلية وغير الأصلية، وهو تحدٍ تم التأكيد عليه من خلال مسابقة CAPRI (التقييم النقدي لتوقع التفاعلات). على الرغم من التقدم في علم الأحياء الهيكلي من خلال تقنيات تجريبية مثل NMR، والتبلور بالأشعة السينية، والميكروسكوب الإلكتروني المبرد، لا تزال الحاجة إلى طرق حسابية فعالة قائمة بسبب التكاليف العالية والوقت المرتبط بالأساليب التجريبية.

تهدف الورقة إلى تقديم مقارنة شاملة لأدوات التقييم الحديثة لربط البروتينات، وتصنيفها إلى أربعة أنواع رئيسية: قائمة على الفيزياء، قائمة على التجربة، قائمة على المعرفة، وطرق التعلم الآلي (ML) أو التعلم العميق (DL). بينما تم تأسيس الأساليب التقليدية (الفيزيائية، التجريبية، وقائمة على المعرفة)، فإن إدخال أساليب ML وDL قد أضاف تعقيداً وإمكانية لتحسين وظائف التقييم. يشير المؤلفون إلى أن الدراسات السابقة لم تقارن بشكل كافٍ بين هذه الأساليب التقليدية والحديثة معاً، ولم تقم بتقييم وظائف التقييم بشكل متسق عبر مجموعات بيانات مختلفة. تسعى هذه الدراسة إلى معالجة هذه الفجوات من خلال إجراء تقييم شامل لوظائف التقييم باستخدام سبعة مجموعات بيانات عامة معترف بها على نطاق واسع، مما يعزز فهم أدائها عبر سياقات مختلفة.

طرق

في هذه الدراسة، يقيم المؤلفون أداء ثمانية أساليب تقييم تقليدية وأربعة أساليب متقدمة قائمة على التعلم العميق (DL) في سياق ربط البروتينات. يتم وصف الأساليب التقليدية بإيجاز، بينما يتم تسليط الضوء على أساليب DL لتطبيقها الناجح عبر مجالات مختلفة، بما في ذلك علم البروتين. تؤكد الدراسة على أهمية فهم أنماط التصنيف الخاطئ في نماذج الربط، مع التركيز بشكل خاص على مجموعة بيانات CAPRI score v2022، التي تتكون من 41,191 نموذج سهل و39,130 نموذج صعب. ومن الجدير بالذكر أن 652 نموذج سهل و512 نموذج صعب تم تصنيفها بشكل خاطئ من قبل جميع الأساليب، والتي أطلق عليها اسم AllWrongSets، بينما تضمنت مجموعة AllButOneWrongSets نماذج تم تصنيفها بشكل صحيح من قبل أسلوب واحد فقط.

تكشف التحليلات أن أداء أساليب DL يتأثر بعدة عوامل، بما في ذلك جودة نماذج الربط، وخصائص مجموعة بيانات التدريب، والميزات المحددة المستخدمة في تدريب النموذج. برز DeepRank-GNN كأكثر أساليب DL فعالية، على الأرجح بسبب تضمينه لأنواع الأحماض الأمينية وميزات مصفوفة التقييم المحددة حسب الموقع (PSSM). تسلط الدراسة الضوء أيضاً على أهمية العلاقة بين مجموعات بيانات التدريب والاختبار في تحديد أداء النموذج، مما يشير إلى أن نماذج الربط عالية الجودة والميزات ذات الصلة ضرورية لتطوير وظائف تقييم فعالة. هناك حاجة لمزيد من التحقيق في قابلية تعميم هذه الوظائف، كما هو موضح في القسم التالي من الورقة.

النتائج

في هذا القسم، يتم تقييم أداء اثني عشر أسلوباً عبر سبع مجموعات بيانات، مع التركيز بشكل خاص على مجموعة بيانات CAPRI Score v2022، التي تنقسم إلى مجمعات صعبة وسهلة. تشمل النتائج ثمانية مجموعات بيانات، باستخدام أفضل النماذج المدربة مسبقاً لأساليب التعلم العميق (DL) كما أوصى بها المؤلفون الأصليون. على سبيل المثال، استخدم GNN-DOVE نموذج الطي 5، بينما استخدم DeepRank-GNN أداة PSSMGen لاشتقاق ميزات مصفوفات التقييم المحددة حسب الموقع (PSSM). تم تكوين طريقة PIsToN بحجم رقعة 16 Å، كما هو موضح لتحقيق أداء مثالي في الأدبيات.

شملت التحليلات حساب المتوسط، والحد الأدنى، والحد الأقصى للدرجات للاتصالات المحتملة في dMaSIF، حيث أظهرت الدرجات المتوسطة أعلى منطقة تحت المنحنى (AUC) عبر جميع مجموعات البيانات، وبالتالي تم الإبلاغ عنها في النتائج. تم تنفيذ الأساليب التقليدية عبر خادم CCharPPI، الذي يحسب القيم دون الحاجة إلى ميزات محسوبة مسبقاً. تم تقييم جودة نماذج الربط باستخدام أداة CAPRI-Q، التي تصنف النماذج إلى أربعة مستويات جودة: غير صحيحة، مقبولة، متوسطة، أو عالية، بناءً على مقاييس مثل نسبة الاتصالات الأصلية المستعادة (fnat) وقيم الانحراف المعياري الجذري (RMSD). تم اعتبار النماذج المصنفة كمقبولة، متوسطة، أو عالية نماذج ربط صحيحة.

المناقشة

في قسم المناقشة، تقيم الورقة وظائف التقييم المختلفة المستخدمة في نماذج ربط البروتينات، مع تسليط الضوء على منهجياتها وأدائها عبر مجموعات بيانات متعددة. يتم تصنيف الأساليب إلى تجريبية (مثل FireDock، RosettaDock، ZRANK2)، قائمة على المعرفة (مثل AP-PISA، CP-PIE، SIPPER)، وأساليب هجينة (مثل PyDock، HADDOCK). تؤكد الدراسة على استخدام خادم CCharPPI لعزل وظائف التقييم عن عملية الربط، مما يسمح بمقارنة مركزة لفعاليتها. تشير النتائج إلى أنه بينما تفوقت الأساليب التقليدية مثل AP-PISA وCP-PIE في مجموعات بيانات معينة، فإن الأساليب القائمة على التعلم العميق (DL) مثل PIsToN وdMaSIF تفوقت بشكل عام على الأساليب التقليدية عبر ستة من ثمانية مجموعات بيانات، خاصة من حيث مقاييس AUC ROC ومعدلات النجاح.

تناقش الورقة أيضاً قابلية تعميم هذه الوظائف، باستخدام درجات نمذجة القالب (TM-scores) لتقييم تشابه مجموعات بيانات الاختبار مع مجموعات التدريب. تجد أن dMaSIF وPIsToN أظهرا أداءً متفوقاً على الرغم من وجود مجموعات بيانات تدريب تحتوي على مجمعات أكثر تشابهاً مع بعض مجموعات بيانات الاختبار. تشير النتائج إلى أنه بينما قد تظهر بعض أساليب DL انحيازات تدريبية، إلا أنها لا تزال تحقق مقاييس أداء أعلى، مما يشير إلى إمكانية استخدامها في التطبيقات العملية. بشكل عام، تؤكد الدراسة على أهمية تقييم كل من وظائف التقييم وقابلية تعميمها لتعزيز دقة توقعات تفاعلات البروتين-بروتين.

Journal: BMC Bioinformatics, Volume: 26, Issue: 1
DOI: https://doi.org/10.1186/s12859-024-05991-4
PMID: https://pubmed.ncbi.nlm.nih.gov/39844036
Publication Date: 2025-01-22
Author(s): Azam Shirali et al.
Primary Topic: Machine Learning in Bioinformatics

Overview

In this section, the authors highlight the significance of scoring functions in protein-protein docking, which is essential for understanding protein interactions. They emphasize that the effectiveness of docking tools relies heavily on the ability to accurately distinguish between native and non-native binding complexes. Despite numerous innovative scoring functions being proposed, a universally effective scoring method remains elusive. The authors suggest that deep learning models present a promising alternative to traditional empirical or mathematical scoring functions.

The study conducts a thorough survey of current scoring methods, encompassing both classical and deep learning approaches, and evaluates their performance and runtime efficiency. This comparison is crucial for large-scale docking applications, where computational efficiency is paramount. The authors assess the strengths and weaknesses of these methods using seven widely recognized public datasets, aiming to provide insights into the advancements in scoring functions for protein-protein complexes and to guide future research in this area.

Introduction

The introduction of this research paper emphasizes the significance of protein-protein docking in understanding protein interactions, which are crucial for biological functions and therapeutic development. It highlights the importance of accurate scoring functions that can effectively distinguish between native and non-native binding complexes, a challenge underscored by the CAPRI (Critical Assessment of PRediction of Interactions) contest. Despite advancements in structural biology through experimental techniques such as NMR, X-ray crystallography, and cryogenic electron microscopy, the need for efficient computational methods remains due to the high costs and time associated with experimental approaches.

The paper aims to provide a comprehensive comparison of state-of-the-art scoring tools for protein-protein docking, categorizing them into four main types: physics-based, empirical-based, knowledge-based, and machine learning (ML) or deep learning (DL) based methods. While classical methods (physics, empirical, and knowledge-based) have been established, the introduction of ML and DL approaches has added complexity and potential for improved scoring functions. The authors note that previous studies have not adequately compared these classical and modern methods together, nor have they consistently benchmarked scoring functions across various datasets. This study seeks to address these gaps by conducting a thorough evaluation of scoring functions using seven widely recognized public datasets, thereby enhancing the understanding of their performance across different contexts.

Methods

In this study, the authors evaluate the performance of eight classical scoring methods and four advanced deep learning (DL)-based methods in the context of protein docking. The classical methods are briefly described, while the DL methods are highlighted for their successful application across various domains, including protein science. The study emphasizes the importance of understanding misclassification patterns in docking models, particularly focusing on the CAPRI score v2022 dataset, which comprises 41,191 easy and 39,130 difficult models. Notably, 652 easy and 512 difficult models were misclassified by all methods, termed the AllWrongSets, while the AllButOneWrongSets included models that were correctly classified by only one method.

The analysis reveals that the performance of DL methods is influenced by several factors, including the quality of the docking models, the training dataset’s characteristics, and the specific features utilized in model training. DeepRank-GNN emerged as the most effective DL method, likely due to its inclusion of amino acid types and Position-Specific Scoring Matrix (PSSM) features. The study also highlights the significance of the relationship between training and test datasets in determining model performance, suggesting that high-quality docking models and relevant features are critical for developing effective scoring functions. Further investigation into the generalizability of these scoring functions is warranted, as indicated in the subsequent section of the paper.

Results

In this section, the performance of twelve methods across seven datasets is evaluated, with a specific focus on the CAPRI Score v2022 dataset, which is divided into difficult and easy complexes. The results encompass eight datasets, utilizing the best pretrained models for deep learning (DL)-based methods as recommended by the original authors. For instance, GNN-DOVE employed fold model 5, while DeepRank-GNN utilized the PSSMGen tool to derive Position-Specific Scoring Matrices (PSSM) features. The PIsToN method was configured with a patch size of 16 Å, as indicated to yield optimal performance in the literature.

The analysis involved calculating average, minimum, and maximum scores for potential contacts in dMaSIF, with the average scores demonstrating the highest Area Under the Curve (AUC) across all datasets, thus being reported in the results. Classical methods were executed via the CCharPPI server, which computes values without the need for pre-calculated features. The quality of docking models was assessed using the CAPRI-Q tool, which categorizes models into four quality levels: incorrect, acceptable, medium, or high, based on metrics such as the fraction of native contacts recovered (fnat) and Root Mean Square Deviation (RMSD) values. Models classified as acceptable, medium, or high were deemed correct docking models.

Discussion

In the discussion section, the paper evaluates various scoring functions used in protein-protein docking models, highlighting their methodologies and performance across multiple datasets. The methods are categorized into empirical (e.g., FireDock, RosettaDock, ZRANK2), knowledge-based (e.g., AP-PISA, CP-PIE, SIPPER), and hybrid approaches (e.g., PyDock, HADDOCK). The study emphasizes the use of the CCharPPI server to isolate scoring functions from the docking process, allowing for a focused comparison of their effectiveness. The results indicate that while classical methods like AP-PISA and CP-PIE excelled in certain datasets, deep learning-based methods (DL-based) such as PIsToN and dMaSIF generally outperformed classical approaches across six out of eight datasets, particularly in terms of AUC ROC measures and success rates.

The paper further discusses the generalizability of these scoring functions, utilizing Template Modeling Scores (TM-scores) to assess the similarity of test datasets to training sets. It finds that dMaSIF and PIsToN demonstrated superior performance despite having training datasets with more similar complexes to certain test datasets. The findings suggest that while some DL-based methods may exhibit training biases, they still achieve higher performance metrics, indicating their potential utility in practical applications. Overall, the study underscores the importance of evaluating both the scoring functions and their generalizability to enhance the accuracy of protein-protein interaction predictions.