Deep-ProBind: التنبؤ بالبروتينات المرتبطة باستخدام نموذج التعلم العميق القائم على المحولات Deep-ProBind: binding protein prediction with transformer-based deep learning model

المجلة: BMC Bioinformatics، المجلد: 26، العدد: 1
DOI: https://doi.org/10.1186/s12859-025-06101-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40121399
تاريخ النشر: 2025-03-22
المؤلف: Salman Khan وآخرون
الموضوع الرئيسي: تعلم الآلة في المعلوماتية الحيوية

نظرة عامة

تناقش هذه القسم أهمية البروتينات المرتبطة في الأنظمة البيولوجية، مع التركيز على تفاعلاتها الانتقائية مع جزيئات مثل الحمض النووي، RNA، والببتيدات، والتي تعتبر حاسمة لمختلف الوظائف الخلوية. غالبًا ما تكون الطرق التقليدية لتحديد الببتيدات المرتبطة بالبروتينات غير فعالة وتفتقر إلى الدقة بسبب تركيزها على ميزات التسلسل القريبة دون النظر في البيانات الهيكلية. لمعالجة هذه التحديات، يقدم البحث Deep-ProBind، وهو نموذج توقع جديد يدمج كل من المعلومات التسلسلية والهيكلية باستخدام بنية المحولات وآلية انتباه قائمة على التطور. يستخدم النموذج تمثيلات الترميز ثنائية الاتجاه من المحولات (BERT) ونهج مصفوفة تسجيل محددة الموضع الزائفة – تحويل الموجات المتقطعة (PsePSSM-DWT) لترميز الببتيدات. يتم استخدام خوارزمية SHapley Additive exPlanations (SHAP) لاختيار الميزات المثلى، تليها شبكة عصبية عميقة (DNN) للتصنيف.

أظهر Deep-ProBind أداءً مثيرًا للإعجاب، محققًا دقة بنسبة 92.67% على مجموعات البيانات المرجعية و93.62% على عينات مستقلة، متفوقًا على خوارزميات التعلم الآلي التقليدية والنماذج الحالية بفروق ملحوظة. نجح النموذج في التخفيف من الإفراط في التكيف من خلال تحسين المعلمات، وأظهر إمكاناته كأداة قيمة في البحث الصيدلاني، لا سيما في تطوير العلاجات المتعلقة بأمراض مثل سرطان الثدي. تهدف الأعمال المستقبلية إلى تعزيز قابلية تكيف النموذج من خلال التعلم الانتقالي، وتحسين هيكله، ومعالجة قضايا القابلية للتوسع، مع الاعتراف بالقيود الحالية لمجموعة بيانات صغيرة نسبيًا قد تؤثر على القابلية للتعميم. يبرز البحث أهمية التنبؤ بدقة بالببتيدات المرتبطة بالبروتينات لتقدم اكتشاف الأدوية واستراتيجيات العلاج.

مقدمة

في المقدمة، يناقش البحث الأدوار المهمة للببتيدات – سلاسل قصيرة من الأحماض الأمينية – في وظائف بيولوجية متنوعة، بما في ذلك تنظيم الهرمونات، وإشارات الخلايا، وآليات الدفاع. يتم تسليط الضوء على الإمكانات لتصميم الببتيدات في التطبيقات العلاجية، مثل تعديل نشاط الإنزيم ومعالجة مقاومة المضادات الحيوية. يؤكد البحث على أهمية البروتينات المرتبطة في المسارات الكيميائية الحيوية والتحديات المرتبطة بتحديد هذه البروتينات وتوصيفها بسبب عوامل مثل حجم الببتيد وانخفاض الألفة المرتبطة.

لمعالجة هذه التحديات، يقترح المؤلفون Deep-ProBind، وهو إطار تعلم عميق جديد لتوقع مواقع ارتباط البروتين. يستخدم هذا النموذج آلية انتباه قائمة على المحولات (BERT) ويجمع الميزات التطورية من خلال نهج PsePSSM-DWT. من خلال دمج تمثيلات الكلمات مع أوصاف تطورية واستخدام خوارزمية SHAP لاختيار الميزات، يحقق Deep-ProBind معدلات دقة مثيرة للإعجاب تصل إلى 92.67% على مجموعات البيانات المرجعية و93.62% على عينات مستقلة، متفوقًا على النماذج الحالية. يضع البحث Deep-ProBind كأداة قيمة للباحثين في تطوير الأدوية، مما يعزز توقع مواقع ارتباط الببتيدات الحيوية للتقدم في العلاجات.

طرق

في هذا القسم، يقوم المؤلفون بإجراء تقييم شامل لفعالية النموذج المقترح من خلال اختبارات تحقق متنوعة، مع التركيز بشكل خاص على اختبارات K-fold واختبارات مستقلة. تم استخدام طريقة التحقق المتقاطع K-fold، المعروفة بقدرتها على تقديم نتائج متوازنة، لتقييم أداء خوارزمية تدريب التعلم الآلي في سياق المعلوماتية الحيوية. لضمان تقييم قوي، تم تنفيذ نهج التحقق المتقاطع بعشرة أضعاف، باستخدام مجموعات بيانات مرجعية لتحديد الدقة العامة للوصفة المقترحة. يهدف هذا الإطار الاختباري الصارم إلى التحقق من موثوقية النموذج وفعاليته في التطبيقات العملية.

نقاش

في قسم النقاش من ورقة البحث، يوضح المؤلفون تصميم ومكونات نموذجهم المقترح لتوقع الببتيدات المرتبطة بالبروتينات، مع التأكيد على أهمية مجموعة بيانات مرجعية منظمة جيدًا. استخدموا مجموعة بيانات تصنف الببتيدات بناءً على درجات الألفة المرتبطة، مما يضمن مجموعة تدريب متوازنة من 1600 عينة (800 إيجابية و800 سلبية) ومجموعة اختبار غير متوازنة لتعكس الظروف الواقعية. قام المؤلفون بتنفيذ أنظمة ترميز ميزات متنوعة، بما في ذلك مصفوفة التسجيل المحددة الموضع (PSSM) وتوسيعها، Pseudo-PSSM، لالتقاط المعلومات التطورية وترتيب التسلسل. بالإضافة إلى ذلك، استخدموا تحويل الموجات المتقطعة (DWT) لاستخراج الميزات واستفادوا من بنية BERT لتمثيل سياق تسلسلات الببتيد.

تم تقييم أداء نموذج الشبكة العصبية العميقة (DNN) المقترح بدقة باستخدام مقاييس متعددة، بما في ذلك الدقة، والحساسية، والنوعية، ومعامل ارتباط ماثيو (MCC). حقق النموذج دقة مثيرة للإعجاب بنسبة 92.67% على مجموعة البيانات المرجعية، مع تعزيز اختيار الميزات للأداء بشكل أكبر. تفوق نهج الميزات الهجينة، الذي يجمع بين PsePSSM-DWT وBERT، على الطرق الفردية، مما يظهر فعالية دمج مجموعات الميزات المتنوعة. تم التحقق من قدرات النموذج التنبؤية على مجموعات بيانات مستقلة، محققًا دقة عالية ودرجات AUC، مما يبرز قوته. كما قارن المؤلفون نموذجهم ضد مصنفات متنوعة، مشيرين إلى أدائه المتفوق، خاصة في سياق تطبيقات المعلوماتية الحيوية. بشكل عام، تشير النتائج إلى أن النموذج المقترح، خاصة عند تحسينه من خلال اختيار الميزات، يعزز بشكل كبير الأداء التنبؤي في تحديد الببتيدات المرتبطة بالبروتينات.

Journal: BMC Bioinformatics, Volume: 26, Issue: 1
DOI: https://doi.org/10.1186/s12859-025-06101-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40121399
Publication Date: 2025-03-22
Author(s): Salman Khan et al.
Primary Topic: Machine Learning in Bioinformatics

Overview

The section discusses the significance of binding proteins in biological systems, emphasizing their selective interactions with molecules like DNA, RNA, and peptides, which are crucial for various cellular functions. Traditional methods for identifying protein-binding peptides are often inefficient and lack accuracy due to their focus on proximal sequence features without considering structural data. To address these challenges, the study introduces Deep-ProBind, a novel prediction model that integrates both sequence and structural information using a transformer architecture and an evolutionary-based attention mechanism. The model employs Bidirectional Encoder Representations from Transformers (BERT) and a Pseudo position specific scoring matrix – Discrete Wavelet Transform (PsePSSM-DWT) approach for peptide encoding. The SHapley Additive exPlanations (SHAP) algorithm is utilized for optimal feature selection, followed by a Deep Neural Network (DNN) for classification.

Deep-ProBind demonstrated impressive performance, achieving 92.67% accuracy on benchmark datasets and 93.62% on independent samples, outperforming traditional machine learning algorithms and existing models by notable margins. The model effectively mitigated overfitting through hyperparameter optimization and showcased its potential as a valuable tool in pharmacological research, particularly in therapeutic development related to diseases such as breast cancer. Future work aims to enhance the model’s adaptability through transfer learning, refine its architecture, and address scalability issues, while acknowledging the current limitation of a relatively small dataset that may affect generalizability. The study highlights the importance of accurately predicting protein-binding peptides for advancing drug discovery and therapeutic strategies.

Introduction

In the introduction, the paper discusses the significant roles of peptides—short chains of amino acids—in various biological functions, including hormone regulation, cell signaling, and defense mechanisms. The potential for peptide design in therapeutic applications, such as modulating enzyme activity and addressing antibiotic resistance, is highlighted. The paper emphasizes the importance of binding proteins in biochemical pathways and the challenges associated with identifying and characterizing these proteins due to factors like peptide size and low binding affinity.

To tackle these challenges, the authors propose Deep-ProBind, a novel deep learning framework for predicting protein binding sites. This model utilizes a transformer-based attention mechanism (BERT) and incorporates evolutionary features through a PsePSSM-DWT approach. By combining word embeddings with evolutionary descriptors and employing the SHAP algorithm for feature selection, Deep-ProBind achieves impressive accuracy rates of 92.67% on benchmark datasets and 93.62% on independent samples, outperforming existing models. The study positions Deep-ProBind as a valuable tool for researchers in pharmaceutical development, enhancing the prediction of peptide-binding sites critical for therapeutic advancements.

Methods

In this section, the authors conduct a thorough evaluation of the proposed model’s effectiveness through various validation tests, specifically focusing on K-fold and independent tests. The K-fold cross-validation method, recognized for its ability to provide balanced results, was employed to assess the performance of the machine learning training algorithm within the context of bioinformatics. To ensure robust evaluation, a tenfold cross-validation approach was implemented, utilizing benchmarking datasets to determine the overall accuracy of the proposed prescription. This rigorous testing framework aims to validate the model’s reliability and effectiveness in practical applications.

Discussion

In the discussion section of the research paper, the authors detail the design and components of their proposed model for predicting protein-binding peptides, emphasizing the importance of a well-structured benchmark dataset. They utilized a dataset that categorizes peptides based on binding affinity scores, ensuring a balanced training set of 1600 samples (800 positive and 800 negative) and an unbalanced testing set to reflect real-world conditions. The authors implemented various feature encoding schemes, including Position-Specific Scoring Matrix (PSSM) and its extension, Pseudo-PSSM, to capture evolutionary information and sequence order. Additionally, they employed Discrete Wavelet Transform (DWT) for feature extraction and utilized the BERT architecture for contextual representation of peptide sequences.

The performance of the proposed Deep Neural Network (DNN) model was rigorously evaluated using multiple metrics, including accuracy, sensitivity, specificity, and Matthew’s Correlation Coefficient (MCC). The model achieved an impressive accuracy of 92.67% on the benchmark dataset, with feature selection enhancing performance further. The hybrid feature approach, which combines PsePSSM-DWT and BERT, outperformed individual methods, demonstrating the effectiveness of integrating diverse feature sets. The model’s predictive capabilities were validated on independent datasets, achieving high accuracy and AUC scores, underscoring its robustness. The authors also compared their model against various classifiers, highlighting its superior performance, particularly in the context of bioinformatics applications. Overall, the findings indicate that the proposed model, especially when optimized through feature selection, significantly enhances predictive performance in identifying protein-binding peptides.