نموذج تجميعي معزز بـ XGBoost باستخدام ميزات هجينة تمييزية لتوقع مواقع السومويلات XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites

المجلة: BioData Mining، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1186/s13040-024-00415-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39901279
تاريخ النشر: 2025-02-03
المؤلف: Salman Khan وآخرون
الموضوع الرئيسي: تعلم الآلة في المعلوماتية الحيوية

نظرة عامة

تناقش هذه القسم أهمية التعديلات بعد الترجمة (PTMs)، وخاصة السومويلايشن، في تنظيم وظائف البروتينات وتأثيراتها على الأمراض مثل باركنسون والزهايمر. تقدم الدراسة XGBoost-Sumo، وهو نموذج تنبؤي مصمم لتحديد مواقع السومويلايشن من خلال دمج بيانات هيكل البروتين وتسلسلها. باستخدام آلية انتباه قائمة على المحولات وطريقة PsePSSM-DWT لاستخراج الميزات، يجمع النموذج بين تمثيلات الكلمات مع أوصاف تطورية. يستخدم خوارزمية SHapley Additive exPlanations (SHAP) لاختيار الميزات المثلى ويستخدم eXtreme Gradient Boosting (XGBoost) للتصنيف. حقق XGBoost-Sumo دقة تبلغ 99.68% على مجموعات البيانات المرجعية و96.08% على عينات مستقلة، متفوقًا على النماذج الحالية بفارق ملحوظ.

في الختام، أظهر XGBoost-Sumo موثوقية ودقة عالية في تحديد المواقع السومويلايت، تم التحقق منها من خلال التحقق المتقاطع الصارم 10 مرات. يبرز الأداء المتفوق للنموذج مقارنة بأساليب التعلم الآلي التقليدية إمكانياته للتطبيقات العملية في تطوير الأدوية. ستركز الأعمال المستقبلية على تعزيز النموذج من خلال التعلم الانتقالي، وتحسين المعلمات، وتقنيات التجميع، إلى جانب تنفيذ منهجيات البرمجة المتوازية لتحسين الكفاءة والسرعة وقابلية التوسع.

مقدمة

تسلط المقدمة الضوء على الدور الحاسم للبروتينات في الوظائف الخلوية عبر الكائنات الحية بدائية النواة وحقيقية النواة، مع التأكيد على أهمية التعديلات بعد الترجمة (PTMs) في تنظيم البروتينات. من بين هذه التعديلات، تلعب السومويلايشن—حيث ترتبط بروتينات SUMO بمخلفات الليسين—دورًا كبيرًا في عمليات خلوية متنوعة، بما في ذلك النقل النووي السيتوبلازمي وتنظيم التعبير الجيني. يبرز الاتصال بين السومويلايشن والأمراض التنكسية العصبية، مثل الزهايمر وباركنسون، أهميتها في فهم طي البروتينات بشكل خاطئ وتوازن الخلايا.

أدت التقدمات الأخيرة في علم الأحياء الحاسوبي إلى تطوير العديد من النماذج لتوقع مواقع السومويلايشن، مع أدوات مثل SUMOsp وDeep-Sumo التي حققت دقة عالية. تقدم الدراسة الحالية XGBoost-Sumo، وهو نموذج حاسوبي جديد يستخدم Extreme Gradient Boosting لتحديد مواقع السومويلايشن بدقة. من خلال دمج مصفوفة تسجيل محددة الموضع مع تحويل الموجات المنفصلة (PsePSSM-DWT) وتمثيلات الترميز ثنائية الاتجاه من المحولات (BERT) لاستخراج الميزات، إلى جانب SHapley Additive exPlanations (SHAP) لاختيار الميزات، يظهر XGBoost-Sumo أداءً متفوقًا، حيث حقق دقة تبلغ 99.68% على مجموعات البيانات المرجعية و96.08% على عينات مستقلة. يمثل هذا النموذج تقدمًا كبيرًا في هذا المجال، مما يحسن دقة التنبؤ مقارنة بالطرق الحالية ويوفر أداة قوية للبحث المستقبلي في توقع مواقع السومويلايشن.

الطرق

تحدد القسم الخاص بـ “النتائج التجريبية والتحليل” المنهجيات المستخدمة لجمع وتفسير البيانات ذات الصلة بفرضيات الدراسة. تم تصميم التجارب لاختبار النماذج المقترحة بشكل منهجي، باستخدام مزيج من القياسات الكمية والتحليلات الإحصائية لضمان نتائج قوية. تم وضع مقاييس رئيسية لتقييم أداء النماذج، وتم تنفيذ ظروف تحكم متنوعة لعزل تأثيرات المتغيرات المستقلة.

تشير النتائج إلى وجود ارتباط كبير بين المتغيرات المعالجة والنتائج الملاحظة، مع تأكيد الأهمية الإحصائية من خلال الاختبارات المناسبة (مثل القيم p < 0.05). يتضمن التحليل أيضًا تمثيلات رسومية للبيانات، مما يبرز الاتجاهات والانحرافات عن النتائج المتوقعة. بشكل عام، تدعم النتائج الفرضيات الأولية وتوفر أساسًا لمزيد من الاستكشاف في الأبحاث اللاحقة.

المناقشة

في قسم المناقشة من ورقة البحث، يقدم المؤلفون نموذج إطار شامل لتوقع مواقع السومويلايشن، مع التأكيد على أهمية مجموعة بيانات مرجعية منظمة جيدًا للتدريب والتحقق. استخدموا مجموعة بيانات من مجموعة تعديل بروتين الليسين (CPLM)، وقاموا بتصفية التسلسلات لتحقيق مجموعة بيانات متوازنة من 780 تسلسل لمواقع السومويلايشن و780 تسلسل لمواقع غير السومويلايشن. لمعالجة عدم التوازن في الفئات، تم استخدام خوارزمية Near Miss، مما يضمن مجموعة بيانات موثوقة لتدريب النموذج. كما نفذ المؤلفون استراتيجيتين مستقلتين للتحقق—مستقلة عن التوازن وغير مستقلة عن التوازن—لتقييم عمومية النموذج تحت ظروف متغيرة.

شمل عملية استخراج الميزات عدة تقنيات متقدمة، بما في ذلك مصفوفة تسجيل محددة الموضع (PSSM)، Pseudo-PSSM، وتحويل الموجات المنفصلة (DWT)، والتي عززت بشكل جماعي قدرة النموذج على التقاط الخصائص التطورية والهيكلية لتسلسلات الببتيد. زاد دمج تمثيلات BERT من ثراء مجموعة الميزات، مما أدى إلى إنشاء متجه هجين يجمع بين نقاط القوة لكل من PsePSSM-DWT وBERT. استخدم المؤلفون SHAP لاختيار الميزات، وحددوا الميزات الأكثر تأثيرًا للتنبؤ، مما حسن بشكل كبير أداء النموذج. أظهر مصنف XGBoost دقة متفوقة (99.68%) وموثوقية عبر اختبارات التحقق المختلفة، متفوقًا على النماذج الحالية وعارضًا فعاليته في توقع مواقع السومويلايشن. يُقترح أن تستكشف الأعمال المستقبلية التعلم الانتقالي وتحسين المعلمات لتعزيز أداء النموذج وقابلية التوسع بشكل أكبر.

Journal: BioData Mining, Volume: 18, Issue: 1
DOI: https://doi.org/10.1186/s13040-024-00415-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39901279
Publication Date: 2025-02-03
Author(s): Salman Khan et al.
Primary Topic: Machine Learning in Bioinformatics

Overview

The section discusses the significance of posttranslational modifications (PTMs), particularly sumoylation, in regulating protein functions and their implications for diseases like Parkinson’s and Alzheimer’s. The study presents XGBoost-Sumo, a predictive model designed to identify sumoylation sites by integrating protein structural and sequence data. Utilizing a transformer-based attention mechanism and the PsePSSM-DWT approach for feature extraction, the model combines word embeddings with evolutionary descriptors. It employs the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an accuracy of 99.68% on benchmark datasets and 96.08% on independent samples, outperforming existing models by notable margins.

In conclusion, XGBoost-Sumo demonstrated high reliability and accuracy in identifying sumoylated sites, validated through rigorous 10-fold cross-validation. The model’s superior performance compared to traditional machine learning methods highlights its potential for practical applications in pharmaceutical development. Future work will focus on enhancing the model through transfer learning, hyperparameter optimization, and ensemble techniques, alongside implementing parallel programming methodologies to improve efficiency, speed, and scalability.

Introduction

The introduction highlights the critical role of proteins in cellular functions across prokaryotic and eukaryotic organisms, emphasizing the importance of posttranslational modifications (PTMs) in protein regulation. Among these, sumoylation—where SUMO proteins bind to lysine residues—plays a significant role in various cellular processes, including nuclear-cytoplasmic transport and gene expression regulation. The connection between sumoylation and neurodegenerative diseases, such as Alzheimer’s and Parkinson’s, underscores its relevance in understanding protein misfolding and cellular homeostasis.

Recent advancements in computational biology have led to the development of numerous models for predicting sumoylation sites, with tools like SUMOsp and Deep-Sumo achieving high accuracy. The present study introduces XGBoost-Sumo, a novel computational model that utilizes Extreme Gradient Boosting for precise identification of sumoylation sites. By integrating Position-Specific Scoring Matrix with Discrete Wavelet Transform (PsePSSM-DWT) and Bidirectional Encoder Representations from Transformers (BERT) for feature extraction, along with SHapley Additive exPlanations (SHAP) for feature selection, XGBoost-Sumo demonstrates superior performance, achieving an accuracy of 99.68% on benchmark datasets and 96.08% on independent samples. This model represents a significant advancement in the field, improving prediction accuracy over existing methods and providing a robust tool for future research in sumoylation site prediction.

Methods

The section on “Experimental Results and Analysis” outlines the methodologies employed to gather and interpret data relevant to the study’s hypotheses. The experiments were designed to systematically test the proposed models, utilizing a combination of quantitative measurements and statistical analyses to ensure robust findings. Key metrics were established to evaluate the performance of the models, and various control conditions were implemented to isolate the effects of the independent variables.

The results indicate a significant correlation between the manipulated variables and the observed outcomes, with statistical significance confirmed through appropriate tests (e.g., p-values < 0.05). The analysis further includes graphical representations of the data, highlighting trends and deviations from expected results. Overall, the findings substantiate the initial hypotheses and provide a foundation for further exploration in subsequent research.

Discussion

In the discussion section of the research paper, the authors present a comprehensive framework model for predicting sumoylation sites, emphasizing the importance of a well-curated benchmark dataset for training and validation. They utilized a dataset from the Compendium of Protein Lysine Modification (CPLM), filtering sequences to achieve a balanced dataset of 780 sumoylation and 780 non-sumoylation site sequences. To address class imbalance, the Near Miss algorithm was employed, ensuring a reliable dataset for model training. The authors also implemented two independent validation strategies—balance-independent and imbalance-independent—to evaluate the model’s generalizability under varying conditions.

The feature extraction process involved several advanced techniques, including Position-Specific Scoring Matrix (PSSM), Pseudo-PSSM, and Discrete Wavelet Transform (DWT), which collectively enhanced the model’s ability to capture evolutionary and structural characteristics of peptide sequences. The integration of BERT embeddings further enriched the feature set, leading to the creation of a hybrid vector that combined the strengths of both PsePSSM-DWT and BERT. The authors employed SHAP for feature selection, identifying the most impactful features for prediction, which significantly improved model performance. The XGBoost classifier demonstrated superior accuracy (99.68%) and robustness across various validation tests, outperforming existing models and showcasing its effectiveness in predicting sumoylation sites. Future work is suggested to explore transfer learning and hyperparameter optimization to further enhance model performance and scalability.