Deep-STP: نهج قائم على التعلم العميق للتنبؤ ببروتينات سم الأفعى باستخدام تمثيلات الكلمات Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings

المجلة: Frontiers in Medicine، المجلد: 10
DOI: https://doi.org/10.3389/fmed.2023.1291352
PMID: https://pubmed.ncbi.nlm.nih.gov/38298505
تاريخ النشر: 2024-01-17
المؤلف: Hasan Zulfiqar وآخرون
الموضوع الرئيسي: لدغات الحيوانات السامة والدراسات

نظرة عامة

تستكشف الأبحاث إمكانيات بروتينات سم الأفعى، المعروفة بتأثيراتها السامة على الأنظمة الدموية والعصبية، في تطوير علاجات دوائية للأمراض ذات الصلة. تعتبر الطرق البيوكيميائية التقليدية لتحديد هذه البروتينات مكلفة وتستغرق وقتًا طويلاً؛ وبالتالي، يقدم البحث نهجًا حسابيًا قائمًا على التسلسل يستفيد من الذكاء الاصطناعي للفرز على نطاق واسع. استخدم المؤلفون ثلاثة موصوفات ميزات—g-gap، المتجه الطبيعي (NV)، وكلمة 2 متجه (W2V)—لتشفير تسلسلات البروتين. قاموا بتحسين هذه الميزات باستخدام تحليل التباين (ANOVA) وخوارزمية شجرة القرار المعززة بالتدرج (GBDT) مع اختيار الميزات التدريجي (IFS)، محققين دقة تبلغ 82.00% في التحقق المتقاطع 10 أضعاف و81.14% على بيانات مستقلة.

يخلص البحث إلى أن النموذج الحسابي المطور يصنف بروتينات سم الأفعى بشكل فعال، مما يظهر قدرات تعميم قوية. تم تحقيق أفضل أداء باستخدام مصنف قائم على الشبكة العصبية التلافيفية (CNN). يخطط المؤلفون لتعزيز عملهم من خلال إنشاء تطبيق ويب للنموذج واستكشاف تقنيات اختيار الميزات المتقدمة لتحسين كفاءة التصنيف. مجموعة البيانات والشيفرة متاحة للجمهور على https://github.com/linDing-groups/Deep-STP، مما يسهل المزيد من الأبحاث في هذا المجال الواعد.

مقدمة

تناقش مقدمة ورقة البحث التركيب المعقد لسم الأفعى، الذي يحتوي على بروتينات سامة مختلفة تؤثر على الدورة الدموية، والنظام العصبي، وأنظمة الحركة للفريسة، مما يسهل الافتراس. تشمل السموم الرئيسية المحددة إنزيمات البروتينات السيرينية، وإنزيمات البروتينات المعدنية، وأكسيدات الأحماض الأمينية L، مع ذكر محدد لإنزيمات الفوسفوليباز A2 وأكسيدات الأحماض الأمينية L من Pseudechis australis التي تظهر خصائص مضادة للبكتيريا. تسلط الورقة الضوء على الإمكانيات العلاجية لمكونات سم الأفعى، مثل كابتوبريل، الذي يستخدم لعلاج ارتفاع ضغط الدم وتقليل مخاطر فشل القلب.

نظرًا للطبيعة المعقدة لبروتينات سم الأفعى وإمكانياتها في تطوير الأدوية، يؤكد المؤلفون على ضرورة وجود طرق فعالة للتحديد. تواجه أدوات المعلوماتية الحيوية الحالية مثل FASTA وHAlign وBLAST قيودًا عندما تكون التسلسلات المتجانسة غائبة. لمعالجة هذه الفجوة، يقترح المؤلفون متنبئًا جديدًا قائمًا على التعلم العميق يسمى Deep-STP، مصممًا للتعرف بدقة على بروتينات سم الأفعى. تتضمن المنهجية تشفير تسلسلات البروتين باستخدام موصوفات مختلفة وتحسين مجموعة الميزات من خلال تقنيات ANOVA وGBDT، تليها التقييم باستخدام التحقق المتقاطع 10 أضعاف ومجموعات بيانات مستقلة.

طرق

في هذه الدراسة، يؤكد المؤلفون على أهمية استخدام بيانات موثوقة لتطوير نموذج تنبؤي. قاموا بجمع عينات إيجابية وسلبية من قواعد البيانات المفتوحة UniProt وRefSeq، مع تطبيق حد هوية تسلسلي يبلغ 80% لاستبعاد التسلسلات المتشابهة. أسفر هذا التصفية عن مجموعة بيانات نهائية تتكون من 270 تسلسلًا إيجابيًا و339 تسلسلًا سلبيًا من عائلات البروتينات البارزة المرتبطة بسموم الأفعى.

ثم تم تقسيم مجموعة البيانات إلى مجموعات تدريبية ومستقلة، حيث تم تخصيص 80% للتدريب و20% للتقييم المستقل. تهدف هذه التقسيمات إلى توفير تقييم موضوعي لكفاءات وأداء النماذج، كما هو موضح في الجدول التكميلي S1.

نتائج

تقدم قسم النتائج نتائج الدراسة، مع تسليط الضوء على النتائج الرئيسية المستمدة من الطرق التجريبية أو التحليلية المستخدمة. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات قيد التحقيق، حيث تؤكد التحليلات الإحصائية قوة هذه العلاقات. بشكل ملحوظ، تظهر النتائج أن التدخل المطبق أدى إلى تحسين قابل للقياس في المقاييس المستهدفة، مما يشير إلى فعاليته.

علاوة على ذلك، يضع النقاش هذه النتائج في سياق الأدبيات الحالية، مع معالجة الآثار المحتملة للبحوث المستقبلية والتطبيقات العملية. يؤكد المؤلفون على أهمية هذه النتائج في تعزيز الفهم في هذا المجال، مع الاعتراف بالقيود واقتراح سبل لمزيد من الاستكشاف. بشكل عام، تسهم النتائج في تقديم رؤى قيمة تعزز قاعدة المعرفة الحالية وتوجه الدراسات اللاحقة.

نقاش

في هذا القسم، يناقش المؤلفون المنهجيات المستخدمة لتشفير تسلسلات بروتينات سم الأفعى والتصنيف اللاحق لهذه البروتينات باستخدام تقنيات التعلم الآلي. تم استخدام ثلاثة موصوفات ميزات: تركيب ثنائي الببتيد g-gap، المتجه الطبيعي (NV)، وكلمة2متجه (W2V). يلتقط تركيب ثنائي الببتيد g-gap العلاقات بين بقايا الأحماض الأمينية، بينما يوفر نظام NV تمثيلًا بُعديًا 60 لتسلسلات البروتين، مما يسهل التحليل النشئي. تم تعديل نهج W2V، وخاصة نموذج حقيبة الكلمات المستمرة، لتدريب تمثيلات تسلسلات البروتين، مما ينتج عنه تمثيلات بُعدية 200.

تم التأكيد على اختيار الميزات كخطوة حاسمة لتعزيز أداء النموذج من خلال القضاء على التكرار. استخدم المؤلفون ANOVA وأشجار القرار المعززة بالتدرج (GBDT) لتحديد مجموعة فرعية مثالية من الميزات، محققين دقة قصوى تبلغ 82.00% مع 167 ميزة. تم تقييم أداء نموذج الشبكة العصبية التلافيفية (CNN) المقترح باستخدام مقاييس مثل الدقة، والدقة، والاسترجاع، ودرجة F1، مما يظهر أداءً متفوقًا مقارنةً بمصنفات التعلم الآلي الأخرى. حقق نموذج CNN منطقة تحت منحنى التشغيل الاستقبالي (AUROC) تبلغ 0.926 على مجموعة بيانات التدريب و0.917 على مجموعة بيانات مستقلة، مما يشير إلى قدرته القوية على التعميم. تشير النتائج إلى أن دمج موصوفات ميزات متنوعة وتقنيات التعلم الآلي المتقدمة يمكن أن يعزز بشكل كبير من التنبؤ ببروتينات سم الأفعى، مما يمهد الطريق للتطبيقات المستقبلية في البحث الدوائي.

Journal: Frontiers in Medicine, Volume: 10
DOI: https://doi.org/10.3389/fmed.2023.1291352
PMID: https://pubmed.ncbi.nlm.nih.gov/38298505
Publication Date: 2024-01-17
Author(s): Hasan Zulfiqar et al.
Primary Topic: Venomous Animal Envenomation and Studies

Overview

The research explores the potential of snake venom proteins, which are known for their toxic effects on the circulatory and nervous systems, in the development of pharmacological treatments for related diseases. Traditional biochemical methods for identifying these proteins are costly and time-consuming; thus, the study introduces a sequence-based computational approach leveraging artificial intelligence for large-scale screening. The authors employed three feature descriptors—g-gap, natural vector (NV), and word 2 vector (W2V)—to encode the protein sequences. They optimized these features using analysis of variance (ANOVA) and a gradient-boost decision tree (GBDT) algorithm combined with incremental feature selection (IFS), achieving an accuracy of 82.00% in 10-fold cross-validation and 81.14% on independent data.

The study concludes that the developed computational model effectively classifies snake toxin proteins, demonstrating robust generalization capabilities. The best performance was achieved using a convolutional neural network (CNN)-based classifier. The authors plan to enhance their work by creating a web application for the model and exploring advanced feature selection techniques to improve classification efficiency. The dataset and code are publicly accessible at https://github.com/linDing-groups/Deep-STP, facilitating further research in this promising area.

Introduction

The introduction of the research paper discusses the complex composition of snake venom, which contains various toxin proteins that affect the blood circulation, nervous, and motion systems of prey, facilitating predation. Key toxins identified include serine proteinases, metalloproteinases, and L-amino acid oxidases, with specific mention of phospholipases A2 and L-amino acid oxidases from Pseudechis australis exhibiting antibacterial properties. The paper highlights the therapeutic potential of snake venom components, such as captopril, which is used to treat hypertension and mitigate heart failure risks.

Given the intricate nature of snake venom proteins and their potential applications in drug development, the authors emphasize the necessity for efficient identification methods. Current bioinformatic tools like FASTA, HAlign, and BLAST face limitations when homologous sequences are absent. To address this gap, the authors propose a novel deep learning-based predictor named Deep-STP, designed to accurately recognize snake toxin proteins. The methodology involves encoding protein sequences using various descriptors and optimizing the feature set through ANOVA and GBDT techniques, followed by evaluation using 10-fold cross-validation and independent datasets.

Methods

In this study, the authors emphasize the importance of utilizing reliable data for developing a predictive model. They sourced positive and negative samples from the open-access databases UniProt and RefSeq, applying a sequence identity cutoff of 80% to exclude similar sequences. This filtering process resulted in a final dataset comprising 270 positive and 339 negative sequences from notable protein families associated with snake toxins.

The dataset was then divided into training and independent sets, with 80% allocated for training and 20% reserved for independent evaluation. This division aims to provide an objective assessment of the models’ efficiencies and performances, as detailed in Supplementary Table S1.

Results

The results section presents the findings of the study, highlighting key outcomes derived from the experimental or analytical methods employed. The data indicates a significant correlation between the variables under investigation, with statistical analyses confirming the robustness of these relationships. Notably, the results demonstrate that the intervention applied led to a measurable improvement in the targeted metrics, suggesting its efficacy.

Furthermore, the discussion contextualizes these findings within the existing literature, addressing potential implications for future research and practical applications. The authors emphasize the importance of these results in advancing understanding in the field, while also acknowledging limitations and proposing avenues for further exploration. Overall, the results contribute valuable insights that enhance the current knowledge base and inform subsequent studies.

Discussion

In this section, the authors discuss the methodologies employed for encoding snake toxin protein sequences and the subsequent classification of these proteins using machine learning techniques. Three feature descriptors were utilized: g-gap dipeptide composition, natural vector (NV), and word2vector (W2V). The g-gap dipeptide composition captures the relationships between amino acid residues, while the NV scheme provides a 60-dimensional representation of protein sequences, facilitating phylogenetic analysis. The W2V approach, particularly the continuous bag of words model, was adapted to train representations of protein sequences, yielding embeddings of 200 dimensions.

Feature selection was emphasized as a crucial step to enhance model performance by eliminating redundancy. The authors employed ANOVA and Gradient Boosting Decision Trees (GBDT) to identify an optimal subset of features, achieving a maximum accuracy of 82.00% with 167 features. The performance of the proposed convolutional neural network (CNN) model was evaluated using metrics such as accuracy, precision, recall, and F1-score, demonstrating superior performance compared to other machine learning classifiers. The CNN model achieved an area under the receiver operating characteristic curve (AUROC) of 0.926 on the training dataset and 0.917 on an independent dataset, indicating its robust generalization capability. The findings suggest that the integration of diverse feature descriptors and advanced machine learning techniques can significantly enhance the prediction of snake toxin proteins, paving the way for future applications in pharmacological research.