DPFunc: التنبؤ بدقة بوظيفة البروتين عبر التعلم العميق مع معلومات الهيكل الموجهة بالمجال DPFunc: accurately predicting protein function via deep learning with domain-guided structure information

المجلة: Nature Communications، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41467-024-54816-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39746897
تاريخ النشر: 2025-01-02
المؤلف: Wenkang Wang وآخرون
الموضوع الرئيسي: تعلم الآلة في المعلوماتية الحيوية

نظرة عامة

يقدم هذا القسم نظرة عامة على نهج جديد قائم على التعلم العميق، DPFunc، مصمم لتوقع وظائف البروتينات مع معالجة قضايا القابلية للتفسير الموجودة في الطرق الحسابية الحالية. تكمن أهمية التنبؤ بدقة بوظائف البروتينات في إمكانيته لتعزيز فهمنا للآليات البيولوجية والمساعدة في علاج الأمراض المعقدة. يستفيد DPFunc من معلومات الهيكل الموجهة من المجال لتحديد المناطق الحرجة داخل هياكل البروتين، مما يسهل التنبؤات الدقيقة للوظائف.

تشير النتائج إلى أن DPFunc لا يتفوق فقط على الطرق الحالية الرائدة، بل يظهر أيضًا تحسنًا ملحوظًا مقارنة بالطرق التقليدية المعتمدة على الهيكل. تكشف التحليلات التفصيلية أن دمج معلومات المجال أمر حيوي لنجاح الطريقة، حيث يمكّن من تحديد بقايا أو مناطق رئيسية ترتبط ارتباطًا وثيقًا بوظائف البروتينات. بشكل عام، يظهر DPFunc كأداة قوية للتنبؤ بوظائف البروتين على نطاق واسع، مما يعزز مجال بيولوجيا البروتين وتطبيقاته في فهم الأنظمة البيولوجية المعقدة.

مقدمة

في مقدمة هذه الورقة البحثية، يبرز المؤلفون التحسين الكبير في أداء نموذج DPFunc عند دمج معلومات المجال. من خلال استبدال كتلة انتباه المجال بطبقة تجميع متوسطة، وهي طريقة تم استخدامها سابقًا في نماذج مثل DeepFRI و GAT-GO، يظهرون أن DPFunc يستفيد بشكل كبير من توجيه المجال. تشير النتائج إلى تحسينات ملحوظة في مقاييس مثل Fmax و AUPR عبر فئات مختلفة: الوظيفة الجزيئية (MF)، المكون الخلوي (CC)، والعملية البيولوجية (BP). على وجه التحديد، يؤدي دمج رؤى المجال إلى زيادات متوسطة في AUPR بنسبة 12.0% و 14.7% و 16.3% لـ MF و CC و BP، على التوالي.

تكشف التحليلات الإضافية أن DPFunc، عندما يكون مزودًا بتعليقات المجال، يحقق عددًا أكبر من مصطلحات علم الجينات (GO) المتوقعة بدقة (AUPR=1) ويظهر أداءً محسنًا في التنبؤات عالية الثقة. تتراوح تحسينات متوسط قياس F من 1.6% إلى 3.1% لـ MF، و 1.9% إلى 3.3% لـ CC، و 5.5% إلى 6.7% لـ BP، مما يبرز موثوقية النموذج ودقته في التنبؤ بوظائف البروتينات. بشكل عام، تؤكد النتائج على الأهمية الحاسمة لمعلومات المجال في تحسين التنبؤ بوظائف البروتين، مع تأثيرات ملحوظة بشكل خاص في فئات CC و BP.

طرق

في هذا القسم، يتم تقييم أداء DPFunc مقارنة بعدة طرق رائدة لتوقع وظائف البروتين، بما في ذلك الأساليب المعتمدة على التسلسل والأساليب المعتمدة على الهيكل. تستخدم التقييم مجموعة بيانات تتكون من هياكل PDB التي تم التحقق منها تجريبيًا ووظائفها المقابلة، مع استخدام مقاييس مثل Fmax و AUPR لتقييم فعالية النموذج. يظهر DPFunc أداءً متفوقًا عبر فئات متعددة – الوظيفة الجزيئية (MF)، المكون الخلوي (CC)، والعملية البيولوجية (BP) – مع تحسينات ملحوظة تتراوح من 8% إلى 27% في Fmax ومن 7% إلى 42% في AUPR مقارنة بـ GAT-GO، خاصة بعد تطبيق إجراء ما بعد المعالجة.

بالإضافة إلى ذلك، يتفوق DPFunc باستمرار على طرق أخرى، بما في ذلك مجموعة متنوعة من الأساليب المركبة وخدمات الويب، حتى عند اختباره ضد بروتينات ذات هويات تسلسلية منخفضة. يدمج هيكل النموذج معلومات المجال من تسلسلات البروتين، وميزات البقايا من نموذج لغة مدرب مسبقًا، ورسوم هيكلية، مما يمكّن من تعلم فعال لميزات البروتين. يتفوق DPFunc في توقع مصطلحات علم الجينات (GO) المفيدة، خاصة تلك التي تتميز بمحتوى معلومات مرتفع (IC)، ويحافظ على أداء قوي على مصطلحات GO الهرمية الأعمق. بشكل عام، تؤكد النتائج على المزايا المميزة لـ DPFunc في التعامل مع البروتينات غير المرئية وتوقع الوظائف المعقدة، مما يرسخ مكانته كطريقة رائدة في هذا المجال.

نتائج

يقدم قسم “النتائج” من الورقة البحثية النتائج الرئيسية المستمدة من التجارب أو التحليلات التي تم إجراؤها. يوضح النتائج التي تم التوصل إليها، مع تسليط الضوء على نقاط البيانات والاتجاهات المهمة التي تم ملاحظتها. عادةً ما تدعم النتائج تحليلات إحصائية ذات صلة، بما في ذلك قيم p وفترات الثقة، التي تتحقق من الفرضيات المختبرة.

بالإضافة إلى ذلك، يتم الإشارة إلى أي تمثيلات رسومية، مثل الرسوم البيانية أو الجداول، لتوضيح النتائج بشكل فعال. قد يقارن القسم أيضًا النتائج بالدراسات السابقة، مع التأكيد على المساهمات الجديدة أو تأكيد النظريات الموجودة. بشكل عام، يخدم هذا القسم لتقديم ملخص واضح وموجز للأدلة التجريبية التي تم جمعها، مما يمهد الطريق لمزيد من المناقشة والتفسير في الأقسام اللاحقة.

مناقشة

يناقش هذا القسم تطوير وقدرات DPFunc، وهي طريقة قائمة على التعلم العميق لتوقع وظائف البروتين التي تستفيد من معلومات الهيكل الموجهة من المجال. يتكون DPFunc من ثلاثة وحدات رئيسية: وحدة تعلم الميزات على مستوى البقايا، وحدة تعلم الميزات على مستوى البروتين، ووحدة توقع وظيفة البروتين. تستخدم الوحدة على مستوى البقايا نموذج لغة بروتين مدرب مسبقًا وشبكات عصبية رسومية لتوليد ونشر الميزات بين البقايا، بينما تدمج الوحدة على مستوى البروتين معلومات المجال لتعزيز تمثيل الهيكل الكامل للبروتين. ثم تقوم وحدة توقع الوظيفة بتعليق وظائف البروتين بناءً على هذه الميزات الشاملة.

يظهر DPFunc مزايا كبيرة في تمييز الأنماط الهيكلية والوظائف المرتبطة بها، خاصة في الحالات التي تظهر فيها البروتينات تشابهًا منخفضًا في التسلسل ولكنها تشترك في هياكل مشابهة. على سبيل المثال، يحدد بشكل فعال البقايا الرئيسية والمواقع الوظيفية في الإنزيمات، مما يظهر قدرته على توقع الوظائف بدقة حتى عند مواجهة تسلسلات متباينة. علاوة على ذلك، يظهر DPFunc وعدًا في تعليق البروتينات البكتيرية، متفوقًا على الطرق الحالية، خاصة في السيناريوهات الصعبة التي تتضمن وظائف نادرة أو محددة. بشكل عام، تؤكد النتائج على إمكانية DPFunc في تعزيز توقع وظائف البروتين من خلال دمج الرؤى الهيكلية ومعرفة المجال، مما يمهد الطريق للتقدم المستقبلي في فهم وظيفة البروتين في سياقات بيولوجية متنوعة.

Journal: Nature Communications, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41467-024-54816-8
PMID: https://pubmed.ncbi.nlm.nih.gov/39746897
Publication Date: 2025-01-02
Author(s): Wenkang Wang et al.
Primary Topic: Machine Learning in Bioinformatics

Overview

The section presents an overview of a novel deep learning-based approach, DPFunc, designed for predicting protein functions while addressing the interpretability issues inherent in existing computational methods. The significance of accurately predicting protein functions lies in its potential to enhance our understanding of biological mechanisms and aid in the treatment of complex diseases. DPFunc leverages domain-guided structure information to identify critical regions within protein structures, thereby facilitating precise function predictions.

The findings indicate that DPFunc not only surpasses current state-of-the-art methods but also demonstrates a marked improvement over traditional structure-based approaches. Detailed analyses reveal that the incorporation of domain information is pivotal for the method’s success, as it enables the identification of key residues or regions that correlate closely with protein functions. Overall, DPFunc emerges as a powerful tool for large-scale protein function prediction, advancing the field of protein biology and its applications in understanding complex biological systems.

Introduction

In the introduction of this research paper, the authors highlight the significant enhancement in the performance of the DPFunc model when domain information is integrated. By substituting the domain attention block with a mean pooling layer, a method previously utilized in models like DeepFRI and GAT-GO, they demonstrate that DPFunc benefits substantially from domain guidance. The results indicate notable improvements in metrics such as Fmax and AUPR across various categories: Molecular Function (MF), Cellular Component (CC), and Biological Process (BP). Specifically, the incorporation of domain insights leads to median AUPR increases of 12.0%, 14.7%, and 16.3% for MF, CC, and BP, respectively.

Further analysis reveals that DPFunc, when equipped with domain annotations, achieves a higher number of perfectly predicted Gene Ontology (GO) terms (AUPR=1) and demonstrates improved performance in high-confidence predictions. The mean F-measure enhancements range from 1.6% to 3.1% for MF, 1.9% to 3.3% for CC, and 5.5% to 6.7% for BP, underscoring the model’s reliability and accuracy in predicting protein functions. Overall, the findings underscore the critical importance of domain information in refining protein function prediction, with particularly pronounced effects observed in the CC and BP categories.

Methods

In this section, the performance of DPFunc is evaluated against several state-of-the-art methods for protein function prediction, including both sequence-based and structure-based approaches. The evaluation utilizes a dataset comprising experimentally validated PDB structures and their corresponding functions, employing metrics such as Fmax and AUPR to assess model efficacy. DPFunc demonstrates superior performance across multiple categories—molecular function (MF), cellular component (CC), and biological process (BP)—with notable improvements of 8% to 27% in Fmax and 7% to 42% in AUPR compared to GAT-GO, particularly after applying a post-processing procedure.

Additionally, DPFunc consistently outperforms other methods, including various composite and web-server approaches, even when tested against proteins with low sequence identities. The model’s architecture integrates domain information from protein sequences, residue features from a pre-trained language model, and structure graphs, enabling effective learning of protein features. DPFunc excels in predicting informative Gene Ontology (GO) terms, especially those characterized by high information content (IC), and maintains robust performance on deeper hierarchical GO terms. Overall, the findings underscore DPFunc’s distinct advantages in handling unseen proteins and predicting complex functions, establishing it as a leading method in the field.

Results

The “Results” section of the research paper presents the key findings derived from the conducted experiments or analyses. It details the outcomes of the study, highlighting significant data points and trends observed. The results are typically supported by relevant statistical analyses, including p-values and confidence intervals, which validate the hypotheses tested.

Additionally, any graphical representations, such as charts or tables, are referenced to illustrate the findings effectively. The section may also compare the results with previous studies, emphasizing novel contributions or confirming existing theories. Overall, this section serves to provide a clear and concise summary of the empirical evidence gathered, setting the stage for further discussion and interpretation in subsequent sections.

Discussion

The section discusses the development and capabilities of DPFunc, a deep learning-based method for protein function prediction that leverages domain-guided structural information. DPFunc comprises three main modules: a residue-level feature learning module, a protein-level feature learning module, and a protein function prediction module. The residue-level module utilizes a pre-trained protein language model and graph neural networks to generate and propagate features among residues, while the protein-level module integrates domain information to enhance the representation of the entire protein structure. The function prediction module then annotates protein functions based on these comprehensive features.

DPFunc demonstrates significant advantages in distinguishing structural motifs and associated functions, particularly in cases where proteins exhibit low sequence similarity but share similar structures. For instance, it effectively identifies key residues and functional sites in enzymes, showcasing its ability to predict functions accurately even when faced with disparate sequences. Furthermore, DPFunc shows promise in annotating bacterial proteins, outperforming existing methods, particularly in challenging scenarios involving rare or specific functions. Overall, the findings underscore DPFunc’s potential to enhance protein function prediction by integrating structural insights and domain knowledge, paving the way for future advancements in understanding protein functionality in various biological contexts.