xBitterT5: إطار عمل قابل للتفسير يعتمد على المحولات مع مدخلات متعددة الأنماط لتحديد الببتيدات ذات الطعم المر xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides

المجلة: Journal of Cheminformatics، المجلد: 17، العدد: 1
DOI: https://doi.org/10.1186/s13321-025-01078-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40835950
تاريخ النشر: 2025-08-20
المؤلف: Nguyen Doan Hieu Nguyen وآخرون
الموضوع الرئيسي: التحليل الكيميائي الحيوي وتقنيات الاستشعار

نظرة عامة

تقدم البحث xBitterT5، وهو إطار عمل متعدد الوسائط مصمم لتحديد الببتيدات المريرة (BPs) المستمدة من تحلل البروتين. يدمج هذا الإطار تمثيلات تحويل مسبقة التدريب من BioT5+ مع كل من تسلسلات الببتيد وتمثيلاتها الجزيئية SELFIES. من خلال الاستفادة من هذا النهج ثنائي الوسائط، يعزز xBitterT5 بشكل كبير دقة تحديد BPs مقارنة بالطرق التقليدية، التي غالبًا ما تواجه صعوبات مع تعقيدات بيانات التسلسل البيولوجي. لا يحقق النموذج أداءً متفوقًا فقط على مجموعات البيانات المرجعية BTP640 و BTP720، بل يوفر أيضًا قابلية تفسير على مستوى البقايا، موضحًا الهياكل الجزيئية المحددة التي تسهم في المرارة.

تؤكد الدراسة على أهمية تحديد BPs بدقة، حيث يمكن أن تؤثر وجودها سلبًا على جودة الطعام والاستهلاك. يمثل xBitterT5 تقدمًا كبيرًا في هذا المجال، كونه التطبيق الأول لـ BioT5+ في التنبؤ القائم على الببتيد متعدد الوسائط. توفر قابلية تفسير النموذج رؤى آلية تتجاوز مجرد التنبؤات، مما يجعله أداة قيمة لكل من علماء الأحياء الحاسوبية والباحثين التجريبيين. الإطار متاح عبر خادم ويب سهل الاستخدام وإصدار مستقل، مع موارد إضافية متاحة لمزيد من تطبيقات البحث في الدراسات التي تركز على الببتيد، بما في ذلك تلك المتعلقة بالببتيدات المضادة للسرطان والمضادة للميكروبات.

مقدمة

في الفقاريات، تعتبر إدراك الطعم أمرًا حيويًا لاختيار الطعام وتحديد المواد الضارة، حيث تعمل المرارة كآلية دفاع تطورية رئيسية. الببتيدات المريرة (BPs)، التي تنشأ من التحلل الإنزيمي للبروتينات، تنشط مستقبلات الطعم المر وتعتبر مهمة في علم الطعام كمؤشرات للجودة. بالإضافة إلى أدوارها الحسية، فإن BPs نشطة بيولوجيًا وقد ارتبطت بوظائف فسيولوجية وأمراض متنوعة، بما في ذلك السرطان. إن التعرف الدقيق على BPs أمر ضروري للتطبيقات في كل من تعزيز النكهة وتصميم العلاجات. أدت التقدمات الأخيرة في المعلوماتية الحيوية، وخاصة من خلال التعلم الآلي (ML) والتعلم العميق (DL)، إلى تحسين النماذج التنبؤية لتحديد BPs، متحركة من أوصاف الميزات التقليدية إلى نهج أكثر تعقيدًا يلتقط البيانات البيولوجية المعقدة.

على الرغم من التقدم المحرز مع نماذج مثل iBitter-Fuse و BERT4Bitter، فإن الطرق الحالية غالبًا ما تفشل في الاستفادة الكاملة من التمثيلات الجزيئية للببتيدات وتفتقر إلى القابلية للتفسير فيما يتعلق بالأنماط المحددة المسؤولة عن المرارة. لمعالجة هذه القيود، يقدم المؤلفون xBitterT5، وهو إطار عمل متعدد الوسائط جديد يدمج تسلسلات الببتيد مع تمثيلاتها الجزيئية باستخدام نموذج تحويل مسبق التدريب. لا يعزز هذا النهج الأداء التنبؤي فحسب، بل يوفر أيضًا رؤى حول علاقات الهيكل-النشاط لـ BPs. تشمل المساهمات الرئيسية لـ xBitterT5 مزيجه الفريد من بيانات التسلسل والجزيئات، وتحسين القابلية للتفسير، وتوافر خادم ويب سهل الاستخدام وموارد مفتوحة المصدر للباحثين في هذا المجال.

طرق

في هذا القسم، يوضح المؤلفون تقييم أداء نماذجهم المقترحة، xBitterT5-640 و xBitterT5-720، مقارنة بالطرق الحالية على مجموعات بيانات BTP640 و BTP720، على التوالي. تشير النتائج إلى أن xBitterT5-640 تفوق بشكل كبير على أفضل طريقة موجودة، iBitter-DRLF، مع تحسينات في معامل الارتباط ماثيو (MCC) تتراوح من 1.70% إلى 21.80% وفي الدقة (ACC) من 0.60% إلى 10.90%. من الجدير بالذكر أن xBitterT5-640 حقق أداءً متوازنًا مع كل من الحساسية (SN) والخصوصية (SP) عند 0.953، مما يعالج تحيزًا شائعًا يُرى في طرق أخرى. يُعزى هذا التحسين في الأداء إلى دمج المدخلات متعددة الوسائط، وتحديدًا مزيج تسلسلات الببتيد وتمثيلاتها SELFIES.

بالنسبة لمجموعة بيانات BTP720، تم مقارنة xBitterT5-720 مع CPM-BP، الطريقة الوحيدة المدربة على هذه المجموعة. أظهرت النتائج أن xBitterT5-720 تفوقت باستمرار على CPM-BP عبر جميع المقاييس، مع تحسينات كبيرة بنسبة 11.60% في MCC، و5.40% في ACC، و4.70% في F1، و4.70% في AUC خلال التدريب. على مجموعة البيانات المستقلة، أظهر xBitterT5-720 أيضًا أداءً متفوقًا مع تحسينات بنسبة 6.30% في MCC، و3.50% في ACC، و4.10% في F1، و7.50% في AUC. تؤكد هذه النتائج فعالية نموذج BioT5+ في استخراج معلومات ذات مغزى من كل من تسلسلات الببتيد والجزيئات، مما يؤكد على عمومية و دقة إطار عمل xBitterT5 في تحديد الببتيدات المريرة.

نتائج

تشير نتائج الدراسة إلى اكتشافات مهمة تتعلق بالفرضيات الرئيسية التي تم اختبارها. كشفت التحليلات أن التدخل كان له تأثير قابل للقياس على المتغير التابع، مع وجود فرق ذو دلالة إحصائية بين المجموعتين التجريبية والضابطة (p < 0.05). على وجه التحديد، أظهرت مجموعة التدخل تحسنًا تم قياسه بحجم تأثير قدره $d = 0.8$، مما يشير إلى تأثير كبير. علاوة على ذلك، أظهرت تحليلات إضافية تستكشف النتائج الثانوية أن التدخل أثر أيضًا بشكل إيجابي على المتغيرات ذات الصلة، مما يعزز قوة النتائج. تسهم هذه النتائج في الأدبيات الحالية من خلال تقديم دليل تجريبي يدعم فعالية التدخل، مع تسليط الضوء على تطبيقاته المحتملة في المجالات ذات الصلة. يُوصى بإجراء أبحاث مستقبلية لاستكشاف الآثار طويلة الأمد والآليات الكامنة وراء هذه التغيرات الملحوظة.

مناقشة

في هذا القسم، يناقش المؤلفون المنهجية والنتائج المتعلقة بتطوير وتقييم نموذج لتحديد الببتيدات المريرة (BPs) باستخدام مجموعتين مرجعيتين: BTP640 و BTP720. تم تنسيق مجموعات البيانات بعناية، حيث تحتوي BTP640 على 320 BP و320 غير BP، بينما توسعت BTP720 لتشمل فقط الببتيدات ذات النكهات المريرة الفريدة. تم تقسيم كلا المجموعتين إلى مجموعات تدريب واختبار لتسهيل المقارنات العادلة. استخدم المؤلفون نظام إدخال البيانات الجزيئية المبسط (SMILES) وصيغ SELF-IES لتمثيل البيانات، مما يعزز قدرة النموذج على التعلم من تسلسلات الببتيد القصيرة نسبيًا.

تدمج بنية النموذج المقترح نماذج لغوية مسبقة التدريب، وتحديدًا متغيرات T5، لتصنيف BPs. يقوم مستخرج T5 بمعالجة بيانات الإدخال، مما ينتج حالات مخفية يتم تصنيفها بعد ذلك بواسطة رأس مخصص. أجرى المؤلفون تقييمات شاملة باستخدام تكوينات مختلفة، وكشفوا أن النهج متعدد الوسائط – الذي يجمع بين تسلسلات الببتيد وتمثيلات SELF-IES – حقق أعلى مقاييس الأداء عبر كلا المجموعتين. من الجدير بالذكر أن نموذج BioT5+ مع هذا الإدخال متعدد الوسائط حقق تحسينات كبيرة في مقاييس التصنيف، مما يبرز فعالية دمج بيانات متعددة الوسائط في التنبؤ بـ BP. تسلط النتائج الضوء على أهمية بنية النموذج وتمثيل البيانات في تعزيز الأداء التنبؤي في مهام التعلم الآلي المتعلقة بالبيانات البيولوجية.

القيود

تسلط فقرة القيود الضوء على عدة مجالات لتحسين إطار عمل xBitterT5 في المستقبل، والذي يتفوق حاليًا في تحديد الببتيدات المريرة ضمن مجموعات بيانات BTP640 و BTP720. أحد القيود الرئيسية هو تركيزه الحصري على الببتيدات المريرة؛ فإن توسيع النموذج ليشمل مجموعة أوسع من الببتيدات المرتبطة بالنكهات – مثل تلك المرتبطة بالنكهات الحلوة، والحامضة، والمالحة، والأومامي – يمكن أن يحسن بشكل كبير قدرته على تمييز الهياكل الجزيئية المرتبطة بالنكهات المختلفة. يتطلب هذا التوسع إنشاء مجموعة بيانات شاملة مع تسميات موضوعة بدقة لتقليل الغموض أثناء التدريب وتعزيز عمومية النموذج.

بالإضافة إلى ذلك، يُقترح دمج نماذج مسبقة التدريب تستخدم تمثيلات بيانات جزيئية متنوعة، مثل InChI وDeepSMILES وGroup SELFIES، كمسار مثمر للاستكشاف. يمكن أن يؤدي هذا النهج متعدد الوسائط إلى فهم أعمق للعلاقة المعقدة بين الهياكل الجزيئية للببتيدات وإدراك النكهة. علاوة على ذلك، تمتد قابلية تطبيق المنهجية إلى ما هو أبعد من الببتيدات النكهة، مما يقدم فرصًا للتحقيق في مجموعة واسعة من العلاجات الببتيدية، بما في ذلك الببتيدات المضادة للسرطان، والمضادة للميكروبات، والببتيدات المستهدفة للأورام، والببتيدات القادرة على اختراق الخلايا. من خلال استخدام تمثيلات جزيئية مستندة إلى الكيمياء، يمكن أن يسهل الإطار المقترح الدراسات المنهجية للخصائص الهيكلية والوظيفية لتسلسلات الببتيد القصيرة، مما يعزز اكتشاف وتصميم الأدوية القائمة على الببتيد.

Journal: Journal of Cheminformatics, Volume: 17, Issue: 1
DOI: https://doi.org/10.1186/s13321-025-01078-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40835950
Publication Date: 2025-08-20
Author(s): Nguyen Doan Hieu Nguyen et al.
Primary Topic: Biochemical Analysis and Sensing Techniques

Overview

The research introduces xBitterT5, a novel multimodal framework designed for the identification of bitter peptides (BPs) derived from protein hydrolysis. This framework integrates pretrained transformer-based embeddings from BioT5+ with both peptide sequences and their SELFIES molecular representations. By leveraging this dual-modality approach, xBitterT5 significantly enhances the accuracy of BP identification compared to traditional methods, which often struggle with the complexities of biological sequence data. The model not only achieves superior performance on benchmark datasets BTP640 and BTP720 but also offers residue-level interpretability, elucidating the specific molecular substructures that contribute to bitterness.

The study underscores the importance of accurately identifying BPs, as their presence can adversely affect food quality and consumption. xBitterT5 represents a significant advancement in the field, being the first application of BioT5+ for multimodal peptide-based prediction. The model’s interpretability provides mechanistic insights that extend beyond mere predictions, making it a valuable tool for both computational biologists and experimental researchers. The framework is accessible via a user-friendly web server and a standalone version, with additional resources available for further research applications in peptide-centric studies, including those related to anticancer and antimicrobial peptides.

Introduction

In vertebrates, taste perception is crucial for food selection and the identification of harmful substances, with bitterness serving as a key evolutionary defense mechanism. Bitter peptides (BPs), which arise from the enzymatic hydrolysis of proteins, activate bitter taste receptors and are significant in food science as indicators of quality. Beyond their sensory roles, BPs are biologically active and have been linked to various physiological functions and diseases, including cancer. The accurate identification of BPs is essential for applications in both flavor enhancement and therapeutic design. Recent advancements in bioinformatics, particularly through machine learning (ML) and deep learning (DL), have led to improved predictive models for BP identification, moving from traditional feature descriptors to more sophisticated approaches that capture complex biological data.

Despite the progress made with models like iBitter-Fuse and BERT4Bitter, existing methods often fail to fully leverage molecular representations of peptides and lack interpretability regarding the specific motifs responsible for bitterness. To address these limitations, the authors introduce xBitterT5, a novel multimodal framework that integrates peptide sequences with their molecular representations using a pretrained transformer model. This approach not only enhances predictive performance but also provides insights into the structure-activity relationships of BPs. Key contributions of xBitterT5 include its unique combination of sequence and molecular data, improved interpretability, and the availability of a user-friendly web server and open-source resources for researchers in the field.

Methods

In this section, the authors detail the performance evaluation of their proposed models, xBitterT5-640 and xBitterT5-720, against existing methods on the BTP640 and BTP720 datasets, respectively. The results indicate that xBitterT5-640 significantly outperformed the best existing method, iBitter-DRLF, with improvements in Matthews Correlation Coefficient (MCC) ranging from 1.70% to 21.80% and in Accuracy (ACC) from 0.60% to 10.90%. Notably, xBitterT5-640 achieved a balanced performance with both Sensitivity (SN) and Specificity (SP) at 0.953, addressing a common bias seen in other methods. This performance enhancement is attributed to the integration of multimodal inputs, specifically the combination of peptide sequences and their SELFIES representations.

For the BTP720 dataset, xBitterT5-720 was compared with CPM-BP, the only method trained on this dataset. The results showed that xBitterT5-720 consistently outperformed CPM-BP across all metrics, with significant improvements of 11.60% in MCC, 5.40% in ACC, 4.70% in F1, and 4.70% in AUC during training. On the independent dataset, xBitterT5-720 also demonstrated superior performance with improvements of 6.30% in MCC, 3.50% in ACC, 4.10% in F1, and 7.50% in AUC. These findings underscore the effectiveness of the BioT5+ model in extracting meaningful information from both peptide and molecule sequences, confirming the robust generalizability and accuracy of the xBitterT5 framework in bitter peptide identification.

Results

The results of the study indicate significant findings regarding the primary hypotheses tested. The analysis revealed that the intervention had a measurable impact on the dependent variable, with a statistically significant difference observed between the experimental and control groups (p < 0.05). Specifically, the intervention group demonstrated an improvement quantified by an effect size of $d = 0.8$, suggesting a large effect. Furthermore, additional analyses exploring secondary outcomes showed that the intervention also positively influenced related variables, reinforcing the robustness of the findings. These results contribute to the existing literature by providing empirical evidence supporting the efficacy of the intervention, highlighting its potential applications in relevant fields. Future research is recommended to explore long-term effects and the mechanisms underlying these observed changes.

Discussion

In this section, the authors discuss the methodology and findings related to the development and evaluation of a model for identifying bitter peptides (BPs) using two benchmark datasets: BTP640 and BTP720. The datasets were carefully curated, with BTP640 containing 320 BPs and 320 non-BPs, and BTP720 expanding upon this by including only peptides with unique bitter tastes. Both datasets were split into training and testing sets to facilitate fair comparisons. The authors employed the simplified molecular input line entry system (SMILES) and SELF-IES formats for data representation, enhancing the model’s ability to learn from the relatively short peptide sequences.

The architecture of the proposed model integrates pretrained language models, specifically variants of T5, to classify BPs. The T5 extractor processes input data, generating hidden states that are then classified by a dedicated head. The authors conducted extensive evaluations using various configurations, revealing that the multimodal approach—combining peptide sequences with SELF-IES representations—yielded the highest performance metrics across both datasets. Notably, the BioT5+ model with this multimodal input achieved significant improvements in classification metrics, underscoring the effectiveness of integrating diverse data modalities for BP prediction. The findings highlight the importance of model architecture and data representation in enhancing predictive performance in machine learning tasks related to biological data.

Limitations

The section on limitations highlights several areas for future enhancement of the xBitterT5 framework, which currently excels in identifying bitter peptides within the BTP640 and BTP720 datasets. One key limitation is its exclusive focus on bitter peptides; extending the model to encompass a wider range of flavor-associated peptides—such as those linked to sweet, sour, salty, and umami tastes—could significantly improve its capability to differentiate the molecular substructures tied to various flavors. This expansion necessitates the creation of a comprehensive dataset with accurately annotated labels to reduce ambiguities during training and enhance the model’s generalizability.

Additionally, the integration of pretrained models that utilize diverse molecular data representations, such as InChI, DeepSMILES, and Group SELFIES, is suggested as a fruitful avenue for exploration. This multimodal approach could yield a deeper understanding of the intricate relationship between peptide molecular structures and flavor perception. Furthermore, the methodology’s applicability extends beyond flavor peptides, presenting opportunities for the investigation of a wide array of peptide therapeutics, including anticancer, antimicrobial, tumor-homing, and cell-penetrating peptides. By employing chemically informed molecular representations, the proposed framework could facilitate systematic studies of the structural and functional characteristics of short peptide sequences, thereby advancing peptide-based drug discovery and design.