يتيح InstaNovo تسلسل الببتيدات الجديد المدعوم بالانتشار في تجارب البروتيوميات على نطاق واسع InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments

المجلة: Nature Machine Intelligence، المجلد: 7، العدد: 4
DOI: https://doi.org/10.1038/s42256-025-01019-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41847501
تاريخ النشر: 2025-03-31
المؤلف: Kevin Eloff وآخرون
الموضوع الرئيسي: تقنيات البروتيوميات المتقدمة وتطبيقاتها

نظرة عامة

تناقش هذه القسم التقدمات في علم البروتيوميات القائم على مطيافية الكتلة، مع التركيز بشكل خاص على تحديد الببتيدات من خلال مطيافية الكتلة المت tandem. تعتمد الطرق التقليدية عادةً على قواعد بيانات البروتين لتحديد الببتيدات، مما يمكن أن يكون محدودًا في سياقات معينة. لمعالجة هذه القيود، يقدم المؤلفون InstaNovo، وهو نموذج تحويل يقوم بترجمة قمم أيونات الشظايا مباشرةً إلى تسلسلات ببتيدية دون الحاجة إلى معلومات قاعدة بيانات مسبقة. يظهر InstaNovo أداءً متفوقًا مقارنةً بالطرق الحالية وقابل للتطبيق عبر سياقات بيولوجية متنوعة.

بالإضافة إلى ذلك، يقدم المؤلفون InstaNovo+، وهو نموذج انتشار يعزز دقة توقعات تسلسل الببتيدات من خلال تحسين تكراري. تعمل هذه النماذج على تحسين تغطية التسلسل العلاجي بشكل كبير، وتسهيل اكتشاف ببتيدات جديدة، وتمكين الكشف عن كائنات لم يتم الإبلاغ عنها سابقًا في مجموعات بيانات متنوعة. تشير النتائج إلى أن هذه الابتكارات توسع من قدرات البروتيوميات، خاصة في مجالات مثل تسلسل البروتين المباشر، وعلم المناعة الببتيدي، واستكشاف البروتين المظلم، مما يحدث ثورة في دراسة البروتينات على نطاق واسع.

الطرق

تحدد قسم “الطرق” في ورقة البحث التصميم التجريبي والتقنيات التحليلية المستخدمة للتحقيق في سؤال البحث. استخدمت الدراسة نهجًا كميًا، مع دمج التحليلات الإحصائية لتقييم البيانات التي تم جمعها من تجارب متنوعة. تضمنت المنهجيات المحددة تجارب محكومة، حيث تم التلاعب بالمتغيرات بشكل منهجي لملاحظة تأثيراتها على النتائج ذات الصلة.

شملت جمع البيانات استخدام أدوات موحدة لضمان الموثوقية والصلاحية. تم إجراء التحليل باستخدام أدوات برمجية تسهل النمذجة الإحصائية المعقدة، مما يسمح بتقييم العلاقات بين المتغيرات. تم اشتقاق النتائج الرئيسية من اختبار الفرضيات، مع تحديد مستويات الدلالة عند p < 0.05، مما يشير إلى أن النتائج كانت ذات دلالة إحصائية وتدعم الاستنتاجات العامة للدراسة.

النتائج

في قسم النتائج، تقدم الدراسة النتائج من مجموعات بيانات متنوعة تم تحليلها بمعدل اكتشاف خاطئ (FDR) بنسبة 1%، مع استثناءات محددة تم الإشارة إليها لعلم المناعة الببتيدي، الذي لم يستخدم FDR للبروتين. تلخص جدول البيانات الموسع 1 نتائج البحث في قاعدة البيانات، بينما توفر جداول البيانات الموسعة 2 و3 تقييمات لطرق InstaNovo وInstaNovo+ عبر جميع مجموعات البيانات.

تم حساب فترات الثقة للتقييمات باستخدام المعادلة ± 1.96 × ŝe_B، حيث تمثل ŝe_B الخطأ القياسي الناتج عن 10,000 تكرار. ومع ذلك، لا يتم حساب الأخطاء القياسية الناتجة عن bootstrap لمجموعات بيانات ProteomeTools بسبب حجمها الكبير، مما يجعل الحسابات مكلفة بشكل مفرط، على الرغم من أنه يُلاحظ أن الأخطاء القياسية ستكون صغيرة جدًا. ينطبق نفس المنطق على مجموعة بيانات الأنواع التسع، مما يؤدي إلى استبعادها من حسابات الأخطاء القياسية الناتجة عن bootstrap.

المناقشة

تسلط قسم المناقشة في الورقة الضوء على التقدمات التي حققها InstaNovo (IN) ونموذج تحسينه التكراري، InstaNovo+ (IN+)، في مجال تسلسل الببتيدات de novo. يؤكد المؤلفون أن التحسين التكراري يعزز بشكل كبير دقة التوقعات، مما يتماشى مع الأساليب البشرية في التسلسل من خلال تحسين التوقعات الأولية تدريجيًا بناءً على البيانات الطيفية. أظهر نموذج IN+، الذي يدمج مشفر انتشار للتحديثات التكرارية، تحسينات كبيرة في مقاييس الأداء، خاصة في الاسترجاع والدقة، عند مقارنته بالنموذج المتقدم، Casanovo. ومن الجدير بالذكر أن IN+ حدد 52,633 تطابقًا بين الببتيد والطيف (PSMs)، محققًا زيادة بنسبة 32.71% مقارنةً بـ Casanovo، مما يبرز فعاليته في تصحيح الأخطاء الأولية وتوسيع نطاق الببتيدات القابلة للاكتشاف.

كما يناقش المؤلفون الآثار الأوسع لنتائجهم، مشيرين إلى أن طرق التسلسل de novo مثل IN وIN+ يمكن أن تتغلب على قيود البحث التقليدي في قواعد البيانات، مثل التكاليف الحاسوبية العالية وعدم القدرة على تحديد تسلسلات جديدة أو تعديلات ما بعد الترجمة. يقدمون أدلة على قدرات النماذج عبر تطبيقات بيولوجية متنوعة، بما في ذلك تحديد البروتينات في عينات معقدة، وعلم المناعة الببتيدي، واستكشاف البروتين المظلم. تشير النتائج إلى أن هذه النماذج لا تعزز فقط معدلات تحديد الببتيدات ولكنها توفر أيضًا رؤى حول المناظر البيولوجية التي كانت غير متاحة سابقًا، مما يمهد الطريق لتطبيقات مستقبلية في علم البروتيوميات الجينومية وعلم البروتيوميات على مستوى الخلية الواحدة. بشكل عام، تضع الورقة IN وIN+ كتحسينات كبيرة في السعي للحصول على منهجيات تسلسل ببتيد دقيقة وفعالة.

Journal: Nature Machine Intelligence, Volume: 7, Issue: 4
DOI: https://doi.org/10.1038/s42256-025-01019-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41847501
Publication Date: 2025-03-31
Author(s): Kevin Eloff et al.
Primary Topic: Advanced Proteomics Techniques and Applications

Overview

The section discusses advancements in mass spectrometry-based proteomics, particularly focusing on peptide identification through tandem mass spectrometry. Traditional methods typically rely on protein databases for peptide identification, which can be limiting in certain contexts. To address these limitations, the authors introduce InstaNovo, a transformer model that translates fragment ion peaks directly into peptide sequences without prior database information. InstaNovo demonstrates superior performance compared to existing methods and is applicable across various biological contexts.

Additionally, the authors present InstaNovo+, a diffusion model that enhances the accuracy of peptide sequence predictions through iterative refinement. These models significantly improve therapeutic sequencing coverage, facilitate the discovery of novel peptides, and enable the detection of previously unreported organisms in diverse datasets. The findings suggest that these innovations expand the capabilities of proteomics, particularly in areas such as direct protein sequencing, immunopeptidomics, and the exploration of the dark proteome, thereby revolutionizing the study of proteins on a large scale.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research question. The study utilized a quantitative approach, incorporating statistical analyses to evaluate the data collected from various experiments. Specific methodologies included controlled experiments, where variables were systematically manipulated to observe their effects on the outcomes of interest.

Data collection involved the use of standardized instruments to ensure reliability and validity. The analysis was conducted using software tools that facilitated complex statistical modeling, allowing for the assessment of relationships between variables. Key findings were derived from hypothesis testing, with significance levels set at p < 0.05, indicating that results were statistically significant and supporting the overall conclusions of the study.

Results

In the Results section, the study presents findings from various datasets analyzed at a 1% false discovery rate (FDR), with specific exceptions noted for immunopeptidomics, which did not utilize protein FDR. Extended Data Table 1 summarizes the database search results, while Extended Data Tables 2 and 3 provide evaluations of the InstaNovo and InstaNovo+ methods across all datasets.

Confidence intervals for the evaluations are calculated using the formula ± 1.96 × ŝe_B, where ŝe_B represents the bootstrap standard error derived from 10,000 replicates. However, bootstrap standard errors are not computed for the ProteomeTools datasets due to their large size, which renders the calculations prohibitively expensive, although it is noted that the standard errors would likely be very small. The same rationale applies to the nine-species dataset, leading to its exclusion from bootstrap standard error calculations.

Discussion

The discussion section of the paper highlights the advancements made by the InstaNovo (IN) and its iterative refinement model, InstaNovo+ (IN+), in the field of de novo peptide sequencing. The authors emphasize that iterative refinement significantly enhances prediction accuracy, aligning with human approaches to sequencing by progressively refining initial predictions based on spectral data. The IN+ model, which integrates a diffusion decoder for iterative updates, demonstrated substantial improvements in performance metrics, particularly in recall and precision, when compared to the state-of-the-art model, Casanovo. Notably, IN+ identified 52,633 peptide-spectrum matches (PSMs), achieving a 32.71% increase over Casanovo, showcasing its effectiveness in correcting initial errors and expanding the range of detectable peptides.

The authors also discuss the broader implications of their findings, noting that de novo sequencing methods like IN and IN+ can overcome the limitations of traditional database searches, such as high computational costs and the inability to identify novel sequences or post-translational modifications. They present evidence of the models’ capabilities across various biological applications, including the identification of proteins in complex samples, immunopeptidomics, and the exploration of the dark proteome. The results indicate that these models not only enhance peptide identification rates but also provide insights into previously inaccessible biological landscapes, paving the way for future applications in proteogenomics and single-cell proteomics. Overall, the paper positions IN and IN+ as significant advancements in the quest for accurate and efficient peptide sequencing methodologies.