PROTAC-Splitter: إطار تعلم آلي للتعرف التلقائي على الهياكل الفرعية لـ PROTAC PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures

المجلة: Journal of Cheminformatics، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1186/s13321-025-01135-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41721433
تاريخ النشر: 2026-02-20
المؤلف: Stefano Ribes وآخرون
الموضوع الرئيسي: تحلل البروتين والمثبطات

نظرة عامة

تقدم البحث PROTAC-Splitter، وهو إطار عمل للتعلم الآلي يهدف إلى أتمتة توضيح الشيميرات المستهدفة للبروتينات (PROTACs)، وهي جزيئات معقدة تتكون من ربيطة E3 ليغاز، ووصلة، ورأس حربي. يتم معالجة تحدي تحديد هذه المكونات بدقة من خلال إنشاء مجموعة بيانات صناعية تحتوي على حوالي 1.3 مليون هيكل PROTAC مع هياكل فرعية موضحة، والتي تم جعلها متاحة للجمهور. تم تطوير نموذجين تكميليين: نموذج قائم على المحولات (Transformer) ونموذج XGBoost قائم على الرسوم البيانية. يظهر نموذج المحولات دقة مطابقة عالية (86%) على البيانات العامة ولكنه يواجه صعوبة مع PROTACs الجديدة هيكليًا، بينما يضمن نموذج XGBoost صحة كيميائية ودقة إعادة التجميع، على الرغم من انخفاض معدلات المطابقة الدقيقة.

لزيادة الموثوقية، تم تقديم وظيفة تغليف لنموذج المحولات، مما يحسن دقة إعادة التجميع بشكل كبير. يتم اقتراح النهج الهجين الذي يجمع بين كلا النموذجين كحل لتوضيح موثوق عبر مساحات كيميائية متنوعة. لا يوفر PROTAC-Splitter أداة قابلة للتوسع لتحليل PROTAC الآلي فحسب، بل يضع أيضًا الأساس لتقدم تصميم وتحسين مُحللات البروتين المستهدفة. الإطار ومجموعة البيانات متاحة بشكل مفتوح، مما يعزز المزيد من البحث في هذا المجال.

مقدمة

تناقش مقدمة ورقة البحث تطوير وأهمية الشيميرات المستهدفة للبروتينات (PROTACs)، وهي جزيئات مبتكرة مصممة لتسهيل التحلل المستهدف لبروتينات معينة من خلال استقطاب ليغازات يوبكويتين E3. يتكون كل PROTAC من ثلاثة مكونات أساسية: ربيطة E3 ليغاز، ورأس حربي يرتبط بالبروتين المستهدف (POI)، ووصلة تربط بين الاثنين. إن تحسين هذه المكونات أمر بالغ الأهمية، حيث يمكن أن تؤثر التعديلات الطفيفة بشكل كبير على كفاءة التحلل والدوائية لـ PROTACs. ومع ذلك، فإن تحليل هياكل PROTAC إلى أجزائها المكونة يمثل تحديًا بسبب تعقيد سير العمل الحالي، الذي يعتمد غالبًا على التنسيق اليدوي وتقنيات المطابقة الصارمة التي تواجه صعوبة مع التغيرات في الهياكل الكيميائية.

لمعالجة هذه التحديات، يقدم المؤلفون PROTAC-Splitter، وهو إطار عمل للتعلم الآلي قادر على استخراج ربيطة E3 ليغاز، والوصلة، والرأس الحربي تلقائيًا من PROTACs الممثلة في صيغة SMILES. نظرًا لتوفر بيانات PROTAC المعلّمة بشكل محدود، قام المؤلفون أيضًا بإنشاء مجموعة بيانات صناعية تحتوي على 1.3 مليون جزيء مشابه لـ PROTAC بخصائص كيميائية واقعية. تم التحقق من أداء PROTAC-Splitter على كل من مجموعات البيانات العامة والمجموعات الخاصة، مما يظهر دقة قوية من خلال نموذجين تكميليين: نموذج قائم على المحولات (Transformer) يحقق دقة مطابقة دقيقة بنسبة 86% ونموذج XGBoost خفيف الوزن يضمن صحة كيميائية بنسبة 100%. تؤكد الدراسة على إمكانية PROTAC-Splitter في تعزيز تحليل PROTAC الآلي وتصميم مُحللات البروتين المستهدفة، بينما توفر أيضًا موارد قيمة لمزيد من البحث في هذا المجال.

الطرق

تحدد قسم “الطرق” الإجراءات التجريبية والتحليلية المستخدمة في الدراسة. يتناول اختيار المشاركين، والمواد المستخدمة، والبروتوكولات المحددة المتبعة لضمان موثوقية وValidity النتائج. كما يتم وصف التحليلات الإحصائية التي أجريت لتفسير البيانات، بما في ذلك أي برامج تم استخدامها والعتبات الدلالية التي تم وضعها لاختبار الفرضيات.

بالإضافة إلى ذلك، قد يتضمن القسم معلومات حول التصميم التجريبي، مثل ما إذا كان تجربة عشوائية محكومة أو دراسة رصدية، وطرق جمع البيانات، مثل الاستطلاعات، والمقابلات، أو القياسات المخبرية. إن صرامة المنهجية أمر بالغ الأهمية لتكرار الدراسة ولعمومية النتائج على السكان الأوسع.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب أو التحليلات التي أجريت. عادةً ما يتضمن بيانات كمية، وتحليلات إحصائية، وتمثيلات بصرية مثل الرسوم البيانية أو الجداول لتوضيح النتائج. قد تبرز النتائج الاتجاهات المهمة، أو الارتباطات، أو الاختلافات الملحوظة في البيانات، والتي تعتبر حاسمة لدعم الفرضيات أو أسئلة البحث المطروحة في الدراسة.

بالإضافة إلى ذلك، قد يناقش القسم تداعيات هذه النتائج بالنسبة للأدبيات الموجودة، مع التأكيد على كيفية مساهمتها في الفهم الأوسع للموضوع. يتم أيضًا تناول أي نتائج غير متوقعة أو شذوذ، مما يوفر نظرة شاملة على نتائج البحث وأهميتها في هذا المجال.

المناقشة

في قسم المناقشة من ورقة البحث، يوضح المؤلفون تطوير وتقييم نماذج PROTAC-Splitter الخاصة بهم، والتي تقوم بأتمتة تحليل جزيئات PROTAC إلى هياكلها الفرعية المكونة: الرؤوس الحربية، والروابط، وربيطات E3 ليغاز. استخدموا مجموعتين رئيسيتين من البيانات – مجموعات بيانات عامة من PROTAC-DB وPROTAC-Pedia، ومجموعة بيانات داخلية من AstraZeneca – لتنسيق مجموعة شاملة من 5,670 PROTAC مع توضيحات منهجية للهياكل الفرعية. قام المؤلفون بتنفيذ خوارزمية جديدة لاستنتاج الهياكل الفرعية، والتي نجحت في توضيح 98.8% من PROTACs، وطوروا مجموعة بيانات صناعية تحتوي على أكثر من 1.3 مليون PROTAC لتعزيز تدريب النموذج وقدرات التعميم.

تم تقييم أداء النماذج بدقة باستخدام مقاييس مثل الصلاحية، ودقة المطابقة الدقيقة، ودقة إعادة التجميع. أظهر النموذج القائم على XGBoost صلاحية ودقة إعادة تجميع عالية ولكن دقة مطابقة دقيقة أقل، بينما حقق النموذج القائم على المحولات معدلات مطابقة دقيقة أعلى بكثير، خاصة بعد تطبيق وظيفة إصلاح حتمية لتصحيح الأخطاء الطفيفة في التوليد. أشار المؤلفون إلى أن التعميم على الهياكل الفرعية غير المرئية لا يزال يمثل تحديًا، خاصة بالنسبة للبيانات الخاصة، واقترحوا أن استراتيجيات التكيف مع المجال يمكن أن تحسن أداء النموذج في التطبيقات الواقعية. يقترحون استراتيجية استنتاج هجينة تستفيد من نقاط القوة في كلا النموذجين، مما يضمن توقعات قوية مع تقليل المتطلبات الحاسوبية. تشمل اتجاهات البحث المستقبلية استكشاف توقع الخصائص الفيزيائية والكيميائية من الهياكل الفرعية وتوسيع المنهجية لتشمل أشكال دوائية أخرى.

القيود

يسلط قسم القيود في نماذج PROTAC-Splitter الضوء على عدة نواقص كبيرة. أولاً، تميل الطرق القائمة على المحولات إلى إنتاج أخطاء صغيرة ولكنها ذات صلة كيميائيًا، خاصة مع الهيكل الفرعي للرأس الحربي، الذي يميل إلى أن يحتوي على متوسط 0.46 ذرة إضافية في البيانات العامة و7.40 في البيانات الداخلية. على الرغم من أن وظيفة الإصلاح يمكن أن تصحح 16.89% من الأخطاء في مجموعة الاختبار الصناعية المحجوزة و60.6% في البيانات الداخلية، إلا أن الحالات المعقدة، خاصة تلك التي تتضمن نقاط ارتباط الوصلة، لا تزال تمثل تحديًا للحل تلقائيًا.

ثانيًا، بينما يضمن نموذج XGBoost القائم على الرسوم البيانية مخرجات كيميائية صحيحة، فإن اعتماده الحصري على البيانات الطوبولوجية يؤدي إلى دقة مطابقة دقيقة منخفضة (42.20% لمجموعات البيانات الصناعية و22.96% لمجموعات البيانات الداخلية) وتحديدات خاطئة منهجية، خاصة مع PROTACs المعقدة مثل الدورات الكبيرة. بالإضافة إلى ذلك، تساهم عدم تطابقات الاستريو كيميائية في مراكز شيرال في جزء صغير من الأخطاء (حوالي 1-2%). قد تشمل التحسينات المستقبلية دمج الوعي بالتكوين ثلاثي الأبعاد لمعالجة هذه القضايا. أخيرًا، تكافح كلا النموذجين للتمييز بين الجزيئات غير PROTAC، وغالبًا ما تصنف المركبات غير ذات الصلة، مثل المواد اللاصقة الجزيئية، على أنها PROTACs. تؤكد هذه القيود على ضرورة إجراء فحوصات أولية للتحقق من وجود هياكل فرعية مشابهة لـ PROTAC في التطبيقات المستقبلية.

Journal: Journal of Cheminformatics, Volume: 18, Issue: 1
DOI: https://doi.org/10.1186/s13321-025-01135-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41721433
Publication Date: 2026-02-20
Author(s): Stefano Ribes et al.
Primary Topic: Protein Degradation and Inhibitors

Overview

The research presents PROTAC-Splitter, a machine learning framework aimed at automating the annotation of proteolysis-targeting chimeras (PROTACs), which are complex molecules composed of an E3 ligase ligand, a linker, and a warhead. The challenge of accurately identifying these components is addressed by generating a synthetic dataset of approximately 1.3 million PROTAC structures with annotated substructures, which is made publicly available. Two complementary models are developed: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. The Transformer model demonstrates high exact-match accuracy (86%) on public data but struggles with structurally novel PROTACs, while the XGBoost model ensures chemical validity and reassembly accuracy, albeit with lower exact-match rates.

To enhance reliability, a wrapper function for the Transformer model is introduced, improving reassembly accuracy significantly. The hybrid approach that combines both models is proposed as a solution for reliable annotation across diverse chemical spaces. PROTAC-Splitter not only provides a scalable tool for automated PROTAC analysis but also lays the groundwork for advancing the design and optimization of targeted protein degraders. The framework and dataset are openly accessible, promoting further research in this domain.

Introduction

The introduction of the research paper discusses the development and significance of proteolysis-targeting chimeras (PROTACs), which are innovative molecules designed to facilitate the targeted degradation of specific proteins through the recruitment of E3 ubiquitin ligases. Each PROTAC consists of three essential components: an E3 ligase ligand, a warhead that binds the protein of interest (POI), and a linker that connects the two. The optimization of these components is crucial, as even minor alterations can significantly impact the degradation efficiency and pharmacokinetics of PROTACs. However, the decomposition of PROTAC structures into their constituent parts is challenging due to the complexity of current workflows, which often rely on manual curation and rigid matching techniques that struggle with variations in chemical structures.

To address these challenges, the authors introduce PROTAC-Splitter, a machine learning framework capable of automatically extracting the E3 ligase ligand, linker, and warhead from PROTACs represented in SMILES format. Given the limited availability of annotated PROTAC data, the authors also generated a synthetic dataset of 1.3 million PROTAC-like molecules with realistic chemical properties. The performance of the PROTAC-Splitter was validated on both public datasets and proprietary collections, demonstrating robust accuracy through two complementary models: a Transformer-based sequence-to-sequence model achieving 86% exact-match accuracy and a lightweight XGBoost model ensuring 100% chemical validity. The study emphasizes the potential of PROTAC-Splitter to enhance automated PROTAC analysis and targeted protein degrader design, while also providing valuable resources for further research in this area.

Methods

The “Methods” section outlines the experimental and analytical procedures employed in the study. It details the selection of participants, materials used, and the specific protocols followed to ensure the reliability and validity of the results. The statistical analyses conducted to interpret the data are also described, including any software utilized and the significance thresholds established for hypothesis testing.

Additionally, the section may include information on the experimental design, such as whether it was a randomized controlled trial or observational study, and the methods of data collection, such as surveys, interviews, or laboratory measurements. The rigor of the methodology is crucial for replicating the study and for the generalizability of the findings to broader populations.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments or analyses. It typically includes quantitative data, statistical analyses, and visual representations such as graphs or tables to illustrate the outcomes. The results may highlight significant trends, correlations, or differences observed in the data, which are crucial for supporting the hypotheses or research questions posed in the study.

Additionally, the section may discuss the implications of these findings in relation to existing literature, emphasizing how they contribute to the broader understanding of the topic. Any unexpected results or anomalies are also addressed, providing a comprehensive overview of the research outcomes and their relevance to the field.

Discussion

In the discussion section of the research paper, the authors detail the development and evaluation of their PROTAC-Splitter models, which automate the decomposition of PROTAC molecules into their constituent substructures: warheads, linkers, and E3 ligase ligands. They utilized two primary datasets—public datasets from PROTAC-DB and PROTAC-Pedia, and an internal dataset from AstraZeneca—to curate a comprehensive collection of 5,670 PROTACs with systematic substructure annotations. The authors implemented a novel algorithm for substructure inference, which successfully annotated 98.8% of the PROTACs, and developed a synthetic dataset of over 1.3 million PROTACs to enhance model training and generalization capabilities.

The performance of the models was rigorously evaluated using metrics such as validity, exact-match accuracy, and reassembly accuracy. The XGBoost-based model demonstrated high validity and reassembly accuracy but lower exact-match accuracy, while the Transformer-based model achieved significantly higher exact-match rates, particularly after applying a deterministic fixing function to correct minor generation errors. The authors noted that generalization to unseen substructures remains a challenge, particularly for proprietary data, and suggested that domain adaptation strategies could improve model performance in real-world applications. They propose a hybrid inference strategy that leverages the strengths of both models, ensuring robust predictions while minimizing computational demands. Future research directions include exploring the prediction of physicochemical properties from substructures and extending the methodology to other drug modalities.

Limitations

The section on limitations of the PROTAC-Splitter models highlights several significant shortcomings. Firstly, Transformer-based methods are prone to generating small yet chemically relevant errors, particularly with the warhead substructure, which tends to have an average of 0.46 extra atoms in public data and 7.40 in internal data. Although a fixing function can correct 16.89% of errors in the synthetic held-out test set and 60.6% in internal data, complex cases, especially those involving both linker attachment points, remain challenging to resolve automatically.

Secondly, while the graph-based XGBoost model ensures chemically valid outputs, its exclusive reliance on topological data results in low exact-match accuracy (42.20% for synthetic and 22.96% for internal test sets) and systematic misidentifications, particularly with complex PROTACs like macrocycles. Additionally, stereochemical mismatches at chiral centers contribute to a small fraction of errors (approximately 1-2%). Future improvements could involve integrating 3D conformational awareness to address these issues. Lastly, both models struggle to differentiate non-PROTAC molecules, often misclassifying unrelated compounds, such as molecular glues, as PROTACs. This limitation emphasizes the necessity for preliminary checks to verify the presence of PROTAC-like substructures in future applications.