تحديد N6-methyladenine باستخدام التعلم العميق ودمج الميزات التمييزية N6-methyladenine identification using deep learning and discriminative feature integration

المجلة: BMC Medical Genomics، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1186/s12920-025-02131-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40158097
تاريخ النشر: 2025-03-29
المؤلف: Salman Khan وآخرون
الموضوع الرئيسي: تعلم الآلة في المعلوماتية الحيوية

نظرة عامة

تقدم ورقة البحث Deep-N6mA، وهو نموذج جديد للشبكات العصبية العميقة (DNN) مصمم للتعرف الدقيق على مواقع N6-methyladenine (6 m A) في تسلسلات الحمض النووي، والتي تعتبر حاسمة للتنظيم الجيني والتعبير الجيني. يستخدم النموذج نهج استخراج ميزات هجين يدمج ميزات مختلفة قائمة على التسلسل، بما في ذلك k-mer، وDinucleotide-based Cross Covariance (DCC)، وTrinucleotide-based Auto Covariance (TAC)، والعديد من طرق التركيب الزائفة. لتعزيز الكفاءة الحسابية وملاءمة الميزات، يتم استخدام تحليل المكونات الرئيسية غير الخاضع للإشراف (PCA). تم التحقق من أداء النموذج بدقة من خلال التحقق المتقاطع بخمسة أضعاف على مجموعتين مرجعيتين، محققًا دقة متوسطة قدرها 97.70% لـ F. vesca و95.75% لـ R. chinensis، متجاوزًا الطرق الحالية بنسبة 4.12% و4.55% على التوالي.

تؤكد النتائج على فعالية Deep-N6mA كأداة موثوقة للكشف المبكر عن مواقع 6 m A، مما يظهر حساسية وتخصص وقيم معامل ارتباط ماثيوز (MCC) متفوقة مقارنة بالنماذج السابقة. يبرز هذا التقدم إمكانيات تقنيات التعلم العميق في معالجة التحديات البيولوجية المعقدة، مما يشير إلى أن Deep-N6mA يمكن أن يساهم بشكل كبير في البحث الجيني. تهدف الأعمال المستقبلية إلى توسيع تطبيقات النموذج عبر مجموعة أوسع من الأنواع وتحسين هيكله، مما يجعله حلاً قويًا لتوقع مواقع 6 m A ودفع مجال البيولوجيا الحاسوبية إلى الأمام.

مقدمة

تسلط مقدمة ورقة البحث الضوء على أهمية N6-methyladenine (6 m A) كعلامة جينية متأثرة بالعوامل البيئية، خاصة في سياق استجابات الإجهاد عبر الكائنات الحية المختلفة. من الجدير بالذكر أن خلايا الإنسان المتعرضة للإجهاد الناجم عن نقص الأكسجين تظهر مستويات مرتفعة من 6 m A الميتوكوندري، بينما في دماغ الفأر، تشير العلاقة العكسية بين مستويات 6 m A والجينات العصبية المستجيبة للإجهاد إلى دور في التكيف مع الإجهاد. في Caenorhabditis elegans، يؤدي الإجهاد الميتوكوندري إلى زيادة مستويات 6 m A، مما يسهل آليات التكيف بين الأجيال. على العكس من ذلك، في خلايا الأرز، ترتبط مستويات 6 m A إيجابيًا بتكيف الملح والحرارة ولكن عكسيًا مع مقاومة البرد.

تؤكد الورقة على الحاجة إلى طرق حسابية فعالة لتحديد مواقع 6 m A، حيث إن الأساليب التجريبية التقليدية غالبًا ما تكون مكلفة وتستغرق وقتًا طويلاً. تم تطوير العديد من نماذج التعلم الآلي والتعلم العميق لتعزيز دقة الكشف، بما في ذلك SNNRice6mA، وDNA6mA-MINT، وDeep6mA، التي أظهرت معدلات دقة عالية عبر جينومات مختلفة. يهدف نموذج Deep-N6mA المقترح إلى تحسين الطرق الحالية من خلال دمج شبكة عصبية عميقة مع نهج استخراج ميزات هجين، باستخدام تقنيات مثل k-mer وتركيب الدينوكليوتيد الزائف. يعالج هذا النموذج مشكلات التكرار والضوضاء من خلال طرق اختيار الميزات مثل تحليل المكونات الرئيسية (PCA). تشير التقييمات الأولية إلى أن Deep-N6mA يتفوق على المصنفات التقليدية والنماذج الحديثة الحالية، مما يثبت فعاليته في توقع مواقع 6 m A. توضح الورقة هيكله، مع تفاصيل حول الأساليب، وتقييم الأداء، والنتائج التجريبية، والاستنتاجات.

طرق

في هذا القسم، يوضح المؤلفون الطرق والمواد المستخدمة لاستخراج الميزات من التسلسلات البيولوجية، وبشكل خاص الحمض النووي، لتسهيل تطبيقات التعلم الآلي. يستخدمون ست تقنيات مختلفة لاستخراج الميزات، بما في ذلك تركيب الدينوكليوتيد الزائف (PseKNC) بقيم مختلفة من \( K \) (1، 2، و3)، وسلسلة k-mer، وDinucleotide-based Cross Covariance (DCC)، وTrinucleotide-based Auto Covariance (TAC). تقوم طريقة PseKNC بتحويل تسلسلات الحمض النووي إلى متجهات ميزات منفصلة مع الحفاظ على ترتيب التسلسل، باستخدام قيم تجريبية لرتب الارتباط والأوزان لتحسين الأداء. تقوم طريقة k-mer بتقسيم التسلسلات إلى سلاسل فرعية متداخلة، بينما تحلل DCC وTAC الارتباطات بين الخصائص الفيزيائية والكيميائية لأزواج النوكليوتيدات والثلاثيات، على التوالي، مع تكوينات محددة للتأخير وأبعاد متجه الميزات.

تشمل الإعدادات التجريبية لتقييم النموذج نهج التحقق المتقاطع بخمسة أضعاف لتقييم دقة خوارزميات التعلم الآلي. تم بناء البيئة الحسابية على Python 3.6، مستفيدة من مكتبات مثل TensorFlow وPyTorch وNumPy وScikit-learn لمهام التعلم الآلي المختلفة. تتضمن تكوينات الأجهزة معالج HP Core i7، وذاكرة وصول عشوائي 8 جيجابايت، وNVIDIA GeForce GTX 1060 GPU، مما يضمن معالجة فعالة للعمليات الكثيفة البيانات وعمليات التعلم العميق. تهدف هذه المنهجية الشاملة إلى تعزيز القدرات التنبؤية للنماذج المطورة لتطبيقات المعلوماتية الحيوية.

مناقشة

تؤكد قسم المناقشة في ورقة البحث على الدور الحاسم لإعداد مجموعة البيانات المرجعية في تطوير نموذج حسابي موثوق لتحديد مواقع N6-methyladenine (6 m A) في تسلسلات الحمض النووي. استخدمت الدراسة عينات إيجابية من قاعدة بيانات MDR لـ *Fragaria vesca* و*Rubus chinensis*، حيث تم تطبيق أداة CD-HIT لتصفية التسلسلات من أجل التنوع وتقليل التكرار. بعد الفحص الدقيق، تضمنت مجموعات البيانات النهائية 4,626 عينة إيجابية و1,912 عينة سلبية لـ *R. chinensis*، و694 عينة متوازنة للتحقق. شمل عملية استخراج الميزات ست طرق متميزة لإنشاء متجه ميزات هجين، مما يعزز الأداء التنبؤي للنموذج من خلال التقاط أنماط البيانات المعقدة. تم استخدام تحليل المكونات الرئيسية (PCA) لتحسين الميزات، مما يقلل بشكل فعال من الأبعاد مع الحفاظ على الخصائص الأساسية للبيانات.

أظهر هيكل الشبكة العصبية العميقة (DNN) الذي تم تطويره في هذه الدراسة أداءً متفوقًا في تحديد مواقع 6 m A، محققًا معدلات دقة قدرها 97.70% لـ *F. vesca* و95.75% لـ *R. chinensis*. تم التحقق من فعالية النموذج من خلال مقاييس أداء مختلفة، بما في ذلك الدقة، والحساسية، والتخصص، ومعامل ارتباط ماثيوز (MCC)، مما يشير إلى تحسينات كبيرة مقارنة بالمؤشرات الحالية. تفوق نموذج Deep-N6mA على خوارزميات التعلم الآلي التقليدية والنماذج السابقة، مما يظهر قوته وقابليته للتعميم عبر مجموعات البيانات. تهدف الأبحاث المستقبلية إلى توسيع مجموعات البيانات وتحسين النموذج بشكل أكبر، مما يعزز إمكانيات Deep-N6mA كأداة قوية للدراسات الجينية.

Journal: BMC Medical Genomics, Volume: 18, Issue: 1
DOI: https://doi.org/10.1186/s12920-025-02131-6
PMID: https://pubmed.ncbi.nlm.nih.gov/40158097
Publication Date: 2025-03-29
Author(s): Salman Khan et al.
Primary Topic: Machine Learning in Bioinformatics

Overview

The research paper presents Deep-N6mA, a novel Deep Neural Network (DNN) model designed for the precise identification of N6-methyladenine (6 mA) sites in DNA sequences, which are crucial for epigenetic regulation and gene expression. The model employs a hybrid feature extraction approach that integrates various sequence-based features, including k-mer, Dinucleotide-based Cross Covariance (DCC), Trinucleotide-based Auto Covariance (TAC), and several pseudo-composition methods. To enhance computational efficiency and feature relevance, an unsupervised Principal Component Analysis (PCA) is utilized. The model’s performance was rigorously validated through fivefold cross-validation on two benchmark datasets, achieving an average accuracy of 97.70% for F. vesca and 95.75% for R. chinensis, surpassing existing methods by 4.12% and 4.55%, respectively.

The findings underscore Deep-N6mA’s effectiveness as a reliable tool for early detection of 6 mA sites, demonstrating superior sensitivity, specificity, and Matthews correlation coefficient (MCC) values compared to previous models. This advancement highlights the potential of deep learning techniques in addressing complex biological challenges, suggesting that Deep-N6mA could significantly contribute to epigenetic research. Future work aims to expand the model’s applicability across a broader range of species and further optimize its architecture, establishing it as a robust solution for 6 mA site prediction and advancing the field of computational biology.

Introduction

The introduction of the research paper highlights the significance of N6-methyladenine (6 mA) as an epigenetic marker influenced by environmental factors, particularly in the context of stress responses across various organisms. Notably, hypoxic-stressed human cells exhibit increased levels of mitochondrial 6 mA, while in the mouse brain, an inverse relationship between 6 mA levels and stress-responsive neuronal genes suggests a role in stress adaptation. In Caenorhabditis elegans, mitochondrial stress elevates 6 mA levels, facilitating intergenerational adaptive mechanisms. Conversely, in rice cells, 6 mA levels correlate positively with salt and heat adaptation but inversely with cold resistance.

The paper emphasizes the need for efficient computational methods to identify 6 mA sites, as traditional experimental approaches are often costly and time-consuming. Several machine learning and deep learning models have been developed to enhance detection accuracy, including SNNRice6mA, DNA6mA-MINT, and Deep6mA, which have demonstrated high accuracy rates across various genomes. The proposed Deep-N6mA model aims to improve upon existing methods by integrating a deep neural network with a hybrid feature extraction approach, utilizing techniques such as k-mer and Pseudo Dinucleotide Composition. This model addresses issues of redundancy and noise through feature selection methods like Principal Component Analysis (PCA). Preliminary evaluations indicate that Deep-N6mA outperforms conventional classifiers and existing state-of-the-art models, establishing its efficacy in predicting 6 mA sites. The paper outlines its structure, detailing methods, performance evaluation, experimental results, and conclusions.

Methods

In this section, the authors detail the methods and materials used for feature extraction from biological sequences, specifically DNA, to facilitate machine learning applications. They employ six distinct feature extraction techniques, including Pseudo K-Tuple Nucleotide Composition (PseKNC) with varying values of \( K \) (1, 2, and 3), k-mer series, Dinucleotide-based Cross Covariance (DCC), and Trinucleotide-based Auto Covariance (TAC). The PseKNC method transforms DNA sequences into discrete feature vectors while preserving sequence order, utilizing empirical values for correlation ranks and weights to optimize performance. The k-mer method segments sequences into overlapping substrings, while DCC and TAC analyze correlations among physicochemical properties of nucleotide pairs and triplets, respectively, with specific configurations for lag and feature vector dimensions.

The experimental setup for model evaluation includes a fivefold cross-validation approach to assess the accuracy of the machine learning algorithms. The computational environment is built on Python 3.6, leveraging libraries such as TensorFlow, PyTorch, NumPy, and Scikit-learn for various machine learning tasks. The hardware configuration features an HP Core i7 processor, 8 GB RAM, and an NVIDIA GeForce GTX 1060 GPU, ensuring efficient handling of data-intensive operations and deep learning processes. This comprehensive methodology aims to enhance the predictive capabilities of the models developed for bioinformatics applications.

Discussion

The discussion section of the research paper emphasizes the critical role of benchmark dataset preparation in developing a reliable computational model for identifying N6-methyladenine (6 mA) sites in DNA sequences. The study utilized positive samples from the MDR database for *Fragaria vesca* and *Rubus chinensis*, applying the CD-HIT tool to filter sequences for diversity and minimize redundancy. After rigorous screening, the final datasets comprised 4,626 positive and 1,912 negative samples for *R. chinensis*, and 694 balanced samples for validation. The feature extraction process involved six distinct methods to create a hybrid feature vector, enhancing the model’s predictive performance by capturing complex data patterns. Principal Component Analysis (PCA) was employed for feature optimization, effectively reducing dimensionality while preserving essential data characteristics.

The Deep Neural Network (DNN) architecture developed in this study demonstrated superior performance in identifying 6 mA sites, achieving accuracy rates of 97.70% for *F. vesca* and 95.75% for *R. chinensis*. The model’s effectiveness was validated through various performance metrics, including Accuracy, Sensitivity, Specificity, and Matthews Correlation Coefficient (MCC), which indicated significant improvements over existing predictors. The proposed Deep-N6mA model outperformed traditional machine learning algorithms and previous models, showcasing its robustness and generalizability across datasets. Future research aims to expand the datasets and optimize the model further, reinforcing the potential of Deep-N6mA as a powerful tool for epigenetic studies.