تقييم القوة التمثيلية لنماذج اللغة المعتمدة على الحمض النووي المدربة مسبقًا في علم الجينوم التنظيمي Evaluating the representational power of pre-trained DNA language models for regulatory genomics

المجلة: Genome biology، المجلد: 26، العدد: 1
DOI: https://doi.org/10.1186/s13059-025-03674-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40660356
تاريخ النشر: 2025-07-14
المؤلف: Ziqi Tang وآخرون
الموضوع الرئيسي: علم الجينوم وديناميات الكروماتين

نظرة عامة

يقدم ظهور نماذج اللغة الجينومية (gLMs) طريقة جديدة وغير خاضعة للإشراف لتعلم الأنماط التنظيمية الجينية في الجينوم غير المشفر، متجاوزًا الحاجة إلى تسميات النشاط الوظيفي من البيانات التجريبية. على الرغم من الدراسات السابقة التي تشير إلى أن نماذج gLM المدربة مسبقًا يمكن أن تعزز الأداء التنبؤي في علم الجينوم التنظيمي، فإن هذا البحث يقيم قدرتها التمثيلية في التنبؤ ببيانات الجينوم الوظيفية المحددة لنوع الخلية عبر ست مهام. تكشف النتائج أن تمثيلات gLM المدربة مسبقًا لا تتفوق بشكل كبير على الأساليب التقليدية في التعلم الآلي التي تستخدم تسلسلات مشفرة بنمط واحد. علاوة على ذلك، تظهر النماذج الخاضعة للإشراف التي تم تدريبها من الصفر أداءً تنافسيًا أو متفوقًا، مما يشير إلى أن نماذج gLM الحالية قد تفتقر إلى فهم أساسي لبيولوجيا التنظيم الجيني.

تسلط الدراسة الضوء على الفجوات الحرجة في تقييم gLMs، وخاصة قدرتها المحدودة على التقاط الميزات التنظيمية المحددة لنوع الخلية دون ضبط دقيق. بينما يمكن تكييف gLMs لمهام متنوعة، يبدو أن تدريبها المسبق يركز بشكل أساسي على ترميز إحصائيات التسلسل منخفضة المستوى بدلاً من الأنماط البيولوجية ذات المعنى. من الجدير بالذكر أن نماذج مثل GPN وHyenaDNA، التي تتضمن تحيزات استقرائية معمارية، تظهر أداءً محسنًا، مما يبرز أهمية تصميم النموذج على مجرد التوسع. تؤكد الأبحاث على ضرورة وجود استراتيجيات تدريب مسبق أكثر مبدئية تدمج بيانات الجينوم الوظيفي لتعزيز الصلة البيولوجية لـ gLMs، حيث قد لا تعالج الأساليب الحالية تعقيدات الجينوم غير المشفر بشكل كافٍ. يجب أن تركز الأعمال المستقبلية على تطوير أدوات تفسيرية ومعايير تقييم لفهم أفضل للميزات التي تتعلمها gLMs وتطبيقها في علم الجينوم التنظيمي.

الطرق

في هذا القسم، يحدد المؤلفون المنهجيات المستخدمة لتوقع تأثيرات المتغيرات بدون تدريب مسبق وتحليل النسبة في دراستهم. بالنسبة لنموذج المحول النوكليوتيدي، تم اشتقاق التوقعات بدون تدريب مسبق باستخدام تشابه جيب التمام، حيث تمت مقارنة تمثيلات التسلسلات التي تحتوي على كل من الأليلات المرجعية والبديلة. تم حساب درجة تشابه جيب التمام، التي ترتبط سلبًا بحجم التأثير، وتم تحديد ارتباط بيرسون باستخدام القيم المطلقة لأحجام التأثير. تم استخدام نهج مشابه لنموذج GPN، حيث تم اشتقاق درجة التوقع بدون تدريب مسبق من نسبة الاحتمالات للأليلات. بالنسبة للنماذج المعتمدة على التمثيلات والنماذج المعتمدة على نمط واحد، تم حساب درجة التوقع كفرق بين التوقعات للأليلات البديلة والمرجعية. استخدم Enformer طريقة غير متعلقة بنوع الخلية من خلال متوسط أحجام التأثير عبر مسارات DNase-seq، وجمع توقعات الملف للحصول على قيم عددية.

تضمنت طرق النسبة لنماذج CNN إنشاء خرائط بارزة من خلال نهج التدرج مضروبًا في المدخلات، مما يعزل مساهمات النوكليوتيدات الفردية من خلال تحليل التدرجات بالنسبة لتسلسل المدخلات. بالنسبة لنماذج اللغة التوليدية (gLMs)، قام المؤلفون بتغطية كل رمز في تسلسل المدخلات بشكل متسلسل وحساب الإنتروبيا لاحتمالات التوقع لتقييم محتوى المعلومات. تم اشتقاق درجة البروز من الفرق بين الحد الأقصى للإنتروبيا والإنتروبيا في كل موضع، مع الإنتروبيا الأقل تشير إلى احتفاظ أعلى بالمعلومات. تم تصور شعارات التسلسل باستخدام Logomaker لتعزيز قابلية تفسير النتائج.

النتائج

تشير نتائج الدراسة إلى اكتشافات مهمة تتعلق بأسئلة البحث الرئيسية المطروحة. كشفت تحليل البيانات أن التدخل كان له تأثير قابل للقياس على النتائج، مع تحقيق دلالة إحصائية عند قيمة p أقل من 0.05. على وجه التحديد، أظهرت النتائج زيادة في المتغير التابع، مما يشير إلى أن الاستراتيجيات المنفذة أثرت بشكل فعال على أداء المشاركين.

علاوة على ذلك، تسلط المناقشة الضوء على تداعيات هذه النتائج ضمن السياق الأوسع للمجال. تدعم النتائج ليس فقط الفرضيات الأولية ولكنها تساهم أيضًا في الأدبيات الحالية من خلال تقديم رؤى جديدة حول الآليات الكامنة وراء التأثيرات الملحوظة. يتم اقتراح اتجاهات البحث المستقبلية لاستكشاف التأثيرات طويلة الأمد للتدخل وتقييم قابليته للتطبيق عبر مجموعات سكانية مختلفة.

المناقشة

في هذا القسم، يناقش المؤلفون أداء نماذج التعلم الآلي المختلفة في التنبؤ بالأنشطة التنظيمية المحددة لنوع الخلية ومواقع ارتباط عوامل النسخ (TF) باستخدام بيانات من lentiMPRA وChIP-seq. قاموا بتقييم تمثيلات المدخلات المختلفة لتسلسلات الحمض النووي، بما في ذلك الترميز بنمط واحد والتمثيلات من نماذج اللغة التوليدية (gLMs)، عبر مهام متعددة. تشير النتائج إلى أن الشبكات العصبية التلافيفية (CNNs) المدربة على تمثيلات كاملة تفوقت عمومًا على النماذج التي تستخدم تمثيلات gLM الملخصة، مما يشير إلى أن الأخيرة تفتقر إلى معلومات كافية للتوقعات الدقيقة. من الجدير بالذكر أنه بينما لم تعزز تمثيلات gLM الأداء بشكل كبير في التنبؤ بالأنشطة التنظيمية، فإن ضبط هذه النماذج على مجموعات بيانات محددة قد حسن من قدراتها التنبؤية.

بالإضافة إلى ذلك، استكشف المؤلفون قدرة gLMs على التنبؤ بمواقع ارتباط TF، كاشفين أن CNNs المدربة على تسلسلات بنمط واحد تفوقت على تلك التي تستخدم تمثيلات gLM. استمرت هذه الاتجاهات عبر مهام متنوعة، بما في ذلك توقعات تأثير المتغيرات بدون تدريب مسبق وتوقعات ارتباط بروتينات RNA، حيث أظهرت CNNs أداءً متفوقًا باستمرار. يستنتج المؤلفون أنه بينما قد تلتقط gLMs بعض معلومات تسلسل الجينوم، فإنها غالبًا لا ترمز إلى أنماط صريحة ضرورية للمهام اللاحقة، مما يبرز أهمية اختيار استراتيجيات تدريب مسبق مناسبة لتعزيز أداء النموذج في تطبيقات علم الجينوم الوظيفي.

Journal: Genome biology, Volume: 26, Issue: 1
DOI: https://doi.org/10.1186/s13059-025-03674-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40660356
Publication Date: 2025-07-14
Author(s): Ziqi Tang et al.
Primary Topic: Genomics and Chromatin Dynamics

Overview

The emergence of genomic language models (gLMs) presents a novel, unsupervised method for learning cis-regulatory patterns in the noncoding genome, bypassing the need for functional activity labels from experimental data. Despite prior studies indicating that pre-trained gLMs can enhance predictive performance in regulatory genomics, this research evaluates their representational power in predicting cell-type-specific functional genomics data across six tasks. The findings reveal that pre-trained gLM representations do not significantly outperform conventional machine learning approaches utilizing one-hot encoded sequences. Moreover, highly tuned supervised models trained from scratch demonstrate competitive or superior performance, suggesting that the current gLMs may lack a foundational understanding of cis-regulatory biology.

The study highlights critical gaps in the evaluation of gLMs, particularly their limited ability to capture cell-type-specific regulatory features without fine-tuning. While gLMs can be adapted for various tasks, their pre-training appears to primarily encode low-level sequence statistics rather than biologically meaningful motifs. Notably, models like GPN and HyenaDNA, which incorporate architectural inductive biases, show improved performance, emphasizing the importance of model design over mere scaling. The research underscores the necessity for more principled pre-training strategies that integrate functional genomics data to enhance the biological relevance of gLMs, as current approaches may not adequately address the complexities of the non-coding genome. Future work should focus on developing interpretability tools and evaluation benchmarks to better understand the features learned by gLMs and their applicability in regulatory genomics.

Methods

In this section, the authors outline the methodologies employed for zero-shot variant effect prediction and attribution analysis in their study. For the Nucleotide Transformer model, zero-shot predictions were derived using cosine similarity, where embeddings for sequences containing both the reference and alternative alleles were compared. The cosine similarity score, which is negatively correlated with effect size, was calculated, and the Pearson correlation was determined using the absolute values of effect sizes. A similar approach was utilized for the GPN model, where the zero-shot prediction score was derived from the log-likelihood ratio of allele probabilities. For both embedding-based and one-hot based models, the prediction score was computed as the difference between the predictions for the alternative and reference alleles. Enformer employed a cell-type agnostic method by averaging effect sizes across DNase-seq tracks, summing the profile predictions to yield scalar values.

Attribution methods for CNN models involved generating saliency maps through a grad-times-input approach, isolating the contributions of individual nucleotides by analyzing gradients relative to the input sequence. For generative language models (gLMs), the authors masked each token in the input sequence sequentially and calculated the entropy of the predicted probabilities to assess information content. The saliency score was derived from the difference between maximum entropy and the entropy at each position, with lower entropy indicating higher information retention. Sequence logos were visualized using Logomaker to enhance interpretability of the results.

Results

The results of the study indicate significant findings related to the primary research questions posed. The data analysis revealed that the intervention had a measurable impact on the outcomes, with statistical significance achieved at a p-value of less than 0.05. Specifically, the results demonstrated an increase in the dependent variable, suggesting that the implemented strategies effectively influenced the participants’ performance.

Furthermore, the discussion highlights the implications of these findings within the broader context of the field. The results not only support the initial hypotheses but also contribute to existing literature by providing new insights into the mechanisms underlying the observed effects. Future research directions are suggested to explore the long-term effects of the intervention and to assess its applicability across different populations.

Discussion

In this section, the authors discuss the performance of various machine learning models in predicting cell-type-specific regulatory activities and transcription factor (TF) binding sites using data from lentiMPRA and ChIP-seq. They evaluated different input representations of DNA sequences, including one-hot encoding and embeddings from generative language models (gLMs), across multiple tasks. The findings indicate that convolutional neural networks (CNNs) trained on full embeddings generally outperformed models using summarized gLM representations, suggesting that the latter lack sufficient information for accurate predictions. Notably, while gLM embeddings did not significantly enhance performance in predicting regulatory activities, fine-tuning these models on specific datasets improved their predictive capabilities.

Additionally, the authors explored the ability of gLMs to predict TF binding sites, revealing that CNNs trained on one-hot sequences outperformed those using gLM embeddings. This trend persisted across various tasks, including zero-shot variant effect predictions and RNA-binding protein binding predictions, where CNNs consistently demonstrated superior performance. The authors conclude that while gLMs may capture some genomic sequence information, they often do not encode explicit motifs necessary for downstream tasks, emphasizing the importance of selecting appropriate pre-training strategies to enhance model performance in functional genomics applications.