GeneCompass: فك شفرة الآليات التنظيمية الجينية العالمية من خلال نموذج أساسي مستند إلى المعرفة عبر الأنواع GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

المجلة: Cell Research، المجلد: 34، العدد: 12
DOI: https://doi.org/10.1038/s41422-024-01034-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39375485
تاريخ النشر: 2024-10-07
المؤلف: Xiaodong Yang وآخرون
الموضوع الرئيسي: علم النسخ الجيني أحادي الخلية والمكاني

نظرة عامة

تسلط الأبحاث الضوء على تطوير GeneCompass، وهو نموذج أساسي عبر الأنواع مصمم لتعزيز فهمنا لآليات تنظيم الجينات من خلال دمج البيانات من أكثر من 120 مليون ترانسكريبتوم أحادي الخلية من البشر والفئران. غالبًا ما تركز طرق البحث التقليدية على الكائنات الحية النموذجية الفردية، مما يحد من الرؤى حول الآليات التنظيمية العالمية. ومع ذلك، فإن التقدم في تسلسل الخلايا المفردة والتعلم العميق قد مكن من إنشاء مجموعة بيانات شاملة، مما أدى إلى الدمج الفعال للمعرفة البيولوجية خلال مرحلة ما قبل تدريب GeneCompass.

تم ضبط GeneCompass بدقة لمهام مختلفة بعد التدريب، مما أظهر أداءً متفوقًا مقارنة بالنماذج الحالية في التطبيقات ذات الأنواع الفردية وسهل التحقيقات البيولوجية عبر الأنواع. ومن الجدير بالذكر أن النموذج تم استخدامه لتحديد العوامل الرئيسية المرتبطة بانتقالات مصير الخلايا، حيث نجح في التنبؤ بجينات مرشحة أدت إلى تمايز خلايا الجذع الجنينية البشرية إلى مصير جنسي. تؤكد هذه الدراسة على إمكانات الذكاء الاصطناعي في فك شفرة آليات تنظيم الجينات وتسريع تحديد المنظمين الرئيسيين لمصير الخلايا وأهداف الأدوية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على تعقيد الكائنات الفقارية، التي تتكون من تريليونات من الخلايا المنظمة في أنواع مختلفة تشكل أنسجة وأعضاء متميزة. إن فهم آليات تنظيم الجينات التي تحكم هذه الأنظمة البيولوجية أمر ضروري لتوضيح أنماط التطور وتعزيز العلاجات السريرية. مع التقدم في تقنيات تسلسل الأوميكس، بدأ الباحثون في تحليل بيانات الخلايا المفردة لاستكشاف تنظيم التعبير الجيني على مستويات متعددة، بما في ذلك إمكانية الوصول إلى الكروماتين والتعديلات بعد النسخ. ومع ذلك، فإن التجارب التقليدية في المختبر غالبًا ما تكون كثيفة العمل، مما يثير الحاجة إلى أساليب حسابية مبتكرة.

تقدم الورقة GeneCompass، وهو نموذج أساسي عبر الأنواع تم تدريبه مسبقًا على مجموعة بيانات scCompass-126M، التي تشمل أكثر من 120 مليون ترانسكريبتوم أحادي الخلية من البشر والفئران. يدمج هذا النموذج المعرفة البيولوجية السابقة، مثل تسلسلات المحفزات وشبكات التعبير الجيني، لتعزيز تعلمه للآليات التنظيمية الجينية العالمية. بعد ضبطه لمهام مختلفة بعد التدريب، أظهر GeneCompass أداءً متفوقًا أو مماثلاً للنماذج المتقدمة، مما يمثل تقدمًا كبيرًا في هذا المجال. يهدف هذا النموذج إلى تسهيل فهم تنظيم الجينات عبر الأنواع وتسريع تحديد المنظمين الرئيسيين لمصير الخلايا وأهداف الأدوية المحتملة.

طرق

في هذه الدراسة، طور المؤلفون مجموعة بيانات كبيرة للتدريب المسبق تُدعى scCompass-126M، والتي تتكون من أكثر من 120 مليون ترانسكريبتوم أحادي الخلية من مصادر بشرية وفئران. تتضمن مجموعة البيانات بشكل محدد 53,568,337 خلية أحادية من البشر و48,200,083 خلية أحادية من الفئران، مع أكثر من 90% من البيانات مأخوذة من مستودعات متاحة للجمهور مثل Gene Expression Omnibus. تم معالجة بيانات التسلسل الخام باستخدام خط أنابيب موحد للمعلوماتية الحيوية الذي شمل معايير تصفية صارمة لضمان جودة البيانات العالية. شمل ذلك إزالة الخلايا التي تحتوي على أقل من 200 جين معبر عنه، والعينات التي تحتوي على أقل من أربع خلايا، والخلايا التي تعبر عن جينات ميتوكوندرية بشكل مفرط، من بين معايير أخرى.

تم إثراء مصفوفات عدد الجينات الناتجة بتعليقات معلوماتية من Ensembl، مع التركيز على الجينات المشفرة للبروتين، والـ lncRNAs، والـ miRNAs، مع استبعاد الفئات مثل الجينات الزائفة والمواقع غير القابلة للاحتجاز. تشمل مجموعة بيانات التدريب المسبق مجموعة متنوعة من أنواع الخلايا، بما في ذلك خلايا الأمراض، وخلايا السرطان، وسلالات الخلايا الخالدة، مع بيانات وصفية توضح جوانب مثل اضطراب الخلايا المفردة ووقت تمايز الخلايا لتوصيف تنوع مجموعة البيانات بشكل أكبر.

نتائج

تتناول قسم النتائج معمارية ومنهجية التدريب المسبق لـ GeneCompass، وهو نموذج أساسي عبر الأنواع مصمم لتحليل بيانات الترانسكريبتوم من أكثر من 120 مليون خلية بشرية وفئران. يدمج النموذج أربعة أنواع من المعرفة البيولوجية السابقة—شبكات تنظيم الجينات (GRN)، معلومات المحفز، تعليقات عائلة الجينات، وعلاقات التعبير الجيني—في إطار التعلم الذاتي. يسمح هذا الدمج لـ GeneCompass بتشفير العلاقات المعقدة بين الجينات وفهم السياقات الخلوية من خلال آلية الانتباه الذاتي.

تتكون مجموعة بيانات التدريب المسبق، المسماة scCompass-126M، من 126 مليون ترانسكريبتوم أحادي الخلية، مع الاحتفاظ بـ 101.76 مليون بعد تصفية التعبيرات الجينية الشاذة. يستخدم النموذج قاموس رموز مكون من 17,465 جينًا متجانسًا، مما يضمن تمثيلًا موحدًا للجينات البشرية والفئران. لتعزيز تمثيل الترانسكريبتوم، يستخدم GeneCompass قيم التعبير الجيني المطلقة جنبًا إلى جنب مع التصنيفات العادية لأعلى 2048 جينًا. بالإضافة إلى ذلك، يتم تطبيق استراتيجية نمذجة اللغة المقنعة، حيث يتم إخفاء 15% من مدخلات الجينات بشكل عشوائي، مما يمكّن النموذج من تعلم استعادة هذه المدخلات المخفية مع التقاط العلاقات الجينية المعقدة بطريقة واعية للسياق. يضع هذا النهج الشامل GeneCompass كأداة قوية لمهام بيولوجية مختلفة بعد ضبطها باستخدام بيانات محددة للمهام.

مناقشة

في هذا القسم، يقدم المؤلفون GeneCompass، وهو نموذج مدرب مسبقًا مصمم لالتقاط ميزات الجينات والعلاقات عبر الأنواع من خلال دمج مجموعة واسعة من ترانسكريبتومات الخلايا المفردة والمعرفة البيولوجية السابقة. تؤكد الدراسة على قدرة النموذج على الاحتفاظ بمعلومات التجانس من خلال إظهار أن التشابه الكوني لتضمينات الجينات للجينات المتجانسة أعلى بكثير من تلك الخاصة بالجينات غير المتجانسة عبر أنواع خلايا مختلفة، بما في ذلك خلايا B، والخلايا الكبدية، والبلعميات. بالإضافة إلى ذلك، تشير النتائج إلى أن كل من المعرفة السابقة والتدريب الذاتي المسبق يساهمان في فعالية النموذج، حيث يلعب الأخير دورًا أكبر.

يعزز GeneCompass أيضًا توضيح نوع الخلايا، متفوقًا على الأساليب الحالية في كل من السياقات ذات الأنواع الفردية وعبر الأنواع. يتحسن أداء النموذج مع زيادة حجم مجموعة بيانات التدريب المسبق، محققًا درجات أعلى من macro-F1 والدقة في مجموعات بيانات مختلفة. علاوة على ذلك، يظهر GeneCompass قابليته للتكيف في مهام بيولوجية متعددة، بما في ذلك استنتاج شبكة تنظيم الجينات (GRN)، وتوقع استجابة الأدوية، وحساسية جرعة الجين، مما يظهر باستمرار أداءً متفوقًا مقارنة بالنماذج المدربة مسبقًا الأخرى. تؤكد النتائج على إمكانات GeneCompass في تحسين التحليلات البيولوجية وتسهيل تحديد العوامل التنظيمية الرئيسية في تحديد مصير الخلايا، مما يعزز فهمنا للتفاعلات الجينية وآليات التنظيم عبر الأنواع.

Journal: Cell Research, Volume: 34, Issue: 12
DOI: https://doi.org/10.1038/s41422-024-01034-y
PMID: https://pubmed.ncbi.nlm.nih.gov/39375485
Publication Date: 2024-10-07
Author(s): Xiaodong Yang et al.
Primary Topic: Single-cell and spatial transcriptomics

Overview

The research highlights the development of GeneCompass, a cross-species foundation model designed to enhance our understanding of gene regulatory mechanisms by integrating data from over 120 million single-cell transcriptomes from human and mouse. Traditional research methods often focus on individual model organisms, limiting insights into universal regulatory mechanisms. However, advancements in single-cell sequencing and deep learning have enabled the creation of a comprehensive dataset, leading to the effective integration of biological knowledge during GeneCompass’s pre-training phase.

GeneCompass was fine-tuned for various downstream tasks, demonstrating superior performance compared to existing models in single-species applications and facilitating cross-species biological investigations. Notably, the model was utilized to identify key factors associated with cell fate transitions, successfully predicting candidate genes that induced the differentiation of human embryonic stem cells into gonadal fate. This study underscores the potential of artificial intelligence in deciphering gene regulatory mechanisms and accelerating the identification of critical cell fate regulators and drug targets.

Introduction

The introduction of this research paper highlights the complexity of vertebrate organisms, which consist of trillions of cells organized into various types that form distinct tissues and organs. Understanding the gene regulatory mechanisms that govern these biological systems is essential for elucidating developmental patterns and enhancing clinical therapies. With advancements in omics sequencing technologies, researchers have begun to analyze single-cell data to explore gene expression regulation at multiple levels, including chromatin accessibility and post-transcriptional modifications. However, traditional wet lab experiments are often labor-intensive, prompting the need for innovative computational approaches.

The paper introduces GeneCompass, a cross-species foundation model pre-trained on the scCompass-126M dataset, which includes over 120 million single-cell transcriptomes from humans and mice. This model integrates prior biological knowledge, such as promoter sequences and gene co-expression networks, to enhance its learning of universal gene regulatory mechanisms. Upon fine-tuning for various downstream tasks, GeneCompass demonstrated superior or comparable performance to state-of-the-art models, marking a significant advancement in the field. This model aims to facilitate the understanding of gene regulation across species and accelerate the identification of key regulators of cell fate and potential drug targets.

Methods

In this study, the authors developed a large-scale pre-training corpus named scCompass-126M, which consists of over 120 million single-cell transcriptomes from both human and mouse sources. Specifically, the dataset includes 53,568,337 human single cells and 48,200,083 mouse single cells, with more than 90% of the data sourced from publicly available repositories such as the Gene Expression Omnibus. The raw sequence data were processed using a unified bioinformatics pipeline that involved stringent filtering criteria to ensure high-quality data. This included removing cells with fewer than 200 expressed genes, samples with fewer than four cells, and cells with excessive mitochondrial gene expression, among other criteria.

The resulting gene count matrices were enriched with informative annotations from Ensembl, focusing on protein-coding genes, long non-coding RNAs (lncRNAs), and microRNAs (miRNAs), while excluding categories such as pseudogenes and non-capturable loci. The pre-training corpus encompasses a diverse range of cell types, including disease cells, cancer cells, and immortalized cell lines, with metadata detailing aspects like single-cell perturbation and cell differentiation time to further characterize the dataset’s diversity.

Results

The results section details the architecture and pre-training methodology of GeneCompass, a cross-species foundation model designed to analyze transcriptomic data from over 120 million human and mouse cells. The model integrates four types of biological prior knowledge—gene regulatory networks (GRN), promoter information, gene family annotations, and gene co-expression relationships—into its self-supervised learning framework. This integration allows GeneCompass to effectively encode the complex relationships among genes and understand cellular contexts through a self-attention mechanism.

The pre-training corpus, named scCompass-126M, consists of 126 million single-cell transcriptomes, with 101.76 million retained after filtering for outlier gene expressions. The model employs a token dictionary of 17,465 homologous genes, ensuring a uniform representation of human and mouse genes. To enhance the representation of the transcriptome, GeneCompass utilizes absolute gene expression values alongside normalized rankings of the top 2048 genes. Additionally, a masked language modeling strategy is applied, where 15% of gene inputs are randomly masked, enabling the model to learn to recover these masked inputs while capturing intricate gene relationships in a context-aware manner. This comprehensive approach positions GeneCompass as a robust tool for various downstream biological tasks following fine-tuning with task-specific data.

Discussion

In this section, the authors present GeneCompass, a pre-trained model designed to capture gene features and relationships across species by integrating a vast corpus of single-cell transcriptomes and prior biological knowledge. The study validates the model’s ability to retain homology information by demonstrating that the cosine similarity of gene embeddings for homologous genes is significantly higher than that for non-homologous genes across various cell types, including B cells, hepatocytes, and macrophages. Additionally, the results indicate that both prior knowledge and self-supervised pre-training contribute to the model’s effectiveness, with the latter playing a more substantial role.

GeneCompass also enhances cell-type annotation, outperforming existing methods in both single-species and cross-species contexts. The model’s performance improves with the size of the pre-training corpus, achieving higher macro-F1 scores and accuracy in various datasets. Furthermore, GeneCompass demonstrates its adaptability in multiple biological tasks, including gene regulatory network (GRN) inference, drug response prediction, and gene dosage sensitivity, consistently showing superior performance compared to other pre-trained models. The findings underscore GeneCompass’s potential to refine biological analyses and facilitate the identification of key regulatory factors in cell fate determination, thereby advancing our understanding of gene interactions and regulatory mechanisms across species.