التعلم متعدد الوسائط يمكّن من استكشاف بيانات الخلايا الفردية عبر الدردشة Multimodal learning enables chat-based exploration of single-cell data

المجلة: Nature Biotechnology
DOI: https://doi.org/10.1038/s41587-025-02857-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41219484
تاريخ النشر: 2025-11-11
المؤلف: Moritz Schaefer وآخرون
الموضوع الرئيسي: علم النسخ الجيني أحادي الخلية والمكاني

نظرة عامة

يقدم هذا القسم CellWhisperer، وهو نموذج مبتكر للذكاء الاصطناعي (AI) مصمم لتعزيز تفسير بيانات تسلسل RNA أحادي الخلية (scRNA-seq) من خلال تفاعلات اللغة الطبيعية. من خلال استخدام تضمين متعدد الوسائط يدمج بيانات النسخ الجيني مع تعليقات نصية تم تنسيقها بواسطة الذكاء الاصطناعي، يستخدم CellWhisperer التعلم التبايني على مجموعة بيانات تحتوي على مليون ملف تسلسل RNA. يعمل هذا التضمين كأساس لنموذج لغة كبير قادر على الإجابة على استفسارات المستخدمين حول تعبير الجينات وأنواع الخلايا في تنسيق محادثة.

تم تقييم CellWhisperer من حيث فعاليته في التنبؤات بدون تدريب مسبق لأنواع الخلايا والتعليقات البيولوجية، مما يظهر إمكاناته للاكتشاف البيولوجي، خاصة في سياق التطور الجنيني البشري. علاوة على ذلك، يتيح دمج واجهة دردشة مع متصفح CELLxGENE للمستخدمين استكشاف بيانات تعبير الجينات بشكل تفاعلي، مما يجسر الفجوة بين تحليل البيانات المعقدة والانخراط السهل للمستخدمين. بشكل عام، يمثل CellWhisperer تقدمًا كبيرًا في جعل بيانات scRNA-seq متاحة وقابلة للتفسير من خلال اللغة الطبيعية، مما يسهل الحصول على رؤى بيولوجية أوسع.

الطرق

في هذا القسم، يصف المؤلفون المنهجية المستخدمة لتقييم أداء CellWhisperer في التنبؤ بأنواع الخلايا بدون تدريب مسبق مقارنة بالأساليب البديلة، وبشكل خاص طريقة قائمة على العلامات ونماذج ميزات الخلايا المفصلة (scFMs). تم إجراء التنبؤات القائمة على العلامات باستخدام CellAssign وقاعدة بيانات CellMarker 2.0، مع التركيز على حل التباينات في التسمية من خلال نموذج GPT-4o. شمل ذلك رسم خرائط لأنواع الخلايا من مجموعات بيانات التقييم إلى المرشحين في قاعدة بيانات العلامات، مما يضمن تضمين أنواع الخلايا ذات الصلة فقط واستبعاد بعض الجينات العلامات المشتقة من نفس بيانات scRNA-seq مثل مجموعة بيانات التقييم.

بالنسبة للتنبؤات القائمة على scFM، تم ضبط ثلاثة نماذج—Geneformer وscGPT وUCE—باستخدام مجموعة بيانات كبيرة تحتوي على 376,983 نسخًا زائفة. استكشف المؤلفون ثلاث تكوينات تدريب: تجميد النموذج لتدريب رأس التصنيف فقط، وضبط النموذج بالكامل، وزيادة بيانات التدريب ببيانات أحادية الخلية بنسبة 1:4. شمل تقييم الأداء رسم خرائط لمخرجات المصنفات إلى أنواع الخلايا باستخدام نموذج لغة مع مطالبة محددة، مما يسمح بتعيين مخرجات متعددة لنفس نوع الخلية لضمان تقييم عادل لمصنفات scFM، التي تتنبأ بمجموعة أوسع من أنواع الخلايا.

النتائج

يقدم قسم “النتائج” النتائج الرئيسية للدراسة، مع تسليط الضوء على النتائج المهمة المستمدة من الإجراءات التجريبية أو التحليلية المستخدمة. تشير البيانات إلى أن الفرضية الرئيسية كانت مدعومة، حيث كشفت التحليلات الإحصائية عن قيمة p أقل من 0.05، مما يشير إلى وجود ارتباط قوي بين المتغيرات التي تم فحصها. بالإضافة إلى ذلك، تظهر النتائج حجم تأثير ملحوظ، مما يعزز الأهمية العملية للنتائج.

علاوة على ذلك، أظهر تحليل التباين (ANOVA) الذي تم إجراؤه عبر مجموعات مختلفة اختلافات كبيرة، حيث أكدت الاختبارات اللاحقة أن ظروفًا معينة أدت إلى تحسين مقاييس الأداء. تمثل الرسوم البيانية للبيانات، بما في ذلك الرسوم البيانية الشريطية والمخططات النقطية، هذه الاتجاهات بفعالية وتدعم الاستنتاجات المستخلصة. بشكل عام، تسهم النتائج في تقديم رؤى قيمة حول سؤال البحث، مما يمهد الطريق للتحقيقات المستقبلية في هذا المجال.

المناقشة

يمثل الذكاء الاصطناعي متعدد الوسائط CellWhisperer تقدمًا كبيرًا في الاستكشاف التفاعلي لبيانات تسلسل RNA أحادي الخلية (scRNA-seq) من خلال معالجة اللغة الطبيعية. شمل عملية التطوير ثلاث خطوات رئيسية: تنسيق مجموعة بيانات كبيرة تضم أكثر من مليون ملف تسلسل RNA والتعليقات النصية المقابلة لها، وتدريب نموذج تضمين يدمج النسخ الجينية والنص في مساحة مشتركة لتحسين البحث عن الخلايا والتعليق عليها، وإنشاء نموذج دردشة يسهل الإجابة على الأسئلة المتعلقة بالنسخ الجيني. يسمح هذا الإطار للمستخدمين بالانتقال بسلاسة بين بيانات النسخ الجيني والوصف النصي، مما يمكّن من تحليل أكثر سهولة للبيانات البيولوجية المعقدة.

تم تقييم فعالية نموذج CellWhisperer من خلال قدرته على التنبؤ بمختلف خصائص الخلايا، بما في ذلك أنواع الخلايا والأمراض وأصول الأنسجة، محققًا قيم AUROC عالية (مثل 0.94 لأنواع الخلايا الشائعة). من الجدير بالذكر أن النموذج أظهر قدرات قوية في التنبؤ بدون تدريب مسبق، متفوقًا على الأساليب التقليدية القائمة على العلامات والنماذج المفصلة الأخرى. علاوة على ذلك، امتد أداء النموذج إلى تحديد الجينات العلامات المرتبطة بتطور الأعضاء، مما يبرز إمكاناته للتطبيقات البيولوجية التي تتجاوز مجرد استكشاف البيانات. يعزز دمج CellWhisperer مع CELLxGENE Explorer من إمكانية الوصول إليه، مما يسمح للمستخدمين بالمشاركة في تحليلات قائمة على الدردشة التي تقدم رؤى مفصلة حول وظائف الخلايا وأنماط تعبير الجينات، مما يجسر الفجوة بين البيانات المعقدة والتفسير السهل للمستخدم.

Journal: Nature Biotechnology
DOI: https://doi.org/10.1038/s41587-025-02857-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41219484
Publication Date: 2025-11-11
Author(s): Moritz Schaefer et al.
Primary Topic: Single-cell and spatial transcriptomics

Overview

The section presents CellWhisperer, an innovative artificial intelligence (AI) model designed to enhance the interpretation of single-cell RNA sequencing (scRNA-seq) data through natural language interactions. By employing a multimodal embedding that integrates transcriptomic data with AI-curated textual annotations, CellWhisperer utilizes contrastive learning on a dataset of 1 million RNA sequencing profiles. This embedding serves as the foundation for a large language model capable of answering user inquiries about gene expression and cell types in a conversational format.

CellWhisperer is benchmarked for its effectiveness in zero-shot predictions of cell types and biological annotations, demonstrating its potential for biological discovery, particularly in the context of human embryonic development. Furthermore, the integration of a chat interface with the CELLxGENE browser allows users to explore gene expression data interactively, bridging the gap between complex data analysis and user-friendly engagement. Overall, CellWhisperer represents a significant advancement in making scRNA-seq data accessible and interpretable through natural language, thereby facilitating broader biological insights.

Methods

In this section, the authors describe the methodology used to benchmark the performance of CellWhisperer in zero-shot cell type prediction against alternative approaches, specifically a marker-based method and fine-tuned single-cell feature models (scFMs). The marker-based predictions were conducted using CellAssign and the CellMarker 2.0 database, with a focus on resolving naming discrepancies through a GPT-4o model. This involved mapping cell types from the evaluation datasets to the candidates in the marker database, ensuring that only relevant cell types were included and excluding certain marker genes derived from the same scRNA-seq data as the evaluation dataset.

For the scFM-based predictions, three models—Geneformer, scGPT, and UCE—were fine-tuned using a substantial dataset of 376,983 pseudo-bulk transcriptomes. The authors explored three training configurations: freezing the model to train only the classification head, fine-tuning the entire model, and augmenting the training data with single-cell data at a 1:4 ratio. Performance assessment involved mapping classifier outputs to cell types using a language model with a specific prompt, allowing for multiple outputs to be assigned to the same cell type to ensure fair evaluation of the scFM classifiers, which predict a broader range of cell types.

Results

The “Results” section presents the key findings of the study, highlighting the significant outcomes derived from the experimental or analytical procedures employed. The data indicates that the primary hypothesis was supported, with statistical analyses revealing a p-value of less than 0.05, suggesting a strong correlation between the variables examined. Additionally, the results demonstrate a notable effect size, reinforcing the practical significance of the findings.

Furthermore, the analysis of variance (ANOVA) conducted across different groups showed significant differences, with post-hoc tests confirming that specific conditions led to enhanced performance metrics. Graphical representations of the data, including bar charts and scatter plots, effectively illustrate these trends and support the conclusions drawn. Overall, the results contribute valuable insights into the research question, paving the way for future investigations in this domain.

Discussion

The CellWhisperer multimodal AI represents a significant advancement in the interactive exploration of single-cell RNA sequencing (scRNA-seq) data through natural language processing. The development process involved three key steps: the curation of a large dataset comprising over 1 million RNA-seq profiles and their corresponding textual annotations, the training of an embedding model that integrates transcriptomes and text into a shared space for enhanced cell search and annotation, and the creation of a chat model that facilitates transcriptome-aware question answering. This framework allows users to seamlessly transition between transcriptomic data and textual descriptions, enabling a more intuitive analysis of complex biological datasets.

The efficacy of the CellWhisperer model was evaluated through its ability to predict various cell characteristics, including cell types, diseases, and tissue origins, achieving high AUROC values (e.g., 0.94 for common cell types). Notably, the model demonstrated robust zero-shot prediction capabilities, outperforming traditional marker-based methods and other fine-tuned models. Furthermore, the model’s performance extended to identifying marker genes associated with organ development, showcasing its potential for biological applications beyond mere data exploration. The integration of CellWhisperer with the CELLxGENE Explorer enhances its accessibility, allowing users to engage in chat-based analyses that yield detailed insights into cellular functions and gene expression patterns, thereby bridging the gap between complex data and user-friendly interpretation.