مساحة موحدة للتضمين الصوتي إلى الكلام إلى اللغة تلتقط الأساس العصبي لمعالجة اللغة الطبيعية في المحادثات اليومية A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

المجلة: Nature Human Behaviour، المجلد: 9، العدد: 5
DOI: https://doi.org/10.1038/s41562-025-02105-9
PMID: https://pubmed.ncbi.nlm.nih.gov/40055549
تاريخ النشر: 2025-03-07
المؤلف: Ariel Goldstein وآخرون
الموضوع الرئيسي: أنظمة الكلام والحوار

نظرة عامة

تقدم هذه الدراسة إطارًا حسابيًا موحدًا يدمج الهياكل اللغوية الصوتية والكلامية والمستوى الكلمي للتحقيق في الآليات العصبية الكامنة وراء المحادثات اليومية. باستخدام تخطيط كهربية القشرة الدماغية، سجل الباحثون الإشارات العصبية خلال 100 ساعة من إنتاج الكلام وفهمه في تفاعلات الحياة الواقعية. استخدموا نموذج تحويل الكلام إلى نص متعدد الوسائط، Whisper، لاستخراج التضمينات على مستويات مختلفة – التضمينات الصوتية منخفضة المستوى، وتضمينات الكلام متوسطة المستوى، وتضمينات الكلمات السياقية – وطوروا نماذج ترميز تقوم بفعالية بربط هذه التضمينات بالنشاط الدماغي. من الجدير بالذكر أن النموذج أظهر دقة تنبؤية عالية للنشاط العصبي عبر مستويات مختلفة من تسلسل معالجة اللغة، متماشياً مع التسلسل القشري لمعالجة الكلام واللغة.

تشير النتائج إلى أن المناطق الحسية والحركية تتوافق بشكل أقرب مع تضمينات الكلام في النموذج، بينما تتماشى المناطق اللغوية الأعلى مع تضمينات اللغة. يلتقط نموذج Whisper الديناميات الزمنية لترميز اللغة إلى الكلام قبل نطق الكلمات وترميز الكلام إلى اللغة بعد النطق. تتجاوز هذه الطريقة النماذج الرمزية التقليدية في تمثيل النشاط العصبي المرتبط بالكلام واللغة الطبيعية، مما يشير إلى تحول نحو نماذج حسابية موحدة تشمل التسلسل الكامل للمعالجة في سياقات المحادثة الواقعية. تهدف هذه الأبحاث إلى تعزيز فهم كيفية تسهيل الدماغ للسلوكيات المعقدة المعتمدة على السياق، وخاصة القدرة البشرية الفريدة على التواصل بالأفكار من خلال اللغة في المحادثات المفتوحة.

الطرق

تحدد قسم “الطرق” في ورقة البحث تصميم التجربة والتقنيات التحليلية المستخدمة للتحقيق في فرضية البحث. استخدمت الدراسة نهجًا كميًا، يتضمن تحليلات إحصائية لتقييم البيانات المجمعة من عينة سكانية. شملت المنهجيات المحددة تجارب محكومة، واستطلاعات، أو دراسات رصدية، اعتمادًا على سؤال البحث.

تم تحليل البيانات باستخدام برامج إحصائية مناسبة، مع تحديد مستويات الدلالة عند p < 0.05. كما يتناول القسم طرق أخذ العينات، بما في ذلك العشوائية أو الطبقية، لضمان تمثيل العينة. علاوة على ذلك، يتم وصف طرق جمع البيانات، مثل الاستبيانات أو أدوات القياس، مع تسليط الضوء على صلاحيتها وموثوقيتها. بشكل عام، تم تصميم الإطار المنهجي لضمان اختبار صارم للفرضيات وقابلية إعادة إنتاج النتائج.

النتائج

في هذه الدراسة، قمنا بإجراء تسجيلات مستمرة على مدار 24 ساعة طوال الأسبوع لتخطيط كهربية القشرة الدماغية (ECoG) وإشارات الكلام من أربعة مرضى خلال محادثاتهم العفوية في وحدة الصرع. تم زرع ما مجموعه 676 إلكترودًا داخل الجمجمة، مع التركيز على نصف الكرة الأيسر، مما أسفر عن 644 إلكترودًا قابلاً للاستخدام بعد استبعاد التسجيلات التالفة. قدمت التسجيلات تغطية واسعة للمناطق اللغوية الحرجة، بما في ذلك التلم الجبهي السفلي (IFG) والتلم الصدغي العلوي (STG)، وأسفرت عن مجموعة بيانات غنية تتكون من 17 إلى 37 ساعة من المحادثة وعشرات الآلاف من الكلمات.

قمنا بتطوير خط أنابيب مسبق لمعالجة البيانات لتحويل المحادثات ومزامنتها مع إشارات ECoG، مصنفين البيانات إلى مراحل الفهم (الاستماع) والإنتاج (الكلام). باستخدام القدرات متعددة الوسائط لنموذج Whisper، استخرجنا “تضمينات صوتية”، “تضمينات كلامية”، و”تضمينات لغوية” لكل كلمة. تضمنت تصميم تجربتنا طريقتين لاستخراج التضمينات اللغوية: واحدة تعزل وحدة فك التشفير عن إدخال الكلام وأخرى تسمح لكل من إدخالات الكلام والنص بالتأثير على التضمينات. مكنتنا هذه الطريقة من التحقيق في تعديل التضمينات اللغوية بواسطة إدخال الكلام، مع الحفاظ على سلامة البيانات العصبية المجمعة.

المناقشة

في هذه الدراسة، بحثنا في العمليات العصبية الكامنة وراء إنتاج وفهم الكلام الطبيعي باستخدام تسجيلات تخطيط كهربية القشرة الدماغية (ECoG) من حوالي 100 ساعة من المحادثات العفوية. استخدمنا نموذج موحد من الصوت إلى الكلام إلى اللغة، Whisper، لاستخراج التضمينات على مستويات مختلفة وبناء نماذج ترميز لربط هذه التضمينات بالنشاط العصبي المرتبط بالكلام. كشفت نتائجنا أن تضمينات الكلام، التي تلتقط المعلومات السياقية، تفوقت بشكل كبير على التضمينات الصوتية في التنبؤ بالنشاط العصبي عبر مناطق قشرية متعددة، وخاصة في التلم الصدغي العلوي (STG) والتلم الجبهي السفلي (IFG). أظهر التحليل تسلسل معالجة موزع، حيث كانت المناطق الحسية أفضل تمثيلاً بواسطة تضمينات الكلام وكانت المناطق اللغوية العليا أفضل تمثيلاً بواسطة تضمينات اللغة.

بالإضافة إلى ذلك، لاحظنا ديناميات زمنية مميزة في الترميز العصبي خلال إنتاج وفهم الكلام. بلغت أداء الترميز ذروته حوالي 500 مللي ثانية قبل بدء الكلمة خلال الإنتاج و300 مللي ثانية بعد بدء الكلمة خلال الفهم، مما يشير إلى تفاعل معقد بين معالجة الكلام واللغة عبر مناطق الدماغ المختلفة. تقترح نتائجنا أن معلومات الكلام واللغة يتم دمجها داخل مناطق قشرية متعددة بدلاً من معالجتها بشكل منفصل، مما يتحدى وجهات النظر التقليدية المودولارية لمعالجة اللغة. بشكل عام، يبرز هذا التحليل الشامل فعالية النماذج متعددة الوسائط العميقة في التقاط العلاقات المعقدة بين تمثيلات الكلام واللغة في الدماغ خلال المحادثات الواقعية.

Journal: Nature Human Behaviour, Volume: 9, Issue: 5
DOI: https://doi.org/10.1038/s41562-025-02105-9
PMID: https://pubmed.ncbi.nlm.nih.gov/40055549
Publication Date: 2025-03-07
Author(s): Ariel Goldstein et al.
Primary Topic: Speech and dialogue systems

Overview

This study presents a unified computational framework that integrates acoustic, speech, and word-level linguistic structures to investigate the neural mechanisms underlying everyday conversations. Utilizing electrocorticography, the researchers recorded neural signals during 100 hours of speech production and comprehension in real-life interactions. They employed a multimodal speech-to-text model, Whisper, to extract embeddings at various levels—low-level acoustic, mid-level speech, and contextual word embeddings—and developed encoding models that effectively map these embeddings to brain activity. Notably, the model demonstrated high predictive accuracy for neural activity across different levels of the language processing hierarchy, aligning with the cortical hierarchy for speech and language processing.

The findings indicate that sensory and motor regions correspond more closely with the model’s speech embeddings, while higher-level language areas align with language embeddings. The Whisper model captures the temporal dynamics of language-to-speech encoding prior to word articulation and speech-to-language encoding following articulation. This approach surpasses traditional symbolic models in representing neural activity associated with natural speech and language, suggesting a shift towards unified computational models that encompass the entire processing hierarchy in real-world conversational contexts. The research aims to enhance understanding of how the brain facilitates complex, context-dependent behaviors, particularly the unique human ability to communicate thoughts through language in open-ended conversations.

Methods

The “Methods” section of the research paper outlines the experimental design and analytical techniques employed to investigate the research hypothesis. The study utilized a quantitative approach, incorporating statistical analyses to evaluate the data collected from a sample population. Specific methodologies included controlled experiments, surveys, or observational studies, depending on the research question.

Data were analyzed using appropriate statistical software, with significance levels set at p < 0.05. The section also details the sampling methods, including randomization or stratification, to ensure the representativeness of the sample. Furthermore, the methods for data collection, such as questionnaires or measurement tools, are described, highlighting their validity and reliability. Overall, the methodological framework is designed to ensure rigorous testing of the hypotheses and the reproducibility of the findings.

Results

In this study, we conducted continuous 24/7 recordings of electrocorticography (ECoG) and speech signals from four patients during their spontaneous conversations in an epilepsy unit. A total of 676 intracranial electrodes were implanted, with a focus on the left hemisphere, resulting in 644 usable electrodes after excluding corrupted recordings. The recordings provided extensive coverage of critical language areas, including the inferior frontal gyrus (IFG) and superior temporal gyrus (STG), and yielded a rich dataset comprising 17 to 37 hours of conversation and tens of thousands of words.

We developed a preprocessing pipeline to transcribe conversations and align them with ECoG signals, categorizing the data into comprehension (listening) and production (speaking) phases. Utilizing the multimodal capabilities of the Whisper model, we extracted ‘acoustic embeddings,’ ‘speech embeddings,’ and ‘language embeddings’ for each word. Our experimental design involved two methods for extracting language embeddings: one that isolated the decoder from speech input and another that allowed both speech and text inputs to influence the embeddings. This approach enabled us to investigate the modulation of language embeddings by speech input, while maintaining the integrity of the neural data collected.

Discussion

In this study, we investigated the neural processes underlying natural speech production and comprehension using electrocorticography (ECoG) recordings from approximately 100 hours of spontaneous conversations. We employed a unified acoustic-to-speech-to-language model, Whisper, to extract embeddings at various levels and constructed encoding models to map these embeddings to neural activity associated with speech. Our findings revealed that speech embeddings, which capture contextual information, significantly outperformed acoustic embeddings in predicting neural activity across multiple cortical regions, particularly in the superior temporal gyrus (STG) and inferior frontal gyrus (IFG). The analysis demonstrated a distributed processing hierarchy, with sensory areas being better modeled by speech embeddings and higher-order language areas being better represented by language embeddings.

Additionally, we observed distinct temporal dynamics in neural encoding during speech production and comprehension. Encoding performance peaked approximately 500 ms before word onset during production and 300 ms after word onset during comprehension, indicating a complex interplay between speech and language processing across different brain regions. Our results suggest that speech and language information are integrated within multiple cortical areas rather than being processed in isolation, challenging traditional modular views of language processing. Overall, this comprehensive analysis highlights the efficacy of deep multimodal models in capturing the intricate relationships between speech and language representations in the brain during real-world conversations.