تصنيف نوع الجمل المدفوعة بالكلام في تشكري باستخدام طرق التعلم التقليدي ونقل المعرفة Speech-driven sentence type classification in Chokri using traditional and transfer learning methods

المجلة: International Journal of Speech Technology، المجلد: 29، العدد: 1
DOI: https://doi.org/10.1007/s10772-025-10233-w
تاريخ النشر: 2026-01-07
المؤلف: Amalesh Gope وآخرون
الموضوع الرئيسي: أبحاث الصوتيات وعلم الأصوات

نظرة عامة

تستكشف هذه الورقة البحثية استخدام الميزات الصوتية من بيانات الكلام لتصنيف الجمل في اللغة النغمية المهددة Chokri، التي تفتقر إلى الموارد النصية الكافية. تستخدم الدراسة كل من تقنيات التعلم الآلي التقليدي وتقنيات التعلم الانتقالي لتقييم فعالية ميزات الكلام، وخاصة معاملات تردد ميل (MFCCs)، وكروما، وميزات طيفية أخرى. تكشف النتائج أن MFCCs وحدها، وكذلك عند دمجها مع ميزات أخرى، تحقق متوسط درجة F1 قدره 86%، مع أعلى درجة تبلغ 88% تم الحصول عليها عند دمج جميع الميزات. علاوة على ذلك، فإن تحويل بيانات الكلام إلى صور طيفية وصور MFCC وتطبيق نماذج التعلم العميق مثل ResNet101 وInceptionResNet يؤدي إلى تحسين كبير في الدقة، تصل إلى 97-98%.

تؤكد الاستنتاجات على إمكانية استخدام ميزات الكلام لتصنيف الجمل دون الاعتماد على البيانات النصية، مما يظهر أنه يمكن تحقيق دقة قوية حتى مع مجموعات بيانات محدودة في اللغات ذات الموارد المنخفضة. تبرز الدراسة الأداء المتفوق لمعاملات MFCC المعتمدة على الصور في تصنيف أنواع الجمل المعقدة. من خلال التركيز على بيانات الكلام كمدخل أساسي، توضح البحث إمكانية تصنيف الجمل بدقة في السياقات ذات الموارد المحدودة. ستستكشف الأبحاث المستقبلية قابلية تطبيق هذه الأساليب على لغات ولهجات أخرى، مما يعزز فهم عمومية ميزات الكلام في مهام معالجة اللغة.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على أهمية تحليل الكلام في معالجة الإشارات، وخاصة في تعزيز أنظمة التعرف التلقائي على الكلام (ASR) وتسهيل التواصل الفوري عبر اللغات. أدت التطورات الأخيرة في تقنيات الترجمة، مثل ChatGPT وGoogle Translator، إلى تحسين الوصول والتفاعل للمستخدمين، خاصة في الحالات الحرجة حيث يكون فهم أنواع الجمل – البيانات، والأسئلة، والأوامر – أمرًا أساسيًا للتواصل الفعال. تهدف الدراسة إلى استكشاف تصنيف أنواع الجمل من بيانات الكلام في Chokri، وهي لغة تبتو-بورمانية، باستخدام تقنيات التعلم الآلي (ML)، وخاصة نماذج التعلم العميق مثل الشبكات العصبية التلافيفية (CNNs) وشبكات الذاكرة طويلة وقصيرة المدى (LSTM)، التي تتقن التعامل مع البيانات التسلسلية.

تتناول الأبحاث التحديات المتعلقة بتصنيف أنواع الجمل من إشارات الكلام، وخاصة في اللغات ذات الموارد المنخفضة مثل Chokri، التي تتميز بهيكل نغمي معقد يتضمن خمسة نغمات معجمية. تستكشف الدراسة ما إذا كانت ميزات الكلام وحدها يمكن أن تصنف بدقة أنواع الجمل دون مدخل نصي، وتحدد ميزات الكلام الفعالة للتصنيف، وتستكشف استخدام كل من البيانات العددية والصورية كمدخلات لنماذج ML. تم تحديد مهمتين للتصنيف: الأولى تصنف أربعة أنواع أساسية من الجمل، بينما الثانية تتوسع إلى تسع فئات، بما في ذلك البيانات المعقدة. تؤكد الورقة على أهمية الميزات النغمية، مثل ارتفاع النغمة وطول النهاية، في التمييز بين أنواع الجمل في Chokri، وتستخدم كل من طرق ML التقليدية وأساليب التعلم الانتقالي لتعزيز دقة التصنيف.

الطرق

تتناول قسم الطرق التجريبية وتحليل البيانات الإجراءات والتقنيات المستخدمة لجمع البيانات وتفسيرها في الدراسة. استخدم الباحثون نهجًا منهجيًا لتصميم التجارب، مما يضمن التحكم في المتغيرات وقياسها بدقة. تم تفصيل منهجيات محددة، بما في ذلك اختيار العينات، والأدوات، وبروتوكولات جمع البيانات، لتوفير وضوح حول الإطار التجريبي.

تم إجراء تحليل البيانات باستخدام أدوات وبرامج إحصائية لضمان تفسير قوي للنتائج. شمل التحليل إحصاءات وصفية لتلخيص البيانات وإحصاءات استنتاجية لاختبار الفرضيات، مما يسمح بتقييم العلاقات والتأثيرات داخل البيانات. تم التحقق من النتائج من خلال التجارب المتكررة وتقنيات التحقق المتبادل، مما يعزز موثوقية الاستنتاجات المستخلصة من النتائج التجريبية.

النتائج

يقدم قسم “النتائج” نتائج الدراسة، موضحًا نتائج التجارب التي تم إجراؤها. يتم الإبلاغ عن مقاييس رئيسية وتحليلات إحصائية، مع تسليط الضوء على الاتجاهات والارتباطات المهمة التي لوحظت في البيانات. تشير النتائج إلى أن الفرضية المقترحة كانت مدعومة، مع أدلة كمية توضح فعالية التدخل أو المنهجية المستخدمة.

بالإضافة إلى ذلك، يتضمن القسم تمثيلات رسومية للبيانات، مثل الرسوم البيانية والجداول، التي تسهل فهمًا أوضح للنتائج. يكشف التحليل عن أنماط محددة تتماشى مع التوقعات النظرية، مما يشير إلى تداعيات محتملة للبحوث المستقبلية والتطبيقات العملية في المجال المعني. بشكل عام، تسهم النتائج في تقديم رؤى قيمة تعزز الفهم الحالي للموضوع قيد البحث.

المناقشة

يقدم قسم المناقشة في الورقة البحثية تحليلًا شاملاً للميزات النغمية والصوتية لأنواع الجمل في Chokri، مع التركيز على كل من المنهجيات النوعية والكمية. يكشف التحليل النوعي أن الجمل الخبرية في Chokri تفتقر إلى نغمات الحدود وتستخدم بشكل أساسي علامات شكلية وتحديدات نغمية معجمية لتمييز أنواع الجمل. على سبيل المثال، يتم تمييز الأوامر بواسطة العلامة الشكلية /tē/، بينما لا تظهر الأسئلة البديلة رفعًا في النغمة، مما يتناقض مع الأسئلة الاستفهامية. تستكشف الدراسة أيضًا ميزات صوتية متنوعة، بما في ذلك معاملات تردد ميل (MFCCs) وكروما، وتقييم فعاليتها في تصنيف أنواع الجمل من خلال خوارزميات التعلم الآلي.

كمياً، تستخدم الأبحاث مصنف الغابة العشوائية لتقييم أداء التصنيف لمجموعات الميزات المختلفة. تشير النتائج إلى أن MFCCs وحدها ومجموعة الميزات المدمجة تحقق أعلى قيم لمنطقة تحت المنحنى (AUC)، مما يظهر دقة تصنيف متفوقة للأوامر والقوائم. ومع ذلك، تظهر الأسئلة والبيانات معدلات استرجاع أقل، مما يشير إلى تحديات في تمييز هذه الأنواع. تبرز الدراسة أيضًا فعالية أساليب التعلم الانتقالي، مثل InceptionResNet وResNet101، التي تحقق دقة عالية في تصنيف أنواع الجمل، خاصة للهياكل المعقدة. بشكل عام، تؤكد النتائج على أهمية اختيار الميزات بعناية وإمكانية تطبيق هذه المنهجيات لتعزيز أنظمة تحويل الكلام إلى نص ودعم جهود توثيق اللغة للغات ذات الموارد المنخفضة.

Journal: International Journal of Speech Technology, Volume: 29, Issue: 1
DOI: https://doi.org/10.1007/s10772-025-10233-w
Publication Date: 2026-01-07
Author(s): Amalesh Gope et al.
Primary Topic: Phonetics and Phonology Research

Overview

This research paper investigates the use of audio features from speech data for sentence classification in the endangered tonal language Chokri, which lacks sufficient textual resources. The study employs both traditional machine learning and transfer learning algorithms to assess the effectiveness of speech features, particularly Mel Frequency Cepstral Coefficients (MFCCs), Chroma, and other spectral features. The findings reveal that MFCCs alone, as well as in combination with other features, achieve an average F1 score of 86%, with the highest score of 88% obtained when all features are combined. Furthermore, converting speech data into spectrogram and MFCC images and applying deep learning models like ResNet101 and InceptionResNet results in a significant accuracy improvement, reaching 97-98%.

The conclusions underscore the potential of using speech features for sentence classification without relying on textual data, demonstrating that robust accuracy can be achieved even with limited datasets in low-resourced languages. The study highlights the superior performance of image-based MFCCs in classifying complex sentence types. By focusing on speech data as the primary input, the research illustrates the feasibility of accurate sentence classification in resource-constrained contexts. Future research will explore the applicability of these methods to other languages and dialects, enhancing understanding of the generalizability of speech features in language processing tasks.

Introduction

The introduction of this research paper highlights the significance of speech analysis in signal processing, particularly in enhancing automated speech recognition (ASR) systems and facilitating real-time communication across languages. Recent advancements in translation technologies, such as ChatGPT and Google Translator, have improved accessibility and interaction for users, especially in critical situations where understanding sentence types—statements, questions, and commands—is essential for effective communication. The study aims to explore the classification of sentence types from speech data in Chokri, a Tibeto-Burman language, utilizing machine learning (ML) techniques, particularly deep learning models like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, which are adept at handling sequential data.

The research addresses the challenges of classifying sentence types from speech signals, particularly in under-resourced languages like Chokri, which features a complex tonal structure with five lexical tones. The study investigates whether speech features alone can accurately classify sentence types without textual input, identifies effective speech features for classification, and explores the use of both numerical and image data as inputs for ML models. Two classification tasks are outlined: the first categorizes four basic sentence types, while the second expands to nine classes, including complex statements. The paper emphasizes the importance of prosodic features, such as pitch elevation and final lengthening, in distinguishing between sentence types in Chokri, and employs both traditional ML methods and transfer learning approaches to enhance classification accuracy.

Methods

The section on experimental methods and data analysis outlines the procedures and techniques employed to gather and interpret data in the study. The researchers utilized a systematic approach to design experiments, ensuring that variables were controlled and measured accurately. Specific methodologies, including the selection of samples, instrumentation, and data collection protocols, were detailed to provide clarity on the experimental framework.

Data analysis was conducted using statistical tools and software to ensure robust interpretation of results. The analysis included descriptive statistics to summarize the data and inferential statistics to test hypotheses, allowing for the assessment of relationships and effects within the data. The findings were validated through repeated trials and cross-validation techniques, enhancing the reliability of the conclusions drawn from the experimental results.

Results

The “Results” section presents the findings of the study, detailing the outcomes of the experiments conducted. Key metrics and statistical analyses are reported, highlighting significant trends and correlations observed in the data. The results indicate that the proposed hypothesis was supported, with quantitative evidence demonstrating the effectiveness of the intervention or methodology employed.

Additionally, the section includes graphical representations of the data, such as charts and tables, which facilitate a clearer understanding of the results. The analysis reveals specific patterns that align with theoretical expectations, suggesting potential implications for future research and practical applications in the relevant field. Overall, the findings contribute valuable insights that advance the current understanding of the topic under investigation.

Discussion

The discussion section of the research paper presents a comprehensive analysis of the prosodic and acoustic features of Chokri sentence types, focusing on both qualitative and quantitative methodologies. The qualitative analysis reveals that Chokri declarative sentences lack boundary tones and primarily utilize morphological markers and lexical tonal specifications to differentiate sentence types. For instance, imperatives are marked by the morphological marker /tē/, while alternative questions do not exhibit pitch raising, contrasting with wh-questions. The study further explores various acoustic features, including Mel Frequency Cepstral Coefficients (MFCCs) and Chroma, assessing their effectiveness in classifying sentence types through machine learning algorithms.

Quantitatively, the research employs a random forest classifier to evaluate the classification performance of different feature sets. The results indicate that MFCCs alone and the combined feature set yield the highest area under the curve (AUC) values, demonstrating superior classification accuracy for imperatives and lists. However, questions and statements show lower recall rates, indicating challenges in distinguishing these types. The study also highlights the effectiveness of transfer learning methods, such as InceptionResNet and ResNet101, which achieve high accuracy in classifying sentence types, particularly for complex structures. Overall, the findings underscore the importance of careful feature selection and the potential for applying these methodologies to enhance speech-to-text systems and support language documentation efforts for under-resourced languages.