PhonoMetric: محرك مزدوج القياس لتقييم لهجة اللغة الإنجليزية في الوقت الحقيقي وتدريب الكلام الشخصي للمتعلمين الهنود PhonoMetric: a dual-metric engine for real-time English language accent evaluation and personalized speech training for Indian learners

المجلة: Frontiers in Communication، المجلد: 10
DOI: https://doi.org/10.3389/fcomm.2025.1704484
تاريخ النشر: 2026-01-05
المؤلف: Rajkumaran Soundarraj وآخرون
الموضوع الرئيسي: أبحاث الصوتيات وعلم الأصوات

نظرة عامة

تقدم هذه الدراسة طريقة جديدة لتقييم وتعزيز دقة نطق اللغة الإنجليزية المنطوقة بالنسبة لنمط اللكنة المرغوب، باستخدام تقنيات متقدمة في معالجة الكلام واسترجاع المعلومات. المركز في النظام هو نموذج ECAPA-TDNN، الذي يولد تمثيلات المتحدثين من صوت المستخدم المعدل مع الكلام بلهجة أمريكية. تتم مقارنة هذه التمثيلات مع عينات اللكنة المرجعية باستخدام التشابه الكوني لإنتاج درجة تشابه اللكنة (ASS). يتم نسخ كلام المستخدم وتوافقه على مستوى الفونيم مع جملة مرجعية، مما يسمح بتصنيف الكفاءة (مبتدئ، متوسط، متقدم) بناءً على القرب الدلالي والفونيمي، بالإضافة إلى الأخطاء المفهومة. أظهرت تقييمات تجريبية مع 30 طالبًا جامعيًا دقة تصنيف عالية (91.3%، 88.6%، و93.1% لمستويات الكفاءة المعنية) وارتباط سلبي قوي (r = -0.82) بين معدل خطأ النطق (PER) وASS، إلى جانب درجة رضا المستخدم 4.6/5.

في الختام، تقدم هذه الأبحاث نهجًا منظمًا وقابلًا للتوسع لتقييم اللكنة وتقديم ملاحظات حول النطق من خلال دمج نماذج التعلم العميق وتحليل مستوى الفونيم. يوفر النظام للمستخدمين ملاحظات كمية، وتصنيف كفاءة، وموارد مخصصة، مما يحول التقييم السلبي إلى تعلم نشط. تم تصميم هذه الحلقة المبتكرة من الملاحظات لدعم المتحدثين غير الأصليين في تحقيق الطلاقة ويمكن تكييفها للاستخدام في البيئات التعليمية، ومراكز التدريب، وبيئات تقييم اللغة. قد تشمل التحسينات المستقبلية دعم لهجات إنجليزية إضافية، وتصحيح النطق في الوقت الحقيقي، وميزات مثل التلعيب والتعلم التكيفي من خلال التعلم المعزز، بهدف تطوير مدرب نطق رقمي شامل.

مقدمة

تسلط مقدمة ورقة البحث الضوء على التحديات التي يواجهها الطلاب الناطقون بالتاميلية عند تعلم اللغة الإنجليزية، خاصة فيما يتعلق بالنطق واللكنة. على الرغم من امتلاكهم مهارات قوية في القواعد والمفردات، إلا أن هؤلاء المتعلمين غالبًا ما يواجهون صعوبات في التواصل بسبب لهجاتهم، مما يؤدي إلى سوء الفهم وانخفاض الثقة بالنفس. ركزت التعليمات اللغوية التقليدية بشكل أساسي على التركيب والمفردات، متجاهلة الجانب الحاسم من النطق، خاصة للمتعلمين من خلفيات لغوية متنوعة. لم تعالج أنظمة تعلم اللغة المدعومة بالحاسوب (CALL) الحالية هذه الاحتياجات الدقيقة بشكل كافٍ، حيث تقدم غالبًا تمارين ثابتة أو ملاحظات أساسية على التعرف التلقائي على الكلام (ASR) تفتقر إلى العمق في القضايا المتعلقة باللكنة.

لمعالجة هذه الفجوات، تقدم الدراسة Photometric، وهو محرك مزدوج القياس مصمم لتعزيز نطق اللغة الإنجليزية المنطوقة من خلال التركيز على اللكنة، وبالتحديد الإنجليزية الأمريكية. يتكون النظام من مكونين رئيسيين: درجة تشابه اللكنة (ASS) ومعدل خطأ الفونيم (PER). يستخدم ASS تمثيلات من نموذج ECAPA-TDNN المدرب على الكلام بلهجة أمريكية لقياس تشابه اللكنة من خلال التشابه الكوني مع المراجع الأصلية. في الوقت نفسه، يقوم نموذج Whisper ASR بنسخ كلام المستخدم وتوافقه على مستوى الفونيم، مع تحديد مشكلات النطق من خلال PER. لا توفر هذه الطريقة المبتكرة فقط توصيات محتوى مخصصة بناءً على ASS وPER وخلفية اللغة الأم للمتعلمين، بل تمثل أيضًا تقدمًا كبيرًا في تدريب النطق من خلال دمج التعلم العميق مع استرجاع المحتوى الذكي. وهذا يمكّن المتعلمين من الانخراط في مسارات تعلم نشطة ومستنيرة، مما يحول التدريب الآلي على النطق إلى تجربة تعليمية أكثر فعالية وتخصيصًا.

طرق

توضح قسم المنهجية النهج المنهجي المستخدم في البحث للتحقيق في الفرضيات المحددة. يتناول تصميم التجربة، بما في ذلك معايير اختيار المشاركين، والأدوات والتقنيات المستخدمة لجمع البيانات، والأساليب الإحصائية المطبقة للتحليل. استخدمت الدراسة مزيجًا من الأساليب الكمية والنوعية لضمان فهم شامل للظواهر قيد التحقيق.

شمل جمع البيانات استبيانات منظمة وتجارب محكومة، مع التركيز على الحفاظ على الموثوقية والصلاحية طوال العملية. تم إجراء تحليلات إحصائية باستخدام أدوات برمجية لتقييم أهمية النتائج، مع استخدام اختبارات مثل ANOVA وتحليل الانحدار لتفسير العلاقات بين المتغيرات. تم تصميم المنهجية لتقليل التحيز وتعزيز إمكانية تكرار النتائج، مما يساهم في قوة استنتاجات الدراسة.

نتائج

في هذه الدراسة، تم اختبار النسخة النهائية من نظام تقييم اللكنة على 30 مستخدمًا من خلفيات لغوية متنوعة، تحديدًا التاميلية والهندية والمالايالامية والتيلوجو. قام كل مشارك بنطق خمس جمل إنجليزية غنية صوتيًا، وتم تقييم لهجاتهم باستخدام تمثيلات ECAPA-TDNN. أسفر نظام تقييم الكلام التلقائي (ASS) عن درجات تتراوح من 42.6 إلى 91.3، مع متوسط درجة 68.2، بينما تراوح معدل خطأ الفونيم (PER) من 8.2% إلى 38.7%، بمتوسط 21.5%. تم تصنيف المستخدمين إلى ثلاث مستويات كفاءة: 10 كمبتدئين، 12 كمتوسطين، و8 كمتقدمين. من الجدير بالذكر أنه تم ملاحظة ارتباط سلبي بين PER وASS، مما يشير إلى أن معدلات خطأ الفونيم الأعلى تتوافق مع درجات لهجة أقل.

علاوة على ذلك، أنشأ النظام دورات نطق مخصصة على يوتيوب بناءً على اللغات الأم للمستخدمين ومستويات أدائهم، كما يتضح من توجيه مبتدئ يتحدث التاميلية إلى “لهجة الإنجليزية الأمريكية للمتحدثين بالتاميلية – ممارسة للمبتدئين.” كان المحتوى المولد ذا صلة وجذابًا ومتوافقًا مع مستويات كفاءة المستخدمين. كما راقب النظام أخطاء الفونيم المحددة، مع تحديد استبدالات شائعة، مثل /θ/ → /t/ (على سبيل المثال، “think” تُنطق “tink”)، /v/ → /w/ (على سبيل المثال، “very” تُنطق “wery”)، و/dʒ/ → /z/ (على سبيل المثال، “judge” تُنطق “juzz”). تم استخدام التصورات، بما في ذلك الرسوم البيانية التكرارية والمخططات النقطية، لتحليل توزيعات ASS وPER، بينما سلطت الرسوم البيانية الشريطية الضوء على الفونيمات التي تم نطقها بشكل خاطئ بشكل متكرر.

مناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على التقدم في تعلم اللغة المدعوم بالحاسوب (CALL) من خلال دمج تقنيات التعرف التلقائي على الكلام (ASR) وتحقق المتحدثين. تنتقد الأنظمة الحالية لـ CALL لتركزها المحدود على دقة النطق على مستوى الكلمة أو الجملة، وغالبًا ما تتجاهل الجوانب الدقيقة مثل اللكنة والطلاقة. تؤكد الدراسة على إمكانيات تمثيلات المتحدثين، خاصة من خلال نماذج مثل ECAPA-TDNN، لتوفير تحليل أكثر تفصيلاً لتشابه اللكنة وأخطاء النطق على مستوى الفونيم. يسمح هذا النهج المزدوج القياس، الذي يستخدم درجة تشابه اللكنة (ASS) ومعدل خطأ الفونيم (PER)، بتقديم ملاحظات مخصصة يمكن أن تتكيف مع خلفيات المتعلمين الفردية، وبالتالي تعمل كمدرب نطق افتراضي.

علاوة على ذلك، تكشف النتائج عن أنماط مميزة في أخطاء الفونيم بين المتعلمين من خلفيات لغوية مشابهة، مما يبرز أهمية الملاحظات الشخصية. لا يقيم النظام المقترح النطق فحسب، بل يوصي أيضًا بموارد مستهدفة، مثل مقاطع الفيديو على يوتيوب، بناءً على مستوى كفاءة المتعلم ولغته الأم. بينما تقدم الدراسة إطارًا قويًا لتقييم اللكنة، فإنها تعترف بالقيود، بما في ذلك الحاجة إلى مدخلات صوتية نظيفة وغياب الملاحظات التصحيحية في الوقت الحقيقي. قد تشمل التحسينات المستقبلية توسيع النظام لاستيعاب لهجات إنجليزية متنوعة ودمج آليات التعلم التكيفي لتخصيص تجربة التعلم بشكل أكبر. بشكل عام، تسهم هذه الأبحاث بشكل كبير في مجال CALL من خلال تقديم حل ذكي وقابل للتوسع لتحسين نطق اللغة الإنجليزية لغير الناطقين بها.

Journal: Frontiers in Communication, Volume: 10
DOI: https://doi.org/10.3389/fcomm.2025.1704484
Publication Date: 2026-01-05
Author(s): Rajkumaran Soundarraj et al.
Primary Topic: Phonetics and Phonology Research

Overview

This study introduces a novel method for assessing and enhancing spoken English pronunciation accuracy relative to a desired accent style, utilizing advanced speech processing and information retrieval techniques. Central to the system is the ECAPA-TDNN model, which generates speaker embeddings from user audio fine-tuned with American-accented speech. These embeddings are compared to reference accent samples using cosine similarity to produce an Accent Similarity Score (ASS). The user’s speech is transcribed and aligned at the phoneme level with a reference sentence, allowing for proficiency classification (Beginner, Intermediate, Advanced) based on semantic and phonetic proximity, as well as comprehensible errors. An experimental evaluation with 30 undergraduate students demonstrated high classification accuracy (91.3%, 88.6%, and 93.1% for the respective proficiency levels) and a strong negative correlation (r = -0.82) between pronunciation error rate (PER) and ASS, alongside a user satisfaction score of 4.6/5.

In conclusion, this research presents a structured and scalable approach to accent assessment and pronunciation feedback through the integration of deep learning models and phoneme-level analysis. The system provides users with quantitative feedback, proficiency classification, and personalized resources, transforming passive assessment into active learning. This innovative feedback loop is designed to support non-native speakers in achieving fluency and could be adapted for use in educational settings, training centers, and language assessment environments. Future enhancements may include support for additional English dialects, real-time pronunciation correction, and features such as gamification and adaptive learning through reinforcement learning, ultimately aiming to develop a comprehensive digital pronunciation coach.

Introduction

The introduction of the research paper highlights the challenges faced by Tamil-speaking students learning English, particularly regarding pronunciation and accent. Despite having strong grammar and vocabulary skills, these learners often struggle with communication due to their accents, leading to misunderstandings and diminished self-confidence. Traditional language instruction has primarily focused on syntax and vocabulary, neglecting the critical aspect of pronunciation, especially for learners from diverse linguistic backgrounds. Existing Computer-Assisted Language Learning (CALL) systems have not adequately addressed these nuanced needs, often providing static drills or basic automatic speech recognition (ASR) feedback that lacks depth in accent-specific issues.

To address these gaps, the study introduces Photometric, a dual-metric engine designed to enhance spoken English pronunciation by focusing on accent, specifically American English. The system comprises two key components: the Accent Similarity Score (ASS) and the Phoneme Error Rate (PER). The ASS utilizes embeddings from the ECAPA-TDNN model trained on American-accented speech to measure accent similarity through cosine similarity with native references. Concurrently, the Whisper ASR model transcribes user speech and aligns it at the phoneme level, identifying articulation issues through the PER. This innovative approach not only provides personalized content recommendations based on the learner’s ASS, PER, and first language background but also marks a significant advancement in pronunciation training by integrating deep learning with intelligent content retrieval. This enables learners to engage in active, informed learning pathways, thus transforming automated pronunciation coaching into a more effective and tailored educational experience.

Methods

The methodology section outlines the systematic approach employed in the research to investigate the specified hypotheses. It details the experimental design, including the selection criteria for participants, the tools and techniques used for data collection, and the statistical methods applied for analysis. The study utilized a combination of quantitative and qualitative methods to ensure a comprehensive understanding of the phenomena under investigation.

Data collection involved structured surveys and controlled experiments, with a focus on maintaining reliability and validity throughout the process. Statistical analyses were performed using software tools to evaluate the significance of the findings, employing tests such as ANOVA and regression analysis to interpret the relationships between variables. The methodology was designed to minimize bias and enhance the reproducibility of results, thereby contributing to the robustness of the study’s conclusions.

Results

In this study, the final version of the accent evaluation system was tested on 30 users from diverse linguistic backgrounds, specifically Tamil, Hindi, Malayalam, and Telugu. Each participant articulated five phonetically rich English sentences, and their accents were assessed using ECAPA-TDNN embeddings. The Automatic Speech Scoring (ASS) yielded scores ranging from 42.6 to 91.3, with a mean score of 68.2, while the phoneme error rate (PER) varied from 8.2% to 38.7%, averaging 21.5%. Users were categorized into three proficiency levels: 10 as beginners, 12 as intermediate, and 8 as advanced. Notably, a negative correlation was observed between PER and ASS, indicating that higher phoneme error rates corresponded to lower accent scores.

Furthermore, the system generated tailored YouTube pronunciation courses based on users’ native languages and performance levels, exemplified by a Tamil-speaking beginner being directed to “American English Accent for Tamil Speakers – Beginner Practice.” The generated content was relevant, engaging, and aligned with users’ proficiency levels. The system also monitored specific phoneme errors, with common substitutions identified, such as /θ/ → /t/ (e.g., “think” pronounced as “tink”), /v/ → /w/ (e.g., “very” pronounced as “wery”), and /dʒ/ → /z/ (e.g., “judge” pronounced as “juzz”). Visualizations, including histograms and scatter plots, were employed to analyze ASS and PER distributions, while bar charts highlighted frequently mispronounced phonemes.

Discussion

The discussion section of the research paper highlights the advancements in Computer-Aided Language Learning (CALL) through the integration of Automatic Speech Recognition (ASR) and speaker verification technologies. It critiques existing CALL systems for their limited focus on pronunciation accuracy at the word or sentence level, often neglecting nuanced aspects such as accent and fluency. The study emphasizes the potential of speaker embeddings, particularly through models like ECAPA-TDNN, to provide a more detailed analysis of accent similarity and phoneme-level pronunciation errors. This dual-metric approach, utilizing the Accent Similarity Score (ASS) and Phoneme Error Rate (PER), allows for tailored feedback that can adapt to individual learners’ backgrounds, thus acting as a virtual pronunciation coach.

Moreover, the findings reveal distinct patterns in phoneme errors among learners from similar linguistic backgrounds, underscoring the importance of personalized feedback. The proposed system not only assesses pronunciation but also recommends targeted resources, such as YouTube videos, based on the learner’s proficiency level and native language. While the study presents a robust framework for accent assessment, it acknowledges limitations, including the need for clean audio inputs and the absence of real-time corrective feedback. Future enhancements could involve expanding the system to accommodate various English dialects and incorporating adaptive learning mechanisms to further personalize the learning experience. Overall, this research contributes significantly to the field of CALL by offering a scalable, intelligent solution for improving non-native English pronunciation.