إثراء مجموعة المتعلمين الكوريين لتصحيح الأخطاء النحوية وتقييم الكتابة Enriching the Korean learner corpus for grammatical error correction and writing assessment

المجلة: Language Resources and Evaluation، المجلد: 60، العدد: 1
DOI: https://doi.org/10.1007/s10579-025-09882-9
تاريخ النشر: 2026-01-08
المؤلف: Jayoung Song وآخرون
الموضوع الرئيسي: اكتساب اللغة الثانية والتعلم

نظرة عامة

تقدم هذه الورقة البحثية KoLLA v2.0، وهي مجموعة بيانات محسّنة للمتعلمين الكوريين مصممة لمعالجة ندرة الموارد لكتابة اللغة الكورية كلغة ثانية. تتضمن مجموعة البيانات المحدثة تعليقات متعددة المراجع لتصحيح الأخطاء النحوية (GEC) وتقييم المقالات بناءً على معايير، مما يجعلها الأولى من نوعها للغة الكورية. من خلال إضافة تصحيح بشري ثانٍ لكل جملة، تلتقط مجموعة البيانات تباين التصحيحات الصالحة في لغة معقدة مورفولوجيًا، مما يسهل تقييمًا أكثر دقة لأنظمة تصحيح الأخطاء النحوية. بالإضافة إلى ذلك، يوفر تضمين الدرجات المستندة إلى المعايير، بناءً على معايير من المعهد الوطني للغة الكورية، تقييمات موحدة للدقة النحوية، والتماسك، وتنوع المفردات.

تؤكد الدراسة على KoLLA v2.0 كمعيار لتقييم أنظمة تصحيح الأخطاء الآلية (AEC) وتقييم المقالات (AES) وخارطة طريق للبحوث المستقبلية. تسلط الضوء على الحاجة إلى جمع بيانات على نطاق أوسع عبر أنواع مختلفة ومستويات كفاءة، بالإضافة إلى دمج نماذج AES العصبية للحصول على ملاحظات أكثر تكيفًا. تعترف الورقة بالتحديات التي تطرحها مورفولوجيا اللغة الكورية الملتصقة، والتي تعقد نطاقات الأخطاء وحدود التصحيح، وتؤكد على أهمية البيانات عالية الجودة والمُعَلَّمة بشكل منهجي لتقدم أبحاث كتابة اللغة الكورية كلغة ثانية. وبالتالي، تُعتبر KoLLA v2.0 الغنية بمثابة مورد أساسي لكل من التقييم الفوري والتطورات المستقبلية في هذا المجال.

مقدمة

تسلط مقدمة الورقة البحثية الضوء على أهمية البيانات المكتوبة والمُتحدث بها من متعلمي اللغة الثانية (L2) في فهم اكتساب اللغة الثانية، خاصة من خلال عدسة مجموعات بيانات المتعلمين الإلكترونية. بينما تقدمت أدوات مثل محرر الأخطاء UCLouvain وERRANT في توضيح الأخطاء للغات المدروسة على نطاق واسع، لا تزال اللغة الكورية ممثلة تمثيلًا ناقصًا في هذا المجال على الرغم من شعبيتها المتزايدة في تعليم اللغات. تفتقر مجموعات بيانات المتعلمين الكوريين الحالية، مثل KoLLA ومجموعة بيانات KyungHee، إلى ميزات أساسية مثل بيانات تصحيح الأخطاء النحوية متعددة المراجع (GEC) ودرجات التقييم المستندة إلى المعايير، والتي تعتبر ضرورية لأبحاث GEC الحالية وتقييم المقالات الآلي (AES).

لمعالجة هذه القيود، تقدم الورقة KoLLA v2.0، وهي نسخة محسّنة من مجموعة بيانات KoLLA الأصلية. تتضمن مجموعة البيانات الجديدة تصحيحين بشريين لكل جملة للمتعلمين، مما يجعلها أول مجموعة بيانات GEC متعددة المراجع للغة الكورية، والتي تلتقط تباين التصحيحات الصالحة في لغة معقدة مورفولوجيًا. بالإضافة إلى ذلك، تتضمن درجات مستندة إلى المعايير تتماشى مع إرشادات المعهد الوطني للغة الكورية، مما يسمح بتقييم أكثر دقة لميزات الكتابة ذات المستوى الأعلى. تُعتبر مجموعة بيانات KoLLA v2.0 الغنية، التي تتكون من 100 مقال من متعلمي اللغة الكورية L2 المبتدئين والمتوسطين، معيارًا لتطوير أنظمة معالجة اللغة الطبيعية وموارد ذات صلة تربويًا لدراسة تعلم اللغة الثانية، مما يسد الفجوة في أبحاث مجموعات بيانات المتعلمين الكوريين.

النتائج

تكشف النتائج من مجموعة بيانات KoLLA المعززة (KoLLA v2.0) عن تقدم كبير في فهم الخصائص اللغوية لكتابة متعلمي اللغة الكورية كلغة ثانية من خلال تطبيق التعليق متعدد الطبقات. تبرز هذه القسم نتيجتين رئيسيتين: أولاً، يعزز استخدام تصحيح الأخطاء النحوية متعددة المراجع (GEC) من التقاط تباين التصحيحات ويحسن تقييم أداء النظام؛ ثانيًا، يوفر التقييم المستند إلى المعايير طريقة موحدة لتقييم جودة الكتابة العامة وكفاءة المتعلم بطريقة ذات صلة تربويًا.

يسهل دمج GEC متعدد المراجع مع التقييم المستند إلى المعايير في مجموعة بيانات KoLLA المعززة تقييمًا شاملاً لكتابة اللغة الكورية كلغة ثانية. يعالج GEC متعدد المراجع بفعالية التباين اللغوي والغموض الموجود في اللغة الكورية، بينما يضع التقييم المستند إلى المعايير إطارًا منظمًا لتقييم جوانب متنوعة من كفاءة الكتابة. معًا، لا تعزز هذه المنهجيات الأساليب الحسابية فحسب، بل تغني أيضًا الممارسات الصفية، مما يؤدي إلى تقييمات أكثر دقة وإنصافًا وإعلامًا للغة المتعلم.

المناقشة

تحدد قسم المناقشة في الورقة البحثية التقدم الكبير في مجموعة بيانات المتعلمين KoLLA، خاصة من خلال تقديم مجموعة بيانات تصحيح الأخطاء النحوية متعددة المراجع (GEC) ومجموعة بيانات تقييم المقالات الآلي (AES) المستندة إلى المعايير. تعزز مجموعة بيانات GEC متعددة المراجع مجموعة بيانات KoLLA الأصلية من خلال توفير نسختين مصححتين لكل جملة كتبها متعلمو اللغة الكورية L2، مما يلتقط التباين اللغوي ويحسن تقييم أنظمة GEC. تعتبر هذه الطريقة مفيدة بشكل خاص للغة الكورية، وهي لغة ملتصقة ذات ميزات مورفوسينتكية معقدة، حيث تسمح بفهم أكثر دقة لأخطاء المتعلمين وتعزز تقييمات أكثر عدلاً لأداء النظام. يتضمن إطار عمل التعليق لمجموعة البيانات تصنيفًا تفصيليًا للأخطاء النحوية، مما يعكس الممارسات الحالية في نماذج GEC العصبية ويسهل المقارنة عبر اللغات.

بالإضافة إلى تحسينات GEC، تتماشى مجموعة بيانات AES المستندة إلى المعايير مع إرشادات المعهد الوطني للغة الكورية وتقدم تقييمًا منظمًا لكتابة اللغة الثانية يشمل ليس فقط الدقة النحوية ولكن أيضًا مهارات الكتابة ذات المستوى الأعلى مثل التماسك وتطوير المحتوى. تتضمن المعايير، التي تم تطويرها من خلال توافق الخبراء، معايير محددة لتقييم المفردات، والنحو، والتنظيم، والمحتوى، مما يضمن الوضوح والموضوعية في التقييم. كما أن موثوقية التقييم بين المقيمين التي تم تحقيقها من خلال تدريب صارم ومعايرة للمعلقين تعزز فعالية المعايير. بشكل عام، يضع دمج هاتين المجموعتين KoLLA v2.0 كمورد رائد لتقدم الأبحاث في اكتساب اللغة الكورية كلغة ثانية، وتصميم التعليم، والتقييم الآلي، بينما يوفر أيضًا رؤى قيمة حول الأخطاء الشائعة للمتعلمين ويسهل التدخلات التربوية المستهدفة. تشمل اتجاهات البحث المستقبلية توسيع تعليقات GEC متعددة المراجع واستكشاف دمج تقييمات الأخطاء على مستوى الخطاب لتعزيز فائدة مجموعة البيانات.

Journal: Language Resources and Evaluation, Volume: 60, Issue: 1
DOI: https://doi.org/10.1007/s10579-025-09882-9
Publication Date: 2026-01-08
Author(s): Jayoung Song et al.
Primary Topic: Second Language Acquisition and Learning

Overview

This research paper presents KoLLA v2.0, an enhanced Korean learner corpus designed to address the scarcity of resources for Korean L2 writing. The updated corpus incorporates multi-reference Grammatical Error Correction (GEC) annotations and rubric-based essay scoring, marking it as the first of its kind for the Korean language. By adding a second human correction for each sentence, the corpus captures the variability of valid corrections in a morphologically complex language, thereby facilitating a more nuanced evaluation of GEC systems. Additionally, the inclusion of rubric-based scores, based on criteria from the Korean National Language Institute, provides standardized assessments of grammatical accuracy, coherence, and lexical diversity.

The study emphasizes KoLLA v2.0 as both a benchmark for evaluating automated error correction (AEC) and essay scoring (AES) systems and a roadmap for future research. It highlights the need for larger-scale data collection across various genres and proficiency levels, as well as the integration of neural AES models for more adaptive feedback. The paper acknowledges the challenges posed by Korean’s agglutinative morphology, which complicates error spans and correction boundaries, and emphasizes the importance of high-quality, systematically annotated data for advancing Korean L2 writing research. The enriched KoLLA v2.0 thus serves as a foundational resource for both immediate evaluation and future developments in the field.

Introduction

The introduction of the research paper highlights the significance of written and spoken data from second language (L2) learners in understanding L2 acquisition, particularly through the lens of electronic learner corpora. While tools like the UCLouvain Error Editor and ERRANT have advanced error annotation for widely studied languages, Korean remains underrepresented in this domain despite its growing popularity in language education. Existing Korean learner corpora, such as KoLLA and the KyungHee corpus, lack essential features like multi-reference grammatical error correction (GEC) data and rubric-based scoring, which are crucial for current GEC and automated essay scoring (AES) research.

To address these limitations, the paper introduces KoLLA v2.0, an enhanced version of the original KoLLA dataset. This new corpus includes two human corrections for each learner sentence, making it the first multi-reference GEC corpus for Korean, which captures the variability of valid corrections in a morphologically complex language. Additionally, it incorporates rubric-based scores aligned with the Korean National Language Institute’s guidelines, allowing for a more nuanced evaluation of higher-order writing features. The enriched KoLLA v2.0 dataset, comprising 100 essays from beginner and intermediate Korean L2 learners, serves as both a benchmark for developing NLP systems and a pedagogically relevant resource for studying L2 learning, thereby bridging the gap in Korean learner corpus research.

Results

The findings from the augmented KoLLA corpus (KoLLA v2.0) reveal significant advancements in understanding the linguistic characteristics of Korean L2 learner writing through the application of multi-layered annotation. This section highlights two main implications: first, the use of multi-reference grammatical error correction (GEC) enhances the capture of correction variability and improves system performance evaluation; second, rubric-based scoring provides a standardized method for assessing overall writing quality and learner proficiency in a pedagogically relevant manner.

The integration of multi-reference GEC with rubric-based scoring in the enhanced KoLLA corpus facilitates a comprehensive evaluation of Korean L2 writing. Multi-reference GEC effectively addresses the linguistic variability and ambiguity present in the Korean language, while rubric-based scoring establishes a structured framework for assessing diverse aspects of writing proficiency. Together, these methodologies not only advance computational approaches but also enrich classroom practices, leading to more accurate, equitable, and informative evaluations of learner language.

Discussion

The discussion section of the research paper outlines significant advancements in the KoLLA learner corpus, specifically through the introduction of a multi-reference Grammatical Error Correction (GEC) dataset and a rubric-based Automated Essay Scoring (AES) dataset. The multi-reference GEC dataset enhances the original KoLLA corpus by providing two corrected versions for each sentence written by Korean L2 learners, thereby capturing linguistic variability and improving the evaluation of GEC systems. This approach is particularly beneficial for Korean, an agglutinative language with complex morphosyntactic features, as it allows for a more nuanced understanding of learner errors and promotes fairer assessments of system performance. The dataset’s annotation framework includes a detailed typology of grammatical errors, reflecting current practices in neural GEC models and facilitating cross-linguistic comparability.

In addition to the GEC enhancements, the rubric-based AES dataset aligns with the Korean National Language Institute’s guidelines and offers a structured evaluation of L2 writing that encompasses not only grammatical accuracy but also higher-order writing skills such as coherence and content development. The rubric, developed through expert consensus, includes specific criteria for assessing lexis, grammar, organization, and content, ensuring clarity and objectivity in scoring. The inter-rater reliability achieved through rigorous training and calibration of annotators further validates the rubric’s effectiveness. Overall, the integration of these two datasets positions KoLLA v2.0 as a pioneering resource for advancing research in Korean L2 acquisition, instructional design, and automated assessment, while also providing valuable insights into common learner errors and facilitating targeted pedagogical interventions. Future research directions include expanding the multi-reference GEC annotations and exploring the integration of discourse-level error assessments to enhance the dataset’s utility.