مجموعة البيانات لتوسيع تقييم EMNIST The dataset for extending EMNIST evaluation

المجلة: Scientific Data، المجلد: 13، العدد: 1
DOI: https://doi.org/10.1038/s41597-025-06291-z
PMID: https://pubmed.ncbi.nlm.nih.gov/41571670
تاريخ النشر: 2026-01-22
المؤلف: Julian Szymański وآخرون
الموضوع الرئيسي: تقنيات التعرف على النصوص المكتوبة بخط اليد

نظرة عامة

يقدم هذا القسم من الورقة نظرة شاملة على مجموعة بيانات تم إنشاؤها حديثًا تهدف إلى تعزيز تقييم نماذج التعلم الآلي للتعرف على الأحرف المكتوبة بخط اليد. تم تصميم مجموعة البيانات لتكمل قواعد بيانات NIST الموجودة، مما يسهل تحليلًا أكثر شمولاً للنماذج المطورة باستخدام هذه المصادر البيانية. يلخص المؤلفون نماذج التعلم الآلي المتاحة للجمهور والتي تم تدريبها على مجموعة بيانات أحرف EMNIST ويتناولون التحديات المرتبطة بتقييم النتائج المتطورة، لا سيما فيما يتعلق بمقارنات الدقة المستمدة من مجموعات اختبار التحقق المتقاطع.

علاوة على ذلك، تدعو الورقة إلى مزيد من التقييم باستخدام بيانات جديدة تم إنشاؤها بشكل مستقل وغير مرتبطة بمؤلفي قاعدة بيانات NIST. تهدف هذه المقاربة إلى تقديم تقييم أكثر قوة لأداء النموذج. تتوفر مجموعة البيانات وأكواد المصدر المقابلة من خلال مستودع جامعة غدانسك التقنية، Most Wiedzy، مما يعزز الشفافية وإمكانية التكرار في البحث ضمن هذا المجال.

مقدمة

في مقدمة هذه الورقة البحثية، يبرز المؤلفون أهمية الشبكات العصبية في التعلم الآلي، لا سيما للمهام مثل التعرف على الأحرف البصرية (OCR). يركزون على التعرف على الأحرف المكتوبة بخط اليد من الأبجدية اللاتينية، باستخدام ثلاثة خوارزميات ذات أداء عالٍ—WaveMixLite-112/16، VGG-5، وTextCaps—تم التحقق منها ضد مجموعة بيانات EMNIST-Letters. أعاد المؤلفون هيكلة الأكواد مفتوحة المصدر لهذه النماذج وجعلوها متاحة للجمهور، بينما قدموا أيضًا مجموعة بيانات جديدة، “مجموعة بيانات الأحرف المكتوبة بخط اليد PG” (PG-HWLD)، لتعزيز مقاييس التقييم وتقييم قدرات التعميم.

تتكون مجموعة بيانات EMNIST، المشتقة من قاعدة بيانات NIST الخاصة 19، من أكثر من 814,000 صورة لأحرف مكتوبة بخط اليد، وهي مصممة لتقديم مهمة تصنيف أكثر تحديًا مقارنة بمجموعة بيانات MNIST المعروفة. يشير المؤلفون إلى أن التقييمات السابقة لمجموعات بيانات EMNIST أظهرت معدلات دقة عالية، حيث حققت مجموعة بيانات EMNIST-Letters دقة قصوى بلغت 85.15%. ومع ذلك، يجادلون بأن طرق التقييم التقليدية قد لا تعكس بشكل كافٍ القدرات الحقيقية للتعميم للنماذج، حيث تعتمد غالبًا على مجموعات اختبار مستمدة من نفس مصادر البيانات مثل مجموعات التدريب. لمعالجة هذه القيود، يقترح المؤلفون مجموعتهم كوسيلة لتوفير معيار أكثر قوة لتقييم أداء نماذج التعرف على الصور المكتوبة بخط اليد.

طرق

في هذه الدراسة، تم إنشاء مجموعة بيانات تتكون من 17,160 عينة مكتوبة بخط اليد من الأحرف الكبيرة والصغيرة من 52 مشاركًا، معظمهم من طلاب الجامعات. قام المشاركون بملء استمارات على شبكات موحدة باستخدام أقلام سوداء، وتم مسح الاستمارات المكتملة كصور RGB. خضعت الصور لعملية معالجة مسبقة دقيقة، والتي شملت قطع الصور إلى عينات بحجم 196 × 196، وتحويلها إلى تنسيق متوافق مع مجموعة بيانات EMNIST، وتنظيمها في أدلة محددة للفئات. مجموعة البيانات، المسماة PG-HWLD، متوازنة من حيث الفئات، حيث تمثل كل حرف 660 عينة. تم اتخاذ الحذر خلال عملية إعادة اللمس للحفاظ على أشكال الأحرف، وتم تقليم أي عينات تمتد خارج حدود الشبكة مع ضمان إمكانية التعرف عليها.

تمت عملية تجنيد المشاركين من خلال جامعة غدانسك للتكنولوجيا، مع المشاركة الطوعية وعدم وجود تعويض مالي. قدم المشاركون، الذين تتراوح أعمارهم بين 19 و25 عامًا، موافقة مستنيرة لمعالجة ونشر عينات خط يدهم، مع ضمانات بإخفاء الهوية للبيانات لاستخدامات التعلم الآلي. تمت الموافقة الأخلاقية على إجراءات الدراسة من قبل لجنة الأخلاقيات في جامعة غدانسك للتكنولوجيا، وفقًا لإعلان هلسنكي.

نقاش

تتكون مجموعة بيانات PG-HWLD، المصممة كامتداد لمجموعة بيانات EMNIST-Letters، من 17,160 عينة من الأحرف المكتوبة بخط اليد المخزنة بتنسيق PNG. كل صورة هي تمثيل رمادي 8 بت لحرف واحد، تم تقليل حجمها إلى 28 × 28 بكسل، مع عرض الأحرف باللون الأبيض على خلفية سوداء. تم تنظيم مجموعة البيانات في أدلة فئات لكل من الأحرف الكبيرة والصغيرة ومتاحة بتنسيق .mat لتوافقها مع نماذج التعلم الآلي المختلفة. بينما تهدف مجموعة البيانات إلى تعزيز تنوع عينات الخط، من المهم ملاحظة أنها تعكس بشكل أساسي أنماط خط اليد لفئة ديموغرافية ضيقة—طلاب الجامعات الذين تتراوح أعمارهم بين 19 و25 عاماً—مما قد يقدم تحيزات ديموغرافية يمكن أن تؤثر على أداء النموذج في مهام تمييز الأشكال الدقيقة.

شملت التحقق الفني من مجموعة بيانات PG-HWLD مراجعة يدوية من قبل المعلقين واختبارًا مع ثلاثة نماذج تعلم آلي: VGG-5، TextCaps، وWaveMixLite-112/16. أشارت النتائج إلى أنه بينما أدت النماذج بشكل متقارب على مجموعة بيانات EMNIST-Letters، واجهت تحديات أكبر مع PG-HWLD، مما يبرز دورها كمعيار أكثر صعوبة للتعرف على الأحرف المكتوبة بخط اليد. تهدف مقدمة مجموعة البيانات إلى تسهيل دراسة تعميم النموذج عبر أنماط خط اليد المختلفة، ومعالجة المشكلة الشائعة المتمثلة في انخفاض الأداء عندما تواجه النماذج مصادر بيانات جديدة. يتضمن المستودع أدوات لإعداد مجموعة البيانات واختبار النموذج، مما يعزز المزيد من الاستكشاف والتحقق من فائدة مجموعة البيانات في مهام التعرف على الأحرف البصرية.

Journal: Scientific Data, Volume: 13, Issue: 1
DOI: https://doi.org/10.1038/s41597-025-06291-z
PMID: https://pubmed.ncbi.nlm.nih.gov/41571670
Publication Date: 2026-01-22
Author(s): Julian Szymański et al.
Primary Topic: Handwritten Text Recognition Techniques

Overview

This section of the paper presents a comprehensive overview of a newly constructed dataset aimed at enhancing the evaluation of machine learning models for handwritten character recognition. The dataset is designed to complement existing NIST databases, facilitating a more thorough analysis of models developed using these data sources. The authors summarize prevalent publicly available machine learning models that have been trained on the EMNIST letters dataset and address challenges associated with evaluating state-of-the-art results, particularly concerning accuracy comparisons derived from cross-validation test sets.

Furthermore, the paper advocates for additional evaluation using new, independently constructed data that is not affiliated with the NIST database authors. This approach aims to provide a more robust assessment of model performance. The dataset and corresponding source codes are made accessible through the Gdansk Tech University repository, Most Wiedzy, promoting transparency and reproducibility in research within the field.

Introduction

In the introduction of this research paper, the authors highlight the significance of neural networks in machine learning, particularly for tasks such as Optical Character Recognition (OCR). They focus on the recognition of single handwritten letters from the Latin alphabet, utilizing three top-performing algorithms—WaveMixLite-112/16, VGG-5, and TextCaps—validated against the EMNIST-Letters dataset. The authors refactored the open-source codes of these models and made them publicly available, while also introducing a new dataset, the “PG-Handwritten Letters Dataset” (PG-HWLD), to enhance evaluation metrics and assess generalization capabilities.

The EMNIST dataset, derived from the NIST Special Database 19, comprises over 814,000 images of handwritten characters and is designed to present a more challenging classification task compared to the well-established MNIST dataset. The authors note that previous evaluations of the EMNIST datasets demonstrated high accuracy rates, with the EMNIST-Letters dataset achieving a peak accuracy of 85.15%. However, they argue that traditional evaluation methods may not adequately reflect the true generalization abilities of the models, as they often rely on test sets derived from the same data sources as the training sets. To address this limitation, the authors propose their dataset as a means to provide a more robust benchmark for evaluating the performance of handwritten image recognition models.

Methods

In this study, a dataset comprising 17,160 handwritten samples of uppercase and lowercase letters was created from 52 participants, primarily university students. Participants filled out forms on standardized grids using black markers, and the completed forms were scanned as RGB images. The images underwent a meticulous preprocessing workflow, which included cutting the images into 196 × 196 samples, converting them to a format consistent with the EMNIST dataset, and organizing them into class-specific directories. The dataset, named PG-HWLD, is class-balanced, with each letter represented by 660 samples. Care was taken during the re-touching process to preserve the character shapes, and any samples extending beyond grid boundaries were trimmed while ensuring recognizability.

Recruitment of participants was conducted through Gdańsk University of Technology, with voluntary participation and no monetary compensation. Participants, aged 19 to 25, provided informed consent for their handwriting samples to be processed and published, with assurances of data anonymization for machine learning applications. The study’s procedures were ethically approved by the Ethics Commission of Gdańsk University of Technology, adhering to the Declaration of Helsinki.

Discussion

The PG-HWLD dataset, designed as an extension of the EMNIST-Letters dataset, comprises 17,160 samples of handwritten letters stored in PNG format. Each image is an 8-bit grayscale representation of a single letter, down-sampled to 28 × 28 pixels, with letters depicted in white on a black background. The dataset is organized into class directories for both uppercase and lowercase letters and is available in a .mat format for compatibility with various machine learning models. While the dataset aims to enhance the diversity of handwriting samples, it is important to note that it primarily reflects the handwriting styles of a narrow demographic—university students aged 19 to 25—potentially introducing demographic biases that could affect model performance in fine-grained shape discrimination tasks.

Technical validation of the PG-HWLD dataset involved manual review by annotators and testing with three machine learning models: VGG-5, TextCaps, and WaveMixLite-112/16. Results indicated that while the models performed comparably on the EMNIST-Letters dataset, they faced greater challenges with PG-HWLD, highlighting its role as a more difficult benchmark for handwritten letter recognition. The dataset’s introduction is intended to facilitate the study of model generalization across different handwriting styles, addressing the common issue of performance drops when models encounter new data sources. The repository includes tools for dataset preparation and model testing, promoting further exploration and validation of the dataset’s utility in optical character recognition tasks.