الكشف عن التطرف في اللهجة العراقية استنادًا إلى التعلم الآلي Extremism Detection in the Iraqi Dialect Based on Machine Learning

المجلة: Iraqi Journal of Science
DOI: https://doi.org/10.24996/ijs.2025.66.2.25
تاريخ النشر: 2025-02-28
المؤلف: Redhaa Fadhil Sabri وآخرون
الموضوع الرئيسي: اللغة واللسانيات والتحليل الثقافي

نظرة عامة

تركز ورقة البحث على اكتشاف التطرف في سياق معالجة اللغة الطبيعية (NLP)، مستهدفة اللهجة العراقية، التي كانت ممثلة تمثيلاً ناقصًا في الدراسات الحالية. يبرز المؤلفون التحديات التي ت posed by نقص الموارد عبر الإنترنت للهجة العراقية، على الرغم من انتشارها على منصات التواصل الاجتماعي. لمعالجة هذه الفجوة، تستخدم الدراسة تقنيات التعلم الآلي لتحليل مجموعتين من البيانات: مجموعة بيانات تعليقات فيسبوك العراقية (IFCD) ومجموعة بيانات تغريدات العراق (ITD). خضعت البيانات لعمليات معالجة مسبقة شاملة، بما في ذلك إزالة اللواحق والسوابق والحروف المتكررة وكلمات التوقف. تم استخدام طرق مختلفة لتضمين الكلمات، بما في ذلك Gensim Word2vec وFastText، جنبًا إلى جنب مع مصنفات مثل آلة الدعم الناقل (SVM) والانحدار اللوجستي (LR) وأقرب جار (KNN) وGaussian Naive Bayes (GNB). حقق النموذج الأفضل أداءً درجة F1 قدرها 0.9521، مع قيم دقة واسترجاع قدرها 0.955 و0.95، على التوالي.

في الخاتمة، يبرز المؤلفون عدة نتائج رئيسية: تفوق نموذج المتجهات العراقية على النماذج المدربة مسبقًا، وزيادة فعالية المعالجة المسبقة في تحسين دقة التصنيف، وتحسين النتائج من خلال التسمية اليدوية قبل التضمين. ومن الجدير بالذكر أن نموذج Word2vec العراقي حقق أعلى دقة عند اقترانه بمصنف SVM. يدعو المؤلفون إلى تطوير نماذج متجهات عراقية أكبر مدربة مسبقًا ومُعالج جذر عراقي موحد لتسهيل البحث في معالجة اللغة الطبيعية وتعزيز استخدام اللهجة العراقية. كما يدعون إلى إنشاء مجموعات بيانات إضافية تتعلق بالتطرف في اللهجة العراقية لدعم جهود البحث المستقبلية في هذا المجال الحرج.

مقدمة

تسلط مقدمة ورقة البحث الضوء على التحديات التي ت posed by انتشار المحتوى الذي ينشئه المستخدمون على مواقع التواصل الاجتماعي، خاصة في سياق خطاب الكراهية والتطرف. تشير إلى أن التنوع المتزايد للمستخدمين يسمح بالتعبير عن مجموعة واسعة من الآراء، ولكن هذا يعقد أيضًا مراقبة وتنظيم المحتوى الضار، خاصة في مناطق مثل العراق، التي تتأثر بالصراع والطائفية. تؤكد الورقة على نقص الأبحاث الشاملة التي تركز على خطاب الكراهية في اللهجة العراقية، على الرغم من انتشار الآراء المتطرفة في الخطاب عبر الإنترنت.

لمعالجة هذه الفجوات، تهدف الدراسة إلى تطوير منهجيات مخصصة لمعالجة وتحليل المحتوى في اللهجة العراقية، والتي غالبًا ما يتم تجاهلها في الدراسات الحالية التي تركز بشكل أساسي على العربية الفصحى الحديثة (MSA). تشمل المساهمات الرئيسية للدراسة إنشاء تقنيات معالجة مسبقة جديدة لكلمات اللهجة العراقية، وتطوير متجهات تضمين خاصة بالعراق، وإنشاء نموذج تعلم آلي مصمم لتصنيف خطاب التطرف والكراهية بشكل فعال. توضح الورقة هيكلها، مشيرة إلى أن الأقسام التالية ستراجع الأعمال ذات الصلة، وتفصل منهجية البحث، وتقدم وتناقش النتائج، وتختتم بمقارنات واتجاهات مستقبلية في مجال معالجة اللغة الطبيعية (NLP) للهجة العراقية.

طرق

تتناول منهجية البحث الموضحة في هذا القسم نقص البيانات حول التطرف وخطاب الكراهية في اللهجة العراقية، خاصة بالمقارنة مع اللهجات العربية الأخرى. يؤكد المؤلفون على التحديات المرتبطة بجمع البيانات من مواقع التواصل الاجتماعي، بما في ذلك الحاجة إلى أذونات المنصة والطبيعة المستهلكة للوقت لاستخدام برامج جمع البيانات المتخصصة.

لتجاوز هذه التحديات، تستخدم الدراسة ثلاث مجموعات بيانات متميزة مأخوذة من منصات التواصل الاجتماعي العراقية، وتحديداً فيسبوك (ميتا) وتويتر (X). تتكون مجموعة البيانات الأولى، المسماة مجموعة بيانات تغريدات العراق (ITD)، من 3,100 تغريدة تحتوي على علامات هاشتاج باللهجة العراقية. تتضمن مجموعة البيانات الثانية 12,000 تعليق من صفحة عراقية على فيسبوك، أيضًا باللهجة العراقية. تتكون مجموعة البيانات الثالثة، المشار إليها باسم CIAD، من 1,170 تغريدة تم جمعها من علامات هاشتاج عراقية على تويتر. تهدف هذه المقاربة الشاملة إلى تقديم تحليل أكثر قوة للتطرف في سياق اللهجة العراقية.

نتائج

في هذا القسم، يتم تقديم نتائج نماذج التصنيف المنفذة في بيئة Google Colab، مع تسليط الضوء على مقاييس الأداء المستخدمة للتقييم، بما في ذلك الدقة، والدقة المتوسطة الموزونة، والاسترجاع المتوسطة الموزونة، ودرجة F1 المتوسطة الموزونة. استخدم النظام وحدة معالجة مركزية Intel Xeon ووحدة معالجة الرسوميات NVIDIA Tesla K80، مع ذاكرة وصول عشوائي قدرها 12.7 جيجابايت ومساحة قرص قدرها 107.72 جيجابايت. كانت أعلى دقة تم تحقيقها 96% باستخدام نموذج آلة الدعم الناقل (SVM) بالاقتران مع طريقة استخراج ميزات Word2vec العراقية. أظهرت نماذج أخرى، مثل الانحدار اللوجستي (LR) وأقرب الجيران (KNN) وGaussian Naive Bayes (GNB)، أيضًا أداءً تنافسيًا، بدقة 94% و89% و80%، على التوالي.

تشير النتائج إلى أن نموذج SVM تفوق باستمرار على المصنفات الأخرى عبر تقنيات استخراج الميزات المختلفة، كما هو موضح في الجداول من 2 إلى 5. بالإضافة إلى ذلك، تُظهر المقارنة مع الأعمال ذات الصلة أن النماذج المقترحة تجاوزت بشكل كبير المعايير السابقة، حيث كانت أعلى دقة تم الإبلاغ عنها من قبل LibSVM هي 78.1%. تؤكد النتائج فعالية مصنف SVM في التعامل مع بيانات النصوص، كما يتضح من الدرجات المتوسطة الموزونة المتفوقة عبر جميع المقاييس التي تم تقييمها.

نقاش

في قسم النقاش، يستعرض المؤلفون دراسات مختلفة ركزت على معالجة وتحليل اللهجة العراقية، خاصة في سياق تحليل المشاعر واكتشاف التطرف. يبرزون العمل الرائد لـ Sabbar et al. (2018)، الذين طوروا عملية جذر للهجة العراقية وحققوا دقة 93% باستخدام خوارزمية Naive Bayes. ومع ذلك، يلاحظون قيود الاعتماد على مجموعات بيانات فردية، والتي يمكن أن تُدخل تحيزًا وتقييد تنوع التمثيل اللغوي. واجهت الدراسات اللاحقة، مثل تلك التي أجراها Alnawas et al. (2019) وAlghamdi et al. (2020)، أيضًا تحديات مماثلة، مما يبرز الحاجة إلى مزيد من مجموعات البيانات الشاملة لتعزيز صلاحية نتائجهم.

يستفيض المؤلفون في توضيح أهمية خطوات المعالجة المسبقة، بما في ذلك تقسيم النص، وإزالة كلمات التوقف، وتطوير جذر عراقي متخصص، والذي يجادلون بأنه أمر حاسم لتحسين دقة التصنيف. يختتمون بأن النماذج المقترحة، وخاصة نموذج Word2Vec العراقي، تفوقت على النماذج المدربة مسبقًا، محققة معدلات دقة ملحوظة. يدعو المؤلفون إلى جهود مستقبلية لإنشاء مجموعات بيانات أكبر متاحة للجمهور وعمليات جذر موحدة للهجة العراقية لتسهيل المزيد من البحث والتطبيقات في معالجة اللغة الطبيعية.

Journal: Iraqi Journal of Science
DOI: https://doi.org/10.24996/ijs.2025.66.2.25
Publication Date: 2025-02-28
Author(s): Redhaa Fadhil Sabri et al.
Primary Topic: Language, Linguistics, Cultural Analysis

Overview

The research paper focuses on extremism detection within the context of natural language processing (NLP), specifically targeting the Iraqi dialect, which has been underrepresented in existing studies. The authors highlight the challenges posed by the scarcity of online resources for the Iraqi dialect, despite its prevalence on social media platforms. To address this gap, the study employs machine learning techniques to analyze two datasets: the Iraqi Facebook Comments Dataset (IFCD) and the Iraqi Tweets Dataset (ITD). The data underwent extensive pre-processing, including the removal of suffixes, prefixes, repeated letters, and stop words. Various word embedding methods, including Gensim Word2vec and FastText, were utilized alongside classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbor (KNN), and Gaussian Naive Bayes (GNB). The best-performing model achieved an F1-score of 0.9521, with precision and recall values of 0.955 and 0.95, respectively.

In the conclusion, the authors emphasize several key findings: the Iraqi vector model outperformed pre-trained models, effective pre-processing enhanced classification accuracy, and manual labeling prior to embedding improved results. Notably, the Iraqi Word2vec model yielded the highest accuracy when paired with the SVM classifier. The authors advocate for the development of larger pre-trained Iraqi vector models and a standardized Iraqi stemmer to facilitate research in NLP and promote the use of the Iraqi dialect. They also call for the creation of additional datasets related to extremism in the Iraqi dialect to support future research endeavors in this critical area.

Introduction

The introduction of the research paper highlights the challenges posed by the proliferation of user-generated content on social networking sites, particularly in the context of hate speech and extremism. It notes that the increasing diversity of users allows for a wide range of opinions to be expressed, but this also complicates the monitoring and regulation of harmful content, especially in regions like Iraq, which are affected by conflict and sectarianism. The paper emphasizes the lack of comprehensive research focused on hate speech in the Iraqi dialect, despite the prevalence of extremist views in online discourse.

To address these gaps, the research aims to develop tailored methodologies for processing and analyzing content in the Iraqi dialect, which is often overlooked in existing studies that predominantly focus on Modern Standard Arabic (MSA). Key contributions of the study include the creation of new preprocessing techniques for Iraqi dialect words, the development of Iraqi-specific embedding vectors, and the establishment of a machine learning model designed to classify extremist and hate speech effectively. The paper outlines its structure, indicating that subsequent sections will review related work, detail the research methodology, present and discuss results, and conclude with comparisons and future directions in the field of natural language processing (NLP) for the Iraqi dialect.

Methods

The research methodology outlined in this section addresses the scarcity of data on extremism and hate speech in the Iraqi dialect, particularly in comparison to other Arabic dialects. The authors emphasize the challenges associated with data collection from social networking sites, including the need for platform permissions and the time-consuming nature of using specialized data collection programs.

To overcome these challenges, the study utilizes three distinct datasets sourced from Iraqi social media platforms, specifically Facebook (Meta) and Twitter (X). The first dataset, termed the Iraqi Tweets Dataset (ITD), comprises 3,100 tweets featuring Iraqi dialect hashtags. The second dataset includes 12,000 comments from an Iraqi Facebook page, also in the Iraqi dialect. The third dataset, referred to as CIAD, consists of 1,170 tweets collected from Iraqi hashtags on Twitter. This comprehensive approach aims to provide a more robust analysis of extremism within the context of the Iraqi dialect.

Results

In this section, the results of the classification models implemented in a Google Colab environment are presented, highlighting the performance metrics used for evaluation, including accuracy, macro-averaged precision, macro-averaged recall, and macro-averaged F1-score. The system utilized an Intel Xeon CPU and an NVIDIA Tesla K80 GPU, with a RAM of 12.7 GB and a disk space of 107.72 GB. The highest accuracy achieved was 96% using the Support Vector Machine (SVM) model combined with the Iraqi Word2vec feature extraction method. Other models, such as Logistic Regression (LR), K-Nearest Neighbors (KNN), and Gaussian Naive Bayes (GNB), also demonstrated competitive performance, with accuracies of 94%, 89%, and 80%, respectively.

The results indicate that the SVM model consistently outperformed other classifiers across various feature extraction techniques, as detailed in Tables 2 through 5. Additionally, a comparison with related work shows that the proposed models significantly surpassed previous benchmarks, with the highest accuracy reported by LibSVM being 78.1%. The findings underscore the effectiveness of the SVM classifier in handling text data, as evidenced by superior macro-average scores across all evaluated metrics.

Discussion

In the discussion section, the authors review various studies that have focused on processing and analyzing the Iraqi dialect, particularly in the context of sentiment analysis and extremism detection. They highlight the pioneering work of Sabbar et al. (2018), who developed a stemming process for the Iraqi dialect and achieved a 93% accuracy using the Naive Bayes algorithm. However, they note the limitations of relying on single datasets, which can introduce bias and restrict the diversity of linguistic representation. Subsequent studies, such as those by Alnawas et al. (2019) and Alghamdi et al. (2020), also faced similar challenges, emphasizing the need for more comprehensive datasets to enhance the validity of their findings.

The authors further elaborate on the importance of preprocessing steps, including tokenization, stop word removal, and the development of a specialized Iraqi stemmer, which they argue is crucial for improving classification accuracy. They conclude that their proposed models, particularly the Iraqi Word2Vec model, outperformed pre-trained models, achieving notable accuracy rates. The authors advocate for future efforts to create larger, publicly available datasets and standardized stemming processes for the Iraqi dialect to facilitate further research and applications in natural language processing.