زيادة البيانات ومجموعات البيانات غير المتوازنة: دراسة أداء التعلم الآلي وهندسة الميزات Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

المجلة: Journal Of Big Data، المجلد: 11، العدد: 1
DOI: https://doi.org/10.1186/s40537-024-00943-4
تاريخ النشر: 2024-06-17
المؤلف: Muhammad Mujahid وآخرون
الموضوع الرئيسي: تقنيات تصنيف البيانات غير المتوازنة

نظرة عامة

تتناول الدراسة تحدي عدم توازن الفئات في تعلم الآلة، وخاصة في تطبيقات استخراج النصوص، حيث قد تتفوق فئة واحدة بشكل كبير على أخرى. غالبًا ما يؤدي هذا التوازن إلى الإفراط في ملاءمة النموذج وتدهور الأداء. للتخفيف من هذه المشكلات، تقارن الأبحاث تقنيات زيادة العينات المختلفة، بما في ذلك تقنية زيادة العينات للأقليات الاصطناعية (SMOTE)، وSVM-SMOTE، وBorderline SMOTE، وK-means SMOTE، وزيادة العينات الاصطناعية التكيفية (ADASYN). يقوم المؤلفون بمعالجة البيانات لتحسين جودتها، مما يمكّن من التعرف على الأنماط بشكل أكثر فعالية. يستخدمون مجموعتين من بيانات تويتر غير المتوازنة ويقيمون أداء ستة نماذج من تعلم الآلة—الغابة العشوائية (RF)، آلة الدعم الناقل (SVM)، الجار الأقرب (KNN)، AdaBoost (ADA)، الانحدار اللوجستي (LR)، وشجرة القرار (DT)—باستخدام طرق استخراج الميزات مثل حقيبة الكلمات (BoW) وتردد المصطلح-تردد الوثيقة العكسي (TF-IDF).

تشير النتائج إلى أن SMOTE وADASYN يتفوقان بشكل كبير على التقنيات الأخرى، حيث حقق SVM أعلى دقة (99.67%) واسترجاع (1.00%) على مجموعات البيانات المعززة بواسطة ADASYN، و99.57% دقة على مجموعات بيانات SMOTE مع ميزات TF-IDF. كما أظهر نموذج SVM دقة متوسطة تبلغ 97.40% مع انحراف معياري قدره 0.008 من خلال التحقق المتقاطع 10-fold. تستنتج الدراسة أن تقنيات زيادة العينات، وخاصة SMOTE وADASYN، تقلل بشكل فعال من خطر الإفراط في الملاءمة وتعزز أداء النموذج على مجموعات البيانات غير المتوازنة، حيث تحقق ميزات TF-IDF نتائج متفوقة مقارنة بميزات BoW. بشكل عام، تؤكد الأبحاث على أهمية توازن مجموعات البيانات لتحسين القدرات التنبؤية لنماذج تعلم الآلة.

مقدمة

تسلط المقدمة الضوء على الاهتمام المتزايد في استخراج النصوص وتعلم الآلة، وخاصة فيما يتعلق بالتحديات التي تطرحها مجموعات البيانات غير المتوازنة، حيث تحتوي بعض الفئات على عينات أقل بكثير من غيرها. يؤثر هذا التوازن سلبًا على الأداء التنبؤي لخوارزميات التصنيف، التي غالبًا ما تعطي الأولوية للدقة العامة—وهي مقياس قد يكون مضللاً لأنه قد يفضل الفئة الأكثر عددًا. كما أشار Japkowicz وStephen، تزداد تعقيد نماذج التصنيف مع شدة عدم توازن الفئات، خاصة عندما يكون عدد أمثلة التدريب محدودًا.

استكشفت العديد من الدراسات استراتيجيات مختلفة لمعالجة عدم توازن الفئات، بما في ذلك استخدام تقنيات توليد البيانات الاصطناعية مثل تقنية زيادة العينات للأقليات الاصطناعية (SMOTE). على سبيل المثال، أظهرت الأبحاث حول تحليل المشاعر للتغريدات أن استخدام SMOTE يمكن أن يعزز أداء النموذج من خلال تحسين التعرف على الفئات الأقل. بالإضافة إلى ذلك، طبقت دراسة أخرى SMOTE على التغريدات السياسية بلغات مختلفة، محققة تحسينات كبيرة في مقاييس التصنيف. ومع ذلك، على الرغم من هذه التقدمات، تفتقر الأدبيات إلى مقارنة شاملة لفعالية طرق زيادة العينات المختلفة، مما يشير إلى فجوة تهدف هذه الأبحاث إلى معالجتها.

طرق

في هذا القسم، يحدد المؤلفون المواد والمنهجية المستخدمة في دراستهم، موضحين مجموعات البيانات، وتقنيات زيادة العينات، ونماذج تعلم الآلة، ومقاييس التقييم. يبدأ تدفق المنهجية، الموضح في الشكل 1، بالحصول على البيانات، يليه خطوات المعالجة المسبقة لتحضير البيانات للتحليل. بعد ذلك، يتم تطبيق طرق زيادة العينات المختلفة لمعالجة عدم توازن الفئات داخل مجموعة البيانات. أخيرًا، يتم تقسيم البيانات إلى مجموعات تدريب واختبار لتسهيل تدريب النموذج وتقييمه. تهدف هذه الطريقة المنظمة إلى تعزيز قوة ودقة نماذج تعلم الآلة المستخدمة في البحث.

نتائج

في هذه الدراسة، تم تقييم أداء نماذج تعلم الآلة المختلفة على مجموعات بيانات غير متوازنة باستخدام مجموعتين من بيانات التغريدات، مع تقسيم التدريب والاختبار بنسبة 75% إلى 25%. تم إجراء استخراج الميزات باستخدام طرق حقيبة الكلمات (BoW) وتردد المصطلح-تردد الوثيقة العكسي (TF-IDF). أشارت النتائج إلى أن نموذج آلة الدعم الناقل (SVM) حقق أعلى دقة بلغت 99.67% مع تقنية زيادة العينات ADASYN وميزات TF-IDF. ومن الجدير بالذكر أن تقنية SMOTE تفوقت على غيرها، خاصة لمجموعة بيانات EndViolence، بينما حققت ميزات TF-IDF أداءً أفضل باستمرار مقارنة بميزات BoW عبر جميع التجارب.

سلطت مقاييس الدقة والاسترجاع الضوء على فعالية ميزات TF-IDF، حيث حققت نماذج مثل الغابة العشوائية (RF)، SVM، والانحدار اللوجستي (LR) درجات استرجاع بلغت 99.99% على مجموعة بيانات EndViolence. أظهرت النتائج أن تقنيات زيادة العينات، وخاصة SMOTE وADASYN، حسنت بشكل كبير أداء النموذج، خاصة عند دمجها مع ميزات TF-IDF. بالإضافة إلى ذلك، تم تقييم نماذج التعلم العميق، مما كشف أن الشبكة العصبية التلافيفية (CNN) تفوقت على نماذج الذاكرة طويلة وقصيرة الأمد (LSTM) وLSTM ثنائية الاتجاه (BiLSTM)، محققة دقة بلغت 96.11% على مجموعات البيانات. بشكل عام، تؤكد النتائج على أهمية اختيار الميزات وتقنيات زيادة العينات في تحسين أداء نماذج تعلم الآلة على مجموعات البيانات غير المتوازنة.

مناقشة

في هذا القسم من المناقشة، تؤكد الورقة على أهمية تقنيات زيادة العينات في معالجة عدم توازن الفئات في تعلم الآلة، وخاصة في سياق تحليل البيانات النصية. يمكن أن تعيق عدم توازن الفئات أداء نماذج تعلم الآلة، مما يحرم الفئات الأقل من تمثيل كافٍ ويؤدي إلى توقعات منحرفة. يدعو المؤلفون إلى استخدام طرق زيادة العينات المختلفة، مثل SMOTE وSVM-SMOTE وADASYN، لتعزيز تمثيل الفئات الأقل في مجموعات البيانات. من خلال استخدام هذه التقنيات، يمكن لخوارزميات تعلم الآلة—بما في ذلك الشبكات العصبية، والغابات العشوائية، وآلات الدعم الناقل—تحقيق دقة تنبؤية محسنة وقدرات تعميم عند التعرض لبيانات جديدة.

تسلط الأبحاث أيضًا الضوء على أهمية هندسة الميزات في المعالجة المسبقة للبيانات النصية، باستخدام طرق مثل حقيبة الكلمات (BoW) وتردد المصطلح-تردد الوثيقة العكسي (TF-IDF) لاستخراج ميزات ذات مغزى من النص الخام. بينما تعتبر BoW فعالة لبعض التطبيقات، توفر TF-IDF نهجًا أكثر دقة من خلال وزن أهمية المصطلحات بناءً على تكرارها عبر المجموعة. تشمل مساهمات الورقة تحليلًا مقارنًا لعدة تقنيات زيادة العينات على مجموعتين من بيانات تويتر غير المتوازنة بشدة، مما يظهر تأثيرها على أداء نماذج تعلم الآلة المختلفة. تشير النتائج إلى أن توازن مجموعات البيانات من خلال زيادة العينات لا يقلل فقط من انحراف النموذج ولكن أيضًا يعزز القدرات العامة لصنع القرار في التطبيقات الواقعية، مثل تحليل المشاعر وكشف الاحتيال.

القيود

يسلط قسم القيود الضوء على عدة قيود حرجة مرتبطة بتقنية زيادة العينات للأقليات الاصطناعية الحدودية (Borderline SMOTE) مقارنةً بـ SMOTE التقليدية. أولاً، تتطلب Borderline SMOTE موارد حسابية أكبر بسبب تركيزها على توليد عينات بالقرب من الحدود القرار، مما قد يعقد عملية التصنيف. بالإضافة إلى ذلك، قد تتأثر فعالية التصنيف سلبًا بالتحديات المرتبطة باختيار وتحسين المعلمات اللازمة.

علاوة على ذلك، بينما تستكشف SMOTE التقليدية بشكل فعال الفضاء الكامل للميزات من خلال توليد عينات اصطناعية باستمرار، قد لا تحقق Borderline SMOTE نفس المستوى من الكفاءة، خاصة في السيناريوهات ذات الحدود القرار المعقدة. يمكن أن تؤدي هذه القيود إلى تمثيل غير كافٍ للفضاء الميزاتي. علاوة على ذلك، في بعض المناطق، قد يؤدي تطبيق Borderline SMOTE إلى وفرة مفرطة من العينات المولدة اصطناعيًا، مما قد يقدم تحيزًا إلى النموذج ويؤثر على أدائه العام.

Journal: Journal Of Big Data, Volume: 11, Issue: 1
DOI: https://doi.org/10.1186/s40537-024-00943-4
Publication Date: 2024-06-17
Author(s): Muhammad Mujahid et al.
Primary Topic: Imbalanced Data Classification Techniques

Overview

The study addresses the challenge of class imbalance in machine learning, particularly in text mining applications, where one class may significantly outnumber another. This imbalance often leads to model overfitting and degraded performance. To mitigate these issues, the research compares various oversampling techniques, including Synthetic Minority Oversampling Technique (SMOTE), SVM-SMOTE, Borderline SMOTE, K-means SMOTE, and Adaptive Synthetic (ADASYN) oversampling. The authors preprocess the data to enhance its quality, thereby enabling more effective pattern recognition. They utilize two imbalanced Twitter datasets and evaluate the performance of six machine learning models—Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), AdaBoost (ADA), Logistic Regression (LR), and Decision Tree (DT)—using feature extraction methods such as Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

The findings indicate that SMOTE and ADASYN significantly outperform other techniques, with SVM achieving the highest accuracy (99.67%) and recall (1.00%) on ADASYN oversampled datasets, and 99.57% accuracy on SMOTE datasets with TF-IDF features. The SVM model also demonstrated a mean accuracy of 97.40% with a standard deviation of 0.008 through 10-fold cross-validation. The study concludes that oversampling techniques, particularly SMOTE and ADASYN, effectively reduce the risk of overfitting and enhance model performance on imbalanced datasets, with TF-IDF features yielding superior results compared to BoW features. Overall, the research emphasizes the importance of balancing datasets to improve the predictive capabilities of machine learning models.

Introduction

The introduction highlights the growing interest in text mining and machine learning, particularly concerning the challenges posed by imbalanced datasets, where certain classes have significantly fewer samples than others. This imbalance adversely affects the predictive performance of classification algorithms, which often prioritize overall accuracy—a metric that can be misleading as it may favor the majority class. As noted by Japkowicz and Stephen, the complexity of classification models increases with the severity of class imbalance, particularly when the number of training examples is limited.

Several studies have explored various strategies to address class imbalance, including the use of synthetic data generation techniques like the Synthetic Minority Oversampling Technique (SMOTE). For instance, research on sentiment analysis of tweets demonstrated that employing SMOTE can enhance model performance by improving the recognition of minority classes. Additionally, another study applied SMOTE to political tweets in different languages, achieving significant improvements in classification metrics. However, despite these advancements, the literature lacks a comprehensive comparison of the effectiveness of different oversampling methods, indicating a gap that this research aims to address.

Methods

In this section, the authors outline the materials and methodology employed in their study, detailing the datasets, oversampling techniques, machine learning models, and evaluation metrics. The methodology flow, illustrated in Figure 1, begins with data acquisition, followed by preprocessing steps to prepare the data for analysis. Subsequently, various oversampling methods are applied to address class imbalance within the dataset. Finally, the data is divided into training and testing subsets to facilitate model training and evaluation. This structured approach aims to enhance the robustness and accuracy of the machine learning models utilized in the research.

Results

In this study, the performance of various machine learning models on imbalanced datasets was evaluated using two tweet datasets, with a train-test split of 75% to 25%. Feature extraction was conducted using Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. The results indicated that the Support Vector Machine (SVM) model achieved the highest accuracy of 99.67% with the ADASYN oversampling technique and TF-IDF features. Notably, the SMOTE technique outperformed others, particularly for the EndViolence dataset, while TF-IDF features consistently yielded better model performance compared to BoW features across all experiments.

Precision and recall metrics further highlighted the effectiveness of TF-IDF features, with models like Random Forest (RF), SVM, and Logistic Regression (LR) achieving recall scores of 99.99% on the EndViolence dataset. The results demonstrated that oversampling techniques, particularly SMOTE and ADASYN, significantly enhanced model performance, especially when combined with TF-IDF features. Additionally, deep learning models were assessed, revealing that the Convolutional Neural Network (CNN) outperformed Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) models, achieving an accuracy of 96.11% on the datasets. Overall, the findings underscore the importance of feature selection and oversampling techniques in improving the performance of machine learning models on imbalanced datasets.

Discussion

In this discussion section, the paper emphasizes the significance of oversampling techniques in addressing class imbalance in machine learning, particularly within the context of textual data analysis. Class imbalance can hinder the performance of machine learning models, depriving minority classes of adequate representation and leading to skewed predictions. The authors advocate for the use of various oversampling methods, such as SMOTE, SVM-SMOTE, and ADASYN, to enhance the representation of minority classes in datasets. By employing these techniques, machine learning algorithms—including neural networks, random forests, and support vector machines—can achieve improved predictive accuracy and generalization capabilities when exposed to novel data.

The research also highlights the importance of feature engineering in the preprocessing of textual data, utilizing methods like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) to extract meaningful features from raw text. While BoW is effective for certain applications, TF-IDF offers a more nuanced approach by weighing the importance of terms based on their frequency across the corpus. The paper’s contributions include a comparative analysis of multiple oversampling techniques on two highly imbalanced Twitter datasets, demonstrating their impact on the performance of various machine learning models. The findings suggest that balancing datasets through oversampling not only mitigates model skewness but also enhances overall decision-making capabilities in real-world applications, such as sentiment analysis and fraud detection.

Limitations

The section on limitations highlights several critical constraints associated with the Borderline Synthetic Minority Over-sampling Technique (Borderline SMOTE) compared to conventional SMOTE. Firstly, Borderline SMOTE requires greater computational resources due to its emphasis on generating samples near the decision boundary, which can complicate the classification process. Additionally, the effectiveness of the classification may be adversely affected by the challenges associated with selecting and optimizing the necessary parameters.

Moreover, while conventional SMOTE effectively explores the entire feature space by consistently generating synthetic samples, Borderline SMOTE may not achieve the same level of efficiency, particularly in scenarios with complex decision boundaries. This limitation can result in an insufficient representation of the feature space. Furthermore, in certain areas, the application of Borderline SMOTE may lead to an overabundance of artificially generated samples, potentially introducing bias into the model and affecting its overall performance.