معالجة البيانات الآلية وهندسة الميزات لتطبيقات التعلم العميق والبيانات الضخمة: استبيان Automated data processing and feature engineering for deep learning and big data applications: A survey

المجلة: Journal of Information and Intelligence، المجلد: 3، العدد: 2
DOI: https://doi.org/10.1016/j.jiixd.2024.01.002
تاريخ النشر: 2024-01-08
المؤلف: Alhassan Mumuni وآخرون
الموضوع الرئيسي: تعلم الآلة وتصنيف البيانات

نظرة عامة

تقدم هذه القسم نظرة عامة على التقدم في أتمتة مهام معالجة البيانات ضمن خطوط أنابيب التعلم العميق، مع التأكيد على أهمية هذه التطورات في سياق الذكاء الاصطناعي (AI) وتطبيقات البيانات الضخمة. تركز أساليب الذكاء الاصطناعي الحديثة على تصميم خوارزميات تتعلم من البيانات، لا سيما من خلال التعلم العميق المراقب، مما سهل إنشاء أنظمة التعلم الآلي. ومع ذلك، لا تزال سير العمل التقليدية للتعلم العميق تتطلب تدخلًا يدويًا لجمع البيانات، والمعالجة المسبقة، والتعزيز. ظهرت ابتكارات حديثة في التعلم الآلي الآلي (AutoML) لمعالجة هذه التحديات، مما يمكّن الأنظمة من تحويل البيانات الخام إلى ميزات مفيدة من خلال أتمتة جميع مراحل المعالجة الوسيطة.

تسلط المراجعة الضوء على تقنيات معالجة البيانات الآلية المختلفة، بما في ذلك تنظيف البيانات، ووضع العلامات، وتقدير البيانات المفقودة، والترميز الفئوي، وتعزيز البيانات، وهندسة الميزات (مثل استخراج الميزات، والبناء، والاختيار). كما تناقش دمج هذه العمليات ضمن إطار موحد لتحسين خط أنابيب التعلم الآلي بالكامل. بينما تشير الدراسة إلى أن العديد من المهام يمكن أتمتتها بفعالية، فإنها تعترف بالتحديات المستمرة في تحقيق الأتمتة الكاملة، لا سيما في تطوير آليات البحث التي تسمح بجمع البيانات واستخدامها بناءً على السياق. تؤكد النتائج على الأهمية المتزايدة لمعالجة البيانات الآلية في تلبية الطلب المتزايد على حلول التعلم الآلي عبر مجالات متنوعة.

مقدمة

ت outlines مقدمة الورقة الاعتماد المتزايد على أساليب التعلم الآلي (ML)، لا سيما التعلم العميق، لحل المشكلات المعقدة في عصر البيانات الضخمة. بينما يتفوق التعلم العميق في استخراج الميزات ومعالجة البيانات، فإنه يواجه صعوبات مع كميات كبيرة من البيانات غير المتجانسة ويواجه تحديات مثل التحيز، والشذوذ، ونقاط البيانات المفقودة. تتطلب هذه القضايا خطوات معالجة إضافية لتعزيز الأداء التنبؤي وموثوقية نماذج ML، خاصة في أنظمة دعم القرار حيث تكون الرؤى المستخلصة من البيانات حاسمة.

يقترح المؤلفون استكشاف طرق مختلفة لأتمتة مهام معالجة البيانات في تطبيقات التعلم العميق والبيانات الضخمة، مصنفين هذه المهام إلى ثلاث مهام فرعية رئيسية: المعالجة المسبقة، وتعزيز البيانات، وهندسة الميزات. يبرزون التعقيدات المرتبطة بمعالجة البيانات المسبقة، بما في ذلك الحاجة إلى خبرة في المجال والطبيعة الذاتية لبعض القرارات، مما يعقد المقارنات في الأداء عبر الدراسات. تناقش الورقة أيضًا قيود الأساليب التقليدية لهندسة الميزات، مشددة على الحاجة إلى طرق أكثر عمومية، مثل نهج البحث المدفوع بالبيانات (FETCH)، لمعالجة حساسية مجموعة البيانات وتحسين أداء النموذج عبر مجموعات بيانات متنوعة.

طرق

تناقش هذه القسم أداء ومنهجيات أتمتة معالجة البيانات المسبقة، والتعزيز، وتقنيات هندسة الميزات في التعلم الآلي. تبرز نقص الأساليب القياسية عبر طرق المعالجة المسبقة المختلفة، لا سيما للمهام مثل نمذجة انبعاثات CO2 وعلم الجينوم. تلخص جدول ملخصي الأساليب الرئيسية للمعالجة المسبقة، ووظائفها (مثل التقدير، والتنظيف، ووضع العلامات، والترميز الفئوي)، ومجموعات البيانات التي تنطبق عليها. تشير النتائج إلى أن تقنيات تعزيز البيانات الآلية تتفوق بشكل كبير على الأساليب التقليدية، كما يتضح من مقاييس الأداء المقارنة عبر مجموعات بيانات مثل CIFAR-10، وCIFAR-100، وImageNet، حيث حققت النماذج المدربة بأساليب آلية دقة تصنيف أعلى.

بالإضافة إلى ذلك، تستعرض القسم طرق هندسة الميزات الآلية، موضحة تفوقها على التقنيات التقليدية من حيث دقة التصنيف، على الرغم من أن الأداء يختلف عبر مجموعات البيانات. كما تشير النصوص إلى أنه بينما تظهر الأساليب الآلية وعدًا، فإنها تتطلب موارد حسابية كبيرة وقد تغير عن غير قصد المعنى الدلالي للبيانات. وأخيرًا، تؤكد الحاجة إلى توسيع طرق AutoML العامة لاستيعاب أنواع المهام والبيانات الأكثر تعقيدًا بخلاف البيانات الجدولية، لا سيما في مجالات مثل السلاسل الزمنية ومعالجة اللغة الطبيعية.

نقاش

تسلط قسم النقاش في الورقة الضوء على الاهتمام المتزايد في أتمتة وظائف جمع البيانات ومعالجتها، لا سيما في مجالات مثل الأعمال، والرعاية الصحية، وإنفاذ القانون. تعتبر أنظمة معالجة البيانات الآلية ضرورية لتطبيقات التعلم الآلي، خاصة حيث تحد مخاوف الخصوصية من وصول البشر إلى البيانات الخام. على الرغم من التقدم الكبير في معالجة البيانات الآلية، لا تزال الأدبيات نادرة، حيث تركز معظم الدراسات الاستقصائية على أدوات التعلم الآلي الآلي الشاملة (AutoML) بدلاً من المهام المحددة ذات المستوى المنخفض لمعالجة البيانات وهندسة الميزات. تهدف هذه الورقة إلى سد هذه الفجوة من خلال التركيز على أتمتة هذه الوظائف الحيوية ضمن خطوط أنابيب التعلم العميق.

يحدد المؤلفون هيكل دراستهم، والذي يتضمن نظرة عامة على مفاهيم الأتمتة، ومناقشات مفصلة حول تقنيات معالجة البيانات الآلية، وتعزيز البيانات، وطرق هندسة الميزات. يؤكدون على أهمية هذه المهام في تعزيز أداء النموذج، لا سيما في التطبيقات المعقدة حيث قد تفشل الأساليب التقليدية. تناقش الورقة أيضًا الحاجة إلى سير عمل متكاملة تسمح بالتنفيذ المتزامن لمهام معالجة البيانات، مما يحسن استخدام أطر التعلم الآلي. كما يتم مناقشة آثار هذه التقنيات الآلية على الصناعة واتجاهات البحث المستقبلية، مما يبرز ضرورة الاستمرار في الاستكشاف في هذا المجال لمعالجة التحديات التي تطرحها جودة البيانات، والتمثيل، وقابلية التفسير في نماذج التعلم الآلي.

Journal: Journal of Information and Intelligence, Volume: 3, Issue: 2
DOI: https://doi.org/10.1016/j.jiixd.2024.01.002
Publication Date: 2024-01-08
Author(s): Alhassan Mumuni et al.
Primary Topic: Machine Learning and Data Classification

Overview

The section provides an overview of the advancements in automating data processing tasks within deep learning pipelines, emphasizing the significance of these developments in the context of artificial intelligence (AI) and big data applications. Modern AI approaches focus on designing algorithms that learn from data, particularly through supervised deep learning, which has streamlined the creation of machine learning systems. However, traditional deep learning workflows still require manual intervention for data collection, preprocessing, and augmentation. Recent innovations in automated machine learning (AutoML) have emerged to address these challenges, enabling end-to-end systems that can transform raw data into useful features by automating all intermediate processing stages.

The review highlights various automated data processing techniques, including data cleaning, labeling, missing data imputation, categorical encoding, data augmentation, and feature engineering (such as feature extraction, construction, and selection). It also discusses the integration of these processes within a unified framework to optimize the entire machine learning pipeline. While the survey indicates that many tasks can be automated effectively, it acknowledges ongoing challenges in achieving full automation, particularly in developing search mechanisms that allow for context-dependent data collection and utilization. The findings underscore the growing importance of automated data processing in meeting the increasing demand for machine learning solutions across diverse fields.

Introduction

The introduction of the paper outlines the increasing reliance on machine learning (ML) methods, particularly deep learning, for solving complex problems in the era of Big Data. While deep learning excels in feature extraction and data processing, it struggles with large volumes of heterogeneous data and faces challenges such as bias, anomalies, and missing data points. These issues necessitate additional processing steps to enhance the predictive performance and reliability of ML models, especially in decision support systems where insights from data are crucial.

The authors propose to explore various methods for automating data processing tasks in deep learning and Big Data applications, categorizing these tasks into three main subtasks: preprocessing, data augmentation, and feature engineering. They highlight the complexities involved in data preprocessing, including the need for domain expertise and the subjective nature of certain decisions, which complicates performance comparisons across studies. The paper also discusses the limitations of traditional approaches to feature engineering, emphasizing the need for more generalizable methods, such as the Feature Set Data-Driven Search (FETCH) approach, to address dataset sensitivity and improve model performance across diverse datasets.

Methods

The section discusses the performance and methodologies of automated data preprocessing, augmentation, and feature engineering techniques in machine learning. It highlights the lack of standardized approaches across various preprocessing methods, particularly for tasks like CO2 emission modeling and genomics. A summary table categorizes key preprocessing methods, their functions (such as imputation, cleaning, labeling, and categorical encoding), and the datasets they apply to. The findings indicate that automated data augmentation techniques significantly outperform traditional methods, as evidenced by comparative performance metrics across datasets like CIFAR-10, CIFAR-100, and ImageNet, where models trained with automated methods achieved higher classification accuracies.

Additionally, the section reviews automated feature engineering methods, demonstrating their superiority over traditional techniques in terms of classification accuracy, although performance varies across datasets. The text also notes that while automated methods show promise, they require substantial computational resources and may inadvertently alter the semantic meaning of data. Lastly, it emphasizes the need for extending generic AutoML methods to accommodate more complex data types and tasks beyond tabular data, particularly in areas like time series and natural language processing.

Discussion

The discussion section of the paper highlights the growing interest in automating data acquisition and processing functions, particularly in fields like business, healthcare, and law enforcement. Automated data processing systems are crucial for machine learning applications, especially where privacy concerns limit human access to raw data. Despite the significant advancements in automated data processing, the literature remains sparse, with most surveys focusing on end-to-end automated machine learning (AutoML) tools rather than on the specific low-level tasks of data preprocessing and feature engineering. This paper aims to fill that gap by concentrating on the automation of these critical functions within deep learning pipelines.

The authors outline the structure of their survey, which includes an overview of automation concepts, detailed discussions on automated data preprocessing techniques, data augmentation, and feature engineering methods. They emphasize the importance of these tasks in enhancing model performance, particularly in complex applications where traditional methods may fall short. The paper also addresses the need for integrated workflows that allow for simultaneous execution of data processing tasks, thus optimizing the use of machine learning frameworks. The implications of these automated techniques for industry and future research directions are also discussed, underscoring the necessity for continued exploration in this domain to address the challenges posed by data quality, representation, and interpretability in machine learning models.