تعزيز توقع أسعار الأسهم من خلال تطوير مجموعات هجينة: تحليل مقارن شامل لأساليب التعلم الآلي Advancing stock price prediction through the development of hybrid ensembles: a comprehensive comparative analysis of machine learning approaches

المجلة: Journal Of Big Data، المجلد: 12، العدد: 1
DOI: https://doi.org/10.1186/s40537-025-01185-8
تاريخ النشر: 2025-10-15
المؤلف: Akila Dabara Kayit وآخرون
الموضوع الرئيسي: طرق التنبؤ بسوق الأسهم

نظرة عامة

تقدم هذه الدراسة نهجًا جديدًا للتعلم الجماعي لتصنيف وتوقع اتجاه الأسهم، مع التركيز على تعزيز الدقة من خلال الاختيار الاستراتيجي للمتعلمين الأساسيين، وتحسين المعلمات عبر GridSearchCV، وتقنيات التعلم الجماعي الهجينة. تستخدم الدراسة مجموعة متنوعة من نماذج التعلم الآلي، بما في ذلك الانحدار اللوجستي (LR)، وغابة عشوائية (RF)، وأشجار القرار (DT)، وآلات الدعم الناقل (SVM)، وأقرب الجيران (KNN)، وأدا بوست، وLasso-LSTM، لمقارنة أداء المصنفات الفردية، والتجمعات الهجينة، ونموذج التصنيف القائم على التصويت. باستخدام بيانات مؤشر S&P 500 من 8 يونيو 2016 إلى 6 سبتمبر 2023، تقيم الدراسة أداء التصنيف عبر مقاييس مثل الدقة، والدقة، والاسترجاع، ودرجة F1، وAUC، والحساسية. ومن الجدير بالذكر أن النموذج الجماعي الذي يجمع بين SVC وLR وRF والتصويت حقق أعلى دقة بلغت 95.8% وAUC قدره 0.97، مما يوضح فعالية طرق التجميع مقارنة بالمصنفات الفردية.

تؤكد النتائج على أهمية اختيار النموذج، وتحسين المعلمات، واستخدام تقنيات التجميع في تحسين الدقة التنبؤية وقابلية نقل النموذج. أكدت التحليلات الإحصائية، بما في ذلك اختبارات كروسكال-واليس ومقارنات دن، وجود اختلافات أداء كبيرة بين النماذج، مما يعزز موثوقية التنبؤات المعتمدة على التجميع. تعزز قدرة النموذج الهجين على التقاط سلوكيات أسعار الأسهم الخطية وغير الخطية قابليته للتكيف عبر ظروف السوق المختلفة، مع ملاحظة قوة خاصة في القطاعات المستقرة مثل الرعاية الصحية والمرافق. تدعو الدراسة إلى دمج خوارزميات تعلم متعددة وتؤكد على ضرورة التحقق المتقاطع لمنع الإفراط في التكيف، مما يعزز تعميم النموذج. بشكل عام، تضع هذه الدراسة التعلم الجماعي كأداة قوية للتنبؤ المالي، مما يوفر رؤى قيمة للمستثمرين والمؤسسات المالية.

مقدمة

تسلط مقدمة ورقة البحث الضوء على الأهمية الحاسمة لتوقع سوق الأسهم في البحث المالي، مشددة على أن نماذج التنبؤ الدقيقة ضرورية لاتخاذ قرارات مستنيرة من قبل المستثمرين وصانعي السياسات. لقد كافحت الطرق التقليدية، مثل التحليل الأساسي والتحليل الفني، لمعالجة تعقيدات الأسواق المالية الديناميكية والمتقلبة. وبالتالي، هناك اهتمام متزايد بتقنيات التعلم الآلي (ML) والتعلم العميق (DL)، التي أظهرت قدرات تنبؤية متفوقة مقارنة بالنماذج الإحصائية التقليدية.

لقد أظهرت التقدمات الأخيرة في التعلم الجماعي، الذي يجمع بين نماذج تنبؤية متعددة لتعزيز القوة والدقة، وعدًا في تحسين التنبؤات المالية. تشير الدراسات إلى أن طرق التجميع تتفوق على الأساليب ذات النموذج الواحد من خلال الاستفادة من نقاط القوة في مصنفات مختلفة وتقليل تباين التنبؤ. ومع ذلك، لا تزال هناك تحديات في تحسين نماذج التجميع الهجينة، لا سيما في دمج أطر التعلم العميق مع خوارزميات التعلم الآلي التقليدية. ومن الجدير بالذكر أنه بينما أثبتت الشبكات العصبية الذاكرة الطويلة القصيرة (LSTM) والغابات العشوائية (RF) فعاليتها في توقع الأسهم، لا يزال استكشاف استراتيجيات الهجينة التي تتضمن طبقات تجميع متعددة وآليات تصويت غير كافٍ. علاوة على ذلك، لم تتناول الأبحاث الحالية بالكامل هجينة نماذج التجميع، وتأثير اختيار الميزات وضبط المعلمات، أو تقديم مقارنة شاملة بين النماذج الهجينة والمتعلمين الفرديين المدربين بشكل مستقل.

الطرق

تحدد قسم المنهجية النهج الشامل المتبع في الدراسة، والذي يشمل جمع البيانات، والمعالجة المسبقة، وهندسة الميزات، واختيار النموذج، والتدريب، والتقييم. تستخدم الدراسة نموذج تجميع هجيني لتعزيز الدقة التنبؤية عبر قطاعات مختلفة، مع التركيز بشكل خاص على الأسهم المتعلقة بالموارد الطبيعية والعقارات.

تشير النتائج إلى أن النموذج الهجين حقق استرجاعًا عاليًا قدره 0.95 لشركة Freeport-McMoRan Inc.، مما يوضح فعاليته في نمذجة الأسهم المتأثرة بالطلب العالمي على السلع. في المقابل، يبرز أداء النموذج في قطاع العقارات، بدقة قدرها 0.76، تعقيدات التنبؤ في هذا المجال، حيث فشلت النماذج الفردية في التقاط التغيرات طويلة الأجل في قيمة العقارات. كما أظهر النموذج الهجين مقاييس أداء معتدلة، بما في ذلك AUC قدره 0.93 واسترجاع قدره 0.95 في القطاعات الأكثر استقرارًا، مما يبرز قيود النماذج الفردية في معالجة التأثيرات التنظيمية التي يمكن أن تدمجها طرق التجميع بشكل فعال.

النتائج

تشير النتائج إلى أن النماذج الهجينة تتفوق بشكل كبير على كل من المتعلمين الفرديين والنماذج المعتمدة على التحسين في التنبؤ المالي. على الرغم من أن الاختلافات القطاعية لم تسفر عن نتائج ذات دلالة إحصائية، إلا أن التحليل كشف أن العوامل على مستوى الشركة أثرت بشكل ملحوظ على الاسترجاع والحساسية، مما يبرز أهمية تخصيص النماذج لتوقعات الشركات المحددة.

أكدت مزيد من التحقق الإحصائي من خلال اختبار كروسكال-واليس واختبار دن بعد ذلك فعالية طرق التجميع، لا سيما الجمع بين مصنف دعم المتجهات (SVC) والانحدار اللوجستي (LR) والغابة العشوائية (RF) وآلية التصويت، في تعزيز دقة توقع سوق الأسهم. تدعو هذه النتائج إلى اعتماد نهج هجيني مخصص في النمذجة المالية لتحسين الأداء التنبؤي.

المناقشة

تحدد قسم المناقشة في ورقة البحث الأهداف والمساهمات للدراسة، التي تركز على تعزيز توقع سوق الأسهم من خلال نماذج التجميع الهجينة التي تدمج تقنيات التعلم الآلي التقليدية (ML) مع استراتيجيات التعلم العميق (DL). تهدف الدراسة إلى تقييم فعالية هذه النماذج الهجينة، واستكشاف تأثير اختيار الميزات وتحسين المعلمات على دقة النموذج، ومقارنة الأداء التنبؤي للتجمعات الهجينة مقابل الأسس ذات النموذج الواحد. من خلال استخدام مقاييس أداء متعددة مثل RMSE وMAPE وAUC، تسعى الدراسة إلى إظهار قوة هذه النماذج وقابليتها للتفسير في المجال المالي، مما يوفر في النهاية رؤى قيمة للمستثمرين والمحللين الماليين.

تسلط الورقة الضوء على الفجوات الكبيرة في الأدبيات الحالية، لا سيما الاستكشاف المحدود للهجينة في نماذج التجميع، والاستخدام غير الكافي لتقنيات اختيار الميزات المتقدمة، وغياب التحليلات المقارنة الشاملة بين النماذج الهجينة والفردية. لمعالجة هذه القضايا، تقدم الدراسة نموذج تجميع هجيني جديد يجمع بين مصنف دعم المتجهات (SVC) والانحدار اللوجستي (LR) والغابة العشوائية (RF) مع آلية تصويت. يهدف هذا النهج إلى تعزيز الدقة التنبؤية مع تقليل الإفراط في التكيف وتحسين قابلية تفسير النموذج. يتضمن إطار البحث معالجة بيانات صارمة، وهندسة ميزات، ومنهجية منظمة لتقييم النموذج، مما يساهم في تقدم تطبيقات التعلم الآلي في التنبؤ المالي.

القيود

يتناول قسم القيود القيود والتحديات التي واجهت البحث. يسلط الضوء على مصادر محتملة للتحيز، مثل قيود حجم العينة وقابلية تعميم النتائج عبر مجموعات سكانية مختلفة. بالإضافة إلى ذلك، يعترف المؤلفون بأن بعض الخيارات المنهجية قد أثرت على النتائج، مما يشير إلى أن طرق بديلة قد تؤدي إلى نتائج مختلفة.

يتم اقتراح أعمال مستقبلية لمعالجة هذه القيود، بما في ذلك الحاجة إلى عينات أكبر وأكثر تنوعًا لتعزيز قوة النتائج. كما يوصي المؤلفون باستكشاف متغيرات إضافية قد تؤثر على النتائج، بالإضافة إلى استخدام دراسات طولية لتقييم التغيرات بمرور الوقت. بشكل عام، يبرز هذا القسم أهمية البحث المستمر للتحقق من صحة وتوسيع استنتاجات الدراسة الحالية.

Journal: Journal Of Big Data, Volume: 12, Issue: 1
DOI: https://doi.org/10.1186/s40537-025-01185-8
Publication Date: 2025-10-15
Author(s): Akila Dabara Kayit et al.
Primary Topic: Stock Market Forecasting Methods

Overview

This research presents a novel ensemble learning approach for stock direction classification and prediction, focusing on enhancing accuracy through strategic selection of base learners, hyperparameter optimization via GridSearchCV, and hybrid ensemble techniques. The study employs a variety of machine learning models, including Logistic Regression (LR), Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), AdaBoost, and Lasso-LSTM, to compare the performance of single classifiers, hybrid ensembles, and a Voting-based classification model. Utilizing S&P 500 index data from June 8, 2016, to September 6, 2023, the research evaluates classification performance across metrics such as accuracy, precision, recall, F1 score, AUC, and sensitivity. Notably, the ensemble model combining SVC, LR, RF, and Voting achieved the highest accuracy of 95.8% and an AUC of 0.97, demonstrating the effectiveness of ensemble methods over individual classifiers.

The findings underscore the importance of model selection, hyperparameter optimization, and the use of ensemble techniques in improving predictive accuracy and model transferability. The statistical analyses, including Kruskal-Wallis tests and Dunn’s post hoc comparisons, confirmed significant performance differences among models, reinforcing the reliability of the ensemble-based predictions. The hybrid model’s ability to capture both linear and non-linear stock price behaviors enhances its adaptability across various market conditions, with particular robustness observed in stable sectors like Healthcare and Utilities. The study advocates for the integration of multiple learning algorithms and emphasizes the necessity of cross-validation to prevent overfitting, thereby enhancing model generalization. Overall, this research positions ensemble learning as a powerful tool for financial forecasting, providing valuable insights for investors and financial institutions.

Introduction

The introduction of the research paper highlights the critical importance of stock market prediction in financial research, emphasizing that accurate forecasting models are essential for informed decision-making by investors and policymakers. Traditional methods, such as fundamental and technical analysis, have struggled to address the complexities of dynamic and volatile financial markets. Consequently, there is a growing interest in machine learning (ML) and deep learning (DL) techniques, which have demonstrated superior forecasting capabilities compared to conventional statistical models.

Recent advancements in ensemble learning, which combines multiple predictive models to enhance robustness and accuracy, have shown promise in improving financial forecasts. Studies indicate that ensemble methods outperform single-model approaches by leveraging the strengths of various classifiers and reducing prediction variance. However, challenges remain in optimizing hybrid ensemble models, particularly in integrating deep learning frameworks with traditional ML algorithms. Notably, while Long Short-Term Memory (LSTM) networks and Random Forests (RF) have proven effective in stock prediction, the exploration of hybridization strategies that incorporate multiple ensemble layers and voting mechanisms is still lacking. Furthermore, existing research has not fully addressed the hybridization of ensemble models, the impact of feature selection and hyperparameter tuning, or provided a comprehensive comparison between hybrid models and independently trained single learners.

Methods

The methodology section outlines the comprehensive approach taken in the study, which encompasses data collection, preprocessing, feature engineering, model selection, training, and evaluation. The research employs a hybrid ensemble model to enhance predictive accuracy across different sectors, specifically focusing on stocks related to natural resources and real estate.

The results indicate that the hybrid model achieved a high recall of 0.95 for Freeport-McMoRan Inc., demonstrating its effectiveness in modeling stocks influenced by global commodity demand. In contrast, the model’s performance in the real estate sector, with an accuracy of 0.76, highlights the complexities of forecasting in this area, where single models failed to capture long-term property value changes. The hybrid model also exhibited moderate performance metrics, including an AUC of 0.93 and a recall of 0.95 in more stable sectors, underscoring the limitations of single models in addressing regulatory impacts that ensemble methods can effectively integrate.

Results

The results indicate that hybrid models significantly outperform both single learners and optimization-based models in financial forecasting. Although sectoral differences did not yield statistically significant results, the analysis revealed that company-level factors notably affected recall and sensitivity, underscoring the importance of customizing models for specific company forecasts.

Further statistical validation through the Kruskal-Wallis test and Dunn’s post-hoc test highlighted the efficacy of ensemble methods, particularly the combination of Support Vector Classifier (SVC), Logistic Regression (LR), Random Forest (RF), and a voting mechanism, in enhancing stock market prediction accuracy. These findings advocate for the adoption of tailored hybrid approaches in financial modeling to improve predictive performance.

Discussion

The discussion section of the research paper outlines the objectives and contributions of the study, which focuses on enhancing stock market prediction through hybrid ensemble models that integrate traditional machine learning (ML) techniques with deep learning (DL) strategies. The study aims to assess the effectiveness of these hybrid models, explore the impact of feature selection and hyperparameter optimization on model accuracy, and compare the predictive performance of hybrid ensembles against single-model baselines. By employing multiple performance metrics such as RMSE, MAPE, and AUC, the research seeks to demonstrate the robustness and interpretability of these models in the financial domain, ultimately providing valuable insights for investors and financial analysts.

The paper highlights significant gaps in existing literature, particularly the limited exploration of hybridization in ensemble models, underutilization of advanced feature selection techniques, and the lack of comprehensive comparative analyses between hybrid and single models. To address these issues, the study introduces a novel hybrid ensemble model combining Support Vector Classifier (SVC), Logistic Regression (LR), and Random Forest (RF) with a voting mechanism. This approach aims to enhance predictive accuracy while mitigating overfitting and improving model interpretability. The research framework includes rigorous data preprocessing, feature engineering, and a structured methodology for model evaluation, contributing to the advancement of machine learning applications in financial forecasting.

Limitations

The section on limitations addresses the constraints and challenges encountered in the research. It highlights potential sources of bias, such as sample size limitations and the generalizability of findings across different populations. Additionally, the authors acknowledge that certain methodological choices may have influenced the results, suggesting that alternative approaches could yield different outcomes.

Future work is proposed to address these limitations, including the need for larger, more diverse samples to enhance the robustness of the findings. The authors also recommend exploring additional variables that could impact the results, as well as employing longitudinal studies to assess changes over time. Overall, this section underscores the importance of continued research to validate and expand upon the current study’s conclusions.