تصنيف لمراجعات الأدبيات ودراسة تجريبية للتعلم العميق المعزز في إدارة المحافظ A taxonomy of literature reviews and experimental study of deepreinforcement learning in portfolio management

المجلة: Artificial Intelligence Review، المجلد: 58، العدد: 3
DOI: https://doi.org/10.1007/s10462-024-11066-w
تاريخ النشر: 2025-01-17
المؤلف: Mohadese Rezaei وآخرون
الموضوع الرئيسي: طرق التنبؤ بسوق الأسهم

نظرة عامة

تقدم هذه القسم نظرة عامة على تطبيق التعلم العميق المعزز (DRL) في إدارة المحافظ، مع تسليط الضوء على مزاياه مقارنة بالطرق التقليدية مثل تحليل المتوسط-التباين. يسمح DRL بإجراء تعديلات ديناميكية على استراتيجيات الاستثمار بناءً على ردود الفعل السوقية في الوقت الحقيقي، مما يجعله مناسبًا بشكل خاص لتعقيدات الأسواق المالية الحديثة. يتضمن البحث مراجعة أدبية لتطبيقات DRL ويقدم دراسة تجريبية تقارن بين خمسة خوارزميات DRL—ممثل الميزة-الناقد (A2C)، تدرج السياسة الحتمي العميق (DDPG)، تحسين السياسة القريب (PPO)، ممثل الميزة-الناقد الناعم (SAC)، وDDPG المتأخر المزدوج (TD3)—في إدارة محفظة من 30 سهمًا من مؤشر داو جونز الصناعي (DJIA). تشير النتائج إلى أن استراتيجيات DRL يمكن أن تعزز أداء المحفظة، حيث حقق TD3 أعلى العوائد التراكمية والسنوية، بينما أظهر SAC أفضل نسبة شارب.

في الخاتمة، يصنف المؤلفون طرق DRL الحالية إلى ثلاث فئات: الناقد فقط، الممثل فقط، وطرق الممثل-الناقد، كل منها له مزايا وتحديات مميزة. يحددون المجالات الرئيسية للبحث المستقبلي، بما في ذلك دمج تكاليف المعاملات في استراتيجيات إدارة المحافظ وتعزيز بيانات الإدخال من خلال ميزات معلوماتية مثل المؤشرات الفنية ومشاعر السوق. يؤكد البحث على إمكانيات خوارزميات SAC وTD3 في إدارة المحافظ، مقترحًا أن اختيار الأصول الذكي يمكن أن يحسن أداء التداول بشكل أكبر. بشكل عام، تؤكد النتائج على وعد DRL في تحسين إدارة المحافظ مع معالجة التحديات الحرجة في هذا المجال.

مقدمة

تناقش مقدمة هذه الورقة البحثية تعقيدات إدارة المحافظ، مع التأكيد على دورها كعملية اتخاذ قرارات ديناميكية تهدف إلى تحسين تخصيص الأصول المالية مع موازنة المخاطر والعائد. تسلط الضوء على نموذج ماركويتز المتوسط-التباين (MV) الأساسي، الذي، على الرغم من أهميته في نظرية المحافظ الحديثة، يقتصر على إطاره الزمني الواحد وعدم قدرته على حساب القيود الواقعية مثل تكاليف المعاملات. تشير الورقة إلى أنه تم اقتراح تحسينات مختلفة على نموذج MV، بما في ذلك الطرق الاستدلالية، ومؤخراً، أساليب التعلم العميق، التي يمكن تصنيفها إلى طرق قائمة على النموذج وطرق خالية من النموذج.

تعتمد الطرق القائمة على النموذج على الشبكات العصبية العميقة (DNNs) للتنبؤ بأسعار الأصول المستقبلية، بينما تستخدم الأساليب الخالية من النموذج التعلم المعزز (RL) لاشتقاق أوزان المحفظة مباشرة دون توقعات أسعار صريحة. تؤكد الورقة على إمكانيات التعلم العميق المعزز (DRL) في إدارة المحافظ، حيث يتماشى بشكل وثيق مع ديناميات السوق ويسهل الإجراءات الاستثمارية. يهدف المؤلفون إلى سد فجوة في الأدبيات من خلال تقديم تصنيف شامل لتطبيقات DRL في إدارة المحافظ، موضحين خصائصها، وتحديد الفجوات البحثية الحالية، وإجراء دراسات تجريبية لتقييم أداء خوارزميات DRL المختلفة مقارنة باستراتيجيات المحافظ التقليدية. من المتوقع أن تعزز النتائج الفهم وتطبيق DRL في تحسين استراتيجيات الاستثمار.

طرق

تحدد هذه القسم طرقًا مختلفة في التعلم المعزز (RL) لإدارة المحافظ، مع التركيز على طرق الممثل-الناقد، والناقد فقط، والممثل فقط. تدمج طرق الممثل-الناقد تقديرات السياسة ودالة القيمة، حيث يحدد الممثل السياسة بناءً على إدخال الحالة، ويقيم الناقد قيمة الإجراء، مما يوجه تحديثات سياسة الممثل. يتم تحديد دالة الهدف بواسطة $\theta_\mu$ للممثل و$\theta_Q$ للناقد، مع إجراء تحديثات لكليهما بناءً على دالة القيمة $Q(s, a; \theta_Q)$ أو دالة القيمة $V^\pi_{\theta_\mu}(s)$.

تركز طرق الناقد فقط، مثل DQN وSARSA، فقط على تقدير دالة القيمة لاشتقاق السياسة المثلى بشكل غير مباشر. في المقابل، تقوم طرق الممثل فقط بتحسين معلمات السياسة مباشرة لتعظيم المكافآت التراكمية المتوقعة دون تقدير دالة القيمة، باستخدام تدرجات السياسة الحتمية، التي تكون فعالة بشكل خاص في فضاءات العمل المستمرة. تناقش هذه القسم أيضًا أطرًا وخوارزميات مبتكرة تم تطويرها لإدارة المحافظ، بما في ذلك مجموعة من المقيمين المستقلين المتطابقين (EIIE) وشبكات سياسة المحفظة الحساسة للتكاليف (PPN)، التي تعزز الأداء من خلال معالجة تكاليف المعاملات وإدارة المخاطر. تشير النتائج التجريبية إلى أن خوارزميات الممثل-الناقد، وخاصة A2C وDDPG وPPO، تتفوق على الطرق التقليدية وغيرها من أساليب RL في إدارة المحافظ بشكل فعال.

نقاش

في قسم النقاش، تتوسع الورقة في المفاهيم الأساسية لنظرية المحافظ الحديثة (MPT) التي قدمها هاري ماركويتز، مع التركيز على نموذج المتوسط-التباين (MV). يستخدم هذا النموذج مقاييس إحصائية مثل التباين والارتباطات لتقييم تأثير الاستثمارات الفردية على ملف المخاطر-العائد العام للمحفظة. المبدأ الأساسي لـ MPT هو أن المستثمرين، كونهم متجنبين للمخاطر، يفضلون المحافظ التي تعظم العوائد المتوقعة لمستوى معين من المخاطر أو تقلل المخاطر لعائد مرغوب. يستخدم نموذج MV التنويع وتخصيص الأصول كاستراتيجيات لتقليل المخاطر، مع حساب العائد المتوقع لمحفظة كمتوسط مرجح لعوائد أصولها المكونة. ومع ذلك، فإن المخاطر ليست ببساطة المتوسط المرجح لمخاطر الأصول الفردية، بل تتأثر بالتغاير بين عوائد الأصول.

ينتقل النقاش بعد ذلك إلى التعلم المعزز (RL)، مقارنًا إياه بالتعلم المراقب (SL) من خلال تسليط الضوء على الجانب الزمني لـ RL وطبيعته الموجهة نحو الهدف. يتم تقديم عملية اتخاذ القرار ماركوف (MDP) كإطار لـ RL، حيث يتفاعل الوكيل مع بيئته لتعظيم المكافآت التراكمية. يتم تعريف المكونات الرئيسية لأنظمة RL، مثل السياسة، وإشارة المكافأة، ودالة القيمة، ونموذج البيئة، موضحًا كيف يتعلم الوكلاء الإجراءات المثلى من خلال استراتيجيات الاستكشاف والاستغلال. كما تحدد الورقة مجموعة متنوعة من خوارزميات RL، بما في ذلك طرق الناقد فقط، والممثل فقط، وطرق الممثل-الناقد، مع التأكيد على تطبيقاتها في إدارة المحافظ. بشكل ملحوظ، يتم تقديم التعلم العميق المعزز (DRL) كنهج قوي يدمج الشبكات العصبية العميقة مع RL، مما يمكّن من إدارة المحافظ بشكل فعال من خلال ربط معلومات السوق باستراتيجيات تخصيص الأصول المثلى.

Journal: Artificial Intelligence Review, Volume: 58, Issue: 3
DOI: https://doi.org/10.1007/s10462-024-11066-w
Publication Date: 2025-01-17
Author(s): Mohadese Rezaei et al.
Primary Topic: Stock Market Forecasting Methods

Overview

The section provides an overview of the application of Deep Reinforcement Learning (DRL) in portfolio management, highlighting its advantages over traditional methods like mean-variance analysis. DRL allows for dynamic adjustments to investment strategies based on real-time market feedback, making it particularly suitable for the complexities of modern financial markets. The paper includes a literature review of DRL applications and presents an experimental study comparing five DRL algorithms—Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC), and Twin Delayed DDPG (TD3)—in managing a portfolio of 30 Dow Jones Industrial Average (DJIA) stocks. The results indicate that DRL strategies can enhance portfolio performance, with TD3 achieving the highest cumulative and annualized returns, while SAC exhibited the best Sharpe ratio.

In the conclusion, the authors classify existing DRL methods into three categories: critic-only, actor-only, and actor-critic methods, each with distinct advantages and challenges. They identify key areas for future research, including the integration of transaction costs into portfolio management strategies and the enhancement of input data through informative features like technical indicators and market sentiments. The paper emphasizes the potential of SAC and TD3 algorithms in portfolio management, suggesting that intelligent asset selection could further improve trading performance. Overall, the findings underscore the promise of DRL in optimizing portfolio management while addressing critical challenges in the field.

Introduction

The introduction of this research paper discusses the complexities of portfolio management, emphasizing its role as a dynamic decision-making process aimed at optimizing financial asset allocation while balancing risk and return. It highlights the foundational Markowitz mean-variance (MV) model, which, despite its significance in modern portfolio theory, is limited by its single-period framework and inability to account for real-world constraints such as transaction costs. The paper notes that various enhancements to the MV model have been proposed, including heuristic methods and, more recently, deep learning approaches, which can be categorized into model-based and model-free methods.

Model-based methods rely on deep neural networks (DNNs) to predict future asset prices, while model-free approaches utilize reinforcement learning (RL) to directly derive portfolio weights without explicit price predictions. The paper underscores the potential of deep reinforcement learning (DRL) in portfolio management, as it aligns closely with market dynamics and simplifies investment actions. The authors aim to fill a gap in the literature by providing a comprehensive taxonomy of DRL applications in portfolio management, detailing their characteristics, identifying existing research gaps, and conducting experimental studies to evaluate the performance of various DRL algorithms against traditional portfolio strategies. The findings are expected to advance the understanding and application of DRL in optimizing investment strategies.

Methods

The section outlines various methods in reinforcement learning (RL) for portfolio management, focusing on actor-critic, critic-only, and actor-only approaches. Actor-critic methods integrate both policy and value function estimations, where the actor determines the policy based on the state input, and the critic evaluates the action’s value, thereby guiding the actor’s policy updates. The objective function is parameterized by $\theta_\mu$ for the actor and $\theta_Q$ for the critic, with updates made to both based on the action-value function $Q(s, a; \theta_Q)$ or the value function $V^\pi_{\theta_\mu}(s)$.

Critic-only methods, such as DQN and SARSA, focus solely on estimating the value function to indirectly derive the optimal policy. In contrast, actor-only methods directly optimize the policy parameters to maximize expected cumulative rewards without estimating the value function, utilizing deterministic policy gradients, which are particularly effective in continuous action spaces. The section also discusses various innovative frameworks and algorithms developed for portfolio management, including the Ensemble of Identical Independent Evaluators (EIIE) and cost-sensitive portfolio policy networks (PPN), which enhance performance by addressing transaction costs and risk management. The experimental results indicate that actor-critic algorithms, particularly A2C, DDPG, and PPO, outperform traditional methods and other RL approaches in managing portfolios effectively.

Discussion

In the discussion section, the paper elaborates on the foundational concepts of Modern Portfolio Theory (MPT) introduced by Harry Markowitz, emphasizing the mean-variance (MV) model. This model utilizes statistical measures like variance and correlations to assess the impact of individual investments on the overall portfolio’s risk-return profile. The core tenet of MPT is that investors, being risk-averse, prefer portfolios that maximize expected returns for a given level of risk or minimize risk for a desired return. The MV model employs diversification and asset allocation as strategies for risk reduction, with the expected return of a portfolio calculated as a weighted average of the returns of its constituent assets. The risk, however, is not simply the weighted average of individual asset risks but is influenced by the covariance between asset returns.

The discussion further transitions into reinforcement learning (RL), contrasting it with supervised learning (SL) by highlighting RL’s temporal aspect and goal-oriented nature. The Markov decision process (MDP) is introduced as a framework for RL, where the agent interacts with its environment to maximize cumulative rewards. Key components of RL systems, such as policy, reward signal, value function, and environment model, are defined, illustrating how agents learn optimal actions through exploration and exploitation strategies. The paper also outlines various RL algorithms, including critic-only, actor-only, and actor-critic methods, emphasizing their applications in portfolio management. Notably, deep reinforcement learning (DRL) is presented as a powerful approach that integrates deep neural networks with RL, enabling effective portfolio management by mapping market information to optimal asset allocation strategies.