التعلم العميق المعزز المعدل حسب المخاطر لتحسين المحفظة: نهج متعدد المكافآت Risk-Adjusted Deep Reinforcement Learning for Portfolio Optimization: A Multi-reward Approach

المجلة: International Journal of Computational Intelligence Systems، المجلد: 18، العدد: 1
DOI: https://doi.org/10.1007/s44196-025-00875-8
تاريخ النشر: 2025-05-26
المؤلف: Himanshu Choudhary وآخرون
الموضوع الرئيسي: طرق التنبؤ بسوق الأسهم

نظرة عامة

تقدم ورقة البحث نهجًا جديدًا للتعلم العميق المعزز المعدل حسب المخاطر (RA-DRL) لتحسين المحفظة، مع معالجة التعقيدات التي تواجه المستثمرين الذين يتجنبون المخاطر. تستخدم المنهجية ثلاثة وكلاء مختلفين للتعلم العميق المعزز (DRL)، تم تدريب كل منها باستخدام دوال مكافأة مختلفة – العوائد اللوغاريتمية، ونسبة شارب التفاضلية، والحد الأقصى للانخفاض – لإنشاء سياسة موحدة. يتم تحسين هذه السياسة بشكل أكبر باستخدام شبكة عصبية تلافيفية (CNN)، مما يمكّن من دمج أهداف استثمارية متنوعة في إجراء واحد معدل حسب المخاطر. يتم التحقق من صحة النهج من خلال الاختبار التجريبي على بيانات يومية من أربعة أسواق أسهم رئيسية: سينسكس، داو، TWSE، و IBEX، مما يظهر أداءً متفوقًا في مقاييس المخاطر والعوائد مقارنةً بوكلاء DRL التقليديين وطرق القياس المرجعية.

في الختام، فإن إطار RA-DRL لا يزيد فقط من الربحية مع تقليل المخاطر ولكنه يظهر أيضًا قوة عبر ظروف السوق المختلفة. تشير الدراسة إلى أن الأبحاث المستقبلية يمكن أن تستكشف تنفيذ وكلاء DRL أكثر تعقيدًا وتطوير دالة مكافأة ديناميكية واحدة لتبسيط عمليات التعلم. بالإضافة إلى ذلك، فإن توسيع إطار تحسين المحفظة ليشمل مجموعة أوسع من الأدوات المالية، مثل صناديق الاستثمار المتداولة، والسندات، والمشتقات، يمكن أن يعزز من قابليته للتطبيق وفعاليته في سيناريوهات الاستثمار في العالم الحقيقي، مما يلبي احتياجات المستثمرين المتنوعة ويتكيف مع بيئات السوق الديناميكية.

مقدمة

تستعرض مقدمة ورقة البحث هذه التحدي الحاسم لتحسين المحفظة، الذي يهدف إلى تخصيص الأموال عبر أصول متنوعة لتعظيم العوائد مع تقليل المخاطر. يبرز المؤلفون السياق التاريخي لهذه المشكلة، مشيرين إلى نموذج تحسين المتوسط-التباين (MVO) لماركويتز، الذي أسس التوازن الأساسي بين المخاطر والعوائد في استراتيجيات الاستثمار. ومع ذلك، بسبب الطبيعة غير المتوقعة لعوائد سوق الأسهم، قد لا تؤدي الطرق التقليدية إلى تحقيق محافظ فعالة في المستقبل. وقد أدى ذلك إلى استكشاف تقنيات التعلم الآلي (ML) والتعلم العميق (DL)، التي يمكن أن تحلل أنماط البيانات التاريخية وتعزز اتخاذ القرار في إدارة المحفظة.

تقدم الورقة منهجية جديدة للتعلم العميق المعزز المعدل حسب المخاطر (RA-DRL) التي تدمج ثلاث دوال مكافأة متميزة – العائد اللوغاريتمي، ونسبة شارب التفاضلية (DSR)، والحد الأقصى للانخفاض (MDD) – لمعالجة تعقيدات تحسين المحفظة. من خلال استخدام شبكة عصبية تلافيفية (CNN) لدمج نقاط القوة في هذه الدوال المكافأة، تهدف الإطار المقترح إلى تحقيق نهج متوازن لتعظيم العوائد مع إدارة المخاطر. كما يؤكد المؤلفون على التكامل المبتكر للتعلم الخاضع للإشراف مع التعلم المعزز، مما يعزز كفاءة تدريب النموذج وقدرته على التكيف مع ظروف السوق المتغيرة. تظهر النتائج التجريبية عبر أربعة أسواق عالمية تفوق نهج RA-DRL مقارنةً بالمعايير المعمول بها، مما يمثل مساهمات كبيرة في مجال تحسين المحفظة المالية.

الطرق

تستعرض قسم المنهجية نهجًا جديدًا لتحسين المحفظة باستخدام إطار التعلم العميق المعزز المعدل حسب المخاطر (DRL)، المسمى RA-DRL. يعالج هذا الإطار مشكلة تحسين المحفظة، التي تتضمن إعادة تخصيص رأس المال ديناميكيًا بين الأصول المالية المختلفة لتقليل المخاطر مع تعظيم العوائد على المدى الطويل. يستخدم المؤلفون عملية اتخاذ القرار ماركوف (MDP) تتميز بمساحة الحالة ($S$)، مساحة العمل ($A$)، احتمالات الانتقال ($P$)، دالة المكافأة ($R$)، وعامل الخصم ($\gamma$). الهدف هو اشتقاق سياسة ($\pi$) تعظم المكافآت التراكمية المتوقعة المخصومة على مدى أفق زمني غير محدود.

يدمج إطار RA-DRL المقترح ثلاثة نماذج DRL متميزة، تم تدريب كل منها باستخدام دوال مكافأة مختلفة لتلبية أهداف استثمارية متنوعة: تعظيم العوائد (باستخدام العائد اللوغاريتمي)، تقليل المخاطر (باستخدام الحد الأقصى للانخفاض، MDD)، وتعظيم العوائد المعدلة حسب المخاطر (باستخدام نسبة شارب، DSR). يتم تجميع الإجراءات من هذه النماذج من خلال شبكة عصبية تلافيفية ثنائية الأبعاد (CNN) لاستخراج تمثيلات ذات مغزى، والتي تتم معالجتها بعد ذلك بواسطة طبقات متصلة بالكامل لتحديد أوزان المحفظة النهائية. يتم حساب الأوزان باستخدام الصيغة $w_{i,t} = \frac{e^{\rho_{i,t} \times c_i}}{e^{\rho_{i,t} \times c}}$، حيث $c$ هو ثابت و $\rho_{i,t}$ يمثل النسبة المئوية للتغير في سعر السهم $i^{th}$ في الوقت $t$. تستخدم الدراسة تحسين بايزي لضبط المعلمات الفائقة، مما يضمن إعدادًا تجريبيًا قويًا لتقييم المنهجية المقترحة.

النتائج

في هذه الدراسة، تم اختبار منهجية التعلم العميق المعزز القائم على التعلم العميق المعدل حسب المخاطر (RA-DRL) على مدى فترة تداول من 1 يناير 2021 إلى 31 مارس 2024، برأس مال أولي قدره ₹1,000,000. تم تقييم أداء RA-DRL مقابل نماذج أساسية مختلفة ومعايير، مع التركيز على مقاييس مثل العوائد التراكمية، العوائد السنوية، التقلب، والعوائد المعدلة حسب المخاطر. من الجدير بالذكر أن RA-DRL حقق أعلى عائد تراكمي بنسبة 124.83% وعائد سنوي بنسبة 29.03% لمؤشر سينسكس، متفوقًا على مؤشر السوق بحوالي 1.4 مرة. كما أظهر النموذج نسبة شارب متفوقة تبلغ 1.69 ونسبة أوميغا تبلغ 1.33، مما يدل على فعاليته في تحقيق عوائد أعلى معدلة حسب المخاطر.

على مدار فترة التداول، أظهر RA-DRL ربحية متسقة، خاصة خلال تقلبات السوق، وحافظ على أداء قوي عبر مؤشرات متعددة، بما في ذلك داو، TWSE، و IBEX. بالنسبة لمؤشر داو، حقق RA-DRL عوائد تراكمية وسنوية بنسبة 50.78% و 13.78% على التوالي، متجاوزًا عوائد المؤشر بـ 1.6 مرة. بينما أظهر نموذج MVO أقل تقلب، حافظ RA-DRL على توازن بين العوائد والمخاطر، مما يبرز قوته في ظروف السوق المختلفة. تؤكد النتائج أن RA-DRL تفوق بشكل كبير على النماذج الأخرى في خمسة من أصل ستة مقاييس أداء، مما يبرز إمكانياته كاستراتيجية تداول قابلة للتطبيق في بيئات السوق الديناميكية.

المناقشة

تستعرض قسم المناقشة من الورقة تنظيمها والنتائج الرئيسية المتعلقة بتطبيق التعلم العميق المعزز (DRL) في تحسين المحفظة. تبدأ بمراجعة الأدبيات الموجودة، مصنفة تطبيقات DRL إلى حلول قائمة على التنبؤ وأخرى قائمة على تحسين المحفظة. يبرز المؤلفون تطور طرق تحسين المحفظة، الانتقال من البرمجة الديناميكية التقليدية إلى نهج DRL، التي أظهرت أداءً محسنًا في إدارة تخصيص الأصول. تشمل المساهمات البارزة صياغة إدارة المحفظة كعملية اتخاذ قرار ماركوف وإدخال دوال مكافأة متنوعة تعزز العوائد المعدلة حسب المخاطر، مثل نسبة شارب التفاضلية (DSR) والحد الأقصى للانخفاض (MDD).

تتوسع القسم في توضيح منهجية RA-DRL المقترحة، التي تدمج دوال مكافأة متعددة لتحقيق توازن بين تعظيم العوائد وإدارة المخاطر. تظهر النتائج التجريبية تفوق نموذج RA-DRL على المعايير التقليدية ونماذج DRL الأخرى، كما يتضح من التحسينات ذات الدلالة الإحصائية في مؤشرات الأداء الرئيسية مثل العوائد التراكمية (CR) ونسبة شارب (SR). يختتم المؤلفون بالتأكيد على إمكانية البحث المستقبلي لتوسيع الإطار ليشمل مجموعة أوسع من الأدوات المالية واستكشاف دوال مكافأة ديناميكية واحدة لتعزيز الاستقرار في التعلم. لا تسهم هذه الأعمال فقط في مجال المالية ولكن تقدم أيضًا رؤى قابلة للتطبيق في سيناريوهات اتخاذ القرار المختلفة التي تتضمن أهدافًا متعارضة.

Journal: International Journal of Computational Intelligence Systems, Volume: 18, Issue: 1
DOI: https://doi.org/10.1007/s44196-025-00875-8
Publication Date: 2025-05-26
Author(s): Himanshu Choudhary et al.
Primary Topic: Stock Market Forecasting Methods

Overview

The research paper presents a novel risk-adjusted deep reinforcement learning (RA-DRL) approach for portfolio optimization, addressing the complexities faced by risk-averse investors. The methodology employs three distinct deep reinforcement learning (DRL) agents, each trained with different reward functions—log returns, differential Sharpe ratio, and maximum drawdown—to create a unified policy. This policy is further refined using a convolutional neural network (CNN), enabling the integration of diverse investment objectives into a single risk-adjusted action. The approach is validated through empirical testing on daily data from four major stock markets: Sensex, Dow, TWSE, and IBEX, demonstrating superior performance in risk and return metrics compared to traditional DRL agents and benchmark methods.

In conclusion, the RA-DRL framework not only maximizes profitability while minimizing risk but also exhibits robustness across various market conditions. The study suggests that future research could explore the implementation of more sophisticated DRL agents and the development of a single dynamic reward function to streamline learning processes. Additionally, expanding the portfolio optimization framework to include a wider array of financial instruments, such as ETFs, bonds, and derivatives, could enhance its applicability and effectiveness in real-world investment scenarios, catering to diverse investor needs and adapting to dynamic market environments.

Introduction

The introduction of this research paper outlines the critical challenge of portfolio optimization, which aims to allocate funds across various assets to maximize returns while minimizing risk. The authors highlight the historical context of this problem, referencing Markowitz’s mean-variance optimization (MVO) model, which established the foundational trade-off between risk and return in investment strategies. However, due to the unpredictable nature of stock market returns, traditional methods may not yield efficient portfolios in the future. This has led to the exploration of machine learning (ML) and deep learning (DL) techniques, which can analyze historical data patterns and enhance decision-making in portfolio management.

The paper introduces a novel risk-adjusted deep reinforcement learning (RA-DRL) methodology that integrates three distinct reward functions—log return, Differential Sharpe ratio (DSR), and maximum drawdown (MDD)—to address the complexities of portfolio optimization. By employing a convolutional neural network (CNN) to combine the strengths of these reward functions, the proposed framework aims to achieve a balanced approach to maximizing returns while managing risk. The authors also emphasize the innovative integration of supervised learning with reinforcement learning, which enhances the model’s training efficiency and adaptability to varying market conditions. Empirical results across four global markets demonstrate the superiority of the RA-DRL approach compared to established benchmarks, marking significant contributions to the field of financial portfolio optimization.

Methods

The methodology section outlines a novel approach to portfolio optimization using a risk-adjusted deep reinforcement learning (DRL) framework, termed RA-DRL. This framework addresses the portfolio optimization problem, which involves dynamically reallocating capital among various financial assets to minimize risk while maximizing long-term returns. The authors employ a Markov decision process (MDP) characterized by a state space ($S$), action space ($A$), transition probabilities ($P$), reward function ($R$), and a discount factor ($\gamma$). The objective is to derive a policy ($\pi$) that maximizes expected discounted cumulative rewards over an infinite time horizon.

The proposed RA-DRL framework integrates three distinct DRL models, each trained with different reward functions to cater to varying investment goals: maximizing returns (using log return), minimizing risk (using maximum drawdown, MDD), and maximizing risk-adjusted returns (using the Sharpe ratio, DSR). The actions from these models are aggregated through a 2D convolutional neural network (CNN) to extract meaningful representations, which are then processed by fully connected layers to determine the final portfolio weights. The weights are computed using the formula $w_{i,t} = \frac{e^{\rho_{i,t} \times c_i}}{e^{\rho_{i,t} \times c}}$, where $c$ is a constant and $\rho_{i,t}$ represents the percentage change in the price of the $i^{th}$ stock at time $t$. The study employs Bayesian optimization for hyperparameter tuning, ensuring a robust experimental setup for evaluating the proposed methodology.

Results

In this study, the proposed Reinforcement Learning-based Adaptive Deep Reinforcement Learning (RA-DRL) methodology was backtested over a trading period from January 1, 2021, to March 31, 2024, with an initial capital of ₹1,000,000. The performance of RA-DRL was evaluated against various base models and benchmarks, focusing on metrics such as cumulative returns, annualized returns, volatility, and risk-adjusted returns. Notably, RA-DRL achieved the highest cumulative return of 124.83% and an annualized return of 29.03% for the Sensex index, outperforming the market index by approximately 1.4 times. The model also exhibited a superior Sharpe ratio of 1.69 and an Omega ratio of 1.33, indicating its effectiveness in generating higher risk-adjusted returns.

Throughout the trading period, RA-DRL demonstrated consistent profitability, particularly during market fluctuations, and maintained a strong performance across multiple indices, including the Dow, TWSE, and IBEX. For the Dow index, RA-DRL generated cumulative and annualized returns of 50.78% and 13.78%, respectively, surpassing the index returns by 1.6 times. While the MVO model exhibited the lowest volatility, RA-DRL maintained a balance between returns and risk, showcasing its robustness in various market conditions. The results affirm that RA-DRL significantly outperformed other models in five out of six performance metrics, underscoring its potential as a viable trading strategy in dynamic market environments.

Discussion

The discussion section of the paper outlines the organization and key findings related to the application of Deep Reinforcement Learning (DRL) in portfolio optimization. It begins with a review of existing literature, categorizing DRL applications into prediction-based and portfolio optimization-based solutions. The authors highlight the evolution of portfolio optimization methods, transitioning from traditional dynamic programming to DRL approaches, which have shown improved performance in managing asset allocation. Notable contributions include the formulation of portfolio management as a Markov decision process and the introduction of various reward functions that enhance risk-adjusted returns, such as the Differential Sharpe Ratio (DSR) and Maximum Drawdown (MDD).

The section further elaborates on the proposed RA-DRL methodology, which integrates multiple reward functions to balance return maximization and risk management. The experimental results demonstrate the superiority of the RA-DRL model over traditional benchmarks and other DRL models, as evidenced by statistically significant improvements in key performance indicators such as cumulative returns (CR) and the Sharpe ratio (SR). The authors conclude by emphasizing the potential for future research to expand the framework to include a broader range of financial instruments and to explore single dynamic reward functions for enhanced stability in learning. This work not only contributes to the field of finance but also offers insights applicable to various decision-making scenarios involving conflicting objectives.