إدارة الطاقة المعتمدة على البيانات للمركبات الكهربائية باستخدام التعلم المعزز غير المتصل Data-driven energy management for electric vehicles using offline reinforcement learning

المجلة: Nature Communications، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41467-025-58192-9
PMID: https://pubmed.ncbi.nlm.nih.gov/40121205
تاريخ النشر: 2025-03-22
المؤلف: Yong Wang وآخرون
الموضوع الرئيسي: أبحاث تقنيات البطاريات المتقدمة

نظرة عامة

يتناول هذا القسم إمكانيات تقنيات إدارة الطاقة لتعزيز أداء المركبات الكهربائية (EV) والمساهمة في استدامة الطاقة العالمية. على الرغم من الجهود البحثية الكبيرة، كانت التطبيقات العملية لهذه التقنيات محدودة، ويرجع ذلك أساسًا إلى الاعتماد على المحاكاة التي لا تترجم النماذج النظرية بشكل فعال إلى سيناريوهات العالم الحقيقي. تقدم هذه الدراسة إطار عمل جديد قائم على البيانات لإدارة الطاقة باستخدام التعلم المعزز غير المتصل، والذي يستفيد من بيانات تشغيل المركبات الكهربائية الفعلية لتحسين استهلاك الطاقة دون الحاجة إلى قواعد مصنوعة يدويًا أو محاكاة عالية الدقة.

تم اختبار الإطار المقترح على مركبات كهربائية تعمل بخلايا الوقود، مما يظهر قدرته على تقليل استهلاك الطاقة وتدهور النظام. تظهر التحقق من خلال البيانات الواقعية من نظام مراقبة المركبات الكهربائية في الصين أن الطريقة تتفوق باستمرار على الاستراتيجيات الحالية، حيث تحقق مستويات أداء تتراوح من 88% إلى 98.6% من الأمثل النظري مع زيادة توفر البيانات. يسمح التدريب على أكثر من 60 مليون كيلومتر من البيانات للوكيل المتعلم بالتعميم بشكل فعال عبر ظروف القيادة المختلفة، بما في ذلك السيناريوهات غير المرئية سابقًا. تؤكد هذه النتائج على الإمكانيات التحويلية للنهج القائم على البيانات في تعزيز كفاءة الطاقة وطول عمر المركبات، خاصة مع تحول صناعة السيارات نحو أنظمة الطاقة الهجينة التي تدمج مصادر طاقة متعددة لتحسين الأداء والاستدامة.

طرق

في هذا القسم، يحدد المؤلفون الأساليب الأساسية المستخدمة لتقييم طريقة التعلم المعزز الأمثل (ORL) المقترحة لأنظمة إدارة الطاقة (EMS). تشمل التقييم عدة تقنيات راسخة: البرمجة الديناميكية (DP)، استنساخ السلوك (BC)، تحسين السياسة القريبة (PPO)، وتقدير السياسة العميقة المتأخرة المزدوجة (TD3).

تعمل DP كمعيار، حيث تمثل الأمثل العالمي من خلال تقليل دالة التكلفة المحددة في المعادلة (10) من خلال بحث عكسي عن أقصر مسار في الزمن، على الرغم من أنها تتطلب معلومات مستقبلية للتحسين. BC، وهو نهج تعلم تقليدي، يتعلم مباشرة من مجموعة بيانات تم إنشاؤها بواسطة سياسات الخبراء، بهدف ربط الحالات بالإجراءات عبر التعلم الموجه. يتم تسليط الضوء على PPO كخوارزمية تعلم معزز عميقة (DRL) قوية على الإنترنت تستخدم التعلم على السياسة وتحديثات الميني باتش لتعزيز اتخاذ القرار في البيئات الديناميكية، مما يوفر استراتيجية قريبة من الأمثل للمقارنة. أخيرًا، تستخدم TD3، وهي خوارزمية DRL متقدمة، شبكات نقدية مزدوجة لتحسين تقدير قيمة الإجراء وتقليل تحيز التقدير المفرط، مما يخدم أيضًا كمعيار يتم تقييم أداء ORL بناءً عليه.

نتائج

يستعرض قسم النتائج تنفيذ وتقييم إطار عمل جديد قائم على البيانات للتعلم المعزز الأمثل (ORL) لأنظمة إدارة الطاقة (EMS) في مركبات كهربائية تعمل بخلايا الوقود (FCEVs). يعتمد نهج ORL، المتميز عن الطرق التقليدية المعتمدة على المحاكاة، على مجموعات بيانات تم إنشاؤها مسبقًا بدلاً من التفاعلات في الوقت الحقيقي مع بيئة محاكاة EV. تم تطوير منصة واقع معزز متخصصة لجمع بيانات تشغيل شاملة، تدمج ظروف القيادة الواقعية مع نموذج نقل الطاقة FCEV المحاكي عالي الدقة. تتيح هذه المنصة إنشاء مجموعات بيانات EMS شاملة، بما في ذلك أكثر من 60 مليون كيلومتر من بيانات القيادة، والتي تتم معالجتها إلى مجموعة بيانات انتقالية موحدة للتعلم اللاحق للسياسة.

تقدم الدراسة طريقة الممثل-الناقد مع تنظيم سياسة السلوك (AC-BPR)، والتي تعزز تقنيات ORL التقليدية من خلال دمج استنساخ السلوك (BC) وتنظيم قائم على المميز (DR). يوازن هذا النهج بشكل فعال بين محاذاة السياسة مع سلوك الخبراء واستكشاف مناطق القيمة العالية Q، مما يسمح بتحسين الأداء حتى مع مجموعات بيانات دون المستوى الأمثل. تتضمن عملية التدريب أخذ عينات من الانتقالات من مخزن إعادة التشغيل لتحديث شبكات الممثل والناقد، مع تقييم وكيل ORL المدرب عبر دورات القيادة المختلفة والسيناريوهات الواقعية. تستمر عملية التدريب والتقييم التكرارية حتى تحقق EMS أداءً مرضيًا، وفي هذه المرحلة يُعتبر الوكيل جاهزًا للنشر في التطبيقات العملية.

مناقشة

في هذا القسم، يناقش المؤلفون تطوير وتقييم نظام إدارة الطاقة القائم على البيانات (EMS) للمركبات الكهربائية الهجينة (HEVs) باستخدام وكيل تعلم معزز مبتكر عبر الإنترنت (ORL). تستخدم الدراسة تحسين السياسة القريبة (PPO) كنموذج خبير لتوليد مجموعات بيانات عالية الجودة، بينما تتضمن أيضًا أخذ عينات عشوائية للإجراءات لإنشاء مجموعات بيانات دون المستوى الأمثل. يتم تحليل أربعة تركيبات بيانات متميزة (D1 إلى D4)، مما يكشف عن اختلافات كبيرة في توزيعات الإجراءات والحالات، خاصة من حيث التحكم في الطاقة وقيود حالة الشحن (SOC). يظهر وكيل ORL القدرة على تعلم استراتيجيات EMS فعالة من كل من مجموعات بيانات الخبراء ودون المستوى الأمثل، محققًا أداءً متفوقًا مقارنة بالطرق التقليدية، بما في ذلك استنساخ السلوك (BC) وغيرها من نهج التعلم المعزز.

تشير النتائج إلى أن وكيل ORL يمكنه بشكل فعال موازنة الاستكشاف والتقليد من خلال خوارزمية تنظيم السياسة المختلطة التكيفية المحافظة (AC-BPR) المقترحة، مما يسمح له بتحسين سياسته حتى عند تدريبه على بيانات مشوشة. من الجدير بالذكر أن وكيل ORL يتفوق على استراتيجيات الخبراء عبر دورات القيادة المختلفة، محققًا تخفيضات كبيرة في التكاليف والحفاظ على الكفاءة التشغيلية. علاوة على ذلك، تسلط الدراسة الضوء على إمكانية التعلم المستمر، حيث يتكيف وكيل ORL مع بيانات جديدة بمرور الوقت، مما يعزز قدراته في EMS. بشكل عام، تشير النتائج إلى أن إطار عمل ORL يمكن أن يعمل كنموذج أساسي لتطبيقات EMS القائمة على البيانات خارج المركبات الكهربائية، مع معالجة التحديات مثل الفجوة بين المحاكاة والواقع وتمكين أداء قوي عبر ظروف تشغيل متنوعة. ومع ذلك، يعترف المؤلفون بأن وكيل ORL يحتاج إلى بيانات كبيرة للتعلم الفعال، مما قد يحد من قابليته للتطبيق في السيناريوهات ذات توفر البيانات المحدود.

Journal: Nature Communications, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41467-025-58192-9
PMID: https://pubmed.ncbi.nlm.nih.gov/40121205
Publication Date: 2025-03-22
Author(s): Yong Wang et al.
Primary Topic: Advanced Battery Technologies Research

Overview

The section discusses the potential of energy management technologies to enhance electric vehicle (EV) performance and contribute to global energy sustainability. Despite significant research efforts, practical applications of these technologies have been limited, primarily due to a reliance on simulations that do not effectively translate theoretical models into real-world scenarios. This study presents a novel data-driven energy management framework utilizing offline reinforcement learning, which leverages actual electric vehicle operation data to optimize energy consumption without the need for manually crafted rules or high-fidelity simulations.

The proposed framework has been tested on fuel cell electric vehicles, demonstrating its ability to reduce energy consumption and system degradation. Validation through real-world data from an electric vehicle monitoring system in China shows that the method consistently outperforms existing strategies, achieving performance levels from 88% to 98.6% of the theoretical optimum with increased data availability. Training on over 60 million kilometers of data allows the learning agent to generalize effectively across various driving conditions, including previously unseen scenarios. These findings underscore the transformative potential of data-driven approaches in enhancing energy efficiency and vehicle longevity, particularly as the automotive industry shifts towards hybrid energy systems that integrate multiple energy sources for improved performance and sustainability.

Methods

In this section, the authors outline the baseline methods employed to evaluate the proposed Optimal Reinforcement Learning (ORL) method for Energy Management Systems (EMS). The evaluation includes several established techniques: Dynamic Programming (DP), Behavior Cloning (BC), Proximal Policy Optimization (PPO), and Twin Delayed Deep Deterministic Policy Gradient (TD3).

DP serves as the benchmark, representing the global optimum by minimizing the cost function defined in Equation (10) through a backward search for the shortest path in time, albeit requiring future information for optimization. BC, an imitation learning approach, learns directly from a dataset generated by expert policies, aiming to map states to actions via supervised learning. PPO is highlighted as a robust online Deep Reinforcement Learning (DRL) algorithm that utilizes on-policy learning and mini-batch updates to enhance decision-making in dynamic environments, providing a near-optimal strategy for comparison. Lastly, TD3, an advanced DRL algorithm, employs twin critic networks to improve action value estimation and mitigate overestimation bias, further serving as a benchmark against which ORL’s performance is assessed.

Results

The results section outlines the implementation and evaluation of a novel data-driven Optimal Reinforcement Learning (ORL) framework for Energy Management Systems (EMS) in fuel cell electric vehicles (FCEVs). The ORL approach, distinct from traditional simulation-based methods, relies on pre-generated datasets rather than real-time interactions with an EV simulation environment. A specialized augmented-reality platform was developed to collect extensive operational data, integrating real-world driving conditions with a high-fidelity simulated FCEV powertrain model. This platform enables the generation of comprehensive EMS datasets, including over 60 million kilometers of driving data, which are processed into a standardized transition dataset for subsequent policy learning.

The study introduces the Actor-Critic with Behavior Policy Regularization (AC-BPR) method, which enhances traditional ORL techniques by combining Behavior Cloning (BC) and Discriminator-based Regularization (DR). This approach effectively balances policy alignment with expert behavior and exploration of high Q-value regions, allowing for improved performance even with suboptimal datasets. The training process involves sampling transitions from a replay buffer to update the Actor and Critic networks, with the trained ORL agent evaluated across various driving cycles and real-world scenarios. The iterative training and evaluation process continues until the EMS achieves satisfactory performance, at which point the agent is deemed ready for deployment in practical applications.

Discussion

In this section, the authors discuss the development and evaluation of a data-driven Energy Management System (EMS) for hybrid electric vehicles (HEVs) using an innovative Online Reinforcement Learning (ORL) agent. The study employs Proximal Policy Optimization (PPO) as the expert model to generate high-quality datasets, while also incorporating random action sampling to create suboptimal datasets. Four distinct dataset compositions (D1 to D4) are analyzed, revealing significant differences in action and state distributions, particularly in terms of power control and state of charge (SOC) constraints. The ORL agent demonstrates the ability to learn effective EMS strategies from both expert and suboptimal datasets, achieving superior performance compared to traditional methods, including Behavior Cloning (BC) and other reinforcement learning approaches.

The results indicate that the ORL agent can effectively balance exploration and imitation through the proposed Adaptive Conservative Blended Policy Regularization (AC-BPR) algorithm, allowing it to refine its policy even when trained on noisy data. Notably, the ORL agent outperforms expert strategies across various driving cycles, achieving significant cost reductions and maintaining operational efficiency. Furthermore, the study highlights the potential for continuous learning, where the ORL agent adapts to new data over time, enhancing its EMS capabilities. Overall, the findings suggest that the ORL framework can serve as a foundational model for data-driven EMS applications beyond electric vehicles, addressing challenges such as the sim-to-real gap and enabling robust performance across diverse operational conditions. However, the authors acknowledge that the ORL agent requires substantial data for effective learning, which may limit its applicability in scenarios with limited data availability.