تحسين هندسي-إنتروبي: دمج النقل الأمثل مع طرق التدرج ريمان لتدريب الشبكات العصبية Geometric-Entropic Optimization: Integrating Optimal Transport with Riemannian Gradient Methods for Neural Network Training

المجلة: Journal of Optimization Theory and Applications، المجلد: 209، العدد: 1
DOI: https://doi.org/10.1007/s10957-026-02958-8
تاريخ النشر: 2026-03-11
المؤلف: Massimiliano Ferrara
الموضوع الرئيسي: تقنيات تحسين التدرج العشوائي

نظرة عامة

في هذا القسم، يقدم المؤلفون تحسين هندسي-إنتروبي (GEO)، وهو خوارزمية جديدة مصممة لتدريب الشبكات العصبية تجمع بين طرق التدرج الريماني مع النقل الأمثل المنظم بالإنتروبيا. يعمل GEO على مجموعة معلمات باستخدام مقياس فيشر-واسرشتاين ويستخدم إسقاطات من نوع سينكهورن للحفاظ على القيود التوزيعية على تنشيطات الطبقات. يقدم المؤلفون ضمانات التقارب، موضحين أن GEO يحقق معدل \( O\left(\frac{1}{\sqrt{T}}\right) + O(\rho^{2K}) \)، حيث يتعلق المصطلح الأول بالنزول بالتدرج الريماني والثاني يتعلق بتقليص تكرارات سينكهورن. تشير التقييمات التجريبية إلى أن GEO يتفوق باستمرار على المحسنات القياسية، مما يؤدي إلى تحسينات في الأداء تبلغ حوالي 20% في المهام المرجعية في التحكم المستمر ونمذجة اللغة.

تؤكد الخاتمة على أهمية الإطار النظري لـ GEO، الذي لا يقدم فقط ضمانات التقارب ولكن أيضًا يدمج الابتكارات الحديثة في التعلم العميق ضمن منظور موحد لنظرية التحسين. تشمل المساهمات الرئيسية تقديم مقياس فيشر-واسرشتاين لدمج الهندسة الإحصائية والتوزيعية، واستخدام إسقاطات سينكهورن لفرض قيود المجموعة، وتنظيم إنتروبي متعدد المقاييس يسهل التوازن التلقائي بين الاستكشاف والاستغلال. تشير النتائج إلى أن التقدم المستقبلي في تحسين الشبكات العصبية قد يعتمد أكثر على التطبيق المنهجي للهياكل الهندسية لتشكيل ديناميات التعلم، بدلاً من الاعتماد فقط على زيادة أحجام النماذج. يكرم هذا العمل إرث قسطنطين أودريشت، الذي كانت مساهماته الأساسية حاسمة في هذا البحث.

مقدمة

تتناول مقدمة هذه الورقة التحديات المرتبطة بتحسين معلمات الشبكات العصبية، وخاصة قيود طرق النزول بالتدرج القياسية في التنقل عبر المناظر المعقدة وغير المحدبة للخسارة المميزة بنقاط السرج والانحناءات المتغيرة. بينما اكتسبت الطرق التكيفية مثل آدم شعبية، فإنها تتجاهل الهيكل الهندسي الجوهري لمشكلة التحسين. لمعالجة ذلك، يقترح المؤلفون إطار تحسين جديد يسمى تحسين هندسي-إنتروبي (GEO)، الذي يدمج النزول بالتدرج الريماني مع قيود النقل الأمثل المنظم بالإنتروبيا. تستند هذه المقاربة إلى ملاحظتين رئيسيتين: أن فضاء المعلمات للشبكات العصبية يمتلك هيكلًا ريمانيًا محددًا بواسطة مقياس معلومات فيشر، وأن تدفق المعلومات عبر طبقات الشبكة يمكن تحليله باستخدام نظرية النقل الأمثل.

تحدد الورقة ثلاث مساهمات نظرية رئيسية: تقديم مقياس فيشر-واسرشتاين لفضاء معلمات الشبكة العصبية وخصائصه الهندسية، وضمانات التقارب لـ GEO تحت افتراضات السلاسة القياسية مع معدلات صريحة تتعلق بالانحناء الريماني وتقليص سينكهورن، وفهم الابتكارات المعمارية الحديثة كحالات محددة من تحسين مقيد بالمجموعة مع تنظيم إنتروبي. من خلال البناء على تقليد الديناميات الهندسية والأعمال السابقة في الهندسة التفاضلية ونظرية التحسين، تهدف هذه الأبحاث إلى تعزيز كفاءة تدريب الشبكات العصبية في بيئات عالية الأبعاد وعشوائية نموذجية للتعلم الآلي الحديث.

طرق

في هذا القسم، يحدد المؤلفون الإطار المنهجي لبحثهم، مع التركيز على توحيد طرق التعلم العميق الهندسية. يبدأون بتعريف المجموعات الرئيسية ذات الصلة بدراستهم. يتضمن الإعداد التجريبي دمج المقاربة الهندسية المقترحة (GEO) مع خوارزمية الممثل الناقد الناعم (SAC) عبر ثلاثة بيئات متميزة من MuJoCo: HalfCheetah-v2 وHumanoid-v2 وAnt-v2. تتميز كل بيئة بأبعاد الحالة والعمل، وهي 17 بعدًا للحالة و6 أبعاد للعمل لـ HalfCheetah-v2، و376 بعدًا للحالة و17 بعدًا للعمل لـ Humanoid-v2، و111 بعدًا للحالة و8 أبعاد للعمل لـ Ant-v2.

يتم تحديد هيكل شبكات الممثل والناقد، ويتكون من طبقتين مخفيتين مع 256 وحدة لكل منهما، باستخدام دوال تنشيط ReLU. النظام التدريبي شامل، يمتد على 10^6 خطوة بيئية ويستخدم 5 بذور عشوائية لكل تكوين لضمان موثوقية النتائج. تهدف هذه المقاربة المنظمة إلى تقييم أداء طرق GEO وSAC المدمجة في مهام التحكم المستمر.

نتائج

تشير النتائج إلى أن خوارزمية تحسين GEO تظهر تحسينات كبيرة في الأداء عبر بيئات مختلفة، كما هو موضح في الجدول 2. على وجه الخصوص، في بيئة Humanoid الصعبة، يتفوق GEO على المحسن آدم بنسبة 21.7% وعلى المحسن موون بنسبة 6.8%. بالإضافة إلى ذلك، تشير الانخفاض في الانحراف المعياري عبر البذور من 412 إلى 289 في Humanoid إلى أن GEO يوفر عملية تحسين أكثر موثوقية.

في سياق نمذجة اللغة، كما هو موضح في الجدول 3، يحقق GEO أقل تعقيد يبلغ 20.3، مقارنة بـ 22.4 لآدم، ويصل إلى التعقيد المستهدف البالغ 25 في 61,000 خطوة، وهو تحسين بنسبة 28% مقارنة بـ 85,000 خطوة لآدم. علاوة على ذلك، تظهر استقرار التدريب، التي تم قياسها كـ \(1 – CV\) للخسارة على آخر 10,000 خطوة، زيادة ملحوظة من 0.89 إلى 0.97، مما يدل على تحسين الموثوقية في عملية التدريب.

مناقشة

في هذا القسم، يقدم المؤلفون نهجًا جديدًا لتحسين الشبكات العصبية من خلال تحسين هندسي-إنتروبي (GEO)، الذي يدمج طرق التدرج الريماني مع النقل الأمثل المنظم بالإنتروبيا. يتم إعادة صياغة المشكلة لتقليل دالة خسارة مع الالتزام بالقيود المحددة على مجموعة ريمانية، مما يعزز عملية التحسين من خلال معالجة قيود النزول بالتدرج القياسية، مثل الحاجة إلى ضبط دقيق لمعدلات التعلم وعدم الكفاءة في تحديث المعلمات. يتم تقديم مقياس فيشر-واسرشتاين لتحقيق التوازن بين الكفاءة الإحصائية والهندسة التوزيعية، مما يوفر إطارًا قويًا لفهم كيفية تأثير اضطرابات المعلمات على توزيعات المخرجات.

تتضمن خوارزمية GEO عدة مكونات رئيسية: خطوة تدرج ريمانية، وإسقاطات من نوع سينكهورن لفرض قيود المجموعة، وتنظيم إنتروبي متعدد المقاييس لتسهيل توازن تلقائي بين الاستكشاف والاستغلال أثناء التدريب. تظهر التقييمات التجريبية أن GEO يتفوق باستمرار على المحسنات التقليدية عبر مهام مختلفة، بما في ذلك التحكم المستمر ونمذجة اللغة، بينما يوفر أيضًا ضمانات نظرية للتقارب. تشير النتائج إلى أن الاستفادة من الهياكل الهندسية في التحسين يمكن أن تؤدي إلى ديناميات تدريب أكثر كفاءة، مما يبرز أهمية النهج المنهجي بدلاً من مجرد زيادة حجم النموذج.

Journal: Journal of Optimization Theory and Applications, Volume: 209, Issue: 1
DOI: https://doi.org/10.1007/s10957-026-02958-8
Publication Date: 2026-03-11
Author(s): Massimiliano Ferrara
Primary Topic: Stochastic Gradient Optimization Techniques

Overview

In this section, the authors present Geometric-Entropic Optimization (GEO), a novel algorithm designed for neural network training that merges Riemannian gradient methods with entropy-regularized optimal transport. GEO operates on a parameter manifold utilizing a Fisher-Wasserstein metric and employs Sinkhorn-type projections to maintain distributional constraints on layer activations. The authors provide convergence guarantees, demonstrating that GEO achieves a rate of \( O\left(\frac{1}{\sqrt{T}}\right) + O(\rho^{2K}) \), where the first term pertains to Riemannian gradient descent and the second relates to the contraction of Sinkhorn iterations. Empirical evaluations indicate that GEO consistently outperforms standard optimizers, yielding approximately 20% performance improvements on benchmark tasks in continuous control and language modeling.

The conclusion emphasizes the significance of GEO’s theoretical framework, which not only offers convergence guarantees but also integrates recent innovations in deep learning within a unified optimization-theoretic perspective. Key contributions include the introduction of the Fisher-Wasserstein metric for blending statistical and distributional geometry, the use of Sinkhorn projections for enforcing manifold constraints, and a multi-scale entropic regularization that facilitates an automatic exploration-exploitation tradeoff. The findings suggest that future advancements in neural network optimization may rely more on the principled application of geometric structures to shape learning dynamics, rather than solely on increasing model sizes. This work honors the legacy of Constantin Udrişte, whose foundational contributions were instrumental in this research.

Introduction

The introduction of this paper addresses the challenges associated with optimizing neural network parameters, particularly the limitations of standard gradient descent methods in navigating the complex, non-convex loss landscape characterized by saddle points and varying curvature. While adaptive methods like Adam have gained popularity, they overlook the intrinsic geometric structure of the optimization problem. To address this, the authors propose a novel optimization framework called Geometric-Entropic Optimization (GEO), which integrates Riemannian gradient descent with entropy-regularized optimal transport constraints. This approach is grounded in two key observations: the parameter space of neural networks possesses a Riemannian structure defined by the Fisher information metric, and the flow of information through network layers can be analyzed using optimal transport theory.

The paper outlines three main theoretical contributions: the introduction of the Fisher-Wasserstein metric for neural network parameter space and its geometric properties, convergence guarantees for GEO under standard smoothness assumptions with explicit rates related to Riemannian curvature and Sinkhorn contraction, and an understanding of recent architectural innovations as specific instances of manifold-constrained optimization with entropic regularization. By building on the geometric dynamics tradition and previous work in differential geometry and optimization theory, this research aims to enhance the efficiency of neural network training in high-dimensional, stochastic environments typical of modern machine learning.

Methods

In this section, the authors outline the methodological framework for their research, focusing on the unification of geometric deep learning methods. They begin by defining the key manifolds relevant to their study. The experimental setup involves integrating the proposed geometric approach (GEO) with the Soft Actor-Critic (SAC) algorithm across three distinct MuJoCo environments: HalfCheetah-v2, Humanoid-v2, and Ant-v2. Each environment is characterized by its state and action dimensions, specifically 17-dimensional state and 6-dimensional action for HalfCheetah-v2, 376-dimensional state and 17-dimensional action for Humanoid-v2, and 111-dimensional state and 8-dimensional action for Ant-v2.

The architecture of the actor and critic networks is specified, comprising two hidden layers with 256 units each, utilizing ReLU activation functions. The training regimen is extensive, spanning 10^6 environment steps and employing 5 random seeds for each configuration to ensure robustness in the results. This structured approach aims to evaluate the performance of the integrated GEO and SAC methods in continuous control tasks.

Results

The results indicate that the GEO optimization algorithm demonstrates significant performance enhancements across various environments, as detailed in Table 2. Specifically, in the challenging Humanoid environment, GEO surpasses the Adam optimizer by 21.7% and the Muon optimizer by 6.8%. Additionally, the reduction in standard deviation across seeds from 412 to 289 on Humanoid suggests that GEO provides a more robust optimization process.

In the context of language modeling, as shown in Table 3, GEO achieves the lowest perplexity of 20.3, compared to 22.4 for Adam, and reaches the target perplexity of 25 in 61,000 steps, which is a 28% improvement over Adam’s 85,000 steps. Furthermore, training stability, quantified as \(1 – CV\) of the loss over the final 10,000 steps, shows a notable increase from 0.89 to 0.97, indicating enhanced reliability in the training process.

Discussion

In this section, the authors present a novel approach to optimizing neural networks through Geometric-Entropic Optimization (GEO), which integrates Riemannian gradient methods with entropy-regularized optimal transport. The problem is reformulated to minimize a loss function while adhering to constraints defined on a Riemannian manifold, enhancing the optimization process by addressing limitations of standard gradient descent, such as the need for careful tuning of learning rates and inefficiencies in parameter updates. The Fisher-Wasserstein metric is introduced to balance statistical efficiency and distributional geometry, providing a robust framework for understanding how parameter perturbations affect output distributions.

The GEO algorithm incorporates several key components: a Riemannian gradient step, Sinkhorn-type projections for enforcing manifold constraints, and multi-scale entropic regularization to facilitate an automatic trade-off between exploration and exploitation during training. Empirical evaluations demonstrate that GEO consistently outperforms traditional optimizers across various tasks, including continuous control and language modeling, while also providing theoretical convergence guarantees. The findings suggest that leveraging geometric structures in optimization can lead to more efficient training dynamics, emphasizing the importance of principled approaches over merely increasing model size.