BrownoutServe: تقديم استدلال مدرك لمستوى الخدمة تحت أحمال عمل متقطعة لنماذج LLM المعتمدة على MoE BrownoutServe: SLO-Aware Inference Serving Under Bursty Workloads for MoE-Based LLMs

المجلة: IEEE Transactions on Computers، المجلد: 75، العدد: 4
DOI: https://doi.org/10.1109/tc.2026.3655019
تاريخ النشر: 2026-01-20
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: إنترنت الأشياء والحوسبة الحافة/الضباب

نظرة عامة

يقدم هذا القسم نظرة عامة على Brownout-Serve، وهو إطار عمل جديد مصمم لتعزيز كفاءة الاستدلال في نماذج اللغة الكبيرة (LLMs) المعتمدة على مزيج من الخبراء (MoE). يعالج الإطار عدم الكفاءة المرتبطة بوضع النموذج الثابت وعدم القدرة على التكيف مع أحمال العمل الديناميكية، مما يمكن أن يؤدي إلى استخدام غير مثالي للموارد وزيادة زمن الاستجابة خلال فترات الطلب الذروة. يقدم Brownout-Serve “خبراء موحدين”، الذين يدمجون المعرفة من عدة خبراء لتقليل تكرار الوصول إلى الخبراء وتقليل زمن الاستدلال. بالإضافة إلى ذلك، يستخدم آلية “البراون أوت” الديناميكية التي تدير معالجة الرموز بشكل تكيفي، مما يحسن الأداء مع ضمان تلبية أهداف مستوى الخدمة (SLOs).

تظهر فعالية Brownout-Serve من خلال التقييمات التي تكشف عن تحسين في الإنتاجية يصل إلى 2.07 مرة مقارنة بالأنظمة الحالية مثل vLLM، إلى جانب تقليل كبير في انتهاكات SLO بنسبة 90.28%. تسلط هذه النتائج الضوء على قوة الإطار في التعامل مع حركة المرور المتقطعة مع الحفاظ على دقة الاستدلال المقبولة. كما يحدد المؤلفون العمل المستقبلي، والذي يتضمن توسيع الإطار ليشمل نماذج LLMs الأخرى المعتمدة على MoE ودمج بنية فصل التشفير المسبق لتحسين SLO بشكل أكبر.

مقدمة

تناقش مقدمة الورقة الاهتمام المتزايد في هياكل مزيج من الخبراء (MoE) ضمن مجال نماذج اللغة الكبيرة (LLMs)، مع تسليط الضوء على قدرتها على تقليل تكاليف التدريب والاستدلال بشكل كبير من خلال تفعيل فقط مجموعة فرعية من معلمات النموذج أثناء الحساب. على الرغم من هذه المزايا، لا يزال الاستدلال الفعال يمثل تحديًا، خاصة خلال مراحل التشفير المسبق والتشفير لنماذج LLM. يشير المؤلفون إلى أن مرحلة التشفير المسبق غالبًا ما تكون عنق زجاجة في الأداء بسبب تفعيل جميع الخبراء تقريبًا، بينما يمكن أن تشهد مرحلة التشفير أيضًا تأخيرات تحت ظروف الحمل العالي. تفاقم عدم التوازن في استخدام الخبراء، حيث يتعامل عدد قليل من “الخبراء الساخنين” مع الغالبية العظمى من الرموز بينما يبقى العديد من “الخبراء الباردين” غير مستغلين، مشاكل زمن الاستجابة.

لمعالجة هذه التحديات، يقترح المؤلفون إطار عمل جديد يسمى BrownoutServe، الذي يتضمن ابتكارين رئيسيين: نموذج الخبير الموحد وآلية البراون أوت الديناميكية. يدمج نموذج الخبير الموحد معرفة عدة خبراء في خبير واحد، مما يقلل من تكرار الوصول إلى الخبراء ويعزز التوازي في GPU. تسمح آلية البراون أوت، المستوحاة من استراتيجيات أنظمة الطاقة، بالتوجيه الديناميكي للرموز إلى إما الخبراء الأصليين أو الموحدين بناءً على متطلبات الحمل، مما يحسن من الدقة والكفاءة. يعزز خوارزمية التحكم في زمن الاستجابة المدركة لـ SLO هذا النهج من خلال ضبط التكوينات لتقليل فقدان الدقة مع ضمان أن زمن الاستدلال يلبي أهداف مستوى الخدمة (SLOs). يدعي المؤلفون أن BrownoutServe يظهر تحسينات كبيرة في الأداء، حيث يحقق إنتاجية أعلى تصل إلى 2.07 مرة ويقلل انتهاكات SLO بنسبة 90.28% مقارنة بالأنظمة الحالية مثل vLLM.

طرق

في هذا القسم، يوضح المؤلفون إعداد التجربة لتقييم نموذج Qwen1.5-MoE-A2.7B-Chat، الذي يتكون من 14.3 مليار معلمة ويستخدم هيكل مزيج من الخبراء (MoE) مع 60 خبيرًا لكل طبقة تحويل. يتم مقارنة هذا التكوين مع نماذج MoE الأخرى، مثل Mixtral-8x7B، التي تحتوي على عدد أقل من الخبراء، مما يعزز فعالية نهج البراون أوت. تُجرى التجارب على جهاز مزود بأربعة وحدات معالجة رسومات NVIDIA A100 ومعالج Intel Xeon Gold 6238. تستخدم التقييمات أربع مهام قليلة اللقطات—PIQA وCOPA وCEVAL وOBQA—بالإضافة إلى مجموعات بيانات ShareGPT وAlpaca لتقييم أداء الاستدلال بدقة.

تُعتبر قاعدة المقارنة هي محرك استدلال vLLM، المعروف بإنتاجيته العالية وزمن الاستجابة المنخفض، والذي يدمج تقنيات تحسين متنوعة. يتم تقييم نسختين من vLLM: النسخة الأصلية مع دمج MoE ونسخة معدلة غير مدمجة. تركز الدراسة على ثلاثة مقاييس رئيسية: الإنتاجية، وفقدان الدقة بسبب نهج البراون أوت، ومعدل انتهاكات أهداف مستوى الخدمة (SLO)، الذي يقيس موثوقية الخدمة. تشير النتائج إلى اختلافات في فقدان الدقة عبر تكوينات مختلفة لنهج البراون أوت، مما يوفر رؤى حول فوائد أدائه في التطبيقات الواقعية.

نقاش

في قسم “النقاش”، تستعرض الورقة استراتيجيات تحسين مختلفة للاستدلال في نماذج اللغة الكبيرة (LLMs)، مع التركيز بشكل خاص على هياكل مزيج من الخبراء (MoE) ونماذج LLM العامة. تصنف الأعمال الحالية إلى منطقتين رئيسيتين: تحسين نماذج LLM المعتمدة على MoE وتحسين نماذج LLM العامة. تشمل التقنيات الملحوظة لـ MoE GShard، الذي يعزز الكفاءة الحسابية من خلال توزيع الخبراء، وSwitch Transformer، الذي ينشط خبيرًا واحدًا فقط لكل رمز. تقدم طرق أخرى، مثل Lina وMPMoE، جدولة ديناميكية واستراتيجيات خط أنابيب تكيفية لتحسين استخدام الموارد. ومع ذلك، تواجه هذه الأساليب صعوبات مع أحمال العمل المتقطعة، مما يؤدي إلى انتهاكات أهداف مستوى الخدمة (SLO). يقترح المؤلفون حلولهم الخاصة، بما في ذلك مفهوم “الخبراء الموحدين” ونهج “البراون أوت”، لتحسين الأداء ديناميكيًا والحفاظ على الموثوقية تحت الطلبات المتقلبة.

بالنسبة لنماذج LLM العامة، تناقش الورقة تقنيات مثل PagedAttention لـ vLLM لإدارة ذاكرة التخزين المؤقت بشكل فعال وElastic Sequence Parallelism لـ LoongServe لتعزيز استخدام الموارد. يبرز المؤلفون أنه على الرغم من أن هذه الطرق فعالة، إلا أنه يمكن أيضًا دمجها مع إطار عمل BrownoutServe المقترح، الذي يجمع بين التحسينات الحالية واستراتيجيات جديدة للتعامل بشكل أفضل مع الأحمال المتقطعة. تختتم القسم بالتأكيد على التحديات التي تطرحها هياكل MoE، مثل الضغط الحسابي الناتج عن أحجام الدفعات الكبيرة وانتهاكات SLO خلال فترات الطلب الذروة، وتوضح كيف يمكن لآلياتهم المقترحة التخفيف من هذه القضايا، مما يؤدي في النهاية إلى تحسين كفاءة الاستدلال وأداء النظام.

Journal: IEEE Transactions on Computers, Volume: 75, Issue: 4
DOI: https://doi.org/10.1109/tc.2026.3655019
Publication Date: 2026-01-20
Author(s): Zhenyun Du et al.
Primary Topic: IoT and Edge/Fog Computing

Overview

The section presents an overview of Brownout-Serve, a novel serving framework designed to enhance the efficiency of inference in Mixture-of-Experts (MoE) based large language models (LLMs). The framework addresses the inefficiencies associated with static model placement and the inability to adapt to dynamic workloads, which can lead to suboptimal resource utilization and increased latency during peak demand periods. Brownout-Serve introduces “united experts,” which consolidate knowledge from multiple experts to minimize the frequency of expert access and reduce inference latency. Additionally, it employs a dynamic brownout mechanism that adaptively manages the processing of tokens, optimizing performance while ensuring that service level objectives (SLOs) are met.

The effectiveness of Brownout-Serve is demonstrated through evaluations that reveal a throughput improvement of up to 2.07 times compared to existing systems like vLLM, alongside a significant reduction in SLO violations by 90.28%. These results highlight the framework’s robustness in handling bursty traffic while maintaining acceptable inference accuracy. The authors also outline future work, which includes extending the framework to other MoE-based LLMs and integrating a prefill-decoding separation architecture to further enhance SLO optimization.

Introduction

The introduction of the paper discusses the growing interest in Mixture of Experts (MoE) architectures within the realm of large language models (LLMs), highlighting their ability to significantly reduce training and inference costs by activating only a subset of model parameters during computation. Despite these advantages, efficient inference remains a challenge, particularly during the prefill and decoding stages of LLM inference. The authors note that the prefill stage is often a performance bottleneck due to the activation of nearly all experts, while the decoding stage can also experience delays under high load conditions. The imbalance in expert utilization, where a few “hot” experts handle the majority of tokens while many “cold” experts remain underutilized, exacerbates latency issues.

To address these challenges, the authors propose a novel framework called BrownoutServe, which incorporates two key innovations: the united expert model and a dynamic brownout mechanism. The united expert model consolidates the knowledge of multiple experts into a single expert, thereby reducing the frequency of expert access and enhancing GPU parallelism. The brownout mechanism, inspired by power system strategies, allows for dynamic routing of tokens to either original or united experts based on workload demands, optimizing for both accuracy and efficiency. The proposed SLO-Aware Latency Control algorithm further enhances this approach by adjusting configurations to minimize accuracy loss while ensuring that inference latency meets service level objectives (SLOs). The authors claim that BrownoutServe demonstrates significant performance improvements, achieving up to 2.07 times higher throughput and reducing SLO violations by 90.28% compared to existing systems like vLLM.

Methods

In this section, the authors detail the experimental setup for evaluating the Qwen1.5-MoE-A2.7B-Chat model, which comprises 14.3 billion parameters and utilizes a mixture of experts (MoE) architecture with 60 experts per transformer layer. This configuration is contrasted with other MoE models, such as Mixtral-8x7B, which has fewer experts, thereby enhancing the effectiveness of the brownout approach. The experiments are conducted on a machine equipped with four NVIDIA A100 GPUs and an Intel Xeon Gold 6238 processor. The evaluation employs four few-shot tasks—PIQA, COPA, CEVAL, and OBQA—as well as the ShareGPT and Alpaca datasets for fine-grained inference performance assessment.

The baseline for comparison is the vLLM inference engine, known for its high throughput and low latency, which integrates various optimization techniques. Two versions of vLLM are evaluated: the native version with fused MoE and a modified non-fused version. The study focuses on three key metrics: throughput, accuracy loss due to the brownout approach, and the Service Level Objective (SLO) Violations Rate, which measures the reliability of the service. The results indicate varying accuracy losses across different configurations of the brownout approach, providing insights into its performance benefits in real-world applications.

Discussion

In the “Discussion” section, the paper reviews various optimization strategies for inference in large language models (LLMs), particularly focusing on mixture-of-experts (MoE) architectures and generic LLMs. It categorizes existing work into two main areas: optimization for MoE-based LLMs and optimization for generic LLMs. Notable techniques for MoE include GShard, which enhances computational efficiency through expert distribution, and the Switch Transformer, which activates only one expert per token. Other methods, such as Lina and MPMoE, introduce dynamic scheduling and adaptive pipeline strategies to improve resource utilization. However, these approaches struggle with bursty workloads, leading to service-level objective (SLO) violations. The authors propose their own solutions, including the “united experts” concept and the “brownout” approach, to dynamically optimize performance and maintain reliability under fluctuating demands.

For generic LLMs, the paper discusses techniques like vLLM’s PagedAttention for efficient key-value cache management and LoongServe’s Elastic Sequence Parallelism to enhance resource utilization. The authors highlight that while these methods are effective, they can also be integrated with their proposed BrownoutServe framework, which combines existing optimizations with new strategies to better handle bursty workloads. The section concludes by emphasizing the challenges posed by MoE architectures, such as computational pressure from large batch sizes and SLO violations during peak demand, and outlines how their proposed mechanisms can mitigate these issues, ultimately leading to improved inference efficiency and system performance.