RetroInfer: محرك تخزين متجهات لاستنتاج LLM طويل السياق القابل للتوسع RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

المجلة: Proceedings of the VLDB Endowment، المجلد: 19، العدد: 5
DOI: https://doi.org/10.14778/3796195.3796212
تاريخ النشر: 2026-01-01
المؤلف: Yaoqi Chen وآخرون
الموضوع الرئيسي: تقنيات معالجة اللغة الطبيعية

نظرة عامة

في هذا القسم، يتناول المؤلفون التحديات التي تواجه نماذج اللغة الكبيرة (LLMs) مع توسيع نوافذ السياق الخاصة بها، وخاصة القيود في سرعة الاستدلال بسبب زيادة متطلبات ذاكرة GPU وعرض النطاق الترددي. ينمو ذاكرة المفتاح والقيمة (KV) التي تخزن تمثيلات الرموز بشكل خطي مع طول السياق وتتطلب مسحًا خطيًا تكراريًا لحساب الانتباه. لتعزيز استدلال السياق الطويل، يقترح المؤلفون RetroInfer، وهو محرك تخزين متجهات يستفيد من التشتت الفطري لآليات الانتباه من خلال نقل ذاكرة KV إلى ذاكرة CPU واسترجاع مجموعة صغيرة مختارة من الرموز ذات الصلة.

يقدم RetroInfer فهرس متجهات واعٍ للانتباه (فهرس الموجة) الذي يحسن التوازن بين دقة الانتباه وتكلفة الاسترجاع من خلال تقنيات مثل تقريب الانتباه الثلاثي، وتقدير الانتباه المحدود بالدقة، والتجميع المجزأ. بالإضافة إلى ذلك، يعمل المخزن المؤقت للموجة كمدير مخزن مؤقت بين GPU وCPU، مما يسهل الحساب الفعال وإدارة البيانات عبر الأجهزة المتنوعة. تكشف تقييمات RetroInfer عن تحسينات كبيرة في سرعة فك التشفير، حيث تحقق أداء أسرع يصل إلى 4.4× مقارنةً بالانتباه الكامل عند 120K سياق و12.2× مقارنةً بأسس الانتباه المتناثر عند 1 مليون رمز، مع الحفاظ على مستويات دقة الانتباه الكامل. يستنتج المؤلفون أن RetroInfer يعزز بشكل فعال سرعة ودقة الاستدلال لنماذج LLMs ذات السياق الطويل من خلال مجموعة برمجياته المبتكرة.

مقدمة

تناقش مقدمة هذه الورقة البحثية التقدمات والتحديات المرتبطة بنماذج اللغة الكبيرة (LLMs) المعتمدة على المحولات التي تستخدم آليات الانتباه. شهدت هذه النماذج نموًا كبيرًا في نوافذ السياق الخاصة بها، حيث تصل الآن إلى 10 ملايين رمز في النماذج الرائدة مثل Gemini Pro وLlama 4. ومع ذلك، فإن هذا التوسع يطرح تحديات كبيرة لموارد GPU، خاصة من حيث استهلاك الذاكرة وعرض النطاق الترددي أثناء الاستدلال. تنمو ذاكرة المفتاح والقيمة (KV) الضرورية لتسريع الانتباه بشكل خطي مع طول التسلسل، مما يؤدي إلى متطلبات ذاكرة عالية تصل إلى 125GB لطلب واحد مكون من 1M رمز مع Llama3-8B. تسلط الورقة الضوء على إمكانية استخدام ذاكرات KV القائمة على التشتت لتخفيف هذه المشكلات من خلال التركيز على مجموعة صغيرة من الرموز التي تهيمن على مخرجات الانتباه، لكنها تؤكد على صعوبة تحديد هذه الرموز المهمة بدقة.

لمعالجة هذه التحديات، يقترح المؤلفون RetroInfer، وهو محرك تخزين متجهات مصمم لتحسين أداء أنظمة ذاكرة KV القائمة على التشتت. يتكون RetroInfer من مكونين رئيسيين: فهرس متجهات واعٍ للانتباه (فهرس الموجة) ومدير مخزن مؤقت بين GPU وCPU غير معتمد على الدقة (مخزن الموجة). يقدم فهرس الموجة تقنيات مبتكرة مثل تقريب الانتباه الثلاثي وتقدير الانتباه المحدود بالدقة لتعزيز التوازن بين دقة الانتباه وتكلفة الاسترجاع. في الوقت نفسه، يدير مخزن الموجة حركة البيانات والحساب بكفاءة عبر موارد CPU وGPU، مستفيدًا من المحلية الزمنية للوصول إلى الرموز لتقليل التنافس على عرض النطاق الترددي PCIe. يظهر تقييم RetroInfer تحسينات كبيرة في كفاءة ودقة الاستدلال عبر نماذج ومهام مختلفة، حيث يحقق أداء يصل إلى 4.4× مقارنةً بالانتباه الكامل و12.2× مقارنةً بأنظمة الانتباه المتناثر الأخرى. تشمل مساهمات هذا العمل تطوير فهرس الموجة، ومخزن الموجة، والتنفيذ الشامل لـ RetroInfer، مما يبرز فعاليته في سيناريوهات الاستدلال ذات السياق الطويل.

طرق

في هذه الدراسة، تم إجراء تجارب على خادم آلة افتراضية (VM) مزود ببطاقات NVIDIA A100، كل منها يمتلك 80GB من الذاكرة، ومعالج AMD EPYC 7V12، الذي يمتلك سعة ذاكرة كبيرة تبلغ 1.7TB. يتكون هيكل الخادم من أربعة عقد ذاكرة غير متجانسة (NUMA)، تحتوي كل منها على 12 نواة، مما يسهل المعالجة المتوازية بكفاءة. يتم إنشاء الاتصال بين GPU وCPU عبر PCIe 4.0 (×16)، مما يوفر عرض نطاق ترددي أحادي الاتجاه يبلغ 32GB/s، وهو أمر حاسم لمهام الحوسبة عالية الأداء.

يعمل الخادم على نظام التشغيل Ubuntu 22.04، مستخدمًا إصدار CUDA 12.4 وإصدار PyTorch 2.5، مما يضمن التوافق مع الأطر والمكتبات الحسابية المتقدمة اللازمة للتجارب التي تم إجراؤها. تم تصميم هذا الإعداد لتحسين أداء الخوارزميات المختبرة، مستفيدًا من قدرات كل من GPU وCPU لتحقيق كفاءة حسابية محسنة.

مناقشة

تسلط قسم المناقشة في الورقة الضوء على التحديات والابتكارات المحيطة باستخدام نماذج اللغة الكبيرة (LLMs) المعتمدة على المحولات، خاصة في سياق استدلال السياق الطويل. تستخدم هياكل المحولات آلية انتباه متعددة الطبقات حيث يتم تحويل الرموز المدخلة إلى استفسارات ومفاتيح وقيم، مما يمكّن النموذج من ربط الرموز من خلال رؤوس انتباه متعددة. ميزة رئيسية لتعزيز كفاءة فك التشفير هي ذاكرة المفتاح والقيمة (KV)، التي تخزن المتجهات المفاتيح والقيم المحسوبة مسبقًا لتقليل الحسابات الزائدة. ومع ذلك، تفرض هذه الذاكرة متطلبات كبيرة من حيث الذاكرة وعرض النطاق الترددي، خاصة مع زيادة طول السياق وحجم الدفعة، مما يؤدي إلى قيود في سعة ذاكرة GPU وعرض النطاق الترددي التي تحد من قابلية توسيع استدلال السياق الطويل.

لمعالجة هذه القيود، تقدم الورقة نهجًا جديدًا يسمى RetroInfer، الذي يستخدم ذاكرة KV قائمة على التشتت تستفيد من التشتت الفطري لآليات الانتباه. من خلال الوصول الانتقائي فقط إلى المتجهات KV الأكثر صلة، تهدف RetroInfer إلى تخفيف اختناقات الذاكرة وعرض النطاق الترددي. يتضمن التصميم فهرس الموجة الذي ينظم المتجهات KV في مجموعات بناءً على التشابه، مما يسمح باسترجاع فعال وحساب درجات الانتباه. لا تعزز هذه الطريقة فقط الإنتاجية ولكنها تحافظ أيضًا على الدقة من خلال استراتيجية تقريب الانتباه الثلاثي التي تصنف الرموز إلى مناطق ثابتة واسترجاع وتقدير. تؤكد الورقة على الحاجة إلى تصميم نظام دقيق لتحقيق التوازن بين التبادلات بين الدقة والكفاءة، خاصة في سياق تفاعلات GPU-CPU والطبيعة الديناميكية لأهمية الرموز أثناء الاستدلال.

Journal: Proceedings of the VLDB Endowment, Volume: 19, Issue: 5
DOI: https://doi.org/10.14778/3796195.3796212
Publication Date: 2026-01-01
Author(s): Yaoqi Chen et al.
Primary Topic: Natural Language Processing Techniques

Overview

In the section, the authors address the challenges faced by large language models (LLMs) as they expand their context windows, particularly the limitations in inference throughput due to increased GPU memory and bandwidth requirements. The key-value (KV) cache, which stores token representations, grows linearly with context length and necessitates an iterative linear scan for attention computation. To enhance long-context inference, the authors propose RetroInfer, a vector storage engine that leverages the inherent sparsity of attention mechanisms by offloading the KV cache to CPU memory and selectively retrieving a small subset of relevant tokens.

RetroInfer introduces an Attention-aWare VEctor index (wave index) that optimizes the balance between attention accuracy and retrieval cost through techniques such as tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. Additionally, the wave buffer serves as a GPU-CPU buffer manager, facilitating efficient computation and data management across heterogeneous hardware. The evaluation of RetroInfer reveals substantial improvements in decoding throughput, achieving up to 4.4× faster performance over full attention at 120K context and up to 12.2× over sparse attention baselines at 1 million tokens, all while maintaining the accuracy levels of full attention. The authors conclude that RetroInfer effectively enhances inference speed and accuracy for long-context LLMs through its innovative software stack.

Introduction

The introduction of this research paper discusses the advancements and challenges associated with transformer-based large language models (LLMs) that utilize attention mechanisms. These models have seen significant growth in their context windows, now reaching up to 10 million tokens in leading models like Gemini Pro and Llama 4. However, this expansion poses substantial challenges for GPU resources, particularly in terms of memory consumption and bandwidth during inference. The key-value (KV) cache, essential for accelerating attention, grows linearly with sequence length, leading to high memory demands—up to 125GB for a single 1M-token request with Llama3-8B. The paper highlights the potential of sparsity-based KV caches to alleviate these issues by focusing on a small subset of tokens that dominate attention output, yet emphasizes the difficulty in accurately identifying these important tokens.

To address these challenges, the authors propose RetroInfer, a vector storage engine designed to optimize the performance of sparsity-based KV cache systems. Central to RetroInfer are two components: the Attention-aware Vector index (wave index) and an accuracy-agnostic GPU-CPU buffer manager (wave buffer). The wave index introduces innovative techniques such as tripartite attention approximation and accuracy-bound attention estimation to enhance the balance between attention accuracy and retrieval cost. Meanwhile, the wave buffer efficiently manages data movement and computation across CPU and GPU resources, leveraging temporal locality of token access to reduce PCIe bandwidth contention. The evaluation of RetroInfer demonstrates significant improvements in inference efficiency and accuracy across various models and tasks, achieving up to 4.4× throughput over full attention and 12.2× over other sparse-attention systems. The contributions of this work include the development of the wave index, wave buffer, and the comprehensive implementation of RetroInfer, showcasing its effectiveness in long-context inference scenarios.

Methods

In this study, experiments were conducted on a virtual machine (VM) server equipped with NVIDIA A100 GPUs, each possessing 80GB of memory, and an AMD EPYC 7V12 CPU, which has a substantial memory capacity of 1.7TB. The server architecture comprises four Non-Uniform Memory Access (NUMA) nodes, each containing 12 cores, facilitating efficient parallel processing. The interconnection between the GPU and CPU is established via PCIe 4.0 (×16), providing a unidirectional bandwidth of 32GB/s, which is critical for high-performance computing tasks.

The server operates on the Ubuntu 22.04 operating system, utilizing CUDA version 12.4 and PyTorch version 2.5, ensuring compatibility with advanced computational frameworks and libraries necessary for the experiments conducted. This setup is designed to optimize the performance of the algorithms tested, leveraging the capabilities of both the GPU and CPU for enhanced computational efficiency.

Discussion

The discussion section of the paper highlights the challenges and innovations surrounding the use of transformer-based large language models (LLMs), particularly in the context of long-context inference. Transformer architectures utilize a multi-layer attention mechanism where input tokens are transformed into queries, keys, and values, enabling the model to relate tokens through multiple attention heads. A key feature for enhancing decoding efficiency is the key-value (KV) cache, which stores previously computed key and value vectors to reduce redundant calculations. However, this cache imposes significant memory and bandwidth demands, particularly as context length and batch size increase, leading to constraints in GPU memory capacity and bandwidth that limit the scalability of long-context inference.

To address these limitations, the paper introduces a novel approach called RetroInfer, which employs a sparsity-based KV cache leveraging the inherent sparsity of attention mechanisms. By selectively accessing only the most relevant KV vectors, RetroInfer aims to alleviate memory and bandwidth bottlenecks. The design incorporates a wave index that organizes KV vectors into clusters based on similarity, allowing for efficient retrieval and computation of attention scores. This method not only enhances throughput but also maintains accuracy through a tripartite attention approximation strategy that categorizes tokens into steady, retrieval, and estimation zones. The paper emphasizes the need for careful system design to balance the trade-offs between accuracy and efficiency, particularly in the context of GPU-CPU interactions and the dynamic nature of token importance during inference.