الانتباه الذاتي الفعال مع التقليم الذكي لنماذج اللغة الكبيرة المستدامة Efficient self-attention with smart pruning for sustainable large language models

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-92586-5
PMID: https://pubmed.ncbi.nlm.nih.gov/40128247
تاريخ النشر: 2025-03-24
المؤلف: Samir Brahim Belhaouari وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

تقدم ورقة البحث نهج ضغط جديد لنماذج اللغة الكبيرة (LLMs) يهدف إلى تقليل متطلباتها الحاسوبية الكبيرة وتأثيرها البيئي. يركز المؤلفون على ضغط الطبقات الداخلية للمحول، التي تعد مساهمات كبيرة في تعقيد النماذج. تستخدم الطريقة المقترحة تقليم الانتشار الأمامي (FPP) لضغط طبقات التضمين والتغذية الأمامية عن طريق تجميد وتصفير المعلمات غير المستخدمة، مما يقلل من المعلمات القابلة للتدريب ويسرع التدريب. بالإضافة إلى ذلك، تم تقديم تقنية طي مصفوفة الوزن لتقليم مصفوفات طبقة الانتباه الذاتي بفعالية، باستخدام ضغط الصفوف المتطابقة (IRC) وضغط الوزن القُطري (DWC) لتحسين مصفوفات الاستعلام والمفتاح والقيمة. يحقق هذا النهج المدمج ضغطًا ملحوظًا بنسبة 99% لطبقات المحول و70% للطبقات الخطية، مما يؤدي إلى ضغط إجمالي للنموذج بحوالي 70% مع الحفاظ على الدقة.

تشير النتائج إلى أن معدلات الضغط المعتدلة (20% إلى 40%) يمكن أن تعزز أداء النموذج، مما يؤدي إلى تقليل كبير في استخدام الذاكرة والمتطلبات الحاسوبية، وبالتالي تعزيز كفاءة الموارد في LLMs. توضح التقييمات عبر 11 مجموعة بيانات متاحة للجمهور فعالية طريقة الضغط، على الرغم من أن المؤلفين يشيرون إلى أن تأثير معدلات الضغط يمكن أن يختلف حسب المهمة. ستستكشف الأبحاث المستقبلية قابلية تطبيق هذا النهج مع نماذج مسبقة التدريب متنوعة وتحقق بشكل أكبر في التبادلات المرتبطة بمعدلات الضغط المختلفة. تؤكد الدراسة على أهمية بيانات التدريب عالية الجودة لتحقيق الكفاءة والدقة المثلى في ضغط النموذج.

مقدمة

ت outlines مقدمة ورقة البحث الهيكل الأساسي لنماذج اللغة الكبيرة (LLMs)، التي تستخدم بشكل أساسي نماذج المحولات. تستفيد هذه المحولات من آليات الانتباه الذاتي لتمييز العلاقات بين الكلمات داخل تسلسل بشكل فعال. تميز هذه القدرة على المعالجة المتوازية المحولات عن النماذج التسلسلية التقليدية، مما يعزز بشكل كبير كفاءتها في إدارة تسلسلات النصوص الكبيرة.

تشمل المكونات الرئيسية لـ LLMs طبقات أساسية متنوعة تساهم في أدائها ووظيفتها. يبرز التركيز على هيكل المحول دوره المحوري في تقدم مهام معالجة اللغة الطبيعية، مما يمكّن LLMs من تحقيق نتائج متفوقة في فهم وتوليد اللغة البشرية.

طرق

في هذا القسم، يصف المؤلفون منهجية ضغط جديدة لنماذج اللغة الكبيرة (LLMs) التي تعالج المتطلبات الكبيرة للذاكرة والحوسبة المرتبطة بهياكل الشبكات العصبية العميقة. تستخدم المنهجية استراتيجيتين رئيسيتين: الضغط التسلسلي والضغط التكراري. يقوم الضغط التسلسلي بضغط النموذج طبقة تلو الأخرى، بدءًا من طبقة التضمين والتقدم عبر طبقات الانتباه الذاتي والتغذية الأمامية. تستخدم هذه العملية تقنيات مثل تقليم الانتشار الأمامي وWeightMatrixFold، التي تحسن مصفوفات الوزن لمكونات الاستعلام والمفتاح والقيمة، مما يعزز الكفاءة مع الاستفادة من التحسينات من الطبقات السابقة. يتم تطبيق دالة الضغط \( C(W_i) \) بشكل تكراري، مما يضمن أن الأوزان المضغوطة لكل طبقة \( \hat{W}_i \) مستندة إلى الأوزان المضغوطة للطبقات السابقة.

من ناحية أخرى، تركز الاستراتيجية التكرارية على ضغط مصفوفات الوزن تدريجيًا من خلال زيادة معدل الضغط بشكل تدريجي وتقييم أداء النموذج بعد كل تكرار. يسمح هذا النهج بالتعديل الدقيق ويساعد في الحفاظ على دقة النموذج، حيث يتم مراقبة الأداء مقابل عتبة محددة مسبقًا. من خلال دمج هذه الاستراتيجيات، يهدف المؤلفون إلى تحقيق توازن متوازن بين حجم النموذج والدقة، مما يعزز الكفاءة الحاسوبية واستخدام الذاكرة دون تدهور كبير في الأداء. يتم توضيح فعالية هذه الطرق في الأشكال المرفقة، التي تظهر تأثيرها على هياكل المحولات.

النتائج

تظهر نتائج الدراسة فعالية طرق الضغط المقترحة على أوزان النموذج، والخسارة، والدقة، والأداء العام. بشكل ملحوظ، تؤثر تقنيات ضغط الصفوف المتطابقة (IRC) وضغط الوزن القُطري (DWC) بشكل كبير على توزيعات الأوزان لمصفوفات الاستعلام والمفتاح والقيمة، مما يؤدي إلى تقليل التباين وزيادة التشتت. على سبيل المثال، بعد تطبيق IRC، تظهر مصفوفات الاستعلام والمفتاح توزيعًا مسطحًا، بينما تحول DWC مصفوفة القيمة إلى هيكل قطري، محتفظة فقط بالأوزان الأساسية. لا يؤدي هذا الضغط إلى تبسيط هيكل النموذج فحسب، بل يعزز أيضًا سرعة الاستدلال، حيث يحقق النموذج دقة قصوى تبلغ حوالي 0.96 عند ضغط 99% لمصفوفات الاستعلام والمفتاح المدمجة.

تكشف التحليلات الإضافية أنه بينما يحسن الضغط عمومًا الأداء، يمكن أن يؤدي الضغط المفرط إلى انخفاض في الدقة بسبب فقدان المعلومات الحيوية. تسلط الدراسة الضوء على أن معدلات الضغط المعتدلة (حتى 70%) تحقق كفاءة وأداء أفضل، مع الحفاظ على دقة عالية للنموذج تبلغ حوالي 0.94. يقوم خوارزمية الضغط التكراري المستخدمة بضبط معدل الضغط ديناميكيًا بناءً على تقييمات الأداء، مما يسمح بتحقيق توازن مثالي بين الضغط والدقة. بشكل عام، تظهر طرق الضغط المقترحة تحسينات كبيرة في كفاءة النموذج، حيث تحقق معدلات ضغط تصل إلى 70.69% مع الحفاظ على الأداء أو تحسينه عبر مهام معالجة اللغة الطبيعية المختلفة، بما في ذلك نمذجة اللغة وتصنيف النصوص.

المناقشة

في قسم المناقشة من الورقة، يستكشف المؤلفون طرقًا متنوعة لضغط نماذج اللغة الكبيرة (LLMs)، مع تسليط الضوء على أهمية تقليم الشبكة، واستخلاص المعرفة، والتكميم. يشيرون إلى أنه بينما تركز تقنيات مثل LLM-Pruner وFLAP على تقليل عرض الشبكة من خلال التقليم الهيكلي، تتبنى تقنيات أخرى مثل Sheared-LLaMA نهجًا أكثر شمولية من خلال إزالة طبقات كاملة. يؤكد المؤلفون على الحاجة إلى تحليلات مقارنة لهذه الطرق لفهم تأثيراتها على كفاءة الاستدلال، خاصة بالنظر إلى التحديات الفريدة التي تطرحها المعلمات الكبيرة والمتطلبات الحاسوبية لـ LLMs.

تقدم الورقة نهجًا جديدًا يجمع بين ضغط الصفوف المتطابقة (IRC) وضغط الوزن القُطري (DWC) لتعزيز كفاءة آليات الانتباه الذاتي في المحولات. من خلال تبسيط مصفوفات الاستعلام (Q) والمفتاح (K) من خلال تكرار الصفوف، يقلل المؤلفون بشكل كبير من التعقيد الحاسوبي مع الحفاظ على أداء النموذج. بالإضافة إلى ذلك، يهدف DWC إلى تقليل البصمة الذاكرية والحمل الحاسوبي لمصفوفة القيمة (V) من خلال التركيز على عناصرها القطرية. تعزز طريقة تقليم الانتشار الأمامي (FPP) أيضًا الطبقات الخطية داخل المحولات، حيث تقوم بتقليم الأوزان بشكل تكراري بناءً على أهميتها، مما يعزز الكفاءة الحاسوبية ويقلل من استخدام الذاكرة. بشكل عام، يقدم المؤلفون استراتيجية شاملة تعالج قيود طرق التقليم الحالية بينما تحسن كل من أداء النموذج وكفاءة الأجهزة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-92586-5
PMID: https://pubmed.ncbi.nlm.nih.gov/40128247
Publication Date: 2025-03-24
Author(s): Samir Brahim Belhaouari et al.
Primary Topic: Topic Modeling

Overview

The research paper presents a novel compression approach for Large Language Models (LLMs) aimed at reducing their substantial computational demands and environmental impact. The authors focus on compressing the internal transformer layers, which are significant contributors to the models’ complexity. The proposed method employs Forward Propagation Pruning (FPP) to compress embedding and feed-forward layers by freezing and zeroing unused parameters, thereby reducing trainable parameters and accelerating training. Additionally, the Weight Matrix Folding technique is introduced to prune self-attention layer matrices effectively, utilizing Identical Row Compression (IRC) and Diagonal Weight Compression (DWC) to optimize the Query, Key, and Value matrices. This combined approach achieves a remarkable 99% compression of transformer layers and 70% of linear layers, resulting in an overall model compression of approximately 70% while preserving accuracy.

The findings indicate that moderate compression rates (20% to 40%) can enhance model performance, leading to significant reductions in memory usage and computational requirements, thus promoting resource efficiency in LLMs. The evaluation across 11 publicly available datasets demonstrates the effectiveness of the compression method, although the authors note that the impact of compression rates can vary by task. Future research will explore the applicability of this approach with diverse pre-trained models and further investigate the trade-offs associated with different compression rates. The study emphasizes the importance of high-quality training data to achieve optimal efficiency and accuracy in model compression.

Introduction

The introduction of the research paper outlines the foundational architecture of Large Language Models (LLMs), which predominantly utilize transformer models. These transformers leverage self-attention mechanisms to effectively discern relationships between words within a sequence. This parallel processing capability distinguishes transformers from traditional sequential models, significantly enhancing their efficiency in managing extensive text sequences.

Key components of LLMs include various essential layers that contribute to their performance and functionality. The emphasis on the transformer architecture highlights its pivotal role in advancing natural language processing tasks, enabling LLMs to achieve superior results in understanding and generating human language.

Methods

In this section, the authors describe a novel compression methodology for large language models (LLMs) that addresses the significant memory and computational demands associated with their deep neural network architectures. The methodology employs two primary strategies: Sequential and Recursive compression. The Sequential strategy compresses the model layer by layer, starting with the embedding layer and progressing through the self-attention and feedforward layers. This process utilizes techniques such as Forward Propagation Pruning and WeightMatrixFold, which optimize the weight matrices for Query, Key, and Value components, thereby enhancing efficiency while leveraging optimizations from previous layers. The compression function \( C(W_i) \) is applied iteratively, ensuring that each layer’s compressed weights \( \hat{W}_i \) are informed by the compressed weights of preceding layers.

Conversely, the Recursive strategy focuses on gradually compressing weight matrices by incrementally increasing the compression rate and evaluating model performance after each iteration. This approach allows for fine-tuning and helps maintain model accuracy, as performance is monitored against a predefined threshold. By combining these strategies, the authors aim to achieve a balanced trade-off between model size and accuracy, thereby optimizing computational efficiency and memory usage without significant degradation in performance. The effectiveness of these methods is illustrated in the accompanying figures, which demonstrate their impact on transformer architectures.

Results

The results of the study demonstrate the effectiveness of the proposed compression methods on model weights, loss, accuracy, and overall performance. Notably, the Identical Row Compression (IRC) and Diagonal Weight Compression (DWC) techniques significantly alter the weight distributions of the Query, Key, and Value matrices, leading to reduced variability and increased sparsity. For instance, after applying IRC, the Query and Key matrices exhibit a flattened distribution, while DWC transforms the Value matrix into a diagonal structure, retaining only essential weights. This compression not only simplifies the model architecture but also enhances inference speed, with the model achieving peak accuracy of approximately 0.96 at 99% compression for the combined Query and Key matrices.

Further analysis reveals that while compression generally improves performance, excessive compression can lead to a decline in accuracy due to the loss of critical information. The study highlights that moderate compression rates (up to 70%) yield better efficiency and performance, with the model maintaining a high accuracy of around 0.94. The recursive compression algorithm employed dynamically adjusts the compression rate based on performance evaluations, allowing for an optimal balance between compression and accuracy. Overall, the proposed compression methods demonstrate substantial improvements in model efficiency, achieving compression rates of up to 70.69% while preserving or enhancing performance across various NLP tasks, including language modeling and text classification.

Discussion

In the discussion section of the paper, the authors explore various methods for compressing large language models (LLMs), highlighting the significance of network pruning, knowledge distillation, and quantization. They note that while techniques such as LLM-Pruner and FLAP focus on reducing network width through structured pruning, others like Sheared-LLaMA adopt a more holistic approach by eliminating entire layers. The authors emphasize the need for comparative analyses of these methods to understand their impacts on inference efficiency, particularly given the unique challenges posed by LLMs’ extensive parameters and computational demands.

The paper introduces a novel approach that combines Identical Row Compression (IRC) and Diagonal Weight Compression (DWC) to enhance the efficiency of self-attention mechanisms in transformers. By simplifying the Query (Q) and Key (K) matrices through row replication, the authors significantly reduce computational complexity while maintaining model performance. Additionally, DWC aims to minimize the memory footprint and computational load of the Value (V) matrix by focusing on its diagonal elements. The proposed Forward Propagation Pruning (FPP) method further optimizes linear layers within transformers, iteratively pruning weights based on their significance, thereby enhancing computational efficiency and reducing memory usage. Overall, the authors present a comprehensive strategy that addresses the limitations of existing pruning methods while improving both model performance and hardware efficiency.