أذكى، أفضل، أسرع، أطول: مشفر ثنائي الاتجاه حديث للتدريب الدقيق والاستدلال السريع والفعال من حيث الذاكرة والسياق الطويل Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

المجلة: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.127
تاريخ النشر: 2025-01-01
المؤلف: Benjamin C. Warner وآخرون
الموضوع الرئيسي: تحليل الفيديو والتلخيص

نظرة عامة

يقدم هذا القسم ModernBERT، وهو نموذج متقدم من نوع المحولات يعتمد فقط على الترميز، والذي يعزز الأداء بشكل كبير في مهام الاسترجاع والتصنيف مقارنة بسابقيه، وخاصة BERT. تم تدريبه على 2 تريليون رمز مع أقصى طول تسلسل يبلغ 8192، يحقق ModernBERT نتائج رائدة في مختلف المعايير، بما في ذلك مجموعة بيانات GLUE، حيث يتفوق على DeBERTaV3-base للمرة الأولى منذ عام 2021. يتضمن النموذج تحسينات معمارية حديثة مثل طبقات GeGLU، وتضمينات موضعية RoPE، والانتباه المحلي-global المتناوب، مما يساهم في كفاءته وفعاليته.

بالإضافة إلى أدائه المتفوق، تم تصميم ModernBERT لتحقيق استدلال مثالي على وحدات معالجة الرسوميات القياسية، مما يظهر سرعة وكفاءة ملحوظة في الذاكرة. يتفوق في كل من مهام استرجاع السياقات القصيرة والطويلة، متفوقًا على أقرب المنافسين بفارق كبير (6.85 و 9.1 نقطة مئوية، على التوالي). يعالج النموذج المدخلات ذات السياقات القصيرة بسرعة ضعف سرعة DeBERTaV3 والمدخلات ذات السياقات الطويلة بسرعة ضعف سرعة النموذج الأسرع التالي، مما يضع معيارًا جديدًا لكفاءة استدلال الترميز. بشكل عام، يمثل ModernBERT تقدمًا كبيرًا في نماذج الترميز، خاصة في التطبيقات التي تتضمن سياقات طويلة ومهام البرمجة.

مقدمة

تناقش مقدمة الورقة تطور الحالة الحالية لنماذج المحولات التي تعتمد فقط على الترميز في معالجة اللغة الطبيعية (NLP)، خاصة بعد ظهور BERT (Devlin et al.، 2019). على الرغم من ظهور نماذج اللغة الكبيرة (LLMs) مثل GPT وLlama، لا تزال نماذج الترميز فقط مفضلة لمجموعة متنوعة من التطبيقات غير التوليدية بسبب قدراتها الفعالة في المعالجة والتوازن المفضل بين الجودة والحجم. تتفوق هذه النماذج في مهام استرجاع المعلومات (IR)، مثل البحث الدلالي، وغالبًا ما يتم دمجها في خطوط أنابيب توليد معززة بالاسترجاع (RAG) لتعزيز أداء LLM من خلال توفير سياق ذي صلة.

يسلط المؤلفون الضوء على قيود نماذج الترميز الحالية، التي تعتمد غالبًا على هياكل قديمة مثل BERT، مما يؤدي إلى قيود مثل أقصى طول تسلسل يبلغ 512 رمزًا وتصميم غير مثالي للتطبيقات الحديثة. لمعالجة هذه القضايا، تقدم الورقة ModernBERT، وهو نموذج محول معاد تصميمه يعتمد فقط على الترميز يعزز الأداء والكفاءة في المهام اللاحقة، خاصةً للتسلسلات الأطول. تم تدريب ModernBERT على مجموعة بيانات واسعة تضم 2 تريليون رمز، بما في ذلك بيانات البرمجة، ويقدم نوعين—ModernBERT-base وModernBERT-large—كلاهما يحقق أداءً رائدًا في مختلف المهام مع كفاءة محسنة في الاستدلال، حيث يعالج تسلسلات من 8192 رمزًا بسرعة تقارب ضعف سرعة النماذج السابقة. بالإضافة إلى ذلك، يقدم المؤلفون FlexBERT، وهو إطار عمل معماري مرن لتسهيل المزيد من البحث في نماذج الترميز فقط.

الطرق

يستعرض قسم “الطرق” الأساليب التجريبية والتحليلية المستخدمة في الدراسة. يوضح التقنيات المحددة المستخدمة لجمع البيانات، بما في ذلك معايير الاختيار للمشاركين والأدوات المستخدمة للقياس. تم تصميم المنهجية لضمان موثوقية وصلاحية النتائج، مع دمج كل من التحليلات الكمية والنوعية حيثما كان ذلك مناسبًا.

بالإضافة إلى ذلك، يصف القسم الأساليب الإحصائية المطبقة لتحليل البيانات، بما في ذلك أي أدوات برمجية مستخدمة للحساب. يؤكد على أهمية التحكم في المتغيرات المربكة ويحدد الخطوات المتخذة لضمان تلبية الاعتبارات الأخلاقية طوال عملية البحث. بشكل عام، تم هيكلة الطرق لتوفير إطار عمل قوي لمعالجة الأسئلة البحثية المطروحة في الدراسة.

النتائج

تسلط النتائج المقدمة في هذا القسم الضوء على الأداء المتفوق لـ Modern-BERT عبر مهام التقييم المختلفة، كما هو ملخص في الجدول 1. يتفوق Modern-BERT على كل من نماذج BERT الأصلية وRoBERTa، محققًا تحسين Pareto في جميع الفئات التي تم تقييمها. في مهام استرجاع السياقات القصيرة على مجموعة بيانات BEIR، يظهر Modern-BERT مزايا ملحوظة على أجهزة الترميز الحالية، بما في ذلك GTE-en-MLM وNomicBERT، خاصة في إعداد DPR. على الرغم من أن Modern-BERT-base يتفوق قليلاً على GTE-en-MLM-base، فإن النسخة الأكبر تتفوق بشكل كبير على نظيرتها على الرغم من وجود عدد أقل من المعلمات. في استرجاع السياقات الطويلة، يتفوق Modern-BERT في إعداد ColBERT متعدد المتجهات، متفوقًا على نماذج السياقات الطويلة الأخرى بفارق كبير، مما يشير إلى معالجة فعالة للتسلسلات الطويلة.

فيما يتعلق بفهم اللغة الطبيعية، يحقق Modern-BERT نتائج استثنائية على معيار GLUE، حيث يتفوق النموذج الأساسي على جميع النماذج الأساسية الحالية، بما في ذلك DeBERTaV3-base، ويتبع النموذج الكبير DeBERTaV3-large عن كثب بينما يكون أكثر كفاءة في سرعة المعالجة. بالإضافة إلى ذلك، يتفوق Modern-BERT في مهام البرمجة، متفوقًا على جميع النماذج الأخرى في كل من إعدادات البرمجة إلى نص والهجينة، وذلك بفضل تدريبه على مجموعة بيانات متنوعة تشمل بيانات البرمجة. تشير مقاييس الكفاءة إلى أن Modern-BERT هو النموذج الأكثر كفاءة بشكل عام، حيث يعالج الرموز بمعدلات أعلى بكثير من المنافسين، خاصة في سيناريوهات السياقات الطويلة، بينما يظهر أيضًا كفاءة أعلى في الذاكرة. بشكل عام، تؤكد هذه النتائج على تقدم Modern-BERT في كل من الأداء والكفاءة عبر مجموعة من المهام.

المناقشة

تستعرض قسم المناقشة من ورقة البحث التحسينات المعمارية والكفاءة الكبيرة التي تم تحقيقها في تطوير ModernBERT، وهو نموذج محول متقدم. تشمل التعديلات الرئيسية دمج تضمينات موضعية دوارة (RoPE) لتحسين التعامل مع السياقات، واستخدام كتلة ما قبل التطبيع لاستقرار التدريب، واعتماد دالة تنشيط GeGLU لتحسين الأداء. تم تصميم بنية النموذج لتحقيق توازن بين العمق والضيق، مما يحسن تخصيص المعلمات مع ضمان استدلال فعال على مجموعة من وحدات معالجة الرسوميات. من الجدير بالذكر أن ModernBERT يستخدم آليات انتباه متناوبة، تجمع بين استراتيجيات الانتباه العالمية والمحلية لتعزيز الكفاءة الحسابية.

بالإضافة إلى ذلك، تسلط الورقة الضوء على تنفيذ تقنيات إزالة الحشو للقضاء على عدم الكفاءة المرتبطة بالرموز المحشوة أثناء التدريب والاستدلال. تساهم هذه الطريقة، جنبًا إلى جنب مع استخدام Flash Attention لعمليات الانتباه ذات الكفاءة في الذاكرة، في تحسين ملحوظ في الأداء بنسبة 10-20% مقارنة بالطرق السابقة. يتم تفصيل نظام التدريب بدقة، مع التركيز على استخدام هدف نمذجة اللغة المقنعة المعدل (MLM)، وجدول زمني لمعدل التعلم على شكل شبه منحرف، وزيادة حجم الدفعة الاستراتيجية لتعزيز كفاءة التدريب. بشكل عام، يظهر ModernBERT أداءً متفوقًا عبر مجموعة متنوعة من مهام فهم اللغة الطبيعية، وسيناريوهات الاسترجاع، ومعايير البرمجة، مما يبرز مرونته وكفاءته في التعامل مع كل من المدخلات ذات السياقات القصيرة والطويلة.

القيود

تنشأ قيود هذه الدراسة بشكل أساسي من تركيزها الحصري على اللغة الإنجليزية، باستخدام مجموعة بيانات كبيرة من الرموز. وبالتالي، قد لا تكون النتائج قابلة للتطبيق مباشرة على لغات أخرى، خاصة تلك التي تفتقر إلى الموارد. يمكن أن تستكشف الأبحاث المستقبلية تحديث نماذج الترميز في كل من السياقات متعددة اللغات (Zhang et al.، 2024) والسياقات الأحادية غير الإنجليزية (Antoun et al.، 2024)، مما يمثل اتجاهًا واعدًا لتوسيع قابلية تطبيق النموذج.

بالإضافة إلى ذلك، فإن تدريب النموذج على بيانات مستندة إلى الويب بشكل أساسي يقدم تحيزات متأصلة قد تؤثر على تمثيلاته. يتطلب هذا الاعتماد على بيانات الويب الحذر في تفسير النتائج، حيث قد تعكس التحيزات السائدة في المواد المصدرية. إن معالجة هذه التحيزات أمر حاسم لتعزيز متانة النموذج وعدالته في التطبيقات المتنوعة.

Journal: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI: https://doi.org/10.18653/v1/2025.acl-long.127
Publication Date: 2025-01-01
Author(s): Benjamin C. Warner et al.
Primary Topic: Video Analysis and Summarization

Overview

The section presents ModernBERT, an advanced encoder-only transformer model that significantly enhances performance in retrieval and classification tasks compared to its predecessors, particularly BERT. Trained on 2 trillion tokens with a maximum sequence length of 8192, ModernBERT achieves state-of-the-art results across various benchmarks, including the GLUE dataset, where it surpasses DeBERTaV3-base for the first time since 2021. The model incorporates modern architectural improvements such as GeGLU layers, RoPE positional embeddings, and alternating local-global attention, which contribute to its efficiency and effectiveness.

In addition to its superior performance, ModernBERT is designed for optimal inference on standard GPUs, demonstrating remarkable speed and memory efficiency. It excels in both short and long-context retrieval tasks, outperforming the nearest competitors by significant margins (6.85 and 9.1 percentage points, respectively). The model processes short-context inputs twice as fast as DeBERTaV3 and long-context inputs at double the speed of the next fastest model, establishing a new benchmark for encoder inference efficiency. Overall, ModernBERT represents a substantial advancement in encoder models, particularly in applications involving long-context and programming tasks.

Introduction

The introduction of the paper discusses the evolution and current state of encoder-only transformer models in Natural Language Processing (NLP), particularly following the advent of BERT (Devlin et al., 2019). Despite the rise of Large Language Models (LLMs) like GPT and Llama, encoder-only models continue to be favored for various non-generative applications due to their efficient processing capabilities and favorable trade-off between quality and size. These models excel in Information Retrieval (IR) tasks, such as semantic search, and are often integrated into Retrieval-Augmented Generation (RAG) pipelines to enhance LLM performance by providing relevant context.

The authors highlight the limitations of existing encoder-only models, which often rely on older architectures like BERT, resulting in constraints such as a maximum sequence length of 512 tokens and suboptimal design for modern applications. To address these issues, the paper introduces ModernBERT, a revamped encoder-only transformer model that enhances downstream performance and efficiency, particularly for longer sequences. ModernBERT is trained on an extensive dataset of 2 trillion tokens, including code data, and offers two variants—ModernBERT-base and ModernBERT-large—both achieving state-of-the-art performance across various tasks with improved inference efficiency, processing sequences of 8192 tokens nearly twice as fast as prior models. Additionally, the authors provide FlexBERT, a modular architecture framework to facilitate further research in encoder-only models.

Methods

The “Methods” section outlines the experimental and analytical approaches employed in the study. It details the specific techniques used for data collection, including the selection criteria for participants and the instruments utilized for measurement. The methodology is designed to ensure the reliability and validity of the results, incorporating both quantitative and qualitative analyses where applicable.

Additionally, the section describes the statistical methods applied to analyze the data, including any software tools used for computation. It emphasizes the importance of controlling for confounding variables and outlines the steps taken to ensure ethical considerations were met throughout the research process. Overall, the methods are structured to provide a robust framework for addressing the research questions posed in the study.

Results

The results presented in this section highlight the superior performance of Modern-BERT across various evaluation tasks, as summarized in Table 1. Modern-BERT outperforms both the original BERT and RoBERTa models, achieving a Pareto improvement in all categories assessed. In short-context retrieval tasks on the BEIR dataset, Modern-BERT demonstrates notable advantages over existing encoders, including GTE-en-MLM and NomicBERT, particularly in the DPR setting. Although Modern-BERT-base narrowly surpasses GTE-en-MLM-base, the larger variant significantly outperforms its counterpart despite having fewer parameters. In long-context retrieval, Modern-BERT excels in the multi-vector ColBERT setting, outperforming other long-context models by a substantial margin, suggesting effective processing of long sequences.

In terms of natural language understanding, Modern-BERT achieves exceptional results on the GLUE benchmark, with the base model surpassing all existing base models, including DeBERTaV3-base, and the large model closely following DeBERTaV3-large while being more efficient in processing speed. Additionally, Modern-BERT excels in programming tasks, outperforming all other models in both code-to-text and hybrid settings, attributed to its training on a diverse dataset that includes programming data. Efficiency metrics indicate that Modern-BERT is the most efficient model overall, processing tokens at significantly higher rates than competitors, particularly in long-context scenarios, while also demonstrating superior memory efficiency. Overall, these findings underscore Modern-BERT’s advancements in both performance and efficiency across a range of tasks.

Discussion

The discussion section of the research paper outlines significant architectural and efficiency improvements made in the development of ModernBERT, an advanced transformer model. Key modifications include the integration of rotary positional embeddings (RoPE) for enhanced context handling, the use of a pre-normalization block to stabilize training, and the adoption of the GeGLU activation function to improve performance. The model architecture is designed to balance depth and narrowness, optimizing parameter allocation while ensuring efficient inference on a range of GPUs. Notably, ModernBERT employs alternating attention mechanisms, combining global and local attention strategies to enhance computational efficiency.

Additionally, the paper highlights the implementation of unpadding techniques to eliminate inefficiencies associated with padding tokens during training and inference. This approach, along with the use of Flash Attention for memory-efficient attention operations, contributes to a notable performance improvement of 10-20% over previous methods. The training regimen is meticulously detailed, emphasizing the use of a modified Masked Language Modeling (MLM) objective, a trapezoidal learning rate schedule, and a strategic batch size warmup to enhance training efficiency. Overall, ModernBERT demonstrates superior performance across various natural language understanding tasks, retrieval scenarios, and code-related benchmarks, showcasing its versatility and efficiency in handling both short and long-context inputs.

Limitations

The limitations of this study primarily stem from its exclusive focus on the English language, utilizing a substantial dataset of tokens. Consequently, the findings may not be directly applicable to other languages, particularly those with fewer resources. Future research could explore the modernization of encoder models in both multilingual (Zhang et al., 2024) and non-English monolingual contexts (Antoun et al., 2024), which presents a promising direction for expanding the applicability of the model.

Additionally, the model’s training on predominantly web-based data introduces inherent biases that may affect its representations. This reliance on web data necessitates caution in interpreting the results, as they may reflect the biases prevalent in the source material. Addressing these biases is crucial for enhancing the model’s robustness and fairness in diverse applications.