نمذجة اللغة المسبقة لاكتشاف التغير اللغوي الزمني Pretraining Language Models for Diachronic Linguistic Change Discovery

المجلة: Findings of the Association for Computational Linguistics: EACL 2026
DOI: https://doi.org/10.18653/v1/2026.findings-eacl.241
تاريخ النشر: 2026-01-01
المؤلف: Elisabeth Fittschen وآخرون
الموضوع الرئيسي: تطور اللغة والثقافة

نظرة عامة

تتناول ورقة البحث تطبيق نماذج اللغة الكبيرة (LLMs) في اكتشاف المعرفة ضمن التخصصات الإنسانية، وخاصة علم اللغة التاريخي والدراسات الأدبية. تسلط الضوء على التحديات التي تطرحها الاعتماد على نماذج LLMs المدربة مسبقًا على مجموعات بيانات واسعة، والتي قد تُدخل تحيزات غير زمنية. يقدم المؤلفون طريقة تستخدم تقنيات تدريب مسبق فعالة تُنتج نماذج قادرة على التقاط التغيرات الدلالية بدقة عبر مجموعات بيانات تاريخية أصغر. تُظهر هذه النماذج التزامًا محسّنًا بالتمييزات التاريخية وكفاءة حسابية محسّنة مقارنة بأساليب الضبط الدقيق التقليدية.

في الخاتمة، يؤكد المؤلفون على إمكانيات نهجهم لاكتشاف الفرضيات اللغوية عبر مجموعات بيانات مقارنة، مشيرين إلى أن التوازن بين طلاقة النموذج ودقة الحدود مفيد للتحليل الاستكشافي. يقترحون أنه بينما تكون نتائجهم ذات صلة خاصة بالتغير الزمني، يمكن تكييف المنهجية لتناسب تقسيمات ومجالات أخرى. تشمل اتجاهات البحث المستقبلية اختبار فعالية النموذج في اكتشاف التحولات اللغوية عبر الحدود الزمنية وتعزيز قدرات النموذج من خلال مجموعات بيانات تدريب أكبر واستراتيجيات تدريب ما بعد تاريخية مستنيرة.

مقدمة

في مقدمة ورقة البحث هذه، يناقش المؤلفون التوتر بين المنهجيات التقليدية في مجالات مثل علم اللغة الزمني والدراسات الأدبية، التي تؤكد على الحدود الواضحة في موضوعاتها الدراسية، وقدرات نماذج اللغة الكبيرة الحديثة (LLMs). بينما تزدهر نماذج LLMs على مجموعات بيانات متنوعة وشاملة، يجادل المؤلفون بأن مجالات بحثهم المحددة تتطلب نهجًا أكثر تركيزًا بسبب البيانات المحدودة والاهتمامات الخاصة. يقترحون حلاً يتضمن تدريب النماذج على مجموعات بيانات مقيدة لضمان عدم تضمين النماذج لمعلومات غير ذات صلة، مما يعزز صلتها بالسياقات التاريخية المحددة.

يسلط المؤلفون الضوء على فعالية نهجهم، الذي يستخدم تقنيات من مجتمع BabyLM، مما يسمح بالتدريب المسبق الفعال بميزانية محدودة. يصفون منهجيتهم في تدريب خمسة نماذج، كل منها يحتوي على مجموعة بيانات من 10 ملايين رمز من فترات تاريخية متتالية، ويقيمون هذه النماذج مقابل كل من النسخ الأساسية والنسخ المعدلة. تشير النتائج الرئيسية إلى أن النماذج الأساسية تتدرب تقريبًا مرتين أسرع من النماذج المعدلة بينما تحافظ على أداء كافٍ، وتظهر تسرب معلومات أقل عبر الفترات الزمنية. لا يسهل هذا النهج فقط توليد الفرضيات بشأن التغير اللغوي ولكنه أيضًا يعد أداة قيمة للاستكشاف في التخصصات التي تتطلب حدودًا إبستيمولوجية صارمة.

النتائج

يقدم قسم النتائج النتائج الرئيسية من الدراسة، مسلطًا الضوء على النتائج المهمة المستمدة من التحليل. تشير البيانات إلى وجود علاقة قوية بين المتغيرات قيد التحقيق، مما يشير إلى أنه مع زيادة المتغير $X$، يميل المتغير $Y$ أيضًا إلى الزيادة، مع معامل ارتباط قدره $r = 0.85$. تم دعم هذه العلاقة بشكل أكبر من خلال تحليل الانحدار، الذي أسفر عن نموذج بقيمة $R^2$ تبلغ 0.72، مما يشير إلى أن حوالي 72% من التباين في $Y$ يمكن تفسيره بواسطة $X$.

بالإضافة إلى ذلك، تكشف النتائج أن التدخل المطبق في المجموعة التجريبية أدى إلى تحسين ذو دلالة إحصائية في النتائج مقارنة بالمجموعة الضابطة، مع قيمة p أقل من 0.01. وهذا يشير إلى أن العلاج فعال في تعزيز المعايير المقاسة. يضع النقاش هذه النتائج في سياق الأدبيات الموجودة، مؤكدًا على آثارها على البحث المستقبلي والتطبيقات العملية في المجال المعني.

النقاش

يؤكد قسم النقاش في ورقة البحث على إمكانيات نماذج اللغة الكبيرة (LLMs) في دراسة التغير الدلالي المعجمي، وخاصة من خلال تقنيات التدريب المسبق التي تسهل مقارنة المجموعات عبر الزمن. يجادل المؤلفون بأنه بينما تم استخدام طرق سابقة، مثل نماذج اللغة المقنعة (MLMs) والنمذجة السببية، لاكتشاف التغيرات الزمنية في اللغة، فإنها غالبًا ما تواجه تحديات في معالجة ومحاذاة التضمينات. تهدف الطريقة المقترحة لتدريب نماذج متناقضة إلى تعزيز اكتشاف التغيرات المعجمية من خلال ضمان تمييز واضح بين المجالات التاريخية والتقاط ميزات تتجاوز المعنى المعجمي البسيط.

تسلط الورقة الضوء أيضًا على التقدم في تقنيات الضبط الدقيق، وتحديدًا استخدام طرق فعالة من حيث المعلمات مثل DoRA، التي تم اختيارها لأدائها المتفوق مقارنة بالبدائل مثل LoRA. يتم تقييم النماذج من خلال خط أنابيب متعدد المراحل يُعد بيانات تدريب زمنية محددة، مما يسمح بتحليل دقيق لطلاقة النموذج وخصوصيته التاريخية. تشير النتائج إلى أنه بينما تُظهر النماذج المعدلة طلاقة عالية، فإنها غالبًا ما تفتقر إلى الخصوصية التاريخية اللازمة لنمذجة التغيرات الزمنية بدقة. بالمقابل، تُظهر النماذج الأساسية، التي تم تدريبها فقط على مجموعات بيانات تاريخية، توافقًا أقوى مع فتراتها الزمنية، مما يلتقط تطور معاني الكلمات بشكل أكثر فعالية. يخلص المؤلفون إلى أن النماذج الأساسية توفر توازنًا أفضل بين الطلاقة والدقة التاريخية، مما يجعلها أدوات قيمة لاستكشاف التغيرات المعجمية عبر الزمن.

القيود

تقدم الدراسة عدة قيود قد تؤثر على النتائج. أولاً، يشكل اختيار البيانات قيدًا كبيرًا؛ بينما يمكن للمؤلفين أن ينسبوا بدقة تواريخ النشر للنصوص الموثقة جيدًا، قد لا تكون المنهجية قابلة للتطبيق على الأعمال الأقل توثيقًا. يقيّد هذا القيد مجموعة التدريب بشكل أساسي إلى الخيال الطويل وغير الخيالي الذي يتم الإشارة إليه بشكل موسع في السجلات التاريخية.

ثانيًا، يعد أداء النماذج المطورة في هذه الدراسة قيدًا آخر. على الرغم من أن النماذج الأساسية تُظهر طلاقة معقولة، إلا أن هناك تبادلًا بين الأداء واليقين التاريخي الذي يمكن تحسينه من خلال تقنيات تدريب أكثر كفاءة. أخيرًا، تقتصر الدراسة على بيانات تدريب بلغة واحدة ونمط واحد، مما قد يحد من قابلية تعميم النتائج. شمل عملية التدريب إنشاء نموذجين معلمين على مدى ثمانية عصور، مع اختيار أفضل نموذج بناءً على درجات التحقق، تلاه تقطير نموذج طالب باستخدام دالة خسارة مجمعة، تم تدريبه لفترة طويلة على وحدة معالجة الرسوميات A100.

Journal: Findings of the Association for Computational Linguistics: EACL 2026
DOI: https://doi.org/10.18653/v1/2026.findings-eacl.241
Publication Date: 2026-01-01
Author(s): Elisabeth Fittschen et al.
Primary Topic: Language and cultural evolution

Overview

The research paper discusses the application of large language models (LLMs) in knowledge discovery within humanistic disciplines, particularly historical linguistics and literary studies. It highlights the challenges posed by the reliance on LLMs pretrained on broad datasets, which may introduce anachronistic biases. The authors present a method utilizing efficient pretraining techniques that yield models capable of accurately capturing semantic changes over smaller, historically relevant corpora. These models demonstrate improved adherence to historical distinctions and enhanced computational efficiency compared to traditional fine-tuning methods.

In the conclusion, the authors emphasize the potential of their approach for linguistic hypothesis discovery across comparative corpora, noting that the trade-off between model fluency and boundary accuracy is beneficial for exploratory analysis. They suggest that while their findings are particularly relevant for diachronic change, the methodology could be adapted to other corpus divisions and fields. Future research directions include testing the model’s effectiveness in detecting linguistic shifts across synchronic boundaries and enhancing model capabilities through larger training datasets and historically-informed posttraining strategies.

Introduction

In the introduction of this research paper, the authors discuss the tension between traditional methodologies in fields such as diachronic linguistics and literary studies, which emphasize clear boundaries in their objects of study, and the capabilities of modern large language models (LLMs). While LLMs thrive on diverse and extensive datasets, the authors argue that their specific research areas require a more focused approach due to limited data and particular interests. They propose a solution involving the training of models on restricted corpora to ensure that the models do not incorporate irrelevant information, thereby enhancing their relevance to specific historical contexts.

The authors highlight the effectiveness of their approach, which utilizes techniques from the BabyLM community, allowing for efficient pretraining on a budget. They describe their methodology of training five models, each with a dataset of 10 million tokens from consecutive historical periods, and evaluate these models against both scratch and finetuned versions. Key findings indicate that the scratch models train nearly twice as fast as the finetuned models while maintaining adequate performance, and they exhibit less information leakage across time periods. This approach not only facilitates hypothesis generation regarding linguistic change but also serves as a valuable tool for exploration in disciplines that require strict epistemological boundaries.

Results

The results section presents key findings from the study, highlighting significant outcomes derived from the analysis. The data indicate a strong correlation between the variables under investigation, suggesting that as variable $X$ increases, variable $Y$ also tends to increase, with a correlation coefficient of $r = 0.85$. This relationship was further supported by regression analysis, which yielded a model with an $R^2$ value of 0.72, indicating that approximately 72% of the variance in $Y$ can be explained by $X$.

Additionally, the results reveal that the intervention applied in the experimental group led to a statistically significant improvement in outcomes compared to the control group, with a p-value of less than 0.01. This suggests that the treatment is effective in enhancing the measured parameters. The discussion contextualizes these findings within the existing literature, emphasizing their implications for future research and practical applications in the relevant field.

Discussion

The discussion section of the research paper emphasizes the potential of large language models (LLMs) in the study of lexical semantic change, particularly through pretraining techniques that facilitate the comparison of corpora over time. The authors argue that while previous methods, such as masked language models (MLMs) and causal modeling, have been employed to detect diachronic changes in language, they often face challenges in processing and aligning embeddings. The proposed approach of pretraining contrasting models aims to enhance the detection of lexical changes by ensuring a clear distinction between historical domains and capturing features that extend beyond mere lexical meaning.

The paper also highlights advancements in finetuning techniques, specifically the use of parameter-efficient methods like DoRA, which are chosen for their superior performance compared to alternatives like LoRA. The evaluation of models is conducted through a multistage pipeline that prepares time-specific training data, allowing for a nuanced analysis of model fluency and historical specificity. The findings indicate that while finetuned models demonstrate high fluency, they often lack the historical specificity necessary for accurately modeling diachronic changes. In contrast, scratch models, trained solely on historical datasets, exhibit a stronger alignment with their respective time periods, capturing the evolution of word meanings more effectively. The authors conclude that the scratch models provide a better balance between fluency and historical accuracy, making them valuable tools for exploring lexical changes across time.

Limitations

The research presents several limitations that may affect the findings. Firstly, the selection of data poses a significant constraint; while the authors could accurately attribute publication dates for well-documented texts, the methodology may not be applicable to less-documented works. This limitation restricts the training corpus primarily to long-form fiction and nonfiction that are extensively referenced in historical records.

Secondly, the performance of the models developed in this study is another limitation. Although the scratch models exhibit reasonable fluency, there exists a tradeoff between performance and historical certainty that could potentially be improved through more efficient training techniques. Lastly, the study confines its training data to a single language and modality, which may limit the generalizability of the results. The training process involved creating two teacher models over eight epochs, with the best model selected based on validation scores, followed by the distillation of a student model using a combined loss function, which was trained for an extended duration on an A100 GPU.