ذاكرة متجددة: نهج جديد للتعليق باستخدام نماذج اللغة الكبيرة التوليدية في البحث الاجتماعي والسياسي Rolling Memory: A New Approach to Annotation with Generative LLMs in Social and Political Research

المجلة: Chinese Political Science Review
DOI: https://doi.org/10.1007/s41111-025-00327-w
تاريخ النشر: 2026-01-07
المؤلف: Joan C. Timoneda وآخرون
الموضوع الرئيسي: طرق التحليل الحاسوبي والنصي

نظرة عامة

في هذا القسم، يناقش المؤلفون إمكانيات نماذج اللغة الكبيرة التوليدية (LLMs) في مهام توضيح النصوص، مع التركيز بشكل خاص على أهميتها في العلوم الاجتماعية. ينتقدون المنهجيات السائدة للتعلم بدون أمثلة والتعلم القليل الأمثلة لعدم استغلال قدرات التعلم في السياق الكامنة في هذه النماذج بشكل كامل. تشير الأبحاث السابقة إلى أن تعزيز احتفاظ الذاكرة لنماذج LLMs أثناء مهام التوضيح يحسن بشكل كبير من أدائها.

لمعالجة هذه الفجوة، يقترح المؤلفون تحسينًا للنهج الذاكري الذي تم توضيحه سابقًا بواسطة تيمونيدا وآخرين. يهدف هذا التحسين إلى تحسين استخدام الذاكرة لنماذج LLMs، مما يعزز فعاليتها في مهام التوضيح. يقترح المقال أنه من خلال تحسين كيفية إدارة هذه النماذج واستخدامها للذاكرة، يمكن للباحثين تحقيق نتائج أفضل في توضيح النصوص، مما قد يكون له آثار أوسع على منهجيات البحث في العلوم الاجتماعية.

مقدمة

تناقش مقدمة هذه الورقة البحثية فعالية النهج القائم على الذاكرة في مهام تصنيف النصوص باستخدام نماذج اللغة الكبيرة التوليدية (LLMs)، وبشكل خاص GPT-4o من OpenAI وLlama-3.3-70B من Meta. يبرز المؤلفون تحسينًا كبيرًا في الأداء يصل إلى 25% عندما تحتفظ هذه النماذج بذاكرة التصنيفات السابقة، بغض النظر عن دقتها، مقارنةً بالتعلم القليل الأمثلة مع التفكير المتسلسل (CoT). يبرز هذا التحسين إمكانيات الاستراتيجيات القائمة على الذاكرة لإعادة تعريف تقديم الأمثلة للنماذج، مما يعزز أداء المصنفين.

يتعمق المؤلفون في الآليات وراء هذه التحسينات، ويربطونها بالتعلم في السياق (ICL)، حيث تتكيف نماذج LLMs مع سلوكها بناءً على الأمثلة المقدمة في الطلب دون تغيير معلماتها. يجادلون بأنه مع تراكم الأمثلة، يجب أن تزداد دقة التصنيف، المقاسة بواسطة درجة F1، طوال المهمة، وتصل إلى ذروتها مع اكتساب النموذج فهمًا أعمق لتفاصيل المهمة. لتعزيز الأداء بشكل أكبر، يقترح المؤلفون نهجًا ذا مرحلتين، باستخدام آخر 100 ملاحظة مصنفة أو الملاحظات ذات الأداء الأعلى من المرحلة الأولى كذاكرة أولية للمرحلة الثانية. يهدف هذا الأسلوب إلى تعزيز الأداء المبكر والنتائج العامة في مهام التوضيح، خاصة في سياق تصنيف الحنين السياسي في بيانات الأحزاب، مما يسهم في تقديم رؤى قيمة لبحوث العلوم الاجتماعية.

النتائج

يقدم قسم النتائج تحليلًا مقارنًا لنهج التعلم المعزز بالذاكرة مقابل التعلم القليل الأمثلة مع التفكير المتسلسل (CoT) في مهمة توضيح نموذج اللغة. تشير النتائج إلى أنه بينما ينتج نهج التعلم بدون أمثلة درجة F1 متوسطة منخفضة تبلغ 0.51، فإن طريقة CoT القليلة الأمثلة تحسن الأداء بشكل كبير إلى درجة F1 متوسطة تبلغ 0.694. ومع ذلك، يتفوق النهج الذاكري على CoT القليل الأمثلة بنسبة 14.15% (p ≤ 0.05) في نهاية المرحلة، محققًا درجة F1 قصوى تبلغ 0.8 مقارنةً بـ 0.723 لـ CoT القليل الأمثلة. من الجدير بالذكر أن النهج الذاكري يظهر نمط عوائد متناقصة، حيث يتأخر في البداية خلف CoT القليل الأمثلة في أول 250 ملاحظة قبل أن يتجاوزه.

يكشف التحليل الإضافي لنهج المرحلتين أن الاستفادة من الذاكرة من المرحلة الأولى تعزز الأداء في المرحلة الثانية، خاصة في المراحل المبكرة. ينتج نهج الذاكرة الأخير-100 تحسينًا ذا دلالة إحصائية قدره 3.56% مقارنةً بنهج الذروة-100 ويعزز الأداء العام بنسبة 13.26% مقارنةً بالتعلم القليل الأمثلة مع CoT. من المهم أن تشير النتائج إلى أن المرحلة الثانية لا تحسن نتائج التعلم القليل الأمثلة، مما يؤكد أن المكاسب الملحوظة في الأداء في النهج الذاكري تعود إلى الاحتفاظ بذاكرة معلوماتية من المرحلة الأولى. وهذا يبرز فعالية الذاكرة في تعزيز أداء النموذج في مهام التوضيح.

المناقشة

تسلط قسم المناقشة في الورقة الضوء على تطور تصنيف النصوص تحت الإشراف باستخدام نماذج اللغة الكبيرة (LLMs)، مع التركيز على الانتقال من نماذج كيس الكلمات التقليدية إلى الهياكل المتقدمة مثل Transformers، وخاصة BERT وخلفائها. لقد أظهرت هذه النماذج دقة ملحوظة في مهام تصنيف النصوص، وغالبًا ما تتطابق أو تتجاوز أداء البشر. لقد أحدث إدخال نماذج LLMs التوليدية، مثل GPT من OpenAI وLlama من Meta، ثورة إضافية في هذا المجال، مما يظهر إمكانياتها في توضيح النصوص بينما يثير أيضًا اعتبارات أخلاقية. تشير النتائج الأخيرة إلى أن النهج القائم على الذاكرة، الذي يحتفظ بالتوضيحات السابقة، يعزز بشكل كبير من أداء LLM، محققًا تحسينًا بنسبة 25% مقارنةً بالتعلم القليل الأمثلة مع التفكير المتسلسل (CoT).

يقترح المؤلفون نهج ذاكرة من مرحلتين يبني على العمل السابق لتيمونيدا وفاليخو فيرا (2025b)، حيث يحتفظ النموذج بمجموعة فرعية من التوضيحات عالية المعلومات من مرحلة أولية لإبلاغ مهمة تصنيف لاحقة. لا يحسن هذا الأسلوب الدقة فحسب، بل يسهل أيضًا التعلم في السياق (ICL)، مما يسمح للنموذج بتكييف فهمه دون الحاجة إلى تحديثات المعلمات. تشير النتائج إلى أنه مع احتفاظ النموذج بمزيد من الأمثلة، يتحسن أداؤه، مما يشير إلى عملية تعلم مشابهة لـ ICL. يدعو المؤلفون إلى مزيد من البحث لاستكشاف العدد الأمثل من الأمثلة المحتفظ بها وإمكانية تعزيز النهج الذاكري للتناسق عبر مهام مختلفة، مما يقلل من الاعتماد على هندسة الطلبات ويحسن التكرار في الدراسات المستقبلية.

Journal: Chinese Political Science Review
DOI: https://doi.org/10.1007/s41111-025-00327-w
Publication Date: 2026-01-07
Author(s): Joan C. Timoneda et al.
Primary Topic: Computational and Text Analysis Methods

Overview

In this section, the authors discuss the potential of Generative Large Language Models (LLMs) in text annotation tasks, particularly emphasizing their relevance to social sciences. They critique the prevalent methodologies of zero-shot and few-shot learning for not fully leveraging the in-context learning capabilities inherent in these models. Previous research indicates that enhancing the memory retention of LLMs during annotation tasks significantly improves their performance.

To address this gap, the authors propose a refinement to the memory approach previously outlined by Timoneda et al. This refinement aims to optimize the memory utilization of LLMs, thereby enhancing their effectiveness in annotation tasks. The article suggests that by improving how these models manage and utilize memory, researchers can achieve better outcomes in text annotation, which could have broader implications for social science research methodologies.

Introduction

The introduction of this research paper discusses the efficacy of memory-based approaches in text classification tasks utilizing generative Large Language Models (LLMs), specifically OpenAI’s GPT-4o and Meta’s Llama-3.3-70B. The authors highlight a significant performance enhancement of up to 25% when these models retain memory of previous classifications, irrespective of their accuracy, compared to few-shot learning with chain-of-thought (CoT) reasoning. This improvement underscores the potential of memory-based strategies to redefine example provision to models, thereby enhancing classifier performance.

The authors delve into the mechanisms behind these improvements, linking them to in-context learning (ICL), where LLMs adapt their behavior based on examples provided in the prompt without altering their parameters. They argue that as the model accumulates examples, its classification accuracy, measured by the F1-score, should increase throughout the task, peaking as the model gains a deeper understanding of the task nuances. To further enhance performance, the authors propose a two-run approach, utilizing the last 100 classified observations or the top-performing observations from the first run as initial memory for the second run. This method aims to bolster early performance and overall outcomes in annotation tasks, particularly in the context of political nostalgia classification in party manifestos, thereby contributing valuable insights for social science research.

Results

The results section presents a comparative analysis of memory-enhanced learning approaches versus few-shot learning with Chain-of-Thought (CoT) reasoning in a language model annotation task. The findings indicate that while the zero-shot learning approach yields a low average F1-score of 0.51, the few-shot CoT method significantly improves performance to an average F1-score of 0.694. However, the memory approach outperforms few-shot CoT by 14.15% (p ≤ 0.05) at the end of the run, achieving a peak F1-score of 0.8 compared to 0.723 for few-shot CoT. Notably, the memory approach demonstrates a diminishing returns pattern, initially lagging behind few-shot CoT for the first 250 observations before surpassing it.

Further analysis of a two-run approach reveals that leveraging memory from the first run enhances performance in the second run, particularly in the early stages. The last-100 memory approach yields a statistically significant improvement of 3.56% over the peak-100 approach and enhances overall performance by 13.26% compared to few-shot learning with CoT. Importantly, the results indicate that the second run does not improve few-shot learning outcomes, confirming that the observed performance gains in the memory approach are attributable to the retention of informative memory from the first run. This underscores the efficacy of memory in enhancing model performance in annotation tasks.

Discussion

The discussion section of the paper highlights the evolution of supervised text classification using large language models (LLMs), emphasizing the transition from traditional bag-of-words models to advanced architectures like Transformers, particularly BERT and its successors. These models have demonstrated remarkable accuracy in text classification tasks, often matching or exceeding human performance. The introduction of generative LLMs, such as OpenAI’s GPT and Meta’s Llama, has further revolutionized the field, showcasing their potential in text annotation while also raising ethical considerations. Recent findings indicate that memory-based approaches, which retain previous annotations, significantly enhance LLM performance, yielding a 25% improvement over few-shot learning with chain-of-thought (CoT) reasoning.

The authors propose a two-run memory approach that builds on previous work by Timoneda and Vallejo Vera (2025b), where the model retains a subset of high-information annotations from an initial run to inform a subsequent classification task. This method not only improves accuracy but also facilitates in-context learning (ICL), allowing the model to adapt its understanding without requiring parameter updates. The results indicate that as the model retains more examples, its performance improves, suggesting a learning process akin to ICL. The authors advocate for further research to explore the optimal number of retained examples and the potential for memory approaches to enhance consistency across various tasks, thereby reducing reliance on prompt engineering and improving replication in future studies.