زيادة حجم نموذج اللغة تؤدي إلى عوائد متناقصة في الإقناع السياسي ذو الرسالة الواحدة Scaling language model size yields diminishing returns for single-message political persuasion

المجلة: Proceedings of the National Academy of Sciences، المجلد: 122، العدد: 10
DOI: https://doi.org/10.1073/pnas.2413443122
PMID: https://pubmed.ncbi.nlm.nih.gov/40053360
تاريخ النشر: 2025-03-07
المؤلف: Kobi Hackenburg وآخرون
الموضوع الرئيسي: المعلومات المضللة وتأثيراتها

نظرة عامة

تبحث هذه الدراسة في القدرات الإقناعية لنماذج اللغة الكبيرة (LLMs) في توليد الرسائل السياسية، مع معالجة المخاوف بشأن تأثيرها المحتمل على الرأي العام. شملت الدراسة توليد 720 رسالة إقناعية عبر 10 قضايا سياسية أمريكية باستخدام 24 نموذج LLM مختلف بأحجام متنوعة، تلاها تجربة مسح عشوائية على نطاق واسع مع 25,982 مشاركًا لتقييم فعالية النماذج في الإقناع. تكشف النتائج عن رؤيتين رئيسيتين: أولاً، هناك عوائد متناقصة حادة في إقناع النموذج، مما يشير إلى أن النماذج الأكثر تقدمًا ليست أكثر إقناعًا إلا بشكل هامشي مقارنة بتلك الأصغر بكثير. ثانيًا، يتناقص الارتباط بين حجم النموذج والإقناع إلى عدم الأهمية عند الأخذ في الاعتبار مقاييس إكمال المهام، مثل التماسك وملاءمة الموضوع، مما يشير إلى أن هذه العوامل قد تتوسط الفوائد المدركة للنماذج الأكبر.

تعتبر تداعيات هذه النتائج كبيرة، خاصة في ضوء المخاوف المتزايدة بشأن إمكانية استخدام LLMs للتلاعب بالرأي العام خلال الأحداث السياسية الحرجة، مثل الانتخابات. على الرغم من المخاوف من أن النماذج المستقبلية قد تصبح أكثر إقناعًا، تشير هذه الدراسة إلى أن مجرد زيادة حجم النموذج قد لا يؤدي إلى تحسينات كبيرة في فعالية الإقناع. وبالتالي، تؤكد الدراسة على أهمية فهم حدود LLMs في التواصل السياسي والحاجة إلى أطر لتخفيف المخاطر المحتملة المرتبطة باستخدامها في التأثير على المواقف العامة.

مقدمة

تحدد المقدمة سلسلة من المواقف السياسية بشأن قضايا اجتماعية وسياسية متنوعة في الولايات المتحدة. تدعو إلى بعض المتطلبات والإصلاحات في الرعاية الصحية، والتوظيف، والعدالة الجنائية، والهجرة، والسياسة الخارجية، والعمليات الانتخابية. تشمل المقترحات الرئيسية فرض العمل على مستلمي Medicaid، والسماح بالانتحار المدعوم للمرضى الذين يعانون من أمراض مميتة، وزيادة خصخصة الرعاية الصحية للمحاربين القدامى. وعلى النقيض من ذلك، تعارض العمل الإيجابي، وسلب حقوق التصويت من المدانين، وحظر الحبس الانفرادي للقاصرين.

بالإضافة إلى ذلك، تؤكد المقدمة على الحاجة إلى ضوابط حدودية أكثر صرامة مع الحفاظ على مستويات الإنفاق الحالية على المساعدات الخارجية. كما تدعو إلى إلغاء الكلية الانتخابية، مما يشير إلى تحول كبير في الإطار الانتخابي. تعكس هذه المواقف تفاعلًا معقدًا للقيم والأولويات عبر مجالات متعددة، مما يبرز النقاشات المستمرة حول هذه القضايا الحرجة في المجتمع الأمريكي.

الطرق

توضح منهجية البحث الموضحة في هذا القسم الموافقة الأخلاقية، وموافقة المشاركين، وتصميم التجربة المستخدمة في الدراسة. تمت الموافقة عليها من قبل لجنة الأخلاقيات البحثية في معهد أكسفورد للإنترنت، وتم تسجيل الدراسة مسبقًا على Open Source Framework، مما يضمن الشفافية وقابلية التكرار. جميع الشيفرات ومواد التكرار متاحة عبر مستودع GitHub، مع توفر مواد دراسية إضافية في ملحق المعلومات التكميلية.

شمل تصميم التجربة توليد رسائل إقناعية باستخدام 24 نموذجًا مختلفًا، حيث أنتج كل نموذج ثلاث رسائل لعشر مواقف قضايا متميزة، مما أسفر عن إجمالي 720 رسالة. تم إنشاء خط أساسي بشري من خلال دمج عشر رسائل إقناعية تم تحديدها مسبقًا بواسطة Tappin وآخرين (55). تم اختيار المواقف القضايا المحددة بعناية بناءً على معايير تضمن إمكانية قياس الإقناع، ومواقف متوسطة بين المجموعات الفرعية الديمقراطية والجمهورية لتجنب تأثيرات السقف أو الأرض، وتمثيل متوازن لكل من الموضوعات الليبرالية والمحافظة. تم اشتقاق هذه المواقف من ISideWith.com، تغطي مجموعة من المجالات بما في ذلك السياسة الخارجية والهجرة، مما يوفر إطارًا شاملاً لتحليل إقناع الرسائل المولدة.

النتائج

في تجربة مسح مسجلة مسبقًا أجريت من أبريل إلى مايو 2024، قام الباحثون بتجنيد 25,982 بالغًا أمريكيًا لتقييم توافقهم مع واحدة من عشر قضايا سياسية معاصرة في الولايات المتحدة، بما في ذلك الهجرة، والرعاية الصحية، والتوظيف، والسياسة الخارجية، والعدالة الجنائية. تم تعيين المشاركين عشوائيًا إلى واحدة من ثلاث مجموعات: الذكاء الاصطناعي، الإنسان، أو التحكم. قدمت مجموعة التحكم آراءً دون التعرض لرسائل إقناعية، بينما تلقت مجموعتا الإنسان والذكاء الاصطناعي رسالة إقناعية تتراوح من 150 إلى 250 كلمة، كتبها إما باحثون بشريون أو تم توليدها بواسطة أحد 24 نموذج لغة مختلف، مما أسفر عن إجمالي 720 رسالة فريدة.

كانت النتيجة الرئيسية المقاسة هي مستوى الاتفاق مع مواقف السياسة، التي تم تقييمها من خلال مجموعة من أربعة أسئلة. تم تحديد التأثير الإقناعي من خلال حساب الفرق في مستويات الاتفاق المتوسطة بين مجموعات العلاج (الإنسان والذكاء الاصطناعي) ومجموعة التحكم، مما يوفر رؤى حول فعالية الرسائل الإقناعية المولدة بواسطة الذكاء الاصطناعي مقابل تلك التي أنشأها البشر على الرأي العام بشأن قضايا السياسة الهامة.

المناقشة

في هذه الدراسة، بحثنا في العلاقة بين حجم نموذج اللغة والإقناع من خلال تحليل ميتا ذو تأثيرات عشوائية. تشير نتائجنا إلى أن نماذج اللغة، في المتوسط، تظهر إقناعًا بمقدار 5.77 نقطة مئوية عند توليد الرسائل، مع وجود ارتباط إيجابي كبير بين حجم النموذج (كما يقاس باللوغاريتم الطبيعي لعدد المعلمات) والإقناع، مقدرًا بمقدار 1.26 نقطة مئوية لكل وحدة زيادة في الحجم اللوغاريتمي. ومع ذلك، تكشف التحليلات عن عوائد متناقصة، مما يشير إلى أن النماذج الأكبر، مثل GPT-4 وClaude-3-Opus، لا تتفوق بشكل كبير على النماذج الأصغر (مثل تلك التي تحتوي على 7 إلى 13 مليار معلمة) من حيث التأثير الإقناعي.

حددت التحليلات الاستكشافية الإضافية إكمال المهمة – المحدد كتوافق وملاءمة الرسائل المولدة – كوسيط محتمل في العلاقة بين الحجم والإقناع. عند التحكم في درجات إكمال المهمة، تضاءل الارتباط بين حجم النموذج والإقناع، مما يشير إلى أن مزايا النماذج الأكبر قد تنبع أساسًا من قدرتها المتفوقة على إنتاج محتوى متماسك وذو صلة بالموضوع. وهذا يشير إلى أن التقدم المستقبلي في حجم النموذج قد يؤدي إلى فوائد إقناعية إضافية محدودة، مما يتحدى الافتراض السائد بأن النماذج الأكبر ستعزز باستمرار القدرات الإقناعية. بشكل عام، تشير نتائجنا إلى وجود سقف محتمل على فعالية نماذج اللغة في سياقات الإقناع السياسي ذات الرسالة الواحدة، مما يبرز الحاجة إلى مزيد من البحث في تفاصيل تحسين النموذج وديناميات التفاعل.

Journal: Proceedings of the National Academy of Sciences, Volume: 122, Issue: 10
DOI: https://doi.org/10.1073/pnas.2413443122
PMID: https://pubmed.ncbi.nlm.nih.gov/40053360
Publication Date: 2025-03-07
Author(s): Kobi Hackenburg et al.
Primary Topic: Misinformation and Its Impacts

Overview

This research investigates the persuasive capabilities of large language models (LLMs) in generating political messages, addressing concerns about their potential influence on public opinion. The study involved generating 720 persuasive messages across 10 U.S. political issues using 24 different LLMs of varying sizes, followed by a large-scale randomized survey experiment with 25,982 participants to assess the models’ effectiveness in persuasion. The findings reveal two key insights: first, there are sharply diminishing returns in model persuasiveness, indicating that the most advanced models are only marginally more persuasive than those significantly smaller. Second, the correlation between model size and persuasiveness diminishes to insignificance when accounting for task completion metrics, such as coherence and topic relevance, suggesting that these factors may mediate the perceived advantages of larger models.

The implications of these findings are significant, especially in light of the growing concerns regarding the potential for LLMs to manipulate public opinion during critical political events, such as elections. Despite fears that future models could become increasingly persuasive, this research indicates that simply scaling model size may not yield substantial improvements in persuasive effectiveness. Consequently, the study underscores the importance of understanding the limits of LLMs in political communication and the need for frameworks to mitigate potential risks associated with their use in influencing public attitudes.

Introduction

The introduction outlines a series of policy positions regarding various social and political issues in the United States. It advocates for certain requirements and reforms in healthcare, employment, criminal justice, immigration, foreign policy, and electoral processes. Key proposals include mandating work for Medicaid recipients, permitting assisted suicide for terminally ill patients, and increasing the privatization of veterans’ healthcare. Conversely, it opposes affirmative action, the disenfranchisement of convicted felons, and the banning of solitary confinement for juveniles.

Additionally, the introduction emphasizes the need for stricter border controls while maintaining current levels of foreign aid spending. It also calls for the abolition of the electoral college, suggesting a significant shift in the electoral framework. These positions reflect a complex interplay of values and priorities across multiple domains, highlighting the ongoing debates surrounding these critical issues in American society.

Methods

The research methodology outlined in this section details the ethical approval, participant consent, and experimental design employed in the study. Approved by the Oxford Internet Institute’s Departmental Research Ethics Committee, the study was preregistered on Open Source Framework, ensuring transparency and reproducibility. All code and replication materials are accessible via a GitHub repository, with additional study materials available in the supplementary information appendix.

The experimental design involved generating persuasive messages using 24 different models, with each model producing three messages for ten distinct issue stances, resulting in a total of 720 messages. A human baseline was established by incorporating ten persuasive messages previously identified by Tappin et al. (55). The selected issue stances were carefully chosen based on criteria ensuring measurable persuasion potential, moderate average attitudes among Democratic and Republican subgroups to avoid floor or ceiling effects, and a balanced representation of both liberal and conservative topics. These stances were derived from ISideWith.com, covering a range of domains including foreign policy and immigration, thus providing a comprehensive framework for analyzing the persuasiveness of the generated messages.

Results

In a preregistered survey experiment conducted from April to May 2024, researchers recruited 25,982 US adults to assess their agreement with one of ten contemporary US policy issues, including immigration, healthcare, employment, foreign policy, and criminal justice. Participants were randomly assigned to one of three groups: AI, human, or control. The control group provided opinions without exposure to persuasive messages, while the human and AI groups received a persuasive message ranging from 150 to 250 words, authored either by human researchers or generated by one of 24 different language models, resulting in a total of 720 unique messages.

The primary outcome measured was the level of agreement with the policy stances, evaluated through a four-item question battery. The persuasive impact was determined by calculating the difference in mean agreement levels between the treatment groups (human and AI) and the control group, providing insights into the effectiveness of AI-generated versus human-generated persuasive messages on public opinion regarding significant policy issues.

Discussion

In this study, we investigated the relationship between language model size and persuasiveness through a random-effects meta-analysis. Our findings indicate that language models, on average, exhibit a persuasiveness of 5.77 percentage points when generating messages, with a significant positive correlation between model size (as measured by the natural logarithm of parameter count) and persuasiveness, quantified at 1.26 percentage points per unit increase in logarithmic size. However, the analysis reveals diminishing returns, suggesting that larger models, such as GPT-4 and Claude-3-Opus, do not significantly outperform smaller models (e.g., those with 7 to 13 billion parameters) in terms of persuasive impact.

Further exploratory analyses identified task completion—defined as the coherence and relevance of generated messages—as a potential mediator in the size-persuasiveness relationship. When controlling for task completion scores, the association between model size and persuasiveness diminished, indicating that larger models’ advantages may stem primarily from their superior ability to produce coherent and on-topic content. This suggests that future advancements in model size may yield limited additional persuasive benefits, challenging the prevailing assumption that larger models will continuously enhance persuasive capabilities. Overall, our results imply a potential ceiling on the effectiveness of language models in single-message political persuasion contexts, emphasizing the need for further research into the nuances of model optimization and interaction dynamics.