مُسيء سلوك، مُستشار جيد: استكشاف دور نماذج اللغة الكبيرة في كشف الأخبار الزائفة Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection

عربي
English

المجلة: Proceedings of the AAAI Conference on Artificial Intelligence، المجلد: 38، العدد: 20
DOI: https://doi.org/10.1609/aaai.v38i20.30214
تاريخ النشر: 2024-03-24

مُسيء سلوك، مُستشار جيد: استكشاف دور نماذج اللغة الكبيرة في كشف الأخبار الزائفة

بيزهي هوكيانغ شينغخوان كاويوهوي شييانغ ليداندينغ وانغبينغ تشيالمختبر الرئيسي لمعالجة المعلومات الذكية بالأكاديمية الصينية للعلوم، معهد تكنولوجيا الحوسبة، الأكاديمية الصينية للعلومجامعة الأكاديمية الصينية للعلومالجامعة الوطنية في سنغافورة{hubeizhe21s, shengqiang18z, caojuan, shiyuhui22s}@ict.ac.cn{liyang23s,wangdanding}@ict.ac.cn, pengqi.qp@gmail.com

الملخص

يتطلب اكتشاف الأخبار المزيفة حسًا دقيقًا لمجموعة متنوعة من الأدلة وفهمًا عميقًا للخلفية الواقعية، وهو ما لا يزال يمثل تحديًا للكاشفين المعتمدين على نماذج اللغة الصغيرة (SLMs) بسبب قيود المعرفة والقدرة. أظهرت التقدمات الأخيرة في نماذج اللغة الكبيرة (LLMs) أداءً ملحوظًا في مهام متنوعة، لكن ما إذا كانت وكيف يمكن أن تساعد LLMs في اكتشاف الأخبار المزيفة لا يزال غير مستكشف بشكل كافٍ. في هذه الورقة، نستكشف إمكانيات LLMs في اكتشاف الأخبار المزيفة. أولاً، نجري دراسة تجريبية ونجد أن LLM متطور مثل GPT 3.5 يمكنه عمومًا كشف الأخبار المزيفة وتقديم مبررات متعددة الزوايا مرغوبة ولكنه لا يزال أقل أداءً من SLM الأساسي، BERT المُعدل. تُعزى تحليلاتنا اللاحقة إلى هذه الفجوة إلى عدم قدرة LLM على اختيار ودمج المبررات بشكل صحيح للتوصل إلى استنتاج. بناءً على هذه النتائج، نقترح أن LLMs الحالية قد لا تحل محل SLMs المعدلة في اكتشاف الأخبار المزيفة ولكن يمكن أن تكون مستشارًا جيدًا لـ SLMs من خلال تقديم مبررات تعليمية متعددة الزوايا. لتجسيد هذا الاقتراح، نصمم شبكة توجيه مبررات تكيفية لاكتشاف الأخبار المزيفة (ARG)، حيث تكتسب SLMs بشكل انتقائي رؤى حول تحليل الأخبار من مبررات LLMs. نستخرج أيضًا نسخة خالية من المبررات من ARG من خلال التقطير، وهي ARGD، التي تخدم السيناريوهات الحساسة من حيث التكلفة دون استعلام LLMs. تُظهر التجارب على مجموعتين من البيانات الواقعية أن ARG و ARGD يتفوقان على ثلاثة أنواع من طرق الأساس، بما في ذلك المعتمدة على SLM، المعتمدة على LLM، ومجموعات من نماذج اللغة الصغيرة والكبيرة.

1 المقدمة

لقد شكل الانتشار الواسع والسريع للأخبار المزيفة عبر الإنترنت تهديدات حقيقية في مجالات حيوية مثل السياسة (فيشر وآخرون، 2016)، الاقتصاد (CHEQ،

الشكل 1: توضيح دور نماذج اللغة الكبيرة (LLMs) في اكتشاف الأخبار المزيفة. في هذه الحالة، (أ) تفشل LLM في إصدار حكم صحيح على صحة الأخبار ولكن (ب) تساعد نموذج اللغة الصغيرة (SLM) في الحكم بشكل صحيح من خلال تقديم مبررات معلوماتية.

2019)، والصحة العامة (نعيم وباتي، 2020). من بين التدابير المضادة لمكافحة هذه القضية، كان اكتشاف الأخبار المزيفة تلقائيًا، الذي يهدف إلى تمييز العناصر الإخبارية غير الدقيقة والمضللة عمدًا عن غيرها تلقائيًا، حلاً واعدًا في الممارسة العملية (شو وآخرون، 2017؛ روث، 2022).

على الرغم من التقدم الكبير الذي تم إحرازه (هو وآخرون، 2022أ)، لا يزال فهم وتصنيف الأخبار المزيفة يمثل تحديًا للنماذج الحالية. ويعود ذلك إلى تعقيد عملية تزوير الأخبار: قد يقوم منشئو الأخبار المزيفة بالتلاعب بأي جزء من الأخبار، باستخدام استراتيجيات كتابة متنوعة وبدافع من أهداف غير قابلة للفهم. لذلك، للحفاظ على كل من الفعالية والعمومية لاكتشاف الأخبار المزيفة، يتطلب الأمر وجود طريقة مثالية تتمتع بـ: 1) حس دقيق لمجموعة متنوعة من الأدلة (مثل، الأسلوب، الحقائق، الفطرة السليمة)؛ و 2) فهم عميق للخلفية الواقعية.

تستغل الطرق الحديثة (تشانغ وآخرون، 2021؛ كاليار وآخرون، 2021؛ موسالانيزهاد وآخرون، 2022؛ هو وآخرون، 2023) عمومًا نماذج اللغة الصغيرة المدربة مسبقًا (SLMs)

مثل BERT (ديفلين وآخرون، 2019)

و RoBERTa (ليو وآخرون، 2019) لفهم محتوى الأخبار وتقديم تمثيل أساسي، بالإضافة إلى السياقات الاجتماعية الاختيارية (شو وآخرون، 2019؛ كوي وآخرون، 2022)، قواعد المعرفة (بوبات وآخرون، 2018؛ هو وآخرون، 2022ب)، أو بيئة الأخبار (شينغ وآخرون، 2022) كملحقات. تجلب SLMs تحسينات، لكن قيود المعرفة والقدرة لديها تقوض أيضًا تعزيز كاشفات الأخبار المزيفة بشكل أكبر. على سبيل المثال، تم تدريب BERT مسبقًا على مجموعة نصوص مثل ويكيبيديا (ديفلين وآخرون، 2019) وبالتالي واجه صعوبة في التعامل مع العناصر الإخبارية التي تتطلب معرفة غير مدرجة (شينغ وآخرون، 2021).

كنمط بديل جديد لـ SLMs، أظهرت نماذج اللغة الكبيرة (LLMs) (OpenAI، 2022؛ أنثروبيك، 2023؛ توفورون وآخرون، 2023) التي عادة ما يتم تدريبها على مجموعة بيانات أكبر ومتوافقة مع تفضيلات البشر، قدرات مثيرة للإعجاب في مهام متنوعة (وي وآخرون، 2022أ) وتعتبر واعدة كحلول عامة لمهام متعددة (ما وآخرون، 2023). ومع ذلك، لا يزال إمكان LLMs في اكتشاف الأخبار المزيفة غير مستكشف بشكل كافٍ:

هل يمكن أن تساعد LLMs في اكتشاف الأخبار المزيفة بمعرفتها الداخلية وقدرتها؟
ما الحل الذي يجب أن نتبناه للحصول على أداء أفضل باستخدام LLMs؟
للإجابة على هذين السؤالين، نجري أولاً تحقيقًا عميقًا في الدور الفعال لـ LLMs في اكتشاف الأخبار المزيفة ونحاول تقديم حل عملي يتضمن LLM. على عكس الأعمال المعاصرة (بيلرين وآخرون، 2023؛ كارامانسيون، 2023) التي تطلب ببساطة من LLMs تقديم توقعات مع تعليمات المهمة، نجري دراسة تجريبية مفصلة لاستخراج إمكانيات LLMs. على وجه التحديد، نستخدم أربعة أساليب تحفيز نموذجية (تحفيز بدون عينة/تحفيز قليل العينة/تحفيز سلسلة الأفكار) لطلب من LLM إصدار أحكام حول صحة العناصر الإخبارية المعطاة (الشكل 1(أ)) ونجد أن حتى أفضل طريقة تعتمد على LLM لا تزال أقل أداءً من SLMs المعدلة الخاصة بالمهمة. ثم نقوم بتحليل المبررات التفسيرية التي تم إنشاؤها بواسطة LLM ونجد أن LLM يمكن أن تقدم مبررات معقولة ومعلوماتية من عدة زوايا. من خلال تحفيز LLM بعد ذلك بمطالب محددة للزوايا وأداء تجميعات قائمة على القواعد للأحكام، نجد أن المبررات تفيد بالفعل اكتشاف الأخبار المزيفة، وننسب الأداء غير المرضي إلى عدم قدرة LLM على اختيار ودمج المبررات بشكل صحيح.

	الصينية			الإنجليزية
	تدريب	تقييم	اختبار	تدريب	تقييم	اختبار
حقيقي	2,331	1,172	1,137	2,878	1,030	1,024
مزيف	2,873	779	814	1,006	244	234
الإجمالي	5,204	1,951	1,951	3,884	1,274	1,258

الجدول 1: إحصائيات مجموعات بيانات اكتشاف الأخبار المزيفة.

للتوصل إلى استنتاج.
استنادًا إلى هذه النتائج، نقترح أن LLM الحالي قد لا يكون بديلاً جيدًا لـ SLM المعدل جيدًا ولكنه يمكن أن يكون مستشارًا جيدًا من خلال تقديم مبررات تعليمية، كما هو موضح في الشكل 1(ب). لتجسيد اقتراحنا، نصمم شبكة توجيه مبررات تكيفية (ARG) لاكتشاف الأخبار المزيفة، والتي تربط بين نماذج اللغة الصغيرة والكبيرة من خلال حقن رؤى جديدة حول تحليل الأخبار من مبررات LLM الكبيرة إلى LMs الصغيرة. تستخرج ARG أيضًا ARG-D الخالية من المبررات عبر التقطير للسيناريوهات الحساسة من حيث التكلفة دون الحاجة لاستعلام LLMs. تُظهر التجارب على مجموعتين من البيانات الواقعية أن ARG و ARG-D يتفوقان على الطرق الحالية المعتمدة على SLM/LLM فقط ومجموعات الطرق. مساهماتنا هي كما يلي:

تحقيق مفصل: نحقق في الدور الفعال لـ LLMs في اكتشاف الأخبار المزيفة ونجد أن LLM سيء في حكم الصحة ولكنه جيد في تحليل المحتويات؛
حل مبتكر وعملي: نصمم شبكة ARG جديدة وإصدارها المقطر ARG-D التي تكمل نماذج اللغة الصغيرة والكبيرة من خلال اكتساب رؤى من مبررات LLM التي تم إنشاؤها لـ SLMs، والتي أظهرت تفوقًا بناءً على تجارب واسعة؛
موارد مفيدة: نبني مجموعة مبررات من GPT-3.5 لاكتشاف الأخبار المزيفة بلغتين (الصينية والإنجليزية) وسنجعلها متاحة للجمهور لتسهيل المزيد من البحث.

2 هل LLM كاشف جيد؟

في هذا القسم، نقيم أداء LLM التمثيلي، أي GPT-3.5 في اكتشاف الأخبار المزيفة لكشف قدرته على الحكم. نستغل أربعة أساليب تحفيز نموذجية ونقوم بمقارنة مع SLM (هنا، BERT) المعدل

الشكل 2: توضيح أساليب التحفيز لـ LLMs.

في هذه المهمة. رسميًا، بالنظر إلى عنصر إخباري

, يهدف النموذج إلى التنبؤ بما إذا كان

مزيفًا أم لا.

2.1 إعدادات التجربة

مجموعة البيانات نستخدم مجموعة البيانات الصينية Weibo21 (نان وآخرون، 2021) ومجموعة البيانات الإنجليزية GossipCop (شو وآخرون، 2020) للتقييم. وفقًا للأعمال الحالية (تشو وآخرون، 2022؛ مو وآخرون، 2023)، نقوم بمعالجة مجموعات البيانات من خلال إزالة التكرار وتقسيم البيانات الزمنية لتجنب أي مبالغة محتملة في الأداء ناتجة عن تسرب البيانات لنموذج SLM. تقدم الجدول 1 إحصائيات مجموعة البيانات.

نموذج اللغة الكبير نقيم GPT-3.5turbo، وهو نموذج اللغة الكبير الذي طورته OpenAI ويدعم الدردشة الشهيرة ChatGPT (OpenAI، 2022)، نظرًا لتمثيليته وسهولة استدعائه. يجعل الحجم الكبير للمعلمات من ضبط النموذج بشكل خاص للمهمة أمرًا شبه مستحيل لنماذج اللغة الكبيرة، لذا نستخدم نموذج التعلم من خلال التوجيه، حيث يتعلم نموذج اللغة الكبيرة المهام من خلال توجيهات تحتوي على تعليمات أو عروض توضيحية قليلة (Liu et al.، 2023a). بالتفصيل، نستخدم الأساليب الأربعة التالية للتوجيه لإظهار إمكانيات نموذج اللغة الكبير في اكتشاف الأخبار المزيفة (الشكل 2):

يتم بناء التحفيز بدون أمثلة (Zero-Shot Prompting) من خلال تضمين وصف المهمة والأخبار المعطاة فقط. لجعل الاستجابة أكثر كفاءة وتقليل نسبة الرفض، نتبنى بشكل اختياري تقنية تمثيل الأدوار عند وصف مهمتنا (Liu et al., 2023b; Ramlochan, 2023).
تحفيز سلسلة الأفكار بدون تدريب مسبق (كوجيما وآخرون، 2022) هو نهج بسيط ومباشر لتحفيز سلسلة الأفكار (CoT) لـ…

نموذج	الاستخدام	صيني	الإنجليزية
جي بي تي-3.5-	عدم وجود أمثلة	0.676	0.568
	زيرو شوت كوت	0.677	0.666
	قليل من اللقطات		0.697
	قليل من اللقطات – سلسلة من الأفكار	0.681
بيرت	التعديل الدقيق
بيرت	التعديل الدقيق

الجدول 2: الأداء في المقياس الكلي F1 للنماذج اللغوية الكبيرة والصغيرة. تم تمييز أفضل نتيجتين بخط عريض وتحت خط، على التوالي. الزيادات النسبية مقارنة بالنتائج الثانية الأفضل موضحة بين الأقواس.

شجع النموذج اللغوي الكبير على التفكير. بالإضافة إلى العناصر في التحفيز بدون أمثلة، يضيف جملة تحفيزية مثل “دعنا نفكر خطوة بخطوة.”

التوجيه القليل (براون وآخرون، 2020) يوفر توجيهات محددة للمهام وعددًا من أمثلة تصنيف الأخبار كعرض. بعد الاختبارات الأولية لـ-إعدادات اللقطة، نختار تحفيز 4 لقطات والذي يتضمن عينتين حقيقيتين وعينتين مزيفتين.
تحفيز سلسلة التفكير القليل (Wei et al., 2022b) لا يوفر فقط أمثلة على تسميات الأخبار، بل يظهر أيضًا خطوات التفكير مع المبررات المكتوبة سابقًا. هنا، نحصل على العروض المبررة المقدمة من المخرجات الصحيحة والمعقولة لتحفيز سلسلة التفكير بدون تدريب.

نموذج اللغة الصغيرة نعتمد على نماذج اللغة الصغيرة المدربة مسبقًا، مثل BERT (Devlin et al., 2019) كممثل، نظرًا لاستخدامه الواسع في هذه المهمة (Kaliyar et al., 2021; Zhu et al., 2022; Sheng et al., 2022). بشكل محدد، نقوم بتحديد الحد الأقصى لطول النص إلى 170 رمزًا ونستخدم chinese-bert-wwm-ext و bert-base-uncased من حزمة Transformers (Wolf et al., 2020) للتقييم باللغة الصينية والإنجليزية، على التوالي. نستخدم Adam (Kingma و Ba، 2014) كالمحسن ونقوم ببحث شبكي للعثور على معدل التعلم الأمثل. نبلغ عن نتيجة الاختبار على أفضل نقطة تحقق في التحقق.

2.2 مقارنة بين النماذج اللغوية الصغيرة والكبيرة

تقدم الجدول 2 أداء GPT-3.5-turbo مع أربعة أساليب تحفيز و BERT المدرب بشكل دقيق على مجموعتي البيانات. نلاحظ أن: 1) على الرغم من أن النموذج اللغوي الكبير يُعتقد عمومًا أنه قوي، إلا أن النموذج اللغوي الكبير يظهر أداءً أقل من النموذج اللغوي الصغير المدرب بشكل دقيق باستخدام جميع أساليب التحفيز الأربعة. النموذج اللغوي الصغير لديه

وجهة نظر

صيني

الإنجليزية

دعامة

ماك إف 1

دعامة

ماك إف 1

الوصف النصي

65%

0.706

71%

0.653

أخبار: الجميع! لا تشتروا الكرز بعد الآن: الكرز هذا العام مصاب باليرقات، وقريباً

تتأثر.

مبررات LLM: …نبرة الأخبار شديدة العجلة، وكأنها تحاول نشر الذعر والقلق.

التنبؤ: مزيف الحقيقة الأرضية: مزيف

الفطرة السليمة

71%

0.698

60٪

0.680

الأخبار: هوانغ، رئيس مكتب الشؤون المدنية في دوآن، يحصل على إعانات معيشية لـ 509 مواطنين، يمتلك تسع عقارات، ولديه ست زوجات…

مبررات LLM: …محتوى الأخبار مثير للغاية… مثل هذا الوضع نادر جداً في الواقع وقد يُعتبر حتى مستحيلاً.

التنبؤ: مزيف الحقيقة الأرضية: مزيف

الواقعية

17%

0.629

24%

0.626

الأخبار: وافق المؤتمر الوطني الثامن عشر على أن الأفراد الذين تبلغ أعمارهم 18 عامًا أو أكثر أصبحوا مؤهلين الآن للزواج…

مبررات LLM: أولاً، فإن الادعاء بأن الأفراد الصينيين الذين تبلغ أعمارهم 18 عامًا أو أكثر يمكنهم تسجيل زواجهم هو ادعاء صحيح، حيث ينص عليه القانون الصيني…

التنبؤ: الحقيقة الحقيقية: مزيفة

آخرون

٤٪

0.649

٨٪

0.704

الجدول 3: تحليل وجهات نظر مختلفة لمبررات نماذج اللغة الكبيرة في مجموعة العينة، بما في ذلك نسبة البيانات، أداء نماذج اللغة الكبيرة، والحالات. النسبة: النسبة المئوية.

زيادة نسبية في

بالصينية و

باللغة الإنجليزية عبر LLM، مما يشير إلى أن LLM يفتقر إلى المعرفة المحددة بالمهام بينما يتعلم SLM أثناء التعديل الدقيق.
2) النسخ القليلة من العينات تتفوق على النسخ التي لا تحتوي على عينات، مما يشير إلى أهمية عينات المهام. ومع ذلك، فإن تقديم عدة عينات يضيق الفجوة مع نموذج اللغة القائم على التعلم الذاتي ولكنه لا يؤدي إلى التفوق عليه.
3) يؤدي استخدام تقنية CoT إلى تحقيق مكاسب إضافية في الأداء بشكل عام، خاصة في إعداد عدم وجود أمثلة مسبقة على مجموعة البيانات الإنجليزية.

ومع ذلك، نلاحظ أيضًا بعض الحالات التي يؤدي فيها استخدام CoT إلى انخفاض. وهذا يشير إلى أن الاستخدام الفعال للمبررات قد يتطلب تصميمًا أكثر دقة.

بشكل عام، نظرًا لأداء النموذج اللغوي الكبير غير المرضي وتكاليف الاستدلال الأعلى مقارنة بالنموذج اللغوي الصغير، لم يكن النموذج اللغوي الكبير الحالي كاشفًا “جيدًا بما فيه الكفاية” ليحل محل النماذج اللغوية الصغيرة المحددة للمهام في كشف الأخبار المزيفة.

2.3 تحليل المبررات من نموذج اللغة الكبير

على الرغم من أن نموذج اللغة الكبير سيء في حكم صحة الأخبار، إلا أننا نلاحظ أيضًا أن المبررات التي تم توليدها من خلال

نموذج	الاستخدام	صيني	الإنجليزية
جي بي تي-3.5-تيربو	زيرو شوت كوت	0.677	0.666
	من منظور TD	0.667	0.611
	من منظور CS	0.678	0.698
بيرت	التعديل الدقيق	0.753	0.765
مجموعة	التصويت بالأغلبية	0.735	0.724
مجموعة	تصويت أوراكل	0.908	0.878

الجدول 4: أداء نموذج اللغة الكبير باستخدام طريقة صفرية مع تحديد المنظور والنماذج المقارنة الأخرى. TD: الوصف النصي؛ CS: الفطرة السليمة.

تظهر تقنية التحفيز من دون أمثلة (zero-shot CoT) قدرة تحليلية فريدة متعددة الزوايا، وهي تحدٍ نادر بالنسبة لنماذج اللغة الكبيرة (SLMs). لمزيد من الاستكشاف، قمنا بأخذ عينة من 500 عينة من كل من مجموعتي البيانات وقمنا بتصنيفها يدويًا وفقًا للزوايا التي يتم من خلالها تحليل الأخبار بواسطة نموذج اللغة الكبير (LLM). يتم تقديم النتائج الإحصائية حسب الزوايا والحالات في الجدول 3.

نرى أن: 1) النموذج اللغوي الكبير قادر على توليد تبريرات تشبه تبريرات البشر حول محتوى الأخبار من وجهات نظر متنوعة، مثل الوصف النصي، والفطرة السليمة، والحقائق، مما يلبي متطلبات الحس الدقيق لمجموعة متنوعة من الأدلة والفهم العميق للسياق الواقعي في كشف الأخبار المزيفة. 2) أداء الكشف على المجموعة الفرعية باستخدام وجهات نظر معينة أعلى من نتيجة عدم وجود تدريب على مجموعة الاختبار الكاملة. وهذا يشير إلى إمكانية التحليل من خلال وجهات النظر، على الرغم من أن التغطية متوسطة. 3) التحليل من منظور الحقائق يؤدي إلى أداء أقل من المتوسط، مما يشير إلى عدم موثوقية استخدام النموذج اللغوي الكبير لتحليل الحقائق بناءً على ذاكرته الداخلية. نحن نعتقد أن هذا ناتج عن مشكلة الهلوسة (جي وآخرون، 2023؛ زانغ وآخرون، 2023).

نواصل التحقيق في أداء نموذج اللغة الكبير عندما يُطلب منه إجراء تحليل من منظور محدد على مجموعة الاختبار الكاملة (أي،

التغطية).

من المجموعة الأولى في الجدول 4، نرى أن حكم نموذج اللغة الكبير مع التحليل من منظور واحد لا يزال واعدًا. بالمقارنة مع إعداد CoT الشامل بدون تدريب، فإن نموذج اللغة الكبير القائم على منظور واحد يؤدي بشكل متقارب على مجموعة البيانات الصينية ويكون أفضل على مجموعة البيانات الإنجليزية (في حالة منظور الفطرة السليمة).

تظهر النتائج أن الآلية الداخلية لنموذج اللغة الكبير لدمج المبررات من وجهات نظر متنوعة غير فعالة في اكتشاف الأخبار المزيفة، مما يحد من الاستخدام الكامل للمبررات. في هذه الحالة، فإن دمج نماذج اللغة الصغيرة والكبيرة لتكمل بعضها البعض هو حل واعد: يمكن أن تستفيد الأولى من القدرة التحليلية للأخيرة، بينما يمكن تعزيز الأخيرة من خلال المعرفة المحددة بالمهام من الأولى.

لإظهار مزايا هذا الحل، نطبق التصويت بالأغلبية وتصويت العراف (افتراضًا لأفضل حالة حيث نثق في النموذج الذي تم الحكم عليه بشكل صحيح لكل عينة، إن وجدت) بين نموذجين من LLM القائمين على منظور واحد وBERT. تظهر النتائج أننا من المحتمل أن نحصل على أداء أفضل من أي طرق تعتمد فقط على LLM أو SLM المذكورة سابقًا إذا استطعنا دمج مزاياها بشكل تكيفي، أي التعلم المرن المحدد للمهام من SLM والمبررات المعلوماتية التي ينتجها LLM. بمعنى آخر، يمكن أن يكون LLM مستشارًا جيدًا لـ SLM من خلال تقديم المبررات، مما يحسن في النهاية أداء الكشف عن الأخبار المزيفة.

3 ARG: شبكة توجيه الأسباب التكيفية لاكتشاف الأخبار المزيفة

استنادًا إلى النتائج والنقاشات المذكورة أعلاه، نقترح شبكة التوجيه القائم على المبررات التكيفية (ARG) لاكتشاف الأخبار المزيفة. توضح الشكل 3 شبكة ARG وإصدارها الخالي من المبررات ARG-D، للسيناريوهات الحساسة من حيث التكلفة. الهدف من ARG هو تمكين كاشفات الأخبار المزيفة الصغيرة من القدرة على اختيار المبررات المفيدة بشكل تكيفي كمرجع للحكم النهائي. بالنظر إلى عنصر أخبار

وأسبابها الناتجة عن نموذج اللغة الكبير المقابل

(وصف نصي) و

(الفطرة السليمة)، يقوم ARG بترميز المدخلات باستخدام SLM في البداية (الشكل 3(أ)). بعد ذلك، يبني تعاون الأخبار والمبررات من خلال توقع حكم LLM من خلال المبرر، مما يعزز تفاعل ميزات الأخبار والمبررات، وتقييم فائدة المبرر (الشكل 3(ب)). يتم تجميع الميزات التفاعلية في النهاية مع ميزة الأخبار.

للحكم النهائي على

كونه مزيفًا أم لا (الشكل 3(c)). يتم اشتقاق ARG-D من ARG عبر التقطير للسيناريوهات التي يكون فيها LLM غير متاح (الشكل 3(d)).

3.1 التمثيل

نستخدم نموذجين من BERT بشكل منفصل كموحد للأخبار والسبب للحصول على تمثيل دلالي-
العروض. بالنسبة للخبر المعطى

واثنان من المبررات المقابلة

التمثيلات هي

، و

، على التوالي.

3.2 التعاون في الأخبار والمبررات

تهدف خطوة التعاون بين الأخبار والمبررات إلى توفير تفاعل غني بين الأخبار والمبررات والتعلم لاختيار المبررات المفيدة كمرجع بشكل تكيفي، وهو ما يعد جوهر تصميمنا. لتحقيق هذا الهدف، يتضمن ARG ثلاثة وحدات، كما هو موضح ومثال عليه باستخدام فرع المبررات الوصفية النصية أدناه:

3.2.1 تفاعل الأخبار-المنطق

لتمكين تبادل المعلومات الشامل بين الأخبار والمبررات، نقدم متفاعل الأخبار والمبررات مع آلية انتباه مزدوجة لتعزيز تفاعلات الميزات. يمكن وصف الانتباه المتبادل على النحو التالي:

أين

، و

هو البعد. بالنظر إلى تمثيلات الأخبار

والمنطق

العملية هي:

أين

هو التجميع المتوسط على تمثيلات الرموز الناتجة عن الانتباه المتقاطع للحصول على تمثيل نصي بمتجه واحد

3.2.2 توقع حكم LLM

فهم الحكم الذي تم الإشارة إليه من خلال المبررات المعطاة هو شرط أساسي للاستفادة الكاملة من المعلومات وراء المبررات. لتحقيق ذلك، نقوم بإنشاء مهمة توقع حكم LLM، التي تتطلب التنبؤ بحكم LLM حول صحة الأخبار وفقًا للمبرر المعطى. نتوقع أن يعمق هذا الفهم لنصوص المبررات. بالنسبة لفرع المبررات الوصفية النصية، نقوم بتغذية تمثيلها

إلى متنبئ حكم نموذج اللغة الكبير، الذي يتم تحديده باستخدام إدراك متعدد الطبقات (MLP)

الشكل 3: الهيكل العام لشبكة التوجيه التكيفية المقترحة (ARG) وإصدارها الخالي من التبريرات ARG-D. في ARG، يتم ترميز عنصر الأخبار والتبريرات الخاصة بـ LLM (أ) على التوالي إلى

. ثم تتعاون النماذج الصغيرة والكبيرة مع بعضها البعض من خلال تفاعل ميزات الأخبار والتبريرات، وتوقع حكم النموذج الكبير، وتقييم فائدة التبريرات. الميزات التفاعلية التي تم الحصول عليها

. يتم تجميع هذه الميزات أخيرًا مع ميزات الأخبار المجمعة بعناية

للحكم النهائي. في ARG-D، يتم الحفاظ على مشفر الأخبار ووحدة الانتباه ويتم الإشراف على مخرجات محاكي الميزات المدرك للمبررات بواسطة الميزات المجمعة.

لتقطير المعرفة.

أين

هما على التوالي حكم النموذج اللغوي الكبير الفعلي (المستخرج من الرد) وتوقعه. الخسارة

هو خسارة الانتروبيا المتقاطعة

الحالة مشابهة للمنطق السليم

3.2.3 تقييم فائدة الأساس المنطقي

تختلف فائدة المبررات من وجهات نظر مختلفة عبر عناصر الأخبار المختلفة، وقد يؤدي الدمج غير السليم إلى تدهور الأداء. لتمكين النموذج من اختيار المبررات المناسبة بشكل تكيفي، نقوم بتصميم عملية تقييم فائدة المبررات، حيث نقوم بتقييم مساهمات المبررات المختلفة وضبط أوزانها لتوقع الحقيقة في المرات القادمة. تتكون العملية من مرحلتين، وهما التقييم وإعادة الوزن. بالنسبة للتقييم، نقوم بإدخال متجه المبررات المرتبطة بالأخبار.

إلى مُقيِّم فائدة المنطق (المُعَامَل بواسطة شبكة عصبية متعددة الطبقات) للتنبؤ بفائدته

استنادًا إلى الافتراض بأن المبررات التي تؤدي إلى أحكام صحيحة أكثر فائدة، نستخدم صحة الحكم كعلامات لفائدة المبررات.

في مرحلة إعادة الوزن، نقوم بإدخال المتجه

إلى MLP للحصول على رقم الوزن

والذي يُستخدم بعد ذلك لإعادة وزن متجه الأخبار المدرك للمبررات

الإجراء كالتالي:

نستخدم أيضًا التجميع الانتباهي لتحويل مصفوفة التمثيل

إلى متجه

3.3 التنبؤ

استنادًا إلى المخرجات من الخطوة الأخيرة، نقوم الآن بتجميع متجه الأخبار

والمتجهات الإخبارية المدركة للمبررات

للحكم النهائي. لخبر

مع التسمية

نقوم بتجميع هذه المتجهات بأوزان مختلفة:

أين

هي معلمات قابلة للتعلم تتراوح من 0 إلى

هو متجه الاندماج، الذي يتم تغذيته بعد ذلك إلى مصنف MLP للتنبؤ النهائي بصدق الأخبار:

دالة الخسارة الكلية هي المجموع الموزون لمصطلحات الخسارة المذكورة أعلاه:

أين

هي معلمات فرعية.

	نموذج	صيني				الإنجليزية
	نموذج	ماكF1	حساب			ماك إف 1	حساب
G1: نموذج اللغة الكبير فقط GPT-3.5-turbo		0.725	0.734	0.774	0.676	0.702	0.813	0.884	0.519
جي 2: فقط SLM	خط الأساس	0.753	0.754	0.769	0.737	0.765	0.862	0.916	0.615
	إيان	0.754	0.756	0.773	0.736	0.763	0.864	0.918	0.608
	ناشر-إيمو	0.761	0.763	0.784	0.738	0.766	0.868	0.920	0.611
	إن ديف	0.765	0.766	0.779	0.751	0.768	0.865	0.918	0.618
G3: LLM+SLM	الخط الأساسي + المبررات	0.767	0.769	0.787	0.748	0.777	0.870	0.921	0.633
	سوبر آي سي إل	0.757	0.759	0.779	0.734	0.736	0.864	0.920	0.551
	أرج	0.784	0.786	0.804	0.764	0.790		0.926	0.653
	(التحسن النسبي مقارنة بالخط الأساسي)
	بدون متنبئ حكم LLM	0.773	0.774	0.789	0.756		0.880	0.928	0.645
	مقيّم فائدة بدون مبرر		0.783	0.801	0.761	0.782	0.873	0.923	0.641
	بدون متنبئ ومقيم	0.769	0.770	0.782	0.756	0.780	0.874	0.923	0.637
	أرج-دي

الجدول 5: أداء نموذج ARG ونسخه وطرق LLM-only وSLM-only وLLM+SLM. تم تمييز أفضل نتيجتين في المقياس الكلي F1 والدقة بالخط العريض والتسطير على التوالي. بالنسبة لـ GPT-3.5-turbo، تم الإبلاغ عن أفضل النتائج في الجدول 2.

3.4 التقطير لنموذج خالٍ من المبررات

يتطلب نموذج ARG إرسال طلبات إلى LLM لكل توقع، مما قد لا يكون ميسورًا في السيناريوهات الحساسة من حيث التكلفة. لذلك، نحاول بناء نموذج خالٍ من المبررات، وهو ARG-D، استنادًا إلى نموذج ARG المدرب من خلال تقنيات تقطير المعرفة (Hinton et al., 2015). الفكرة الأساسية هي محاكاة وتضمين المعرفة من المبررات في وحدة بارامترية. كما هو موضح في الشكل 3(d)، نقوم بتهيئة مشفر الأخبار والمصنف باستخدام الوحدات المقابلة في ARG وندرب محاكي ميزات واعٍ للمبررات (تم تنفيذه باستخدام كتلة محول متعددة الرؤوس) ووحدة انتباه لتضمين المعرفة. بالإضافة إلى خسارة الانتروبيا المتقاطعة

نحن نسمح للميزة

للتقليد

في ARG، باستخدام خسارة تقدير المتوسط المربع:

4 التقييم

4.1 إعدادات التجربة

الخطوط الأساسية نقارن بين ثلاث مجموعات من الطرق: G1 (LLM-فقط): نعرض أداء أفضل إعداد في كل مجموعة بيانات في الجدول 2، أي، القليل من الأمثلة في الصينية والقليل من الأمثلة مع سلسلة التفكير في الإنجليزية.
جي 2 (SLM فقط)

1) الخط الأساسي: نموذج BERTbase الافتراضي الذي يظل إعداده متسقًا مع

في القسم 2. 2) EANN

(وانغ وآخرون، 2018): نموذج يتعلم إشارات فعالة باستخدام التدريب المعاكس المساعد، بهدف إزالة الميزات المتعلقة بالحدث قدر الإمكان. استخدمنا سنة النشر كعلامة للمهمة المساعدة. 3) ناشر-إيمو (تشانغ وآخرون، 2021): نموذج يدمج سلسلة من الميزات العاطفية مع الميزات النصية لاكتشاف الأخبار المزيفة. 4) ENDEF (تشو وآخرون، 2022): نموذج يزيل تحيز الكيانات من خلال التعلم السببي لتحقيق تعميم أفضل على بيانات الأخبار المزيفة التي تغيرت توزيعها. جميع الطرق في هذه المجموعة استخدمت نفس BERT كمشفر نصي.
G3 (LLM+SLM): 1) الأساسيات + المبررات: يجمع الميزات من مشفر الأخبار ومشفر المبررات ويغذيها إلى MLP للتنبؤ. 2) SuperICL (شو وآخرون، 2023): يستغل SLM كإضافة لتعلم السياق في LLM من خلال حقن التنبؤ والثقة لكل عينة اختبار في الموجه.

تفاصيل التنفيذ نستخدم نفس مجموعات البيانات المقدمة في القسم 2 ونحتفظ بالإعداد كما هو من حيث النموذج المدرب مسبقًا، ومعدل التعلم، وطريقة التحسين. بالنسبة لشبكة ARG-D، يتم اشتقاق معلمات مشفر الأخبار والمصنف من نموذج ARG. تم تنفيذ كتلة محول بأربعة رؤوس في محاكي الميزات المدركة للمبررات. تم البحث عن وزن دوال الخسارة

في ARG و

في ARG-D.

4.2 مقارنة الأداء ودراسة الإلغاء

تقدم الجدول 5 أداء ARG المقترح ونسخه والطرق المقارنة. من النتائج، نلاحظ أن: 1) يتفوق ARG على جميع الطرق المقارنة الأخرى في F1 الكلي، مما يوضح فعاليته. 2) لا يزال ARG-D الخالي من المبررات يتفوق على جميع الطرق المقارنة باستثناء ARG ونسخه، مما يظهر التأثير الإيجابي للمعرفة المقطرة من ARG. 3) تظهر الطريقتان المقارنتان LLM+SLM أداءً مختلفًا. إن الجمع البسيط بين ميزات الأخبار والمبررات يؤدي إلى تحسين الأداء، مما يظهر فائدة مبرراتنا المحفزة. يتفوق SuperICL على طريقة LLM فقط ولكنه يفشل في التفوق باستمرار على SLM الأساسي في مجموعتي البيانات. نحن نعتقد أن هذا بسبب تعقيد مهمتنا، حيث أن حقن التنبؤ والثقة من SLM لا يجلب معلومات كافية. 4) نقوم بتقييم ثلاث مجموعات من تجارب الإلغاء لتقييم فعالية الوحدات المختلفة في شبكة ARG. من النتيجة، يمكننا أن نرى أن عدم وجود LLM Judgement Predictor أو عدم وجود Rationale Usefulness Evaluator كلاهما يؤدي إلى انخفاض كبير في أداء ARG، مما يبرز أهمية هياكل هذين الهيكلين. بالإضافة إلى ذلك، وجدنا أن حتى الأضعف من بين نسخ ARG لا يزال يتفوق على جميع الطرق الأخرى، مما يظهر أهمية هيكل تفاعل الأخبار والمبررات الذي صممناه.

4.3 تحليل النتائج

للتحقيق في الجزء الذي يجب أن يُنسب إليه المكسب الإضافي من ARG(-D)، نقوم بإجراء تحليل إحصائي على العينات التي تم الحكم عليها بشكل صحيح إضافيًا من

مقارنةً بـ BERT العادي. من الشكل 4، نلاحظ أن: 1) تتجاوز نسب العينات المتداخلة بين ARG(-D) وLLM

، مما يشير إلى أن

يمكن أن تستغل (وتستوعب) المعرفة القيمة للأحكام من LLM، حتى لو كان أداؤها غير مرضٍ. 2) العينات التي تم الحكم عليها بشكل صحيح من قبل LLM من كلا المنظورين تساهم بأكبر قدر، مما يشير إلى أن المزيد من المبررات المتنوعة قد تعزز تدريب

. 3)

من الأحكام الصحيحة يجب أن تُنسب إلى النموذج نفسه. نحن نعتقد أنه ينتج بعض أنواع “المعرفة الجديدة” بناءً على الأحكام الخاطئة للمعرفة المعطاة.

لتحليل حالات النجاح والفشل و

الشكل 4: إحصائيات العينات التي تم الحكم عليها بشكل صحيح إضافيًا من (أ) ARG و(ب) ARG-D مقارنةً بقاعدة BERT. اليمين

يدل على العينات التي تم الحكم عليها بشكل صحيح بواسطة الطريقة

. TD/CS: الوصف النصي/منظور الفطرة السليمة.

الشكل 5: الأداء مع تغير عتبة التحول.

لتحليل إضافي، يرجى الرجوع إلى الملحق.

4.4 تحليل التكلفة في الممارسة

نستعرض استراتيجية محتملة لتحويل النموذج لتحقيق التوازن بين الأداء والتكلفة في الأنظمة العملية. مستلهمين من ما وآخرون (2023)، نقوم بمحاكاة الحالة التي نستخدم فيها ARG-D الأكثر اقتصادية بشكل افتراضي ولكن نستفسر عن ARG الأكثر قوة لجزء من البيانات. كما هو موضح في الشكل 5، من خلال إرسال فقط

من البيانات (وفقًا لثقة ARG-D) إلى ARG، يمكننا تحقيق 0.784 في F1 الكلي، وهو نفس أداء استخدام ARG بالكامل.

اكتشاف الأخبار المزيفة يتم صياغة اكتشاف الأخبار المزيفة عمومًا كمسألة تصنيف ثنائي بين العناصر الحقيقية والمزيفة. يمكن تصنيف الأبحاث في هذه المهمة بشكل تقريبي إلى مجموعتين: طرق قائمة على السياق الاجتماعي وطرق قائمة على المحتوى. تهدف الطرق في المجموعة الأولى إلى تمييز الأخبار المزيفة والحقيقية خلال عملية الانتشار من خلال مراقبة أنماط الانتشار (تشو وزافاراني، 2019)، وتعليقات المستخدمين (مين وآخرون، 2022)، والشبكات الاجتماعية (نجوين وآخرون، 2020). تركز المجموعة الثانية على العثور على تلميحات بناءً على
المحتوى المعطى، بما في ذلك النص (بريزبيلا، 2020) والصور (تشي وآخرون، 2021) وقد تتطلب مساعدة إضافية من قواعد المعرفة (بوبات وآخرون، 2018) وبيئات الأخبار (شينغ وآخرون، 2022). تحصل كلا المجموعتين من الطرق على تمثيل نصي من نماذج مدربة مسبقًا مثل BERT كقاعدة ولكن نادرًا ما تأخذ في الاعتبار إمكاناتها لاكتشاف الأخبار المزيفة. قمنا بإجراء استكشاف في هذه الورقة من خلال دمج LMs الكبيرة والصغيرة وحققنا تحسينًا جيدًا باستخدام المحتوى النصي فقط.

LLMs لفهم اللغة الطبيعية

على الرغم من أن LLMs هي نماذج توليدية في الغالب، إلا أن لديها أيضًا قدرات قوية في فهم اللغة الطبيعية (NLU)، خاصة في سيناريوهات التعلم في السياق القليل (براون وآخرون، 2020). تركز الأعمال الأخيرة في هذا المجال على تقييم أحدث LLM في NLU. تظهر النتائج أن LLMs قد لا تتمتع بميزة شاملة مقارنةً بنموذج صغير مدرب جيدًا في بعض أنواع مهام NLU (تشونغ وآخرون، 2023؛ كوكوń وآخرون، 2023). تقدم نتائجنا نتائج تجريبية في اكتشاف الأخبار المزيفة باستخدام المحتوى النصي فقط كمدخل.

6 الخاتمة والنقاش

حققنا في ما إذا كانت LMs الكبيرة تساعد في اكتشاف الأخبار المزيفة وكيفية الاستفادة بشكل صحيح من مزاياها لتحسين الأداء. تظهر النتائج أن LLM الكبير (GPT-3.5) لا يؤدي بشكل جيد مقارنةً بـ LLM الصغير المخصص للمهمة (BERT)، ولكنه يمكن أن يوفر مبررات مفيدة ويكمل LMs الصغيرة في فهم الأخبار. بناءً على هذه النتائج، صممنا شبكة ARG لدمج مزايا LMs الصغيرة والكبيرة بشكل مرن وطورنا نسختها الخالية من المبررات ARG-D للسيناريوهات الحساسة للتكلفة. أظهرت التجارب تفوق ARG وARG-D.

النقاش توضح نتائجنا في اكتشاف الأخبار المزيفة الحاجز الحالي أمام LLMs لتكون كفؤة في التطبيقات المرتبطة بشكل وثيق بالخلفية الواقعية المعقدة. على الرغم من أن لديها قدرة تحليلية متفوقة، قد تكافح LLMs للاستفادة بشكل كامل من قدرتها الداخلية. وهذا يشير إلى أن “استخراج” إمكاناتها قد يتطلب تقنيات تحفيز جديدة وفهمًا أعمق لآليتها الداخلية. ثم حددنا إمكانية دمج LMs الصغيرة وLLMs لتحقيق تحسين إضافي وقدمنا حلاً مناسبًا بشكل خاص للحالات التي يتعين فيها على النماذج ذات الأداء الأفضل “اختيار
الجيد للتعلم” من النماذج الأسوأ. نتوقع أن يتم توسيع حلنا ليشمل مهام أخرى وتعزيز استخدام LLMs بشكل أكثر فعالية وملاءمة من حيث التكلفة في المستقبل.

القيود نحدد القيود التالية: 1) لم نقم بفحص LLMs المعروفة الأخرى (مثل كلود

وإيرني بوت

) بسبب عدم توفر واجهة برمجة التطبيقات لنا عند إجراء هذا البحث؛ 2) نحن نعتبر فقط المنظورات الملخصة من استجابة LLM وقد تكون هناك وجهات نظر تحفيزية أخرى بناءً على إطار مفاهيمي للأخبار المزيفة؛ 3) لا تزال أفضل نتائجنا تتخلف عن تكامل تصويت الأوركل للأحكام متعددة المنظورات في الجدول 4، مما يشير إلى أن هناك مجالًا للتحسينات في أدائنا.

الشكر

يود المؤلفون أن يشكروا المراجعين المجهولين على تعليقاتهم الثاقبة. هذا العمل مدعوم من قبل المؤسسة الوطنية للعلوم الطبيعية في الصين (62203425)، برنامج البحث والتطوير الرئيسي في مقاطعة تشجيانغ (2021C01164)، مشروع الأكاديمية الصينية للعلوم (E141020)، برنامج زمالة ما بعد الدكتوراه من CPSF (GZC20232738) وCIPSC-SMP-Zhipu.AIصندوق النماذج الكبيرة عبر التخصصات.

References

Anthropic. 2023. Model card and evaluations for claude models. https://www-files.anthropic.com/ production/images/Model-Card-Claude-2.pdf. Accessed: 2023-08-13.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877-1901. Curran Associates Inc.

Kevin Matthe Caramancion. 2023. News verifiers showdown: A comparative performance evaluation of ChatGPT 3.5, ChatGPT 4.0, bing AI,

and bard in news fact-checking. arXiv preprint arXiv:2306.17176.

CHEQ. 2019. The economic cost of bad actors on the internet. https://info.cheq.ai/hubfs/Research/ THE_ECONOMIC_COST_Fake_News_final.pdf. Accessed: 2023-08-13.

Jian Cui, Kwanwoo Kim, Seung Ho Na, and Seungwon Shin. 2022. Meta-path-based fake news detection leveraging multi-level social context information. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 325-334. ACM.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186. ACL.

Marc Fisher, John Woodrow Cox, and Peter Hermann. 2016. Pizzagate: From rumor, to hashtag, to gunfire in dc. The Washington Post.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Beizhe Hu, Qiang Sheng, Juan Cao, Yongchun Zhu, Danding Wang, Zhengjia Wang, and Zhiwei Jin. 2023. Learn over past, evolve for future: Forecasting temporal trends for fake news detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 116-125. ACL.

Linmei Hu, Siqi Wei, Ziwang Zhao, and Bin Wu. 2022a. Deep learning for fake news detection: A comprehensive survey. AI Open, 3:133-155.

Xuming Hu, Zhijiang Guo, GuanYu Wu, Aiwei Liu, Lijie Wen, and Philip Yu. 2022b. CHEF: A pilot Chinese dataset for evidence-based fact-checking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3362-3376. ACL.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55:1-38.

Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang. 2021. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia tools and applications, 80(8):11765-11788.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.

Jan Kocoń, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Kocoń, Bartłomiej Koptyra, Wiktoria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radliński, Konrad Wojtasik, Stanisław Woźniak, and Przemysław Kazienko. 2023. ChatGPT: Jack of all trades, master of none. Information Fusion, 99:101861.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199-22213. Curran Associates, Inc.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023a. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1-35.

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. 2023b. Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun. 2023. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559.

Erxue Min, Yu Rong, Yatao Bian, Tingyang Xu, Peilin Zhao, Junzhou Huang, and Sophia Ananiadou. 2022. Divide-and-conquer: Post-user interaction network for fake news detection on social media. In Proceedings of the ACM Web Conference 2022, pages 1148-1158. ACM.

Ahmadreza Mosallanezhad, Mansooreh Karami, Kai Shu, Michelle V. Mancenido, and Huan Liu. 2022. Domain adaptive fake news detection via reinforcement learning. In Proceedings of the ACM Web Conference 2022, pages 3632-3640. ACM.

Yida Mu, Kalina Bontcheva, and Nikolaos Aletras. 2023. It’s about time: Rethinking evaluation on rumor detection benchmarks using chronological splits. In Findings of the Association for Computational Linguistics: EACL 2023, pages 736-743. ACL.

Salman Bin Naeem and Rubina Bhatti. 2020. The COVID-19 ‘infodemic’: a new front for information professionals. Health Information & Libraries Journal, 37(3):233-239.

Qiong Nan, Juan Cao, Yongchun Zhu, Yanyan Wang, and Jintao Li. 2021. MDFEND: Multi-domain fake news detection. In Proceedings of the 30th ACM

International Conference on Information and Knowledge Management. ACM.

Van-Hoang Nguyen, Kazunari Sugiyama, Preslav Nakov, and Min-Yen Kan. 2020. FANG: Leveraging social context for fake news detection using graph representation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pages 1165-1174. ACM.

OpenAI. 2022. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/ chatgpt/. Accessed: 2023-08-13.

Kellin Pelrine, Meilina Reksoprodjo, Caleb Gupta, Joel Christoph, and Reihaneh Rabbany. 2023. Towards reliable misinformation mitigation: Generalization, uncertainty, and GPT-4. arXiv preprint arXiv:2305.14928v1.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. DeClarE: Debunking fake news and false claims using evidence-aware deep learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 22-32. ACL.

Piotr Przybyla. 2020. Capturing the style of fake news. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 490-497. AAAI Press.

Peng Qi, Juan Cao, Xirong Li, Huan Liu, Qiang Sheng, Xiaoyue Mi, Qin He, Yongbiao Lv, Chenyang Guo, and Yingchao Yu. 2021. Improving fake news detection by using an entity-enhanced framework to fuse diverse multimodal clues. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1212-1220. ACM.

Sunil Ramlochan. 2023. Role-playing in large language models like ChatGPT. https://www.promptengineering.org/role-playing-in-large-language-models-like-chatgpt/. Accessed: 2023-08-13.

Yoel Roth. 2022. The vast majority of content we take action on for misinformation is identified proactively. https://twitter.com/yoyoel/status/ 1483094057471524867. Accessed: 2023-08-13.

Qiang Sheng, Juan Cao, Xueyao Zhang, Rundong Li, Danding Wang, and Yongchun Zhu. 2022. Zoom out and observe: News environment perception for fake news detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4543-4556. ACL.

Qiang Sheng, Xueyao Zhang, Juan Cao, and Lei Zhong. 2021. Integrating pattern-and fact-based fake news detection via model preference learning. In Proceedings of the 30th ACM international conference on information & knowledge management, pages 16401650. ACM.

Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019. dEFEND: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 395-405. ACM.

Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020. FakeNewsNet: A data repository with news content, social context and spatiotemporal information for studying fake news on social media. Big data, 8:171-188.

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19:22-36.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 849857. ACM.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. Transactions on Machine Learning Research.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022b. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824-24837. Curran Associates, Inc.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online. ACL.

Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, Chenguang Zhu, and Julian McAuley. 2023. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848.

Xueyao Zhang, Juan Cao, Xirong Li, Qiang Sheng, Lei Zhong, and Kai Shu. 2021. Mining dual emotion for fake news detection. In Proceedings of the web conference 2021, pages 3465-3476. ACM.

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Can ChatGPT understand too? a comparative study on ChatGPT and fine-tuned BERT. arXiv preprint arXiv:2302.10198.

Xinyi Zhou and Reza Zafarani. 2019. Network-based fake news detection: A pattern-driven approach. ACM SIGKDD Explorations Newsletter, 21(2):4860.

Yongchun Zhu, Qiang Sheng, Juan Cao, Shuokai Li, Danding Wang, and Fuzhen Zhuang. 2022. Generalizing to the future: Mitigating entity bias in fake news detection. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21202125. ACM.

تحليل حالة

تقدم الجداول 6 و 7 حالات في مجموعة الاختبار. تتضمن الأولى حالات حيث كانت واحدة على الأقل من الطرق المقارنة وتوقعات النموذج اللغوي الكبير (LLM) صحيحة، بينما تتضمن الثانية تلك الموجودة في المجموعة التكميلية (أي، لا يقدم أي منهما توقعات صحيحة).

في الجدول 6، تُظهر الحالة 1 حالة حيث قدمت القاعدة توقعًا صحيحًا بينما كان LLM خاطئًا. ظلت ARG غير متأثرة بالأسباب المحتملة المضللة لـ LLM وحافظت على التوقع الصحيح. تُظهر الحالتان 2 و 3 الوضع حيث كانت القاعدة غير صحيحة بينما يمكن اعتبار LLM جزئيًا صحيحًا (أي، فقط ما استخدم المنظور المحدد للاثنين هو الصحيح). في هذه الحالات، اختارت ARG السبب الصحيح بناءً على قدرتها على الاختيار التكيفي، مما أدى إلى التوقع الصحيح.

في الحالة 4 في الجدول 7، حددت ARG بنجاح الحالة التي فشلت فيها كل من القاعدة و LLM
في تقديم توقع دقيق، مما يتجاوز توقعاتنا. بعد إجراء تحليل مقارن، وجدنا أنه ضمن مجموعة التدريب، يؤدي استخدام عبارات مثل “أسلوب الكتابة المدمج” في أسباب الوصف النصي غالبًا إلى أحكام غير صحيحة. نفترض أنه مسترشدًا بالإشارات المقدمة من مقيم فائدة الأسباب، أدركت ARG هذا النمط وكانت قادرة على اتخاذ أحكام تتعارض مع الأسباب الخاطئة واختارت التوقع الصحيح. بينما في الحالة 5، لم تتمكن ARG من عكس الحكم الخاطئ كما في الحالة 4. نعتقد أن هذا بسبب عدم وجود نمط واضح مثل ذلك في الحالة 4. مقيدة بأداء فهم اللغة الطبيعية لنموذج BERT-base، كافحت ARG لالتقاط علاقات منطقية أكثر تعقيدًا.

أمثلة على التحفيز

في هذا القسم، نهدف إلى عرض قوالب التحفيز جنبًا إلى جنب مع أمثلة حقيقية لتسهيل معرفة القراء بالإعدادات مباشرة.

في الجداول 8 و 9، نوضح أمثلة التحفيز للتحفيز بدون طلقات، والتحفيز بدون طلقات باستخدام CoT، والتحفيز بدون طلقات باستخدام مطالبات محددة للمنظور، والتحفيز بعدد قليل من الطلقات، والتحفيز بعدد قليل من الطلقات باستخدام CoT من مجموعة البيانات الصينية.

للاختصار، نقدم فقط زوجًا من الأخبار يتكون من خبر حقيقي وآخر مزيف كعرض توضيحي للتحفيز بعدد قليل من الطلقات والتحفيز بعدد قليل من الطلقات باستخدام CoT. في الممارسة العملية، نستخدم مجموعات متعددة من أزواج الأخبار وفقًا لعدد الطلقات المطلوبة.

نظرًا لاحتمالية ملحوظة لاستدراج الرفض من GPT-3.5-turbo عند استخدام مطالبات تتعلق بالكشف عن الأخبار المزيفة مباشرة على مجموعة البيانات الإنجليزية، نستخدم تقنية لعب الأدوار. على وجه التحديد، قمنا ببناء سياقات مناسبة في المطالبة للسماح لـ LLM بأن يكون في مشهد وأخيرًا تقليل احتمال الرفض. (Liu et al., 2023b; Ramlochan, 2023)

الحالة 1: كل من القاعدة و ARG توقعوا بشكل صحيح، بينما كان LLM خاطئًا.
الأخبار: واو! تم التقاط روبوت في روسيا بشكل بطولي وهو ينقذ فتاة صغيرة مرحة من أن تسحقها رف. انتشر هذا الفعل على الإنترنت، مما يظهر قوة الروبوت، مذهل!
سبب LLM (الوصف النصي): تستخدم هذه الرسالة لغة مبالغ فيها وعاطفية، دون تقديم مصادر أو أدلة موثوقة.
سبب LLM (الفطرة السليمة): بناءً على المعرفة والخبرة الشائعة، لم تصل الروبوتات بعد إلى مستوى يمكنها من التصرف بشكل مستقل وإنقاذ الأرواح البشرية.

الحقيقة الأساسية: حقيقية
توقع القاعدة: حقيقي توقع LLM TD: مزيف توقع LLM CS: مزيف توقع ARG: حقيقي
الحالة 2: توقعت القاعدة بشكل خاطئ، بينما توقعت ARG بشكل صحيح، وكان LLM باستخدام منظور الوصف النصي أيضًا صحيحًا.
الأخبار: #NingboRevelation# [صيادو نينغبو شيانغشان يصطادون سمكة غريبة، أطلق عليها لقب ‘باندا الماء’]#NingboFreshNews# حوالي الساعة 7:00 مساءً في 2 مارس، اصطادت السفينة ZXY02206، عند إحداثيات

E،

، سمكة غريبة. بعد التحقق، تم التعرف على السمكة على أنها سمكة ستurgeon صينية برية صغيرة، بطول 64 سنتيمترًا ووزنها حوالي 2 كيلوغرام. حوالي الساعة 10:00 صباحًا في 3 مارس، تم إعادة هذه السمكة الصغيرة إلى أحضان البحر.
سبب LLM (الوصف النصي): تستخدم هذه الرسالة لغة أكثر رسمية وتقدم تفاصيل محددة حول الوقت والمكان ونوع السمكة التي تم اصطيادها، بما في ذلك حجمها. كما تذكر أن المعلومات قد تم التحقق منها.
سبب LLM (الفطرة السليمة): بناءً على المعرفة والخبرة الشائعة، فإن سمكة الستurgeon الصينية هي نوع مهدد بالانقراض ونادر، مما يجعل من غير المحتمل أن يتم اصطيادها بشكل شائع من قبل صيادين عاديين.

الحقيقة الأساسية: حقيقية
توقع القاعدة: مزيف توقع LLM TD: حقيقي توقع LLM CS: مزيف توقع ARG: حقيقي
الحالة 3: توقعت القاعدة بشكل خاطئ، بينما توقعت ARG بشكل صحيح، وكان LLM باستخدام منظور الفطرة السليمة أيضًا صحيحًا.
الأخبار: بعد التعافي، يتنقل رئيس الوزراء البريطاني إلى العمل بالدراجة كل يوم، على الرغم من أنه يمتلك سيارة فولكس فاجن مستعملة. لا يُسمح له باستخدام سيارة رسمية للتنقل، ونظرًا لعدم وجود مكان ثابت لوقوف السيارات في مكتب رئيس الوزراء، يختار ركوب الدراجة. يتبعه حارسان في سيارة لكنهما غالبًا ما يتأخران بسبب الازدحام المروري.
سبب LLM (الوصف النصي): أسلوب هذه الرسالة مباشر وواضح، دون تزيين مفرط أو مبالغة. يبدو أن المحتوى الموصوف أيضًا واقعي وموثوق.
سبب LLM (الفطرة السليمة): بالنظر إلى طبيعة وظيفة رئيس الوزراء البريطاني واعتبارات الأمن، فإن الادعاء بأنه يتنقل إلى العمل بالدراجة كل يوم ليس موثوقًا جدًا.

الحقيقة الأساسية: مزيف
توقع القاعدة: حقيقي توقع LLM TD: حقيقي توقع LLM CS: مزيف توقع ARG: مزيف
الجدول 6: حالات في مجموعة الاختبار حيث كانت واحدة على الأقل في القاعدة و LLM باستخدام التحفيز المحدد للمنظور قد توقعت بشكل صحيح. TD/CS تمثل الوصف النصي/منظور الفطرة السليمة، على التوالي. الأنماط في الأسباب تحتها خط والتوقعات الصحيحة بالخط العريض المائل.

Case 4: Both the Baseline and the LLM using perspective-specific prompting made incorrect predictions, while the
ARG did it correctly.
News: According to Russian media reports, recently, Japan entrusted US military transport planes stationed in Japan to
transport 2,000 test kits donated by our country to Japan. However, during the transportation process, the US suddenly
informed Japan that these test kits had been transported to the US mainland. A US Air Force official later explained that
due to a mistake, these materials were wrongly transported to the US mainland. The Japanese side does not accept the
explanation of the US military official! Do you think the actions of the US military were really due to a mistake?
LLM Rationale (Textual Description): The style of the message is quite compact, with language that is concise and clear.
There are no apparent grammatical or spelling errors, making it consistent with the style of news reporting.
LLM Rationale (Commonsense): Based on common experience and knowledge, it's not uncommon for mistakes to occur
during the transportation of goods. Therefore, it's more likely that the actions of the U.S. military were due to an error
rather than intentional.
Ground Truth: Fake
Baseline Pred.: Real LLM TD Pred.: Real LLM CS Pred.: Real ARG Pred.: Fake

الحالة 5: القاعدة و LLM باستخدام التحفيز المحدد للمنظور و ARG جميعهم قدموا توقعات خاطئة.
الأخبار: #ShenzhenEpidemic# بدأت منطقة سونغغانغ في شنتشن فرض قيود على الدخول والخروج. يجب على الجميع الانتباه لارتداء الأقنعة بشكل صحيح! احموا أنفسكم! الوباء لم ينته بعد، يجب على الجميع أن يكونوا حذرين للغاية.
سبب LLM (الوصف النصي): تستخدم الرسالة هاشتاج “#ShenzhenEpidemic#”، والذي يتماشى مع الطريقة الشائعة للنشر على وسائل التواصل الاجتماعي. النغمة أيضًا رسمية جدًا، مما يعمل كتذكير للجميع باتخاذ تدابير وقائية ويعبر عن القلق والانتباه تجاه الوضع الوبائي.
سبب LLM (الفطرة السليمة): نظرًا للوضع العالمي الحالي بشأن الوباء، تقوم مناطق مختلفة بتنفيذ تدابير للسيطرة على انتشار الفيروس. لذلك، قد تتبنى شنتشن أيضًا تدابير تحكم مماثلة. بالإضافة إلى ذلك، تذكر الرسالة ارتداء الأقنعة وحماية النفس، وهو ما يتماشى مع التدابير الوقائية خلال الوباء.

Ground Truth: Fake
Baseline Pred.: Real LLM TD Pred.: Real LLM CS Pred.: Real ARG Pred.: Real

الجدول 7: حالات في مجموعة الاختبار حيث قدمت كل من القاعدة و LLM باستخدام التحفيز المحدد للمنظور توقعات خاطئة. TD/CS تمثل الوصف النصي/منظور الفطرة السليمة، على التوالي. الأنماط في الأسباب تحتها خط والتوقعات الصحيحة بالخط العريض المائل.

(1) التحفيز بدون طلقات

س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كان من المرجح أن تكون رسالة حقيقية، ارجع 1؛ خلاف ذلك، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: تم الكشف عن صور مفصلة لجراحة وتر أكس لي. توقف عن الشكاوى ويرجى إظهار التعاطف والبركات!
أ: 1

(2) التحفيز بدون طلقات باستخدام CoT

س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: تم الكشف عن صور مفصلة لجراحة وتر شيانغ ليو. توقف عن الشكاوى ويرجى إظهار التعاطف والبركات!
ج: دعنا نفكر خطوة بخطوة. أولاً، تذكر هذه الرسالة أن شيانغ ليو خضع لجراحة وتر، والتي يمكن التحقق منها من خلال قنوات أخرى. ثانياً، تشير الرسالة إلى الكشف عن صور مفصلة، والتي يمكن أيضاً التحقق منها من خلال البحث في التقارير ذات الصلة. أخيراً، تحث الرسالة الناس على إظهار التعاطف وتقديم البركات، وهو ما يتماشى مع الممارسات الشائعة. لذلك، يمكن اعتبار أن هذه الرسالة من المرجح أن تكون معلومات صحيحة، والرد هو 1. لذلك، الجواب (الأرقام العربية) هو 1
(3) Zero-Shot CoT مع تحفيز محدد للمنظور (وصف نصي)

س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: تم الكشف عن صور مفصلة لجراحة وتر شيانغ ليو. توقف عن الشكاوى ويرجى إظهار التعاطف والبركات!
ج: دعنا نفكر من منظور الوصف النصي. تستخدم هذه الرسالة لغة عاطفية لنقل التعاطف والبركات لشيانغ ليو، بينما تشير أيضاً إلى الكشف عن صور مفصلة للجراحة. بالنظر إلى أن مثل هذه الرسالة من المحتمل أن تجذب انتباه الجمهور وتفاعلهم، فمن المرجح أن تكون معلومات صحيحة. ارجع 1.
(4) Zero-Shot CoT مع تحفيز محدد للمنظور (الفطرة السليمة)

الجدول 8: تحفيز بدون طلقات، تحفيز Zero-Shot CoT، ونماذج Zero-Shot CoT مع تحفيز محدد للمنظور، جنبًا إلى جنب مع حالة الأخبار والردود من LLM (أي، GPT-3.5-turbo). نماذج التحفيز بالخط العريض والأخبار المستهدفة بالخط المائل.

(5) تحفيز قليل الطلقات

س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: تم الكشف عن صور مفصلة لجراحة وتر شيانغ ليو. توقف عن الشكاوى ويرجى إظهار التعاطف والبركات!

ج: 1

(6) تحفيز قليل الطلقات مع CoT

س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: [مسؤول روسي يعيد التأكيد على الإنكار: نقل الأراضي إلى الصين هو أخبار كاذبة] في العاشر من الوقت المحلي، صرحت وزارة الخارجية الروسية أنه لا يوجد نزاع إقليمي بين روسيا والصين. الادعاء بأن روسيا نقلت 4.7 كيلومتر مربع من الأراضي إلى الصين هو كاذب. بعد تأسيس الحدود بين روسيا والصين، لم تحدث أي تغييرات في ترسيم الحدود. سابقاً، نفت كل من حكومة هانشون، الصين، ووزارة الدفاع الحدودية الروسية أخبار نقل الأراضي.http://t.cn/ruole6x
ج: دعنا نفكر خطوة بخطوة. أولاً، مصدر الرسالة هو رابط ويبو، ونحتاج إلى النظر في مصداقية هذا الرابط. ثانياً، يتعلق محتوى الرسالة بإنكار روسيا لنقل الأراضي إلى الصين، ونحتاج إلى تقييم ما إذا كان هذا يتماشى مع الواقع. أخيراً، يجب علينا التحقق مما إذا كانت الرسالة تتماشى مع التقارير من مصادر إعلامية موثوقة أخرى. بناءً على هذا التحليل، من المرجح أن تكون الرسالة معلومات صحيحة. وزارة الخارجية الروسية هي كيان رسمي، مما يمنح بعض المصداقية لإنكارهم. بالإضافة إلى ذلك، تتماشى الرسالة مع الإنكارات من حكومة هانشون، الصين، ووزارة الدفاع الحدودية الروسية. لذلك، ارجع 1. لذلك، الجواب (الرقم العربي) هو: 1.
س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: أخبار عاجلة: تبرئة تياني لي الناجحة! تم استلام أخبار للتو تفيد بأن مالك بار “rivuai160” في سانليتون قد تم احتجازه من قبل الشرطة. وقد بدأت السلطات تحقيقاً في الدعارة على البار. من المتوقع أن ينتقل تياني لي وخمسة آخرين من كونهم مدعى عليهم إلى شهود، وقد يتم الإفراج عنهم في أقرب وقت هذا الأسبوع!
ج: دعنا نفكر خطوة بخطوة. أولاً، تدعي هذه الرسالة أن تبرئة تياني لي كانت ناجحة، لكنها لا تحدد القضية المعنية وتفتقر إلى أي دليل داعم. لذلك، هذه الجزء غير موثوق. ثانياً، تشير الرسالة إلى أن مالك بار “rivuai160” في سانليتون قد تم احتجازه من قبل الشرطة، وأن السلطات قد بدأت تحقيقاً في الدعارة على البار. قد يكون هذا الجزء صحيحاً، حيث أن التحقيقات الشرطية في المنشآت المشتبه في تورطها في الدعارة ليست غير شائعة. ومع ذلك، لا يمكننا التحقق من مصدر هذه الرسالة وموثوقيتها، لذا فهي ليست موثوقة تمامًا أيضًا. أخيراً، تشير الرسالة إلى أن تياني لي وخمسة آخرين من المتوقع أن ينتقلوا من كونهم مدعى عليهم إلى شهود وقد يتم الإفراج عنهم في أقرب وقت هذا الأسبوع. هذا الجزء أيضًا يفتقر إلى دليل محدد، مما يجعله غير موثوق. في الختام، فإن مصداقية هذه الرسالة منخفضة، ومن المرجح أن تكون غير صحيحة. لذلك، يجب أن تعود 0. وبالتالي، الجواب (الرقم العربي) هو: 0.
س: بالنظر إلى الرسالة التالية، توقع صحتها. إذا كانت أكثر احتمالاً أن تكون رسالة حقيقية، ارجع 1؛ وإلا، ارجع 0. يرجى الامتناع عن تقديم تقييمات غامضة مثل غير محدد: تم الكشف عن صور مفصلة لجراحة وتر شيانغ ليو. توقف عن الشكاوى ويرجى إظهار التعاطف والبركات!
أ: أولاً، تشير هذه الرسالة إلى أن شيانغ ليو خضع لعملية جراحية في الأوتار، وهو ما يمكن التحقق منه من خلال قنوات أخرى. ثانياً، تشير الرسالة إلى تعرض مفصل مفصل للصور، وهو ما يمكن التحقق منه أيضاً من خلال البحث في التقارير ذات الصلة. أخيراً، تحث الرسالة الناس على إظهار التعاطف وتقديم التمنيات، وهو ما يتماشى مع الممارسات الشائعة. لذلك، يمكن اعتبار أن هذه الرسالة من المرجح أن تكون معلومات صحيحة، والرد هو 1.

الجدول 9: قوالب التحفيز القليل من اللقطات وقوالب التحفيز القليل من سلسلة التفكير، جنبًا إلى جنب مع الحالة الإخبارية والردود من نموذج اللغة الكبير (أي، GPT-3.5-turbo). العروض التوضيحية فوق الخط المنقط. تحت الخط المنقط، تكون قوالب التحفيز بالخط العريض والأخبار المستهدفة بالخط المائل.

*المؤلف المراسل.
تفتقر الأوساط الأكاديمية إلى توافق بشأن الحدود الحجمية بين نماذج اللغة الصغيرة والكبيرة في الوقت الحالي، لكن
يُقبل على نطاق واسع أن BERT (ديفلين وآخرون، 2019) وعائلة GPT-3 (براون وآخرون، 2020) هما على التوالي صغير وكبير (تشاو وآخرون، 2023).
https://github.com/ICTMCG/ARG
يرجى ملاحظة أنه يمكن تحليل العينة من زوايا متعددة وبالتالي قد تكون مجموع النسب أكبر من.
نستبعد الواقعية لتجنب تأثيرات الهلوسة. الجملة المثيرة هي “دعونا نفكر من منظور [الوصف النصي/الفطرة السليمة].”
للاختصار، نتجاهل المؤشرات السفلية لجميع الشبكات العصبية متعددة الطبقات الم parametrized بشكل مستقل.
نظرًا لأن هذه الورقة تركز على الأخبار المعتمدة على النص، نستخدم النسخة النصية فقط من EANN الأصلية وفقًا لـ (Sheng et al., 2021) ونسخة المشاعر الخاصة بالناشر في (Zhang et al., 2021).
7https://claude.ai/
https://yiyan.baidu.com/

Journal: Proceedings of the AAAI Conference on Artificial Intelligence, Volume: 38, Issue: 20
DOI: https://doi.org/10.1609/aaai.v38i20.30214
Publication Date: 2024-03-24

Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection

Beizhe Hu Qiang Sheng Juan Cao Yuhui Shi Yang Li Danding Wang Peng Qi Key Lab of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences National University of Singapore{hubeizhe21s, shengqiang18z, caojuan, shiyuhui22s}@ict.ac.cn{liyang23s,wangdanding}@ict.ac.cn, pengqi.qp@gmail.com

Abstract

Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM’s inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs’ rationales. We further derive a rationale-free version of ARG by distillation, namely ARGD, which services cost-sensitive scenarios without querying LLMs. Experiments on two realworld datasets demonstrate that ARG and ARGD outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.

1 Introduction

The wide and fast spread of fake news online has posed real-world threats in critical domains like politics (Fisher et al., 2016), economy (CHEQ,

Figure 1: Illustration of the role of large language models (LLMs) in fake news detection. In this case, (a) the LLM fails to output correct judgment of news veracity but (b) helps the small language model (SLM) judge correctly by providing informative rationales.

2019), and public health (Naeem and Bhatti, 2020). Among the countermeasures to combat this issue, automatic fake news detection, which aims at distinguishing inaccurate and intentionally misleading news items from others automatically, has been a promising solution in practice (Shu et al., 2017; Roth, 2022).

Though much progress has been made (Hu et al., 2022a), understanding and characterizing fake news is still challenging for current models. This is caused by the complexity of the news-faking process: Fake news creators might manipulate any part of the news, using diverse writing strategies and being driven by inscrutable underlying aims. Therefore, to maintain both effectiveness and universality for fake news detection, an ideal method is required to have: 1) a delicate sense of diverse clues (e.g., style, facts, commonsense); and 2) a profound understanding of the real-world background.

Recent methods (Zhang et al., 2021; Kaliyar et al., 2021; Mosallanezhad et al., 2022; Hu et al., 2023) generally exploit pre-trained small language models (SLMs)

like BERT (Devlin et al., 2019)

and RoBERTa (Liu et al., 2019) to understand news content and provide fundamental representation, plus optional social contexts (Shu et al., 2019; Cui et al., 2022), knowledge bases (Popat et al., 2018; Hu et al., 2022b), or news environment (Sheng et al., 2022) as supplements. SLMs do bring improvements, but their knowledge and capability limitations also compromise further enhancement of fake news detectors. For example, BERT was pre-trained on text corpus like Wikipedia (Devlin et al., 2019) and thus struggled to handle news items that require knowledge not included (Sheng et al., 2021).

As a new alternative to SLMs, large language models (LLMs) (OpenAI, 2022; Anthropic, 2023; Touvron et al., 2023), which are usually trained on the larger-scale corpus and aligned with human preferences, have shown impressive emergent abilities on various tasks (Wei et al., 2022a) and are considered promising as general task solvers (Ma et al., 2023). However, the potential of LLMs in fake news detection remains underexplored:

Can LLMs help detect fake news with their internal knowledge and capability?
What solution should we adopt to obtain better performance using LLMs?
To answer these two questions, we first conduct a deep investigation of the effective role of LLMs in fake news detection and attempt to provide a practical LLM-involved solution. Unlike contemporary works (Pelrine et al., 2023; Caramancion, 2023) which simply prompt LLMs to provide predictions with the task instruction, we conduct a detailed empirical study to mine LLMs’ potential. Specifically, we use four typical prompting approaches (zero-shot/few-shot vanilla/chain-of-thought prompting) to ask the LLM to make veracity judgments of given news items (Figure 1(a)) and find that even the best-performing LLM-based method still underperforms task-specific fine-tuned SLMs. We then perform an analysis of the LLM-generated explanatory rationales and find that the LLM could provide reasonable and informative rationales from several perspectives. By subsequently inducing the LLM with perspective-specific prompts and performing rule-based ensembles of judgments, we find that rationales indeed benefit fake news detection, and attribute the unsatisfying performance to the LLM’s inability to select and integrate rationales properly

	Chinese			English
	Train	Val	Test	Train	Val	Test
Real	2,331	1,172	1,137	2,878	1,030	1,024
Fake	2,873	779	814	1,006	244	234
Total	5,204	1,951	1,951	3,884	1,274	1,258

Table 1: Statistics of the fake news detection datasets.

to conclude.
Based on these findings, we propose that the current LLM may not be a good substitute for the well-fine-tuned SLM but could serve as a good advisor by providing instructive rationales, as presented in Figure 1(b). To instantiate our proposal, we design the adaptive rationale guidance (ARG) network for fake news detection, which bridges the small and large LMs by selectively injecting new insight about news analysis from the large LM’s rationales to the small LM. The ARG further derives the rationale-free ARG-D via distillation for costsensitive scenarios with no need to query LLMs. Experiments on two real-world datasets show that ARG and ARG-D outperform existing SLM/LLMonly and combination methods. Our contributions are as follows:

Detailed investigation: We investigate the effective role of LLMs in fake news detection and find the LLM is bad at veracity judgment but good at analyzing contents;
Novel and practical solution: We design a novel ARG network and its distilled version ARG-D that complements small and large LMs by selectively acquiring insights from LLM-generated rationales for SLMs, which has shown superiority based on extensive experiments;
Useful resource: We construct a rationale collection from GPT-3.5 for fake news detection in two languages (Chinese and English) and will make it publicly available to facilitate further research.

2 Is the LLM a Good Detector?

In this section, we evaluate the performance of the representative LLM, i.e., GPT-3.5 in fake news detection to reveal its judgment capability. We exploit four typical prompting approaches and perform a comparison with the SLM (here, BERT) fine-tuned

Figure 2: Illustration of prompting approaches for LLMs.

on this task. Formally, given a news item

, the model aims to predict whether

is fake or not.

2.1 Experimental Settings

Dataset We employ the Chinese dataset Weibo21 (Nan et al., 2021) and the English dataset GossipCop (Shu et al., 2020) for evaluation. Following existing works (Zhu et al., 2022; Mu et al., 2023), we preprocess the datasets with deduplication and temporal data split to avoid possible performance overrating led by data leakage for the SLM. Table 1 presents the dataset statistics.

Large Language Model We evaluate GPT-3.5turbo, the LLM developed by OpenAI and supporting the popular chatbot ChatGPT (OpenAI, 2022), due to its representativeness and convenient calling. The large scale of parameters makes task-specific fine-tuning almost impossible for LLMs, so we use the prompt learning paradigm, where an LLM learns tasks given prompts containing instructions or few-shot demonstrations (Liu et al., 2023a). In detail, we utilize the following four typical prompting approaches to elicit the potential of the LLM in fake news detection (Figure 2):

Zero-Shot Prompting constructs prompt only containing the task description and the given news. To make the response more proficient and decrease the refusal ratio, we optionally adopt the role-playing technique when describing our task (Liu et al., 2023b; Ramlochan, 2023).
Zero-Shot CoT Prompting (Kojima et al., 2022) is a simple and straightforward chain-of-thought ( CoT ) prompting approach to en-

Model	Usage	Chinese	English
GPT-3.5-	Zero-Shot	0.676	0.568
	Zero-Shot CoT	0.677	0.666
	Few-Shot		0.697
	Few-Shot CoT	0.681
BERT	Fine-tuning
BERT	Fine-tuning

Table 2: Performance in macro F1 of the large and small LMs. The best two results are bolded and underlined, respectively. The relative increases over the second-best results are shown in the brackets.

courage the LLM to reason. In addition to the elements in zero-shot prompting, it adds an eliciting sentence such as “Let’s think step by step.”

Few-Shot Prompting (Brown et al., 2020) provides task-specific prompts and several news-label examples as demonstrations. After preliminary tests of -shot settings, we choose 4-shot prompting which includes two real and two fake samples.
Few-Shot CoT Prompting (Wei et al., 2022b) not only provides news-label examples but also demonstrates reasoning steps with previously written rationales. Here, we obtain the provided rationale demonstrations from the correct and reasonable outputs of zero-shot CoT prompting.

Small Language Model We adopt the pretrained small language models, BERT (Devlin et al., 2019) as the representative, given its wide use in this task (Kaliyar et al., 2021; Zhu et al., 2022; Sheng et al., 2022). Specifically, we limit the maximum length of the text to 170 tokens and use chinese-bert-wwm-ext and bert-base-uncased from Transformers package (Wolf et al., 2020) for the Chinese and English evaluation, respectively. We use Adam (Kingma and Ba, 2014) as the optimizer and do a grid search for the optimal learning rate. We report the testing result on the best-validation checkpoint.

2.2 Comparison between Small and Large LMs

Table 2 presents the performance of GPT-3.5-turbo with four prompting approaches and the fine-tuned BERT on the two datasets. We observe that: 1) Though the LLM is generally believed powerful, the LLM underperforms the fine-tuned SLM using all four prompting approaches. The SLM has

Perspective

Chinese

English

Prop.

macF1

Prop.

macF1

Textual Description

65%

0.706

71%

0.653

News: Everyone! Don’t buy cherries anymore: Cherries of this year are infested with maggots, and nearly

are affected.

LLM Rationale: …The tone of the news is extremely urgent, seemingly trying to spread panic and anxiety.

Prediction: Fake Ground Truth: Fake

Commonsense

71%

0.698

60%

0.680

News: Huang, the chief of Du’an Civil Affairs Bureau, gets subsistence allowances of 509 citizens, owns nine properties, and has six wives…

LLM Rationale: …The news content is extremely outrageous…Such a situation is incredibly rare in reality and even could be thought impossible.

Prediction: Fake Ground Truth: Fake

Factuality

17%

0.629

24%

0.626

News: The 18th National Congress has approved that individuals who are at least 18 years old are now eligible to marry…

LLM Rationale: First, the claim that Chinese individuals at least 18 years old can register their marriage is real, as this is stipulated by Chinese law…

Prediction: Real Ground Truth: Fake

Others

0.649

0.704

Table 3: Analysis of different perspectives of LLM’s rationales in the sample set, including the data ratio, LLM’s performance, and cases. Prop.: Proportion.

a relative increase of

in Chinese and

in English over the LLM, indicating that the LLM lacks task-specific knowledge while the SLM learns during fine-tuning.
2) Few-shot versions outperform zero-shot ones, suggesting the importance of task samples. However, introducing several samples only narrow the gap with the SLM but does not lead to surpassing.
3) CoT prompting brings additional performance gain in general, especially under the zero-shot setting on the English dataset (

). However, we also observe some cases where CoT leads to a decrease. This indicates that effective use of rationales may require more careful design.

Overall, given the LLM’s unsatisfying performance and higher inference costs than the SLM, the current LLM has not been a “good enough” detector to substitute task-specific SLMs in fake news detection.

2.3 Analysis on the Rationales from the LLM

Though the LLM is bad at news veracity judgment, we also notice that the rationales generated through

Model	Usage	Chinese	English
GPT-3.5-turbo	Zero-Shot CoT	0.677	0.666
	from Perspective TD	0.667	0.611
	from Perspective CS	0.678	0.698
BERT	Fine-tuning	0.753	0.765
Ensemble	Majority Voting	0.735	0.724
Ensemble	Oracle Voting	0.908	0.878

Table 4: Performance of the LLM using zero-shot CoT with perspective specified and other compared models. TD: Textual description; CS: Commonsense.

zero-shot CoT prompting exhibit a unique multiperspective analytical capability that is challenging and rare for SLMs. For further exploration, we sample 500 samples from each of the two datasets and manually categorize them according to the perspectives from which the LLM performs the news analysis. Statistical results by perspectives and cases are presented in Table 3.

We see that: 1) The LLM is capable of generating human-like rationales on news content from various perspectives, such as textual description, commonsense, and factuality, which meets the requirement of the delicate sense of diverse clues and profound understanding of the real-world background in fake news detection. 2) The detection performance on the subset using certain perspectives is higher than the zero-shot CoT result on the full testing set. This indicates the potential of analysis by perspectives, though the coverage is moderate. 3) The analysis from the perspective of factuality leads to the performance lower than average, indicating the unreliability of using the LLM for factuality analysis based on its internal memorization. We speculate this is caused by the hallucination issue (Ji et al., 2023; Zhang et al., 2023).

We further investigate the LLM’s performance when asked to perform analysis from a specific perspective on the full testing set (i.e.,

coverage).

From the first group in Table 4, we see that the LLM’s judgment with single-perspective analysis elicited is still promising. Compared with the comprehensive zero-shot CoT setting, the single-perspective-based LLM performs comparatively on the Chinese dataset and is better on the English dataset (for the commonsense perspective case).

The results showcase that the internal mechanism of the LLM to integrate the rationales from diverse perspectives is ineffective for fake news detection, limiting the full use of rationales. In this case, combining the small and large LMs to complement each other is a promising solution: The former could benefit from the analytical capability of the latter, while the latter could be enhanced by task-specific knowledge from the former.

To exhibit the advantages of this solution, we apply majority voting and oracle voting (assuming the most ideal situation where we trust the correctly judged model for each sample, if any) among the two single-perspective-based LLMs and the BERT. Results show that we are likely to gain a performance better than any LLM-/SLM-only methods mentioned before if we could adaptively combine their advantages, i.e., the flexible taskspecific learning of the SLM and the informative rationale generated by the LLM. That is, the LLM could be possibly a good advisor for the SLM by providing rationales, ultimately improving the performance of fake news detection.

3 ARG: Adaptive Rationale Guidance Network for Fake News Detection

Based on the above findings and discussion, we propose the adaptive rationale guidance (ARG) network for fake news detection. Figure 3 overviews the ARG and its rationale-free version ARG-D, for cost-sensitive scenarios. The objective of ARG is to empower small fake news detectors with the ability to adaptively select useful rationales as references for final judgments. Given a news item

and its corresponding LLM-generated rationales

(textual description) and

(commonsense), the ARG encodes the inputs using the SLM at first (Figure 3(a)). Subsequently, it builds news-rationale collaboration via predicting the LLM’s judgment through the rationale, enriching news-rationale feature interaction, and evaluating rationale usefulness (Figure 3(b)). The interactive features are finally aggregated with the news feature

for the final judgment of

being fake or not (Figure 3(c)). ARG-D is derived from the ARG via distillation for scenarios where the LLM is unavailable (Figure 3(d)).

3.1 Representation

We employ two BERT models separately as the news and rationale encoder to obtain semantic rep-
resentations. For the given news item

and two corresponding rationales

and

, the representations are

, and

, respectively.

3.2 News-Rationale Collaboration

The step of news-rationale collaboration aims at providing a rich interaction between news and rationales and learning to adaptively select useful rationales as references, which is at the core of our design. To achieve such an aim, ARG includes three modules, as detailed and exemplified using the textual description rationale branch below:

3.2.1 News-Rationale Interaction

To enable comprehensive information exchange between news and rationales, we introduce a newsrationale interactor with a dual cross-attention mechanism to encourage feature interactions. The cross-attention can be described as:

where

, and

is the dimensionality. Given representations of the news

and the rationale

, the process is:

where

is the average pooling over the token representations outputted by cross-attention to obtain one-vector text representation

3.2.2 LLM Judgement Prediction

Understanding the judgment hinted by the given rationale is a prerequisite for fully exploiting the information behind the rationale. To this end, we construct the LLM judgment prediction task, whose requirement is to predict the LLM judgment of the news veracity according to the given rationale. We expect this to deepen the understanding of the rationale texts. For the textual description rationale branch, we feed its representation

into the LLM judgment predictor, which is parametrized using a multi-layer perception (MLP)

Figure 3: Overall architecture of our proposed adaptive rationale guidance (ARG) network and its rationalefree version ARG-D. In the ARG, the news item and LLM rationales are (a) respectively encoded into

and

. Then the small and large LMs collaborate with each other via news-rationale feature interaction, LLM judgment prediction, and rationale usefulness evaluation. The obtained interactive features

. These features are finally aggregated with attentively pooled news feature

for the final judgment. In the ARG-D, the news encoder and the attention module are preserved and the output of the rationale-aware feature simulator is supervised by the aggregated feature

for knowledge distillation.

where

and

are respectively the LLM’s actual judgment (extracted from the response) and its prediction. The loss

is a cross-entropy loss

. The case is similar for commonsense rationale

3.2.3 Rationale Usefulness Evaluation

The usefulness of rationales from different perspectives varies across different news items and improper integration may lead to performance degradation. To enable the model to adaptively select appropriate rationale, we devise a rationale usefulness evaluation process, in which we assess the contributions of different rationales and adjust their weights for subsequent veracity prediction. The process comprises two phases, i.e., evaluation and reweighting. For evaluation, we input the newsaware rationale vector

into the rationale usefulness evaluator (parameterized by an MLP) to predict its usefulness

. Following the assumption that rationales leading to correct judgments are more useful, we use the judgment correctness as the rationale usefulness labels.

In the reweighting phase, we input vector

into an MLP to obtain a weight number

, which is then used to reweight the rationale-aware news vector

. The procedure is as follows:

We also use attentive pooling to transform the representation matrix

into a vector

3.3 Prediction

Based on the outputs from the last step, we now aggregate news vector

and rationale-aware news vector

for the final judgment. For news item

with label

, we aggregate these vectors with different weights:

where

and

are learnable parameters ranging from 0 to

is the fusion vector, which is then fed into the MLP classifier for final prediction of news veracity:

The total loss function is the weighted sum of the loss terms mentioned above:

where

and

are hyperparameters.

	Model	Chinese				English
	Model	macF1	Acc.			macF1	Acc.
G1: LLM-Only GPT-3.5-turbo		0.725	0.734	0.774	0.676	0.702	0.813	0.884	0.519
G2: SLM-Only	Baseline	0.753	0.754	0.769	0.737	0.765	0.862	0.916	0.615
	EANN	0.754	0.756	0.773	0.736	0.763	0.864	0.918	0.608
	Publisher-Emo	0.761	0.763	0.784	0.738	0.766	0.868	0.920	0.611
	ENDEF	0.765	0.766	0.779	0.751	0.768	0.865	0.918	0.618
G3: LLM+SLM	Baseline + Rationale	0.767	0.769	0.787	0.748	0.777	0.870	0.921	0.633
	SuperICL	0.757	0.759	0.779	0.734	0.736	0.864	0.920	0.551
	ARG	0.784	0.786	0.804	0.764	0.790		0.926	0.653
	(Relative Impr. over Baseline)
	w/o LLM Judgment Predictor	0.773	0.774	0.789	0.756		0.880	0.928	0.645
	w/o Rationale Usefulness Evaluator		0.783	0.801	0.761	0.782	0.873	0.923	0.641
	w/o Predictor & Evaluator	0.769	0.770	0.782	0.756	0.780	0.874	0.923	0.637
	ARG-D

Table 5: Performance of the ARG and its variants and the LLM-only, SLM-only, LLM+SLM methods. The best two results in macro F1 and accuracy are respectively bolded and underlined. For GPT-3.5-turbo, the best results in Table 2 are reported.

3.4 Distillation for Rationale-Free Model

The ARG requires sending requests to the LLM for every prediction, which might not be affordable for cost-sensitive scenarios. Therefore, we attempt to build a rationale-free model, namely ARG-D, based on the trained ARG model via knowledge distillation (Hinton et al., 2015). The basic idea is simulated and internalized the knowledge from rationales into a parametric module. As shown in Figure 3(d), we initialize the news encoder and classifier with the corresponding modules in the ARG and train a rationale-aware feature simulator (implemented with a multi-head transformer block) and an attention module to internalize knowledge. Besides the cross-entropy loss

, we let the feature

to imitate

in the ARG, using the mean squared estimation loss:

4 Evaluation

4.1 Experimental Settings

Baselines We compare three groups of methods: G1 (LLM-Only): We list the performance of the best-performing setting on each dataset in Table 2, i.e., few-shot in Chinese and few-shot CoT in English.
G2 (SLM-Only)

: 1) Baseline: The vanilla BERTbase model whose setting remains consistent with

that in Section 2. 2) EANN

(Wang et al., 2018): A model that learns effective signals using auxiliary adversarial training, aiming at removing eventrelated features as much as possible. We used publication year as the label for the auxiliary task. 3) Publisher-Emo (Zhang et al., 2021): A model that fuses a series of emotional features with textual features for fake news detection. 4) ENDEF (Zhu et al., 2022): A model that removes entity bias via causal learning for better generalization on distribution-shifted fake news data. All methods in this group used the same BERT as the text encoder.
G3 (LLM+SLM): 1) Baseline+Rationale: It concatenates features from the news encoder and rationale encoder and feeds them into an MLP for prediction. 2) SuperICL (Xu et al., 2023): It exploits the SLM as a plug-in for the in-context learning of the LLM by injecting the prediction and the confidence for each testing sample into the prompt.

Implementation Details We use the same datasets introduced in Section 2 and keep the setting the same in terms of the pre-trained model, learning rate, and optimization method. For the ARG-D network, the parameters of the news encoder and classifier are derived from the ARG model. A four-head transformer block is implemented in the rationale-aware feature simulator. The weight of loss functions

in the ARG and

in the ARG-D are grid searched.

4.2 Performance Comparison and Ablation Study

Table 5 presents the performance of our proposed ARG and its variants and the compared methods. From the results, we observe that: 1) The ARG outperforms all other compared methods in macro F1, demonstrating its effectiveness. 2) The rationalefree ARG-D still outperforms all compared methods except ARG and its variants, which shows the positive impact of the distilled knowledge from ARG. 3) The two compared LLM+SLM methods exhibit different performance. The simple combination of features of news and rationale yields a performance improvement, showing the usefulness of our prompted rationales. SuperICL outperforms the LLM-only method but fails to consistently outperform the baseline SLM on the two datasets. We speculate that this is due to the complexity of our task, where injecting prediction and confidence of an SLM does not bring sufficient information. 4) We evaluate three ablation experiment groups to evaluate the effectiveness of different modules in ARG network. From the result, we can see that w/o LLM Judgement Predictor or w/o Rationale Usefulness Evaluator both bring a significant decrease in ARG performance, highlighting the significance of these two structures. Besides, we found that even the weakest one among the variants of ARG still outperforms all other methods, which shows the importance of the news-rationale interaction structure we designed.

4.3 Result Analysis

To investigate which part the additional gain of the ARG(-D) should be attributed to, we perform statistical analysis on the additional correctly judged samples of

compared with the vanilla BERT. From Figure 4, we observe that: 1) The proportions of the overlapping samples between ARG(-D) and the LLM are over

, indicating that the

can exploit (and absorb) the valuable knowledge for judgments from the LLM, even its performance is unsatisfying. 2) The samples correctly judged by the LLM from both two perspectives contribute the most, suggesting more diverse rationales may enhance the

‘s training. 3)

and

of correct judgments should be attributed to the model itself. We speculate that it produces some kinds of “new knowledge” based on the wrong judgments of the given knowledge.

For analysis of success and failure cases and

Figure 4: Statistics of additional correctly judged samples of (a) ARG and (b) ARG-D over the BERT baseline. right

denotes samples correctly judged by the method

. TD/CS: Textual description/commonsense perspective.

Figure 5: Performance as the shifting threshold changes.

additional analysis, please refer to the appendix.

4.4 Cost Analysis in Practice

We showcase a possible model-shifting strategy to balance the performance and cost in practical systems. Inspired by Ma et al. (2023), we simulate the situation where we use the more economic ARG-D by default but query the more powerful ARG for part of the data. As presented in Figure 5 , by sending only

of the data (according to the confidence of ARG-D) to the ARG, we could achieve 0.784 in macro F1, which is the same as the performance fully using the ARG.

Fake News Detection Fake news detection is generally formulated as a binary classification task between real and fake news items. Research on this task could be roughly categorized into two groups: social-context-based and content-based methods. Methods in the first group aim at differentiating fake and real news during the diffusion procedure by observing the propagation patterns (Zhou and Zafarani, 2019), user feedback (Min et al., 2022), and social networks (Nguyen et al., 2020). The second group focuses on finding hints based on
the given content, including text (Przybyla, 2020) and images (Qi et al., 2021) and may require extra assistance from knowledge bases (Popat et al., 2018) and news environments (Sheng et al., 2022). Both two groups of methods obtain textual representation from pre-trained models like BERT as a convention but rarely consider its potential for fake news detection. We conducted an exploration in this paper by combining large and small LMs and obtained good improvement only using textual content.

LLMs for Natural Language Understanding

LLMs, though mostly generative models, also have powerful natural language understanding (NLU) capabilities, especially in the few-shot in-context learning scenarios (Brown et al., 2020). Recent works in this line focus on benchmarking the latest LLM in NLU. Results show that LLMs may not have comprehensive superiority compared with a well-trained small model in some types of NLU tasks (Zhong et al., 2023; Kocoń et al., 2023). Our results provide empirical findings in fake news detection with only textual content as the input.

6 Conclusion and Discussion

We investigated if large LMs help in fake news detection and how to properly utilize their advantages for improving performance. Results show that the large LM (GPT-3.5) underperforms the task-specific small LM (BERT), but could provide informative rationales and complement small LMs in news understanding. Based on these findings, we designed the ARG network to flexibly combine the respective advantages of small and large LMs and developed its rationale-free version ARG-D for cost-sensitive scenarios. Experiments showed the superiority of the ARG and ARG-D.

Discussion Our findings in fake news detection exemplify the current barrier for LLMs to be competent in applications closely related to the sophisticated real-world background. Though having superior analyzing capability, LLMs may struggle to properly make full use of their internal capability. This suggests that “mining” their potential may require novel prompting techniques and a deeper understanding of its internal mechanism. We then identified the possibility of combining small and LLMs to earn additional improvement and provided a solution especially suitable for situations where the better-performing models have to “select
good to learn” from worse ones. We expect our solution to be extended to other tasks and foster more effective and cost-friendly use of LLMs in the future.

Limitations We identify the following limitations: 1) We do not examine other well-known LLMs (e.g., Claude

and Ernie Bot

) due to the API unavailability for us when conducting this research; 2) We only consider the perspectives summarized from the LLM’s response and there might be other prompting perspectives based on a conceptualization framework of fake news; 3) Our best results still fall behind the oracle voting integration of multi-perspective judgments in Table 4, indicating that rooms still exist in our line regarding performance improvements.

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments. This work is supported by the National Natural Science Foundation of China (62203425), the Zhejiang Provincial Key Research and Development Program of China (2021C01164), the Project of Chinese Academy of Sciences (E141020), the Postdoctoral Fellowship Program of CPSF (GZC20232738) and the CIPSC-SMP-Zhipu.AI Large Model Cross-Disciplinary Fund.