التعلم العميق وتمثيلات الجمل لاكتشاف أخبار النقر من المحتوى عبر الإنترنت Deep learning and sentence embeddings for detection of clickbait news from online content

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-97576-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40246954
تاريخ النشر: 2025-04-17
المؤلف: Amara Muqadas وآخرون
الموضوع الرئيسي: المعلومات المضللة وتأثيراتها

نظرة عامة

تتناول هذه الدراسة التحدي المتمثل في اكتشاف العناوين الجذابة في الأخبار عبر الإنترنت، مع التركيز بشكل خاص على المحتوى باللغة الأردية. مع انتشار المحتوى الذي ينشئه المستخدمون، تتعرض مصداقية المعلومات للخطر بشكل متزايد بسبب العناوين المثيرة المصممة لجذب النقرات. بينما درست الدراسات الحالية بشكل أساسي اكتشاف العناوين الجذابة باللغة الإنجليزية باستخدام تقنيات معالجة اللغة الطبيعية (NLP)، فإن هذه الدراسة جديدة في تطبيقها على اللغة الأردية. يقترح المؤلفون نهج التعلم العميق باستخدام تمثيلات الجمل كميزات إدخال للنماذج، وبشكل خاص بنية Bi-LSTM، محققين دقة تصل إلى 88% في تحديد العناوين الجذابة. تتجاوز هذه الأداء نماذج التعلم الآلي التقليدية، والتي شملت أشجار القرار، وآلات الدعم الناقل، وطرق التجميع، مع أفضل دقة من الانحدار اللوجستي عند 73% وXGBoost عند 78%.

تؤكد النتائج على أهمية طرق تمثيل الميزات المتقدمة، مثل تمثيلات الجمل، التي تلتقط المعنى الدلالي بفعالية دون الحاجة إلى تدريب مسبق مكثف على مجموعات بيانات كبيرة، وهو أمر ذو صلة خاصة للغات ذات الموارد المحدودة مثل الأردية. كما تسلط الدراسة الضوء على إمكانية البحث المستقبلي لدمج نماذج قائمة على المحولات مثل BERT وGPT، بهدف تعزيز اكتشاف العناوين الجذابة متعددة اللغات من خلال تحليل الميزات اللغوية مثل التعابير والاستعارات. قد يحسن هذا الاتجاه بشكل كبير من قوة اكتشاف العناوين الجذابة عبر لغات مختلفة، مستفيدًا من تحسين النماذج المدربة مسبقًا لالتقاط الأنماط التركيبية والدلالية المعقدة.

الطرق

تستخدم منهجية البحث المقترحة لاكتشاف العناوين الجذابة في الأخبار الأردية نهجًا منظمًا من أربع مراحل. تتضمن المرحلة الأولى جمع البيانات، حيث يتم جمع العناوين الإخبارية الأردية ذات الصلة لتشكيل مجموعة بيانات شاملة. بعد ذلك، تركز المرحلة الثانية على معالجة البيانات، والتي تشمل تنظيف البيانات وتطبيعها لتحسين جودتها للتحليل.

تشمل المرحلة الثالثة استخراج الميزات، حيث يتم تحديد الخصائص الرئيسية للعناوين وقياسها لتسهيل تدريب النموذج. أخيرًا، تطبق المنهجية نماذج مختلفة من التعلم الآلي (ML) والتعلم العميق (DL) لتصنيف العناوين بناءً على ميلها للعناوين الجذابة. يتم تمثيل الإطار الخاص بهذا البحث بصريًا في الشكل 3، موضحًا المراحل المترابطة للمنهجية.

النتائج

في هذا القسم، تقدم الدراسة تحليلًا شاملاً لإطار مقترح لاكتشاف العناوين الجذابة في العناوين الإخبارية الأردية باستخدام نماذج مختلفة من التعلم الآلي والتعلم العميق. تتكون مجموعة البيانات من 1,000 عنوان، مقسمة بالتساوي بين فئات العناوين الجذابة وغير الجذابة. يبرز التحليل الاستكشافي الأولي للبيانات من خلال سحب الكلمات المصطلحات الأكثر تكرارًا، مما يساعد في فهم الخصائص اللغوية لمجموعة البيانات. يتم تقييم أداء مصنفي التعلم الآلي، بما في ذلك آلة الدعم الناقل (SVM)، والانحدار اللوجستي (LR)، وأقرب الجيران (KNN)، وشجرة القرار (DT)، باستخدام ميزات TF-IDF ووسم أجزاء الكلام (POS). تشير النتائج إلى أن SVM وLR وKNN تحقق دقة تصل إلى 73%، مع إظهار LR وKNN أعلى دقة ودرجات F1، بينما تعاني DT من أداء ضعيف عبر جميع المقاييس.

تستكشف الدراسة أيضًا نماذج التعلم التجميعي، كاشفة أن XGBoost يتفوق على المصنفات الأخرى في اكتشاف العناوين الجذابة، مستفيدًا بفعالية من ميزات TF-IDF وPOS. كما يتم تقييم نماذج التعلم العميق، وبشكل خاص الذاكرة طويلة وقصيرة الأمد (LSTM) وBi-LSTM، حيث يتفوق Bi-LSTM باستمرار على LSTM بسبب قدرته على التقاط السياق الثنائي الاتجاه. يؤدي استخدام تمثيلات الجمل إلى تحقيق أعلى أداء تصنيفي، حيث يحقق دقة تصل إلى 88% مع مقاييس دقة واسترجاع متوازنة. بشكل عام، تؤكد النتائج على أهمية تمثيل الميزات في تعزيز أداء النموذج، حيث أثبتت تمثيلات الجمل تفوقها على التمثيلات التقليدية مثل Word2 Vec وGloVe، خاصة في سياق تصنيف النصوص الأردية.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على الاهتمام المتزايد في اكتشاف العناوين الجذابة، خاصة في سياق اللغة الأردية، التي كانت ممثلة تمثيلاً ناقصًا في الأدبيات الحالية التي تركز بشكل أساسي على الإنجليزية. تم استخدام أساليب مختلفة من التعلم الآلي (ML) والتعلم العميق (DL) في الدراسات السابقة، مما أدى إلى درجات متفاوتة من النجاح. على سبيل المثال، حقق داود وآخرون 78% دقة باستخدام SVM الخطي والانحدار اللوجستي، بينما أفاد أحمد وآخرون أن BERT تفوق على النماذج الأخرى بأعلى دقة على مجموعات بيانات متنوعة. تشمل المساهمات الملحوظة الأخرى استخدام طرق التجميع وتقنيات استخراج الميزات، مثل TF-IDF ووسم أجزاء الكلام، التي أظهرت أنها تعزز أداء النموذج بشكل كبير.

كما تؤكد الورقة على أهمية جمع البيانات ومعالجتها لاكتشاف العناوين الجذابة بشكل فعال. تم إعداد مجموعة بيانات من 1,000 عنوان إخباري أردي، مصنفة إلى عناوين جذابة وغير جذابة، مع اتفاق مرتفع بين المراجعين بنسبة 100%. شملت خطوات المعالجة التقطيع، والتصريف، والتجزئة، المصممة لمعالجة تعقيدات اللغة الأردية. تم استخدام تقنيات استخراج الميزات، وخاصة TF-IDF وطرق تمثيل الكلمات المختلفة، لتحويل البيانات النصية إلى تمثيلات عددية مناسبة لنماذج ML وDL. تشير النتائج إلى أن الجمع بين هذه المنهجيات يمكن أن يؤدي إلى تحسين الدقة في تحديد المحتوى الجذاب، مما يسهم في تقديم رؤى قيمة في مجال معالجة اللغة الطبيعية في اللغات ذات الموارد المحدودة.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-97576-1
PMID: https://pubmed.ncbi.nlm.nih.gov/40246954
Publication Date: 2025-04-17
Author(s): Amara Muqadas et al.
Primary Topic: Misinformation and Its Impacts

Overview

The research addresses the challenge of detecting clickbait in online news headlines, particularly focusing on content in the Urdu language. With the proliferation of user-generated content, the authenticity of information is increasingly compromised by sensational headlines designed to attract clicks. While existing studies have predominantly examined clickbait detection in English using Natural Language Processing (NLP) techniques, this study is novel in its application to Urdu. The authors propose a deep learning approach utilizing sentence embeddings as input features for models, specifically a Bi-LSTM architecture, achieving an accuracy of 88% in identifying clickbait. This performance surpasses traditional machine learning models, which included decision trees, support vector machines, and ensemble methods, with the best accuracy from logistic regression at 73% and XGBoost at 78%.

The findings underscore the importance of advanced feature representation methods, such as sentence embeddings, which effectively capture semantic meaning without extensive pretraining on large datasets, particularly relevant for low-resource languages like Urdu. The study also highlights the potential for future research to incorporate transformer-based models like BERT and GPT, aiming to enhance multilingual clickbait detection by analyzing linguistic features such as idioms and metaphors. This direction could significantly improve the robustness of clickbait detection across various languages, leveraging fine-tuning of pre-trained models to capture complex syntactic and semantic patterns.

Methods

The proposed research methodology for detecting news clickbait in Urdu news headlines employs a structured four-stage approach. The first stage involves data collection, where relevant Urdu news headlines are gathered to form a comprehensive dataset. Following this, the second stage focuses on data preprocessing, which includes cleaning and normalizing the data to enhance its quality for analysis.

The third stage encompasses feature extraction, where key characteristics of the headlines are identified and quantified to facilitate model training. Finally, the methodology applies various machine learning (ML) and deep learning (DL) models to classify the headlines based on their propensity for clickbait. The framework for this research is visually represented in Figure 3, illustrating the interconnected stages of the methodology.

Results

In this section, the research presents a comprehensive analysis of a proposed framework for detecting clickbait in Urdu news headlines using various machine learning and deep learning models. The dataset comprises 1,000 headlines, evenly split between clickbait and non-clickbait categories. Initial exploratory data analysis through word clouds highlights the most frequent terms, aiding in understanding the dataset’s linguistic characteristics. The performance of machine learning classifiers, including Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Decision Tree (DT), is evaluated using TF-IDF and Part-of-Speech (POS) tagging features. The results indicate that SVM, LR, and KNN achieve an accuracy of 73%, with LR and KNN demonstrating the highest precision and F1 scores, while DT underperforms across all metrics.

The study further explores ensemble learning models, revealing that XGBoost outperforms other classifiers in clickbait detection, effectively leveraging both TF-IDF and POS features. Deep learning models, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (Bi-LSTM), are also assessed, with Bi-LSTM consistently outperforming LSTM due to its ability to capture bidirectional context. The use of Sentence embeddings yields the highest classification performance, achieving an accuracy of 88% with balanced precision and recall metrics. Overall, the findings underscore the importance of feature representation in enhancing model performance, with Sentence embeddings proving superior to traditional embeddings like Word2 Vec and GloVe, particularly in the context of Urdu text classification.

Discussion

The discussion section of the research paper highlights the growing interest in clickbait detection, particularly in the context of the Urdu language, which has been underrepresented in existing literature primarily focused on English. Various machine learning (ML) and deep learning (DL) approaches have been employed in previous studies, yielding varying degrees of success. For instance, Daoud et al. achieved 78% precision using linear SVM and logistic regression, while Ahmad et al. reported that BERT outperformed other models with the highest accuracy on diverse datasets. Other notable contributions include the use of ensemble methods and feature extraction techniques, such as TF-IDF and part-of-speech tagging, which have shown to enhance model performance significantly.

The paper also emphasizes the importance of data collection and preprocessing for effective clickbait detection. A dataset of 1,000 Urdu news headlines was curated, categorized into clickbait and non-clickbait, with a high inter-annotator agreement of 100%. Preprocessing steps included lemmatization, stemming, and tokenization, tailored to address the complexities of the Urdu language. Feature extraction techniques, particularly TF-IDF and various word embedding methods, were employed to convert textual data into numerical representations suitable for ML and DL models. The findings suggest that the combination of these methodologies can lead to improved accuracy in identifying clickbait content, thus contributing valuable insights to the field of natural language processing in under-resourced languages.