نمذجة الموضوعات من خلال عدسة الببليومetrics وتقنيتها Topic modelling through the bibliometrics lens and its technique

المجلة: Artificial Intelligence Review، المجلد: 58، العدد: 3
DOI: https://doi.org/10.1007/s10462-024-11011-x
تاريخ النشر: 2025-01-06
المؤلف: Bayode Ogunleye وآخرون
الموضوع الرئيسي: طرق التحليل الحاسوبي والنصي

نظرة عامة

تقدم الورقة البحثية تحليلًا بيبليومتريًا شاملاً لنمذجة الموضوعات (TM) في سياق معالجة اللغة الطبيعية (NLP)، باستخدام مجموعة بيانات تضم 16,941 وثيقة من قاعدة بيانات سكوبس تمتد من 2004 إلى 2023. تكشف الدراسة عن زيادة كبيرة في منشورات TM على مر السنين، على الرغم من تراجع تأثير الاقتباس. تشير التحليلات إلى أن أبحاث TM تتركز بشكل أساسي في الصين والولايات المتحدة الأمريكية، مع انتشار ملحوظ للأوراق المقدمة في المؤتمرات مقارنة بمقالات المجلات، خاصة منذ عام 2019. تشمل المجالات الرئيسية التي تساهم في أبحاث TM علوم الكمبيوتر والهندسة والعلوم الاجتماعية والرياضيات. تسلط النتائج الضوء على العديد من التطبيقات غير المستكشفة لنمذجة الموضوعات، مثل تقسيم الصور وتصنيفها، وتؤكد على الحاجة إلى نهج متعدد التخصصات لتعزيز استخدام TM في مجالات متنوعة.

تختتم الورقة بتلخيص مساهماتها، والتي تشمل نظرة عامة مفصلة على اتجاهات أبحاث TM، وتحديد الفجوات البحثية، وإطار مفاهيمي مقترح للمساهمين. تؤكد التوصيات للبحوث المستقبلية على اعتماد خوارزميات TM المتقدمة، مثل نماذج الموضوعات العصبية، لتطبيقات متنوعة بما في ذلك تحليل السجلات الطبية واكتشاف الأحداث. يدعو المؤلفون إلى دمج TM مع التحليل البيبليومتري لفهم أفضل لموضوعات البحث وتشجيع التعاون عبر التخصصات لتحسين استخدام TM في تجميع الأدبيات. يُقترح أن تستكشف الدراسات المستقبلية قواعد بيانات إضافية للتحقق من النتائج وتعزيز الفهم العام لتأثير TM وإمكاناته.

مقدمة

تناقش مقدمة الورقة تطور وأهمية نمذجة الموضوعات (TM) في سياق البيانات الضخمة، لا سيما مع التقدم في التكنولوجيا مثل الإنترنت ووسائل التواصل الاجتماعي. TM، وهو نهج تعلم إحصائي لتحديد الموضوعات المتماسكة داخل الوثائق، له جذوره في الخوارزميات المبكرة مثل التحليل الدلالي الكامن والتحليل الدلالي الكامن الاحتمالي. أدت قيود هذه النماذج المبكرة، لا سيما في التعامل مع النصوص القصيرة، إلى تطوير طرق أكثر تعقيدًا مثل تخصيص ديريشليت الكامن (LDA) ونموذج الموضوعات الثنائية (BTM)، الذي يعالج ندرة البيانات من خلال تحليل أزواج الكلمات. شهدت الاتجاهات الأخيرة ظهور نماذج الموضوعات العصبية، التي تمثلها بنية BERTopic، والتي تدمج تضمينات قائمة على المحولات وتقنيات التجميع.

تسلط الورقة الضوء على التطبيقات المتنوعة لـ TM عبر مجالات مختلفة، بما في ذلك تحليل المشاعر، وأنظمة التوصية، وتلخيص الوثائق، فضلاً عن فائدتها في استخراج الموضوعات المتماسكة من مجموعات بيانات محددة مثل شكاوى العملاء والمستندات القانونية. على الرغم من الاهتمام المتزايد بـ TM، لا يزال هناك فجوة في التحليلات البيبليومترية لهذا المجال، لا سيما في ضوء التطورات الأخيرة في نماذج اللغة الكبيرة. تهدف هذه الدراسة إلى سد هذه الفجوة من خلال إجراء تحليل بيبليومتري لأبحاث TM، باستخدام قاعدة بيانات سكوبس لتحديد الاتجاهات والتطبيقات والاتجاهات البحثية المستقبلية. تم توضيح هيكل الورقة، مع تخصيص الأقسام اللاحقة للمنهجية والنتائج والاستنتاجات.

طرق البحث

شملت منهجية هذه الدراسة تحليل البيانات المستمدة من قاعدة بيانات سكوبس، التي تغطي الفترة من 2004 إلى 2023. تم اختيار سكوبس لتغطيتها البيبليومترية الواسعة، التي تشمل مجموعة متنوعة من الموضوعات والدول واللغات والناشرين، مما يجعلها مصدرًا مثاليًا لفحص الهيكل العالمي للبحث العلمي (Bahroun et al. 2023; Vagelas and Leontopoulos 2023). تُعرف قاعدة البيانات بتمثيلها الشامل والمنسق للمطبوعات العلمية، مما يجعلها ذات قيمة كبيرة للباحثين.

استخدمت الدراسة تقنيتين رئيسيتين لتوليف الوثائق المسترجعة. أولاً، تم استخدام مؤشرات بيبليومترية لتقييم تأثير البحث والاتجاهات التطورية داخل المجال. ثانيًا، تم تطبيق نهج تعدين النصوص (TM) لتعزيز التحليل البيبليومتري من خلال توضيح مسارات أبحاث TM وتحديد الموضوعات الناشئة. سيتم توضيح الخطوات التفصيلية المتعلقة بجمع البيانات والتحليل في الأقسام اللاحقة.

النتائج

في قسم النتائج، تقدم الدراسة تحليلًا مزدوجًا يشمل نتائج بيبليومترية ونتائج نمذجة الموضوعات. تركز نتائج نمذجة الموضوعات، المستمدة من نموذج تخصيص ديريشليت الكامن (LDA)، على تحديد العدد الأمثل من الموضوعات (K). كشفت التحليلات أن درجات التماسك بلغت ذروتها بين 9-12 موضوعًا، مع درجة ملحوظة تبلغ حوالي 0.4. بالإضافة إلى ذلك، أظهر عدد أكبر من الموضوعات حوالي 90 أيضًا مستويات تماسك مماثلة. في النهاية، اختار المؤلفون 9 موضوعات كالتكوين الأمثل نظرًا لعمليتها والقدرة المتميزة على تفسير الموضوعات الناتجة.

تؤكد النتائج على أهمية التماسك في نمذجة الموضوعات، مما يشير إلى أن التوازن بين عدد الموضوعات وقابليتها للتفسير أمر حاسم للتحليل الفعال. لا يعزز هذا الاختيار وضوح الموضوعات المولدة فحسب، بل يسهل أيضًا رؤى أعمق حول الموضوعات الأساسية الموجودة في البيانات. يهدف دمج كل من التحليلات البيبليومترية ونمذجة الموضوعات إلى تقديم رؤى شاملة وتوصيات قابلة للتنفيذ بناءً على النتائج المجمعة.

المناقشة

في هذا القسم، تناقش الدراسة المنهجية والنتائج المتعلقة بجمع البيانات، وتقييم الجودة، والتحليل البيبليومتري، ونمذجة الموضوعات في مجال نمذجة الموضوعات (TM) من 2004 إلى 2023. استخدمت استراتيجية جمع البيانات كلمات رئيسية مثل “نموذج الموضوع”، “اكتشاف الموضوع”، و”استخراج الموضوع”، مما أسفر عن مجموعة أولية من 17,580 وثيقة، تم تنقيحها إلى 16,941 منشورًا ذا صلة. أسفر تقييم موثوقية المقيمين باستخدام كوهين-كابا عن قيمة كابا تبلغ 0.84، مما يشير إلى اتفاق كبير بين المقيمين بشأن صلة الوثائق المختارة. كشفت التحليلات البيبليومترية عن زيادة كبيرة في منشورات TM على مر السنين، مع نشر 2,249 وثيقة في عام 2023، على الرغم من أن تأثير الاقتباس قد انخفض، على الأرجح بسبب الزيادة السريعة في معدلات النشر التي تفوق معدلات الاقتباس.

استخدمت الدراسة نموذج تخصيص ديريشليت الكامن (LDA) لنمذجة الموضوعات، وحددت تسعة موضوعات متميزة عبر مجموعة البيانات. أشارت مقاييس التقييم إلى أن نموذج LDA أنتج موضوعات عالية الجودة وقابلة للتفسير، حيث ركز الموضوع الأول على تطوير وتطبيق الخوارزميات، مما يمثل 54.39% من الوثائق. تسلط النتائج الضوء على تطور أبحاث TM، وهيمنة أنواع معينة من المنشورات، والمساهمين الرئيسيين من حيث التأليف وتأثير الاقتباس، مع التركيز بشكل خاص على مساهمات شخصيات بارزة مثل ديفيد بلي. كما أوضح تحليل التزامن العلاقات بين الكلمات الرئيسية، كاشفًا عن خمسة تجمعات موضوعية تجسد الاتجاهات الحالية وتطبيقات TM في مجالات متنوعة، بما في ذلك معالجة اللغة الطبيعية والصحة العامة.

القيود

تعترف الدراسة بعدة قيود قد تؤثر على شمولية نتائجها. أولاً، الاعتماد على قاعدة بيانات سكوبس، على الرغم من فائدته، لا يشمل كل الأدبيات المتعلقة بإدارة التكنولوجيا (TM)، مما قد يؤدي إلى إغفال مساهمات ذات صلة. علاوة على ذلك، فإن التركيز على مقالات المجلات والأوراق المقدمة في المؤتمرات يستبعد مصادر قيمة أخرى، مما قد يؤدي إلى فجوات في المعلومات المقدمة. تقدم التقنيات البيبليومترية المستخدمة نظرة عامة على أكثر المجالات إنتاجية ضمن أدبيات TM بدلاً من رؤية شاملة للمجال.

على الرغم من هذه القيود، تعتبر الدراسة أساسًا مهمًا لفهم المشهد الحالي لتطبيقات TM وتحدد المجالات الرئيسية للبحث المستقبلي. تقترح أن التحقيقات اللاحقة يمكن أن تستفيد من النتائج لاستكشاف TM عبر قطاعات أو منظمات أو مناطق مختلفة، أو توسيع النطاق من خلال دمج البيانات من قواعد بيانات مفتوحة مثل OpenAlex أو Google Scholar.

Journal: Artificial Intelligence Review, Volume: 58, Issue: 3
DOI: https://doi.org/10.1007/s10462-024-11011-x
Publication Date: 2025-01-06
Author(s): Bayode Ogunleye et al.
Primary Topic: Computational and Text Analysis Methods

Overview

The research paper provides a comprehensive bibliometric analysis of topic modeling (TM) within the context of natural language processing (NLP), utilizing a dataset of 16,941 documents from the Scopus database spanning from 2004 to 2023. The study reveals a significant increase in TM publications over the years, although the citation impact has declined. The analysis indicates that TM research is predominantly concentrated in China and the USA, with a notable prevalence of conference papers compared to journal articles, particularly since 2019. Key subject areas contributing to TM research include computer science, engineering, social sciences, and mathematics. The findings highlight several underexplored applications of TM, such as image segmentation and classification, and emphasize the need for multidisciplinary approaches to enhance TM’s deployment in various fields.

The paper concludes by summarizing its contributions, which include a detailed overview of TM research trends, identification of research gaps, and a proposed conceptual framework for stakeholders. Recommendations for future research emphasize the adoption of advanced TM algorithms, such as neural topic models, for diverse applications including medical record analysis and event detection. The authors advocate for the integration of TM with bibliometric analysis to better understand research themes and encourage collaboration across disciplines to optimize the use of TM in literature synthesis. Future studies are suggested to explore additional databases to validate findings and enhance the overall understanding of TM’s impact and potential.

Introduction

The introduction of the paper discusses the evolution and significance of topic modeling (TM) in the context of big data, particularly with advancements in technology such as the internet and social media. TM, a statistical learning approach for identifying coherent topics within documents, has its roots in early algorithms like latent semantic analysis and probabilistic latent semantic analysis. The limitations of these early models, particularly in handling short texts, led to the development of more sophisticated methods such as latent Dirichlet allocation (LDA) and the biterm topic model (BTM), which addresses data sparsity by analyzing word pairs. Recent trends have seen the rise of neural topic models, exemplified by the BERTopic architecture, which integrates transformer-based embeddings and clustering techniques.

The paper highlights the diverse applications of TM across various domains, including sentiment analysis, recommender systems, and document summarization, as well as its utility in extracting coherent topics from specific datasets like customer complaints and legal documents. Despite the growing interest in TM, there remains a gap in bibliometric analyses of the field, particularly in light of recent developments in large language models. This study aims to fill this gap by conducting a bibliometric analysis of TM research, utilizing the Scopus database to identify trends, applications, and future research directions. The structure of the paper is outlined, with subsequent sections dedicated to methodology, results, and conclusions.

Methods

The methodology of this study involved the analysis of data sourced from the Scopus database, covering the period from 2004 to 2023. Scopus was selected for its extensive bibliometric coverage, which encompasses a wide range of subjects, countries, languages, and publishers, thus serving as an optimal resource for examining the global structure of scientific research (Bahroun et al. 2023; Vagelas and Leontopoulos 2023). The database is recognized for its comprehensive and curated representation of scientific publications, making it invaluable for researchers.

The research employed two primary techniques for synthesizing the retrieved documents. First, bibliometric indicators were utilized to assess the research impact and evolutionary trends within the field. Second, a text mining (TM) approach was applied to enhance the bibliometric analysis by elucidating the pathways of TM research and identifying emerging themes. Detailed steps regarding data collection and analysis will be elaborated in subsequent sections.

Results

In the Results section, the study presents a two-fold analysis encompassing bibliometric and topic modeling results. The topic modeling results, derived from a Latent Dirichlet Allocation (LDA) model, focus on determining the optimal number of topics (K). The analysis revealed that coherence scores peaked between 9-12 topics, with a notable score of approximately 0.4. Additionally, a higher number of topics around 90 also exhibited similar coherence levels. Ultimately, the authors selected 9 topics as the optimal configuration due to its practicality and the distinct interpretability of the resulting topics.

The findings underscore the importance of coherence in topic modeling, suggesting that a balance between the number of topics and their interpretability is crucial for effective analysis. This selection not only enhances the clarity of the topics generated but also facilitates deeper insights into the underlying themes present in the data. The integration of both bibliometric and topic modeling analyses aims to provide comprehensive insights and actionable recommendations based on the synthesized results.

Discussion

In this section, the research discusses the methodology and findings related to data collection, quality assessment, bibliometric analysis, and topic modeling in the field of topic modeling (TM) from 2004 to 2023. The data collection strategy utilized keywords such as “topic model,” “topic discovery,” and “topic extraction,” resulting in an initial pool of 17,580 documents, which was refined to 16,941 relevant publications. A Cohen-Kappa inter-rater reliability assessment yielded a kappa value of 0.84, indicating substantial agreement between raters regarding the relevance of selected documents. The bibliometric analysis revealed a significant increase in TM publications over the years, with 2,249 documents published in 2023, although citation impact has decreased, likely due to the rapid increase in publication rates outpacing citation rates.

The study employed Latent Dirichlet Allocation (LDA) for topic modeling, identifying nine distinct topics across the dataset. The evaluation metrics indicated that the LDA model produced high-quality, interpretable topics, with the first topic focusing on the development and application of algorithms, accounting for 54.39% of the documents. The findings highlight the evolution of TM research, the dominance of certain publication types, and the leading contributors in terms of both authorship and citation impact, particularly emphasizing the contributions of notable figures such as David Blei. The co-occurrence analysis further elucidated the relationships among keywords, revealing five thematic clusters that encapsulate the current trends and applications of TM in various domains, including natural language processing and public health.

Limitations

The study acknowledges several limitations that may affect the comprehensiveness of its findings. Firstly, the reliance on the Scopus database, while beneficial, does not encompass the entirety of literature on Technology Management (TM), potentially omitting relevant contributions. Furthermore, the focus on journal articles and conference papers excludes other valuable sources, which may lead to gaps in the information presented. The bibliometric techniques employed offer an overview of the most productive areas within TM literature rather than a holistic view of the field.

Despite these constraints, the study serves as a significant foundation for understanding the current landscape of TM applications and identifies key areas for future research. It suggests that subsequent investigations could leverage the findings to explore TM across various sectors, organizations, or regions, or expand the scope by incorporating data from open databases such as OpenAlex or Google Scholar.