نمذجة الموضوعات المدعومة بالذكاء الاصطناعي: مقارنة بين LDA وBERTopic في تحليل المخاطر القلبية الوعائية المرتبطة بالأفيونيات لدى النساء AI-powered topic modeling: comparing LDA and BERTopic in analyzing opioid-related cardiovascular risks in women

المجلة: Experimental Biology and Medicine، المجلد: 250
DOI: https://doi.org/10.3389/ebm.2025.10389
PMID: https://pubmed.ncbi.nlm.nih.gov/40093658
تاريخ النشر: 2025-02-28
المؤلف: Li Ma وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

في هذا القسم، يناقش المؤلفون التقدم في نمذجة الموضوعات ضمن معالجة اللغة الطبيعية (NLP)، مع التركيز بشكل خاص على قيود الطرق التقليدية مثل تخصيص ديريشليت الكامن (LDA) في التقاط العلاقات الدلالية. يقدمون BERTopic، وهي طريقة تم تطويرها في عام 2022 تستخدم التعلم العميق لفهم العلاقات السياقية بين الكلمات بشكل أفضل. تدمج الدراسة وحدات الذكاء الاصطناعي في كل من LDA وBERTopic لتحليل مخاطر القلب والأوعية الدموية المرتبطة بالمسكنات الأفيونية الموصوفة لدى النساء، باستخدام مجموعة بيانات تتكون من 1,837 ملخصًا مأخوذًا من PubMed.

استخدم المؤلفون مجموعة أدوات تعلم الآلة للغة (MALLET) لـ LDA وBioBERT لتضمين الوثائق في BERTopic، محددين الأعداد المثلى للمواضيع بـ 18 و23 على التوالي. سهل دمج ChatGPT-4-Turbo تفسير النتائج ومقارنتها، مما كشف عن ارتباط عالٍ في أوصاف الموضوعات التي تم إنشاؤها بواسطة كلا الطريقتين. أشارت مراجعات الخبراء إلى دقة أداء مشابهة لكل من LDA وBERTopic، لكن مخططات t-SNE أظهرت أن BERTopic أنتج مجموعات أكثر تماسكًا وتميزًا، مما يشير إلى تحسين التماسك في تمثيل الموضوعات. تؤكد النتائج على إمكانية خوارزميات الذكاء الاصطناعي لتعزيز كل من تقنيات نمذجة الموضوعات التقليدية والحديثة، حيث تقدم BERTopic قدرات تفسير آلي من خلال التكامل مع نماذج اللغة الكبيرة.

مقدمة

تتناول مقدمة هذه الورقة البحثية وباء الأفيون المتزايد في الولايات المتحدة، مشيرة إلى أنه في عام 2023، أساء حوالي 8.6 مليون فرد تتراوح أعمارهم بين 12 عامًا وما فوق استخدام المسكنات الأفيونية الموصوفة، مع أكثر من 5 ملايين يعانون من اضطراب استخدام الوصفات الطبية. تشير الورقة إلى زيادة مذهلة في الوفيات المرتبطة بالأفيون، خاصة بين النساء، اللواتي شهدن زيادة بنسبة 642% في الوفيات الناتجة عن جرعات زائدة من المسكنات الأفيونية منذ عام 1999. تواجه هذه الفئة العمرية مخاطر قلبية وعائية فريدة مرتبطة باستخدام الأفيون، مما يبرز الحاجة إلى أبحاث مستهدفة في هذا المجال.

يؤكد المؤلفون على دور معالجة اللغة الطبيعية (NLP) ونمذجة الموضوعات كأدوات حيوية لاستخراج الرؤى من الأدبيات الطبية الحيوية الواسعة. كانت الطرق التقليدية مثل تخصيص ديريشليت الكامن (LDA) فعالة ولكنها تكافح مع تعقيدات مجموعات البيانات الحديثة، خاصة في السياقات الطبية. بالمقابل، تستخدم BERTopic، التي تم تطويرها في عام 2022، نماذج قائمة على المحولات مثل BERT لالتقاط العلاقات السياقية بين الكلمات، مما يوفر نهجًا معياريًا يعزز المرونة وقابلية التفسير. تهدف الدراسة إلى مقارنة LDA وBERTopic المدمجين بالذكاء الاصطناعي باستخدام مجموعة بيانات مختارة من الملخصات الطبية الحيوية المتعلقة بالمسكنات الأفيونية والقضايا القلبية الوعائية لدى النساء، مع التركيز على فعاليتها في كشف الموضوعات والتعامل مع تعقيدات النصوص الطبية الحيوية.

طرق

يستعرض قسم “الطرق” التصميم التجريبي والإجراءات المستخدمة في الدراسة التي نشرتها جمعية الحدود للبيولوجيا التجريبية والطب. يوضح المواد المستخدمة، بما في ذلك عينات بيولوجية محددة وكواشف، بالإضافة إلى المنهجيات لجمع البيانات وتحليلها. يبرز القسم صرامة البروتوكولات التجريبية لضمان إمكانية تكرار النتائج وموثوقيتها.

قد تشمل التقنيات الرئيسية المستخدمة في الدراسة اختبارات في المختبر، وتحليلات إحصائية، وربما طرق تصوير متقدمة، على الرغم من عدم تقديم تفاصيل محددة في النص المستخرج. يهدف القسم إلى تقديم نظرة شاملة على الإطار التجريبي، مما يمكّن الباحثين الآخرين من تكرار الدراسة أو البناء على نتائجها.

نتائج

في هذه الدراسة، تم استكشاف دمج الذكاء الاصطناعي، وتحديدًا ChatGPT-4-Turbo، مع تقنيات معالجة اللغة الطبيعية التقليدية (NLP)، مثل تخصيص ديريشليت الكامن (LDA) المعتمد على MALLET وBERTopic المدمج مع BioBERT، لتعزيز قدرات نمذجة الموضوعات. استخدمت البحث مجموعة بيانات تتكون من 1,837 ملخصًا من PubMed تركز على مخاطر القلب والأوعية الدموية المرتبطة بالأفيون لدى النساء، بهدف تقييم فعالية هذه المنهجيات المدمجة في استخراج موضوعات ذات مغزى من النصوص الطبية الحيوية المتخصصة.

تشير النتائج إلى أن دمج خوارزميات الذكاء الاصطناعي يحسن بشكل كبير التحليل القائم على السياق لبيانات النص مقارنة بالنماذج التقليدية القائمة على القواعد والإحصائيات. تسلط الدراسة الضوء على الإمكانات التحويلية للذكاء الاصطناعي في تحسين تقنيات نمذجة الموضوعات، مما يشير إلى أن النهج المقترح يمكن تكييفه لمختلف خوارزميات نمذجة الموضوعات من خلال التنفيذ اليدوي أو واجهات البرمجة الآلية. يتم تمثيل سير العمل لهذا الدمج بصريًا في الشكل 2 من الورقة.

مناقشة

في هذه الدراسة، استخدم المؤلفون نهجين لنمذجة الموضوعات مدمجين بالذكاء الاصطناعي—تخصيص ديريشليت الكامن (LDA) باستخدام MALLET وBERTopic—لتحليل ملخصات PubMed المتعلقة بمخاطر القلب والأوعية الدموية المرتبطة بالأفيون لدى النساء. تم تنسيق مجموعة البيانات بعناية ومعالجتها مسبقًا، مما يضمن تضمين الملخصات ذات الصلة باللغة الإنجليزية فقط. نموذج LDA، الذي تم تحسينه من خلال طريقة قائمة على معدل تغيير الارتباك (RPC)، أنتج 18 موضوعًا متماسكًا، بينما أنتج BERTopic، الذي يستفيد من BioBERT لتضمين الوثائق، 21 موضوعًا غنيًا بالسياق. أشارت النتائج إلى أن BERTopic لم يوفر فقط موضوعات أكثر تحديدًا وذات صلة سريرية، بل أظهر أيضًا تماسكًا وتمايزًا أفضل في الموضوعات، كما يتضح من درجات تماسك UMass الأعلى والفصل البصري الأكثر وضوحًا في مخططات t-SNE.

كشفت تقييمات الخبراء لملاءمة الموضوعات عن دقة قابلة للمقارنة بين النموذجين، حيث تمكن كلاهما من التقاط الموضوعات المتعلقة بقضايا القلب والأوعية الدموية المرتبطة بالأفيون بشكل فعال. ومع ذلك، سهل دمج BERTopic مع ChatGPT-4-Turbo تجربة أكثر سهولة للمستخدم، مما قلل من الحاجة إلى تعديلات يدوية واسعة. تؤكد النتائج على إمكانية دمج تقنيات معالجة اللغة الطبيعية التقليدية مع أدوات الذكاء الاصطناعي المتقدمة لتعزيز أداء نمذجة الموضوعات، مما يشير إلى آفاق واعدة للبحث المستقبلي في التطبيقات الطبية الحيوية. تدعو الدراسة إلى دمج مدروس للذكاء الاصطناعي لتحسين الكفاءة والدقة في استخراج النصوص، مع تسليط الضوء أيضًا على ضرورة الخبرة في المجال لتعظيم فوائد هذه التقنيات.

Journal: Experimental Biology and Medicine, Volume: 250
DOI: https://doi.org/10.3389/ebm.2025.10389
PMID: https://pubmed.ncbi.nlm.nih.gov/40093658
Publication Date: 2025-02-28
Author(s): Li Ma et al.
Primary Topic: Topic Modeling

Overview

In this section, the authors discuss the advancements in topic modeling within natural language processing (NLP), particularly focusing on the limitations of traditional methods like Latent Dirichlet Allocation (LDA) in capturing semantic relationships. They introduce BERTopic, a method developed in 2022 that utilizes deep learning to better understand contextual word relationships. The study integrates AI modules into both LDA and BERTopic to analyze prescription opioid-related cardiovascular risks in women, utilizing a dataset of 1,837 abstracts sourced from PubMed.

The authors employed the Machine Learning of Language Toolkit (MALLET) for LDA and BioBERT for document embedding in BERTopic, determining optimal topic numbers of 18 and 23, respectively. The integration of ChatGPT-4-Turbo facilitated the interpretation and comparison of results, revealing a high correlation in topic descriptions generated by both methods. Expert reviews indicated similar performance accuracies for LDA and BERTopic, but t-SNE plots demonstrated that BERTopic produced more compact and distinct clusters, suggesting enhanced coherence in topic representation. The findings underscore the potential of AI algorithms to enhance both traditional and modern topic modeling techniques, with BERTopic offering automated interpretation capabilities through integration with large language models.

Introduction

The introduction of this research paper addresses the escalating opioid epidemic in the United States, highlighting that in 2023, approximately 8.6 million individuals aged 12 and older misused prescription opioids, with over 5 million experiencing a prescription use disorder. The paper notes a staggering increase in opioid-related deaths, particularly among women, who have seen a 642% rise in fatalities from prescription opioid overdoses since 1999. This demographic faces unique cardiovascular risks associated with opioid use, underscoring the need for targeted research in this area.

The authors emphasize the role of natural language processing (NLP) and topic modeling as vital tools for extracting insights from extensive biomedical literature. Traditional methods like Latent Dirichlet Allocation (LDA) have been effective but struggle with the complexities of modern datasets, particularly in medical contexts. In contrast, BERTopic, developed in 2022, utilizes transformer-based models such as BERT to capture contextual relationships between words, offering a modular approach that enhances flexibility and interpretability. The study aims to compare AI-integrated LDA and BERTopic using a curated dataset of biomedical abstracts related to prescription opioids and cardiovascular issues in women, focusing on their effectiveness in uncovering themes and handling the complexities of biomedical texts.

Methods

The “Methods” section outlines the experimental design and procedures employed in the study published by the Frontiers Society for Experimental Biology and Medicine. It details the materials used, including specific biological samples and reagents, as well as the methodologies for data collection and analysis. The section emphasizes the rigor of the experimental protocols to ensure reproducibility and reliability of the findings.

Key techniques utilized in the study may include in vitro assays, statistical analyses, and possibly advanced imaging methods, although specific details are not provided in the extracted text. The section aims to provide a comprehensive overview of the experimental framework, enabling other researchers to replicate the study or build upon its findings.

Results

In this study, the integration of AI, specifically ChatGPT-4-Turbo, with traditional Natural Language Processing (NLP) techniques, such as MALLET-based Latent Dirichlet Allocation (LDA) and BioBERT-embedded BERTopic, was explored to enhance topic modeling capabilities. The research utilized a dataset of 1,837 PubMed abstracts focused on opioid-related cardiovascular risks in women, aiming to evaluate the effectiveness of these combined methodologies in extracting meaningful themes from specialized biomedical texts.

The findings indicate that the incorporation of AI algorithms significantly improves the context-aware analysis of text data compared to traditional rule-based and statistical models. The study highlights the transformative potential of AI in refining topic modeling techniques, suggesting that the proposed approach can be adapted for various topic modeling algorithms through manual implementation or automated programming interfaces. The workflow for this integration is visually represented in Figure 2 of the paper.

Discussion

In this study, the authors employed two AI-integrated topic modeling approaches—Latent Dirichlet Allocation (LDA) using MALLET and BERTopic—to analyze PubMed abstracts related to opioid-related cardiovascular risks in women. The dataset was meticulously curated and preprocessed, ensuring that only relevant English-language abstracts were included. The LDA model, optimized through a Rate of Perplexity Change (RPC)-based method, generated 18 coherent topics, while BERTopic, leveraging BioBERT for document embedding, produced 21 contextually rich topics. The results indicated that BERTopic not only provided more specific and clinically relevant themes but also demonstrated superior topic coherence and differentiation, as evidenced by higher UMass coherence scores and clearer visual separations in t-SNE plots.

Expert evaluations of topic relevance revealed comparable accuracy between the two models, with both effectively capturing themes pertinent to opioid-related cardiovascular issues. However, BERTopic’s streamlined integration with ChatGPT-4-Turbo facilitated a more user-friendly experience, reducing the need for extensive manual adjustments. The findings underscore the potential of combining traditional natural language processing techniques with advanced AI tools to enhance topic modeling performance, suggesting promising avenues for future research in biomedical applications. The study advocates for thoughtful integration of AI to improve efficiency and accuracy in text mining, while also highlighting the necessity of domain expertise in maximizing the benefits of such technologies.