“ذكائي يكذب علي”: الهلوسات المبلغ عنها من قبل المستخدمين في مراجعات تطبيقات الذكاء الاصطناعي ”My AI is Lying to Me”: User-reported LLM hallucinations in AI mobile apps reviews

المجلة: Scientific Reports، المجلد: 15، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-15416-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40830185
تاريخ النشر: 2025-08-19
المؤلف: Rhodes Massenon وآخرون
الموضوع الرئيسي: المعلومات المضللة وتأثيراتها

نظرة عامة

تستكشف هذه الورقة البحثية الهلاوس التي أبلغ عنها المستخدمون والتي تم إنشاؤها بواسطة نماذج اللغة الكبيرة (LLMs) في التطبيقات المحمولة المدعومة بالذكاء الاصطناعي. من خلال تحليل واسع النطاق لـ 3 ملايين مراجعة مستخدم من 90 تطبيقًا متنوعًا، تستخدم الدراسة نهجًا مختلطًا، بما في ذلك خوارزمية الكشف عن الهلاوس التي أبلغ عنها المستخدمون المعتمدة على القواعد، لتحديد هذه القضايا وتوصيفها. تكشف النتائج أن حوالي 1.75% من المراجعات التي تم الإبلاغ عنها لأخطاء الذكاء الاصطناعي تشير إلى الهلاوس، مع تطوير تصنيف من سبعة أنواع. النوع الأكثر شيوعًا هو “عدم الدقة الواقعية”، الذي يمثل 38% من الحالات، يليه “الإخراج غير المنطقي/غير ذي الصلة” بنسبة 25% و”المعلومات المفبركة” بنسبة 15%. يظهر التحليل اللغوي أن المراجعات التي تبلغ عن الهلاوس تتمتع بدرجات شعور وتصنيفات نجوم أقل بكثير، مما يبرز التأثير الضار على تجربة المستخدم والثقة.

تخلص الدراسة إلى التأكيد على أهمية فهم تصورات المستخدمين حول هلاوس LLM لتحسين ضمان جودة البرمجيات في تطبيقات الذكاء الاصطناعي. يمكن أن يُعلم التصنيف المطور والأنماط اللغوية المحددة إنشاء أدوات مراقبة أكثر فعالية وعمليات ضمان الجودة. تشمل اتجاهات البحث المستقبلية تطوير طرق الكشف الآلي عن الهلاوس باستخدام التعلم الآلي الخاضع للإشراف، وإجراء دراسات أكبر عبر المنصات واللغات، واستكشاف آليات التغذية الراجعة داخل التطبيق للإبلاغ عن أخطاء الذكاء الاصطناعي. إن معالجة الهلاوس التي يدركها المستخدمون أمر حيوي لتعزيز موثوقية وثقة الذكاء الاصطناعي في التطبيقات المحمولة.

الطرق

تستخدم الدراسة نهجًا تجريبيًا مختلطًا للتحقيق في الهلاوس التي أبلغ عنها المستخدمون في مراجعات التطبيقات المحمولة المدعومة بالذكاء الاصطناعي. تهدف الدراسة إلى جمع تعليقات المستخدمين بشكل منهجي لتطوير تصنيف نوعي لأنواع الهلاوس وتحليل انتشارها وخصائصها بشكل كمي. تتناول هذه المنهجية أربعة أسئلة بحثية رئيسية: انتشار هلاوس LLM التي أبلغ عنها المستخدمون (RQ1)، أنواع الهلاوس المبلغ عنها (RQ2)، خصائص هذه التقارير (RQ3)، والآثار المترتبة على ضمان جودة البرمجيات في تطبيقات الذكاء الاصطناعي المحمولة (RQ4).

يبدأ تصميم البحث باختيار البيانات المستهدفة وجمعها، تليه مرحلة تصفية أولية باستخدام خوارزمية تعتمد على القواعد لتحديد المراجعات المرشحة. يتبع ذلك التوضيح اليدوي للتحقق من وتكوين تصنيف الهلاوس، مما يؤدي في النهاية إلى تحليل كمي للتقارير المؤكدة. يضمن النهج المنظم فهمًا شاملاً لطبيعة وتأثير هلاوس LLM على تجارب المستخدمين مع تطبيقات الذكاء الاصطناعي.

النتائج

تتناول قسم النتائج الاكتشافات التجريبية من تحليل مراجعات المستخدمين للتطبيقات المحمولة المدعومة بالذكاء الاصطناعي، مع التركيز بشكل خاص على الهلاوس التي أبلغ عنها المستخدمون المرتبطة بنماذج اللغة الكبيرة (LLMs). تم تنظيم النتائج للرد على ثلاثة أسئلة بحثية رئيسية: انتشار هلاوس LLM التي أبلغ عنها المستخدمون (RQ1)، أنواع الهلاوس المحددة (RQ2)، وخصائص المراجعات التي تتضمن هذه التقارير (RQ3).

يكشف التحليل عن رؤى مهمة حول تكرار وطبيعة الهلاوس، مع تسليط الضوء على أنماط وفئات محددة من الحدوث. بالإضافة إلى ذلك، توفر خصائص المراجعات سياقًا لفهم تجارب المستخدمين وتصوراتهم المتعلقة بهذه الهلاوس، مما يسهم في فهم أعمق لتأثير LLMs في التطبيقات المحمولة.

المناقشة

في هذا القسم، يناقش المؤلفون المنهجية والنتائج المتعلقة بالهلاوس التي أبلغ عنها المستخدمون في التطبيقات المحمولة المدعومة بالذكاء الاصطناعي. شمل البحث جمعًا منهجيًا لمراجعات المستخدمين من 90 تطبيقًا مختارًا عبر فئات متنوعة، مع التركيز على تلك التي تدمج وظائف نماذج اللغة الكبيرة (LLM) بشكل كبير. تم تصفية ما مجموعه 20,000 مراجعة للملاءمة، مع عينة نهائية من 1,000 مراجعة تم توضيحها يدويًا لتحديد حالات الهلاوس. وجدت الدراسة أن حوالي 1.75% من المراجعات التي تم الإبلاغ عنها لأخطاء الذكاء الاصطناعي تحتوي على تقارير واضحة عن الهلاوس، مما يبرز قضية حرجة ولكن نادرة تؤثر على ثقة المستخدم، خاصة في “أدوات الذكاء الاصطناعي التوليدية” و”تطبيقات التعليم بالذكاء الاصطناعي”.

طور المؤلفون تصنيفًا مركزيًا للمستخدم يصنف الهلاوس إلى أنواع مثل “عدم الدقة الواقعية”، “الإخراج غير المنطقي”، و”المعلومات المفبركة”. يساعد هذا التصنيف مهندسي البرمجيات في إنشاء حالات اختبار مستهدفة لمعالجة القضايا المحددة التي أبلغ عنها المستخدمون. كما كشف التحليل أيضًا أن المراجعات التي تبلغ عن الهلاوس تتمتع بتصنيفات نجوم أقل بكثير وشعور سلبي قوي، مما يشير إلى أن عدم الدقة المدركة هي من المحركات الرئيسية لعدم رضا المستخدم. تشير النتائج إلى آثار عملية قابلة للتنفيذ لممارسات هندسة البرمجيات، مثل تنفيذ آليات التصحيح الذاتي في LLMs للتخفيف من أكثر أوضاع الفشل شيوعًا التي تم تحديدها في الدراسة. بشكل عام، تؤكد الدراسة على أهمية فهم تجارب المستخدمين وتصوراتهم في تحسين موثوقية تطبيقات الذكاء الاصطناعي وثقة المستخدم.

Journal: Scientific Reports, Volume: 15, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-15416-8
PMID: https://pubmed.ncbi.nlm.nih.gov/40830185
Publication Date: 2025-08-19
Author(s): Rhodes Massenon et al.
Primary Topic: Misinformation and Its Impacts

Overview

This research paper investigates user-reported hallucinations generated by Large Language Models (LLMs) in AI-powered mobile applications. Through a large-scale analysis of 3 million user reviews from 90 diverse apps, the study employs a mixed-methods approach, including a heuristic-based User-Reported LLM Hallucination Detection algorithm, to identify and characterize these issues. The findings reveal that approximately 1.75% of reviews flagged for AI errors indicate hallucinations, with a taxonomy of seven types developed. The most frequently reported type is “Factual Incorrectness,” accounting for 38% of instances, followed by “Nonsensical/Irrelevant Output” at 25% and “Fabricated Information” at 15%. Linguistic analysis shows that reviews reporting hallucinations exhibit significantly lower sentiment scores and star ratings, highlighting the detrimental impact on user experience and trust.

The study concludes by emphasizing the importance of understanding user perceptions of LLM hallucinations for improving software quality assurance in AI applications. The developed taxonomy and identified linguistic patterns can inform the creation of more effective monitoring tools and quality assurance processes. Future research directions include developing automated detection methods for hallucinations using supervised machine learning, conducting larger cross-platform and cross-lingual studies, and exploring in-app feedback mechanisms for reporting AI errors. Addressing user-perceived hallucinations is crucial for enhancing the reliability and trustworthiness of AI in mobile applications.

Methods

The research employs an empirical, mixed-methods approach to investigate user-reported hallucinations in AI-powered mobile application reviews. The study aims to systematically gather user feedback to qualitatively develop a taxonomy of hallucination types and quantitatively analyze their prevalence and characteristics. This methodology addresses four key research questions: the prevalence of user-reported LLM hallucinations (RQ1), the types of hallucinations reported (RQ2), the characteristics of these reports (RQ3), and the implications for software quality assurance in AI mobile applications (RQ4).

The research design begins with targeted data selection and collection, followed by an initial filtering stage utilizing a heuristic-based algorithm to identify candidate reviews. This is succeeded by manual annotation to verify and construct the taxonomy of hallucinations, ultimately leading to a quantitative analysis of the confirmed reports. The structured approach ensures a comprehensive understanding of the nature and impact of LLM hallucinations on user experiences with AI applications.

Results

The results section details the empirical findings from an analysis of user reviews of AI-powered mobile applications, specifically focusing on user-reported hallucinations associated with large language models (LLMs). The findings are organized to respond to three primary research questions: the prevalence of user-reported LLM hallucinations (RQ1), the types of hallucinations identified (RQ2), and the characteristics of the reviews that include these reports (RQ3).

The analysis reveals significant insights into the frequency and nature of hallucinations, highlighting specific patterns and categories of occurrences. Additionally, the characteristics of the reviews provide context for understanding user experiences and perceptions related to these hallucinations, thus contributing to a deeper comprehension of the impact of LLMs in mobile applications.

Discussion

In this section, the authors discuss the methodology and findings related to user-reported hallucinations in AI-powered mobile applications. The research involved a systematic collection of user reviews from 90 selected apps across various categories, focusing on those that integrate significant large language model (LLM) functionalities. A total of 20,000 reviews were filtered for relevance, with a final sample of 1,000 reviews manually annotated to identify instances of hallucinations. The study found that approximately 1.75% of the reviews flagged for AI errors contained clear reports of hallucinations, highlighting a critical but infrequent issue that undermines user trust, particularly in “Generative AI Tools” and “AI Educational Apps.”

The authors developed a user-centric taxonomy categorizing hallucinations into types such as “Factual Incorrectness,” “Nonsensical Output,” and “Fabricated Information.” This classification aids software engineers in creating targeted test cases to address specific user-reported issues. The analysis also revealed that reviews reporting hallucinations exhibited significantly lower star ratings and strong negative sentiment, indicating that perceived inaccuracies are major drivers of user dissatisfaction. The findings suggest actionable implications for software engineering practices, such as implementing self-correction mechanisms in LLMs to mitigate the most common failure modes identified in the study. Overall, the research emphasizes the importance of understanding user experiences and perceptions in improving AI application reliability and user trust.