التحقق من صحة الكتبة الرقمية: مراجعة شاملة لممارسات التقييم والاستخدام السريري Validating Digital Scribes: A Scoping Review of Evaluation Practices and Clinical Use

المجلة: Journal of Medical Systems، المجلد: 50، العدد: 1
DOI: https://doi.org/10.1007/s10916-026-02392-3
PMID: https://pubmed.ncbi.nlm.nih.gov/42026370
تاريخ النشر: 2026-04-24
المؤلف: Ekin Kerimoğlu وآخرون
الموضوع الرئيسي: العلاج بالفن والصحة النفسية

نظرة عامة

تستكشف هذه المراجعة نطاق طرق التحقق من الصحة والاستعداد السريري للكتبة الرقمية التي تستخدم التعرف التلقائي على الكلام (ASR) ونماذج اللغة الكبيرة (LLMs) لإنشاء الوثائق السريرية من تفاعلات المرضى ومقدمي الرعاية. تشير النتائج إلى أن معظم أنظمة الكتبة الرقمية في مراحلها الأولى من التطوير، وعادة ما تصنف على أنها مستويات جاهزية التكنولوجيا (TRL) 3 و4، مع تقدم محدود نحو دمج كامل في سير العمل. تسلط المراجعة الضوء على التباين في طرق التحقق من الصحة عبر الدراسات، التي تعتمد في الغالب على بيانات محاكاة أو بيانات استعادية، مما يحد من القدرة على إجراء مقارنات قوية بشأن الأداء السريري.

تحدد المراجعة ثلاثة أطر تحفيزية – موجهة نحو الإنسان، والأداء، والنظام – تؤثر على ممارسات التقييم والتوقعات بشأن النتائج. على الرغم من الفوائد المحتملة للكتبة الرقمية في تعزيز كفاءة الوثائق، فإن نقص أطر التحقق من الصحة الموحدة والدراسات المستقبلية في العالم الحقيقي يشكل تحديات لدمجها الفعال في الممارسة السريرية. يؤكد المؤلفون على ضرورة إجراء تقييمات شاملة تشمل الدقة التقنية، والنتائج السريرية، وسهولة الاستخدام، ودمج سير العمل لسد الفجوة بين النشر التجاري والتحقق العلمي، مما يضمن الاستخدام الآمن والفعال في البيئات السريرية.

مقدمة

تناقش المقدمة العبء الكبير للوثائق السريرية على مقدمي الرعاية الصحية، والذي يمكن أن يستهلك ما يصل إلى نصف ساعات عملهم ويساهم في الإرهاق والأخطاء. لقد خففت الكتبة الطبية تقليديًا من هذا العبء، لكن التقدم الأخير في الذكاء الاصطناعي التوليدي (AI)، وخاصة نماذج اللغة الكبيرة (LLMs)، يقدم حلاً جديدًا من خلال تطوير الكتبة الرقمية. تستخدم هذه الكتبة الرقمية التعرف التلقائي على الكلام (ASR) لالتقاط محادثات المرضى ومقدمي الرعاية وإنشاء وثائق سريرية متماسكة، مما قد يعزز التواصل ويقلل من الاعتماد على الشاشات، التي يمكن أن تؤثر سلبًا على جودة العلاقات في البيئات السريرية.

على الرغم من وعد الكتبة الرقمية، لا يزال التحقق من صحة هذه الأنظمة محدودًا وغير متسق، مع منهجيات متباينة وندرة الدراسات في العالم الحقيقي. تزيد المخاوف بشأن التحيز، وسوء التمثيل، وتأثيرها على التواصل السريري من تعقيد دمجها في سير العمل الصحي. تهدف هذه المراجعة إلى تلخيص الأدلة الحالية حول تطوير والتحقق من صحة ودمج الكتبة الرقمية التي تجمع بين ASR وLLMs، وتقييم أدائها الفني والسريري. من خلال القيام بذلك، تسعى إلى تحديد الفرص والتحديات في اعتماد هذه التقنيات، مع التأكيد على الحاجة إلى تقييم دقيق لضمان تعزيزها بدلاً من إعاقة تفاعلات المرضى ومقدمي الرعاية.

طرق البحث

في هذا القسم، يصف المؤلفون المنهجية المستخدمة في إجراء مراجعة نطاق لتقييم الأدلة المحيطة بالكتبة الرقمية في البيئات السريرية، مع الالتزام بإرشادات PRISMA-ScR. تهدف المراجعة إلى تلخيص الأبحاث الحالية، وتحديد الفجوات في الأدبيات، واقتراح اتجاهات البحث المستقبلية. من الجدير بالذكر أن المؤلفين يبرزون غياب إطار تحقق موحد عبر الدراسات التي تمت مراجعتها، مما يحد من قوة النتائج. تم استخدام طرق تحقق متنوعة، بما في ذلك التقييمات المقارنة بين الملخصات المكتوبة من قبل الأطباء وتلك التي تم إنشاؤها بواسطة الذكاء الاصطناعي، واستبيانات تجربة المستخدم، والتقييمات النوعية من خلال المقابلات ومجموعات التركيز. ومع ذلك، لم تتضمن أي من الدراسات وجهات نظر المرضى، وهو إغفال كبير.

تتمثل إحدى النتائج الرئيسية في نقص دراسات التحقق من الصحة في العالم الحقيقي للكتبة الرقمية، حيث تم إجراء معظم التقييمات في بيئات خاضعة للرقابة لا تعكس تعقيدات الممارسة السريرية. يجادل المؤلفون بأنه بينما يعد التحقق الفني ضروريًا للتطوير المبكر، فإنه غالبًا ما يتجاهل التحديات المتنوعة التي تواجهها في البيئات الواقعية، مثل تباين المرضى وأنماط الوثائق المختلفة. يؤكدون على الحاجة إلى أدوات قابلة للتكيف يمكن أن تلبي مختلف السياقات السريرية ويبرزون أن الفوائد المدركة، مثل تقليل عبء الوثائق، قد لا تكون مستدامة عبر أدوار وتخصصات مختلفة. يدعو المؤلفون إلى اعتماد أطر تحقق صحية رقمية راسخة، مثل CONSORT-AI وDECIDE-AI، لتعزيز اتساق وموثوقية التقييمات، وضمان معالجة سهولة الاستخدام، والدقة، والأهمية السريرية بشكل كافٍ في الدراسات المستقبلية.

النتائج

يستعرض قسم النتائج النتائج المستخلصة من بحث منهجي حدد 3,181 مقالة، وتم تقليص العدد إلى 176 بعد فحص العناوين والملخصات، وفي النهاية شمل 16 دراسة بعد مراجعة النص الكامل. تم نشر هذه الدراسات بين عامي 2020 و2025، وأجريت في الغالب في الولايات المتحدة، وأستراليا، والمملكة المتحدة. منهجيًا، استخدمت ثلاث دراسات طرقًا نوعية، بينما استخدمت تسع دراسات طرقًا كمية، واعتمدت البقية تصاميم مختلطة أو متعددة الطرق.

فيما يتعلق بمصادر البيانات، استخدمت سبع دراسات بيانات اصطناعية أو محادثات محاكاة، واعتمدت خمس دراسات على مجموعات بيانات سريرية حقيقية دون نشر مباشر، وقامت أربع دراسات بتقييم التكنولوجيا في بيئات سريرية فعلية. استخدمت معظم الدراسات نماذج GPT من OpenAI للتلخيص، بينما كانت نماذج BART أقل استخدامًا. بالإضافة إلى ذلك، تم وصف استراتيجيات التحفيز لتوجيه هيكل أو محتوى الملاحظات التي تم إنشاؤها بواسطة الذكاء الاصطناعي في أربع دراسات. يمكن العثور على معلومات مفصلة عن الدراسات في الجدول التكميلي B.

المناقشة

يستعرض قسم المناقشة في هذه الورقة البحثية المراجعة المنهجية للكتبة الرقمية التي تدمج التعرف التلقائي على الكلام (ASR) ونماذج اللغة الكبيرة (LLMs) في البيئات السريرية. تسلط المراجعة الضوء على استراتيجية البحث ومعايير الاختيار، مع التأكيد على تضمين الدراسات التي تمت مراجعتها من قبل الأقران والتي تركز على محادثات المرضى ومقدمي الرعاية وتوليد الملاحظات السريرية. استخدم المؤلفون إطار مستوى جاهزية التكنولوجيا (TRL) لتقييم نضج هذه التقنيات، مما يكشف أن معظم الدراسات كانت في مراحل TRL المبكرة، تتضمن بشكل أساسي تحققًا تقنيًا في بيئات خاضعة للرقابة بدلاً من التطبيقات السريرية في العالم الحقيقي.

تشير النتائج الرئيسية إلى أن طرق التحقق من الصحة تختلف بشكل كبير عبر الدراسات، مع نقص في توحيد مقاييس التقييم، مما يعيق القابلية للمقارنة. بينما أفادت العديد من الدراسات بتحسينات في كفاءة الوثائق ومشاركة الأطباء، تم الإشارة إلى تحديات مثل زيادة عبء العمل وصعوبات التكيف. تقدم المراجعة ثلاثة أطر تحفيزية – موجهة نحو الإنسان، وموجهة نحو الأداء، وموجهة نحو النظام – لتصنيف المبررات وراء تنفيذ الكتبة الرقمية. تساعد هذه الأطر في تفسير النتائج المتباينة وطرق التحقق من الصحة، مما يبرز الحاجة إلى أطر تقييم موحدة وقوية لضمان الدمج الآمن والفعال للكتبة الرقمية في الممارسة السريرية. يدعو المؤلفون إلى مزيد من البحث لمعالجة الآثار التنظيمية والمشهد المتطور للذكاء الاصطناعي في الرعاية الصحية.

القيود

قد لا تعكس نتائج هذه المراجعة بالكامل أحدث التطورات في تقنيات الذكاء الاصطناعي، خاصة في سياق الوثائق السريرية. بالإضافة إلى ذلك، قد تؤدي التباينات في تصميمات الدراسات، وأحجام العينات، والمنهجيات عبر الأبحاث المضمنة إلى إدخال تحيزات وتقييد قابلية تعميم النتائج.

علاوة على ذلك، يعني الوتيرة السريعة للابتكار في نماذج الذكاء الاصطناعي أن فعالية ودقة هذه الأدوات يمكن أن تتغير بشكل كبير على فترات زمنية قصيرة، مما قد يجعل بعض الاستنتاجات غير صالحة. يجب على الباحثين والممارسين أن يظلوا حذرين في تطبيق هذه النتائج، حيث قد تؤدي التطورات المستمرة في نماذج اللغة الكبيرة (LLMs) مثل GPT-4 إلى تحسين الأداء وظهور تحديات جديدة في ممارسات الوثائق السريرية.

Journal: Journal of Medical Systems, Volume: 50, Issue: 1
DOI: https://doi.org/10.1007/s10916-026-02392-3
PMID: https://pubmed.ncbi.nlm.nih.gov/42026370
Publication Date: 2026-04-24
Author(s): Ekin Kerimoğlu et al.
Primary Topic: Art Therapy and Mental Health

Overview

This scoping review investigates the validation methods and clinical readiness of digital scribes that utilize automatic speech recognition (ASR) and large language models (LLMs) to generate clinical documentation from patient-provider interactions. The findings indicate that most digital scribe systems are in the early stages of development, typically categorized as Technology Readiness Levels (TRL) 3 and 4, with limited progression towards full workflow integration. The review highlights the heterogeneity in validation methods across studies, which predominantly rely on simulated or retrospective data, thereby limiting the ability to draw robust comparisons regarding clinical performance.

The review identifies three motivational frames—human-, performance-, and system-oriented—that influence evaluation practices and expectations regarding outcomes. Despite the potential benefits of digital scribes in enhancing documentation efficiency, the lack of standardized validation frameworks and prospective real-world studies poses challenges for their effective integration into clinical practice. The authors emphasize the necessity for comprehensive evaluations that encompass technical accuracy, clinical outcomes, usability, and workflow integration to bridge the gap between commercial deployment and scientific validation, ultimately ensuring safe and effective use in clinical settings.

Introduction

The introduction discusses the significant burden of clinical documentation on healthcare providers, which can consume up to half of their working hours and contribute to burnout and errors. Medical scribes have traditionally alleviated this burden, but recent advancements in generative artificial intelligence (AI), particularly large language models (LLMs), present a novel solution through the development of digital scribes. These digital scribes utilize automatic speech recognition (ASR) to capture patient-provider conversations and generate coherent clinical documentation, potentially enhancing communication and reducing the reliance on screens, which can detract from relational quality in clinical settings.

Despite the promise of digital scribes, the validation of these systems remains limited and inconsistent, with varying methodologies and a scarcity of real-world studies. Concerns about bias, misrepresentation, and the impact on clinical communication further complicate their integration into healthcare workflows. This scoping review aims to synthesize current evidence on the development, validation, and integration of digital scribes that combine ASR and LLMs, assessing their technical and clinical performance. By doing so, it seeks to identify opportunities and challenges in the adoption of these technologies, emphasizing the need for careful evaluation to ensure they enhance rather than hinder patient-provider interactions.

Methods

In this section, the authors describe the methodology employed in conducting a scoping review to evaluate the evidence surrounding digital scribes in clinical settings, adhering to the PRISMA-ScR guidelines. The review aims to summarize existing research, identify gaps in the literature, and suggest future research directions. Notably, the authors highlight the absence of a standardized validation framework across the studies reviewed, which limits the robustness of the findings. Various validation methods were utilized, including comparative evaluations of clinician-written versus AI-generated summaries, user experience questionnaires, and qualitative assessments through interviews and focus groups. However, none of the studies incorporated patient perspectives, which is a significant oversight.

A critical finding is the lack of real-world validation studies for digital scribes, with most evaluations conducted in controlled environments that do not reflect the complexities of clinical practice. The authors argue that while technical validation is essential for early development, it often neglects the diverse challenges encountered in real-world settings, such as patient variability and differing documentation styles. They emphasize the need for adaptable tools that can cater to various clinical contexts and highlight that perceived benefits, such as reduced documentation burden, may not be sustainable across different roles and specialties. The authors call for the adoption of established digital health validation frameworks, like CONSORT-AI and DECIDE-AI, to enhance the consistency and accountability of evaluations, ensuring that usability, accuracy, and clinical relevance are adequately addressed in future studies.

Results

The results section details the findings from a systematic search that identified 3,181 articles, narrowing down to 176 after title and abstract screening, and ultimately including 16 studies following full-text review. These studies, published between 2020 and 2025, were predominantly conducted in the United States, Australia, and the United Kingdom. Methodologically, three studies employed qualitative methods, nine utilized quantitative approaches, and the remainder adopted mixed or multi-method designs.

In terms of data sources, seven studies utilized synthetic data or simulated conversations, five relied on real clinical datasets without live deployment, and four assessed the technology in actual clinical environments. Most studies leveraged OpenAI’s GPT models for summarization, while BART-based models were less frequently used. Additionally, prompting strategies to direct the structure or content of AI-generated notes were described in four studies. Detailed study information can be found in Supplementary Table B.

Discussion

The discussion section of this research paper outlines the systematic review of digital scribes that integrate Automatic Speech Recognition (ASR) and Large Language Models (LLMs) in clinical settings. The review highlights the search strategy and selection criteria, emphasizing the inclusion of peer-reviewed studies that focus on patient-provider conversations and the generation of clinical notes. The authors employed a Technology Readiness Level (TRL) framework to assess the maturity of these technologies, revealing that most studies were at early TRL stages, primarily involving technical validation in controlled environments rather than real-world clinical applications.

Key findings indicate that validation methods varied significantly across studies, with a lack of standardization in evaluation metrics, which hampers comparability. While many studies reported improvements in documentation efficiency and clinician engagement, challenges such as increased workload and adaptation difficulties were noted. The review introduces three motivational frames—human-oriented, performance-oriented, and system-oriented—to categorize the rationales behind the implementation of digital scribes. These frames help interpret the varying outcomes and validation methods, underscoring the need for robust, standardized evaluation frameworks to ensure the safe and effective integration of digital scribes into clinical practice. The authors call for further research to address regulatory implications and the evolving landscape of AI in healthcare.

Limitations

the findings of this review may not fully capture the latest advancements in AI technologies, particularly in the context of clinical documentation. Additionally, the variability in study designs, sample sizes, and methodologies across the included research may introduce biases and limit the generalizability of the results.

Furthermore, the rapid pace of innovation in AI models means that the effectiveness and accuracy of these tools can change significantly over short periods, potentially rendering some conclusions obsolete. Researchers and practitioners should remain cautious in applying these findings, as ongoing developments in large language models (LLMs) like GPT-4 may lead to improved performance and new challenges in clinical documentation practices.