متى يجب على المستخدمين التحقق؟ نمذجة تكرار التأكيد في مهام الذكاء الاصطناعي متعددة الخطوات When Should Users Check? Modeling Confirmation Frequency in Multi-Step Agentic AI Tasks

المجلة: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
DOI: https://doi.org/10.1145/3772318.3790655
تاريخ النشر: 2026-04-13
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي القابل للتفسير (XAI)

نظرة عامة

تقدم ورقة البحث نموذجًا نظريًا للقرار يهدف إلى تحسين تفاعلات المستخدم مع الوكيل في المهام طويلة الأمد من خلال تحديد اللحظات المناسبة لتأكيد المستخدم على إجراءات الوكيل. المحور المركزي لهذا النموذج هو نمط التأكيد-التشخيص-التصحيح-إعادة التنفيذ (CDCR)، الذي يحدد ما إذا كان يجب على الوكيل التصرف بشكل مستقل أو طلب إشراف المستخدم بناءً على أوقات التفاعل المتوقعة.

تكشف النتائج من تجربة محكومة شملت 48 مشاركًا أن تنفيذ التأكيدات الوسيطة قلل بشكل كبير من وقت إكمال المهام بنسبة 13.54% وحصل على تفضيل قوي من 81% من المشاركين. وهذا يشير إلى تحول في المنظور بشأن تأكيد المستخدم، داعيًا إلى نهج مختلط المبادرة يوازن بين الاستقلالية وإشراف المستخدم. يقترح المؤلفون اتجاهات بحث مستقبلية تركز على تعزيز الكفاءة والثقة وتصميم التفاعل لتطوير أنظمة وكيلة أكثر فعالية وتحت إشراف المستخدم.

مقدمة

تناقش مقدمة ورقة البحث هذه ظهور الذكاء الاصطناعي الوكالي، الذي يشير إلى الوكلاء الذكائيين الذين يساعدون البشر في المهام المعقدة عبر مجالات متنوعة، مثل الأجهزة متعددة الوسائط والأنظمة المستندة إلى الويب. تسلط الورقة الضوء على نقاش حاسم في تفاعل الإنسان مع الكمبيوتر (HCI) بشأن التوازن بين سيطرة المستخدم واستقلالية الوكيل. غالبًا ما تعمل الوكلاء الذكائيون الحاليون باستقلالية كاملة، مما يؤدي إلى مشكلات حيث يفوت المستخدمون الفرص للتدخل في الوقت المناسب، مما يؤدي إلى معدلات خطأ عالية – فقط 30% دقة في المعايير متعددة الخطوات. يمكن أن تؤدي هذه الكفاءة المنخفضة إلى تكاليف كبيرة وزيادة في بصمات الكربون بسبب الحاجة إلى إعادة تنفيذ المهام.

يهدف المؤلفون إلى معالجة توقيت تأكيدات المستخدم في المهام الوكالية طويلة الأمد، مقترحين نموذجًا نظريًا للقرار لتحديد النقاط المثلى لفحوصات المستخدم التي تقلل من إجمالي الوقت المتوقع لإكمال المهام مع ضمان الدقة. من خلال دراسة تشكيلية مع ثمانية مشاركين، حددوا عدم الرضا عن استراتيجية التأكيد في النهاية السائدة ونمط التأكيد-التشخيص-التصحيح-إعادة التنفيذ (CDCR) المتكرر. أظهرت تقييماتهم التجريبية مع 48 مشاركًا تفضيلًا قويًا (81%) لاستراتيجية التأكيد الوسيطة، التي قللت من متوسط وقت إكمال المهام بنسبة 13.54% مقارنة بنهج التأكيد في النهاية. تسهم الورقة بدراسة تشكيلية، ونموذج نظري للقرار لتوقيت التأكيد، والتحقق التجريبي من النموذج، ونقاش حول تداعياته على الأنظمة تحت إشراف المستخدم.

النتائج

كشفت نتائج الدراسة التشكيلية عن نمط التأكيد-التشخيص-التصحيح-إعادة التنفيذ (CDCR) السائد بين المشاركين عند معالجة الأخطاء التي ارتكبتها الوكلاء الذكائيون. تضمنت هذه الطريقة مراجعة شاملة لتاريخ التنفيذ، قابلة للتطبيق عبر مهام وواجهات مستخدم متنوعة. ومن الجدير بالذكر أن المشاركين أعربوا عن الحاجة إلى استراتيجيات تأكيد إضافية تتجاوز النهج القياسي “التأكيد في النهاية”، مما يشير إلى إمكانية دمج نقاط تفتيش وسيطة، والتي أبلغت عن تطوير نموذج لتكرار التأكيد.

أكدت التحليلات الكمية فعالية النموذج المقترح، حيث أظهرت تقليلاً كبيرًا في وقت إكمال المهام بنسبة 13.54% (حوالي 35.84 ثانية) مقارنةً بحالة الأساس مع التأكيد في النهاية، مع دلالة إحصائية ($t(143) = 5.52, p < 0.001$). اختلف أداء النموذج حسب مجال المهمة، حيث حقق تقليلات بنسبة 17.44% للتسوق، و7.46% لتحرير الصور، و15.64% للعبة Overcooked. كانت فوائد التأكيد الوسيط أكثر وضوحًا للأخطاء المبكرة، مما أدى إلى تقليل الوقت بنسبة تقارب 29%، بينما تضاءل تأثيره للأخطاء المتوسطة والمتأخرة. بشكل عام، أدى التحول إلى التأكيد الوسيط إلى عملية أكثر كفاءة، مما قلل من الوقت المستغرق في التشخيص وإعادة التنفيذ، بينما قلل أيضًا من العبء العقلي المرتبط بالتحقق. بالإضافة إلى ذلك، لم يحسن هذا النهج الكفاءة فحسب، بل ساهم أيضًا في الاستدامة من خلال تقليل الموارد الحاسوبية غير الضرورية.

نقاش

في هذا القسم، يناقش المؤلفون دمج السيطرة البشرية واستقلالية الوكيل الذكائي في سياق تفاعل الإنسان مع الكمبيوتر (HCI) وهندسة الموثوقية. يبرزون الحاجة الملحة للإشراف البشري في المهام طويلة الأمد بسبب معدلات الخطأ العالية للوكلاء الذكائيين الحاليين، الذين غالبًا ما يظهرون فقط 30% دقة في المهام المعقدة متعددة الخطوات. يقترح المؤلفون إطارًا رياضيًا لتحديد اللحظات المثلى لتدخل المستخدم، معالجين توقيت التأكيدات في سير العمل الذي يقوده الوكيل. يهدف هذا الإطار إلى موازنة أعباء تأكيد المستخدم مقابل التكاليف المحتملة لاستعادة الأخطاء، مما يعزز تجربة المستخدم وكفاءة التشغيل.

كما يرسم المؤلفون أوجه تشابه مع هندسة الموثوقية، حيث تعمل النماذج الرياضية على تحسين فترات الفحص للأنظمة المعقدة. يجادلون بأنه بينما ركزت الأبحاث الحالية في HCI على كيفية تمكين المستخدمين من السيطرة، فقد أغفلت إلى حد كبير متى يجب ممارسة هذه السيطرة أثناء المهام متعددة الخطوات. تكشف دراستهم التشكيلية أن المستخدمين غالبًا ما يفضلون آليات التأكيد الاستباقية على الاستراتيجية التقليدية “التأكيد في النهاية”، التي تتركهم في حالة مراقبة سلبية. تؤكد النتائج على ضرورة وجود نموذج جدولة يحفز تأكيدات المستخدم في نقاط استراتيجية طوال تنفيذ المهمة، مما يقلل من العبء المعرفي على المستخدمين ويحسن الفعالية العامة للأنظمة الذكائية.

Journal: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
DOI: https://doi.org/10.1145/3772318.3790655
Publication Date: 2026-04-13
Author(s): Zhenyun Du et al.
Primary Topic: Explainable Artificial Intelligence (XAI)

Overview

The research paper presents a decision-theoretic model aimed at optimizing user-agent interactions in long-horizon tasks by determining the appropriate moments for user confirmation of agent actions. Central to this model is the Confirmation-Diagnosis-Correction-Redo (CDCR) pattern, which informs whether an agent should act autonomously or solicit user supervision based on anticipated user interaction times.

The findings from a controlled experiment involving 48 participants reveal that implementing intermediate confirmation significantly decreased task completion time by 13.54% and garnered strong preference from 81% of participants. This suggests a shift in perspective regarding user confirmation, advocating for a mixed-initiative approach that balances autonomy with user oversight. The authors propose future research directions focused on enhancing efficiency, trust, and interaction design to develop more effective and user-supervised agentic systems.

Introduction

The introduction of this research paper discusses the emergence of Agentic AI, which refers to AI agents that assist humans in complex tasks across various domains, such as multi-modal devices and web-based systems. The paper highlights a critical debate in Human-Computer Interaction (HCI) regarding the balance between user control and agent autonomy. Current AI agents often operate with end-to-end autonomy, leading to issues where users miss opportunities for timely intervention, resulting in high error rates—only 30% accuracy on multi-step benchmarks. This inefficiency can lead to significant costs and increased carbon footprints due to the need for task re-execution.

The authors aim to address the timing of user confirmations in long-horizon agentic tasks, proposing a decision-theoretic model to identify optimal points for user checks that minimize total expected task completion time while ensuring correctness. Through a formative study with eight participants, they identified dissatisfaction with the prevalent confirm-at-end strategy and a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern. Their empirical evaluation with 48 participants demonstrated a strong preference (81%) for an intermediate confirmation strategy, which reduced average task completion time by 13.54% compared to the confirm-at-end approach. The paper contributes a formative study, a decision-theoretic model for confirmation timing, empirical validation of the model, and a discussion on its implications for user-supervised systems.

Results

The results of the formative study revealed a prevalent Confirmation-Diagnosis-Correction-Redo (CDCR) pattern among participants when addressing errors made by AI agents. This approach involved a comprehensive review of the execution history, applicable across various tasks and user interfaces. Notably, participants expressed a need for additional confirmation strategies beyond the standard “confirm-at-end” approach, indicating potential for the integration of intermediate checkpoints, which informed the development of a confirmation frequency model.

Quantitative analyses confirmed the efficacy of the proposed model, demonstrating a significant reduction in task completion time by 13.54% (approximately 35.84 seconds) compared to the baseline condition with end confirmation, with statistical significance ($t(143) = 5.52, p < 0.001$). The model's performance varied by task domain, achieving reductions of 17.44% for Shopping, 7.46% for Image Editing, and 15.64% for Overcooked. The benefits of intermediate confirmation were most pronounced for early errors, leading to a time reduction of approximately 29%, while its impact diminished for mid-task and late errors. Overall, the shift to intermediate confirmation resulted in a more efficient process, decreasing the time spent on diagnosis and redo, while also reducing the mental load associated with verification. Additionally, this approach not only improved efficiency but also contributed to sustainability by minimizing unnecessary computational resources.

Discussion

In this section, the authors discuss the integration of human control and AI agent autonomy within the context of Human-Computer Interaction (HCI) and reliability engineering. They highlight the critical need for human oversight in long-horizon tasks due to the high error rates of current AI agents, which often exhibit only 30% accuracy in complex, multi-step tasks. The authors propose a mathematical framework to determine optimal moments for user intervention, addressing the timing of confirmations in agent-led workflows. This framework aims to balance the burdens of user confirmation against the potential costs of error recovery, thereby enhancing user experience and operational efficiency.

The authors also draw parallels with reliability engineering, where mathematical models optimize inspection intervals for complex systems. They argue that while existing HCI research has focused on how to empower users with control, it has largely overlooked when such control should be exercised during multi-step tasks. Their formative study reveals that users often prefer proactive confirmation mechanisms over the traditional “confirm-at-end” strategy, which leaves them in a passive monitoring state. The findings underscore the necessity for a scheduling model that prompts user confirmations at strategic points throughout task execution, thereby reducing the cognitive load on users and improving the overall effectiveness of AI systems.