اختبار A/B: مراجعة منهجية للأدبيات A/B testing: A systematic literature review

المجلة: Journal of Systems and Software، المجلد: 211
DOI: https://doi.org/10.1016/j.jss.2024.112011
تاريخ النشر: 2024-02-22
المؤلف: Federico Quin وآخرون
الموضوع الرئيسي: طرق إحصائية في التجارب السريرية

نظرة عامة

تقدم هذه القسم نظرة شاملة على اختبار A/B، وهي طريقة تستخدم لاختبار الفرضيات في تطوير البرمجيات، حيث يتم مقارنة متغيرين من منظور المستخدم النهائي. يقدم البحث نتائج من مراجعة أدبية منهجية لـ 141 دراسة أساسية، تتناول جوانب رئيسية مثل تصميم وتنفيذ اختبارات A/B، وأدوار أصحاب المصلحة، والتحديات المفتوحة في هذا المجال. تكشف التحليلات أن اختبار A/B يستهدف بشكل أساسي الخوارزميات، والعناصر المرئية، وعمليات سير العمل، مع كون اختبارات A/B الكلاسيكية الفردية هي النوع الأكثر شيوعًا. يتم تصنيف أصحاب المصلحة إلى أدوار مثل مصمم المفهوم، ومهندس التجربة، وفني الإعداد خلال مرحلة التصميم، بينما خلال التنفيذ، يعملون كمنسقين ومقيمين للتجارب.

تحدد الدراسة أن البيانات الرئيسية المجمعة تشمل بيانات المنتج/النظام، وبيانات مركزية المستخدم، وبيانات مكانية زمنية، مع استخدام النتائج بشكل أساسي لاختيار الميزات وإطلاقها. تشمل المشاكل المفتوحة في اختبار A/B تحسين قابلية الاستخدام وتحسين الأساليب المقترحة. يقترح المؤلفون ثلاثة اتجاهات بحث مستقبلية: تعزيز اعتماد الأساليب الإحصائية، تحسين عملية اختبار A/B، وتعزيز الأتمتة داخل اختبار A/B. يتم تناول صلاحية الاستنتاج من خلال استبعاد الدراسات ذات الجودة المنخفضة وتقييم جودة الأوراق المختارة، مما يضمن أن الرؤى المستمدة قوية وقابلة للتطبيق على المجال الأوسع لاختبار A/B.

مقدمة

تؤكد مقدمة ورقة البحث على أهمية تطوير البرمجيات التكراري والوقت السريع للوصول إلى السوق لشركات البرمجيات، مع تسليط الضوء على دور الابتكار من خلال اختبار A/B. يتضمن اختبار A/B، أو التجارب المنضبطة عبر الإنترنت، مقارنة متغيرين من البرمجيات في بيئات العالم الحقيقي لتقييم أدائها بناءً على مقاييس مختلفة، مثل معدلات النقر وتحويلات المستخدمين. على الرغم من تطبيقه الواسع في الشركات التقنية الكبرى مثل جوجل وميتا، يشير المؤلفون إلى نقص الدراسات التجريبية الشاملة التي تسد الفجوة بين أحدث ما توصلت إليه التكنولوجيا وما هو متبع في ممارسة اختبار A/B. استكشفت الدراسات السابقة بعض جوانب اختبار A/B، لكن لم يقدم أي منها نظرة شاملة تتضمن رؤى حول أدوار أصحاب المصلحة والسياق الأوسع لهندسة البرمجيات.

لمعالجة هذه الفجوة، أجرى المؤلفون مراجعة أدبية منهجية تهدف إلى توضيح الحالة الحالية للبحث في اختبار A/B، والتي يمكن أن تفيد كل من الاستفسارات الأكاديمية المستقبلية والتحسينات العملية للممارسين. تم هيكلة الورقة لتقديم اختبار A/B والدراسات ذات الصلة أولاً، تليها أسئلة البحث والمنهجية، والنتائج، والتأملات حول النتائج، والملاحظات الختامية. تعد الأقسام التالية بتقديم تحليل مفصل لعمليات اختبار A/B، بما في ذلك التصميم والتنفيذ والتقييم، مع رسم مجالات التطبيق لأهداف اختبار A/B لتحديد الاتجاهات والمجالات لمزيد من الاستكشاف.

الطرق

المنهجية المستخدمة في هذه الدراسة هي مراجعة أدبية منهجية، مقسمة إلى ثلاث مراحل رئيسية: التخطيط، والتنفيذ، والتركيب. خلال مرحلة التخطيط، يتم وضع بروتوكول مفصل، يحدد دوافع الدراسة، وأسئلة البحث، ومصادر الأدبيات، وسلاسل البحث، ومعايير الإدراج والاستبعاد، وعناصر استخراج البيانات، وطرق التحليل. تتضمن مرحلة التنفيذ تطبيق سلسلة البحث والمعايير لتحديد الدراسات ذات الصلة واستخراج البيانات. أخيرًا، تحلل مرحلة التركيب البيانات المستخرجة للإجابة على أسئلة البحث واستنتاج الرؤى. تم إجراء المراجعة بواسطة فريق من أربعة باحثين، مع تفاصيل شاملة متاحة على موقع الدراسة.

فيما يتعلق بالأساليب الإحصائية المستخدمة في اختبار A/B، تحدد المراجعة اختبارات الفرضيات للمساواة كالأكثر شيوعًا، مع كون اختبار t لستودنت هو الطريقة الرئيسية. تشمل الطرق الأخرى استخدام تقنية التمهيد، التي، على الرغم من إمكاناتها في تعزيز التحليل، لا تزال غير مستخدمة بشكل كافٍ بسبب متطلباتها الحاسوبية. بالإضافة إلى ذلك، تسلط المراجعة الضوء على الحاجة إلى تحسين الإبلاغ عن الأساليب الإحصائية في الدراسات الأساسية، حيث لا يحدد العديد منها الاختبارات المستخدمة أو يتجاهل التحليل الإحصائي تمامًا. يدعو المؤلفون إلى تطوير أساليب إحصائية مصممة خصيصًا لمعالجة السياقات التجريبية المحددة، مع التأكيد على أهمية الشفافية في الإبلاغ عن التقنيات الإحصائية في المنشورات البحثية.

النتائج

في قسم النتائج، يقدم المؤلفون أولاً معلومات ديموغرافية تتعلق بالدراسات الأساسية المدرجة في تحليلهم. تضع هذه البيانات الأساسية الأساس لاستكشاف أعمق لأسئلة البحث التي تم تناولها في الدراسة. بعد هذه النظرة العامة، يقوم المؤلفون بتحليل منهجي لكل سؤال بحث، موفرين رؤى ونتائج تعتبر حاسمة لفهم تداعيات عملهم. يسمح النهج المنظم بتحديد واضح للنتائج، مما يسهل تفسير شامل للبيانات المجمعة.

المناقشة

في قسم المناقشة، يقارن البحث بين التجارب المنضبطة التقليدية واختبار A/B، مع التأكيد على أدوارها في اختبار الفرضيات ضمن مجالات مختلفة، بما في ذلك هندسة البرمجيات. تتضمن التجارب المنضبطة متغيرات مستقلة وأخرى معتمدة، حيث يتم التلاعب بالأولى لتقييم تأثيرها على الثانية، مثل كثافة الأخطاء والإنتاجية في مهام التصميم. يتم وصف اختبار A/B، كونه تجربة منضبطة عبر الإنترنت، بأنه مصمم خصيصًا لتقييم ميزات البرمجيات في الوقت الحقيقي مع المستخدمين النهائيين، مع التركيز على تحسين تجربة المستخدم وتصميم الواجهة.

كما يسلط البحث الضوء على دمج اختبار A/B ضمن ممارسات DevOps، التي تسهل تطوير البرمجيات ونشرها من خلال الأتمتة. يتم تقديم التجريب المستمر كعنصر حيوي في DevOps، مما يسهل تطويرًا مدفوعًا بالبيانات من خلال السماح بالتقييم المستمر لمتغيرات البرمجيات. يلخص المؤلفون الدراسات الثانوية الموجودة حول اختبار A/B، مصنفين إياها إلى جوانب تقنية واجتماعية ومحددة المجال. يحددون الفجوات في الأدبيات المتعلقة بتصميم وتنفيذ اختبارات A/B، وأدوار أصحاب المصلحة، والرؤى المستمدة من نتائج اختبار A/B. لمعالجة هذه الفجوات، يحدد المؤلفون هدف دراستهم، مستخدمين نهج هدف سؤال مقياس (GQM) لتحليل اختبار A/B في أنظمة البرمجيات من منظور بحثي.

Journal: Journal of Systems and Software, Volume: 211
DOI: https://doi.org/10.1016/j.jss.2024.112011
Publication Date: 2024-02-22
Author(s): Federico Quin et al.
Primary Topic: Statistical Methods in Clinical Trials

Overview

This section provides a comprehensive overview of A/B testing, a method used for hypothesis testing in software development, where two variants are compared from the end user’s perspective. The paper presents findings from a systematic literature review of 141 primary studies, addressing key aspects such as the design and execution of A/B tests, stakeholder roles, and open challenges in the field. The analysis reveals that A/B testing primarily targets algorithms, visual elements, and workflow processes, with single classic A/B tests being the most prevalent type. Stakeholders are categorized into roles such as concept designer, experiment architect, and setup technician during the design phase, while during execution, they act as experiment coordinators and assessors.

The study identifies that the main data collected includes product/system data, user-centric data, and spatio-temporal data, with the results primarily used for feature selection and rollout. Open problems in A/B testing include enhancing usability and improving proposed approaches. The authors propose three future research directions: strengthening the adoption of statistical methods, improving the A/B testing process, and enhancing automation within A/B testing. The conclusion validity is addressed by excluding lower-quality studies and evaluating the quality of selected papers, ensuring that the insights derived are robust and applicable to the broader field of A/B testing.

Introduction

The introduction of the research paper emphasizes the significance of iterative software development and rapid time-to-market for software companies, highlighting the role of innovation through A/B testing. A/B testing, or online controlled experimentation, involves comparing two software variants in real-world settings to assess their performance based on various metrics, such as click rates and user conversions. Despite its widespread application in major tech firms like Google and Meta, the authors note a lack of comprehensive empirical studies that bridge the gap between the state-of-the-art and state-of-practice in A/B testing. Previous studies have explored certain aspects of A/B testing, but none have provided a holistic overview that includes insights into stakeholder roles and the broader context of software engineering.

To address this gap, the authors conducted a systematic literature review aimed at elucidating the current state of research in A/B testing, which could inform both future academic inquiries and practical improvements for practitioners. The paper is structured to first introduce A/B testing and related studies, followed by the research questions and methodology, results, reflections on findings, and concluding remarks. The subsequent sections promise to deliver a detailed analysis of A/B testing processes, including design, execution, and evaluation, while also mapping application domains to A/B testing targets to identify trends and areas for further exploration.

Methods

The methodology employed in this study is a systematic literature review, structured into three main phases: planning, execution, and synthesis. During the planning phase, a detailed protocol is established, outlining the study’s motivation, research questions, sources for literature, search strings, inclusion and exclusion criteria, data extraction items, and analysis methods. The execution phase involves applying the search string and criteria to identify relevant studies and extract data. Finally, the synthesis phase analyzes the extracted data to address the research questions and derive insights. The review was conducted by a team of four researchers, with comprehensive details available on the study’s website.

In terms of statistical methods utilized in A/B testing, the review identifies hypothesis tests for equality as the most prevalent, with the Student’s t-test being the primary method. Other methods include bootstrapping, which, despite its potential for enhancing analysis, remains underutilized due to its computational demands. Additionally, the review highlights the need for better reporting of statistical methods in primary studies, as many do not specify the tests used or omit statistical analysis altogether. The authors advocate for the development of tailored statistical methods to address specific experimental contexts, emphasizing the importance of transparency in reporting statistical techniques in research publications.

Results

In the Results section, the authors first present demographic information pertaining to the primary studies included in their analysis. This foundational data sets the stage for a deeper exploration of the research questions addressed in the study. Following this overview, the authors systematically analyze each research question, providing insights and findings that are critical to understanding the implications of their work. The structured approach allows for a clear delineation of results, facilitating a comprehensive interpretation of the data collected.

Discussion

In the discussion section, the paper contrasts traditional controlled experiments with A/B testing, emphasizing their roles in hypothesis testing within various fields, including software engineering. Controlled experiments involve independent and dependent variables, where the former are manipulated to assess their impact on the latter, such as fault density and productivity in design tasks. A/B testing, described as online controlled experimentation, is specifically tailored for evaluating software features in real-time with end-users, focusing on enhancing user experience and interface design.

The paper also highlights the integration of A/B testing within DevOps practices, which streamline software development and deployment through automation. Continuous experimentation is presented as a vital component of DevOps, facilitating data-driven development by allowing for the ongoing assessment of software variants. The authors summarize existing secondary studies on A/B testing, categorizing them into technical, social, and domain-specific aspects. They identify gaps in the literature regarding the design and execution of A/B tests, the roles of stakeholders, and the insights derived from A/B testing results. To address these gaps, the authors outline their study’s aim, employing the Goal Question Metric (GQM) approach to systematically analyze A/B testing in software systems from a research perspective.