اختبار A/B: مراجعة منهجية للأدبيات A/B testing: A systematic literature review

عربي
English

المجلة: Journal of Systems and Software، المجلد: 211
DOI: https://doi.org/10.1016/j.jss.2024.112011
تاريخ النشر: 2024-02-22

اختبار A/B: مراجعة منهجية للأدبيات

فيديريكو كوينداني وينزماتياس غالستركاميلا كوستا سيلفاديستري نت، جامعة KU Leuven، شارع سيليستينينلاان 200A، لوفين، 3000، بلجيكاجامعة لينيوس، ساحة الجامعة 1، فاكشو، 351 06، السويدجامعة كانتربري، 69 طريق كرايكي، كرايستشيرش، 8140، نيوزيلندا

الملخص

اختبار A/B، المعروف أيضًا بالتجارب المنضبطة عبر الإنترنت أو التجارب المستمرة، هو شكل من أشكال اختبار الفرضيات حيث يتم مقارنة نسختين من برنامج ما في الميدان من وجهة نظر المستخدم النهائي.يتم استخدام الاختبار على نطاق واسع في الممارسة لتمكين اتخاذ القرارات المستندة إلى البيانات في تطوير البرمجيات. بينما استكشفت بعض الدراسات جوانب مختلفة من البحث حولاختبار، لم يتم إجراء دراسة شاملة حول أحدث ما توصلت إليه التكنولوجيا فياختبار. هذه الدراسة ضرورية لتقديم نظرة شاملة ومنهجية على مجالاختبار دفع البحث المستقبلي إلى الأمام. لمعالجة هذه الفجوة وتقديم لمحة عامة عن أحدث التطورات فياختبار، هذه الورقة تقدم نتائج مراجعة منهجية للأدبيات التي قامت بتحليل 141 دراسة أولية. كانت أسئلة البحث تركز على موضوعاختبار، كيفتم تصميم الاختبارات وتنفيذها، وما هي الأدوار التي يلعبها أصحاب المصلحة في هذه العملية، والتحديات المفتوحة في هذا المجال. تُظهر تحليل البيانات المستخرجة أن الأهداف الرئيسية لـاختبار الخوارزميات، العناصر المرئية، وسير العمل والعمليات. كلاسيكي واحدتعتبر الاختبارات هي النوع السائد من الاختبارات، والتي تعتمد بشكل أساسي على اختبارات الفرضيات. لدى أصحاب المصلحة ثلاثة أدوار رئيسية في تصميماختبارات: مصمم المفاهيم، مهندس التجارب، وفني الإعداد. الأنواع الرئيسية من البيانات التي تم جمعها خلال تنفيذالاختبارات هي بيانات المنتج/النظام، بيانات مركزية حول المستخدم، وبيانات مكانية زمنية. الاستخدام السائد لنتائج الاختبارات هو اختيار الميزات، طرح الميزات، تطوير الميزات المستمر، وتصميم الاختبار. لدى أصحاب المصلحة دوران رئيسيان خلالتنفيذ الاختبار: منسق التجربة ومقيم التجربة. المشكلات الرئيسية المبلغ عنها تتعلق بتحسين الأساليب المقترحة وقابليتها للاستخدام. من دراستنا استخلصنا ثلاثة خطوط مثيرة للاهتمام للبحث المستقبلي: تعزيز اعتماد الأساليب الإحصائية فياختبار، تحسين العملية لـاختبار وتعزيز الأتمتة لـاختبار.

الكلمات المفتاحية: اختبار A/B، مراجعة أدبية منهجية، هندسة اختبار A/B

1. المقدمة

تطوير البرمجيات التكراري ووقت الوصول إلى السوق هما عنصران حاسمان لنجاح شركات البرمجيات. في صميم ذلك تكمن الابتكار من خلال استكشاف ميزات برمجية جديدة أو تجربة تغييرات برمجية. من أجل تمكين مثل هذا الابتكار في الممارسة العملية، غالبًا ما تستخدم شركات البرمجيات

اختبار

الاختبار، المعروف أيضًا بالتجارب المتحكم بها عبر الإنترنت أو التجارب المستمرة، هو شكل من أشكال اختبار الفرضيات حيث يتم تقييم نسختين من برنامج ما في الميدان (تتراوح من نسخ ذات تخطيط واجهة مستخدم معدّل قليلاً إلى نسخ من برنامج تحتوي على ميزات جديدة). على وجه الخصوص، يتم تحليل جدوى النسختين باستخدام مقاييس مثل معدلات النقر لزوار المواقع الإلكترونية، وقيم حياة الأعضاء (LTV) في خدمة الاشتراك، وتحويلات المستخدمين في التسويق. يتم استخدام اختبار A/B بشكل واسع.

يستخدم في الممارسة، بما في ذلك الشركات التقنية الكبيرة والشهيرة مثل جوجل، ميتا، لينكد إن، ومايكروسوفت.

حتى لو

يتم استخدام الاختبار بشكل شائع في الممارسة العملية، وعلى حد علمنا، لم يتم إجراء أي دراسة شاملة قائمة على التجربة حول أحدث ما توصلت إليه الأبحاث (أي، حالة البحث مقارنة بحالة الممارسة) في اختبار A/B. تعتبر مثل هذه الدراسة ضرورية لتوفير نظرة شاملة ومنهجية في مجال اختبار A/B لدفع الأبحاث المستقبلية إلى الأمام. استكشفت ثلاث دراسات سابقة [13، 12، 133] عددًا من جوانب البحث حول

اختبار، مثل مواضيع البحث، نوع التجارب في

اختبار، و

الأدوات والمقاييس. ومع ذلك، فإن هذه الدراسات لا تقدم نظرة شاملة عن أحدث التطورات التي توفر رؤى أعمق حول أنواع الأهداف التي تتعلق بها

يتم تطبيق الاختبار، أدوار أصحاب المصلحة في تصميم

الاختبارات، تنفيذ الاختبارات، واستخدام نتائج الاختبارات. هذه الرؤى هي المفتاح لتحديد وفهم

الاختبار في الصورة الأوسع لهندسة البرمجيات. لمعالجة هذه القضية، قمنا بإجراء مراجعة منهجية للأدبيات [84]. تهدف دراستنا إلى تقديم رؤى حول حالة البحث في

الاختبار كأساس لتوجيه الأبحاث المستقبلية. قد يستفيد الممارسون أيضًا من الدراسة لتحديد التحسينات المحتملة لـ

الاختبار في ممارساتهم اليومية.

يتكون باقي هذه الورقة على النحو التالي. يوفر القسم 2 مقدمة موجزة لـ

يختبر ويناقش الدراسات الثانوية ذات الصلة. في القسم 3، نحدد أسئلة البحث ونلخص المنهجية التي استخدمناها. ثم يقدم القسم 4 النتائج، موفراً إجابة على كل سؤال بحث. في القسم 5، نتأمل في نتائج الدراسة، ونقدم رؤى، ونحدد الفرص للبحث المستقبلي، ونحدد التهديدات للصلاحية. أخيراً، يختتم القسم 6 الورقة.

2.1. الخلفية

الاختبار هو طريقة يتم من خلالها مقارنة نسختين من البرمجيات، تُعرف النسخة A والنسخة B، من خلال تقييم مزايا النسخ من خلال تعرضها لمستخدمي النظام النهائي 145. لمقارنة النسخ، يتم صياغة فرضية مع تجربة لاختبارها، أي، الفعلية

اختبار. على عكس اختبار البرمجيات العادي، يتم إجراء اختبار A/B في الأنظمة الحية. الشكل 1 يوضح العملية العامة لـ

الاختبار مع ثلاث مراحل رئيسية.

المرحلة الأولى من

الاختبار يتعلق بتصميم

اختبار. في تصميم هذه التجربة، يتم تحديد مجموعة من المعايير، مثل: الفرضية، عينة من السكان التي يجب أن تستهدفها التجربة، مدة التجربة، و

المقاييس التي تم جمعها خلال التجربة. الـ

تُستخدم المقاييس لتحديد جدارة كل متغير خلال التجربة. أمثلة على

تشمل المقاييس معدل النقر (CTR) وعدد النقرات وعدد الجلسات 47.

المرحلة الثانية من

يتكون الاختبار من تنفيذ

اختبار في نظام البرمجيات الجاري. يتم نشر كلا النسختين في نظام مباشر، ويتم تقسيم عينة من السكان بين النسختين. خلال التنفيذ، يحتفظ النظام بسجل للبيانات ذات الصلة لتقييم التجربة بعد انتهائها (وفقًا للمدة المحددة). قد تتوافق البيانات ذات الصلة مباشرة مع المحددات.

المقاييس، أو قد تمكّن بشكل غير مباشر من التحليل المتقدم في مرحلة التقييم للحصول على رؤى إضافية من الدراسة المنفذة

اختبارات.

المرحلة الثالثة من

يتضمن الاختبار تقييم التجربة. بعد

تم الانتهاء من الاختبار، وتم اختبار الفرضية الأصلية، عادةً باستخدام اختبار إحصائي، مثل اختبار الطالب أو اختبار ويلش [75، 156]. بناءً على نتيجة الاختبار، يمكن للمصمم اتخاذ إجراءات متابعة، مثل بدء طرح ميزة على جميع السكان أو تصميم جديدة.

متغيرات للاختبار في اختبارات A/B اللاحقة.

2.1.1. التجارب المنضبطة مقابلاختبار

تقليديًا، يُعتبر التجربة المُتحكم بها طريقة تجريبية تتيح اختبار فرضية بشكل منهجي. يتم تمييز نوعين من المتغيرات في التجارب المُتحكم بها: المتغيرات المستقلة والمتغيرات التابعة. المتغيرات المستقلة هي المتغيرات التي يتم التحكم بها خلال التجربة لاختبار الفرضية، على سبيل المثال، حالة متطورة ونهج مقترح حديث لحل مشكلة تصميم معينة من قبل مجموعة التحكم ومجموعة العلاج على التوالي. المتغيرات التابعة هي المتغيرات التي

الشكل 1: عام

عملية الاختبار.

يتم قياسها خلال التجربة لمقارنة نتائج كل من مجموعة التحكم ومجموعة العلاج، على سبيل المثال، كثافة الأخطاء والإنتاجية التي تم الحصول عليها في مهمة التصميم. بعد إجراء التجربة، يتم اختبار الفرضية واستنتاج النتائج بناءً على النتائج؛ على سبيل المثال، فإن نهج التصميم المقترح حديثًا لديه كثافة أخطاء أقل بكثير مقارنة بالنهج المتقدم، ولكن هناك حاجة إلى مزيد من البحث فيما يتعلق بالإنتاجية. يتم استخدام التجارب المنضبطة على نطاق واسع عبر جميع أنواع المجالات العلمية، مثل علم النفس 31، والصيدلة 116، والتعليم 32، والآن أيضًا في هندسة البرمجيات [142، 34، 68].

بينما تُجرى التجارب المنضبطة عادةً في بيئة محكومة خارج الإنترنت،

تستخدم الاختبارات تجارب محكومة لتقييم ميزات أو متغيرات البرمجيات على المستخدمين النهائيين لنظام قيد التشغيل. لهذا السبب،

يشار إلى الاختبار غالبًا بالتجارب المنضبطة عبر الإنترنت. الهدف من اختبار A/B يكمن في اختبار الفرضيات في أنظمة البرمجيات الحية حيث يشكل المستخدمون النهائيون لهذه الأنظمة المشاركين أو سكان التجربة. أمثلة على الفرضيات التي يتم اختبارها في

غالبًا ما تتعلق الاختبارات بتحسين تجربة المستخدم (UX) 130، وتحسين تصميم واجهة المستخدم (UI) [158]، وتحسين معدلات نقر المستخدمين [4]، أو تقييم المتطلبات غير الوظيفية في الخدمات الموزعة [14].

2.1.2. ديف أوبس واختبار

لقد اكتسبت عمليات التطوير (DevOps اختصارًا) شعبية في السنوات الأخيرة. تتكون DevOps من مجموعة من الممارسات والأدوات والإرشادات لإدارة وتنفيذ المهام المختلفة بكفاءة وفعالية خلال دورات حياة البرمجيات. يتراوح ذلك من عملية تطوير البرمجيات إلى نشرها وإدارتها أثناء وقت التشغيل. تلعب أتمتة عمليات البرمجيات دورًا مركزيًا في DevOps لتسهيل الحياة على المطورين وتخفيف عبء تطوير البرمجيات بشكل عام.

الممارسات الشائعة التي تعد جزءًا من مصطلحات ديف أوبس هي التكامل المستمر والنشر المستمر (CICD باختصار). يتكون CICD من أتمتة اختبار البرمجيات، وتكامل البرمجيات والبناء، ونشر البرمجيات، مما يقلل بشكل فعال من العمل اليدوي المطلوب من المطورين ويخفف العبء عن نشر البرمجيات. في نفس السياق، تهدف التجارب المستمرة إلى إعداد تجارب بشكل مستمر في أنظمة البرمجيات لاختبار متغيرات برمجية جديدة. بعبارة أخرى، تعزز التجارب المستمرة عملية تطوير البرمجيات من خلال تمكين نهج تطوير قائم على البيانات (على سبيل المثال، من خلال قياس رضا المستخدم عن ميزات البرمجيات الجديدة في وقت مبكر من التطوير). لتحقيق ذلك،

يتم استخدام الاختبار لإعداد وتقييم التجارب المنضبطة عبر الإنترنت في نظام البرمجيات. على سبيل المثال، قام فابيجان وآخرون [58] بإجراء دراسة حالة حول تطور توسيع التجارب المستمرة في مايكروسوفت، موفرين إرشادات لشركات أخرى لإجراء التجارب المستمرة.

نبدأ بملخص للدراسات الثانوية المتعلقة بالدراسة المقدمة في هذه الورقة. ثم نحدد هدف الدراسة المقدمة في هذه الورقة لتقديم نظرة شاملة ومنهجية عن أحدث التطورات في

اختبار.

قمنا بتجميع الدراسات ذات الصلة في ثلاث فئات: الدراسات التي تركز على الجوانب التقنية لـ

اختبارات، دراسات تركز على الجوانب الاجتماعية لـ

اختبارات، ودراسات تتعلق بـ

الاختبار في مجالات محددة.

الجوانب الفنية لـ

اختبار. قام رودريغيز وآخرون 132 بإجراء دراسة تخطيط منهجية حول النشر المستمر للخدمات والمنتجات البرمجية المكثفة. يحدد المؤلفون التجريب المستمر والسريع كواحد من العوامل التي تميز النشر المستمر، ويشرحون ذلك من خلال عدسة نشر هذه التجارب وممارسات DevOps المرتبطة بها. قدم روس ورونيسون 133 دراسة تخطيطية حول التجريب المستمر و

اختبار. يستكشف المؤلفون مواضيع البحث، والمنظمات التي توظف

الاختبار، وأخذ نظرة أعمق على نوع التجارب التي يتم إجراؤها. أجرى أوير وفيلدرر [12] دراسة تخطيط منهجية حول التجريب المستمر. ركز المؤلفون على مواضيع البحث، والمساهمات، وأنواع البحث، والتعاون بين الصناعة والأكاديمية، والاتجاهات في المنشورات، والشعبية في المنشورات حول

اختبار، أماكن، واستشهادات ورقية. مؤخرًا، قدم أوير وآخرون مراجعة أدبية منهجية حول

اختبار وتجريب مستمر، مستفيدين من نتائج الدراسات السابقة 133،12. يقوم المؤلفون بتطبيق تقنية التراكم الأمامي على مجموعة من الأوراق لتكوين قائمة بالدراسات الأساسية للمراجعة. ثم يستكشفون العناصر الأساسية لإطار التجريب المستمر، والتحديات والفوائد المرتبطة بالتجريب المستمر. ذات صلة وثيقة، قام إيرثال وآخرون [53] بإجراء مراجعة أدبية من خلال تطبيق بحث مخصص، تلاه التراكم على مجموعة الأوراق المحددة في البداية. تركز الدراسة على تعريف التجريب المستمر واستكشاف العمليات المرتبطة به. بينما يعترف المؤلفون

الاختبار كأحد الاستراتيجيات لتحقيق التجريب المستمر، لا تتناول هذه المراجعة الأدبية الجوانب التقنية لـ

اختبار.

الجوانب الاجتماعية لـ

اختبار. جانب اجتماعي مهم من

الاختبار هو الحصول على ملاحظات المستخدم. جزء كبير من

تدور الاختبارات حول إعطاء الأولوية وتحسين تجربة المستخدم. لقد حددنا دراستين تركزان على هذا الجانب الاجتماعي. يقدم فابيجان وآخرون مراجعة أدبية حول ملاحظات العملاء وتقنيات جمع البيانات في سياق أبحاث وتطوير البرمجيات. يبرز المؤلفون التقنيات الموجودة في الأدبيات للحصول على ملاحظات العملاء وتنظيم جمع البيانات، في أي مراحل تطوير البرمجيات تُستخدم التقنيات، وما هي التحديات والقيود الرئيسية لهذه التقنيات. واحدة من التقنيات التي أشار إليها المؤلفون هي

الاختبار، الذي يمكن أن يكون أداة قيمة للحصول على ملاحظات المستخدمين حول النماذج الأولية. يناقش فابيجان وآخرون 62] التحديات والآثار الناتجة عن عدم مشاركة بيانات العملاء داخل المنظمات الكبيرة. حالة محددة قدمها المؤلفون تدعم القضايا الحرجة التي تظهر من عدم مشاركة الملاحظات النوعية للعملاء في مرحلة ما قبل التطوير مع مرحلة التطوير، مما يجبر المطورين على تكرار جمع ملاحظات المستخدمين أو تطوير المنتجات دون هذه المعلومات.

الاختبار في مجالات محددة. ما وراء

الاختبار في الشركات المعتمدة على الإنترنت، استخدام

تم الإبلاغ عن الاختبار في مجالات أخرى متنوعة. مثال على ذلك هو مجال الأنظمة المدمجة. يستكشف ماتوس وآخرون التحديات والاستراتيجيات للتجريب المستمر في الأنظمة المدمجة، مقدّمين وجهات نظر صناعية وبحثية. مجال آخر هو الأنظمة السيبرانية الفيزيائية (CPS). يقدم جيامو وآخرون مراجعة أدبية منهجية حول أحدث ما توصلت إليه التجارب المستمرة في الأنظمة السيبرانية الفيزيائية، مستنتجين أن الأدبيات تركز أكثر على التحديات المقدمة بدلاً من اقتراح حلول لهذه التحديات.

ملخص. درست الدراسات الثانوية الحالية

الاختبار مع التركيز على تحقيق الاختبارات والعمليات المرتبطة بها وأنواع التجارب التي تم إجراؤها. ومع ذلك، فإن هذه الدراسات تركز على جانب معين، أو تفتقر إلى عملية بحث صارمة لتحديد الدراسات ذات الصلة. الدراسات الحالية لا تقدم رؤى كافية في الهدف من

اختبار (أي، “ما” هو موضوع الاختبار)، أدوار أصحاب المصلحة في تصميم وتنفيذ

اختبارات، واستخدام

نتائج الاختبار.

2.2.2. هدف الدراسة.

لمعالجة قيود الدراسات الحالية، قمنا بإجراء دراسة أدبية متعمقة. نحن نحدد هدف هذه الدراسة باستخدام نهج هدف سؤال مقياس (GQM) 17:

الغرض: دراسة وتحليل
المشكلة: تصميم وتنفيذ

اختبار
الموضوع: في أنظمة البرمجيات
وجهة نظر: من وجهة نظر الباحثين.
بشكل ملموس، نهدف إلى دراسة موضوع

اختبار، كيف

تم تصميم الاختبارات وتنفيذها، وما هو دور أصحاب المصلحة في المراحل المختلفة من

اختبار. أخيرًا، نهدف أيضًا إلى الحصول على رؤى حول المشكلات البحثية المبلغ عنها في الأدبيات.

3. المنهجية

تستخدم هذه الدراسة منهجية مراجعة الأدبيات المنهجية كما هو موضح في 84. تصف هذه المنهجية عملية صارمة لمراجعة الأدبيات لموضوع معين. تضمن العملية أن المراجعة تحدد وتقيّم وتفسر جميع الأوراق البحثية ذات الصلة بطريقة قابلة للتكرار. تتكون مراجعة الأدبيات من ثلاث مراحل رئيسية: التخطيط، والتنفيذ، والتركيب. خلال مرحلة التخطيط، يتم تحديد بروتوكول للدراسة 129، والذي يتضمن الدافع للدراسة، والأسئلة البحثية التي سيتم الإجابة عليها، والمصادر للبحث عن الأوراق، وسلسلة البحث، ومعايير الإدراج والاستبعاد، وعناصر البيانات التي سيتم استخراجها من الدراسات الأولية 1، وطرق التحليل التي سيتم استخدامها. خلال مرحلة التنفيذ، يتم تطبيق سلسلة البحث كما هو محدد في البروتوكول، وتطبق معايير الإدراج والاستبعاد لتحديد الدراسات الأولية، ويتم استخراج جميع عناصر البيانات من هذه الأوراق. أخيرًا، خلال مرحلة التركيب، يتم تحليل البيانات المستخرجة وتفسيرها للإجابة على الأسئلة البحثية، وللحصول على رؤى مفيدة من الدراسة.

قمنا بإجراء مراجعة منهجية للأدبيات مع أربعة باحثين. يتم تلخيص المزيد من التفاصيل حول عملية مراجعة الأدبيات (مثل الأدوار التي يلعبها الباحثون في مراجعة الأدبيات) في الأقسام التالية. تتوفر وصف كامل مع البروتوكول، وجميع البيانات المجمعة وتحليل البيانات على موقع الدراسة 129.

3.1. أسئلة البحث

لتحقيق هدف هذه الدراسة (“دراسة وتحليل تصميم وتنفيذ اختبار A/B في أنظمة البرمجيات من وجهة نظر الباحثين.”)، نقدم أربعة أسئلة بحثية:

RQ1: ما هو موضوع

اختبار؟
RQ2: كيف هي

ما هي الاختبارات المصممة؟ ما هو دور أصحاب المصلحة في هذه العملية؟
RQ3: كيف هي

ما هي الاختبارات التي تم تنفيذها وتقييمها في النظام؟ ما هو دور أصحاب المصلحة في هذه العملية؟

RQ4: ما هي المشكلات البحثية المفتوحة المبلغ عنها في مجال

اختبار؟

مع RQ1، نستكشف موضوع

الاختبار، أي، (جزء من) النظام الذي يتم فيه

يتم تطبيق الاختبار. تشمل الأمثلة

اختبارات على متغيرات البرنامج، ميزات التطبيق، مكونات البرمجيات، الأنظمة الفرعية، النظام نفسه، والبنية التحتية المستخدمة من قبل النظام. نحن أيضًا نحقق في المجالات التي فيها

يتم استخدام الاختبار.

مع RQ2، نستكشف ما هو محدد ومحدد في اختبارات A/B قبل تنفيذها في النظام. ننظر إلى المقاييس المستخدمة، وما إذا كانت الطرق الإحصائية مستخدمة في التجارب وإذا كان الأمر كذلك، فما هي الطرق، والأدوات المستخدمة لإجراء التجارب. كما نستكشف أيضًا من هم أصحاب المصلحة المعنيون في هذه العملية وما هو دورهم (مثل، مستخدمي النظام الذين يؤثرون على الاختبارات التي يجب نشرها، أو المهندسين المعماريين الذين يقررون أي مجموعة سكانية).

يجب إجراء الاختبارات).

مع RQ3، نستكشف كيف

تُنفذ الاختبارات في النظام ويتم تقييم النتائج. بشكل أكثر تحديدًا، ننظر إلى الطريقة التي يتم بها جمع البيانات للتقييم في الاختبار، وتقييم الـ

اختبار نفسه (باستخدام البيانات المجمعة، وإذا كان ذلك مناسبًا، نتيجة الاختبار الإحصائي)، واستخدام نتائج الاختبار (مثل، القرار بشأن اختيار الهدف، المدخلات للصيانة، المحفز للاختبار التالي في سلسلة الاختبارات). نستكشف أيضًا دور أصحاب المصلحة خلال هذه العملية من

اختبار (على سبيل المثال، المشغلون يقررون متى ينتهون من تجربة).

مع RQ4، نحدد مشكلات البحث المفتوحة في مجال اختبار A/B. يمكن اشتقاق المشكلات من أوصاف قيود الأساليب المقترحة في الأوراق التي تم مراجعتها، التحديات المفتوحة، أو خطوط العمل المستقبلية.

اختبار.

3.2. استعلام البحث

قمنا أولاً بتحديد قائمة من المصطلحات ذات الصلة لـ

تم الاختبار من عدد من المنشورات المعروفة [87] 73، 93، 92، 85، 47. ثم قمنا بتحديد وتطبيق معيار ذهبي [178] لضبط المصطلحات. لوصف تفصيلي للمصطلحات ذات الصلة وتطبيق المعيار الذهبي، نشير إلى بروتوكول البحث 129. تعرض الشكل 2 (الأعلى) استعلام البحث النهائي بعد تطبيق المعيار الذهبي.

3.3. استراتيجية البحث

تم تنفيذ استعلام البحث في أكتوبر 2022. تم تطبيق استعلام البحث على العنوان والملخص لكل ورقة في المصادر (غير حساسة لحالة الأحرف). أسفر البحث التلقائي عن 3,944 ورقة، كما هو موضح في الشكل 2. بعد تصفية الأوراق المكررة واختيار النسخ الخاصة بالمجلات فقط من النسخ الموسعة للمؤتمرات، تبقى 2,379 ورقة بحثية لمزيد من المعالجة.

3.4. عملية البحث

بعد جمع الأوراق، طبقنا معايير الإدراج التالية:
IC1: أوراق تركز بشكل أساسي على

اختبار (أو أي من مرادفاته المعروفة) أو (2) وصف وتطبيق (تصميمات) جديدة

اختبارات؛ على سبيل المثال تقديم إثبات المفهوم؛
IC2: أوراق تتضمن تقييمًا لما تم تقديمه

اختبارات، إما من خلال تقديم تقييم عبر المحاكاة باستخدام بيانات اصطناعية أو بيانات ميدانية، أو من خلال إجراء تجربة ميدانية واحدة أو أكثر في نظام حقيقي؛

IC3: أوراق مكتوبة باللغة الإنجليزية.
لقد حددنا IC1 بحيث ندرج فقط الأعمال التي تتعلق بالأسئلة البحثية المطروحة، أي أنه من الضروري أن تركز الأعمال على

اختبار أو تصميمهم وتقييمهم. لاحظ أن IC1 يتضمن أوراقًا تتناول وتقدم حلولًا للتحديات المعروفة في اختبار A/B. تأكد IC2 من تضمين الأوراق التي تحتوي فقط على بيانات تتعلق بتصميم و/أو تنفيذ

اختبارات. أخيرًا، قمنا بتضمين الأوراق المكتوبة باللغة الإنجليزية فقط مع IC3.

بالإضافة إلى معايير الشمول المذكورة أعلاه، قمنا أيضًا بتطبيق معايير الاستبعاد التالية:
EC1: الأوراق التي تقدم مراجعات أدبية (منهجية)، استبيانات (باستخدام استبيانات)، مقابلات، وأوراق خارطة الطريق؛
EC2: أوراق قصيرة (

صفحات

العروض التوضيحية، الملخصات الموسعة، المحاضرات الرئيسية، والدروس التعليمية؛

الشكل 2: الدراسات الأساسية المختارة للمراجعة المنهجية للأدبيات.

EC3: أوراق ذات درجة جودة

(موضح في القسم 3.5).
EC4: أوراق تقدم وصفًا غير كافٍ أو وصفًا موجزًا جدًا لـ

عملية اختبار التصميم أو عملية التنفيذ.

استبعدت EC1 و EC2 و EC3 الأوراق التي لا تساهم مباشرة في تحقيق تقدم تقني جديد، أو الأعمال الأولية التي لم يتم تطويرها بالكامل بعد، أو الأعمال التي ليست ذات جودة كافية. في هذه المراجعة الأدبية، نركز على الأبحاث الناضجة والمتطورة في مجال

اختبار للإجابة على أسئلة البحث. استبعد EC4 الأعمال التي لا تحتوي على معلومات أساسية للإجابة على أسئلة البحث.

تم تضمين الأوراق التي استوفت جميع معايير الإدراج ولم تستوفِ أي من معايير الاستبعاد كدراسات أولية في دراسة الأدبيات. أدى تطبيق معايير الإدراج والاستبعاد على عناوين وملخصات الأوراق البحثية إلى الحصول على 279 ورقة. أدى القراءة الدقيقة للأوراق إلى تقليل عدد الأوراق إلى 137. بالإضافة إلى الأوراق البحثية المسترجعة من خلال سلسلة البحث والمصفاة بتطبيق معايير الإدراج/الاستبعاد، قمنا بتطبيق تقنية التراكم على الأعمال المستشهد بها في هذه الأوراق لالتقاط الأوراق التي قد تكون فاتتنا. من خلال تقنية التراكم اكتشفنا 4 أوراق إضافية، مما رفع العدد النهائي للدراسات الأولية إلى 141، كما هو موضح في الشكل 2.

3.5. عناصر البيانات

للتمكن من الإجابة على أسئلة البحث، نقوم باستخراج عناصر البيانات المدرجة في الجدول 1. لكل عنصر بيانات نقدم وصفًا تفصيليًا.

D1-4: المؤلفون، السنة، العنوان، والمكان المستخدم لأغراض التوثيق.

الجدول 1: عناصر البيانات المجمعة للإجابة على أسئلة البحث

معرف	عنصر البيانات	غرض
D1	المؤلفون	التوثيق
D2	سنة	التوثيق
D3	عنوان	التوثيق
D4	مكان	التوثيق
D5	نوع الورق	التوثيق
D6	قطاع المؤلفين	التوثيق
D7	درجة الجودة	التوثيق
D8	مجال التطبيق	RQ1
D9	هدف A/B	RQ1
D10	نوع اختبار A/B	RQ2
D11	المقاييس المستخدمة	RQ2
D12	الطرق الإحصائية المستخدمة	RQ2
D13	دور أصحاب المصلحة في تصميم التجربة	RQ2
D14	بيانات إضافية تم جمعها	RQ3
D15	طريقة التقييم	RQ3
D16	استخدام نتائج الاختبارات	RQ3
D17	دور أصحاب المصلحة في تنفيذ التجارب	RQ3
D18	المشاكل المفتوحة	RQ4

D5: نوع الورق. الخيارات تشمل: ورق التركيز (التركيز على

اختبار نفسه، أي التعديلات أو الاقتراحات أو التحسينات على

عملية الاختبار)، أو ورقة تطبيقية (تطبيق وتقييم

الاختبار في أنظمة البرمجيات الحقيقية).

D6: قطاع مؤلفي الدراسة الأولية المستخدمة للتوثيق (استنادًا إلى انتماء المؤلف). تشمل الخيارات أكاديمي بالكامل، صناعي بالكامل، ومختلط.

D7: درجة جودة لتقارير البحث [115]. تُعرّف درجة الجودة بناءً على العناصر التالية: تعريف مشكلة الدراسة، سياق المشكلة (العلاقة بالأعمال الأخرى)، تصميم البحث (تنظيم الدراسة)، المساهمات ونتائج الدراسة، الرؤى المستخلصة، القيود. يتم تقييم كل عنصر على مقياس من ثلاثة مستويات: وصف صريح (2 نقاط)، وصف عام (1 نقطة)، أو عدم وجود وصف (0 نقاط). لذلك، تُعرّف درجة الجودة على مقياس من 0 إلى 12 [113].

D8: المجال التطبيقي الذي يُستخدم فيما يتعلق بـ

الاختبار في الدراسة الأولية. تشمل الخيارات الأولية التجارة الإلكترونية، الاتصالات، السيارات، المالية، والروبوتات. سيتم اشتقاق خيارات إضافية خلال جمع البيانات.

D9: هدف

تختبر العناصر التي هي موضوع

اختبار. تشمل الخيارات الأولية خوارزمية، واجهة مستخدم، وتكوينات التطبيق. تم اشتقاق خيارات إضافية خلال جمع البيانات.

D10: نوع

اختبار يتوافق مع العدد من

المتغيرات والطريقة التي يتم اختبارها بها. تشمل الخيارات الأولية اختبار A/B أحادي (كلاسيكي)، اختبار A/B متعدد المتغيرات أحادي، تسلسل يدوي من الكلاسيكي

اختبارات، تسلسل يدوي متعدد المتغيرات

اختبارات، تسلسل آلي كلاسيكي

اختبارات، تسلسل آلي متعدد المتغيرات

اختبارات. تم اشتقاق خيارات إضافية أثناء جمع البيانات.

D11: المقاييس التي تُستخدم في

اختبارات. تشمل الخيارات الأولية معدل النقر، معدل النقرات، عدد النقرات، عدد الجلسات، عدد الاستفسارات، وقت الغياب، الوقت حتى النقر، وقت الجلسة. تم اشتقاق خيارات إضافية خلال جمع البيانات.

الجدول 2: أنواع الأوراق للدراسات الأساسية.

نوع

عدد

حالات

تركيز

مطبق

D12: الطريقة الإحصائية التي تُستخدم لتقييم البيانات التي تم الحصول عليها من خلال

اختبار، إن وجد. تشمل الخيارات الأولية اختبار الطالب، اختبار النسبة، عدم وجود اختبار إحصائي. تم اشتقاق خيارات إضافية خلال جمع البيانات.

D13: دور أصحاب المصلحة في تصميم التجربة. تشمل الخيارات الأولية تحديد

هدف/فرضيات الاختبار، تحديد

مدة الاختبار، الضبط

اختلافات الاختبار. يتم اشتقاق خيارات إضافية أثناء جمع البيانات.

D14: بيانات إضافية يتم جمعها أثناء تنفيذ الـ

اختبار (بالإضافة إلى المباشر أو غير المباشر

بيانات قياسية). تشمل الأمثلة موقع المستخدم الجغرافي، نوع المتصفح، توقيتات الاستدعاءات أو الطلبات. يتم اشتقاق خيارات إضافية أثناء جمع البيانات.

D15: طريقة التقييم المستخدمة في الدراسة الأساسية

تشمل الخيارات الأولية مثال توضيحي، محاكاة، تقييم تجريبي.

D16: استخدام نتائج الاختبارات التي تم جمعها من

اختبارات. تشمل الأمثلة الاختبارات اللاحقة

تنفيذ الاختبار، لاحقاً

تصميم الاختبار، طرح الميزة، تطوير الميزة. يتم اشتقاق خيارات إضافية أثناء جمع البيانات.

D17: دور أصحاب المصلحة في عملية التنفيذ

اختبارات. تشمل الخيارات الأولية

تعديل الاختبار (تعديل الفردي

اختبارات)،

اختبار التحفيز (بدء التالي

اختبارات يدويًا)،

اختبار الإشراف (المراقبة)

تنفيذ الاختبارات)، عدم المشاركة، غير محدد. يتم اشتقاق خيارات إضافية خلال جمع البيانات.

D18: المشاكل المفتوحة المبلغ عنها. المشاكل المفتوحة مشتقة من التحديات المبلغ عنها، والقيود، والتهديدات لصحة النتائج. الخيارات مشتقة خلال جمع البيانات.

4. النتائج

نبدأ بالمعلومات الديموغرافية حول الدراسات الأساسية. ثم نركز على كل من أسئلة البحث.

4.1. المعلومات الديموغرافية

تُستخرج المعلومات الديموغرافية من عناصر البيانات نوع الورقة (D5)، قطاع المؤلفين (D6)، ودرجة الجودة (D7).

من بين 141 دراسة أساسية،

تركز على

اختبار نفسه، بينما

يطبق

اختبار أو استخدامه لأغراض التقييم، انظر الجدول 2

تمتلك الغالبية العظمى من 72 دراسة أولية (51.1%) مؤلفين من الصناعة، انظر الجدول 3. تحتوي ثلاث وأربعون دراسة (30.5%) على مزيج من مؤلفين من الصناعة والأكاديميين، و26 دراسة (

) هم من مؤلفين أكاديميين فقط.

تظهر الشكل 3 توزيع درجات الجودة بمتوسط

. هذا يدل على أن تقارير الأبحاث في الدراسات الأولية ذات جودة جيدة. نظرًا لأن جميع الأوراق تجاوزت العتبة 4، لم يكن هناك حاجة لاستبعاد أي من الأوراق لاستخراج البيانات للإجابة على أسئلة البحث.

الجدول 3: خلفيات المؤلفين للدراسات الأساسية.

خلفية

عدد

حالات

أكاديمي

صناعة

مختلط

درجة الجودة

الشكل 3: درجات الجودة للدراسات الأساسية.

الجدول 4: المجالات التطبيقية المحددة لـ

اختبار.

تطبيق

نطاق

عدد

حالات

شبكة

٣٨

محرك بحث

٣٥

التجارة الإلكترونية

تفاعل

٢٢

المالية

النقل

آخر

غير متوفر

4.2. RQ1: ما هو موضوعاختبار؟

للإجابة على هذا السؤال البحثي، ننظر إلى العناصر البيانية التالية: مجال التطبيق (D8)، والهدف A/B (D9).

مجال التطبيق. الجدول 4 يسرد مجالات التطبيق للدراسات الأساسية. متوسط عدد المجالات هو 1.13 (131 دراسة أساسية تم تطبيقها

الاختبار في مجال واحد، ثلاث دراسات في مجالين، ست دراسات في ثلاثة مجالات، ودراسة واحدة في أربعة مجالات). تسع دراسات لا تذكر أي مجال. نلاحظ أن أكثر مجالات التطبيق شعبية هو الويب (38 حالة). من الأمثلة النموذجية منصات وسائل التواصل الاجتماعي، مثل فيسبوك [109] أو لينكد إن 170، ناشري الأخبار 175، 60، وخدمات الوسائط المتعددة، مثل بث الأفلام على نتفليكس [9]. المجال الثاني الأكثر شعبية هو محركات البحث (35 حالة)، مع دراسات أجريت في ياندكس 46، 45، بينغ 41، 112، ياهو 6، 150، من بين آخرين. يتم تطبيق اختبار A/B أيضًا بنشاط في التجارة الإلكترونية (27 حالة)، مع أمثلة من عملاق التجزئة أمازون 52، صناعة الأزياء [26]، وأعمال C2C (من مستهلك إلى مستهلك)، مثل إتسي 83 وسوق فيسبوك [77]. بعد ذلك نلاحظ تطبيق

الاختبار في ما نجمعه تحت “التفاعل” (22 حالة)، مع برامج الاتصال الرقمي، مثل سناب [167] وسكايب [60، تفاعل نظام التشغيل المستخدم 74، 56، وبرامج التطبيقات، مثل متجر التطبيقات 33 والألعاب المحمولة [173]. أخيرًا، نلاحظ مجال التطبيقات المالية (16 حالة)، بما في ذلك الدراسات في ياهو فاينانس 179 وAlipay [24، النقل (4 حالات) في، على سبيل المثال، Didi Chuxing [66]. المجالات الأخرى هي التعليم (3 حالات) 131 والروبوتات (حالتان) 118، من بين أمور أخرى.

الهدف. هدف الـ

الاختبار يدل على العنصر الذي يخضع للاختبار والذي تتم مقارنة (على الأقل) نسختين منه. الجدول 5 يسرد

الأهداف التي حددناها من الدراسات الأولية، مع وصف وأمثلة لكل منها. العدد المتوسط من

الأهداف هي 1.21 (120 دراسة أولية تم تطبيقها

اختبار عنصر واحد، 26 دراسة لعناصر اثنين، و24 دراسة لثلاثة عناصر). لاحظ أن الدراسات التي تحتوي على أكثر من واحد

تستهدف عادةً تطبيق هذه في تجارب متعددة. الأهداف السائدة لـ

اختبار الخوارزمية، العناصر البصرية، وسير العمل/العملية التي تشكل معًا

من جميع

الأهداف المبلغ عنها في الدراسات الأولية. ومن الجدير بالذكر أن 32 دراسة أولية لم تحدد

الهدف، على سبيل المثال باستخدام مجموعات بيانات من اثنين سابقين

اختبارات في تقييم الورقة دون توضيح تفاصيل هذه الاختبارات 166.

مجال التطبيق مقابل

الهدف. يمكننا الآن رسم مجالات التطبيق مع الأهداف لـ

اختبار. توفر هذه التحليل رؤى حول العناصر أو المكونات التي تكون عادة موضوعًا لـ

الاختبار في مجالات معينة، أو بدلاً من ذلك

تظل الأهداف غير مستكشفة في مجالات معينة. تقدم الجدول 6 هذا التوزيع. نبرز عددًا من الملاحظات الرئيسية:

يتم تطبيق اختبار الخوارزميات عبر جميع مجالات التطبيقات، وفي جميع المجالات الرئيسية، يكون الهدف الأساسي لـاختبار. تشمل الخوارزميات التي يتم اختبارها بشكل شائع خوارزميات ترتيب المحتوى لمواقع التواصل الاجتماعي، وخوارزميات التوصية لمواقع الأخبار/الوسائط المتعددة، وخوارزميات ترتيب البحث.

الجدول 5: المحددات

الأهداف، مع الوصف.

هدف A/B	وصف	عدد الحدوث
خوارزمية	نسخة محدثة من خوارزمية مثل خوارزمية التوصية 175، أو خوارزمية ترتيب البحث 86، أو خوارزمية عرض الإعلانات 16.	٥٨
عناصر بصرية	تغيير في المكونات المرئية مثل التحديثات على تصميم الموقع الإلكتروني [22] أو تحديث عام لواجهة المستخدم [40].	٣٣
سير العمل / العملية	تغيير في سير العمل لتطبيق، مثل إضافة زر ملاحظات إلى لوحة التحكم [110]، أو تغيير في سير عمل المستخدم، مثل عملية أداة المساعد الافتراضي 96.	٢٨
الجزء الخلفي	تحسين مكون برمجي غير مرئي مباشرة للمستخدم، مثل تحسينات خادم الاختبار [127] أو ضبط معلمات التطبيق لأداء أفضل 60.	10
وظائف جديدة للتطبيق	وظائف جديدة تم تقديمها، مثل أداة جديدة على صفحة الويب [28] أو محتوى إضافي يتم تقديمه للمستخدم بعد إجراء استعلام بحث [112].	٦
آخر	تتضمن هذه الفئة ثلاثة أخرىالأهداف: توقيت ومحتوى مختلف للبريد الإلكتروني المرسل [174، موارد تعليمية متنوعة مقدمة للمستخدم 131، وتكوين صفحة موقع ويب 157.	٣
غير محدد	هدف الـلم يتم تحديد الاختبار في الدراسة.	32

الجدول 6: مجال التطبيق

هدف

مجال التطبيق

خوارزمية

عناصر بصرية

سير العمل / العملية

الجزء الخلفي

تطبيق جديد. وظيفة.

آخر

شبكة

محرك بحث

التجارة الإلكترونية

تفاعل

المالية

النقل

آخر

لأغراض محركات البحث، وخوارزميات تقديم الإعلانات في مجالات الويب وتطبيقات محركات البحث.

اختبار A/B للعناصر البصرية شائع بشكل خاص لمحركات البحث (16 دراسة) مقارنةً بمجالات التطبيقات الأخرى مثل الويب (مع 6 دراسات فقط). تشمل الأمثلة النموذجية تغييرات في لون خط نتائج محرك البحث.وتغيير موضع الإعلانات على صفحة النتائج 121.
عناصر سير العمل والعمليات كـتُطبق الأهداف بشكل شائع عبر المجالات الرئيسية. هذا الهدف شائع بشكل خاص في الويب والتجارة الإلكترونية (مع 8 و7 دراسات، على التوالي). من الأمثلة النموذجية التغييرات في العملية التي يتم من خلالها تحديد الإعلانات ذات الأداء الأفضل في منصة إعلانات JD، أكبر بائع تجزئة عبر الإنترنت في الصين، والتغييرات في سياسة تخصيص الطلبات لمنصات توصيل الوجبات عند الطلب.
للمواقع ومحركات البحث، جميع أنواعتُطبق الأهداف. التركيز الرئيسي على الويب هو على الخوارزميات وسير العمل/العمليات، بينما التركيز لمحركات البحث هو على الخوارزميات والعناصر المرئية والواجهة الخلفية. بالنسبة للويب، نلاحظ فقط دراسة رئيسية واحدة مع الواجهة الخلفية كـ الهدف. تستهدف هذه الدراسة تكوينات مختلفة للخدمات الصغيرة في اختبار من أجل ضبط الخدمات الصغيرة الفردية لتحسين الأداء. من ناحية أخرى، لاحظنا فقط ثلاث دراسات رئيسية تستهدف سير عمل أو عملية فياختبار. قامت دراسة واحدة بتقييم تغيير في صياغة الإعلانات الرقمية 18، وقامت دراسة أخرى بتقييم تغيير في استراتيجيات الإعلان 75)، بينما قامت الدراسة الأخيرة بتقييم خيار الدفع مقابل “البحث المدعوم” (لإعطاء الأولوية لنتائج البحث) 19.
بالنسبة للتجارة الإلكترونية، لاحظنا أنيتم استخدام الاختبار بشكل أساسي لاختبار التغييرات في خوارزميات الترتيب والتوصية، وكذلك في العمليات مثل المساعدات الافتراضية. ومن الجدير بالذكر أننا حددنا دراسة رئيسية واحدة فقط قامت بتقييم التغييرات في واجهة المستخدم 103.
تم تحديد أن الاختبار لتحسينات الواجهة الخلفية هو الأكثر شيوعًا لمحركات البحث، بينما لم نحدد ورقة في مجال التجارة الإلكترونية والمالية حيثتم استخدام الاختبار للتغييرات في الواجهة الخلفية.

سؤال البحث 1: ما هو موضوع

اختبار؟ الأهداف الرئيسية لـ

اختبار الخوارزميات، العناصر البصرية، سير العمل والعمليات، والميزات الخلفية. يتم تطبيق اختبار A/B بشكل شائع في مجالات الويب، ومحركات البحث، والتجارة الإلكترونية، وبرامج التفاعل، والمالية. يتم اختبار الخوارزميات باستمرار عبر هذه المجالات. يتم تقييم العناصر البصرية بشكل أساسي في محركات البحث، وبشكل غير بديهي ليس في التجارة الإلكترونية. سير العمل والعمليات شائعة.

الأهداف في مجالات الويب والتجارة الإلكترونية. من ناحية أخرى، تعتبر الميزات الخلفية مثل أداء الخادم أهدافًا شائعة لمحركات البحث.

الشكل 4: تم التعرف عليه

أنواع الاختبارات.

4.3. السؤال البحثي 2: كيف هيما هي الاختبارات المصممة؟ ما هو دور أصحاب المصلحة في هذه العملية؟

للإجابة على سؤال البحث الثاني، ننظر إلى العناصر البيانية التالية: نوع اختبار A/B (D10)، المقاييس المستخدمة (D11)، الأساليب الإحصائية المعتمدة (D12)، ودور أصحاب المصلحة في تصميم التجربة (D13).

4.3.1. تصميماختبارات

للإجابة على الجزء الأول من RQ2 (كيف يتم

تم تصميم الاختبارات؟)، نلقي نظرة أعمق على تصميم الـ

اختبارات، تركز على نوع

اختبارات،

المقاييس والأساليب الإحصائية المستخدمة في

اختبارات.

نوع الاختبار.. نوع الـ

تشمل الاختبارات الكلاسيكية الفردية

اختبارات مع نوعين

اختبار يتكون من أكثر من متغيرين (يشار إليه كاختبار متعدد الذراعين

اختبارات)، متعددة المتغيرات

اختبار حيث يتم اختبار تركيبات العناصر في واحد

اختبار، وتسلسلات من جميع هذه الأنواع. الشكل 4 يوضح ترددات هذه الأنواع المختلفة

أنواع الاختبارات المستخرجة من الدراسات الأساسية.

بشكل عام، حددنا 155 حالة من

أنواع الاختبارات، أي بمتوسط 1.13 حدوث لكل دراسة رئيسية (123 دراسة اعتبرت نوعًا واحدًا من)

اختبار، اعتبرت 17 دراسة نوعين، واعتبرت دراسة واحدة ثلاثة أنواع من الاختبارات). استخدمت الغالبية العظمى من الدراسات الأولية اختبارًا كلاسيكيًا واحدًا

اختبار مع متغير تحكم ومتغير علاج (95 حالة). يتم استخدام هذا الاختبار القياسي لاختبار مجموعة متنوعة من الأهداف. النوع الثاني الأكثر شيوعًا من

اختبار هو متعدد الأذرع

اختبار (30 حالة). يتكون هذا النوع من الاختبار من أكثر من متغيرين تحت الاختبار؛ على سبيل المثال، متغير تحكم كخط أساسي وثلاثة متغيرات علاجية مع نسخة مميزة لكل منها. تُستخدم هذه الاختبارات عادةً لتقييم إصدارات متعددة من خوارزمية التوصية، مثل 141، 149، ولتجربة خوارزميات تقديم الإعلانات المختلفة، مثل 155. النوع الثالث الأكثر شيوعًا من

الاختبار هو تسلسل كلاسيكي

اختبارات (24 حالة). تشمل الأمثلة هنا مقارنة عدة متغيرات بأسلوب متسلسل يتم تنفيذه يدويًا (على عكس أسلوب متعدد الأذرع

اختبار حيث يتم نشر جميع المتغيرات في وقت واحد) [63، اختبار يدوي لعدة تكرارات من خوارزميات التعلم الآلي بشكل متسلسل 105، وتنفيذ تلقائي لتسلسل من

اختبارات للتعامل مع إصدار ميزات مُتحكم فيه في

[139]. آخر تحديد

نوع الاختبار هو متعدد المتغيرات

اختبار (6 مرات). هذا النوع من الاختبار يقيم مجموعات متنوعة من المتعددة

الجدول 7: المحددات

المقاييس.

مقياس A/B

عدد

حالات

مقاييس التفاعل

225

مقاييس النقر

المؤشرات النقدية

مقاييس الأداء

المؤشرات السلبية

٣٤

عرض المقاييس

مقاييس التغذية الراجعة

ميزات. على عكس نظام متعدد الأذرع

اختبار، متعدد المتغيرات

اختبار يتيح اختبار متغيرات أكثر من ميزة واحدة في شكل واحد

اختبار. مثال على ذلك هو مقارنة تركيبات مختلفة من عناصر واجهة المستخدم المتنوعة 40.

المقاييس.. الجدول 7 يسرد

المقاييس التي استخرجناها من الدراسات الأساسية. في المجموع، 493 حالة من

تم الإبلاغ عن المقاييس في الدراسات الأساسية. مع إجمالي 198 تجربة موزعة على 141 دراسة، فإن هذا يعطي متوسط 2.12 مقياس لكل تجربة.

(تتراوح من 1 إلى 8 مقاييس لكل تجربة). المجموعة الأكثر شيوعًا من

المقاييس هي مقاييس التفاعل (225 حالة) التي تشير إلى عدد التحويلات

عدد جلسات المستخدمين، الوقت الذي يقضيه المستخدمون على الموقع الإلكتروني، والمقاييس المتعلقة باستخدام التطبيق أو الموقع الإلكتروني (مثل عدد المشاركات المصنفة، عدد الحجوزات التي تم إجراؤها)

أكبر مجموعة بعد ذلك هي مقاييس النقرات (82 حالة). تشمل الأمثلة عدد النقرات، النقرات لكل استعلام، ومعدل النقر الجيد.

المجموعة الثالثة من

المقاييس التي حددناها هي مقاييس تتعلق بالت Monetization، أي الإيرادات والتكاليف (64 حالة). تشمل الأمثلة عدد المشتريات، قيمة الطلب، الإيرادات لكل فتح بريد إلكتروني، وتكلفة الإعلان. المجموعة التالية هي مقاييس الأداء (50 حالة). تشمل الأمثلة وقت استجابة بسيط لتطبيق، عرض النطاق الترددي المستخدم، زمن الانتقال من البداية إلى النهاية، أو تأخير تشغيل الصوت. المجموعات المتبقية هي مقاييس تتبع الآثار غير المرغوب فيها في

الاختبارات (34 حالة، مثل معدل التخلي أو عدد إلغاء الاشتراكات)، المشاهدات (21 حالة، مثل عدد مشاهدات الصفحة أو عدد مشاهدات المنتج)، وتعليقات المستخدمين (17 حالة، مثل عدد شكاوى العملاء أو التعليقات النصية).

طرق إحصائية. الجدول 8 يجمع أنواع الطرق الإحصائية المستخدمة لـ

الاختبارات في الدراسات الأولية. الطريقة الإحصائية الأكثر استخدامًا هي اختبارات الفرضيات التي تختبر المساواة (94 حالة إجمالية). الاختبار الرئيسي المستخدم في هذه المجموعة هو اختبار t لطلاب، على سبيل المثال 75، 71. الاختبارات الأخرى في هذه المجموعة هي اختبار كولموغوروف-سميرنوف، على سبيل المثال، 140، واختبار مان-ويتني، على سبيل المثال، 137، واختبار ويلكوكسون للرتب الموقعة، على سبيل المثال، 156. من بين 94 حالة من هذا النوع من اختبارات الفرضيات، لم تقم 37 دراسة أولية بالإبلاغ عن الاختبار المحدد المستخدم في تحليل النتيجة 9. الطريقة الثانية الأكثر استخدامًا هي تقنية البوتستراب (11 حالة). هذه الطريقة تبني مجموعات بيانات متعددة عن طريق إعادة أخذ عينات من مجموعة البيانات الأصلية [46]. ثم تُستخدم مجموعات البيانات التي تم إنشاؤها حديثًا عادةً لاختبار فرضيات المساواة. الفائدة الرئيسية لهذه التقنية هي تحسينات الحساسية التي تم الحصول عليها في تحليل النتائج. ومع ذلك، فإن العيب الكبير لهذه التقنية هو أنها مكلفة حسابيًا، خاصة بالنسبة لمجموعات البيانات الأكبر [110]. الطريقة الإحصائية الأكثر استخدامًا بعد ذلك هي اختبار فرضية تختبر الاستدلال وجودة الملاءمة.

الجدول 8: الطرق الإحصائية المستخدمة خلال

اختبار.

الطرق الإحصائية المستخدمة	عدد الحدوث
فرضية – المساواة	٥٧
فرضية – المساواة (الطريقة الملموسة غير محددة)	37
التمويل الذاتي	11
فرضية – استنتاج	٨
ملاءمة النموذج	٨
طريقة التصحيح	٧
مُقدِّر	٦
فرضية – الاستقلال	٥
طريقة الانحدار	٢

(كلاهما 8 حالات). تشمل أمثلة اختبارات فرضيات الاستدلال استخدام نهج التحليل البايزي لضمان عدم تداخل التجارب التي تعمل في وقت واحد 89، ونهج بايزي لاستنتاج التأثير السببي لتشغيل حملات الإعلانات 15. تشمل أمثلة طرق ملاءمة النموذج طرق الاختبار المتسلسل التي تعتمد على اختبارات نسبة الاحتمالات [83]، واختبار والد 81. المجموعات المتبقية هي طرق التصحيح (7 حالات) مثل تصحيح بونفيروني 177؛ مقدرات مخصصة للملاحظات في

اختبار (6 مرات)، على سبيل المثال، مُقدّر يأخذ في الاعتبار التباينات [109]؛ اختبارات الفرضيات للاستقلال (5 مرات)، تحتوي على

اختبارات 150؛ وطرق الانحدار (2 حدوث)، مثل CUPED 48.

4.3.2. دور أصحاب المصلحة

للتعامل مع الجزء الثاني من RQ2 (ما هو دور أصحاب المصلحة في تصميم

الاختبارات؟)، نقوم بتحليل الدور الذي يلعبه أصحاب المصلحة في تصميم

اختبارات.

أدوار أصحاب المصلحة. الجدول 9 يوضح الأدوار المختلفة لأصحاب المصلحة في التصميم لـ

اختبارات استخرجناها من الدراسات الأولية، مرتبطة بالمهام، الأوصاف والأمثلة. حددنا ثلاثة أدوار رئيسية: مصمم المفاهيم (127 حالة)، مهندس التجارب (111 حالة)، وفني الإعداد (31 حالة). يتكون دور مصمم المفاهيم من تصور أفكار جديدة لـ

اختبار. تتكون وظيفة مهندس التجارب من معايرة المعايير الفنية للتجربة مثل مدة التجربة. تتكون وظيفة فني الإعداد من اتخاذ الخطوات اللازمة للسماح بتنفيذ الـ

اختبار. المهمة الرئيسية لمصمم المفاهيم هي تصميم وضبط المتغيرات لـ

اختبارات (67 حالة). المهمة الرئيسية لمهندس التجارب هي تحديد مدة

اختبارات (60 حالة). أخيرًا، المهمة الرئيسية لفني الإعداد هي تنفيذ الأنشطة بعد التصميم لـ

اختبارات (25 حالة).

4.3.3. التحليل المتقاطعتصميم الاختبار

نناقش نوعين من خرائط عناصر البيانات: الدور الذي يلعبه أصحاب المصلحة في تصميم

اختبارات مقابل

نوع الاختبار؛ و

المقاييس المستخدمة في التجارب مقابل الأساليب الإحصائية المعتمدة.

مهام أصحاب المصلحة مقابل

نوع الاختبار. تخطيط مهام أصحاب المصلحة في التصميم لـ

اختبارات عبر الأنواع من

تظهر الاختبارات في الجدول 10. نلاحظ ما يلي:

المهام الأساسية للمساهمين عبر جميع الأنواع منالاختبارات هي تصميم وضبط المتغيرات، وتحديد مدة التجارب، والسكان، والهدف أو الفرضية. تؤكد هذه الأرقام أن هذه هي مهام تصميم أساسية لأياختبار.
تستخدم غالبية الدراسات متعددة الأذرعاختبار وتسلسلتُبلغ الاختبارات عن تصميم وضبط المتغيرات كمهام مهمة لأصحاب المصلحة (22 و 13 حالة على التوالي).

الجدول 9: أدوار ومهام أصحاب المصلحة في تصميم اختبارات A/B (Occ اختصار لعدد الحدوث).

دور	مهمة	وصف المهمة	مناسبات
	تصميم وضبط المتغيرات	تصميم وضبط المتغيرات للاختبار. من الأمثلة تعديل الـ المتغيرات [141]، أو تصميم متغيرات لمجموعات سكانية مختلفة (مثل: المستخدمين القدامى مقابل المستخدمين الجدد) 21.	67
	تحديد الهدف أو الفرضية	صياغة الهدف أو الفرضية لـاختبار نفسه. تشمل الأمثلة تحديد هدف للعثور على خوارزمية اختيار الأخبار ذات الأداء الأفضل [50] أو تحديد فرضية مسبقة.اختبار 5.	٤٨
	قم بتنفيذ إجراءات ما قبل التصميم	الإجراءات التي يتم اتخاذها قبل التصميماختبار. تشمل الأمثلة تقديم الدافع لاختبارات A/B [157] أو إجراء اختبارات A/B غير المتصلة بالإنترنت قبل الانتقال إلى الإنترنت.اختبار [72].	12
فني إعداد (31)	تحديد المدة	تحديد مدة الـاختبار. تشمل الأمثلة اختيار مدة تجربة ثابتة (مثل: أسبوع واحد) [5] أو من خلال تاريخ انتهاء صلاحية محدد 106.	60
	تحديد تخصيص السكان	تحديد السكان الذين يجب أن يشاركوا فياختبار. تشمل الأمثلة عبارة بسيطةتقسيم جميع المستخدمين [163، مهمة حيث يتم تحديد الفئة المستهدفة على مدى فترة أسبوعين 173، أو مهمة حيث يجب أخذ تأثيرات الشبكة في الاعتبار 102.	51
	قم بتنفيذ إجراءات ما بعد التصميم	الإجراءات التي يتم اتخاذها بعد الانتهاء من تصميم الـاختبار. تشمل الأمثلة إجراءاختبار قبل التشغيل اختبار [179، 33، التحقق من تصميم الاختبار 110، أو جدولة تنفيذ اختبار A/B 157.	٢٥
	قم بإجراء تحليل قياسي وتهيئة	تحليل وإمكانية بدء القياسات لـاختبار. يتكون المثال من إنشاء مقياس فائدة A/B مخصص بأوزان سلبية وإيجابية مرتبطة بإجراءات المستخدم خلال جلسة البحث 112.	٦

الجدول 10: مهام أصحاب المصلحة

نوع الاختبار

مهمة	اختبار A/B كلاسيكي فردي (95)	اختبار A/B متعدد الأذرع (30)	سلسلة من اختبارات A/B (24)	اختبار A/B متعدد المتغيرات (6)
تصميم وضبط المتغيرات	٣٣	٢٢	١٣	2
مدة	٤٥	9	11	2
تعيين السكان	٣٧	٧	٨	2
هدف/فرضية	27	17	٨	2
إجراءات ما بعد التصميم	12	1	٥	0
إجراءات ما قبل التصميم	٦	٤	2	1
تحليل المقياس/التهيئة.	٥	1	0	0

الجدول 11: الطرق الإحصائية

المقاييس (H اختصار للافتراض)

طريقة	انخراط.	انقر	نقدي	سلبي	مثالي.	عرض	تعليق
H – المساواة	31	14	٧	10	٤	٧	2
H – المساواة (غير محدد)	٢٤	12	٨	٨	11	٥	٥
التمويل الذاتي	9	2	2	٣	٣	1	1
H – استنتاج	٥	1	0	0	1	0	0
ملاءمة النموذج	٥	1	2	1	0	0	0
طريقة التصحيح	٤	1	1	1	2	0	1
مُقدِّر	٤	1	2	1	0	1	0
ح – الاستقلال	2	2	٣	0	0	1	0
طريقة الانحدار	1	1	1	0	0	1	0

نظرًا لأن هذه الأنواع من الاختبارات تتضمن عدة متغيرات قيد الاختبار، غالبًا ما تحدد الدراسات مزيدًا من التفاصيل حول المتغيرات والأسباب وراء اختيار المتغيرات التي سيتم اختبارها.

تحديد الهدف أو الفرضية لـيتم ذكر الاختبار بشكل متكرر لذوي الأذرع المتعددةاختبارات (17 حالة). بالمقارنة مع النوعين التقليدييناختبار يتضمن عادةً متغير تحكم ومتغير معدل يهدف إلى تحسين متغير التحكم، متعدد الأذرعتشمل الاختبارات أكثر من متغيرين، لذا غالبًا ما يقوم الممارسون بصياغة فرضيات حول الأداء المحتمل لكل متغير.
تُبلغ عن الإجراءات بعد التصميم بشكل أكثر شيوعًا لتسلسلاتاختبارات (5 حالات). على سبيل المثال، تذكر دراسة رئيسية نمذجة تسلسلاختبارات 139، تذكر دراسة أخرى تحديد شرط النجاح لـاختبارات قبل تنفيذها 151، ودراسة أخرى تشير إلى تقديم نطاق النتائج لـاختبارات 152.
فقط عدد قليل من الدراسات الأولية تقارير عن إجراءات ما قبل التصميم وتحليل المقاييس وتهيئتها، بغض النظر عن نوعاختبار.
المقاييس مقابل الطرق الإحصائية المستخدمة. الطرق الإحصائية المستخدمة عبر أنواع مختلفة منتظهر المقاييس في الجدول 11.
تُستخدم مقاييس التفاعل ومقاييس النقر عبر جميع أنواع الطرق الإحصائية.
الطريقة المحددة المستخدمة لاختبار الفرضيات المتعلقة بالمساواة غالبًا ما لا يتم تحديدها عبر جميع الأنواعالمقاييس. بالنسبة للمقاييس المالية وأداء المقاييس على وجه الخصوص، فإن الغالبية العظمى من الدراسات لا تذكر طريقة اختبار الفرضيات المحددة (8 و 11 حالة، على التوالي). قد يكون ذلك بسبب الحساسية في الإبلاغ عن النتائج لهذه الأنواع من المقاييس.

الجدول 12: البيانات التي تم جمعها لاختبارات A/B.

البيانات المجمعة

عدد

حالات

بيانات المنتج/النظام

٤٨

بيانات مركزية حول المستخدم

البيانات المكانية الزمنية

البيانات الثانوية

تُستخدم المقاييس السلبية بشكل أساسي لاختبارات مساواة الفرضيات (10 و8 حالات لمساواة الفرضيات وعدم تحديد طريقة لمساواة الفرضيات على التوالي).
طريقة الفرضية للاستقلال تُستخدم في الغالب للمعايير النقدية، ومع ذلك، فإن استخدامها نادر (3 حالات).
استخدام مقاييس التغذية الراجعة غير شائع أيضًا، وإذا تم استخدامها، فإن الطريقة الإحصائية المحددة المستخدمة لا يتم الإبلاغ عنها (5 حالات).

سؤال البحث 2: كيف يتمما هي الاختبارات المصممة؟ ما هو دور أصحاب المصلحة في هذه العملية؟ النوع الأساسي منالاختبار هو كلاسيكي واحداختبار، يليه متعدد الأذرعاختبارات وتسلسلاختبار. تعتبر مقاييس التفاعل النوع السائد منالمقاييس المستخدمة فياختبار. أخرى بارزةتشمل المقاييس النقر، والمقاييس المالية، ومقاييس الأداء. اختبار الفرضيات للمساواة هو إلى حد بعيد الطريقة الإحصائية الأكثر استخدامًا فياختبار. ملحوظ، حولمن هذه الدراسات التي تختبر المساواة لا تحدد الطريقة المحددة التي تستخدمها لذلك. لدى أصحاب المصلحة دوران رئيسيان في تصميماختبارات: مصمم المفاهيم وهندسة التجارب. يتم الإبلاغ عن دور ثالث وهو فني الإعداد بشكل أقل تكرارًا.

4.4. السؤال البحثي 3: كيف هيهل تم تنفيذ الاختبارات؟ ما هو دور أصحاب المصلحة في هذه العملية؟

4.4.1. تنفيذاختبارات

للتعامل مع الجزء الأول من RQ3 (كيف يتم

تم تنفيذ الاختبارات؟)، نقوم بتحليل البيانات التي تم جمعها خلال

الاختبارات، طرق التقييم المستخدمة، واستخدام

اختبارات.

تم جمع البيانات. الجدول 12 يوضح فئات البيانات التي تم جمعها خلال التنفيذ لـ

اختبارات. حددنا أربعة أنواع من البيانات. تُعتبر بيانات المنتج أو النظام الأكثر شيوعًا في الدراسات الأولية (48 حالة). تشمل هذه الفئة من البيانات نوع المتصفح المستخدم من قبل المستخدم النهائي، ونظام التشغيل الخاص بالمستخدم النهائي، ومعلومات محددة عن الأجهزة المستخدمة للتفاعل مع التطبيق، ومعلومات عامة تتعلق باستخدام النظام (مثل معلومات تتبع فئات العناصر للمنتجات في تطبيق التجارة الإلكترونية، وأنواع استعلامات البحث المعالجة خلال الـ

اختبار). الأكثر شيوعًا هو البيانات الموجهة نحو المستخدم (26 حالة). تحتوي هذه الفئة على بيانات تتعلق بكيفية تفاعل المستخدم النهائي مع النظام بالإضافة إلى المعلومات الشخصية للمستخدمين النهائيين. تشمل الأمثلة خصائص التمرير للمستخدمين على تطبيق ويب، تاريخ التنقل للمستخدمين النهائيين، ملاحظات المستخدم، واستخدام العمر أو المهنة الحالية للمستخدم النهائي أثناء التحليل. الفئة الأكثر شيوعًا المبلغ عنها هي البيانات الزمانية المكانية (20 حالة) التي تجمع البيانات المتعلقة بالموقع الجغرافي والبيانات المتعلقة بالوقت. تشمل الأمثلة الطوابع الزمنية للطلبات المقدمة إلى تطبيق، وتاريخ إنشاء الحسابات التي تشارك في

اختبار، ومعلومات مكانية مثل الدولة والمنطقة لمستخدمي النهاية. أخيرًا، تشير بعض الدراسات الأساسية إلى استخدام بيانات ثانوية (6 حالات). البيانات في هذه الفئة تت correspond إلى

المقاييس التي لا تُستخدم كمقاييس رئيسية للتقييم لـ

اختبارات. تشمل الأمثلة عدد النقرات أو مشاهدات الصفحة التي تُستخدم للتحليل الإضافي بعد إجراء الـ

اختبارات.

الجدول 13: طريقة التقييم المستخدمة في الدراسات الأولية.

طريقة التقييم

عدد

حالات

التقييم التجريبي

100

محاكاة تعتمد على بيانات تجريبية حقيقية

٢٦

محاكاة

مثال توضيحي

دراسة حالة

نظري

طريقة التقييم. تلخص الجدول 13 طرق التقييم المحددة. تقدم الغالبية العظمى من الدراسات الأولية نتائج من تقييم تجريبي (100 حالة)، أي تنفيذ

اختبارات في الأنظمة الحية. عدد كبير من الدراسات يستخدم بيانات تاريخية من دراسات سابقة تم إجراؤها

اختبارات لمحاكاة جديدة

الاختبارات (26 حالة)، بينما تستخدم مجموعة من الدراسات (15 حالة) المحاكاة بدون بيانات تاريخية كطريقة تقييم. وأخيرًا، تستخدم بعض الدراسات أمثلة توضيحية (10 حالات)، ودراسات حالة (5 حالات)، وتقدم دراسة رئيسية واحدة تقييمًا نظريًا 121.

استخدام نتائج الاختبارات. الجدول 14 يوضح استخدام نتائج الاختبارات المستخرجة من الدراسات الأساسية. يشير استخدام نتائج الاختبارات إلى ما يفعله أصحاب المصلحة بالبيانات المستخرجة والتحليلات.

اختبارات، مثل استخدام النتائج لتصميم إضافي

الاختبارات. كما يظهر الجدول، الاستخدامات الرئيسية لـ

نتائج الاختبار هي اختيار وإطلاق ميزة (71 و 24 حالة على التوالي). تهدف عدد من الدراسات إلى التحقق من فعالية الـ

عملية الاختبار نفسها (12 حالة). يبدو أن استخدام نتائج الاختبار لتحفيز اختبار A/B لاحق لم يتم استكشافه بشكل جيد (4 حالات).

4.4.2. دور أصحاب المصلحة

للتعامل مع الجزء الثاني من RQ3 (ما هو دور أصحاب المصلحة في هذه العملية؟)، نقوم بتحليل دور أصحاب المصلحة في

تنفيذ الاختبار.

أدوار أصحاب المصلحة. الجدول 15 يوضح الأدوار المختلفة لأصحاب المصلحة في

تنفيذ الاختبار، قمنا باستخراج من الدراسات الأولية مع المهام المرتبطة، وصف وأمثلة. حددنا دورين رئيسيين: مساهم في التجربة (40 حالة) ومقيّم التجربة (37 حالة). يتكون دور مساهم في التجربة من إدارة الـ

تنفيذ الاختبار. تتكون وظيفة مُقيّم التجارب من تقييم

نتائج الاختبار وقد يتطلب اتخاذ إجراءات إضافية. المهمة الرئيسية لمساهم التجربة هي إشراف التجربة (19 مرة). المهمة الرئيسية لمقيم التجربة هي تحليل ما بعد التجربة (17 مرة).

4.4.3. التحليل المتقاطعتنفيذ الاختبار

نأخذ نظرة أعمق على خريطتين لعناصر البيانات المتعلقة بالتنفيذ لـ

الاختبارات: استخدام نتائج الاختبارات مع مهام أصحاب المصلحة في التنفيذ

الاختبارات؛ وطريقة التقييم مع مهام أصحاب المصلحة في التنفيذ

اختبارات.

استخدام نتائج الاختبارات مقابل مهام أصحاب المصلحة في التنفيذ

الاختبارات. التحليل الأول الذي نقوم به يتعلق باستخدام نتائج الاختبارات والمهام التي يقوم بها المعنيون في تنفيذ

الاختبارات. تظهر النتائج في الجدول 16. نبرز بعض الملاحظات الرئيسية:

يتم تطبيق إشراف التجارب بغض النظر عن استخدامات نتائج الاختبارات. بالنسبة لإطلاق الميزات كاستخدام لـنتائج الاختبار، غالبًا ما يتم ذكر مهمة إشراف التجارب. يتولى الإشراف مهمة رئيسية في هذا السياق لضمان أن يتم التنفيذ بطريقة خالية من المخاطر (أي، عدم التسبب في أي ضرر للمستخدمين) 165، 28.

الجدول 14: استخدام نتائج الاختبارات المجمعة منتنفيذ الاختبار.
استخدام نتائج الاختبارات	وصف	يحدث.
اختيار الميزات	نتائج الـتُستخدم الاختبارات لتحديد أي متغير يقدم تحسينًا للتطبيق. تشمل الأمثلة اختيار إصدار جديد من خوارزمية الترتيب [125، 28] أو خوارزمية التوصية [65] واختيار تصميم بصري مختلف [8].	71
إطلاق الميزة	نتائج الـتُستخدم الاختبارات لتحديد ما إذا كان يجب الاستمرار في طرح ميزة ما أو إيقافها، كما هو موضح على سبيل المثال من قبل الممارسين في مايكروسوفت 165، 49.	٢٤
استمرار تطوير الميزات	نتائج الـتُستخدم الاختبارات كقوة دافعة لتطوير الميزات الإضافية، مثل تحسين المقترحات الجديدة.مقاييس تعتمد على أنماط الدورية بعد الحصول على نتائج واعدة 45، وتطوير طرق التخصيص بشكل أكبر [6].	17
تصميم اختبار A/B اللاحق	نتائج الـتُستخدم الاختبارات للمستقبلتصميم الاختبار، على سبيل المثال اقتراح بدائلمتغيرات للاختبار في المستقبلاختبارات [96]، وتصميم جديداختبار لمزيد من اختبار جودةوضع التنبؤ المترى.	15
فعالية التحقق منعملية الاختبار	نتائج الـتُستخدم الاختبارات لإظهار فعالية الاقتراحات الجديدة أو المحسّنة.نهج الاختبار من قبل المؤلفين. تشمل الأمثلة تقييم إطار مضاد واقعي مقترح حديثًا لتشغيل جانب البائعاختبارات في الأسواق ذات الجانبين [77، والتحقق من صحة منهجية إحصائية جديدة للرصد المستمر لـاختبارات 82.	12
تحقق من سؤال البحث	يتم استخدام الاختبار للتحقق من صحة سؤال البحث الذي طرحه المؤلفون. يتكون أحد الأمثلة من التحقيق في الفرضية التي تحدد الظروف التي يجب على الشركات فيها دفع ثمن الإعلانات في محركات البحث [19].	10
كشف الأخطاء / إصلاحها	نتائج الـتُستخدم الاختبارات لاكتشاف الأخطاء المحتملة أو التحقق من إصلاحات الأخطاء، مثل البحث عن مشكلات جودة البيانات فياختبارات نماذج التعلم الآلي لكشف الأخطاء المحتملة 105.	٥
لاحقتنفيذ الاختبار	نتائج الـتُستخدم الاختبارات لتنفيذ ما يلياختبارات، مثل استخدام النتائج مناختبارات لتحديد تلقائيًا أي من اختبارات للتنفيذ [151].	٤
غير محدد	لم يتم تحديد استخدام نتائج الاختبار في الدراسة.	٢٤

الجدول 15: الأدوار المحددة والمهام المحددة لأصحاب المصلحة خلال التنفيذ

اختبارات.

دور	مهمة	وصف المهمة	مناسبات
	إشراف على التجارب	مراقبة ومتابعة تنفيذ اختبارات A/B عن كثب [151، 43.	19
	تغيير التجربة	تغيير جوانب الـاختبار أثناء تنفيذه. تشمل الأمثلة تعديل تخصيص السكان للتجربة 33، أو تعديلالأنماط نفسها [152.	12
	إنهاء التجربة	التوقفاختبارات عند الضرورة. تشمل الأمثلة إيقاف يدوياختبارات عندما يتم جمع بيانات كافية [96]، أو إيقاف التجربة مبكرًا عند ملاحظة ضرر [89].	9
	تحليل ما بعد التجربة	خطوات متنوعة يتم اتخاذها بعد تحليل النتائج لـاختبار. تشمل الأمثلة التحقق المزدوج من النتائج الناتجة عن اختبارات A/B المنفذة [71]، وإجراء تحليل أعمق للنتائج المشبوهة 60، أو تنفيذ تقنيات تقليل التحيز على البيانات المسترجعة من اختبارات A/B 110.	17
	تحفيز التجربة	بدء تنفيذ (التالي)اختبارات 165، 22.	١٣
	آخر	تشمل هذه الفئة بعض المهام المتخصصة، مثل توثيق النتائج والتعلم من إجراء الـاختبار [144، إعادة تشغيل اختبارات A/B [112]، أو دمج ملاحظات المستخدمين في تحليلاختبارات 106.	٧

الجدول 16: استخدام نتائج الاختبارات

مهام أصحاب المصلحة في تنفيذ التجربة (“تطوير الميزات المستمر” هو اختصار لـ “استمرار تطوير الميزات”، و”تحقق الفعالية” هو اختصار لـ “التحقق من الفعالية”، و”تحقق من سؤال البحث” هو اختصار لـ “التحقق من سؤال البحث”).

استخدام	الإشراف	تحليل ما بعد	تحفيز	تغيير	إنهاء
اختيار الميزات	٨	11	٦	٨	٤
إطلاق الميزة	10	٤	٦	٦	٤
تطوير المميزات المستمر	٧	٣	٥	٢	٣
تصميم اختبار A/B	٦	0	٥	٢	٣
تقييم فعالية اختبار A/B	1	2	1	1	1
قيمة RQ	1	1	0	1	1
كشف الأخطاء/إصلاحها	٤	0	٣	٢	2
تنفيذ اختبار A/B	1	0	1	0	0

عادةً ما يتم الإبلاغ عن مهمة تحليل ما بعد التجربة فقط للتجارب التي اكتملت بالكامل (أي، لا تمر بجولات إضافية من التكرار). في الدراسات الأساسية حيث نتائج الـتُستخدم الاختبارات للمتابعةفي تصميم الاختبار، لم يتم تحديد أي حالات حيث يقوم أصحاب المصلحة بمهمة إجراء تحليل بعدي على نتائج التجارب.
للاحقتصميم الاختبار، غالبًا ما يتم ذكر مهمة تحفيز التجربة. وهذا متوقع نظرًا لأن التصميم الجديديجب أيضًا تنفيذ الاختبارات. بالإضافة إلى ذلك،يتم ذكر إنهاء الاختبار أيضًا كثيرًا (على سبيل المثال، إنهاء تجربة بسبب نتائج سيئة [76]).
في حالة إصلاح الأخطاء واكتشافها، عادةً ما يشرف المعنيون على التجارب (إما لاكتشاف الأخطاء المحتملة في الشيفرة أو لضمان فعالية إصلاح الخطأ) [57، ويقومون بتحفيز التجارب (أي إطلاق تجربة بشكل صريح لإصلاح خطأ معروف في التطبيق) 105.

طريقة التقييم مقابل مهام أصحاب المصلحة في التنفيذ

اختبارات. بالإضافة إلى ذلك، نقوم بتحليل المهام التي يقوم بها أصحاب المصلحة أثناء تنفيذ

اختبارات عبر طرق التقييم. يتم عرض هذه الخريطة في الجدول 17. نبرز عددًا من النقاط الرئيسية:

جميع المهام التي يقوم بها أصحاب المصلحة في تنفيذتُستخدم الاختبارات على نطاق واسع في حالة التقييم التجريبي.
بالنسبة لطريقة المحاكاة المعتمدة على البيانات التجريبية الحقيقية، يتم الإبلاغ عن مهمة التحليل اللاحق أكثر من أي مهمة أخرى. مثال على ذلك هو البحث عن القيم الشاذة في تحليل نتائجاختبارات، واستخدام التجارب التاريخية لتأكيد فعاليتها [78].
الدراسات الأولية التي تستخدم المحاكاة كطريقة تقييم نادراً ما تحدد المهام التي يقوم بها أصحاب المصلحة في التنفيذ.اختبارات. نفترض أنه، نظرًا لأن المحاكاة تتيح طريقة أكثر تحكمًا لإجراءالاختبارات، المهام التي يقوم بها أصحاب المصلحة بعد تصميمالاختبارات ليست ذات صلة.
المهمة الوحيدة المعنية بالمساهمين التي تم الإبلاغ عنها للتقييم النظري هي تعديل التجربة (الدراسة الأساسية [121]).

الجدول 17: طريقة التقييم

مهام أصحاب المصلحة في تنفيذ الاختبار (“emp. sim.” اختصار لـ “محاكاة تعتمد على بيانات تجريبية حقيقية”، “ill.” اختصار لـ “توضيحي”).

طريقة المهمة	الإشراف	تحليل ما بعد	تحفيز	تغيير	إنهاء	آخر
تجريبي	14	١٣	10	10	٦	٦
التوظيف البسيط	2	٤	1	0	1	0
محاكاة	1	1	1	0	0	0
مثال توضيحي	2	0	1	2	2	1
دراسة حالة	0	0	0	0	0	0
نظري	0	0	0	1	0	0

الجدول 18: قائمة بالمشاكل المفتوحة المحددة.

فئة المشكلة المفتوحة	فئة المشكلة المفتوحة	عدد الحدوث
متعلق بالتقييم	تمديد التقييم	21
	قدم تحليلًا شاملاً للنهج	16
	متعلق بالتقييم الآخر	٣٤
متعلق بالعملية	إضافة إرشادات العملية	9
متعلق بالعملية	أتمتة العملية	٧
متعلق بالجودة	تعزيز القابلية للتوسع	٧
متعلق بالجودة	تعزيز القابلية للتطبيق	٦

سؤال البحث 3: كيف يتم الاختبارات المنفذة في النظام؟ ما هو دور أصحاب المصلحة في هذه العملية؟ الأنواع الرئيسية من البيانات التي تم جمعها خلالتنفيذ الاختبار يتعلق بالمنتج/النظام، المستخدمين، والجوانب الزمانية والمكانية. الطريقة السائدة للتقييم المستخدمة فيالتجريب هو تقييم تجريبي، لكن عددًا كبيرًا من الدراسات يستخدم أيضًا المحاكاة.تُستخدم نتائج الاختبار بشكل أساسي لاختيار الميزات، تليها طرح الميزات، وتستمر في تطوير الميزات. (تلقائي) لاحقاًتنفيذ الاختبار يُستخدم بشكل هامشي فقط. الأدوار الرئيسية المبلغ عنها لأصحاب المصلحة فيتنفيذ الاختبار هو مساهم في التجربة (مع مشرف التجربة كالمهمة الرئيسية) ومقيّم للتجربة (مع تحليل ما بعد التجربة كالمهمة الرئيسية).

4.5. RQ4: ما هي المشكلات البحثية المفتوحة المبلغ عنها في مجالاختبار؟

للإجابة على سؤال البحث 4، نقوم بتحليل نتائج عنصر البيانات المشكلات المفتوحة (D18).
تقدم الجدول 18 تصنيفًا للمشكلات المفتوحة التي حددناها في الدراسات الأساسية. لكل فئة، وضعنا فئات فرعية محددة من المشكلات المفتوحة. نحن نتناول كل نوع من المشكلات المفتوحة مع أمثلة توضيحية.

أولاً، قمنا بتحديد ثلاث فئات فرعية من المشكلات المفتوحة المتعلقة بتقييم النهج المقترح: (1) التوسعات لتقييم النهج المقدم في الدراسة الأساسية، (2) تحليل أكثر شمولاً للنهج المقدم في الدراسة الأساسية، و(3) مشكلات مفتوحة أخرى متعلقة بالتقييم في الدراسة الأساسية.

قم بتمديد التقييم. يستكشف دروتسا وآخرون 45 أنماط الدورية في مقاييس تفاعل المستخدمين، وتأثيرها على مقاييس التفاعل في

اختبارات. علاوة على ذلك، قدم المؤلفون اقتراحات جديدة

المقاييس التي تأخذ في الاعتبار أنماط الدورية هذه، مما يؤدي إلى مزيد من الحساسية

تحليل الاختبار. المؤلفون
تم تقييم المقاييس المقترحة على البيانات التاريخية

بيانات الاختبار من ياندكس، على الرغم من أنهم يذكرون أن التقييم الإضافي للنهج يمكن أن يتم في مجالات مختلفة مثل الشبكات الاجتماعية، وخدمات البريد الإلكتروني، وخدمات استضافة الفيديو/الصورة. من وجهة نظر مختلفة قليلاً، طور باراخاس وآخرون تقنية لتحديد التأثيرات السببية لحملات التسويق على المستخدمين، مع التركيز على الحملة نفسها بدلاً من التركيز فقط على تصميم وسائل الإعلان. قدم المؤلفون إرشادات محددة حول عشوائية وتخصيص المستخدمين لحملات الإعلان، وقدموا تقنية لتقدير التأثير السببي الذي تتركه الحملات على المستخدمين قيد الاختبار. كنقطة عمل مستقبلية، يطرح المؤلفون سؤال تقييم مختلف يتعلق بما كان سيحدث إذا تم تطبيق التقنية على كامل السكان.

قدم تحليلًا شاملاً للنهج. تم ذكر مثال على هذه الفئة من قبل بيسكا وفوجتاس 126. اقترح المؤلفون طريقة لتقييم خوارزميات التوصية في تطبيقات التجارة الإلكترونية الصغيرة سواء عبر الإنترنت أو خارجها عبر

اختبار. يقارن النهج نتائج التقييم غير المتصل لخوارزميات التوصية مع نتائج التقييم المتصل

اختبار الخوارزميات. علاوة على ذلك، استخدم المؤلفون هذه البيانات لبناء نموذج توقع لتحديد خوارزميات التوصية الواعدة بشكل أكثر فعالية بفضل المعرفة المكتسبة من الإنترنت.

اختبار. كعمل مستقبلي، يذكر المؤلفون أنه من الضروري القيام بمزيد من العمل للتحقق من السببية لتأثير تم ملاحظته في تحليل البيانات غير المتصلة بالإنترنت والبيانات المتصلة بالإنترنت.

بيانات الاختبار. في دراسة رئيسية أخرى كتبها مادلبيرجر وجيزدني [114]، يقوم المؤلفون بتحليل تأثير تسويق وسائل التواصل الاجتماعي على معدلات النقر ومشاركة العملاء. لتحقيق ذلك، يقومون بتنفيذ حملات تسويق متعددة عبر وسائل التواصل الاجتماعي باستخدام

اختبار وتقييم الفرضيات المتعلقة بتأثير الجوانب البصرية والمحتوى للإعلانات على معدلات النقر لدى المستخدمين النهائيين. كبحث مستقبلي، يذكر المؤلفون أن هناك حاجة إلى تحقيق أكثر شمولاً لتحديد سبب رفض بعض الفرضيات في الدراسة.

أخرى متعلقة بالتقييم. مثال على مشاكل مفتوحة أخرى متعلقة بالتقييم تم توضيحه من قبل غروسون وآخرون. يقترح المؤلفون منهجية تعتمد على التحليل المضاد للوقائع لتقييم خوارزميات التوصية، مستفيدين من كل من التقييم غير المتصل والتقييم المتصل عبر

اختبار. تتضمن الطريقة

اختبار التوصيات على مجموعة فرعية من السكان، واستخدام نتائج هذه الاختبارات لتقليل التحيز في التقييمات غير المتصلة بالإنترنت لخوارزمية التوصية بناءً على البيانات التاريخية. فيما يتعلق بالمشاكل المفتوحة، يذكر المؤلفون استكشاف مقاييس إضافية للنهج، فضلاً عن التحسينات المحتملة التي يمكن إجراؤها على المقدرات التي يستخدمونها في النهج. مثال آخر يحدده جو وآخرون، الذين يقدمون بديلاً للمعيار القياسي.

اختبار مع اختبار فرضية ثابتة من خلال تقديم اختبار تسلسلي. تقليديًا في اختبار A/B، يتم اختبار فرضية الاختبار بعد فترة زمنية ثابتة ويتم استخلاص الاستنتاجات بناءً على النتيجة النهائية. الاختبار التسلسلي الذي قدمه المؤلفون لا يحتوي على عدد محدد مسبقًا من الملاحظات، بل في نقاط متعددة خلال التجربة، يحدد الاختبار ما إذا كان يمكن قبول الفرضية أو رفضها، أو إذا كانت هناك حاجة لمزيد من الملاحظات. بالنسبة للأعمال المستقبلية، يرغب المؤلفون في دعم

تجارب في نهجهم، بالإضافة إلى توسيع الإجراء للبيانات التي تتبع توزيعًا غير ثنائي. في مثال نهائي، يدرس جوي وآخرون [73] طرق التعامل مع تأثيرات الشبكة في نتائج

الاختبارات. واحدة من الافتراضات الأساسية لـ

الاختبار هو أن المستخدمين يتأثرون فقط بـ

النوع الذي يتم تعيينهم له. ومع ذلك، يمكن أن تقوض تأثيرات الشبكة هذا الافتراض بسبب التفاعل بين المستخدمين في السكان. يوضح المؤلفون وجود تأثيرات الشبكة في لينكد إن، ويقترحون مقدرًا لتأثير العلاج المتوسط يأخذ أيضًا في الاعتبار تأثيرات الشبكة المحتملة. كخط من البحث المستقبلي، يرغب المؤلفون في استكشاف طرق لتعزيز النهج بحيث يمكنه التعامل مع ظواهر الحياة الواقعية بشكل أفضل.

ثانيًا، قمنا بإنشاء فئتين فرعيتين من المشكلات المفتوحة المتعلقة بالعمليات: (1) إرشادات لـ

عملية الاختبار، و(2) أتمتة جوانب من

عملية الاختبار.

أضف إرشادات العملية. في محاولة لتقديم المزيد من التفاصيل الدقيقة

تتناول إرشادات الاختبار في مجال التجارة الإلكترونية، غوسوامي وآخرون 71 التجارب المنضبطة لاتخاذ القرارات في سياق بحث التجارة الإلكترونية. تشمل الاعتبارات كيفية تحديد أولويات المشاريع لـ

اختبار المتاجر الصغيرة وكيف
لإجراء

تُترك الاختبارات خلال فترة العطلات كأسئلة مفتوحة. تُقدم دراسة أولية مختلفة تغطي فوائد التجارب المنضبطة على نطاق واسع من قبل فابيجان وآخرون [57]. في هذه الدراسة، يقدم المؤلفون عدة أمثلة على التجارب التي تم إجراؤها.

الاختبارات، والدروس المستفادة المقابلة من هذه التجارب. واحدة من المشكلات المفتوحة المدرجة في الدراسة تتعلق بتقديم “إرشادات حول اكتشاف الأنماط بين المقاييس الرائدة والمتأخرة”.

أتمتة العملية. يقدم ماتوس وآخرون [118] خطوة نحو التجارب المستمرة الآلية. يقترح المؤلفون إطارًا معماريًا يستوعب التنفيذ الآلي لـ

اختبارات والتوليد التلقائي لـ

المتغيرات. للتحقق من صحة الإطار،

تم إجراء اختبار باستخدام روبوت. واحدة من التحديات المفتوحة التي تم طرحها في الدراسة تتعلق بالقدرة على توليد الفرضيات تلقائيًا لـ

اختبارات مستندة إلى البيانات المجمعة. يقدم دويفستين وآخرون [49]

الاختبار، نهج يستفيد من تقنيات استخراج النماذج الاستثنائية لاستهداف

متغيرات إلى المجموعات الفرعية في السكان قيد الاختبار. على عكس نشر أفضل متغير أداءً من

اختبار، قدم المؤلفون اقتراحًا بتشغيل كلا النسختين (إذا كانت الموارد كافية) واستهداف نسخ معينة لمستخدمين فرديين بناءً على المجموعات الفرعية المستنتجة. واحدة من الطرق المحتملة للبحث المستقبلي تتكون من تطوير إطار عمل يمكّن من تخصيص المواقع الإلكترونية بشكل آلي مدعوم بـ

اختبار.

أخيرًا، قمنا بتأسيس فئتين فرعيتين من المشكلات المفتوحة المتعلقة بالجودة: (1) تعزيز قابلية التوسع في النهج المقترح، و(2) تعزيز قابلية تطبيق النهج.

تعزيز القابلية للتوسع. أحد الأمثلة على ذلك قدمه زهاو وآخرون 179. من أجل الحصول على تفسير سببي وراء النتائج لـ

الاختبارات، يقترح المؤلفون تقسيم السكان، وبالتالي تحليل نتائج الـ

اختبار في شرائح فردية. بالنسبة للأعمال المستقبلية، يذكر المؤلفون تطوير حل أكثر قابلية للتوسع يدمج النهج في منصة التجارب الحالية لديهم. لمعالجة التجارب عبر الإنترنت بشكل خاص لتطبيقات السحابة، يقدم توصلالي وآخرون 1531 جاك بوت، وهو نظام للتجارب عبر الإنترنت في السحابة. يدعم جاك بوت اختبار A/B متعدد المتغيرات ويضمن الإدارة المناسبة للتفاعلات في تطبيق السحابة أثناء تنفيذ

اختبارات. كمنصة للعمل المستقبلي، يذكر المؤلفون طرقًا للتعامل مع القابلية المحدودة للتوسع في التجارب متعددة المتغيرات بسبب زيادة عدد التجارب المحتملة بشكل أسي مع زيادة عدد العناصر التي يجب اختبارها.

تعزيز القابلية للتطبيق. إحدى هذه الدراسات تستكشف

الاختبار في صناعة السيارات [11]. تتناول الدراسة المخاوف المتعلقة بحجم العينة المحدود

الاختبارات التي يتم الحصول عليها بسبب الطبيعة المحدودة للمشاركين الذين يمكنهم المشاركة في

اختبارات في الصناعة. للتغلب على هذه العقبة، يقدم المؤلفون إرشادات محددة لأداء

اختبار وتحديد تخصيص المستخدمين إما إلى مجموعة التحكم أو مجموعة العلاج في الاختبار. ومع ذلك، فإن إحدى القيود تتعلق بضرورة وجود بيانات قبل التجربة لضمان تخصيص متوازن للسكان بين كلا المجموعتين.

المتغيرات. في محاولة لزيادة الحساسية في اختبار A/B، يقترح ليو وتايلور [109] مقدرًا جديدًا لاختبار A/B يأخذ في الاعتبار تباين المستخدمين الفرديين. لتحقيق ذلك، يتم تحليل بيانات ما قبل التجربة للمستخدمين الفرديين وحساب التباينات. من أجل التحقق من صحة النهج، تم أخذ عينة من 100 تجربة سابقة تم إجراؤها.

تم جمع الاختبارات وتحليلها باستخدام النهج الجديد. إحدى القيود الكبيرة التي أشار إليها المؤلفون هي أنه “هناك حاجة إلى افتراض أقوى حول تجانس تأثير العلاج” لكي يظل النهج غير متحيز.

سؤال البحث 4: ما هي المشكلات البحثية المفتوحة المبلغ عنها في مجالالاختبار؟ أكثر المشاكل المفتوحة التي تم الإبلاغ عنها بشكل شائع تتعلق بالنهج المقترح، وبشكل خاص تحسين النهج، وتوسيعه، وتقديم تحليل شامل. أما المشاكل المفتوحة الأخرى التي تم الإبلاغ عنها بشكل أقل تكرارًا فتتعلق بعملية اختبار A/B، وبشكل خاص إضافة إرشادات لـعملية الاختبار، وأتمتة العملية. أخيرًا، تشير عدد من الدراسات إلى مشاكل مفتوحة تتعلق بخصائص الجودة، وبشكل خاص تعزيز قابلية التوسع وقابلية تطبيق النهج المقترح.

موضوع	عدد الحدوثات	الدراسات الأولية
تطبيق اختبار A/B	51	١٧٥، ١٢٣، ٩٥، ١٠٢، ١٦، ١٥، ١٢١، ١٧١، ٩٩، ١٤٨، ٣٣، ٧٠، ٦٦، ٥٢، ٢٠، ١٧٤، ١٠٧، ٦٣، ١٥٥، ١٥٠، ٦٥، ١٦٣، ١٧٠، ٢٧، ١٤٣، ٢، ٥، ١٣٥، ١٤٩، ٧، ٩٨، ١٤٧، ١٤١، ١٧٣، ١٩، ٢٦، ٨، ١١٤، ٦، ١٢٢، ٥٠، ٩٧، ١٣٦، ١٢٥، ٢٢، ١٢٤، ١٢٨، ١٥٩، ٦٧، ٣، ١٧٦،
تحسين كفاءة اختبار A/B	20	[1، 28، 127، 164، 23، 85، 47، 39، 44، 86، 40، 45، 46، 109، 100، 83، 18، 78، 37، 64،
ما وراء اختبار A/B القياسي	١٨	، 126، 29، 118، 75، 49، 117، 30، 151
مشاكل اختبار A/B الملموسة	17	[١٣٨، ٧٣، ١٦٨، ١٠٥، ١٤٦، ٤٣، ١٦٢، ١٤، ١٠٣، ١١١، ٧١، ٢٤، ١٥٣، ١٣٧، ٩٦، ١٠١، ٢٥
المزالق والتحديات في اختبار A/B	١٣	91، 88، 54، 60، 167، 42، 169، 120، 11، 41، 110، 140، 90
أطر ومنصات التجريب	١٣	١٤٤، ١٥٤، ١٠٦، ١٥٦، ١٠٨، ٩، ١٣١، ٧٤، ٢١، ٣٦، ١٧٩، ١٧٧، ١٥٢
اختبار A/B على نطاق واسع	٩	٨٩، ١٦٠، ٥٨، ١٦٥، ٨١، ١٥٧، ٧٦، ٥٧، ٥٦

5. المناقشة

في هذا القسم، نناقش عددًا من الرؤى الإضافية التي حصلنا عليها. نبدأ بمواضيع البحث التي درستها الدراسات الأساسية. بعد ذلك، ننظر في البيئات والأدوات المستخدمة لـ

اختبار. ثم نبلغ عن عدد من الفرص للبحث المستقبلي. نختتم بمناقشة التهديدات لصحة الدراسة.

5.1. مواضيع البحث

خلال استخراج البيانات من 141 دراسة أولية، لاحظنا الموضوعات العامة للدراسات الأولية وقمنا بتصنيف الدراسات الأولية ضمن 7 مواضيع بحثية. تلخص الجدول 19 هذه المواضيع السبعة. لاحظ أن الدراسات تتشارك في مواضيع متداخلة. سنقوم الآن بشرح كل فئة بإيجاز وتقديم بعض الأمثلة من الدراسات الأولية.

5.1.1. تطبيقاختبار

التركيز الرئيسي للدراسة الأولية هو استخدام وتطبيق

الاختبار كأداة تقييم للموضوع الرئيسي للدراسة (مثل تقييم خوارزمية التوصية الجديدة، إعادة تصميم الواجهة، إلخ)

5.1.2. تحسين الكفاءة لـاختبار

هذا الموضوع يتعلق بتحسين عملية

اختبار من خلال استكشاف طرق لتحسين الحساسية في

بيانات الاختبار 44، 127، 164، 85، التحقيق في تقنيات الاختبار المتسلسل للتوقف

اختبارات في أقرب وقت ممكن [83، 86، 1، اقتراح تقنيات لاكتشاف غير الصالح

اختبار

[28]، واستخدام بيانات إضافية مثل أنماط الدورية في سلوك المستخدم لتحسين

اختبار [45].

الجدول 20: البيئات والأدوات المستخدمة لـ

اختبار.

البيئة

عدد

حالات

نظام التجارب الداخلية

أداة بحث أو نموذج أولي

١٣

أداة اختبار A/B التجارية

أداة اختبار غير تجارية A/B

استطلاع المستخدم

5.1.3. ما وراء المعياراختبار

هذا الموضوع يتعلق بالتقنيات التي تتجاوز المعايير القياسية

اختبار، مثل استخدام أنواع جديدة من

المقاييس [166، 48، 112، استخدام العوامل المضادة في تقييم

اختبار

[77، 134، التحقيق في طرق أتمتة أجزاء من

عملية الاختبار [139، 118، 117، 151، تحسين أو تغيير

عملية الاختبار [79، 29، والتحقيق في طرق دمج العمل غير المتصل والعمل عبر الإنترنت

اختبار [72، 126].

5.1.4. الخرسانةمشاكل الاختبار

يتضمن هذا الموضوع دراسات تتعلق بـ

الاختبار في مجالات محددة وأنواع محددة من

اختبار. تشمل الأمثلة

اختبار بشكل محدد في مجال التجارة الإلكترونية 96، 71، الشبكة

اختبار

الاختبار في الأسواق [103، 73، 24،

الاختبار في مجال أنظمة السيبرانية والفيزيائية مع التوائم الرقمية [43، أو

اختبار تطبيقات الهواتف المحمولة 101، 168.

5.1.5. الفخاخ والتحديات لـاختبار

هذا الموضوع يتعلق بالمخاطر المرتبطة بإجراء

اختبار [54، 88، 42، 41، أو تحديات (متعلقة بمجال معين) تتعلق بـ

اختبار [169، 120، 110.

5.1.6. أطر ومنصات التجريب

يغطي هذا الموضوع الأوراق التي تقدم منصة اختبار A/B 106، 152، 144، أو إطار عمل يتعلق بالجوانب المرتبطة بـ

عملية الاختبار مثل إطار عمل لاكتشاف فقدان البيانات في

اختبارات [74، إطار عمل لتصميم

اختبارات [36]، أو إطار لتخصيص

اختبار [154].

5.1.7.الاختبار على نطاق واسع

تتركز الدراسات الأولية تحت هذا الموضوع على إجراء

الاختبار على نطاق واسع، مثل الاعتبارات المتعلقة بإجراء اختبارات A/B على نطاق واسع 81، 157، 76، نماذج العمليات أو الإرشادات لاختبار A/B على نطاق واسع 58، 165، أو حلول قابلة للتوسع مثل طريقة إحصائية قابلة للتوسع لقياس تأثيرات العلاج الكمي على مقاييس الأداء في

اختبارات 160.

5.2. البيئات والأدوات المستخدمة لـاختبار

بالإضافة إلى مواضيع البحث التي تم تناولها في الدراسات الأساسية، نقوم أيضًا بتحليل البيئات والأدوات التي تم استخدامها لتنفيذ اختبار A/B، انظر الجدول 20.

أكثر أنواع البيئات المذكورة شيوعًا هو نظام التجارب الداخلية لـ

اختبار (20 حالة)، على سبيل المثال البيئات المخصصة التي طورتها شركات مثل مايكروسوفت 105، جوجل 152، إيباي 157، وإتسي 83. تدعم هذه البيئات بشكل عام تنفيذ اختبارات A/B. علاوة على ذلك، تصف بعض الدراسات الأساسية ميزات ملموسة لنظام التجارب للمساعدة في التصميم.

اختبارات، مثل التحكم في التحيز أثناء تحديد

اختبارات في إطار تقارير التجارب في Airbnb 100. بعد ذلك، نلاحظ أدوات البحث والنماذج الأولية (13 حالة). تشمل الأمثلة على ذلك

أداة لإجراء تجارب سحابية عبر الإنترنت 153، نموذج بحثي لـ

اختبار تم تنفيذه في NodeJS 139، أداة لاختبار A/B مع مساعدي القرار [96، وأداة تتيح التنفيذ التلقائي لعدة

اختبارات 151. كانت البيئات المتبقية التي حددناها تجارية

أدوات الاختبار (10 مرات)، مثل Optimizely [122]، وGoogle Analytics [22]؛ أدوات تجارية غير مرتبطة بالاختبار A/B (7 مرات)، مثل Crazy egg [22، أداة لتخطيط الحرارة تُستخدم لتصميم متغيرات A/B، واستخدام Yahoo Gemini (منصة إعلانات) لاختبار استراتيجيات إعلانات مختلفة [114؛ واستطلاع مستخدم (1 مرة) لتحديد أي

متغيرات للاختبار من خلال إجراء مسح أولي.

5.3. فرص البحث واتجاهات البحث المستقبلية

من دراستنا، نقترح عددًا من الاتجاهات البحثية المحتملة في مجال

اختبار. بشكل ملموس، نقدم ثلاثة مجالات بحث: البحث في تحسين العملية العامة بشكل أكبر

اختبار، بحث حول أتمتة جوانب من

اختبار، وبحث حول اعتماد الأساليب الإحصائية المقترحة في

اختبار.

5.3.1. تحسينعملية الاختبار

اتجاه مستقبلي واحد يتعلق بأخذ الاعتبارات عند تشغيل العديد من

اختبارات في آن واحد 152. تغطي العديد من الدراسات هذا الموضوع من خلال، على سبيل المثال، مناقشة الدروس المستفادة في حالات غير متوقعة.

نتائج الاختبار التي caused by أخرى

اختبارات كانت تعمل بالتوازي [54]، أو التحقق يدويًا من الآثار المحتملة لإجراء اختبارات A/B من خلال تحليل اختبارات A/B المنفذة في النظام [157]. ومع ذلك، لم نصادف دراسة تقدم نهجًا منهجيًا لمعالجة هذه المشكلة.

مجال آخر للبحث المستقبلي هو تحسين الحساسية في

اختبارات بواسطة، على سبيل المثال، دمج تقنيات تحسين الحساسية المختلفة كما أشار إليه دروتسا وآخرون [44، مما يمكّن من التنبؤ الاستباقي بسلوك المستخدم في

اختبارات مستندة إلى بيانات تاريخية [45]، ودراسة أعمق لـ

اختبار مقدرات لتحقيق حساسية أفضل كما ذكر بوياركوف وآخرون 127.

السبيل الأخير للبحث المستقبلي في تحسين

تتعلق عملية الاختبار بتوفير إرشادات إضافية ومبادئ تصميم لاختيار والهندسة

المقاييس. نبرز دراستين رئيسيتين تذكران المشكلات المفتوحة المتعلقة بهذه الفرصة: قدم خاريتونوف وآخرون 85 مجموعات حساسة للتعلم من

المقاييس كمشكلة مفتوحة عامة، ويدرس دووان وآخرون [48] التحقيق في الديناميات بين المقاييس البديلة والمقياس الفعلي الأساسي.

5.3.2. الأتمتة

في إطار الجهود الرامية إلى إنشاء تجارب مستمرة، قدمت دراسات متعددة خطوات يمكن أن تتخذها الشركات لتطوير ثقافة التجريب، مثل [58، 169، 55]. في ضوء توسيع هذه الثقافة التجريبية، فإن الأتمتة (الجزئية) لـ

عملية الاختبار ضرورية لتمكين وتعزيز التجارب المستمرة 28، 71. البحث الأولي حول أتمتة الخطوات في

تم إجراء اختبارات، كما هو موضح على سبيل المثال من قبل تامبوريللي وآخرون 151 وماتوس وآخرون 118، انظر الأقسام 5.1.3 و4.5.2. ومع ذلك، تشير الحالة الحالية للبحث في هذا الموضوع إلى أن المزيد من التحقيقات والحلول الأكثر عمقًا ضرورية للاستفادة الكاملة من التصميم والتنفيذ الآلي.

اختبارات. بالإضافة إلى ذلك، لا تزال هناك عدد من المشكلات المفتوحة التي يمكن أن تسهل وتمكن من التجارب الآلية، مثل تحديد أي اختبارات A/B يجب إعطاؤها الأولوية أثناء التنفيذ 71، وتوليد رؤى تلقائيًا تتعلق بالمنطق وسبب نتائج التجارب لمطوري التجارب لتوجيه تطوير المنتج [169].

5.3.3. اعتماد وتكييف الأساليب الإحصائية

على الرغم من أن عددًا من الدراسات الأولية تناقش استخدام تقنية البوتستراب لتقييم النتائج لـ

اختبارات 154، 2، 71، لا يزال التحجيم الذاتي غير مستكشف إلى حد كبير في

الاختبار، على الرغم من أن هذه الطريقة الإحصائية لديها القدرة على تحسين التحليل لـ

نتائج الاختبار [81، 14. علاوة على ذلك، يمكن أن يكون استخدام طريقة البوتستراب أداة لا تقدر بثمن لتوفير رؤى إحصائية حول نتائج الاختبارات التي قد لا يمكن الحصول عليها، على سبيل المثال، بواسطة طريقة اختبار المساواة القياسية [51. ومع ذلك، فإن أحد العيوب الكبيرة لاستخدام البوتستراب هو أنه يتطلب حسابات مكلفة [110. إلى جانب اعتماد الطرق الإحصائية المعروفة، فإن تصميم وتكييف طرق إحصائية جديدة لتناسب سيناريوهات التجريب المحددة يمثل اتجاهًا بحثيًا مثيرًا. تم ذكر مثال واحد من قبل خاريتونوف [86، الذي وضع
تصميم اختبار إحصائي مخصص غير ثنائي

المقاييس. مثال آخر يتعلق بأخذ “التأثيرات الناتجة عن علاجات متعددة مع مقاييس مختلفة من الاهتمام” في الاعتبار لتخصيص النهج الذي قدمه تو وآخرون 154 لتعيينات العلاج المثلى في اختبار A/B من خلال الاستفادة من تقديرات التأثير السببي.

بجانب عدد محدود من الدراسات الأولية التي تستخدم تقنية البوتستراب في التحليل لـ

تشير عدد كبير من الدراسات إلى نتائج ذات دلالة إحصائية أو قيم p في تحليل الاختبارات التي تم إجراؤها.

اختبارات دون تحديد الاختبار الإحصائي المحدد المستخدم (37 حالة). علاوة على ذلك، فإن عددًا كبيرًا من الدراسات لا تُبلغ عن أي شيء يتعلق بالتحليل الإحصائي (47 حالة). نحن نرى أن هذه المعلومات مهمة للإبلاغ عنها في المنشورات البحثية، ونحث المؤلفين على تحديد الأساليب الإحصائية المحددة المستخدمة.

للحصول على النتائج في الدراسات.

5.4. التهديدات للصلاحية

في هذا القسم، نذكر التهديدات المحتملة لصحة مراجعة الأدبيات المنهجية [10].

5.4.1. الصلاحية الداخلية

تشير الصلاحية الداخلية إلى مدى صحة الاستنتاج السببي المستند إلى دراسة معينة. واحدة من التهديدات للصلاحية الداخلية هي التحيز المحتمل للباحثين الذين يقومون بمراجعة الأدبيات النظامية، والذي قد يؤثر على جمع البيانات والرؤى المستخلصة في الدراسة. من أجل التخفيف من هذا التهديد، قمنا بإشراك عدة باحثين في الدراسة. كان مسؤولاً عن اختيار الأوراق، واستخراج البيانات، وتحليل النتائج. في كل خطوة، تم تطبيق التحقق المتبادل لتقليل التحيز. تم إشراك باحثين إضافيين إذا لم يكن هناك توافق. بالإضافة إلى ذلك، قمنا بتعريف بروتوكول صارم لمراجعة الأدبيات النظامية.

5.4.2. الصلاحية الخارجية

تشير الصلاحية الخارجية إلى مدى إمكانية تعميم نتائج الدراسة على المجال العام لـ

الاختبار. تهديد لصلاحية هذا الاستعراض المنهجي للأدبيات هو أن ليس كل الأعمال ذات الصلة مشمولة. للتخفيف من هذا التهديد، بحثنا في جميع المصادر الرئيسية للمكتبات الرقمية التي تنشر الأعمال في علوم الحاسوب. ثانياً، قمنا بتعريف سلسلة البحث من خلال تضمين جميع المصطلحات المستخدمة بشكل شائع لاختبار A/B لضمان استرجاع الأعمال ذات الصلة بشكل صحيح. أخيراً، قمنا أيضاً بتطبيق تقنية التراكم على الأوراق المختارة من استعلام البحث التلقائي لكشف أعمال إضافية قد تكون قد فاتتنا.

5.4.3. صحة الاستنتاج

تشير صلاحية الاستنتاج إلى مدى حصولنا على القياس الصحيح وما إذا كنا قد حددنا النطاق الصحيح فيما يتعلق بما يعتبر بحثًا في هذا المجال.

اختبار. إحدى التهديدات لصحة الاستنتاج هي جودة الدراسات المختارة؛ فقد تنتج الدراسات ذات الجودة المنخفضة رؤى غير مبررة أو غير قابلة للتطبيق على المجال العام.

اختبار. من أجل التخفيف من هذا التهديد، استبعدنا الأوراق القصيرة، وأوراق العروض، وأوراق خارطة الطريق من الدراسة. علاوة على ذلك، قمنا بتقييم درجة جودة لكل ورقة مختارة. الأوراق التي حصلت على درجة جودة

تم استبعادهم من الدراسة.

5.4.4. الموثوقية

تشير الموثوقية إلى مدى إمكانية إعادة إنتاج هذا العمل إذا تم إجراء الدراسة مرة أخرى. للتخفيف من هذه التهديدات، نجعل جميع البيانات المجمعة والمعالجة متاحة على الإنترنت. كما قمنا بتعريف سلسلة بحث محددة، وقائمة من المصادر عبر الإنترنت، وغيرها من التفاصيل المحددة في بروتوكول البحث لضمان إمكانية إعادة الإنتاج. كما أن تحيز الباحثين يشكل تهديدًا هنا، حيث يؤثر على إمكانية الحصول على نتائج مشابهة إذا تم إجراء مراجعة منهجية للأدبيات مرة أخرى مع مجموعة مختلفة من المراجعين.

6. الخاتمة

اختبار A/B يدعم القرارات المستندة إلى البيانات بشأن اعتماد الميزات. يتم استخدامه على نطاق واسع عبر مختلف الصناعات والشركات التكنولوجية الرئيسية مثل جوجل، ميتا، ومايكروسوفت. في هذه المراجعة المنهجية للأدبيات، حددنا مواضيع

اختبارات، كيف

تم تصميم الاختبارات وتنفيذها، والمشكلات البحثية المفتوحة المبلغ عنها في الأدبيات. لقد لاحظنا أن الخوارزميات، والعناصر المرئية، والتغييرات في سير العمل أو العملية هي الأكثر اختبارًا، مع كون الويب ومحركات البحث والتجارة الإلكترونية هي المجالات التطبيقية الأكثر شيوعًا لـ

اختبار. فيما يتعلق بتصميم

اختبارات، كلاسيكية

تُستخدم الاختبارات ذات النسختين بشكل شائع، إلى جانب مقاييس التفاعل مثل معدل التحويل أو عدد الانطباعات كمقياس لتقييم الإمكانيات.

المتغيرات. تُستخدم اختبارات الفرضيات لاختبار المساواة على نطاق واسع لتحليل

نتائج الاختبار، وتوليد الأفكار يثير أيضًا اهتمامًا في بعض الدراسات الأساسية. قمنا بتصميم ثلاثة أدوار يتولى أصحاب المصلحة في تصميم

اختبارات: مصمم المفاهيم، مهندس التجارب، وفني الإعداد. فيما يتعلق بتنفيذ

الاختبارات، التقييم التجريبي هو الطريقة الرائدة في التقييم. بالإضافة إلى الرئيسي

تُجمع المقاييس والبيانات المتعلقة بالمنتج أو النظام، والبيانات المرتكزة على المستخدم بشكل أكبر لإجراء تحليل أعمق لنتائج الـ

اختبارات.

يتم استخدام الاختبار بشكل شائع لتحديد ونشر الأداء الأفضل

متغير، أو لإطلاق ميزة تدريجياً. أخيرًا، وضعنا دورين يتولى أصحاب المصلحة في تنفيذ

اختبارات: مساهم في التجربة، ومقيّم للتجربة.

حددنا سبع فئات من المشكلات المفتوحة: تحسين الأساليب المقترحة، توسيع تقييم الأسلوب المقترح، تقديم تحليل شامل للأسلوب المقترح، إضافة

إرشادات عملية الاختبار، أتمتة

عملية الاختبار، تعزيز القابلية للتوسع، وتعزيز القابلية للتطبيق. من خلال الاستفادة من هذه الفئات والملاحظات التي تم إجراؤها خلال التحليل، نقدم ثلاثة خطوط رئيسية من فرص البحث المثيرة: تطوير حلول أكثر عمقًا لأتمتة مراحل الـ

عملية الاختبار؛ تقديم التحسينات إلى

عملية الاختبار من خلال فحص الطرق الواعدة لتحسين الحساسية، وحلول منهجية للتعامل مع تداخل العديد من العوامل.

اختبارات تعمل في آن واحد، وتوفير إرشادات ومبادئ تصميم لاختيار وهندسة

المقاييس؛ وأخيرًا اعتماد وتكييف أساليب إحصائية أكثر تعقيدًا مثل إعادة التقدير لتعزيز التحليل لـ

اختبار المزيد.

شكر وتقدير

نشكر ميشيل بروفوست على دعمه لهذه الدراسة.

References

[1] Vineet Abhishek and Shie Mannor. 2017. A Nonparametric Sequential Test for Online Randomized Experiments. In Proceedings of the 26th International Conference on World Wide Web Companion (Perth, Australia) (WWW’17 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 610-616. https://doi.org/10.1145/3041021.3054196
[2] Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. LASER: A Scalable Response Prediction Platform for Online Advertising. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM ’14). Association for Computing Machinery, New York, NY, USA, 173-182. https://doi.org/10.1145/2556195.2556252
[3] Michal Aharon, Yohay Kaplan, Rina Levy, Oren Somekh, Ayelet Blanc, Neetai Eshel, Avi Shahar, Assaf Singer, and Alex Zlotnik. 2019. Soft Frequency Capping for Improved Ad Click Prediction in Yahoo Gemini Native. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2793-2801. https://doi.org/10.1145/3357384.3357801
[4] Michal Aharon, Oren Somekh, Avi Shahar, Assaf Singer, Baruch Trayvas, Hadas Vogel, and Dobri Dobrev. 2019. Carousel Ads Optimization in Yahoo Gemini Native. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1993-2001. https://doi.org/10.1145/3292500.3330740
[5] Luca Aiello, Ioannis Arapakis, Ricardo Baeza-Yates, Xiao Bai, Nicola Barbieri, Amin Mantrach, and Fabrizio Silvestri. 2016. The Role of Relevance in Sponsored Search. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) (CIKM ’16). Association for Computing Machinery, New York, NY, USA, 185-194. https://doi.org/10.1145/2983323.2983840
[6] Ryuya Akase, Hiroto Kawabata, Akiomi Nishida, Yuki Tanaka, and Tamaki Kaminaga. 2021. Related Entity Expansion and Ranking Using Knowledge Graph. In Complex, Intelligent and Software Intensive Systems, Leonard Barolli, Kangbin Yim, and Tomoya Enokido (Eds.). Springer International Publishing, Cham, 172-184.
[7] Rafael Alfaro-Flores, José Salas-Bonilla, Loic Juillard, and Juan Esquivel-Rodríguez. 2021. Experiment-driven improvements in Human-in-the-loop Machine Learning Annotation via significance-based A/B testing. In 2021 XLVII Latin American Computing Conference (CLEI). 1-9. https://doi.org/10.1109/CLEI53233.2021.9639977
[8] Joana Almeida and Beatriz Casais. 2022. Subject Line Personalization Techniques and Their Influence in the E-Mail Marketing Open Rate. In Information Systems and Technologies, Alvaro Rocha, Hojjat Adeli, Gintautas Dzemyda, and Fernando Moreira (Eds.). Springer International Publishing, Cham, 532-540.
[9] Xavier Amatriain. 2013. Beyond Data: From User Information to Business Value through Personalized Recommendations and Consumer Science. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (San Francisco, California, USA) (CIKM ’13). Association for Computing Machinery, New York, NY, USA, 2201-2208. https://doi.org/10.1145/2505515.2514701
[10] Apostolos Ampatzoglou, Stamatia Bibi, Paris Avgeriou, Marijn Verbeek, and Alexander Chatzigeorgiou. 2019. Identifying, categorizing and mitigating threats to validity in software engineering secondary studies. Information and Software Technology 106 (2019), 201-230. https://doi.org/10.1016/j.infsof.2018.10.006
[11] Nirupama Appiktala, Miao Chen, Michael Natkovich, and Joshua Walters. 2017. Demystifying dark matter for online experimentation. In 2017 IEEE International Conference on Big Data (Big Data). 1620-1626. https://doi.org/10. 1109/BigData. 2017.8258096
[12] F. Auer and M. Felderer. 2018. Current State of Research on Continuous Experimentation: A Systematic Mapping Study. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE Computer Society, Los Alamitos, CA, USA, 335-344. https://doi.org/10.1109/SEAA.2018.00062
[13] Florian Auer, Rasmus Ros, Lukas Kaltenbrunner, Per Runeson, and Michael Felderer. 2021. Controlled experimentation in continuous experimentation: Knowledge and challenges. Information and Software Technology 134 (2021), 106551. https://doi.org/10.1016/j.infsof. 2021.106551
[14] Eytan Bakshy and Eitan Frachtenberg. 2015. Design and Analysis of Benchmarking Experiments for Distributed Internet Services. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 108-118. https: //doi.org/10.1145/2736277.2741082
[15] Joel Barajas, Jaimie Kwon, Ram Akella, Aaron Flores, Marius Holtan, and Victor Andrei. 2012. Marketing Campaign Evaluation in Targeted Display Advertising. In Proceedings of the Sixth International Workshop on Data Mining for Online Advertising and Internet Economy (Beijing, China) (ADKDD ’12). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/2351356.2351361
[16] Joel Barajas, Jaimie Kwon, Ram Akella, Aaron Flores, Marius Holtan, and Victor Andrei. 2012. Measuring Dynamic Effects of Display Advertising in the Absence of User Tracking Information. In Proceedings of the Sixth International Workshop on Data Mining for Online Advertising and Internet Economy (Beijing, China) (ADKDD ’12). Association for Computing Machinery, New York, NY, USA, Article 8, 9 pages. https://doi.org/10.1145/2351356.2351364
[17] Victor R. Basili, Gianluigi Caldiera, and Dieter H. Rombach. 1994. The Goal Question Metric Approach. Vol. I. John Wiley & Sons.
[18] Tobias Blask. 2013. Applying Bayesian parameter estimation to

tests in e-business applications examining the impact of green marketing signals in sponsored search advertising. In 2013 International Conference on e-Business (ICE-B). 1-8.
[19] Tobias Blask, Burkhardt Funk, and Reinhard Schulte. 2011. Should companies bid on their own brand in sponsored search?. In Proceedings of the International Conference on e-Business. 1-8.
[20] Fedor Borisyuk, Siddarth Malreddy, Jun Mei, Yiqun Liu, Xiaoyi Liu, Piyush Maheshwari, Anthony Bell, and Kaushik Rangadurai. 2021. VisRel: Media Search at Scale. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery

Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2584-2592. https://doi.org/10.1145/3447548.3467081
[21] Slava Borodovsky and Saharon Rosset. 2011. A/B Testing at SweetIM: The Importance of Proper Statistical Analysis. In 2011 IEEE 11th International Conference on Data Mining Workshops. 733-740. https://doi.org/10.1109/ICDMW. 2011.19
[22] Alex Brown, Binky Lush, and Bernard J. Jansen. 2016. Pixel efficiency analysis: A quantitative web analytics approach. Proceedings of the Association for Information Science and Technology 53, 1 (2016), 1-10. https://doi.org/10.1002/ pra2.2016.14505301040 arXiv:https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/pra2.2016.14505301040
[23] Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 55-63. https://doi.org/10.1145/3159652.3159699
[24] Tianchi Cai, Daxi Cheng, Chen Liang, Ziqi Liu, Lihong Gu, Huizhi Xie, Zhiqiang Zhang, Xiaodong Zeng, and Jinjie Gu. 2021. LinkLouvain: Link-Aware A/B Testing and Its Application on Online Marketing Campaign. In Database Systems for Advanced Applications, Christian S. Jensen, Ee-Peng Lim, De-Nian Yang, Wang-Chien Lee, Vincent S. Tseng, Vana Kalogeraki, Jen-Wei Huang, and Chih-Ya Shen (Eds.). Springer International Publishing, Cham, 499-510.
[25] Javier Cámara and Alfred Kobsa. 2009. Facilitating Controlled Tests of Website Design Changes: A Systematic Approach. In Web Engineering, Martin Gaedke, Michael Grossniklaus, and Oscar Díaz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 370-378.
[26] Samit Chakraborty, Md. Saiful Hoque, Naimur Rahman Jeem, Manik Chandra Biswas, Deepayan Bardhan, and Edgar Lobaton. 2021. Fashion Recommendation Systems, Models and Methods: A Review. Informatics 8, 3 (2021). https: //doi.org/10.3390/informatics8030049
[27] Guangde Chen, Bee-Chung Chen, and Deepak Agarwal. 2017. Social Incentive Optimization in Online Social Networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cambridge, United Kingdom) (WSDM ’17). Association for Computing Machinery, New York, NY, USA, 547-556. https://doi.org/10. 1145/3018661.3018700
[28] Nanyu Chen, Min Liu, and Ya Xu. 2019. How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, New York, NY, USA, 501-509. https://doi. org/10.1145/3289600.3291000
[29] Russell Chen, Miao Chen, Mahendrasinh Ramsinh Jadav, Joonsuk Bae, and Don Matheson. 2017. Faster online experimentation by eliminating traditional A/A validation. In 2017 IEEE International Conference on Big Data (Big Data). 1635-1641. https://doi.org/10.1109/BigData.2017.8258098
[30] Emmanuelle Claeys, Pierre Gançarski, Myriam Maumy-Bertrand, and Hubert Wassner. 2017. Regression Tree for Bandits Models in A/B Testing. In Advances in Intelligent Data Analysis XVI, Niall Adams, Allan Tucker, and David Weston (Eds.). Springer International Publishing, Cham, 52-62.
[31] Rafael Costa, Elie Cheniaux, Pedro Rosaes, Marcele Carvalho, Rafael Freire, Márcio Versiani, Bernard Range, and Antonio Nardi. 2011. The effectiveness of cognitive behavioral group therapy in treating bipolar disorder: A randomized controlled study. Revista brasileira de psiquiatria (São Paulo, Brazil : 1999) 33 (06 2011), 144-9. https://doi.org/ 10.1590/S1516-44462011000200009
[32] John Creswell and Timothy Guetterman. 2018. Educational Research: Planning, Conducting, and Evaluating Quantitative and Qualitative Research, 6th Edition. Pearson, New York, NY, USA.
[33] Xinyi Dai, Yunjia Xi, Weinan Zhang, Qing Liu, Ruiming Tang, Xiuqiang He, Jiawei Hou, Jun Wang, and Yong Yu. 2021. Beyond Relevance Ranking: A General Graph Matching Framework for Utility-Oriented Learning to Rank. ACM Trans. Inf. Syst. 40, 2, Article 25 (nov 2021), 29 pages. https://doi.org/10.1145/3464303
[34] Maya Daneva, Daniela Damian, Alessandro Marchetto, and Oscar Pastor. 2014. Empirical research methodologies and studies in Requirements Engineering: How far did we come? Journal of Systems and Software 95 (2014), 1-9. https://doi.org/10.1016/j.jss.2014.06.035
[35] Rico de Feijter, Rob van Vliet, Erik Jagroep, Sietse Overbeek, and Sjaak Brinkkemper. 2017. Towards the adoption of DevOps in software product organizations: A maturity model approach. Technical Report. Utrecht University.
[36] Wagner S. De Souza, Fernando O. Pereira, Vanessa G. Albuquerque, Jorge Melegati, and Eduardo Guerra. 2022. A Framework Model to Support A/B Tests at the Class and Component Level. In 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC). 860-865. https://doi.org/10.1109/COMPSAC54236.2022.00136
[37] Alex Deng. 2015. Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (WWW ’15 Companion). Association for Computing Machinery, New York, NY, USA, 923-928. https://doi.org/10.1145/2740908.2742563
[38] Alex Deng, Tianxi Li, and Yu Guo. 2014. Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (

’14). Association for Computing Machinery, New York, NY, USA, 609-618. https://doi.org/10.1145/ 2566486.2568028
[39] Alex Deng, Yicheng Li, Jiannan Lu, and Vivek Ramamurthy. 2021. On Post-Selection Inference in

Testing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery 8 Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2743-2752. https://doi.org/10.1145/ 3447548.3467129
[40] Drew Dimmery, Eytan Bakshy, and Jasjeet Sekhon. 2019. Shrinkage Estimators in Online Experiments. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2914-2922. https://doi.org/10.1145/3292500. 3330771
[41] Pavel Dmitriev, Brian Frasca, Somit Gupta, Ron Kohavi, and Garnet Vaz. 2016. Pitfalls of long-term online controlled experiments. In 2016 IEEE International Conference on Big Data (Big Data). 1367-1376. https://doi.org/10.1109/ BigData. 2016.7840744
[42] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 1427-1436. https://doi.org/10.1145/3097983.3098024
[43] Jürgen Dobaj, Andreas Riel, Thomas Krug, Matthias Seidl, Georg Macher, and Markus Egretzberger. 2022. Towards Digital Twin-Enabled DevOps for CPS Providing Architecture-Based Service Adaptation & Verification at Runtime. In Proceedings of the 17th Symposium on Software Engineering for Adaptive and Self-Managing Systems (Pittsburgh, Pennsylvania) (SEAMS ’22). Association for Computing Machinery, New York, NY, USA, 132-143. https://doi.org/ 10.1145/3524844.3528057
[44] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015. Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (

’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 256-266. https://doi.org/10.1145/2736277.2741116
[45] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017. Periodicity in User Engagement with a Search Engine and Its Application to Online Controlled Experiments. ACM Trans. Web 11, 2, Article 9 (apr 2017), 35 pages. https: //doi.org/10.1145/2856822
[46] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments. In Proceedings of the 26 th International Conference on World Wide Web (Perth, Australia) (

’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1301-1310. https://doi.org/10.1145/3038912.3052664
[47] Alexey Drutsa, Anna Ufliand, and Gleb Gusev. 2015. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM ’15). Association for Computing Machinery, New York, NY, USA, 763-772. https://doi.org/10.1145/2806416.2806496
[48] Weitao Duan, Shan Ba, and Chunzhe Zhang. 2021. Online Experimentation with Surrogate Metrics: Guidelines and a Case Study. In Proceedings of the 14 th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 193-201. https://doi.org/10.1145/ 3437963.3441737
[49] Wouter Duivesteijn, Tara Farzami, Thijs Putman, Evertjan Peer, Hilde J. P. Weerts, Jasper N. Adegeest, Gerson Foks, and Mykola Pechenizkiy. 2017. Have It Both Ways-From A/B Testing to A&B Testing with Exceptional Model Mining. In Machine Learning and Knowledge Discovery in Databases, Yasemin Altun, Kamalika Das, Taneli Mielikäinen, Donato Malerba, Jerzy Stefanowski, Jesse Read, Marinka Žitnik, Michelangelo Ceci, and Sašo Džeroski (Eds.). Springer International Publishing, Cham, 114-126.
[50] Joshua Eckroth and Eric Schoen. 2019. A genetic algorithm for finding a small and diverse set of recent news stories on a given subject: How we generate aaai’s ai-alert. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 (2019), 9357-9364. https://www.scopus.com/inward/record.uri? eid=2-s2.0-85090801224&partnerID=40&md5=f3391d595e00df8a0cba7802c9043ebd Cited by: 2.
[51] B. Efron and R.J. Tibshirani. 1994. An Introduction to the Bootstrap. CRC Press. https://books.google.be/books? id=MWC1DwAAQBAJ
[52] Beyza Ermis, Patrick Ernst, Yannik Stein, and Giovanni Zappella. 2020. Learning to Rank in the Position Based Model with Bandit Feedback. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2405-2412. https://doi.org/10.1145/3340531.3412723
[53] Vladimir M. Erthal, Bruno P. de Souza, Paulo Sérgio M. dos Santos, and Guilherme H. Travassos. 2022. A Literature Study to Characterize Continuous Experimentation in Software Engineering. CIbSE 2022-XXV Ibero-American Conference on Software Engineering (2022). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85137064966& partnerID=40&md5=04240b73ab90eb841083173be558b33f Cited by: 0.
[54] Maria Esteller-Cucala, Vicenc Fernandez, and Diego Villuendas. 2019. Experimentation Pitfalls to Avoid in A/B Testing for Online Personalization. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization (Larnaca, Cyprus) (UMAP’19 Adjunct). Association for Computing Machinery, New York, NY, USA, 153-159. https://doi.org/10.1145/3314183.3323853
[55] Aleksander Fabijan, Benjamin Arai, Pavel Dmitriev, and Lukas Vermeer. 2021. It takes a Flywheel to Fly: Kickstarting and Growing the

testing Momentum at Scale. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 109-118. https://doi.org/10.1109/SEAA53835.2021.00023
[56] Aleksander Fabijan, Pavel Dmitriev, Colin McFarland, Lukas Vermeer, Helena Holmström Olsson, and Jan Bosch. 2018. Experimentation growth: Evolving trustworthy

testing capabilities in online software companies. Journal of Software: Evolution and Process 30, 12 (2018), e2113. https://doi.org/10.1002/smr. 2113 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr. 2113 e2113 JSME-17-0210.R2.
[57] Aleksander Fabijan, Pavel Dmitriev, Helena Holmström Olsson, and Jan Bosch. 2017. The Benefits of Controlled Experimentation at Scale. In 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 18-26. https://doi.org/10.1109/SEAA. 2017.47
[58] Aleksander Fabijan, Pavel Dmitriev, Helena Holmström Olsson, and Jan Bosch. 2017. The Evolution of Continuous Experimentation in Software Product Development: From Data to a Data-Driven Organization at Scale. In Proceedings of the 39th International Conference on Software Engineering (Buenos Aires, Argentina) (ICSE ’17). IEEE Press, Los Alamitos, CA, USA, 770-780. https://doi.org/10.1109/ICSE.2017.76
[59] A. Fabijan, P. Dmitriev, H. Holmstrom Olsson, and J. Bosch. 2020. The Online Controlled Experiment Lifecycle. IEEE Software 37, 02 (mar 2020), 60-67. https://doi.org/10.1109/MS.2018.2875842
[60] Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. 2019. Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery 88 Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2156-2164. https://doi.org/10.1145/3292500.3330722
[61] Aleksander Fabijan, Helena Holmström Olsson, and Jan Bosch. 2015. Customer Feedback and Data Collection Techniques in Software R&D: A Literature Review. In Software Business, João M. Fernandes, Ricardo J. Machado, and Krzysztof Wnuk (Eds.). Springer International Publishing, Cham, 139-153. https://doi.org/10.1007/978-3-319-19593-3_12
[62] Aleksander Fabijan, Helena Holmström Olsson, and Jan Bosch. 2016. The Lack of Sharing of Customer Data in Large Software Organizations: Challenges and Implications. In Agile Processes, in Software Engineering, and Ex-
treme Programming, Helen Sharp and Tracy Hall (Eds.). Springer International Publishing, Cham, 39-52. https: //doi.org/10.1007/978-3-319-33515-5_4
[63] Yaron Fairstein, Elad Haramaty, Arnon Lazerson, and Liane Lewin-Eytan. 2022. External Evaluation of Ranking Models under Extreme Position-Bias. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 252-261. https://doi.org/10.1145/3488560.3498420
[64] Elea McDonnell Feit and Ron Berman. 2019. Test & Roll: Profit-Maximizing A/B Tests. Marketing Science 38, 6 (2019), 1038-1058. https://doi.org/10.1287/mksc.2019.1194
[65] Antonino Freno. 2017. Practical Lessons from Developing a Large-Scale Recommender System at Zalando. In Proceedings of the Eleventh ACM Conference on Recommender Systems (Como, Italy) (RecSys ’17). Association for Computing Machinery, New York, NY, USA, 251-259. https://doi.org/10.1145/3109859.3109897
[66] Kun Fu, Fanlin Meng, Jieping Ye, and Zheng Wang. 2020. CompactETA: A Fast Inference System for Travel Time Prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 3337-3345. https: //doi.org/10.1145/3394486.3403386
[67] Burkhardt Funk. 2009. Optimizing price levels in e-commerce applications: An empirical study. ICETE 2009 – International Joint Conference on e-Business and Telecommunications (2009), 37-43. https://www.scopus.com/inward/ record.uri?eid=2-s2.0-74549181430&partnerID=40&md5=6dfdde67b807b3964c62fc8c1929dcf0 Cited by: 1.
[68] Matthias Galster and Danny Weyns. 2016. Empirical Research in Software Architecture: How Far have We Come?. In 2016 13th Working IEEE/IFIP Conference on Software Architecture (WICSA). IEEE Press, Los Alamitos, CA, USA, 11-20. https://doi.org/10.1109/WICSA.2016.10
[69] Federico Giaimo, Hugo Andrade, and Christian Berger. 2020. Continuous experimentation and the cyber-physical systems challenge: An overview of the literature and the industrial perspective. Journal of Systems and Software 170 (2020), 110781. https://doi.org/10.1016/j.jss.2020.110781
[70] Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Trans. Manage. Inf. Syst. 6, 4, Article 13 (dec 2016), 19 pages. https://doi.org/10.1145/2843948
[71] Anjan Goswami, Wei Han, Zhenrui Wang, and Angela Jiang. 2015. Controlled experiments for decision-making in eCommerce search. In 2015 IEEE International Conference on Big Data (Big Data). IEEE Press, Los Alamitos, CA, USA, 1094-1102. https://doi.org/10.1109/BigData.2015.7363863
[72] Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, New York, NY, USA, 420-428. https://doi.org/10.1145/3289600.3291027
[73] Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B Testing: From Sampling to Estimation. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (

’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 399-409. https://doi.org/ 10.1145/2736277.2741081
[74] Jayant Gupchup, Yasaman Hosseinkashi, Pavel Dmitriev, Daniel Schneider, Ross Cutler, Andrei Jefremov, and Martin Ellis. 2018. Trustworthy Experimentation Under Telemetry Loss. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 387-396. https://doi.org/10.1145/3269206.3271747
[75] Shubham Gupta and Sneha Chokshi. 2020. Digital Marketing Effectiveness Using Incrementality. In Advances in Computing and Data Sciences, Mayank Singh, P. K. Gupta, Vipin Tyagi, Jan Flusser, Tuncer Oren, and Gianluca Valentino (Eds.). Springer Singapore, Singapore, 66-75.
[76] Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Raff, and Aleksander Fabijan. 2018. The Anatomy of a Large-Scale Experimentation Platform. In 2018 IEEE International Conference on Software Architecture (ICSA). 1-109. https://doi.org/10.1109/ICSA. 2018.00009
[77] Viet Ha-Thuc, Avishek Dutta, Ren Mao, Matthew Wood, and Yunli Liu. 2020. A Counterfactual Framework for SellerSide A/B Testing on Marketplaces. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 2288-2296. https://doi.org/10.1145/3397271.3401434
[78] Yan He and Miao Chen. 2017. A Probabilistic, Mechanism-Indepedent Outlier Detection Method for Online Experimentation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 640-647. https://doi.org/10.1109/DSAA. 2017.64
[79] Yan He, Lin Yu, Miao Chen, William Choi, and Don Matheson. 2022. A Cluster-Based Nearest Neighbor Matching Algorithm for Enhanced A/A Validation in Online Experimentation. In Companion Proceedings of the Web Conference 2022 (Virtual Event, Lyon, France) (

’22). Association for Computing Machinery, New York, NY, USA, 136-140. https://doi.org/10.1145/3487553.3524220
[80] Jez Humble and David Farley. 2010. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (1st ed.). Addison-Wesley Professional, Illinois, IL, USA.
[81] Hao Jiang, Fan Yang, and Wutao Wei. 2020. Statistical Reasoning of Zero-Inflated Right-Skewed User-Generated Big Data A/B Testing. In 2020 IEEE International Conference on Big Data (Big Data). 1533-1544. https://doi.org/10. 1109/BigData50022.2020.9377996
[82] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at A/B Tests: Why It Matters, and What to Do about It. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 1517-1525. https://doi.org/10.1145/3097983.3097992
[83] Nianqiao Ju, Diane Hu, Adam Henderson, and Liangjie Hong. 2019. A Sequential Test for Selecting the Better Variant: Online A/B Testing, Adaptive Allocation, and Continuous Monitoring. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, New York, NY, USA, 492-500. https://doi.org/10.1145/3289600.3291025
[84] Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report. Technical report, Ver. 2.3 EBSE Technical Report. EBSE.
[85] Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive Combinations of A/B Test Metrics. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cambridge, United Kingdom) (WSDM ’17). Association for Computing Machinery, New York, NY, USA, 651-659. https://doi.org/10. 1145/3018661.3018708
[86] Eugene Kharitonov, Aleksandr Vorobev, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015. Sequential Testing for Early Stopping of Online Experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (Santiago, Chile) (SIGIR ’15). Association for Computing Machinery, New York, NY, USA, 473-482. https://doi.org/10.1145/2766462.2767729
[87] Rochelle King, Elizabeth F. Churchill, and Caitlin Tan. 2017. Designing with Data: Improving the User Experience with

Testing. O’Reilly Media, Inc., Sebastopol, CA, USA.
[88] Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Beijing, China) (KDD ’12). Association for Computing Machinery, New York, NY, USA, 786-794. https://doi.org/10.1145/2339530.2339653
[89] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD ’13). Association for Computing Machinery, New York, NY, USA, 1168-1176. https://doi.org/10.1145/2487575.2488217
[90] Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. 2014. Seven Rules of Thumb for Web Site Experimenters. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, New York, USA) (KDD ’14). Association for Computing Machinery, New York, NY, USA, 1857-1866. https: //doi.org/10.1145/2623330.2623341
[91] Ron Kohavi and Roger Longbotham. 2011. Unexpected Results in Online Controlled Experiments. SIGKDD Explor. Newsl. 12, 2 (mar 2011), 31-35. https://doi.org/10.1145/1964897.1964905
[92] Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments and A/B Testing. Encyclopedia of machine learning and data mining 7,8 (2017), 922-929.
[93] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal Henne. 2009. Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (02 2009), 140-181. https://doi.org/10.1007/ s10618-008-0114-1
[94] Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to

Testing. Cambridge University Press, Cambridge, United Kingdom. https://doi.org/10.1017/9781108653985
[95] Anastasiia Kornilova and Lucas Bernardi. 2021. Mining the Stars: Learning Quality Ratings with User-Facing Explanations for Vacation Rentals. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 976-983. https://doi.org/10.1145/3437963.3441812
[96] Kostantinos Koukouvis, Roberto Alcañiz Cubero, and Patrizio Pelliccione. 2016. A/B Testing in E-commerce Sales Processes. In Software Engineering for Resilient Systems, Ivica Crnkovic and Elena Troubitsyna (Eds.). Springer International Publishing, Cham, 133-148.
[97] Anuj Kumar and Kartik Hosanagar. 2017. Measuring the Value of Recommendation Links on Product Demand. SSRN Electronic Journal (01 2017). https://doi.org/10.2139/ssrn. 2909971
[98] Ratnakar Kumar and Nitasha Hasteer. 2017. Evaluating usability of a web application: A comparative analysis of opensource tools. In 2017 2nd International Conference on Communication and Electronics Systems (ICCES). 350-354. https://doi.org/10.1109/CESYS. 2017.8321296
[99] Mounia Lalmas, Janette Lehmann, Guy Shaked, Fabrizio Silvestri, and Gabriele Tolomei. 2015. Promoting Positive PostClick Experience for In-Stream Yahoo Gemini Users. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 1929-1938. https://doi.org/10.1145/2783258.2788581
[100] Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 491-499. https://doi.org/10.1145/3219819.3219905
[101] Florian Lettner, Clemens Holzmann, and Patrick Hutflesz. 2013. Enabling A/B Testing of Native Mobile Applications by Remote User Interface Exchange. In Computer Aided Systems Theory – EUROCAST 2013, Roberto Moreno-Díaz, Franz Pichler, and Alexis Quesada-Arencibia (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 458-466.
[102] Chengbo Li, Lin Zhu, Guangyuan Fu, Longzhi Du, Canhua Zhao, Tianlun Ma, Chang Ye, and Pei Lee. 2021. Learning to Bundle Proactively for On-Demand Meal Delivery. In Proceedings of the 30th ACM International Conference on Information

Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing

Machinery, New York, NY, USA, 3898-3905. https://doi.org/10.1145/3459637.3481931
[103] Hannah Li, Geng Zhao, Ramesh Johari, and Gabriel Y. Weintraub. 2022. Interference, Bias, and Variance in TwoSided Marketplace Experimentation: Guidance for Platforms. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (

’22). Association for Computing Machinery, New York, NY, USA, 182-192. https: //doi.org/10.1145/3485447.3512063
[104] Lihong Li, Jin Young Kim, and Imed Zitouni. 2015. Toward Predicting the Outcome of an A/B Experiment for Search Relevance. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM ’15). Association for Computing Machinery, New York, NY, USA, 37-46. https://doi.org/10.1145/2684822.2685311
[105] Paul Luo Li, Xiaoyu Chai, Frederick Campbell, Jilong Liao, Neeraja Abburu, Minsuk Kang, Irina Niculescu, Greg Brake, Siddharth Patil, James Dooley, and Brandon Paddock. 2021. Evolving Software to be ML-Driven Utilizing RealWorld A/B Testing: Experiences, Insights, Challenges. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 170-179. https://doi.org/10.1109/ICSE-SEIP52600. 2021.00026
[106] Paul Luo Li, Pavel Dmitriev, Huibin Mary Hu, Xiaoyu Chai, Zoran Dimov, Brandon Paddock, Ying Li, Alex Kirshenbaum, Irina Niculescu, and Taj Thoresen. 2019. Experimentation in the Operating System: The Windows Experimentation Platform. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 21-30. https://doi.org/10.1109/ICSE-SEIP.2019.00011
[107] Yiyang Li, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Content Recommendation by Noise Contrastive Transfer Learning of Feature Representation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 1657-1665. https://doi.org/10.1145/3132847.3132855
[108] Ye Li, Hong Xie, Yishi Lin, and John C.S. Lui. 2021. Unifying Offline Causal Inference and Online Bandit Learning for Data Driven Decision. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 2291-2303. https://doi.org/10.1145/3442381.3449982
[109] Kevin Liou and Sean J. Taylor. 2020. Variance-Weighted Estimators to Improve Sensitivity in Online Experiments. In Proceedings of the 21st ACM Conference on Economics and Computation (Virtual Event, Hungary) (EC ’20). Association for Computing Machinery, New York, NY, USA, 837-850. https://doi.org/10.1145/3391403.3399542
[110] Sophia Liu, Aleksander Fabijan, Michael Furchtgott, Somit Gupta, Pawel Janowski, Wen Qin, and Pavel Dmitriev. 2019. Enterprise-Level Controlled Experiments at Scale: Challenges and Solutions. In 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 29-37. https://doi.org/10.1109/SEAA.2019.00013
[111] Yuchu Liu, David Issa Mattos, Jan Bosch, Helena Holmström Olsson, and Jonn Lantz. 2021. Size matters? Or not: A/B testing with limited sample in automotive embedded software. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 300-307. https://doi.org/10.1109/SEAA53835.2021.00046
[112] Widad Machmouchi, Ahmed Hassan Awadallah, Imed Zitouni, and Georg Buscher. 2017. Beyond Success Rate: Utility as a Search Quality Metric for Online Experiments. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 757-765. https://doi.org/10.1145/3132847.3132850
[113] Lech Madeyski, Wojciech Orzeszyna, Richard Torkar, and Mariusz Józala. 2014. Overcoming the Equivalent Mutant Problem: A Systematic Literature Review and a Comparative Experiment of Second Order Mutation. IEEE Transactions on Software Engineering 40, 1 (2014), 23-42. https://doi.org/10.1109/TSE. 2013.44
[114] Maria Madlberger and Jiri Jizdny. 2021. Impact of promotional social media content on click-through rate – Evidence from a FMCG company. 20th International Conferences on WWW/Internet 2021 and Applied Computing 2021 (2021), 3-10. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85124068035&partnerID=40&md5= c0b8f49a3b48b3d561fd0ed305eb1895 Cited by: 0.
[115] Sara Mahdavi-Hezavehi, Vinicius H.S. Durelli, Danny Weyns, and Paris Avgeriou. 2017. A systematic literature review on methods that handle multiple quality attributes in architecture-based self-adaptive systems. Information and Software Technology 90 (2017), 1-26. https://doi.org/10.1016/j.infsof.2017.03.013
[116] Taisei Masuda, Kyoko Murakami, Kenkichi Sugiura, Sho Sakui, Ron Philip Schuring, and Mitsuhiro Mori. 2022. A phase

randomised placebo-controlled study of the COVID-19 vaccine mRNA-1273 in healthy Japanese adults: An interim report. Vaccine 40, 13 (2022), 2044-2052. https://doi.org/10.1016/j.vaccine.2022.02.030
[117] David Issa Mattos, Jan Bosch, and Helena Holmström Olsson. 2017. More for Less: Automated Experimentation in Software-Intensive Systems. In Product-Focused Software Process Improvement, Michael Felderer, Daniel Méndez Fernández, Burak Turhan, Marcos Kalinowski, Federica Sarro, and Dietmar Winkler (Eds.). Springer International Publishing, Cham, 146-161.
[118] David Issa Mattos, Jan Bosch, and Helena Holmström Olsson. 2017. Your System Gets Better Every Day You Use It: Towards Automated Continuous Experimentation. In 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 256-265. https://doi.org/10.1109/SEAA.2017.15
[119] David Issa Mattos, Jan Bosch, and Helena Holmström Olsson. 2018. Challenges and Strategies for Undertaking Continuous Experimentation to Embedded Systems: Industry and Research Perspectives. In Agile Processes in Software Engineering and Extreme Programming, Juan Garbajosa, Xiaofeng Wang, and Ademar Aguiar (Eds.). Springer International Publishing, Cham, 277-292. https://doi.org/10.1007/978-3-319-91602-6_20
[120] David Issa Mattos, Jan Bosch, Helena Holmstrom Olsson, Aita Maryam Korshani, and Jonn Lantz. 2020. Automotive A/B testing: Challenges and Lessons Learned from Practice. In 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 101-109. https://doi.org/10.1109/SEAA51224.2020.00026
[121] Pavel Metrikov, Fernando Diaz, Sebastien Lahaie, and Justin Rao. 2014. Whole Page Optimization: How Page Elements

Interact with the Position Auction. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (Palo Alto, California, USA) (

’14). Association for Computing Machinery, New York, NY, USA, 583-600. https: //doi.org/10.1145/2600057.2602871
[122] Risto Miikulainen, Myles Brundage, Jonathan Epstein, Tyler Foster, Babak Hodjat, Neil Iscoe, Jingbo Jiang, Diego Legrand, Sam Nazari, Xin Qiu, Michael Scharff, Cory Schoolland, Robert Severn, and Aaron Shagrin. 2020. Ascend by Evolv: AI-Based Massively Multivariate Conversion Rate Optimization. AI Magazine 41, 1 (Apr. 2020), 44-60. https://doi.org/10.1609/aimag.v41i1.5256
[123] Tadashi Okoshi, Kota Tsubouchi, and Hideyuki Tokuda. 2019. Real-World Product Deployment of Adaptive Push Notification Scheduling on Smartphones. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2792-2800. https://doi.org/10.1145/3292500.3330732
[124] Takumi Ozawa, Akiyuki Sekiguchi, and Kazuhiko Tsuda. 2016. A Method for the Construction of User Targeting Knowledge for B2B Industry Website. Procedia Computer Science 96 (2016), 1147-1155. https://doi.org/10.1016/ j.procs.2016.08.157 Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 20th International Conference KES-2016.
[125] Dan Pelleg, Oleg Rokhlenko, Idan Szpektor, Eugene Agichtein, and Ido Guy. 2016. When the Crowd is Not Enough: Improving User Experience with Social Media through Automatic Quality Analysis. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (San Francisco, California, USA) (CSCW ’16). Association for Computing Machinery, New York, NY, USA, 1080-1090. https://doi.org/10.1145/2818048. 2820022
[126] Ladislav Peska and Peter Vojtas. 2020. Off-Line vs. On-Line Evaluation of Recommender Systems in Small E-Commerce. In Proceedings of the 31st ACM Conference on Hypertext and Social Media (Virtual Event, USA) (HT ’20). Association for Computing Machinery, New York, NY, USA, 291-300. https://doi.org/10.1145/3372923.3404781
[127] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 235-244. https://doi.org/10.1145/2939672.2939688
[128] Jia Qu and Jing Zhang. 2016. Validating Mobile Designs with Agile Testing in China: Based on Baidu Map for Mobile. In Design, User Experience, and Usability: Design Thinking and Methods, Aaron Marcus (Ed.). Springer International Publishing, Cham, 491-498.
[129] Federico Quin, Danny Weyns, and Matthias Galster. 2023. Study Systematic Literature Review on A/B Testing. https: //people.cs.kuleuven.be/danny.weyns/material/SLR_AB/
[130] Jan Renz, Daniel Hoffmann, Thomas Staubitz, and Christoph Meinel. 2016. Using A/B testing in MOOC environments. In Proceedings of the Sixth International Conference on Learning Analytics 6 Knowledge (LAK ’16). Association for Computing Machinery, New York, NY, USA, 304-313. https://doi.org/10.1145/2883851.2883876
[131] Mohi Reza, Juho Kim, Ananya Bhattacharjee, Anna N. Rafferty, and Joseph Jay Williams. 2021. The MOOClet Framework: Unifying Experimentation, Dynamic Improvement, and Personalization in Online Courses. In Proceedings of the Eighth ACM Conference on Learning @ Scale (Virtual Event, Germany) (L@S ’21). Association for Computing Machinery, New York, NY, USA, 15-26. https://doi.org/10.1145/3430895.3460128
[132] Pilar Rodríguez, Alireza Haghighatkhah, Lucy Ellen Lwakatare, Susanna Teppola, Tanja Suomalainen, Juho Eskeli, Teemu Karvonen, Pasi Kuvaja, June M. Verner, and Markku Oivo. 2017. Continuous deployment of software intensive products and services: A systematic mapping study. Journal of Systems and Software 123 (2017), 263-291. https: //doi.org/10.1016/j.jss.2015.12.015
[133] Rasmus Ros and Per Runeson. 2018. Continuous Experimentation and A/B Testing: A Mapping Study. In Proceedings of the 4 th International Workshop on Rapid Continuous Software Engineering (Gothenburg, Sweden) (RCoSE ’18). Association for Computing Machinery, New York, NY, USA, 35-41. https://doi.org/10.1145/3194760.3194766
[134] Nir Rosenfeld, Yishay Mansour, and Elad Yom-Tov. 2017. Predicting Counterfactuals from Large Historical Data and Small Randomized Trials. In Proceedings of the 26th International Conference on World Wide Web Companion (Perth, Australia) (

‘ 17 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 602-609. https://doi.org/10.1145/3041021.3054190
[135] Sandra Sajeev, Jade Huang, Nikos Karampatziakis, Matthew Hall, Sebastian Kochman, and Weizhu Chen. 2021. Contextual Bandit Applications in a Customer Support Bot. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 3522-3530. https://doi.org/10.1145/3447548.3467165
[136] Suhrid Satyal, Ingo Weber, Hye-young Paik, Claudio Di Ciccio, and Jan Mendling. 2017. AB-BPM: Performance-Driven Instance Routing for Business Process Improvement. In Business Process Management, Josep Carmona, Gregor Engels, and Akhil Kumar (Eds.). Springer International Publishing, Cham, 113-129.
[137] Suhrid Satyal, Ingo Weber, Hye young Paik, Claudio Di Ciccio, and Jan Mendling. 2019. Business process improvement with the AB-BPM methodology. Information Systems 84 (2019), 283-298. https://doi.org/10.1016/j.is.2018.06. 007
[138] Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M. Airoldi. 2017. Detecting Network Effects: Randomizing Over Randomized Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 1027-1035. https://doi.org/10.1145/3097983.3098192
[139] Gerald Schermann, Dominik Schöni, Philipp Leitner, and Harald C. Gall. 2016. Bifrost: Supporting Continuous De-
ployment with Automated Enactment of Multi-Phase Live Testing Strategies. In Proceedings of the 17 th International Middleware Conference (Trento, Italy) (Middleware ’16). Association for Computing Machinery, New York, NY, USA, Article 12, 14 pages. https://doi.org/10.1145/2988336.2988348
[140] Shahriar Shariat, Burkay Orten, and Ali Dasdan. 2017. Online Evaluation of Bid Prediction Models in a Large-Scale Computational Advertising Platform: Decision Making and Insights. Knowl. Inf. Syst. 51, 1 (apr 2017), 37-60. https: //doi.org/10.1007/s10115-016-0972-6
[141] Fanjuan Shi, Chirine Ghedira, and Jean-Luc Marini. 2015. Context Adaptation for Smart Recommender Systems. IT Professional 17, 6 (2015), 18-26. https://doi.org/10.1109/MITP. 2015.96
[142] Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on Internal and External Validity in Empirical Software Engineering. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE Press, Los Alamitos, CA, USA, 9-19. https://doi.org/10.1109/ICSE.2015.24
[143] Natalia Silberstein, Oren Somekh, Yair Koren, Michal Aharon, Dror Porat, Avi Shahar, and Tingyi Wu. 2020. Ad Close Mitigation for Improved User Experience in Native Advertisements. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 546-554. https://doi.org/10.1145/3336191.3371798
[144] Jorge Gabriel Siqueira and Melise M. V. de Paula. 2018. IPEAD A/B Test Execution Framework. In Proceedings of the XIV Brazilian Symposium on Information Systems (Caxias do Sul, Brazil) (SBSI’18). Association for Computing Machinery, New York, NY, USA, Article 14, 8 pages. https://doi.org/10.1145/3229345.3229360
[145] Dan Siroker and Pete Koomen. 2013. A/B Testing: The Most Powerful Way to Turn Clicks Into Customers (1st ed.). Wiley Publishing, Hoboken, NJ, USA.
[146] Bruce Spang, Veronica Hannan, Shravya Kunamalla, Te-Yuan Huang, Nick McKeown, and Ramesh Johari. 2021. Unbiased Experiments in Congested Networks. In Proceedings of the 21st ACM Internet Measurement Conference (Virtual Event) (IMC ’21). Association for Computing Machinery, New York, NY, USA, 80-95. https://doi.org/10.1145/ 3487552.3487851
[147] Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. 2019. SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 513-526.
[148] Fei Sun, Peng Jiang, Hanxiao Sun, Changhua Pei, Wenwu Ou, and Xiaobo Wang. 2018. Multi-Source Pointer Network for Product Title Summarization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 7-16. https: //doi.org/10.1145/3269206.3271722
[149] Idan Szpektor, Yoelle Maarek, and Dan Pelleg. 2013. When Relevance is Not Enough: Promoting Diversity and Freshness in Personalized Question Recommendation. In Proceedings of the 22nd International Conference on World Wide Web (Rio de Janeiro, Brazil) (

’13). Association for Computing Machinery, New York, NY, USA, 1249-1260. https: //doi.org/10.1145/2488388.2488497
[150] Yukihiro Tagami, Toru Hotta, Yusuke Tanaka, Shingo Ono, Koji Tsukamoto, and Akira Tajima. 2014. Filling Context-Ad Vocabulary Gaps with Click Logs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, New York, USA) (KDD ’14). Association for Computing Machinery, New York, NY, USA, 1955-1964. https://doi.org/10.1145/2623330. 2623334
[151] Giordano Tamburrelli and Alessandro Margara. 2014. Towards Automated A/B Testing. In Search-Based Software Engineering, Claire Le Goues and Shin Yoo (Eds.). Springer International Publishing, Cham, 184-198.
[152] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, DC, USA) (KDD ’10). Association for Computing Machinery, New York, NY, USA, 17-26. https://doi.org/10.1145/1835804.1835810
[153] Mert Toslali, Srinivasan Parthasarathy, Fabio Oliveira, and Ayse K. Coskun. 2020. JACKPOT: Online experimentation of cloud microservices. HotCloud 2020-12th USENIX Workshop on Hot Topics in Cloud Computing, co-located with USENIX ATC 2020 (2020). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85091892156&partnerID=40& md5=cae12fe24f34f2bb0818e448f8c07fbf Cited by: 1.
[154] Ye Tu, Kinjal Basu, Cyrus DiCiccio, Romil Bansal, Preetam Nandy, Padmini Jaikumar, and Shaunak Chatterjee. 2021. Personalized Treatment Selection Using Causal Heterogeneity. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 1574-1585. https://doi.org/10. 1145/3442381.3450075
[155] Yutaro Ueoka, Kota Tsubouchi, and Nobuyuki Shimizu. 2020. Tackling Cannibalization Problems for Online Advertisement. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (Genoa, Italy) (UMAP ’20). Association for Computing Machinery, New York, NY, USA, 358-362. https://doi.org/10.1145/ 3340631.3394875
[156] Jean Vanderdonckt, Mathieu Zen, and Radu-Daniel Vatavu. 2019. AB4Web: An On-Line A/B Tester for Comparing User Interface Design Alternatives. Proc. ACM Hum.-Comput. Interact. 3, EICS, Article 18 (jun 2019), 28 pages. https://doi.org/10.1145/3331160
[157] Deepak Kumar Vasthimal, Pavan Kumar Srirama, and Arun Kumar Akkinapalli. 2019. Scalable Data Reporting Platform for A/B Tests. In 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). 230-238. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS. 2019.00052
[158] Daniel Walper, Julia Kassau, Philipp Methfessel, Timo Pronold, and Wolfgang Einhauser. 2020. Optimizing user inter-
faces in food production: gaze tracking is more sensitive for A-B-testing than behavioral data alone. In ACM Symposium on Eye Tracking Research and Applications (ETRA ’20 Short Papers). Association for Computing Machinery, New York, NY, USA, 1-4. https://doi.org/10.1145/3379156.3391351
[159] Jian Wang and David Hardtke. 2015. User Latent Preference Model for Better Downside Management in Recommender Systems. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (

’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1209-1219. https://doi.org/10.1145/2736277.2741126
[160] Weinan Wang and Xi Zhang. 2021. CONQ: CONtinuous Quantile Treatment Effects for Large-Scale Online Controlled Experiments. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 202-210. https://doi.org/10.1145/ 3437963.3441779
[161] Yu Wang, Somit Gupta, Jiannan Lu, Ali Mahmoudzadeh, and Sophia Liu. 2019. On Heavy-user Bias in A/B Testing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2425-2428. https://doi.org/10.1145/3357384.3358143
[162] Zenan Wang, Carlos Carrion, Xiliang Lin, Fuhua Ji, Yongjun Bao, and Weipeng Yan. 2022. Adaptive Experimentation with Delayed Binary Feedback. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 2247-2255. https://doi.org/10.1145/3485447. 3512097
[163] Liang Wu and Mihajlo Grbovic. 2020. How Airbnb Tells You Will Enjoy Sunset Sailing in Barcelona? Recommendation in a Two-Sided Travel Marketplace. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 2387-2396. https://doi.org/10.1145/3397271.3401444
[164] Yuhang Wu, Zeyu Zheng, Guangyu Zhang, Zuohua Zhang, and Chu Wang. 2022. Non-Stationary A/B Tests. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 2079-2089. https://doi.org/10.1145/3534678. 3539325
[165] Tong Xia, Sumit Bhardwaj, Pavel Dmitriev, and Aleksander Fabijan. 2019. Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 11-20. https://doi.org/10.1109/ICSE-SEIP.2019.00010
[166] Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. 2018. False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery

Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 876-885. https://doi.org/10.1145/3219819.3219860
[167] Yuxiang Xie, Meng Xu, Evan Chow, and Xiaolin Shi. 2021. How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments. In Proceedings of the 14 th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 949-957. https://doi.org/10.1145/3437963.3441742
[168] Ya Xu and Nanyu Chen. 2016. Evaluating Mobile Apps with A/B and Quasi A/B Tests. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 313-322. https://doi.org/10.1145/2939672.2939703
[169] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 2227-2236. https://doi.org/10.1145/2783258.2788602
[170] Ye Xu, Zang Li, Abhishek Gupta, Ahmet Bugdayci, and Anmol Bhasin. 2014. Modeling Professional Similarity by Mining Professional Career Trajectories. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, New York, USA) (KDD ’14). Association for Computing Machinery, New York, NY, USA, 1945-1954. https://doi.org/10.1145/2623330.2623368
[171] Yanbo Xu, Divyat Mahajan, Liz Manrao, Amit Sharma, and Emre Kıcıman. 2021. Split-Treatment Analysis to Rank Heterogeneous Causal Effects for Prospective Interventions. In Proceedings of the 14 th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 409-417. https://doi.org/10.1145/3437963.3441821
[172] Sezin Gizem Yaman, Myriam Munezero, Jürgen Münch, Fabian Fagerholm, Ossi Syd, Mika Aaltola, Christina Palmu, and Tomi Männistö. 2017. Introducing continuous experimentation in large software-intensive product and service organisations. Journal of Systems and Software 133 (2017), 195-211. https://doi.org/10.1016/j.jss.2017.07.009
[173] Wanshan Yang, Gemeng Yang, Ting Huang, Lijun Chen, and Youjian Eugene Liu. 2018. Whales, Dolphins, or Minnows? Towards the Player Clustering in Free Online Games Based on Purchasing Behavior via Data Mining Technique. In 2018 IEEE International Conference on Big Data (Big Data). 4101-4108. https://doi.org/10.1109/BigData.2018.8622067
[174] Runlong Ye, Pan Chen, Yini Mao, Angela Wang-Lin, Hammad Shaikh, Angela Zavaleta Bernuy, and Joseph Jay Williams. 2022. Behavioral Consequences of Reminder Emails on Students’ Academic Performance: A Real-World Deployment. In Proceedings of the 23rd Annual Conference on Information Technology Education (Chicago, IL, USA) (SIGITE ’22). Association for Computing Machinery, New York, NY, USA, 16-22. https://doi.org/10.1145/3537674.3554740
[175] Takeshi Yoneda, Shunsuke Kozawa, Keisuke Osone, Yukinori Koide, Yosuke Abe, and Yoshifumi Seki. 2019. Algorithms and System Architecture for Immediate Personalized News Recommendations. In IEEE/WIC/ACM International Conference on Web Intelligence (Thessaloniki, Greece) (WI ’19). Association for Computing Machinery, New York, NY,

USA, 124-131. https://doi.org/10.1145/3350546.3352509
[176] Scott W. H. Young. 2014. Improving Library User Experience with A/B Testing: Principles and Process. Weave: Journal of Library User Experience 1 (08 2014). https://doi.org/10.3998/weave.12535642.0001.101
[177] Miao Yu, Wenbin Lu, and Rui Song. 2020. A new framework for online testing of heterogeneous treatment effect.

2020-34th AAAI Conference on Artificial Intelligence (2020), 10310-10317. https://www.scopus.com/inward/ record.uri?eid=2-s2.0-85106588123&partnerID=40&md5=53544f162212be7cd129e1f196debcd8 Cited by: 2.
[178] He Zhang and Muhammad Ali Babar. 2010. On Searching Relevant Studies in Software Engineering. In Proceedings of the 14 th International Conference on Evaluation and Assessment in Software Engineering (UK) (EASE’10). BCS Learning & Development Ltd., Swindon, GBR, 111-120.
[179] Zhenyu Zhao, Yan He, and Miao Chen. 2017. Inform Product Change through Experimentation with Data-Driven Behavioral Segmentation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 69-78. https://doi.org/10.1109/DSAA. 2017.65

الملحق أ. قائمة الدراسات الأساسية

الجدول A.21: قائمة بالدراسات الأساسية.

معرف الورقة	مرجع	عنوان
1	[1]	اختبار غير معلمي تسلسلي للتجارب العشوائية عبر الإنترنت
2	[138]	كشف تأثيرات الشبكة: التوزيع العشوائي على التجارب العشوائية
٣	91	نتائج غير متوقعة في التجارب المنضبطة عبر الإنترنت
٤	[28]	كيف يمكن أن تسوء اختبارات A/B: التشخيص التلقائي للتجارب عبر الإنترنت غير الصالحة
٥	[127]	تعديل انحدار شجرة القرار المعزز لتقليل التباين في التجارب المسيطر عليها عبر الإنترنت
٦	٨٨	تجارب عبر الإنترنت تحت السيطرة موثوقة: خمسة نتائج محيرة مشروحة
٧	89	التجارب المنضبطة عبر الإنترنت على نطاق واسع
٨	[164]	اختبارات A/B غير الثابتة
9	73	اختبار الشبكة A/B: من أخذ العينات إلى التقدير
10	[166]	الكشف عن تأثير العلاج المتنوع مع التحكم في معدل الاكتشاف الخاطئ للتجارب المنضبطة عبر الإنترنت
11	٥٤	فخاخ التجريب التي يجب تجنبها في اختبار A/B للتخصيص عبر الإنترنت
12	[٣٨]	الاستدلال الإحصائي في التجارب المتحكم بها عبر الإنترنت ذات المرحلتين مع اختيار العلاج والتحقق
13	23	تحويل متسق لمقاييس النسبة لتجارب التحكم عبر الإنترنت بكفاءة
14	١٦٠	CONQ: تأثيرات العلاج الكمي المستمر للتجارب المنضبطة عبر الإنترنت على نطاق واسع
15	60	تشخيص عدم تطابق نسبة العينة في التجارب المنضبطة عبر الإنترنت: تصنيف وقواعد إرشادية للممارسين
16	[144]	إطار تنفيذ اختبار A/B لـ IPEAD
17	82	التجسس على اختبارات A/B: لماذا هو مهم، وماذا تفعل حيال ذلك
١٨	85	تعلم التركيبات الحساسة لمقاييس اختبار A/B
19	[47]	الجوانب العملية للحساسية في التجارب عبر الإنترنت مع مقاييس تفاعل المستخدم
20	[168]	تقييم التطبيقات المحمولة باستخدام اختبارات A/B واختبارات شبه A/B
21	٣٩]	حول الاستدلال بعد الاختيار في اختبار A/B
٢٢	٤٨	التجريب عبر الإنترنت باستخدام مقاييس بديلة: إرشادات ودراسة حالة
23	٤٤	توقع تفاعل المستخدمين في المستقبل وتطبيقه لتحسين حساسية التجارب عبر الإنترنت
٢٤	٥٨	تطور التجريب المستمر في تطوير منتجات البرمجيات: من البيانات إلى منظمة مدفوعة بالبيانات على نطاق واسع
٢٥	167	كيفية قياس تطبيقك: بعض الفخاخ والعلاجات في قياس أداء التطبيق في التجارب المنضبطة عبر الإنترنت
26	86	الاختبار المتسلسل للتوقف المبكر للتجارب عبر الإنترنت
27	40	مقدرات الانكماش في التجارب عبر الإنترنت
٢٨	77	إطار مضاد للحقائق لاختبار A/B من جانب البائع في الأسواق
٢٩	٤٥	الدورية في تفاعل المستخدم مع محرك البحث وتطبيقها على التجارب المنضبطة عبر الإنترنت
30	١٠٥	تطوير البرمجيات لتكون مدفوعة بالتعلم الآلي باستخدام اختبارات A/B في العالم الحقيقي: التجارب، الرؤى، التحديات

31	٤٦	استخدام التأخير في تأثير العلاج لتحسين الحساسية والحفاظ على اتجاهية مقاييس التفاعل في تجارب A/B
٣٢	79	خوارزمية مطابقة الجوار الأقرب المعتمدة على التجمع لتحسين التحقق من A/A في التجارب عبر الإنترنت
٣٣	١٠٩	مقدرات موزونة بالتباين لتحسين الحساسية في التجارب عبر الإنترنت
٣٤		اثنا عشر قذراً: اثنا عشر فخاً شائعاً في تفسير القياسات في التجارب المسيطر عليها عبر الإنترنت
٣٥	100	لعنة الفائز: تقدير التحيز للتأثيرات الكلية للميزات في التجارب المنضبطة عبر الإنترنت
٣٦	[169]	من البنية التحتية إلى الثقافة: تحديات اختبار A/B في الشبكات الاجتماعية الكبيرة
37	83	اختبار تسلسلي لاختيار النسخة الأفضل: اختبار A/B عبر الإنترنت، التخصيص التكيفي، والمراقبة المستمرة
٣٨	146	تجارب غير متحيزة في الشبكات المزدحمة
٣٩	154	اختيار العلاج المخصص باستخدام التباين السببي
40	١١٢	ما وراء معدل النجاح: الفائدة كمقياس لجودة البحث للتجارب عبر الإنترنت
41	[175]	الخوارزميات وهندسة النظام لتوصيات الأخبار الشخصية الفورية
42	[106]	التجريب في نظام التشغيل: منصة تجريب ويندوز
43		AB4Web: أداة اختبار A/B على الإنترنت لمقارنة بدائل تصميم واجهة المستخدم
٤٤	123	نشر المنتج في العالم الحقيقي لجدولة الإشعارات التكيفية على الهواتف الذكية
٤٥	95	تعدين النجوم: تعلم تقييمات الجودة مع تفسيرات موجهة للمستخدمين لإيجارات العطلات
٤٦	43	نحو عمليات تطوير مدعومة بالتوأم الرقمي للأنظمة السيبرانية الفيزيائية تقدم تكيف الخدمة القائم على الهندسة المعمارية والتحقق في وقت التشغيل
٤٧		التجريب التكيفي مع ردود الفعل الثنائية المتأخرة
٤٨	[108]	توحيد الاستدلال السببي غير المتصل وتعلم الباندت عبر الإنترنت من أجل اتخاذ قرارات مدفوعة بالبيانات
٤٩	[9]	ما وراء البيانات: من معلومات المستخدم إلى قيمة الأعمال من خلال التوصيات الشخصية وعلوم المستهلك
50	١٠٢	تعلم تجميع الطلبات بشكل استباقي لتوصيل الوجبات عند الطلب
51		قياس التأثيرات الديناميكية للإعلانات المعروضة في غياب معلومات تتبع المستخدم
52	15	تقييم حملة التسويق في الإعلانات المعروضة المستهدفة
53	121	تحسين الصفحة بالكامل: كيف تتفاعل عناصر الصفحة مع مزاد الموضع
٥٤	131	إطار MOOClet: توحيد التجريب، التحسين الديناميكي، والتخصيص في الدورات التدريبية عبر الإنترنت
٥٥	[171]	تحليل العلاج المقسم لتصنيف التأثيرات السببية المتنوعة للتدخلات المستقبلية
٥٦	99	تعزيز تجربة إيجابية بعد النقر لمستخدمي ياهو جيميني في البث
٥٧	١٣٤	توقع العواقب المضادة من بيانات تاريخية كبيرة وتجارب عشوائية صغيرة
٥٨	148	شبكة المؤشر متعددة المصادر لتلخيص عناوين المنتجات
٥٩	٣٣	ما وراء تصنيف الأهمية: إطار عمل عام لمطابقة الرسوم البيانية للتعلم القائم على الفائدة
60	72	التقييم غير المتصل لاتخاذ قرارات حول خوارزميات توصية قوائم التشغيل

61	74	تجارب موثوقة تحت فقدان التليمتر
62	[70]	نظام توصيات نتفليكس: الخوارزميات، القيمة التجارية، والابتكار
63	[139]	بيفروست: دعم النشر المستمر من خلال تنفيذ آلي لاستراتيجيات الاختبار الحي متعددة المراحل
64	66	CompactETA: نظام استدلال سريع لتوقع وقت السفر
65	14	تصميم وتحليل تجارب القياس المرجعي للخدمات الإنترنت الموزعة
66	52	تعلم الترتيب في النموذج القائم على الموضع مع ملاحظات البانديت
67	20]	VisRel: البحث الإعلامي على نطاق واسع
68	174	العواقب السلوكية لرسائل التذكير على الأداء الأكاديمي للطلاب: نشر في العالم الحقيقي
69	[107]	توصية المحتوى من خلال تعلم نقل التباين الضوضائي لتمثيل الميزات
70	63	التقييم الخارجي لنماذج الترتيب تحت انحياز الموقف الشديد
71	155	معالجة مشاكل التهام الإعلانات عبر الإنترنت
72	[150]	ملء فجوات مفردات السياق الإعلاني بسجلات النقر
73	65	دروس عملية من تطوير نظام توصية على نطاق واسع في زالاندو
74	[163]	كيف تخبرك Airbnb أنك ستستمتع بالإبحار عند غروب الشمس في برشلونة؟ توصية في سوق السفر ذي الجانبين
75	[170]	نمذجة التشابه المهني من خلال استخراج مسارات الحياة المهنية
76	[27]	تحسين الحوافز الاجتماعية في الشبكات الاجتماعية عبر الإنترنت
77	[143]	تخفيف إغلاق الإعلانات لتحسين تجربة المستخدم في الإعلانات الأصلية
78	[126]	التقييم غير المتصل مقابل التقييم المتصل لأنظمة التوصية في التجارة الإلكترونية الصغيرة
79	[2]	ليزر: منصة قابلة للتوسع لتوقع الاستجابة للإعلانات عبر الإنترنت
٨٠	[5]	دور الصلة في البحث المدعوم
81	[135]	تطبيقات اللصوص السياقية في روبوت دعم العملاء
82	[165]	السرعة الآمنة: دليل عملي لنشر البرمجيات على نطاق واسع باستخدام التوزيع المنضبط
83	[149]	عندما لا تكون الصلة كافية: تعزيز التنوع والحداثة في توصية الأسئلة المخصصة
84	[103]	التداخل، والتحيز، والتباين في تجارب الأسواق ذات الجانبين: إرشادات للمنصات
85	[120]	اختبار A/B في صناعة السيارات: التحديات والدروس المستفادة من الممارسة
86	[21]	اختبار A/B في SweetIM: أهمية التحليل الإحصائي الصحيح
87	٣٦]	نموذج إطار لدعم اختبارات A/B على مستوى الصف والمكون
٨٨	[81]	التحليل الإحصائي للبيانات الكبيرة الناتجة عن المستخدمين مع زيادة صفرية وانحراف يميني في اختبار A/B
89	[111]	الحجم مهم؟ أم لا؟الاختبار بعينة محدودة في برمجيات السيارات المدمجة
90	[157]	منصة تقارير بيانات قابلة للتوسع لاختبارات A/B
91	18	تطبيق تقدير المعلمات بايزياختبارات في تطبيقات الأعمال الإلكترونية تفحص تأثير إشارات التسويق الأخضر في إعلانات البحث المدفوعة
92	[7]	تحسينات مدفوعة بالتجارب في تعليم الآلة بمشاركة الإنسان عبر الأساس القائم على الأهميةاختبار
93	76	تشريح منصة تجارب واسعة النطاق
94	11	فك غموض المادة المظلمة للتجارب عبر الإنترنت
95	71	تجارب محكومة لاتخاذ القرار في بحث التجارة الإلكترونية
96	98	تقييم قابلية استخدام تطبيق ويب: تحليل مقارن للأدوات مفتوحة المصدر

97	٢٩	تجارب عبر الإنترنت أسرع من خلال القضاء على التحقق التقليدي A/A
98	41	مخاطر التجارب الطويلة الأمد التي تجرى عبر الإنترنت
99	[147]	SoftSKU: تحسين هياكل الخادم لتنوع الخدمات الصغيرة على نطاق واسع
100	[57]	فوائد التجارب المنضبطة على نطاق واسع
١٠١	[141]	تكييف السياق لأنظمة التوصية الذكية
١٠٢	[173]	الحيتان، الدلافين، أم الأسماك الصغيرة؟ نحو تجميع اللاعبين في الألعاب المجانية عبر الإنترنت بناءً على سلوك الشراء من خلال تقنية تنقيب البيانات
١٠٣	[110]	التجارب المنضبطة على مستوى المؤسسات على نطاق واسع: التحديات والحلول
١٠٤	19	هل يجب على الشركات تقديم عروض على علامتها التجارية في البحث المدعوم؟
١٠٥	78]	طريقة كشف الشواذ الاحتمالية المستقلة عن الآلية للتجارب عبر الإنترنت
١٠٦	[179]	إبلاغ تغيير المنتج من خلال التجريب مع تقسيم سلوكي قائم على البيانات
١٠٧	[118]	نظامك يتحسن كل يوم تستخدمه: نحو التجريب المستمر الآلي
١٠٨	[26]	أنظمة توصية الموضة، النماذج والأساليب: مراجعة
١٠٩	[8]	تقنيات تخصيص سطر الموضوع وتأثيرها على معدل فتح البريد الإلكتروني
١١٠	[114]	أثر المحتوى الترويجي على وسائل التواصل الاجتماعي على معدل النقر – دليل من شركة السلع الاستهلاكية سريعة الحركة
111	[6]	توسيع الكيانات ذات الصلة وترتيبها باستخدام الرسم البياني المعرفي
١١٢	[24]	LinkLouvain: اختبار A/B مع الوعي بالروابط وتطبيقه على حملات التسويق عبر الإنترنت
113	[122]	Ascend بواسطة Evolv: تحسين معدل التحويل القائم على الذكاء الاصطناعي متعدد المتغيرات بشكل كبير
114	١٧٧	إطار جديد للاختبار عبر الإنترنت لتأثير العلاج المتنوع
١١٥	[153]	الجائزة الكبرى: التجريب عبر الإنترنت لخدمات الميكروسيرفيس السحابية
١١٦	75]	فعالية التسويق الرقمي باستخدام الزيادة
١١٧	[137]	تحسين عمليات الأعمال باستخدام منهجية AB-BPM
١١٨	[50]	خوارزمية جينية للعثور على مجموعة صغيرة ومتنوعة من الأخبار الحديثة حول موضوع معين: كيف نولد تنبيه الذكاء الاصطناعي لـ aaai
١١٩	97	قياس قيمة روابط التوصية على طلب المنتج
١٢٠	٥٦	نمو التجريب: تطور موثوقاختبار القدرات في شركات البرمجيات عبر الإنترنت
121	١٤٠	التقييم عبر الإنترنت لنماذج توقع العطاءات في منصة إعلانات حاسوبية واسعة النطاق: اتخاذ القرار والرؤى
١٢٢	[136]	AB-BPM: توجيه الحالات المدفوع بالأداء لتحسين العمليات التجارية
123	٤٩	احصل على كلا الجانبين – من اختبار A/B إلى اختبار A&B مع تعدين النماذج الاستثنائية
١٢٤	[117]	المزيد مقابل القليل: التجارب الآلية في الأنظمة المعتمدة على البرمجيات
١٢٥	30	شجرة الانحدار لنماذج اللصوص في اختبار A/B
١٢٦	[125]	عندما لا تكون الحشود كافية: تحسين تجربة المستخدم مع وسائل التواصل الاجتماعي من خلال التحليل التلقائي للجودة
127	[22]	تحليل كفاءة البكسل: نهج تحليلات الويب الكمية
128	96	اختبار A/B في عمليات مبيعات التجارة الإلكترونية
١٢٩	[124]	طريقة لبناء معرفة استهداف المستخدم لمواقع صناعة B2B
١٣٠	[128]	تحقق من تصاميم الهواتف المحمولة من خلال الاختبار السريع في الصين: استنادًا إلى خريطة بايدو للهواتف المحمولة
131	[159]	نموذج تفضيل المستخدم الكامن لإدارة أفضل للجانب السلبي في أنظمة التوصية

132	[151]	نحو اختبار A/B آلي
١٣٣	90]	سبع قواعد عامة لمجربي مواقع الويب
١٣٤	[101]	تمكين اختبار A/B لتطبيقات الهواتف المحمولة الأصلية من خلال تبادل واجهة المستخدم عن بُعد
135	[152]	البنية التحتية للتجارب المتداخلة: المزيد، الأفضل، أسرع في التجارب
١٣٦	67	تحسين مستويات الأسعار في تطبيقات التجارة الإلكترونية: دراسة تجريبية
137	[25]	تسهيل الاختبارات المنضبطة لتغييرات تصميم الموقع: نهج منهجي
138	[3]	تحديد تردد ناعم لتحسين توقعات نقرات الإعلانات في ياهو جمني نيتيف
١٣٩	37]	اختبار الفرضيات الثنائية البايزية الموضوعية للعينات في التجارب المنضبطة عبر الإنترنت
١٤٠	64	اختبار وتدوير: اختبارات A/B لتعظيم الأرباح
141	[176]	تحسين تجربة مستخدمي المكتبة من خلال اختبار A/B: المبادئ والعملية

*المؤلف المراسل
عناوين البريد الإلكتروني: فيدرico.quin@kuleuven.be (فيديريكو كوين)، داني.وينز@كوليفن.be (داني وينز)، ماثياس.غالستر@كانterbury.ac.nz (ماتياس غالستر)، camila.costasilva@pg.canterbury.ac.nz (كاميلا كوستا سيلفا)
نستخدم مصطلح “ورقة بحثية” للإشارة إلى الأوراق التي اعتبرناها لتطبيق معايير الشمول والاستبعاد في المراجعة النظامية، ومصطلح “دراسة أولية” للأوراق البحثية التي اخترناها لاستخراج البيانات.
الأوراق المنشورة في تنسيق ملاحظات المحاضرات في علوم الحاسوب معتعتبر الصفحات أيضًا قصيرة.
الأكاديمي يشير إلى الانتماءات التي تكون مؤهلة لتخريج طلاب الماجستير و/أو الدكتوراه.
نميز بين البيانات المستخرجة من التقييم التجريبي في نظام حي والبيانات المستخرجة من المحاكاة أو مثال توضيحي لتقديم رؤى مستهدفة حول التنفيذ.اختبارات أثناء تحليل البيانات لـ SLR.
استبعدنا التجارب والمقاييس المقابلة للدراسات الأساسية التي حللت عددًا كبيرًا من الدراسات السابقة التي أجريت.اختبارات.
التحويل هو إجراء مرغوب يتم فياختبار.
يرجى ملاحظة أن بعض الدراسات الأساسية لا تحدد بشكل صريح المقاييس بسبب حساسية الأعمال. استنادًا إلى المعلومات المتاحة في الدراسة، قمنا بتضمينها في مقاييس المشاركة العامة.
تُوصف النقرات الجيدة بأنها نقرات ذات معنى خلال جلسة استعلام البحث 20.
ومع ذلك، فقد أبلغت هذه الدراسات عن قيم p جنبًا إلى جنب مع النتائج، أو أشارت بشكل صريح إلى فترات الثقة والنتائج ذات الدلالة الإحصائية.اختبارات.
انظر عنصر البياناتالهدف في القسم 4.2 للمراجع المحددة.
غير صالح يشير إلى التجارب المصممة بشكل سيء أو سوء تفسير النتائج المستخلصة من التجربة.
يوفر التحليل المضاد للواقع إجابات حول السبب والنتيجة لمجموعة العلاج والنتائج المقابلة لها، مقارنة بما كان سيحدث لو لم يتم تطبيق العلاج.
أو بدلاً من ذلك، ذكر صريح لعدم استخدام الطرق الإحصائية.

Journal: Journal of Systems and Software, Volume: 211
DOI: https://doi.org/10.1016/j.jss.2024.112011
Publication Date: 2024-02-22

A/B Testing: A Systematic Literature Review

Federico Quin , Danny Weyns , Matthias Galster , Camila Costa Silva Distrinet, KU Leuven, Celestijnenlaan 200A, Leuven, 3000, Belgium Linnaeus University, Universitetsplatsen 1, Växjö, 351 06, Sweden University of Canterbury, 69 Creyke Road, Christchurch, 8140, New Zealand

Abstract

A/B testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are compared in the field from an end user’s point of view. testing is widely used in practice to enable data-driven decision making for software development. While a few studies have explored different facets of research on testing, no comprehensive study has been conducted on the state-of-the-art in testing. Such a study is crucial to provide a systematic overview of the field of testing driving future research forward. To address this gap and provide an overview of the state-of-the-art in testing, this paper reports the results of a systematic literature review that analyzed 141 primary studies. The research questions focused on the subject of testing, how tests are designed and executed, what roles stakeholders have in this process, and the open challenges in the area. Analysis of the extracted data shows that the main targets of testing are algorithms, visual elements, and workflow and processes. Single classic tests are the dominating type of tests, primarily based in hypothesis tests. Stakeholders have three main roles in the design of tests: concept designer, experiment architect, and setup technician. The primary types of data collected during the execution of tests are product/system data, user-centric data, and spatio-temporal data. The dominating use of the test results are feature selection, feature rollout, continued feature development, and subsequent test design. Stakeholders have two main roles during test execution: experiment coordinator and experiment assessor. The main reported open problems are related to the enhancement of proposed approaches and their usability. From our study we derived three interesting lines for future research: strengthen the adoption of statistical methods in testing, improving the process of testing, and enhancing the automation of testing.

Keywords: A/B testing, Systematic literature review, A/B test engineering

1. Introduction

Iterative software development and time to market are crucial to the success of software companies. Central to this is innovation by exploring new software features or experimenting with software changes. In order to enable such innovation in practice, software companies often employ

testing

testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are evaluated in the field (ranging from variants with a slightly altered GUI layout to variants of software with new features). In particular, the merit of the two variants are analyzed using metrics such as click rates of visitors of websites, members’ lifetime values (LTV) in a subscription service, and user conversions in marketing [82, 161, 48. A/B testing is extensively

used in practice, including large and popular tech companies such as Google, Meta, LinkedIn, and Microsoft [77, 161, 104, 168.

Even though

testing is commonly used in practice, to the best of our knowledge, no comprehensive empirically grounded study has been conducted on the state-of-the-art (i.e., state-of-research in contrast to state-of-the-practice) in A/B testing. Such a study is crucial to provide a systematic overview of the field of A/B testing to drive future research forward. Three earlier studies [13, 12, 133 explored a number of aspects of research on

testing, such as research topics, type of experiments in

testing, and

tooling and metrics. Yet, these studies do not provide a comprehensive overview of the state-of-the-art that provides deeper insights in the types of targets to which

testing is applied, the roles of stakeholders in the design of

tests, the execution of the tests, and the usage of the test results. These insights are key to position and understand

testing in the broader picture of software engineering. To tackle this issue, we performed a systematic literature review [84]. Our study aims to provide insights on the state of research in

testing as a basis to guide future research. Practitioners may also benefit from the study to identify potential improvements of

testing in their daily practices.

The remainder of this paper is structured as follows. Section 2 provides a brief introduction to

testing and discusses related secondary studies. In Section 3, we outline the research questions and summarize the methodology we used. Section 4 then presents the results, providing an answer to each research question. In Section 5, we reflect on the results of the study, report insights, outline opportunities for future research, and outline threats to validity. Finally, Section 6 concludes the paper.

2.1. Background

testing is a method where two software variants, denoted as variant A and variant B , are compared by evaluating the merit of the variants through exposure to the end-users of the system 145. To compare the variants, a hypothesis is formulated together with an experiment to test it, i.e., the actual

test. As opposed to regular software testing, A/B testing takes place in live systems. Figure 1 shows the general process of

testing with three main phases.

The first phase of

testing concerns the design of an

test. In this experiment design, a range of parameters is specified, such as: the hypothesis, the sample of the population the experiment should be targeted to, the duration of the experiment, and the

metrics that are collected during the experiment. The

metrics are used to determine the merit of each variant during the experiment. Examples of

metrics include the click-through rate (CTR), number of clicks, and number of sessions 47.

The second phase of

testing consists of the execution of the

test in the running software system. Both variants are deployed in a live system, and the sample of the population is split among both variants. During the execution, the system keeps track of relevant data to evaluate the experiment after it finishes (according to the specified duration). Relevant data may directly correspond to the specified

metrics, or it may indirectly enable advanced analysis in the evaluation stage to gain additional insights from the conducted

tests.

The third phase of

testing comprises the evaluation of the experiment. After the

test is finished, the original hypothesis is tested, typically with a statistical test, such as a students test or Welsh’s t-test [75, 156]. Based on the outcome of the test, the designer can then take follow-up actions, for instance initiating a rollout of a feature to the entire population or designing new

variants to test in subsequent A/B tests.

2.1.1. Controlled experiments vs testing

Traditionally, a controlled experiment is an empirical method that enables to systematically test a hypothesis 32. Two types of variables are distinguished in controlled experiments: independent and dependent variables. Independent variables are variables that are controlled during the experiment to test the hypothesis, for instance, a state of the art and a newly proposed approach to solve a particular design problem by a control group and a treatment group respectively. Dependent variables are variables that

Figure 1: General

testing process.

are measured during the experiment to compare the results of both the control and treatment group, for instance, the fault density and productivity obtained in a design task. After conducting the experiment, the hypothesis is tested and conclusions are drawn based on the results; for instance, a newly proposed design approach has a significantly lower fault density compared to the state of the art approach, but more research is required concerning the productivity. Controlled experimentation is widely used across all types of scientific fields, such as psychology 31, pharmaceutics 116, education 32, and nowadays also in software engineering [142, 34, 68].

Whereas controlled experiments are typically performed offline in a controlled setting,

testing uses controlled experiments to evaluate software features or variants on the end-users of a running system. For this reason,

testing is often referred to as online controlled experimentation 94, 59. The aim of A/B testing lies in testing hypotheses in live software systems where end-users of these systems form the participants or population of the experiment. Examples of hypotheses that are tested in

testing often relate to improving user experience (UX) 130, improving user interface (UI) design [158], improving user click rates [4], or evaluating non functional requirements in distributed services [14].

2.1.2. DevOps and testing

Development Operations (DevOps in short) has gained popularity in recent years 35. DevOps consists of a set of practices, tools, and guidelines to efficiently and effectively manage and carry out different tasks during software life-cycles. This ranges from the process of software development to the deployment and management of software at runtime. Automation of software processes plays a central part of DevOps to make life easier for developers and ease the burden of software development in general.

Common practices that are part of the DevOps lexicon are continuous integration and continuous deployment (CICD in short) 80. CICD consists of the automation of software testing, software integration and building, and deployment of software, effectively reducing manual labor required by developers and easing the burden of deploying software. In a similar vein, continuous experimentation 172 aims at continuously setting up experiments in software systems to test new software variants. Put differently, continuous experimentation enriches the software development process by enabling a data-driven development approach (e.g., by measuring user satisfaction of new software features early on in development). To achieve this,

testing is used to setup and evaluate online controlled experiments in the software system. Fabijan et al. [58] for example perform a case study on the evolution of scaling up continuous experimentation at Microsoft, providing guidelines for other companies to conduct continuous experimentation.

We start with a summary of secondary studies related to the study presented in this paper. Then we pinpoint the aim of the study presented in this paper to provide a systematic overview of the state-of-the-art in

testing.

We grouped related studies into three classes: studies with a focus on technical aspects of

testing, studies focusing on social aspects of

testing, and studies concerned with

testing in specific domains.

Technical aspects of

testing. Rodriguez et al. 132 performed a systematic mapping study on continuous deployment of software intensive services and products. The authors identify continuous and rapid experimentation as one of the factors that characterize continuous deployment, and elaborate on this through the lens of the deployment of these experiments and DevOps practices associated with it. Ros and Runeson 133 put forward a mapping study on continuous experimentation and

testing. The authors explore research topics, organizations that employ

testing, and take a deeper look at the type of experimentation that is conducted. Auer and Felderer [12] conducted a systematic mapping study on continuous experimentation. The authors put a focus on the research topics, contributions, and research types, collaboration between industry and academia, trends in publications, popularity in publications on

testing, venues, and paper citations. Recently, Auer et al. 13 presented a systematic literature review on

testing and continuous experimentation, leveraging the results from previous mapping studies 133,12 . The authors apply forward snowballing on a set of papers to compose the list of primary studies for the review. They then explore the core constituents of a continuous experimentation framework, and the challenges and benefits of continuous experimentation. Closely related, Erthal et al. [53 conducted a literature review by applying an ad-hoc search, followed by snowballing on the initial set of identified papers. The study places emphasis on defining continuous experimentation and exploring its associated processes. While the authors acknowledge

testing as one of the strategies for achieving continuous experimentation, this literature review does not delve into the technical aspects of

testing.

Social aspects of

testing. An important social aspect of

testing is obtaining user feedback. A significant portion of

tests revolves around prioritizing and optimizing the user experience. We identified two studies that focus on this social aspect. Fabijan et al. 61 present a literature review on customer feedback and data collection techniques in the context of software research and development. The authors highlight existing techniques in the literature to obtain customer feedback and organize data collection, in which software development stages the techniques are used, and what the main challenges and limitations are for the techniques. One of the techniques outlined by the authors is

testing, which can serve as an valuable tool to obtain user feedback on prototypes. Fabijan et al. 62] discuss challenges and implications of the lack of sharing customer data within large organizations. One specific case presented by the authors underpins critical issues that manifest from not sharing qualitative customer feedback in the pre-development stage with the development stage, forcing developers to repeat the collection of user feedback or to develop products without this information.

testing in specific domains. Beyond

testing at Internet-based companies, the use of

testing is reported in various other domains. An example is the domain of embedded systems. Mattos et al. 119 explore challenges and strategies for continuous experimentation in embedded systems, providing both industrial- and research perspectives. Another domain is Cyber-Physical Systems (CPS). Giaimo et al. 69] present a systematic literature review on the state-of-the-art of continuous experimentation in CPS, concluding that the literature focuses more on presented challenges rather than proposing solutions to the challenges.

Summary. Existing secondary studies examined

testing with a focus on realizing tests, associated processes, and the types of experimentation conducted. However, these studies have a particular focus, or they lack a rigorous search process to identify relevant studies. Existing studies fall short in providing insights in the target of

testing (i.e., “what” is the subject of testing), the roles of stakeholders in designing and executing

tests, and the utilization of

test results.

2.2.2. Aim of the study.

To tackle the limitations of existing studies, we performed an in-depth literature study. We define the aim of this study using the Goal Question Metric (GQM) approach 17:

Purpose: Study and analyze
Issue: The design and execution of

testing
Object: In software systems
Viewpoint: From the view point of researchers.
Concretely, we aim to investigate the subject of

testing, how

tests are designed and executed, and what the role is of stakeholders in the different phases of

testing. Finally, we also aim at obtaining insights in the research problems reported in the literature.

3. Methodology

This study uses the methodology of a systematic literature review as described in 84. This methodology describes a rigorous process to review the literature for a topic of interest. The process ensures that the review identifies, evaluates, and interprets all relevant research papers in a reproducible manner. The literature review consists of three main phases: planning, execution, and synthesis. During planning a protocol is defined for the study 129, which includes the motivation for the study, the research questions to be answered, sources to search for papers, the search string, inclusion- and exclusion criteria, data items to be extracted from the primary studie 1 , and analysis methods to be used. During execution the search string is applied as specified in the protocol, the inclusion and exclusion criteria are applied to identify the primary studies, and all the data items are extracted from these papers. Lastly, during synthesis the extracted data is analyzed and interpret to answer the research questions, and to obtain useful insights from the study.

We conducted the systematic literature review with four researchers. Further details on the process of the literature review (e.g. the roles the researchers play in the literature review) are summarized in the following sections. A complete description with the protocol, with all collected data and the data analysis are available at the study website 129.

3.1. Research questions

To realize the aim of this study (“Study and analyze the design and execution of A/B testing in software systems from the view point of researchers.”), we put forward four research questions:

RQ1: What is the subject of

testing?
RQ2: How are

tests designed? What is the role of stakeholders in this process?
RQ3: How are

tests executed and evaluated in the system? What is the role of stakeholders in this process?

RQ4: What are the reported open research problems in the field of

testing?

With RQ1, we investigate the subject of

testing, i.e., the (part of the) system to which an

test is applied. Examples include

tests on program variables, application features, software components, subsystems, the system itself, and infrastructure used by the system. We also investigate the domains in which

testing is used.

With RQ2, we investigate what is defined and specified in A/B tests before they are executed in the system. We look at the metrics used, whether statistical methods are used in the experiments and if so which methods, and the tools used to conduct the experiments We also investigate which stakeholders are involved in this process and what is their role (e.g., users of the system influencing the tests that should be deployed, or architects deciding on which population the

tests should be run).

With RQ3, we investigate how

tests are executed in the system and the results are evaluated. More specifically, we look at the way in which data is collected for evaluation in the test, the evaluation of the

test itself (using the collected data and, if applicable, the result of a statistical test), and the use of the test results (e.g., decision about selection of target, input for maintenance, trigger for next test in a pipeline). We also explore the role stakeholders have during this process of

testing (e.g., operators deciding when to finish an experiment).

With RQ4, we identify open research problems in the field of A/B testing. The problems can be derived from descriptions of limitations of proposed approaches in the reviewed papers, open challenges, or outlines of future work on

testing.

3.2. Search query

We first identified a list of relevant terms for

testing from a number of known publications [87] 73, 93, 92, 85, 47. We then identified and applied a gold standard [178] to tune the terms. For a detailed description of the relevant terms and application of the gold standard, we refer to the research protocol 129 . Figure 2 (top) displays the final search query after applying the gold standard.

3.3. Search strategy

The search query was executed October 2022. The search query was applied to the title and abstract of each paper in the sources (not case-sensitive). The automatic search resulted in 3,944 papers, as shown in Figure 2. After filtering duplicate papers and selecting only journal versions of extensions of conference versions, 2,379 research papers are left for further processing.

3.4. Search process

After collecting the papers, we applied the following inclusion criteria:
IC1: Papers that either (1) have a primary focus on

testing (or any of its known synonyms) or (2) describe and apply (new) design(s) of

tests; for example introducing a proof-of-concept;
IC2: Papers that include an assessment of the presented

tests, either by providing an evaluation through simulation with artificial data or field data, or through running one or more field experiments in a real system;

IC3: Papers written in English.
We defined IC1 such that we only include works that are relevant to the posed research questions, i.e., it is essential that the work focuses on

testing or their design and evaluation. Note that IC1 includes papers that address and present solutions to known challenges in A/B testing. IC2 ensured that only papers are included that contain data related to the design and/or running of

tests. Lastly, we only included papers that are written in English with IC3.

Besides the inclusion criteria above, we also applied the following exclusion criteria:
EC1: Papers that report (systematic) literature reviews, surveys (using questionnaires), interviews, and roadmap papers;
EC2: Short papers (

pages

demos, extended abstracts, keynote talks, and tutorials;

Figure 2: Primary studies selected for the systematic literature review.

EC3: Papers with a quality score

(explained in Section 3.5).
EC4: Papers that provide no or only a very brief description of the

testing design process or execution process.

EC1, EC2, and EC3 excluded papers that do not directly contribute new technical advancements, preliminary works that have not been fully developed yet, or works that are not of sufficient quality. In this literature review we focus on mature, state-of-the-art research in the field of

testing to answer the research questions. EC4 excluded works that do not contain essential information to answer the research questions.

Papers that satisfied all inclusion criteria and none of the exclusion criteria were included as primary studies in the literature study. The application of inclusion and exclusion criteria to the titles and abstracts of the research papers resulted in 279 papers. A thorough reading of the papers further reduced the number of papers to 137 . In addition to the research papers retrieved via the search string and filtered by applying inclusion/exclusion criteria, we applied snowballing on the cited works of these papers to capture potentially missed papers. With snowballing we discovered 4 additional papers, bringing the final number of primary studies to 141, as shown in Figure 2.

3.5. Data items

To be able to answer the research questions, we extract the data items listed in Table 1. For each data item we provide a detailed description.

D1-4: Authors, year, title, and venue used for documentation purposes.

Table 1: Collected data items to answer the research questions

Identifier	Data item	Purpose
D1	Authors	Documentation
D2	Year	Documentation
D3	Title	Documentation
D4	Venue	Documentation
D5	Paper type	Documentation
D6	Authors sector	Documentation
D7	Quality score	Documentation
D8	Application domain	RQ1
D9	A/B target	RQ1
D10	A/B test type	RQ2
D11	Used metrics	RQ2
D12	Statistical methods employed	RQ2
D13	Role of stakeholders in the experiment design	RQ2
D14	Additional data collected	RQ3
D15	Evaluation method	RQ3
D16	Use of test results	RQ3
D17	Role of stakeholder in experiment execution	RQ3
D18	Open problems	RQ4

D5: The type of paper. Options include: focus paper (focus on

testing itself, i.e., modifications, suggestions, or enhancements to the

testing process), or applied paper (application and evaluation of

testing in real software systems).

D6: The sector of the authors of the primary study used for documentation (based on the author’s affiliation). Options include Fully academic, Fully industrial, and Mixed.

D7: A quality score for the reporting of the research [115]. The quality score is defined on the following items: Problem definition of the study, Problem context (relation to other work), Research design (study organization), Contributions and study results, Derived insights, Limitations. Each item is rated on a scale of three levels: explicit description ( 2 points), general description ( 1 point), or no description ( 0 points). Therefore, the quality score is defined on a scale of 0 to 12 [113.

D8: The application domain that is used in relation to

testing in the primary study. Initial options include E-commerce, Telecom, Automotive, Finance, Robotics. Further options were be derived during data collection.

D9: The target of

tests describes the element that is subject of

testing. Initial options include an algorithm, a user interface, and application configurations. Further options were derived during data collection.

D10: The type of

test corresponding to the number of

variants and the way in which they are tested. Initial options include Single (classic) A/B test, Single multivariate A/B test, Manual sequence of classic

tests, Manual sequence of multivariate

tests, Automated sequence of classic

tests, Automated sequence of multivariate

tests. Additional options were derived during data collection.

D11: The metrics that are used in the

tests. Initial options include Click rate, Click-through rate, Number of clicks, Number of sessions, Number of queries, Absence time, Time to click, Session time. Additional options were derived during data collection.

Table 2: Paper types of the primary studies.

Type

Number of

occurrences

Focus

Applied

D12: The statistical method that is employed to evaluate the data obtained through the

test, if any. Initial options include Student test, Proportional test, No statistical test. Further options were derived during data collection.

D13: The role of stakeholders in the experiment design. Initial options include Determining

test goal/hypotheses, Determining

test duration, Tuning

test variants. Further options are derived during data collection.

D14: Additional data that is collected during the execution of an

test (in addition to direct or indirect

metric data). Examples include User geo-location, Browser type, Timestamps of invocations or requests. Further options are derived during data collection.

D15: The evaluation method used in the primary study

. Initial options include Illustrative example, Simulation, Empirical evaluation.

D16: The use of the test results gathered from

tests. Examples include Subsequent

test execution, Subsequent

test design, Feature rollout, Feature development. Further options are derived during data collection.

D17: The role of stakeholder in the process of executing

tests. Initial options include

test alteration (adjusting individual

tests),

test triggering (starting subsequent

tests manually),

test supervision (monitoring

tests execution), No involvement, Unspecified. Further options are derived during data collection.

D18: Reported open problems. Open problems are derived from the reported challenges, limitations, and threats to validity. Options are derived during data collection.

4. Results

We start with the demographic information about the primary studies. Then we zoom in on each of the research questions.

4.1. Demographic information

Demographic information is extracted from data items Paper type (D5), Authors sector (D6), and Quality score (D7).

Of the 141 primary studies,

have a focus on

testing itself, while

apply

testing or use it for evaluation purpose, see Table 2

A majority of 72 primary studies (51.1%) have industrial authors, see Table3. Forty-three studies (30.5%) have a mix of industry and academic authors, and 26 studies (

) are from academic authors only.

Figure 3 shows the distribution of quality scores with an average of

. This shows that the reporting of the research in the primary studies is of good quality. Since all papers passed the threshold of 4, none of the papers had to be excluded for the extraction of data to answer the research questions.

Table 3: Author backgrounds of the primary studies.

Background

Number of

occurrences

Academic

Industry

Mixed

Quality score

Figure 3: Quality scores of the primary studies.

Table 4: Identified application domains for

testing.

Application

domain

Number of

occurrences

Web

Search engine

E-commerce

Interaction

Finances

Transportation

Other

N/A

4.2. RQ1: What is the subject of testing?

To answer this research question, we look at the following data items: Application domain (D8), and A/B target (D9).

Application domain. Table 4 lists the application domains of the primary studies. The average number of domains is 1.13 ( 131 primary studies applied

testing in one domain, three studies in two domains, six studies in three domains, and one study in four domains). Nine studies do not mention any domain. We observe that the most popular application domain is the Web ( 38 occurrences). Typical examples are social media platforms, such as Facebook [109] or LinkedIn 170, news publishers 175, 60, and multimedia services, such as movie streaming at Netflix [9]. The second most popular domain is search engines (35 occurrences), with studies conducted at Yandex 46, 45, Bing 41, 112, Yahoo 6, 150, among others. A/B testing is also actively applied in E-commerce (27 occurrences), with examples from retail giant Amazon 52, the fashion industry [26, and C2C (consumer-to-consumer) businesses, such as Etsy 83 and Facebook marketplace [77. Next we observe the application of

testing in what we group under “interaction” ( 22 occurrences), with digital communication software, such as Snap [167] and Skype [60, user-operating system interaction 74, 56, and application software, such as an App store 33 and mobile games [173. Lastly, we note the financial application domain ( 16 occurrences), including studies at Yahoo finance 179 and Alipay [24, transportation (4 occurrences) at for instance Didi Chuxing [66]. Other domains are education ( 3 occurrences) 131 and robotics ( 2 occurrences) 118, among others.

target. The target of the

test denotes the element that is subject to testing and of which (at least) two variants are compared. Table 5 lists the

targets we identified from the primary studies, with a description and examples for each. The average number of

targets is 1.21 ( 120 primary studies applied

testing to one element, 26 studies to two elements, and 24 studies to three elements). Note that studies with more than one

target typically apply these in multiple experiments. The dominating targets of

testing are algorithm, visual elements, and workflow/process that together make up

of all

targets reported in the primary studies. Notable, 32 primary studies did not specify a particular

target, for example using datasets from two prior

tests in the paper’s evaluation without clarifying the details of these tests 166 .

Application domain vs

target. We can now map the application domains with the targets of

testing. This analysis provides insights into which elements or components are typically the subject of

testing in particular domains, or alternatively which

targets remain unexplored in particular domains. Table 6 presents this mapping. We highlight a number of key observations:

testing of algorithms is applied across all application domains and for all major domains it is the primary target of testing. Commonly tested algorithms include feed ranking algorithms for social media websites, recommendation algorithms for news/multimedia websites, search ranking algorithms

Table 5: Identified

targets, with description.

A/B target	Description	Number of occurrences
Algorithm	Updated version of an algorithm such as a recommendation algorithm 175, a search ranking algorithm 86, or an ad serving algorithm 16.	58
Visual elements	Change to visual components such as updates to a website layout [22] or a general user interface update [40].	33
Workflow / process	Alteration to the workflow of an application, e.g. the addition of a feedback button to a dashboard [110], or a change in a user workflow, e.g. the process of a virtual assistant tool 96 .	28
Back-end	Optimization of a software component that is not directly visible to the user, such as testing server optimizations [127] or adjusting application parameters for better performance 60 .	10
New application functionality	Newly introduced functionality, such as a new widget on a web-page [28] or additional content that is presented to the user after performing a search query [112].	6
Other	This category comprises three other targets: different timing and content of emails sent [174, varying educational resources presented to the user 131, and the page configuration of a website 157.	3
Unspecified	The target of the test was not specified in the study.	32

Table 6: Application domain

target

Application domain

Algorithm

Visual elements

Workflow / process

Back-end

New app. func.

Other

Web

Search engine

E-commerce

Interaction

Finances

Transportation

Other

for search engines, and advertisement serving algorithms both in the Web and search engine application domains.

A/B testing of visual elements is particularly popular for search engines ( 16 studies) compared to other application domains such as Web (with only 6 studies). Typical examples include changes to font color of search engine results and changing the position of advertisements on the result page 121 .
Workflow and process elements as target are commonly applied across the major domains. This target is particularly popular for the Web and E-commerce (with 8 and 7 studies, respectively). Typical examples are changes to the process in which best-performing advertisements are determined in JD’s advertisement platform, China’s largest online retailer 162, and changes to the order assignment policy for on-demand meal delivery platforms 102 .
For the Web and search engines, all types of targets are applied. The main focus for the Web is on algorithms and workflow/processes, while the focus for search engines is on algorithms, visual elements, and back-end. For the Web, we notice only a single primary study with back-end as target. This study targets different microservice configurations in testing in order tune individual microservices for performance improvements [147. On the other hand, for search engines, we only noted three primary studies that target a workflow or process in testing. One study evaluated a change of wording in digital advertisements 18, one study evaluated a change in advertisement strategies 75), the last study evaluated the option to pay for “sponsored search” (to prioritize search results) 19.
For e-commerce, we noticed that testing is mainly used to test changes to ranking and recommendation algorithms, and to processes such as virtual assistants. Notably, we only identified a single primary study that evaluated changes to the user interface 103.
testing for back-end optimizations was identified to be most common for search engines, while we did not identify a paper in e-commerce and finances domain where testing was used for back-end changes.

Research question 1: What is the subject of

testing? The main targets of

testing are algorithms, visual elements, workflow and processes, and back-end features. A/B testing is commonly applied in the domains of Web, search engines, e-commerce, interaction software, and finances. Algorithms are consistently tested across these domains. Visual elements are predominantly evaluated in search engines, and counter-intuitively not in e-commerce. Workflow and processes are popular

targets in the Web and e-commerce domains. On the other hand, back-end features such as server performance are popular targets for search engines.

Figure 4: Identified

test types.

4.3. RQ2: How are tests designed? What is the role of stakeholders in this process?

To answer the second research question, we look at the following data items: A/B test type (D10), Used metrics (D11), Statistical methods employed (D12), and Role of stakeholders in the experiment design (D13).

4.3.1. Design of tests

To answer the first part of RQ2 (How are

tests designed?), we take a deeper look at the design of the

tests, focusing on the type of

tests,

metrics, and statistical methods used in the

tests.

test type.. The type of

tests include single classic

tests with two variants,

test composed of more than two variants (denoted as multi-armed

tests), multivariate

test where combinations of elements are tested in one

test, and sequences of all the these types. Figure 4 shows the frequencies of these different

test types extracted from the primary studies.

Overall, we identified 155 occurrences of

test types, i.e., an average of 1.13 occurrences per primary study ( 123 studies considered a single type of

test, 17 studies considered two types, and one study considered three test types). The majority of the primary studies employed single classic

testing with a control variant and a treatment variant ( 95 occurrences). These standard test is used to test a variety of targets. The second most common type of

test is a multi-armed

test ( 30 occurrences). This type of test is composed of more than two variants under test; for example one control variant as baseline and three treatment variants with a distinct version each. These tests are commonly used to evaluate multiple versions of a recommendation algorithm, e.g., 141, 149, and to test different advertisement serving algorithms, e.g., 155. The third most common type of

test is a sequence of classic

tests ( 24 occurrences). Examples here include the comparison of multiple variants in a manually executed sequential style (as opposed to a multi-armed

test where all variants are deployed simultaneously) [63, manually testing multiple iterations of machine learning algorithms sequentially 105, and automatically executing a sequence of

tests to handle controlled feature release in

[139]. The last identified

test type is multivariate

test ( 6 occurrences). This type of test evaluates various combinations of multiple

Table 7: Identified

metrics.

A/B metric

Number of

occurrences

Engagement metrics

225

Click metrics

Monetary metrics

Performance metrics

Negative metrics

View metrics

Feedback metrics

features. As opposed to a multi-armed

test, a multivariate

test enables testing variants of more than a single feature in a singular

test. An example is the comparison of different combinations of varying GUI elements 40 .

metrics.. Table 7 lists the

metrics that we extracted from the primary studies. In total, 493 occurrences of

metrics were reported in the primary studies. With a total of 198 experiments spread over 141 studies, this gives an average of 2.12 metrics per experiment

(ranging from 1 to 8 metrics per experiment). The most common group of

metrics are engagement metrics ( 225 occurrences) that refer to the number of conversion

, number of user sessions, time users are present on the website, and metrics related to the usage of the application or website (e.g. number of posts rated, number of bookings made)

The second largest group are click metrics ( 82 occurrences). Examples include number of clicks, clicks per query, and good click rate

. The third group of

metrics we identified are metrics related to monetization, i.e., revenue and cost ( 64 occurrences). Examples include number of purchases, order value, revenue per email opening, and advertisement cost. The next group are performance metrics (50 occurrences). Examples include a simple response time of an application, bandwidth used, end-to-end latency, or playback delay of audio. The remaining groups are metrics that track unwanted effects in the

tests ( 34 occurrences, e.g. abandonment rate or number of un-subscriptions), views ( 21 occurrences, e.g. number of page views or number of product views), and user feedback ( 17 occurrences, e.g. number of customer complaints or verbatim feedback).

Statistical methods. Table 8 groups the types of statistical methods used for

tests in the primary studies. The most commonly used statistical method are hypothesis tests that test for equality ( 94 occurrences in total). The main test used in this group is a student t-test, e.g. 75, 71. Other tests in this group are the Kolmogorov-Smirnov test, e.g., 140, Mann-Whitney test, e.g., 137, and Wilcoxon signed-rank test, e.g., 156. Out of the 94 occurrences of this type of hypothesis test, 37 primary studies did not report the concrete test used in the analysis of the result 9 . The second most commonly used method is bootstrapping (11 occurrences). This method constructs multiple datasets by resampling the original dataset [46]. The newly constructed datasets are then typically used for equality hypothesis testing. The key benefit of this technique is the sensitivity improvements gained in the analysis of the results. However, a big drawback of the technique is that it is computationally expensive, especially for larger datasets [110]. The third mostly commonly used statistical method is a hypothesis test that tests for inference and goodness of fit

Table 8: Statistical methods employed during

testing.

Statistical methods employed	Number of occurrences
Hypothesis – equality	57
Hypothesis – equality (concrete method unspecified)	37
Bootstrapping	11
Hypothesis – inference	8
Goodness of fit	8
Correction method	7
Estimator	6
Hypothesis – independence	5
Regression method	2

(both 8 occurrences). Examples of inference hypothesis tests include using Bayesian analysis approach to ensure multiple simultaneously running experiments do not interfere 89, and a Bayesian approach to infer the causal effect of running ad campaigns 15. Examples of goodness of fit methods include sequential testing methods that are based on likelihood ratio tests [83], and a Wald test 81. The remaining groups are correction methods ( 7 occurrences) with e.g. Bonferroni correction 177; custom estimators for observations in

testing ( 6 occurrences), e.g., an estimator that takes variances into account [109]; hypothesis tests for independence ( 5 occurrences), containing

tests 150; and regression methods ( 2 occurrences), e.g. CUPED 48 .

4.3.2. Role of stakeholders

To address the second part of RQ2 (What is the role of stakeholders in the design of

tests?), we analyze the role stakeholders play in the design of

tests.

Roles of stakeholders. Table 9 lists the different roles of stakeholders in the design of

tests that we extracted from the primary studies, associated with tasks, descriptions and examples. We identified three main roles: concept designer ( 127 occurrences), experiment architect ( 111 occurrences), and setup technician (31 occurrences). The role Concept designer consists of conceptualizing new ideas for

testing. The role of Experiment architect consists of calibrating technical parameters of the experiment such as the experiment duration. The role of Setup technician consists of taking the necessary steps required to allow the execution of the

test. The top task of the concept designer is designing and tuning variants of

tests ( 67 occurrences). The top task of the experiment architect is determining the duration of

tests (60 occurrences). Finally, the main task of the setup technician is performing post-design activities of

tests (25 occurrences).

4.3.3. Cross analysis test design

We discuss two mappings of data items: The role stakeholders take in the design of

tests versus

test type; and the

metrics used in experiments versus the statistical methods employed.

Tasks of stakeholders vs

test type. The mapping of stakeholder’s tasks in the design of

tests across types of

tests is shown in Table 10. We observe the following:

The primary tasks of stakeholders across all types of tests are the design and tune of variants, determining the duration of experiments, the population, and the goal or hypothesis. These numbers confirm that these are essential design tasks for any test.
A majority of the studies that use multi-armed testing and sequence of tests report the design and tuning of variants as important stakeholder task ( 22 and 13 occurrences respectively).

Table 9: Roles and tasks of stakeholders in the design of A/B tests (Occ short for number of occurrences).

Role	Task	Task description	Occ.
	Design and tune variants	Designing and tuning the variants to test. Examples are tweaking the variants [141], or designing variants for different kind of populations (e.g., old vs new users) 21.	67
	Determine goal or hypothesis	Formulating the goal or hypothesis of the test itself. Examples include the specification of a goal to find the better performing news selection algorithm [50] or the specification of a pre-determined hypothesis for the test 5 .	48
	Perform pre-design actions	Actions that are taken before designing the test. Examples include providing motivation for A/B tests [157] or performing offline A/B tests before moving to online testing [72].	12
Setup technician (31)	Determine duration	Determining the duration of the test. Examples include choosing a fixed experiment duration (e.g., 1 week) [5] or via an explicit expiration date 106 .	60
	Determine population assignment	Determining the population that should take part in the test. Examples include a simple split of all users [163, an assignment where the target population is determined over a two week period 173, or an assignment where network effects have to be taken into account 102 .	51
	Perform post-design actions	Actions that are taken after completing the design of the test. Examples include performing testing prior to running the test [179, 33, validation of the test design 110, or scheduling the execution of the A/B test 157 .	25
	Perform metric analysis and initialization	Analyzing and potentially initializing metrics for the test. An example consists of instantiating a custom A/B utility metric with negative and positive weights tied to user’s actions during a search session 112 .	6

Table 10: Tasks of stakeholders

test type

Task	Single classic A/B test (95)	Multi-armed A/B test (30)	Sequence of A/B tests (24)	Multivariate A/B test (6)
Design and tune variants	33	22	13	2
Duration	45	9	11	2
Population assignment	37	7	8	2
Goal/hypothesis	27	17	8	2
Post-design actions	12	1	5	0
Pre-design actions	6	4	2	1
Metric analysis/init.	5	1	0	0

Table 11: Statistical methods

metrics ( H short for hypothesis)

Method	Engag.	Click	Monetary	Negative	Perf.	View	Feedback
H – equality	31	14	7	10	4	7	2
H – equality (unsp.)	24	12	8	8	11	5	5
Bootstrapping	9	2	2	3	3	1	1
H – inference	5	1	0	0	1	0	0
Goodness of fit	5	1	2	1	0	0	0
Correction method	4	1	1	1	2	0	1
Estimator	4	1	2	1	0	1	0
H – independence	2	2	3	0	0	1	0
Regression method	1	1	1	0	0	1	0

Since these types of tests involve multiple variants under test, the studies often specify more details about the variants and the reasoning behind choosing which variants to test.

Determining the goal or hypothesis for testing is frequently mentioned for multi-armed tests ( 17 occurrences). In contrast to conventional two-variant testing that typically involves a control variant and an altered variant aimed at improving the control variant, multi-armed tests involve more than two variants, so practitioners often formulate hypotheses regarding the potential performance of each variant.
Post-design actions are more often reported for sequences of tests ( 5 instances). For instance, one primary study mentions modeling the sequence of tests 139, another study mentions determining the success condition of the tests before executing them 151, and another study refers to providing an outcome range of the tests 152 .
Only a few primary studies report pre-design actions and metrics analysis and initialization, independently of the type of test.
metrics vs statistical methods used. The statistical methods used across different types of metrics are shown in Table 11.
Engagement metrics and click metrics are used across all types of statistical methods.
The concrete method used for hypothesis testing of equality is often not specified across all types of metrics. For monetary and performance metrics in particular, a majority of studies do not mention the concrete hypothesis testing method ( 8 and 11 occurrences, respectively). This might be due to the sensitivity in reporting results for these types of metrics.

Table 12: Data collected for the A/B tests.

Data collected

Number of

occurrences

Product/system data

User-centric data

Spatial-temporal data

Secondary data

Negative metrics are primarily used for hypothesis equality tests ( 10 and 8 occurrences for hypothesis equality and hypothesis equality no method specified respectively).
Hypothesis method for independence is most frequently used for the monetary metrics, yet, the use is uncommon (3 instances).
The use of feedback metrics is also uncommon and if used, the specific statistical method used is not reported (5 occurrences).

Research question 2: How are tests designed? What is the role of stakeholders in this process? The primary type of test is a single classic test, followed by multi-armed tests and sequence of test. Engagement metrics are the dominating type of metrics used in testing. Other prominent metrics include click, monetary, and performance metrics. Hypothesis testing for equality is by far the most commonly used statistical method used in testing. Remarkable, about of these studies that test on equality do not specify the concrete method they use for that. Stakeholders have two main roles in the design of tests: concept designer and experiment architecture. Less frequently reported is a third role of setup technician.

4.4. RQ3: How are tests executed? What is the role of stakeholders in this process?

4.4.1. Execution of tests

To address the first part of RQ3 (How are

tests executed?), we analyze the data collected during

tests, the evaluation methods used, and the use of

tests.

Data collected. Table 12 lists the classes of data collected during the execution of

tests. We identified four types of data. Product or system data is most commonly reported in the primary studies ( 48 occurrences). This data class includes the type of browser used by the end-user, the operating system of the end-user, hardware-specific information of the device used to interact with the application, and general information related to usage of the system (e.g. tracking information about item categories of products in an e-commerce application, and types of search queries processed during the

test). Second most popular is user-centric data ( 26 occurrences). This class contains data related to how the end-user interacts with the system as well as personal information of end-users. Examples include scrolling characteristics of users on a web application, the navigation history of end-users, user feedback, and using age or current occupation of the end-user during analysis. The third most commonly reported class is spatial-temporal data (20 occurrences) that groups data related to geographic location and time-related data. Examples include timestamps of requests to an application, the creation date of accounts that take part in the

test, and spatial information such as the country and region of end-users. Lastly, a few primary studies report the use of secondary data (6 occurrences). Data in this class correspond to

metrics that do not serve as main evaluation metrics for

tests. Examples are the number of clicks or page views that are used for additional analysis after conducting the

tests.

Table 13: Evaluation method used in the primary studies.

Evaluation method

Number of

occurrences

Empirical evaluation

100

Simulation based on real empirical data

Simulation

Illustrative example

Case study

Theoretical

Evaluation method. Table 13 summarizes the identified evaluation methods. The vast majority of primary studies provide results from an empirical evaluation (100 occurrences), i.e., executing

tests in live systems. A substantial number of studies use historical data from previously conducted

tests to simulate new

tests ( 26 occurrences), while a handful of studies ( 15 occurrences) use simulations without historical data as their evaluation method. Lastly, a few studies use illustrative examples ( 10 occurrences), case studies ( 5 occurrences), and a single primary study provides a theoretical evaluation 121.

Use of test results. Table 14 lists the use of test results extracted from the primary studies. Use of test results refers to what stakeholders do with the obtained data and analyses of

tests, such using the results to design additional

tests. As the table shows, the main usages of

test results are the selection and rollout of a feature ( 71 and 24 occurrences respectively). A number of studies aim at validating the effectiveness of the

testing process itself ( 12 occurrences). The use of test results to trigger a subsequent A/B test seems not very well explored (4 occurrences).

4.4.2. Role of stakeholders

To address the second part of RQ3 (What is the role of stakeholders in this process?), we analyze the role of stakeholders in

test execution.

Roles of stakeholders. Table 15 lists the different role of stakeholders in the

test execution we have extracted from the primary studies with associated tasks, a description and examples. We identified two main roles: experiment contributor ( 40 occurrences) and experiment assessor ( 37 occurrences). The role Experiment contributor consists of managing the

test execution. The role Experiment assessor consists of evaluating the

test results and potentially undertaking additional actions. The top task of the experiment contributor is experiment supervision (19 occurrences). The top task of the experiment assessor is experiment post-analysis ( 17 occurrences).

4.4.3. Cross analysis test execution

We take a deeper look at two mappings of data items related to the execution of

tests: Use of test results with the tasks of stakeholders in the execution of

tests; and the evaluation method with the tasks of stakeholders in the execution of

tests.

Use of test results vs Tasks of stakeholders in the execution of

tests. The first mapping we analyze relates to the use of test results and the tasks stakeholders undertake in the execution of

tests. The results are shown in Table 16. We highlight some key observations:

Experiment supervision is applied regardless of the usages of test results. For feature rollout as a use of test results, the task of experiment supervision is often mentioned. Supervision takes on a key task in this context to ensure that the rollout happens in a hazard-free manner (i.e., no harm is caused to users) 165, 28.

Table 14: Use of test results gathered from test execution.
Use of test results	Description	Occur.
Feature selection	The results of the test are used to determine which variant presents an improvement to the application. Examples include selecting a new version of a ranking algorithm [125, 28] or a recommendation algorithm [65, and selecting a different visual design [8].	71
Feature rollout	The results of the test are used to determine if the rollout of a feature should be continued or halted, as for example outlined by practitioners at Microsoft 165, 49 .	24
Continue feature development	The results of the test are used as a driving force for further feature development, e.g. fine-tuning newly proposed metrics based on periodicity patterns after obtaining promising results 45, and further developing personalization methods [6].	17
Subsequent A/B test design	The results of the test are used for future test design, for example suggesting alternative variants to test in future tests [96], and designing a new test to further test the quality of an metric prediction mode .	15
Validation effectiveness of testing process	The results of the test are used to demonstrate the effectiveness of the newly proposed or improved testing approach by the authors. Examples include evaluating a newly proposed counterfactual framework to run seller-side tests in two-sided marketplaces [77, and the validation of a new statistical methodology for continuous monitoring of tests 82 .	12
Validation of a research question	testing is used to validate a research question put forward by the authors. One example consists of investigating the hypothesis under which circumstances companies should pay for advertising in search engines [19].	10
Bug detection / fixing	The results of the test are used to detect potential bugs or validate bug fixes, e.g. probing for data quality issues in tests of ML models to uncover potential bugs 105.	5
Subsequent test execution	The results of the test are used to execute subsequent tests, e.g. using the results of tests to automatically determine which subsequent tests to execute [151].	4
Unspecified	The use of the test results was not specified in the study.	24

Table 15: Identified roles and concrete tasks of stakeholders during in the execution of

tests.

Role	Task	Task description	Occ.
	Experiment supervision	Monitoring and closely following up on the execution of A/B tests [151, 43.	19
	Experiment alteration	Altering aspects of the test during its execution. Examples include adjusting the population assignment of the experiment 33, or adjusting the variants themselves [152.	12
	Experiment termination	Stopping tests when deemed necessary. Examples include manually stopping tests when sufficient data is collected [96], or stopping the experiment early when harm is observed [89].	9
	Experiment post-analysis	Various steps that are taken after analyzing the results of the test. Examples include double checking results from executed A/B tests [71], performing a deeper analysis of suspicious results 60, or performing bias reduction techniques on the retrieved data from the A/B tests 110.	17
	Experiment triggering	Starting the execution of (subsequent) tests 165, 22.	13
	Other	This category encompasses a few niche tasks, such as documenting the findings and learning from conducting the test [144, rerunning A/B tests [112], or incorporating user feedback in the analysis of the tests 106 .	7

Table 16: Use of test results

Tasks of stakeholders in the experiment execution (“cont. feature dev.” is short for “continue feature development”, “val. eff.” is short for “validation of effectiveness”, and “val. of a RQ” is short for “validation of a research question”).

Use	Supervision	Postanalysis	Triggering	Alteration	Termination
Feature selection	8	11	6	8	4
Feature rollout	10	4	6	6	4
Cont. feature dev.	7	3	5	2	3
A/B test design	6	0	5	2	3
Val. eff. A/B testing	1	2	1	1	1
Val. of a RQ	1	1	0	1	1
Bug detection/fixing	4	0	3	2	2
A/B test execution	1	0	1	0	0

The task of experiment post-analysis is typically only reported for experiments that are fully complete (i.e., do not go through additional rounds of iteration). In the primary studies where the results of the tests are used for subsequent test design, no instances were identified where stakeholders take the task of performing post-analysis on the results of the experiments.
For subsequent test design, the task of experiment triggering is often mentioned. This is to be expected since the newly designed tests also need to be executed. Additionally, test termination is also mentioned often (e.g., terminating an experiment due to bad results [76]).
In the case of bug fixing and detection, stakeholders typically supervise experiments (either to detect possible bugs in the code or ensure the bugfix is effective) [57, and trigger the experiments (i.e. launch an experiment explicitly to fix a known bug in the application) 105 .

Evaluation method vs Tasks of stakeholders in the execution of

tests. In addition, we analyze the tasks stakeholders undertake during the execution of

tests across the evaluation methods. This mapping is shown in Table 17. We highlight a number of key takeaways:

All tasks that stakeholders undertake in the execution of tests are widely encountered in the case of empirical evaluation.
For the method of simulation based on real empirical data, the task of post-analysis is reported more often than any other task. An example is looking for outliers in the analysis of the results of tests, and using historical experiments to confirm its effectiveness [78].
Primary studies that use simulation as an evaluation method rarely specify the tasks stakeholders undertake in the execution of tests. We hypothesize that, since simulations allow for a more controlled way of conducting tests, the tasks stakeholders undertake after the design of tests are not pertinent.
The only stakeholder task reported for theoretical evaluation is experiment alteration (primary study [121]).

Table 17: Evaluation method

Tasks of stakeholders in the test execution (“emp. sim.” short for “simulation based on real empirical data”, “ill.” short for “illustrative”).

Task Method	Supervision	Postanalysis	Triggering	Alteration	Termination	Other
Empirical	14	13	10	10	6	6
Emp. sim.	2	4	1	0	1	0
Simulation	1	1	1	0	0	0
Ill. example	2	0	1	2	2	1
Case study	0	0	0	0	0	0
Theoretical	0	0	0	1	0	0

Table 18: List of identified open problems.

Open problem category	Open problem sub-category	Number of occurrences
Evaluation-related	Extend the evaluation	21
	Provide thorough analysis of approach	16
	Other evaluation-related	34
Process-related	Add process guidelines	9
Process-related	Automate process	7
Quality-related	Enhance scalability	7
Quality-related	Enhance applicability	6

Research Question 3: How are tests executed in the system? What is the role of stakeholders in this process? The main types of data collected during the test execution relate to the product/system, users, and spatial-temporal aspects. The dominating evaluation method used in testing is empirical evaluation, but a relevant number of studies also use simulation. test results are primarily used for feature selection, followed by feature rollout, and continue feature development. (Automatic) subsequent test execution is only used marginally. The main reported roles of stakeholders in test execution is experiment contributor (with experiment supervisor as main task) and experiment assessor (with experiment post-analysis as main task).

4.5. RQ4: What are the reported open research problems in the field of testing?

To answer research question 4, we analyze the results of data item Open problems (D18).
Table 18 present a categorization of open problems we have identified in the primary studies. For each category we devised concrete sub-categories of open problems. We elaborate on each type of open problem with illustrative examples.

First, we established three sub-categories of open problems that are related to the evaluation of the proposed approach: (1) extensions to the evaluation of the approach presented in the primary study, (2) a more thorough analysis of the approach presented in the primary study, and (3) Other evaluation-related open problems in the primary study.

Extend the evaluation. Drutsa et al. 45 explore periodicity patterns in user engagement metrics, and its influence on engagement metrics in

tests. Moreover, the authors put forward new

metrics that take such periodicity patterns into account, resulting in more sensitive

test analysis. The authors
evaluated the proposed metrics on historical

test data from Yandex, though they state that further evaluation of the approach could be carried out in different domains such as social networks, email services, and video/image hosting services. From a slightly different point of view, Barajas et al. 15 developed a technique to determine the causal effects of marketing campaigns on users, putting the focus on the campaign itself rather than only focusing on the design of advertisement media. The authors put forward specific guidelines on randomizing and assigning users to advertising campaigns, and provide a technique to estimate the causal effect the campaigns have on the users under test. As a point of future work, the authors posit a different evaluation question concerning what would have happened if the technique would have been applied to the whole population.

Provide thorough analysis of approach. An example of this category is mentioned by Peska and Vojtas 126. The authors put forward a way of evaluating recommendation algorithms in small e-commerce applications both offline and online via

testing. The approach compares results of offline evaluation of recommendation algorithms with the results of online

testing of the algorithms. Moreover, the authors then used these data to build a prediction model to determine the promising recommendation algorithms more effectively due to the knowledge obtained from online

testing. As future work, the authors list that further work is necessary to verify the causality of an effect observed in the analysis of offline and online

testing data. In another primary study written by Madlberger and Jizdny [114, the authors perform an analysis on the impact of social media marketing on click-through rates and customer engagement. To accomplish this, they run multiple social media marketing campaigns using

testing, evaluating hypotheses related to the impact of visual and content aspects of advertisements on the click rates of end-users. As future research, the authors report that a more comprehensive investigation is necessary to ascertain why some hypotheses in the study have been rejected.

Other evaluation-related. An example of other evaluation-related open problems is laid out by Gruson et al. 72]. The authors propose a methodology based on counterfactual analysis to evaluate recommendation algorithms, leveraging both offline evaluation and online evaluation via

testing. The approach comprises

testing recommendations to a subset of the population, and using the results of these tests to de-bias offline evaluations of the recommendation algorithm based on historical data. In regards to open problems, the authors mention exploring additional metrics for the approach, as well as potential improvements that can be made to the estimators they use in the approach. Another example is specified by Ju et al. 83, who present an alternative to standard

testing with a static hypothesis test by putting forward a sequential test. Classically in A/B testing, the hypothesis of the test is tested after a fixed time and conclusions are made based on the final result. The sequential test put forward by the authors does not have a predetermined number of observations, rather at multiple points during the experiment the test determines whether the hypothesis can be accepted, rejected, or if more observations are required. For future work, the authors wish to support

experiments in their approach, as well as extending the procedure for data that follows a non-binomial distribution. In a final example, Gui et al. [73] study ways of dealing with interference of network effects in the results of

tests. One of the fundamental assumptions of

testing is that users are only affected by the

variant they are assigned to. However, network effects can undermine this assumption do to interaction between users in the population. The authors demonstrate the presence of network effects at LinkedIn, and propose an estimator for the average treatment effect that also takes potential network effects into account. As a line of future research, the authors want to investigate ways of enhancing the approach such that it can deal with more real life phenomena.

Second, we established two sub-categories of open problems that are process-related: (1) guidelines to the

testing process, and (2) automation of aspects of the

testing process.

Add process guidelines. In an effort to provide more nuanced

testing guidelines in the e-commerce domain, Goswami et al. 71 discuss controlled experiments to make decisions in the context of e-commerce search. Considerations such as how to prioritize projects for

testing for smaller retailers and how
to conduct

tests during holiday time are left as open questions. A different primary study covering the benefits of controlled experimentation at scale is presented by Fabijan et al. [57]. In this study, the authors present multiple examples of conducted

tests, and the corresponding lessons learned from these experiments. One of the listed open problems in the study relates to providing “guidance on detection of patterns between leading and lagging metrics”.

Automate process. Mattos et al. [118] present a step towards automated continuous experimentation. The authors put forward an architectural framework that accommodates the automated execution of

tests and automated generation of

variants. To validate the framework, an

test was conducted with a robot. One of the open challenges laid out in the study comprises the ability to automatically generate hypotheses for

tests based on the collected data. Duivesteijn et al. [49] present

testing, an approach that leverages exceptional model mining techniques to target

variants to subgroups in the population under test. As opposed to deploying the best-performing variant of the

test, the authors put forward running both variants (if ample resources are available) and targeting specific variants to individual users based on their inferred subgroups. One of the potential avenues for future research consists of the development of a framework that would enable automated personalization of websites supported by

testing.

Lastly, we established two sub-categories of open problems that are quality-related: (1) enhancing scalability of the proposed approach, and (2) enhancing the applicability of the approach.

Enhance scalability. One example of this is presented by Zhao et al. 179. In order to obtain a causal explanation behind the results of

tests, the authors propose segmenting the population, and consequently analyzing the results of the

test in individual segments. For future work, the authors mention developing a more scalable solution that integrates the approach into their existing experimentation platform. To address online experimentation specifically for cloud applications, Toslali et al. 1531 introduce Jackpot, a system for online experimentation in the cloud. Jackpot supports multivariate A/B testing and ensures proper management of interactions in the cloud application during the execution of

tests. As a venue for future work, the authors mention ways of dealing with the limited scalability of multivariate experimentation due to the number of potential experiments increasing exponentially with the number of elements to be tested.

Enhance applicability. One such study explores

testing in the automotive industry [11]. The study addresses concerns relating to the limited sample sizes

tests obtain due to the limited nature of participants that can take part in

tests in the industry. To overcome this hurdle, the authors provide specific guidelines for performing

testing and determining the assignment of users to either the control or treatment variant in the test. However, one limitation pertains to requiring pre-experimental data to ensure a balanced population assignment between both

variants. In an effort to increase sensitivity in A/B testing, Liou and Taylor [109] propose a new estimator for A/B testing that takes variance of individual users into account. To realize this, pre-experiment data of individual users is analyzed and variances are computed. In order to validate the approach a sample of 100 previously conducted

tests were collected and analyzed using the new approach. A big limitation noted by the authors is that “a stronger assumption about the homogeneity of the treatment effect” is required in order for the approach to remain unbiased.

Research Question 4: What are the reported open research problems in the field of testing? The most commonly reported open problems directly related to the proposed approach, in particular improving the approach, extending the approach and providing a thorough analysis. Other less frequently reported open problems relate to the A/B testing process, in particular adding guidelines for the testing process, and automating the process. Finally, a number of studies report open problems regarding quality properties, specifically enhancing scalability and applicability of the proposed approach.

Topic	Number of occurrences	Primary studies
Application of A/B testing	51	175, 123, 95, 102, 16, 15, 121, 171, 99, 148, 33, 70, 66, 52, 20, 174, 107, 63, 155, 150, 65, 163, 170, 27, 143, 2, 5, 135, 149, 7, 98, 147, 141, 173, 19, 26, 8, 114, 6, 122, 50, 97, 136, 125, 22, 124, 128, 159, 67, 3, 176 ,
Improving efficiency of A/B testing	20	[1, 28, 127, 164, 23, 85, 47, 39, 44, 86, 40, 45, 46, 109, 100, 83, 18, 78, 37, 64,
Beyond standard A/B testing	18	, 126, 29, 118, 75, 49, 117, 30, 151
Concrete A/B testing problems	17	[138, 73, 168, 105, 146, 43, 162, 14, 103, 111, 71, 24, 153, 137, 96, 101, 25
Pitfalls and challenges of A/B testing	13	91, 88, 54, 60, 167, 42, 169, 120, 11, 41, 110, 140, 90
Experimentation frameworks and platforms	13	144, 154, 106, 156, 108, 9, 131, 74, 21, 36, 179, 177, 152
A/B testing at scale	9	89, 160, 58, 165, 81, 157, 76, 57, 56]

5. Discussion

In this section, we discuss a number of additional insights we obtained. We start with the research topics studied by the primary studies. Next, we look at environments and tools used for

testing. Then we report a number of opportunities for future research. We conclude with a discussion of threats to validity of the study.

5.1. Research topics

During data extraction of the 141 primary studies, we noted the general subject matters of the primary studies and categorized the primary studies along 7 research topics. Table 19 summarizes these 7 topics. Note that studies share overlapping topics. We briefly explain now each category and provide a few examples from the primary studies.

5.1.1. Application of testing

The main focus of the primary study is the use and application of

testing as evaluation tool for the main subject matter of the study (e.g. evaluation new recommendation algorithm, interface redesigns, et

5.1.2. Improving the efficiency of testing

This topic is about improving the process of

testing by exploring ways of improving sensitivity in

testing data 44, 127, 164, 85, investigating sequential testing techniques to stop

tests as soon as reasonable [83, 86, 1, proposing techniques to detect invalid

test

[28], and using extra data such as periodicity patterns in user behavior to improve

testing [45].

Table 20: Environments and tools used for

testing.

Environment

Number of

occurrences

In-house experimentation system

Research tool or prototype

Commercial A/B testing tool

Commercial non A/B testing tool

User survey

5.1.3. Beyond standard testing

This topic is about techniques that go beyond standard

testing, such as the use of new types of

metrics [166, 48, 112, use of counterfactuals in the evaluation of

test

[77, 134, investigating ways of automating parts of the

testing process [139, 118, 117, 151, improving or altering the

testing process [79, 29, and investigating ways of combining offline- and online

testing [72, 126].

5.1.4. Concrete testing problems

This topic includes studies that

testing in specific domains and specific types of

testing. Examples include

testing specifically in the e-commerce domain 96, 71, network

testing or

testing in marketplaces [103, 73, 24,

testing in the CPS domain with digital twins [43, or

testing for mobile applications 101, 168.

5.1.5. Pitfalls and challenges of testing

This topic is about pitfalls related to conducting

testing [54, 88, 42, 41, or (particular domainrelated) challenges related to

testing [169, 120, 110.

5.1.6. Experimentation frameworks and platforms

This topic covers papers that present an A/B testing platform 106, 152, 144, or a framework concerning aspects related to the

testing process such as a framework for detecting data loss in

tests [74, a framework for the design of

tests [36], or a framework for personalization of

testing [154].

5.1.7. testing at scale

Primary studies under this topic focus on conducting

testing at a large scale, e.g., considerations for conducting A/B testing at scale 81, 157, 76, process models or guidelines for A/B testing at scale 58, 165, or concrete scalable solutions such as a scalable statistical method for measuring quantile treatment effects for performance metrics in

tests 160 .

5.2. Environments and tools used for testing

In addition to the research topics covered in the primary studies, we also analyze the environments and tools that were used to realize A/B testing, see Table 20.

The most commonly mentioned type of environment is in-house experimentation system for

testing (20 occurrences), for instance dedicated environments developed by companies such as Microsoft 105, Google 152, eBay 157, and Etsy 83. These environments broadly support executing A/B tests. Furthermore, some primary studies describe concrete features of the experimentation system to help design

tests, e.g. controlling for bias during the specification of

tests in Airbnb’s Experimentation Reporting Framework 100. Next, we observe research tools and prototypes (13 occurrences). Examples include a

tool to perform online cloud experimentation 153, a research prototype for

testing implemented in NodeJS 139, a tool for A/B testing with decision assistants [96, and a tool that enables automatic execution of multiple

tests 151. The remaining environments we identified were commercial

testing tools (10 occurrences), e.g., Optimizely [122], and Google Analytics [22]; commercial tools not related to A/B testing (7 occurrences), e.g., Crazy egg [22, a heatmapping tool used to design A/B variants, and using Yahoo Gemini (advertisement platform) to test different advertising strategies [114; and a user survey (1 occurrence) to determine which

variants to test by conducting a preliminary survey.

5.3. Research opportunities and future research directions

From our study, we propose a number of potential future research directions in the field of

testing. Concretely, we provide three lines of research: research on further improving the general process of

testing, research on automating aspects of

testing, and research on the adoption of proposed statistical methods in

testing.

5.3.1. Improving the testing process

One future direction relates to taking considerations when running many

tests at once 152. Plenty of studies cover this topic by e.g., discussing lessons learned in unexpected

test results that were caused by other

tests that were running in parallel [54], or manually checking for possible effects of running A/B tests by analyzing the deployed A/B tests in the system [157]. Yet, we did not encounter a study that puts forward a systematic approach to tackle this problem.

Another avenue for future research is about improving the sensitivity in

tests by, e.g., combining different sensitivity improvement techniques as pointed out by Drutsa et al. [44, enabling proactive prediction of user behavior in

tests based on historical data [45], and a deeper study of

test estimators to achieve better sensitivity as mentioned by Poyarkov et al. 127.

The last avenue for future research in improving the

testing process relates to providing further guidelines and designing principles for choosing and engineering

metrics. We highlight two primary studies that mention open problems related to this opportunity: Kharitonov et al. 85 put forward learning sensitive combinations of

metrics as a general open problem, and Duan et al. [48] discuss investigating dynamics between surrogate metrics and the actual underlying metric.

5.3.2. Automation

In an effort to establish continuous experimentation, multiple studies put forward steps companies can take to develop an experimentation culture, e.g. [58, 169, 55. In light of expanding this experimentation culture, (partial) automation of the

testing process is essential to enable and empower continuous experimentation 28, 71. Initial research on automation of steps in the

testing has been conducted, as for example presented by Tamburrelli et al 151 and Mattos et al. 118, see Sections 5.1.3 and 4.5.2. Yet the present state of research in this topic suggests that further investigation and more in-depth solutions are necessary to fully exploit automated design and execution of

tests. Additionally, a number of open problems still remain that could facilitate and enable automated experimentation, e.g., determining which A/B tests to prioritize at execution 71, and automatically generating insights related to the rationale and cause of experiment results to experiment developers to guide product development [169].

5.3.3. Adoption and tailoring statistical methods

Even though a number of primary studies discuss bootstrapping as a technique to evaluate the results of

tests 154, 2, 71, bootstrapping remains largely unexplored in

testing, despite the fact that this statistical method has the potential to improve the analysis of

test results [81, 14. Moreover, bootstrapping can present an invaluable tool to provide statistical insights into the results of the tests which could e.g. not be obtained by a standard equality testing method [51. However, one big downside of bootstrapping is that it is computationally expensive 110. Alongside adoption of known statistical methods, designing and tailoring new statistical methods to accommodate for particular experimentation scenarios presents an interesting research direction. One example is mentioned by Kharitonov [86, who put
forward designing a custom statistical test for non-binomial

metrics. Another example concerns taking into account “the effects from multiple treatments with various metrics of interest” to tailor the approach presented by Tu et al. 154 for optimal treatment assignments in A/B testing by leveraging causal effect estimations.

Besides a limited number of primary studies employing bootstrapping in the analysis of

tests, a significant number of studies mention statistically significant results or p -values in the analysis of conducted

tests without specifying the concrete statistical test used ( 37 occurrences). Moreover, a considerable number of studies do not report anything related to statistical analysis ( 47 occurrences). We argue that this information is important to report in research publications, and urge authors to specify the concrete statistical methods used

to obtain the results in the studies.

5.4. Threats to validity

In this section we list potential threats to the validity of the systematic literature review [10].

5.4.1. Internal validity

Internal validity refers to the extent to which a causal conclusion based on a study is warranted. One threat to the internal validity is a potential bias of researchers that perform the SLR, which may have an effect on the data collection and the insights derived in the study. In order to mitigate this threat, we involved multiple researchers in the study. Multiple researchers were responsible for selecting papers, extracting data and analyzing results. In each step, cross-checking was applied to minimize bias. Extra researchers were involved if no consensus could be found. Additionally, we defined a rigid protocol for the systematic literature review.

5.4.2. External validity

External validity refers to the extent to which the findings of the study can be generalized to the general field of

testing. A threat to the external validity of this systematic literature review is that not all relevant works are covered. To mitigate this threat, we searched all main digital library sources that publish work in computer science. Secondly, we defined the search string by including all commonly used terms for A/B testing to ensure proper retrieval of relevant works. Lastly, we also applied snowballing on the selected papers from the automatic search query to uncover additional works that might have been missed.

5.4.3. Conclusion validity

Conclusion validity refers to the extent to which we obtained the right measure and whether we defined the right scope in relation to what is considered research in the field of

testing. One threat to the conclusion validity is the quality of the selected studies; studies of lower quality might produce insights that are not justified or applicable to the general field of

testing. In order to mitigate this threat, we excluded short papers, demo papers, and roadmap papers from the study. Furthermore, we evaluated a quality score for each selected paper. Papers with a quality score

were excluded from the study.

5.4.4. Reliability

Reliability refers to the extent to which this work is reproducible if the study would be conducted again. To mitigate this threat, we make all the collected and processed data available online. We also defined a specific search string, a list of online sources, and other specific details in the research protocol to ensure reproducibility. Bias of researchers also poses a threat here, influencing that similar results would be retrieved if the systematic literature review would be conducted again with a different set of reviewers.

6. Conclusion

A/B testing supports data-driven decisions about the adoption of features. It is widely used across different industries and key technology companies such as Google, Meta, and Microsoft. In this systematic literature review, we identified the subjects of

tests, how

tests are designed and executed, and the reported open research problems in the literature. We observed that algorithms, visual elements, and changes to a workflow or process are most commonly tested, with web, search engine, and e-commerce being the most popular application domains for

testing. Concerning the design of

tests, classic

tests with two variants are most commonly used, alongside engagement metrics such as conversion rate or number of impressions as metric to gauge the potential of the

variants. Hypothesis tests for equality testing are broadly utilized to analyze

test results, and bootstrapping also garners interest in a few primary studies. We devised three roles stakeholders take on in the design of

tests: Concept designer, Experiment architect, and Setup technician. Regarding the execution of

tests, empirical evaluation is the leading evaluation method. Besides the main

metrics, data concerning the product or system, and user-centric data are collected the most to conduct deeper analysis of the results of the

tests.

testing is most commonly used to determine and deploy the better performing

variant, or to gradually roll out a feature. Lastly, we devised two roles stakeholders take on in the execution of

tests: Experiment contributor, and Experiment assessor.

We identified seven categories of open problems: improving proposed approaches, extending the evaluation of the proposed approach, providing thorough analysis of the proposed approach, adding

testing process guidelines, automating the

testing process, enhancing scalability, and enhancing applicability. Leveraging these categories and observations made during the analysis, we provide three main lines of interesting research opportunities: developing more in-depth solutions to automate stages of the

testing process; presenting improvements to the

testing process by examining promising avenues for sensitivity improvement, systematic solutions to deal with interference of many

tests running at once, and providing guidelines and designing principles to choose and engineer

metrics; and lastly the adoption and tailoring of more sophisticated statistical methods such as bootstrapping to strengthen the analysis of

testing further.

Acknowledgement

We thank Michiel Provoost for his support to this study.

References

Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing

Interact with the Position Auction. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (Palo Alto, California, USA) (

Appendix A. List of primary studies

Table A.21: List of primary studies.

Paper ID	Reference	Title
1	[1]	A Nonparametric Sequential Test for Online Randomized Experiments
2	[138]	Detecting Network Effects: Randomizing Over Randomized Experiments
3	91	Unexpected Results in Online Controlled Experiments
4	[28]	How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments
5	[127]	Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments
6	88	Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
7	89	Online Controlled Experiments at Large Scale
8	[164]	Non-Stationary A/B Tests
9	73	Network A/B Testing: From Sampling to Estimation
10	[166]	False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments
11	54	Experimentation Pitfalls to Avoid in A/B Testing for Online Personalization
12	[38]	Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation
13	23	Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments
14	160	CONQ: CONtinuous Quantile Treatment Effects for Large-Scale Online Controlled Experiments
15	60	Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners
16	[144]	IPEAD A/B Test Execution Framework
17	82	Peeking at A/B Tests: Why It Matters, and What to Do about It
18	85	Learning Sensitive Combinations of A/B Test Metrics
19	[47]	Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics
20	[168]	Evaluating Mobile Apps with A/B and Quasi A/B Tests
21	39]	On Post-Selection Inference in A/B Testing
22	48	Online Experimentation with Surrogate Metrics: Guidelines and a Case Study
23	44	Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments
24	58	The Evolution of Continuous Experimentation in Software Product Development: From Data to a Data-Driven Organization at Scale
25	167	How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments
26	86	Sequential Testing for Early Stopping of Online Experiments
27	40	Shrinkage Estimators in Online Experiments
28	77	A Counterfactual Framework for Seller-Side A/B Testing on Marketplaces
29	45	Periodicity in User Engagement with a Search Engine and Its Application to Online Controlled Experiments
30	105	Evolving Software to be ML-Driven Utilizing Real-World A/B Testing: Experiences, Insights, Challenges

31	46	Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments
32	79	A Cluster-Based Nearest Neighbor Matching Algorithm for Enhanced A/A Validation in Online Experimentation
33	109	Variance-Weighted Estimators to Improve Sensitivity in Online Experiments
34		A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments
35	100	Winner’s Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments
36	[169]	From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks
37	83	A Sequential Test for Selecting the Better Variant: Online A/B Testing, Adaptive Allocation, and Continuous Monitoring
38	146	Unbiased Experiments in Congested Networks
39	154	Personalized Treatment Selection Using Causal Heterogeneity
40	112	Beyond Success Rate: Utility as a Search Quality Metric for Online Experiments
41	[175]	Algorithms and System Architecture for Immediate Personalized News Recommendations
42	[106]	Experimentation in the Operating System: The Windows Experimentation Platform
43		AB4Web: An On-Line A/B Tester for Comparing User Interface Design Alternatives
44	123	Real-World Product Deployment of Adaptive Push Notification Scheduling on Smartphones
45	95	Mining the Stars: Learning Quality Ratings with User-Facing Explanations for Vacation Rentals
46	43	Towards Digital Twin-Enabled DevOps for CPS Providing Architecture-Based Service Adaptation & Verification at Runtime
47		Adaptive Experimentation with Delayed Binary Feedback
48	[108]	Unifying Offline Causal Inference and Online Bandit Learning for Data Driven Decision
49	[9]	Beyond Data: From User Information to Business Value through Personalized Recommendations and Consumer Science
50	102	Learning to Bundle Proactively for On-Demand Meal Delivery
51		Measuring Dynamic Effects of Display Advertising in the Absence of User Tracking Information
52	15	Marketing Campaign Evaluation in Targeted Display Advertising
53	121	Whole Page Optimization: How Page Elements Interact with the Position Auction
54	131	The MOOClet Framework: Unifying Experimentation, Dynamic Improvement, and Personalization in Online Courses
55	[171]	Split-Treatment Analysis to Rank Heterogeneous Causal Effects for Prospective Interventions
56	99	Promoting Positive Post-Click Experience for In-Stream Yahoo Gemini Users
57	134	Predicting Counterfactuals from Large Historical Data and Small Randomized Trials
58	148	Multi-Source Pointer Network for Product Title Summarization
59	33	Beyond Relevance Ranking: A General Graph Matching Framework for UtilityOriented Learning to Rank
60	72	Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms

61	74	Trustworthy Experimentation Under Telemetry Loss
62	[70]	The Netflix Recommender System: Algorithms, Business Value, and Innovation
63	[139]	Bifrost: Supporting Continuous Deployment with Automated Enactment of Multi-Phase Live Testing Strategies
64	66	CompactETA: A Fast Inference System for Travel Time Prediction
65	14	Design and Analysis of Benchmarking Experiments for Distributed Internet Services
66	52	Learning to Rank in the Position Based Model with Bandit Feedback
67	20]	VisRel: Media Search at Scale
68	174	Behavioral Consequences of Reminder Emails on Students’ Academic Performance: A Real-World Deployment
69	[107]	Content Recommendation by Noise Contrastive Transfer Learning of Feature Representation
70	63	External Evaluation of Ranking Models under Extreme Position-Bias
71	155	Tackling Cannibalization Problems for Online Advertisement
72	[150]	Filling Context-Ad Vocabulary Gaps with Click Logs
73	65	Practical Lessons from Developing a Large-Scale Recommender System at Zalando
74	[163]	How Airbnb Tells You Will Enjoy Sunset Sailing in Barcelona? Recommendation in a Two-Sided Travel Marketplace
75	[170]	Modeling Professional Similarity by Mining Professional Career Trajectories
76	[27]	Social Incentive Optimization in Online Social Networks
77	[143]	Ad Close Mitigation for Improved User Experience in Native Advertisements
78	[126]	Off-Line vs. On-Line Evaluation of Recommender Systems in Small ECommerce
79	[2]	LASER: A Scalable Response Prediction Platform for Online Advertising
80	[5]	The Role of Relevance in Sponsored Search
81	[135]	Contextual Bandit Applications in a Customer Support Bot
82	[165]	Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout
83	[149]	When Relevance is Not Enough: Promoting Diversity and Freshness in Personalized Question Recommendation
84	[103]	Interference, Bias, and Variance in Two-Sided Marketplace Experimentation: Guidance for Platforms
85	[120]	Automotive A/B testing: Challenges and Lessons Learned from Practice
86	[21]	A/B Testing at SweetIM: The Importance of Proper Statistical Analysis
87	36]	A Framework Model to Support A/B Tests at the Class and Component Level
88	[81]	Statistical Reasoning of Zero-Inflated Right-Skewed User-Generated Big Data A/B Testing
89	[111]	Size matters? Or not: testing with limited sample in automotive embedded software
90	[157]	Scalable Data Reporting Platform for A/B Tests
91	18	Applying Bayesian parameter estimation to tests in e-business applications examining the impact of green marketing signals in sponsored search advertising
92	[7]	Experiment-driven improvements in Human-in-the-loop Machine Learning Annotation via significance-based testing
93	76	The Anatomy of a Large-Scale Experimentation Platform
94	11	Demystifying dark matter for online experimentation
95	71	Controlled experiments for decision-making in e-Commerce search
96	98	Evaluating usability of a web application: A comparative analysis of opensource tools

97	29	Faster online experimentation by eliminating traditional A/A validation
98	41	Pitfalls of long-term online controlled experiments
99	[147]	SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale
100	[57]	The Benefits of Controlled Experimentation at Scale
101	[141]	Context Adaptation for Smart Recommender Systems
102	[173]	Whales, Dolphins, or Minnows? Towards the Player Clustering in Free Online Games Based on Purchasing Behavior via Data Mining Technique
103	[110]	Enterprise-Level Controlled Experiments at Scale: Challenges and Solutions
104	19	Should companies bid on their own brand in sponsored search?
105	78]	A Probabilistic, Mechanism-Indepedent Outlier Detection Method for Online Experimentation
106	[179]	Inform Product Change through Experimentation with Data-Driven Behavioral Segmentation
107	[118]	Your System Gets Better Every Day You Use It: Towards Automated Continuous Experimentation
108	[26]	Fashion Recommendation Systems, Models and Methods: A Review
109	[8]	Subject Line Personalization Techniques and Their Influence in the E-Mail Marketing Open Rate
110	[114]	Impact of promotional social media content on click-through rate – Evidence from a FMCG company
111	[6]	Related Entity Expansion and Ranking Using Knowledge Graph
112	[24]	LinkLouvain: Link-Aware A/B Testing and Its Application on Online Marketing Campaign
113	[122]	Ascend by Evolv: Artificial intelligence-based massively multivariate conversion rate optimization
114	177	A new framework for online testing of heterogeneous treatment effect
115	[153]	JACKPOT: Online experimentation of cloud microservices
116	75]	Digital Marketing Effectiveness Using Incrementality
117	[137]	Business process improvement with the AB-BPM methodology
118	[50]	A genetic algorithm for finding a small and diverse set of recent news stories on a given subject: How we generate aaai’s ai-alert
119	97	Measuring the value of recommendation links on product demand
120	56	Experimentation growth: Evolving trustworthy testing capabilities in online software companies
121	140	Online Evaluation of Bid Prediction Models in a Large-Scale Computational Advertising Platform: Decision Making and Insights
122	[136]	AB-BPM: Performance-driven instance routing for business process improvement
123	49	Have It Both Ways-From A/B Testing to A&B Testing with Exceptional Model Mining
124	[117]	More for Less: Automated Experimentation in Software-Intensive Systems
125	30	Regression Tree for Bandits Models in A/B Testing
126	[125]	When the Crowd is Not Enough: Improving User Experience with Social Media through Automatic Quality Analysis
127	[22]	Pixel efficiency analysis: A quantitative web analytics approach
128	96	A/B Testing in E-commerce Sales Processes
129	[124]	A Method for the Construction of User Targeting Knowledge for B2B Industry Website
130	[128]	Validating Mobile Designs with Agile Testing in China: Based on Baidu Map for Mobile
131	[159]	User Latent Preference Model for Better Downside Management in Recommender Systems

132	[151]	Towards Automated A/B Testing
133	90]	Seven rules of thumb for web site experimenters
134	[101]	Enabling A/B Testing of Native Mobile Applications by Remote User Interface Exchange
135	[152]	Overlapping Experiment Infrastructure: More, Better, Faster Experimentation
136	67	Optimizing price levels in e-commerce applications: An empirical study
137	[25]	Facilitating Controlled Tests of Website Design Changes: A Systematic Approach
138	[3]	Soft Frequency Capping for Improved Ad Click Prediction in Yahoo Gemini Native
139	37]	Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments
140	64	Test & Roll: Profit-Maximizing A/B Tests
141	[176]	Improving Library User Experience with A/B Testing: Principles and Process

*Corresponding author
Email addresses: federico.quin@kuleuven.be (Federico Quin), danny.weyns@kuleuven.be (Danny Weyns), matthias.galster@canterbury.ac.nz (Matthias Galster), camila.costasilva@pg.canterbury.ac.nz (Camila Costa Silva)
We use the term “research paper” to refer to papers that we considered for the application of inclusion and exclusion criteria in the SLR, and the term “primary study” for the research papers that we selected for data extraction.
Papers published in the Lecture Notes in Computer Science format with pages are also considered short.
Academic refers to affiliations that are eligible to graduate master and/or PhD students.
We distinguish data retrieved from empirical evaluation in a live system from data retrieved from simulation or an illustrative example to provide targeted insights into the execution of tests during data analysis of the SLR.
We excluded experiments and corresponding metrics of primary studies that analyzed a large number of previously conducted tests.
conversion is a desired action taken in the test.
Note that some of the primary studies do not specify explicitly the metrics due to business sensitivity. Based on the available information in the study, we have included these in general engagement metrics.
Good clicks are described as clicks that are meaningful during the search query session 20.
However, these studies did report p-values alongside the results, or explicitly refer to confidence intervals and statistically significant results of the tests.
See data item target in Section 4.2 for specific references.
Invalid refers to badly designed experiments or misinterpretation of the results retrieved from the experiment.
Counterfactual analysis provides answers to the cause and effect of the treatment group and their corresponding outcomes, compared to what would have happened if the treatment would not have been applied.
Or alternatively an explicit mention of lack of statistical methods used.

اختبار A/B: مراجعة منهجية للأدبيات A/B testing: A systematic literature review

اختبار A/B: مراجعة منهجية للأدبيات

الملخص

1. المقدمة

2. الخلفية والأعمال ذات الصلة

2.1. الخلفية

2.1.1. التجارب المنضبطة مقابل اختبار

2.1.2. ديف أوبس و اختبار

2.2. الدراسات الثانوية ذات الصلة

2.2.1. ملخص المراجعات ذات الصلة.

2.2.2. هدف الدراسة.

3. المنهجية

3.1. أسئلة البحث

3.2. استعلام البحث

3.3. استراتيجية البحث

3.4. عملية البحث

3.5. عناصر البيانات

4. النتائج

4.1. المعلومات الديموغرافية

4.2. RQ1: ما هو موضوع اختبار؟

4.3. السؤال البحثي 2: كيف هي ما هي الاختبارات المصممة؟ ما هو دور أصحاب المصلحة في هذه العملية؟

4.3.1. تصميم اختبارات

4.3.2. دور أصحاب المصلحة

4.3.3. التحليل المتقاطع تصميم الاختبار

4.4. السؤال البحثي 3: كيف هي هل تم تنفيذ الاختبارات؟ ما هو دور أصحاب المصلحة في هذه العملية؟

4.4.1. تنفيذ اختبارات

4.4.2. دور أصحاب المصلحة

4.4.3. التحليل المتقاطع تنفيذ الاختبار

4.5. RQ4: ما هي المشكلات البحثية المفتوحة المبلغ عنها في مجال اختبار؟

4.5.1. مشكلات مفتوحة متعلقة بالتقييم

4.5.2. مشكلات مفتوحة متعلقة بالعملية

4.5.3. مشكلات مفتوحة تتعلق بالجودة

5. المناقشة

5.1. مواضيع البحث

5.1.1. تطبيق اختبار

5.1.2. تحسين الكفاءة لـ اختبار

5.1.3. ما وراء المعيار اختبار

5.1.4. الخرسانة مشاكل الاختبار

5.1.5. الفخاخ والتحديات لـ اختبار

5.1.6. أطر ومنصات التجريب

5.1.7. الاختبار على نطاق واسع

5.2. البيئات والأدوات المستخدمة لـ اختبار

5.3. فرص البحث واتجاهات البحث المستقبلية

5.3.1. تحسين عملية الاختبار

5.3.2. الأتمتة

5.3.3. اعتماد وتكييف الأساليب الإحصائية

5.4. التهديدات للصلاحية

5.4.1. الصلاحية الداخلية

5.4.2. الصلاحية الخارجية

5.4.3. صحة الاستنتاج

5.4.4. الموثوقية

6. الخاتمة

شكر وتقدير

References

الملحق أ. قائمة الدراسات الأساسية

A/B Testing: A Systematic Literature Review

Abstract

1. Introduction

2. Background and related work

2.1. Background

2.1.1. Controlled experiments vs testing

2.1.2. DevOps and testing

2.2. Related secondary studies

2.2.1. Summary of related reviews.

2.2.2. Aim of the study.

3. Methodology

3.1. Research questions

3.2. Search query

3.3. Search strategy

3.4. Search process

3.5. Data items

4. Results

4.1. Demographic information

4.2. RQ1: What is the subject of testing?

4.3. RQ2: How are tests designed? What is the role of stakeholders in this process?

4.3.1. Design of tests

4.3.2. Role of stakeholders

4.3.3. Cross analysis test design

4.4. RQ3: How are tests executed? What is the role of stakeholders in this process?

4.4.1. Execution of tests

2.1.1. التجارب المنضبطة مقابلاختبار

2.1.2. ديف أوبس واختبار

4.2. RQ1: ما هو موضوعاختبار؟

4.3. السؤال البحثي 2: كيف هيما هي الاختبارات المصممة؟ ما هو دور أصحاب المصلحة في هذه العملية؟

4.3.1. تصميماختبارات

4.3.3. التحليل المتقاطعتصميم الاختبار

4.4. السؤال البحثي 3: كيف هيهل تم تنفيذ الاختبارات؟ ما هو دور أصحاب المصلحة في هذه العملية؟

4.4.1. تنفيذاختبارات

4.4.3. التحليل المتقاطعتنفيذ الاختبار

4.5. RQ4: ما هي المشكلات البحثية المفتوحة المبلغ عنها في مجالاختبار؟

5.1.1. تطبيقاختبار

5.1.2. تحسين الكفاءة لـاختبار

5.1.3. ما وراء المعياراختبار

5.1.4. الخرسانةمشاكل الاختبار

5.1.5. الفخاخ والتحديات لـاختبار

5.1.7.الاختبار على نطاق واسع

5.2. البيئات والأدوات المستخدمة لـاختبار

5.3.1. تحسينعملية الاختبار