ABSTRACT
Online controlled experiments, or A/B testing, has been a standard framework adopted by most online product companies to measure the effect of any new change. Companies use various statistical methods including hypothesis testing and statistical inference to quantify the business impact of the changes and make product decisions. Nowadays, experimentation platforms can run as many as hundreds or even more experiments concurrently. When a group of experiments is conducted, usually the ones with significant successful results are chosen to be launched into the product. We are interested in learning the aggregated impact of the launched features. In this paper, we investigate a statistical selection bias in this process and propose a correction method of getting an unbiased estimator. Moreover, we give an implementation example at Airbnb's ERF platform (Experiment Reporting Framework) and discuss the best practices to account for this bias.
Supplemental Material
- Theodore Alfonso Bancroft. 1944. On biases in estimation due to the use of preliminary tests of significance. The Annals of Mathematical Statistics Vol. 15, 2 (June. 1944), 190--204.Google ScholarCross Ref
- Theodore Alfonso Bancroft. 1964. Analysis and inference for incompletely specified models involving the use of preliminary test (s) of significance. Biometrics, Vol. 20, 3 (Sept. 1964), 427--442.Google ScholarCross Ref
- Edward C Capen, Robert V Clapp, William M Campbell, and others. 1971. Competitive bidding in high-risk situations. Journal of petroleum technology Vol. 23, 06 (June. 1971), 641--653.Google ScholarCross Ref
- Robert Chang. 2015. Detecting and avoiding bucket imbalance in A/B tests. (Dec.. 2015). Retrieved February 16, 2017 from https://blog.twitter.com/2015/detecting-and-avoiding-bucket-imbalance-in-ab-testsGoogle Scholar
- Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1105--1114. Google ScholarDigital Library
- Alex Deng, Tianxi Li, and Yu Guo. 2014. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In Proceedings of the 23rd international conference on World wide web. ACM, 609--618. Google ScholarDigital Library
- Bradley Efron. 2011. Tweedie's formula and selection bias. J. Amer. Statist. Assoc. Vol. 106, 496 (Dec.. 2011), 1602--1614.Google ScholarCross Ref
- Bradley Efron. 2012. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Vol. Vol. 1. Cambridge University Press, Cambridge.Google Scholar
- Bradley Efron. 2014. Estimation and accuracy after model selection. J. Amer. Statist. Assoc. Vol. 109, 507 (July. 2014), 991--1007.Google ScholarCross Ref
- Bradley Efron and Robert J Tibshirani. 1993. An introduction to the bootstrap. Chapman and Hall, London.Google Scholar
- William Fithian, Dennis Sun, and Jonathan Taylor. 2014. Optimal inference after model selection. arXiv preprint arXiv:1410.2597 (Oct.. 2014).Google Scholar
- Chad Garner. 2007. Upward bias in odds ratio estimates from genome-wide association studies. Genetic epidemiology, Vol. 31, 4 (May. 2007), 288--295.Google Scholar
- Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1168--1176. Google ScholarDigital Library
- Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 959--967. Google ScholarDigital Library
- Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery Vol. 18, 1 (Feb. 2009), 140--181. Google ScholarDigital Library
- Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, and others. 2016. Exact post-selection inference, with application to the lasso. The Annals of Statistics Vol. 44, 3 (April. 2016), 907--927.Google ScholarCross Ref
- Will Moss. 2014. Experiment reporting framework. (May. 2014). Retrieved February 16, 2017 from http://nerds.airbnb.com/experiment-reporting-frameworkGoogle Scholar
- Jan Overgoor. 2014. Experiments at Airbnb. (May. 2014). Retrieved February 16, 2017 from http://nerds.airbnb.com/experiments-at-airbnbGoogle Scholar
- Lei Sun and Shelley B Bull. 2005. Reduction of selection bias in genomewide studies by resampling. Genetic epidemiology, Vol. 28, 4 (May. 2005), 352--367.Google Scholar
- Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 17--26. Google ScholarDigital Library
- Richard H Thaler. 1988. Anomalies: The winner's curse. The Journal of Economic Perspectives Vol. 2, 1 (Jan.. 1988), 191--202.Google Scholar
- Rui Xiao and Michael Boehnke. 2009. Quantifying and correcting for the winner's curse in genetic association studies. Genetic epidemiology, Vol. 33, 5 (2009), 453--462.Google Scholar
- Lizhen Xu, Radu V Craiu, and Lei Sun. 2011. Bayesian methods to overcome the winner's curse in genetic studies. The Annals of Applied Statistics (2011), 201--231.Google Scholar
- Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2227--2236. Google ScholarDigital Library
- Hua Zhong and Ross L Prentice. 2008. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, Vol. 9, 4 (Oct. 2008), 621--634.Google ScholarCross Ref
- Hua Zhong and Ross L Prentice. 2010. Correcting "winner's curse" in odds ratios from genomewide association findings for major complex human diseases. Genetic epidemiology, Vol. 34, 1 (Jan. 2010), 78--91.Google Scholar
- Sebastian Zöllner and Jonathan K Pritchard. 2007. Overcoming the winner's curse: estimating penetrance parameters from case-control data. The American Journal of Human Genetics Vol. 80, 4 (April. 2007), 605--615.Google ScholarCross Ref
Index Terms
- Winner's Curse: Bias Estimation for Total Effects of Features in Online Controlled Experiments
Recommendations
Online controlled experiments: introduction, learnings, and humbling statistics
RecSys '12: Proceedings of the sixth ACM conference on Recommender systemsThe web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend ...
Online controlled experiments: introduction, insights, scaling, and humbling statistics
UEO '13: Proceedings of the 1st workshop on User engagement optimizationThe web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend algorithms, ...
Designing and deploying online field experiments
WWW '14: Proceedings of the 23rd international conference on World wide webOnline experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative ...
Comments