ABSTRACT
A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks and methodologies, including challenges and pitfalls, have been discussed extensively in several previous KDD work [7, 8, 9, 10]. In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting. We start with an introduction of the experimentation platform and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them. It is then followed by discussions on several more sophisticated A/B testing scenarios, such as running offline experiments and addressing the network effect, where one user's action can influence that of another. Lastly, we talk about features and processes that are crucial for building a strong experimentation culture.
Supplemental Material
- Rubin, Donald B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.Google Scholar
- Ugander, Johan, Karrer, Brian, Backstrom, Lars and Kleinberg, Jon. Graph cluster randomization: network exposure to multiple universes. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 329--337. ACM, 2013. Google ScholarDigital Library
- Katzir, Liran, Liberty, Edo and Somekh Oren. Framework and algorithms for network bucket testing. Proceedings of the 21st international conference on World Wide Web, pages 1029--1036. ACM, 2012. Google ScholarDigital Library
- Toulis, Panos and Kao, Edward. Estimation of causal peer influence effects. Proceedings of The 30th International Conference on Machine Learning, pages 1489--1497, 2013.Google Scholar
- Eckles, Dean, Karrer, Brian and Ugander, Johan. Design and analysis of experiments in networks: Reducing bias from interference. arXiv preprint arXiv:1404.7530, 2014.Google Scholar
- Aronow, Peter M, and Samii, Cyrus. Estimating average causal effects under general interference. arXiv preprint arXiv:1305.6156, 2013Google Scholar
- Kohavi, Ron, et al. Trustworthy online controlled experiments: Five puzzling outcomes explained. Proceedings of the 18th Conference on Knowledge Discovery and Data Mining. 2012, www.exp-platform.com/Pages/PuzzingOutcomesExplained.aspx. Google ScholarDigital Library
- Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010. Google ScholarDigital Library
- Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013. http://bit.ly/ExPScale. Google ScholarDigital Library
- Kohavi, Ron, et al. Seven Rules of Thumb for Web Site Experimenters. KDD 2014: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014. Google ScholarDigital Library
- Yates, Frank, Sir Ronald Fisher and the Design of Experiments. Biometrics, 20(2):307--321, 1964.Google ScholarCross Ref
- Bakshy, Eytan, Echles, Dean and Bernstein, Michael S. Designing and Deploying Online Field Experiments. Proceedings of the 23rd international conference on World Wide Web, pages 283--292, ACM, 2014 Google ScholarDigital Library
- Kohavi, Ron. et al. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery. February 2009, Vol. 18, 1, pp. 140--181. http://www.exp-platform.com/Pages/hippo_long.aspx. Google ScholarDigital Library
- Crook, Thomas, et al. Seven Pitfalls to Avoid when Running Controlled Experiments on the Web. {ed.} Peter Flach and Mohammed Zaki. KDD '09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. 2009, pp. 1105--1114. http://www.exp-platform.com/Pages/ExPpitfalls.aspx. Google ScholarDigital Library
- Ioannidis, John PA. "Why most published research findings are false." PLoS medicine 2.8 (2005): e124.Google Scholar
- Wacholder, Sholom, et al. "Assessing the probability that a positive report is false: an approach for molecular epidemiology studies." Journal of the National Cancer Institute 96.6 (2004): 434--442.Google ScholarCross Ref
- Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289--300.Google ScholarCross Ref
- Saaty, Thomas L. "How to make a decision: the analytic hierarchy process."European journal of operational research 48.1 (1990): 9--26.Google Scholar
- Gui, Huan, Xu, Ya, Bhasin, Anmol, Han Jiawei. Network A/B Testing: From Sampling to Estimation. Proceedings of the 24rd international conference on World Wide Web, ACM, 2015 Google ScholarDigital Library
- Box, George EP, J. Stuart Hunter, and William G. Hunter. "Statistics for experimenters: design, innovation, and discovery." AMC 10 (2005): 12.Google Scholar
- Gerber, A. S., and Green, D. P. Field Experiments: Design, Analysis, and Interpretation. WW Norton, 2012Google Scholar
- Sumbaly, Roshan, et al. "Serving large-scale batch computed data with project voldemort." Proceedings of the 10th USENIX conference on File and Storage Technologies. USENIX Association, 2012. Google ScholarDigital Library
- Tate, Ryan. The Software Revolution Behind LinkedIn's Gushing Profits. {Online} http://www.wired.com/2013/04/linkedin-software-revolutionGoogle Scholar
- Auradkar, Aditya, et al. "Data infrastructure at LinkedIn." Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 2012. Google ScholarDigital Library
- Kreps, Jay, Neha Narkhede, and Jun Rao. "Kafka: A distributed messaging system for log processing." Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece. 2011.Google Scholar
- Naga, Praveen Neppalli, Real-time Analytics at Massive Scale with Pinot. {Online} September 29, 2014 http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinotGoogle Scholar
- Fisher, Ronald A. Presidential Address. Sankhya: The Indian Journal of Statistics. 1938, Vol. 4, 1. http://www.jstor.org/stable/40383882.Google Scholar
- Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008. Google ScholarDigital Library
- Betz, Joe, Tagle, Moira. Rest.li:RESTful Service Architecture at Scale. {Online} February, 19, 2013 https://engineering.linkedin.com/architecture/restli-restful-service-architecture-scaleGoogle Scholar
- Romano, Joseph P. Azeem M. Shaikh and Michael Wolf. 2010b Multiple Testing. New Palgrave Dictionary of Economics. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.418.4975&rep=rep1&type=pdfGoogle Scholar
- Wikipedia. Simpson's Paradox. {Online} http://en.wikipedia.org/wiki/Simpson%27s_paradoxGoogle Scholar
- McFarland, Colin. Experiment!: Website conversion rate optimization with A/B and multivariate testing. s.1. : New Riders, 2012.978-0321834607Google Scholar
- Eisenberg, Bryan. How to Improve A/B Testing. ClickZ Network. {Online} April 29, 2005. www.clickz.com/clickz/column/1717234/how-improvem-a-b-testing.Google Scholar
- Vemuri, Srinivas, Varshney, Maneesh, Puttaswamy, Krishna and Liu, Rui. Execution Primitives for Scalable Joins and Aggregations in Map Reduce. Proceedings of the VLDB Endowment, Vol. 7, No. 13 Google ScholarDigital Library
- Varshney, Maneesh, Vemuri, Srinivas. Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics {Online} November 11, 2014 https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analyticsGoogle Scholar
Index Terms
- From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks
Recommendations
Online controlled experiments at large scale
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningWeb-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At ...
Peeking at A/B Tests: Why it matters, and what to do about it
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThis paper reports on novel statistical methodology, which has been deployed by the commercial A/B testing platform Optimizely to communicate experimental results to their customers. Our methodology addresses the issue that traditional p-values and ...
Top Challenges from the first Practical Online Controlled Experiments Summit
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical ...
Comments