ABSTRACT
Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data.
We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points and 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective.
Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.
- http://peel-framework.org/.Google Scholar
- https://flink.apache.org/.Google Scholar
- https://hadoop.apache.org/.Google Scholar
- https://mahout.apache.org/.Google Scholar
- https://spark.apache.org/.Google Scholar
- A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. The VLDB Journal, 23(6), Dec. 2014. Google ScholarDigital Library
- T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.Google Scholar
- Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, 2014. Google ScholarDigital Library
- k. Caninil. Sibyl: A system for large scale supervised machine learning.Google Scholar
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache FlinkTM: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., 38(4):28--38, 2015.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. Proc. VLDB Endow., 2012. Google ScholarDigital Library
- A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), Mar. Google ScholarDigital Library
- HiBench. https://github.com/intel-hadoop/HiBench.Google Scholar
- L. Jimmy and A. Kolcz. Large-scale machine learning at twitter. SIGMOD 2012, 2012. Google ScholarDigital Library
- A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Records, 44(4), May 2016. Google ScholarDigital Library
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583--598, 2014. Google ScholarDigital Library
- C.-J. Lin and J. J. Moré. Newton's method for large bound-constrained optimization problems. SIAM J. on Optimization, 9(4), Apr. 1999. Google ScholarDigital Library
- D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 1989.Google ScholarCross Ref
- O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernéndez. Spark versus flink: Understanding performance in big data analytics frameworks. In IEEE CLUSTER 2016, pages 433--442, Sept 2016.Google ScholarCross Ref
- H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In KDD '13. ACM, 2013. Google ScholarDigital Library
- F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In USENIX HOTOS'15. USENIX Association, 2015. Google ScholarDigital Library
- F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011, USA. Google ScholarDigital Library
- M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In WWW '07. ACM, 2007. Google ScholarDigital Library
- S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. ACM RecSys 2013, 2013. Google ScholarDigital Library
- S. Schelter, V. Satuluri, and R. Zadeh. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at NIPS 2014, 2014.Google Scholar
- J. Shi, Y. Qiu, U. F. Minhas, L. Jiao, C. Wang, B. Reinwald, and F. Özcan. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow., 8(13), Sept. 2015. Google ScholarDigital Library
- J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. Performance evaluation of big data frameworks for large-scale data analytics. In IEEE BigData 2016, pages 424--431, Dec 2016.Google ScholarCross Ref
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, 2012. Google ScholarDigital Library
- Benchmarking Data Flow Systems for Scalable Machine Learning
Recommendations
Scalable machine-learning algorithms for big data analytics: a comprehensive review
Big data analytics is one of the emerging technologies as it promises to provide better insights from huge and heterogeneous data. Big data analytics involves selecting the suitable big data storage and computational framework augmented by scalable ...
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Revised Selected Papers of the First Workshop on Specifying Big Data Benchmarks - Volume 8163Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications in short big data systems is a hot topic. In this paper, we focus on measuring ...
Comments