research-article

Benchmarking Data Flow Systems for Scalable Machine Learning

Authors:
Christoph Boden

TU Berlin

TU Berlin
View Profile

,
Andrea Spina

TU Berlin and Radicalbit S.r.l.

TU Berlin and Radicalbit S.r.l.
View Profile

,
Tilmann Rabl

TU Berlin

TU Berlin
View Profile

,
Volker Markl

TU Berlin

TU Berlin
View Profile

BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and BeyondMay 2017Article No.: 5Pages 1–10https://doi.org/10.1145/3070607.3070612

Published:14 May 2017Publication History

BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Pages 1–10

ABSTRACT

Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data.

We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points and 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective.

Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.

References

http://peel-framework.org/.Google Scholar
https://flink.apache.org/.Google Scholar
https://hadoop.apache.org/.Google Scholar
https://mahout.apache.org/.Google Scholar
https://spark.apache.org/.Google Scholar
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. The VLDB Journal, 23(6), Dec. 2014. Google ScholarDigital Library
T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.Google Scholar
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, 2014. Google ScholarDigital Library
k. Caninil. Sibyl: A system for large scale supervised machine learning.Google Scholar
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache FlinkTM: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., 38(4):28--38, 2015.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. Proc. VLDB Endow., 2012. Google ScholarDigital Library
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), Mar. Google ScholarDigital Library
HiBench. https://github.com/intel-hadoop/HiBench.Google Scholar
L. Jimmy and A. Kolcz. Large-scale machine learning at twitter. SIGMOD 2012, 2012. Google ScholarDigital Library
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Records, 44(4), May 2016. Google ScholarDigital Library
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583--598, 2014. Google ScholarDigital Library
C.-J. Lin and J. J. Moré. Newton's method for large bound-constrained optimization problems. SIAM J. on Optimization, 9(4), Apr. 1999. Google ScholarDigital Library
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 1989.Google ScholarCross Ref
O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernéndez. Spark versus flink: Understanding performance in big data analytics frameworks. In IEEE CLUSTER 2016, pages 433--442, Sept 2016.Google ScholarCross Ref
H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In KDD '13. ACM, 2013. Google ScholarDigital Library
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In USENIX HOTOS'15. USENIX Association, 2015. Google ScholarDigital Library
F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011, USA. Google ScholarDigital Library
M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In WWW '07. ACM, 2007. Google ScholarDigital Library
S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. ACM RecSys 2013, 2013. Google ScholarDigital Library
S. Schelter, V. Satuluri, and R. Zadeh. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at NIPS 2014, 2014.Google Scholar
J. Shi, Y. Qiu, U. F. Minhas, L. Jiao, C. Wang, B. Reinwald, and F. Özcan. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow., 8(13), Sept. 2015. Google ScholarDigital Library
J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. Performance evaluation of big data frameworks for large-scale data analytics. In IEEE BigData 2016, pages 424--431, Dec 2016.Google ScholarCross Ref
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, 2012. Google ScholarDigital Library

Benchmarking Data Flow Systems for Scalable Machine Learning
1. Computing methodologies

Recommendations

Scalable machine-learning algorithms for big data analytics: a comprehensive review

Big data analytics is one of the emerging technologies as it promises to provide better insights from huge and heterogeneous data. Big data analytics involves selecting the suitable big data storage and computational framework augmented by scalable ...
Read More
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Revised Selected Papers of the First Workshop on Specifying Big Data Benchmarks - Volume 8163

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications in short big data systems is a hot topic. In this paper, we focus on measuring ...
Read More
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
May 2017
76 pages
ISBN:9781450350198
DOI:10.1145/3070607
Co-chairs:
Foto Afrati,
Jacek Sroka,
Editor:
Jan Hidders,
Program Chair:
Paris Koutris
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 539
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Benchmarking Data Flow Systems for Scalable Machine Learning

BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

ABSTRACT

References

Cited By

Recommendations

Scalable machine-learning algorithms for big data analytics: a comprehensive review

The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture