research-article

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Authors:
Juwei Shi

Renmin University of China

Renmin University of China
View Profile

,
Yunjie Qiu

IBM Research, China

IBM Research, China
View Profile

,
Umar Farooq Minhas

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Limei Jiao

IBM Research, China

IBM Research, China
View Profile

,
Chen Wang

Tsinghua University

Tsinghua University
View Profile

,
Berthold Reinwald

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Fatma Özcan

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 13pp 2110–2121https://doi.org/10.14778/2831360.2831365

Published:01 September 2015Publication History

Proceedings of the VLDB Endowment

Abstract

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.

References

Apache Hadoop. http://hadoop.apache.org/.Google Scholar
Apache Mahout. https://mahout.apache.org/.Google Scholar
HDFS caching. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.Google Scholar
HPROF: A heap/cpu profiling tool. http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.Google Scholar
RRDtool. http://oss.oetiker.ch/rrdtool/.Google Scholar
Spark wins 2014 graysort competition. http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
P. Henderson. Functional Programming: Application and Implementation. Prentice-Hall International London, 1980. Google ScholarDigital Library
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.Google ScholarDigital Library
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW, pages 41--51, 2010.Google ScholarCross Ref
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In SOCC, pages 1--15, 2014. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarCross Ref
A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007. Google ScholarDigital Library
O. OMalley and A. C. Murthy. Winning a 60 second dash with a yellow elephant. Sort Benchmark, 2009.Google Scholar
J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. VLDB, 7(13):1319--1330, 2014. Google ScholarDigital Library
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, pages 5:1--5:16, 2013. Google ScholarDigital Library
W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-based large scale social network analysis. In SCC, pages 506--513, 2010. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library

Index Terms

Clash of the titans: MapReduce vs. Spark for large scale data analytics
1. Information systems
  1. Data management systems

Recommendations

Reviews

Reviewer: Tope Omitola

Analyzing big data has never been more important in computer science, and the platforms of choice have been either MapReduce or Spark. By using selected workloads that characterize the majority of batch and iterative analytic operations (word count, sort, k -means, linear regression, and PageRank), this paper presents analyses of the performance differences between these two platforms. Although MapReduce is designed for batch jobs and Spark for iterative jobs, it is noted that they are being used, on the field, for both job types. The authors find that Spark is 2.5 to 5 times faster than MapReduce on the majority of these workloads (the only exception is sort). These results are not so surprising given the key architectural decisions made by the two platforms. This paper is resplendent with the configuration setup parameters of the experiments (hardware, software, and profilers). These parameters are useful for system administrators who want to understand a platform's behavior under different configurations. We also learn that since the majority of big data analytic workloads are central processing unit (CPU)-bound, both platforms are scalable to the number of CPU cores available to them. System developers can use the knowledge gleaned from this paper to improve the architecture and implementation of Spark and MapReduce, and of the applications running on both platforms. The explanations of the experiment results are very good: they further the understanding of how architecture and working assumptions affect system performance, and also explain some of the inner workings of the platforms. For example, we learn that as the number of "reduce" tasks is increased, the execution time of the "map" stage increases. If you want to understand the pros and cons of MapReduce and Spark, and when and how to use them, this paper is a good place to start. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 13
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
September 2015
144 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2015
Published in pvldb Volume 8, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 1,695
  Total Downloads
- Downloads (Last 12 months)108
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media