skip to main content
research-article

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Published:01 September 2015Publication History
Skip Abstract Section

Abstract

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads. To conduct a detailed analysis, we developed two profiling tools: (1) We correlate the task execution plan with the resource utilization for both MapReduce and Spark, and visually present this correlation; (2) We provide a break-down of the task execution time for in-depth analysis. Through detailed experiments, we quantify the performance differences between MapReduce and Spark. Furthermore, we attribute these performance differences to different components which are architected differently in the two frameworks. We further expose the source of these performance differences by using a set of micro-benchmark experiments. Overall, our experiments show that Spark is about 2.5x, 5x, and 5x faster than MapReduce, for Word Count, k-means, and PageRank, respectively. The main causes of these speedups are the efficiency of the hash-based aggregation component for combine, as well as reduced CPU and disk overheads due to RDD caching in Spark. An exception to this is the Sort workload, for which MapReduce is 2x faster than Spark. We show that MapReduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduce.

References

  1. Apache Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. Apache Mahout. https://mahout.apache.org/.Google ScholarGoogle Scholar
  3. HDFS caching. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.Google ScholarGoogle Scholar
  4. HPROF: A heap/cpu profiling tool. http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html.Google ScholarGoogle Scholar
  5. RRDtool. http://oss.oetiker.ch/rrdtool/.Google ScholarGoogle Scholar
  6. Spark wins 2014 graysort competition. http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html.Google ScholarGoogle Scholar
  7. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Henderson. Functional Programming: Application and Implementation. Prentice-Hall International London, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111--1122, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW, pages 41--51, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In SOCC, pages 1--15, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. O. OMalley and A. C. Murthy. Winning a 60 second dash with a yellow elephant. Sort Benchmark, 2009.Google ScholarGoogle Scholar
  16. J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. VLDB, 7(13):1319--1330, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, pages 5:1--5:16, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-based large scale social network analysis. In SCC, pages 506--513, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clash of the titans: MapReduce vs. Spark for large scale data analytics

    Recommendations

    Reviews

    Tope Omitola

    Analyzing big data has never been more important in computer science, and the platforms of choice have been either MapReduce or Spark. By using selected workloads that characterize the majority of batch and iterative analytic operations (word count, sort, k -means, linear regression, and PageRank), this paper presents analyses of the performance differences between these two platforms. Although MapReduce is designed for batch jobs and Spark for iterative jobs, it is noted that they are being used, on the field, for both job types. The authors find that Spark is 2.5 to 5 times faster than MapReduce on the majority of these workloads (the only exception is sort). These results are not so surprising given the key architectural decisions made by the two platforms. This paper is resplendent with the configuration setup parameters of the experiments (hardware, software, and profilers). These parameters are useful for system administrators who want to understand a platform's behavior under different configurations. We also learn that since the majority of big data analytic workloads are central processing unit (CPU)-bound, both platforms are scalable to the number of CPU cores available to them. System developers can use the knowledge gleaned from this paper to improve the architecture and implementation of Spark and MapReduce, and of the applications running on both platforms. The explanations of the experiment results are very good: they further the understanding of how architecture and working assumptions affect system performance, and also explain some of the inner workings of the platforms. For example, we learn that as the number of "reduce" tasks is increased, the execution time of the "map" stage increases. If you want to understand the pros and cons of MapReduce and Spark, and when and how to use them, this paper is a good place to start. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 13
      Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
      September 2015
      144 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 September 2015
      Published in pvldb Volume 8, Issue 13

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader