Apache Spark: a unified engine for big data processing

Authors:
Matei Zaharia

Stanford University, Stanford, CA and Databricks, San Francisco, CA

Stanford University, Stanford, CA and Databricks, San Francisco, CA
View Profile

,
Reynold S. Xin

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Patrick Wendell

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Tathagata Das

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Michael Armbrust

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Ankur Dave

University of California, Berkeley

University of California, Berkeley
View Profile

,
Xiangrui Meng

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Josh Rosen

Databricks, San Francisco, CA

Databricks, San Francisco, CA
View Profile

,
Shivaram Venkataraman

University of California, Berkeley

University of California, Berkeley
View Profile

,
Michael J. Franklin

University of Chicago and University of California, Berkeley

University of Chicago and University of California, Berkeley
View Profile

,
Ali Ghodsi

University of California, Berkeley

University of California, Berkeley
View Profile

,
Joseph Gonzalez

University of California, Berkeley

University of California, Berkeley
View Profile

,
Scott Shenker

University of California, Berkeley

University of California, Berkeley
View Profile

,
Ion Stoica

University of California, Berkeley

University of California, Berkeley
View Profile

Authors Info & Claims

Communications of the ACM Volume 59 Issue 11November 2016pp 56–65https://doi.org/10.1145/2934664

Published:28 October 2016Publication History

Communications of the ACM

Abstract

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

References

Apache Storm project; http://storm.apache.orgGoogle Scholar
Armbrust, M. et al. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD/PODS Conference (Melbourne, Australia, May 31-June 4). ACM Press, New York, 2015. Google ScholarDigital Library
Dave, A. Indexedrdd project; http://github.com/amplab/spark-indexedrddGoogle Scholar
Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation (San Francisco, CA, Dec. 6--8). USENIX Association, Berkeley, CA, 2004. Google ScholarDigital Library
Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., Looger, L.L., and Ahrens, M.B. Mapping brain activity at scale with cluster computing. Nature Methods 11, 9 (Sept. 2014), 941--950.Google ScholarCross Ref
Gonzalez, J.E. et al. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11^th OSDI Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6--8). USENIX Association, Berkeley, CA, 2014. Google ScholarDigital Library
Isard, M. et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the EuroSys Conference (Lisbon, Portugal, Mar. 21--23). ACM Press, New York, 2007. Google ScholarDigital Library
Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the ACM-SIAM SODA Symposium on Discrete Algorithms (Austin, TX, Jan. 17--19). ACM Press, New York, 2010. Google ScholarDigital Library
Kornacker, M. et al. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 4--7, 2015).Google Scholar
Low, Y. et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the 38^th International VLDB Conference on Very Large Databases (Istanbul, Turkey, Aug. 27--31, 2012).Google Scholar
Malewicz, G. et al. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD/PODS Conference (Indianapolis, IN, June 6--11). ACM Press, New York, 2010. Google ScholarDigital Library
McSherry, F., Isard, M., and Murray, D.G. Scalability! But at what COST? In Proceedings of the 15^th HotOS Workshop on Hot Topics in Operating Systems (Kartause Ittingen, Switzerland, May 18--20). USENIX Association, Berkeley, CA, 2015. Google ScholarDigital Library
Melnik, S. et al. Dremel: Interactive analysis of Webscale datasets. Proceedings of the VLDB Endowment 3 (Sept. 2010), 330--339. Google ScholarDigital Library
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1--7. Google ScholarDigital Library
Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., and Patterson, D.A. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the SIGMOD/PODS Conference (Melbourne, Australia, May 31--June 4). ACM Press, New York, 2015. Google ScholarDigital Library
Shun, J. and Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN PPoPP Symposium on Principles and Practice of Parallel Programming (Shenzhen, China, Feb. 23--27). ACM Press, New York, 2013. Google ScholarDigital Library
Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., and Kraska, T. MLI: An API for distributed machine learning. In Proceedings of the IEEE ICDM International Conference on Data Mining (Dallas, TX, Dec. 7--10). IEEE Press, 2013.Google ScholarCross Ref
Stonebraker, M. and Cetintemel, U. 'One size fits all': An idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Engineering (Tokyo, Japan, Apr. 5--8). IEEE Computer Society, Washington, D.C., 2005, 2--11. Google ScholarDigital Library
Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. Design and evaluation of a real-time URL spam filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May 22--25). IEEE Press, 2011. Google ScholarDigital Library
Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103--111. Google ScholarDigital Library
Venkataraman, S. et al. SparkR; http://dl.acm.org/citation.cfm?id=2903740&CFID=687410325&CFTOKEN=83630888Google Scholar
Xin, R. and Zaharia, M. Lessons from running large-scale Spark workloads; http://tinyurl.com/large-scale-sparkGoogle Scholar
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., and Stoica, I. Shark: SQL and rich analytics at scale. In Proceedings of the ACM SIGMOD/PODS Conference (New York, June 22--27). ACM Press, New York, 2013. Google ScholarDigital Library
Zaharia, M. An Architecture for Fast and General Data Processing on Large Clusters. Ph.D. thesis, Electrical Engineering and Computer Sciences Department, University of California, Berkeley, 2014; https://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdfGoogle Scholar
Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Ninth USENIX NSDI Symposium on Networked Systems Design and Implementation (San Jose, CA, Apr. 25--27, 2012). Google ScholarDigital Library
Zaharia, M. et al. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24^th ACM SOSP Symposium on Operating Systems Principles (Farmington, PA, Nov. 3--6). ACM Press, New York, 2013. Google ScholarDigital Library
Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In Proceedings of IEEE International Conference on Big Data (Santa Clara, CA, Oct. 29--Nov. 1). IEEE, 2015. Google ScholarDigital Library

Index Terms

Apache Spark: a unified engine for big data processing

Recommendations

Learning Apache Spark 2.0
Read More
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Read More
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 59, Issue 11
November 2016
118 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3013530
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,683
  Total Citations
  View Citations
- 149,435
  Total Downloads
- Downloads (Last 12 months)6,890
- Downloads (Last 6 weeks)633
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Apache Spark: a unified engine for big data processing

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Learning Apache Spark 2.0

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Performance comparison of Apache Hadoop and Apache Spark