skip to main content
research-article
Public Access

Apache Spark: a unified engine for big data processing

Published:28 October 2016Publication History
Skip Abstract Section

Abstract

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

References

  1. Apache Storm project; http://storm.apache.orgGoogle ScholarGoogle Scholar
  2. Armbrust, M. et al. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD/PODS Conference (Melbourne, Australia, May 31-June 4). ACM Press, New York, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dave, A. Indexedrdd project; http://github.com/amplab/spark-indexedrddGoogle ScholarGoogle Scholar
  4. Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation (San Francisco, CA, Dec. 6--8). USENIX Association, Berkeley, CA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., Looger, L.L., and Ahrens, M.B. Mapping brain activity at scale with cluster computing. Nature Methods 11, 9 (Sept. 2014), 941--950.Google ScholarGoogle ScholarCross RefCross Ref
  6. Gonzalez, J.E. et al. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th OSDI Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6--8). USENIX Association, Berkeley, CA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Isard, M. et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the EuroSys Conference (Lisbon, Portugal, Mar. 21--23). ACM Press, New York, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the ACM-SIAM SODA Symposium on Discrete Algorithms (Austin, TX, Jan. 17--19). ACM Press, New York, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kornacker, M. et al. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 4--7, 2015).Google ScholarGoogle Scholar
  10. Low, Y. et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the 38th International VLDB Conference on Very Large Databases (Istanbul, Turkey, Aug. 27--31, 2012).Google ScholarGoogle Scholar
  11. Malewicz, G. et al. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD/PODS Conference (Indianapolis, IN, June 6--11). ACM Press, New York, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. McSherry, F., Isard, M., and Murray, D.G. Scalability! But at what COST? In Proceedings of the 15th HotOS Workshop on Hot Topics in Operating Systems (Kartause Ittingen, Switzerland, May 18--20). USENIX Association, Berkeley, CA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Melnik, S. et al. Dremel: Interactive analysis of Webscale datasets. Proceedings of the VLDB Endowment 3 (Sept. 2010), 330--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., and Patterson, D.A. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the SIGMOD/PODS Conference (Melbourne, Australia, May 31--June 4). ACM Press, New York, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shun, J. and Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN PPoPP Symposium on Principles and Practice of Parallel Programming (Shenzhen, China, Feb. 23--27). ACM Press, New York, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., and Kraska, T. MLI: An API for distributed machine learning. In Proceedings of the IEEE ICDM International Conference on Data Mining (Dallas, TX, Dec. 7--10). IEEE Press, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  18. Stonebraker, M. and Cetintemel, U. 'One size fits all': An idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Engineering (Tokyo, Japan, Apr. 5--8). IEEE Computer Society, Washington, D.C., 2005, 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. Design and evaluation of a real-time URL spam filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May 22--25). IEEE Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Venkataraman, S. et al. SparkR; http://dl.acm.org/citation.cfm?id=2903740&CFID=687410325&CFTOKEN=83630888Google ScholarGoogle Scholar
  22. Xin, R. and Zaharia, M. Lessons from running large-scale Spark workloads; http://tinyurl.com/large-scale-sparkGoogle ScholarGoogle Scholar
  23. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., and Stoica, I. Shark: SQL and rich analytics at scale. In Proceedings of the ACM SIGMOD/PODS Conference (New York, June 22--27). ACM Press, New York, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zaharia, M. An Architecture for Fast and General Data Processing on Large Clusters. Ph.D. thesis, Electrical Engineering and Computer Sciences Department, University of California, Berkeley, 2014; https://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdfGoogle ScholarGoogle Scholar
  25. Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Ninth USENIX NSDI Symposium on Networked Systems Design and Implementation (San Jose, CA, Apr. 25--27, 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zaharia, M. et al. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM SOSP Symposium on Operating Systems Principles (Farmington, PA, Nov. 3--6). ACM Press, New York, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In Proceedings of IEEE International Conference on Big Data (Santa Clara, CA, Oct. 29--Nov. 1). IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Apache Spark: a unified engine for big data processing

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Communications of the ACM
            Communications of the ACM  Volume 59, Issue 11
            November 2016
            118 pages
            ISSN:0001-0782
            EISSN:1557-7317
            DOI:10.1145/3013530
            • Editor:
            • Moshe Y. Vardi
            Issue’s Table of Contents

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 October 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Popular
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format