Abstract
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
- Apache Storm project; http://storm.apache.orgGoogle Scholar
- Armbrust, M. et al. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD/PODS Conference (Melbourne, Australia, May 31-June 4). ACM Press, New York, 2015. Google ScholarDigital Library
- Dave, A. Indexedrdd project; http://github.com/amplab/spark-indexedrddGoogle Scholar
- Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth OSDI Symposium on Operating Systems Design and Implementation (San Francisco, CA, Dec. 6--8). USENIX Association, Berkeley, CA, 2004. Google ScholarDigital Library
- Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., Looger, L.L., and Ahrens, M.B. Mapping brain activity at scale with cluster computing. Nature Methods 11, 9 (Sept. 2014), 941--950.Google ScholarCross Ref
- Gonzalez, J.E. et al. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th OSDI Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6--8). USENIX Association, Berkeley, CA, 2014. Google ScholarDigital Library
- Isard, M. et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the EuroSys Conference (Lisbon, Portugal, Mar. 21--23). ACM Press, New York, 2007. Google ScholarDigital Library
- Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the ACM-SIAM SODA Symposium on Discrete Algorithms (Austin, TX, Jan. 17--19). ACM Press, New York, 2010. Google ScholarDigital Library
- Kornacker, M. et al. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 4--7, 2015).Google Scholar
- Low, Y. et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the 38th International VLDB Conference on Very Large Databases (Istanbul, Turkey, Aug. 27--31, 2012).Google Scholar
- Malewicz, G. et al. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD/PODS Conference (Indianapolis, IN, June 6--11). ACM Press, New York, 2010. Google ScholarDigital Library
- McSherry, F., Isard, M., and Murray, D.G. Scalability! But at what COST? In Proceedings of the 15th HotOS Workshop on Hot Topics in Operating Systems (Kartause Ittingen, Switzerland, May 18--20). USENIX Association, Berkeley, CA, 2015. Google ScholarDigital Library
- Melnik, S. et al. Dremel: Interactive analysis of Webscale datasets. Proceedings of the VLDB Endowment 3 (Sept. 2010), 330--339. Google ScholarDigital Library
- Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1--7. Google ScholarDigital Library
- Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., and Patterson, D.A. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the SIGMOD/PODS Conference (Melbourne, Australia, May 31--June 4). ACM Press, New York, 2015. Google ScholarDigital Library
- Shun, J. and Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN PPoPP Symposium on Principles and Practice of Parallel Programming (Shenzhen, China, Feb. 23--27). ACM Press, New York, 2013. Google ScholarDigital Library
- Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, M.I., and Kraska, T. MLI: An API for distributed machine learning. In Proceedings of the IEEE ICDM International Conference on Data Mining (Dallas, TX, Dec. 7--10). IEEE Press, 2013.Google ScholarCross Ref
- Stonebraker, M. and Cetintemel, U. 'One size fits all': An idea whose time has come and gone. In Proceedings of the 21st International ICDE Conference on Data Engineering (Tokyo, Japan, Apr. 5--8). IEEE Computer Society, Washington, D.C., 2005, 2--11. Google ScholarDigital Library
- Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. Design and evaluation of a real-time URL spam filtering service. In Proceedings of the IEEE Symposium on Security and Privacy (Oakland, CA, May 22--25). IEEE Press, 2011. Google ScholarDigital Library
- Valiant, L.G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103--111. Google ScholarDigital Library
- Venkataraman, S. et al. SparkR; http://dl.acm.org/citation.cfm?id=2903740&CFID=687410325&CFTOKEN=83630888Google Scholar
- Xin, R. and Zaharia, M. Lessons from running large-scale Spark workloads; http://tinyurl.com/large-scale-sparkGoogle Scholar
- Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., and Stoica, I. Shark: SQL and rich analytics at scale. In Proceedings of the ACM SIGMOD/PODS Conference (New York, June 22--27). ACM Press, New York, 2013. Google ScholarDigital Library
- Zaharia, M. An Architecture for Fast and General Data Processing on Large Clusters. Ph.D. thesis, Electrical Engineering and Computer Sciences Department, University of California, Berkeley, 2014; https://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdfGoogle Scholar
- Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Ninth USENIX NSDI Symposium on Networked Systems Design and Implementation (San Jose, CA, Apr. 25--27, 2012). Google ScholarDigital Library
- Zaharia, M. et al. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM SOSP Symposium on Operating Systems Principles (Farmington, PA, Nov. 3--6). ACM Press, New York, 2013. Google ScholarDigital Library
- Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn, O., Franklin, M.J., Patterson, D.A., and Perlmutter, S. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In Proceedings of IEEE International Conference on Big Data (Santa Clara, CA, Oct. 29--Nov. 1). IEEE, 2015. Google ScholarDigital Library
Index Terms
- Apache Spark: a unified engine for big data processing
Recommendations
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing ResearchThe term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Comments