ABSTRACT
The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.
- http://www.slideshare.net/cloudera/hw09-data-processing-in-the-enterprise.Google Scholar
- http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis.Google Scholar
- http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun.Google Scholar
- http://www.slideshare.net/cloudera/hw09-hadoop-based-data-mining-platform-for-the-telecom-industry.Google Scholar
- http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoop summit hadoop and the enterprise.html.Google Scholar
- http://www.slideshare.net/prasadc/hive-percona-2009.Google Scholar
- http://hadoop.apache.org/.Google Scholar
- http://research.yahoo.com/files/facebook-hadoop-summit.pdf.Google Scholar
- http://hadoop.apache.org/hive/.Google Scholar
- http://www.jaql.org.Google Scholar
- Teradata: DBC/1012 data base computer concepts and facilities, Teradata Corp., Document No. C02-0001-00, 1984.Google Scholar
- P. A. Bernstein and N. Goodman. Full reducers for relational queries using multi-attribute semijoins. In Symp. On Comp. Network, 1979.Google Scholar
- P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie Jr. Query processing in a system for distributed databases (SDD-1). ACM Transactions on Database Systems, 6(4):602--625, 1981. Google ScholarDigital Library
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992. Google ScholarDigital Library
- D. J. DeWitt and M. Stonebraker. MapReduce: A major step backwards. Blog post at The Database Column, 17 January 2008.Google Scholar
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google ScholarDigital Library
- G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2), 1993. Google ScholarDigital Library
- J. Hammerbacher. Managing a large Hadoop cluster. Presentation, Facebook Inc., May 2008.Google Scholar
- P. Mishra and M. H. Eich. Join processing in relational databases. ACM Comput. Surv., 24(1), 1992. Google ScholarDigital Library
- C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Technical Conference, pages 267--273, 2008. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
- D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, 1989. Google ScholarDigital Library
- H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarDigital Library
Index Terms
- A comparison of join algorithms for log processing in MaPreduce
Recommendations
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Join Algorithms under Apache Spark: Revisited
ICCTA '19: Proceedings of the 2019 5th International Conference on Computer and Technology ApplicationsCurrently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the ...
High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop
ICICA '14: Proceedings of the 2014 International Conference on Intelligent Computing ApplicationsHadoop is a quickly budding ecosystem of components based on Google's MapReduce algorithm and file system work for implementing MapReduce algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process ...
Comments