skip to main content
10.1145/1807167.1807273acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A comparison of join algorithms for log processing in MaPreduce

Published:06 June 2010Publication History

ABSTRACT

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.

References

  1. http://www.slideshare.net/cloudera/hw09-data-processing-in-the-enterprise.Google ScholarGoogle Scholar
  2. http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis.Google ScholarGoogle Scholar
  3. http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun.Google ScholarGoogle Scholar
  4. http://www.slideshare.net/cloudera/hw09-hadoop-based-data-mining-platform-for-the-telecom-industry.Google ScholarGoogle Scholar
  5. http://wiki.apache.org/hadoop/PoweredBy.Google ScholarGoogle Scholar
  6. http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoop summit hadoop and the enterprise.html.Google ScholarGoogle Scholar
  7. http://www.slideshare.net/prasadc/hive-percona-2009.Google ScholarGoogle Scholar
  8. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  9. http://research.yahoo.com/files/facebook-hadoop-summit.pdf.Google ScholarGoogle Scholar
  10. http://hadoop.apache.org/hive/.Google ScholarGoogle Scholar
  11. http://www.jaql.org.Google ScholarGoogle Scholar
  12. Teradata: DBC/1012 data base computer concepts and facilities, Teradata Corp., Document No. C02-0001-00, 1984.Google ScholarGoogle Scholar
  13. P. A. Bernstein and N. Goodman. Full reducers for relational queries using multi-attribute semijoins. In Symp. On Comp. Network, 1979.Google ScholarGoogle Scholar
  14. P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie Jr. Query processing in a system for distributed databases (SDD-1). ACM Transactions on Database Systems, 6(4):602--625, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. J. DeWitt and M. Stonebraker. MapReduce: A major step backwards. Blog post at The Database Column, 17 January 2008.Google ScholarGoogle Scholar
  19. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2), 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Hammerbacher. Managing a large Hadoop cluster. Presentation, Facebook Inc., May 2008.Google ScholarGoogle Scholar
  22. P. Mishra and M. H. Eich. Join processing in relational databases. ACM Comput. Surv., 24(1), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Technical Conference, pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A comparison of join algorithms for log processing in MaPreduce

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
            June 2010
            1286 pages
            ISBN:9781450300322
            DOI:10.1145/1807167

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 6 June 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader