skip to main content
10.1145/1755913.1755939acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

HadoopToSQL: a mapReduce query optimizer

Published:13 April 2010Publication History

ABSTRACT

MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data.

HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features.

We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.

References

  1. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache Software Foundation. Hadoop. http://hadoop.apache.org/core/.Google ScholarGoogle Scholar
  3. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 205--218, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chen and S. W. Schlosser. Map-Reduce meets wider varieties of applications. Technical Report IRP-TR-08-05, Pittsburgh, USA, 2008. Intel Research Pittsburgh.Google ScholarGoogle Scholar
  5. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72--77, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. DeMichiel and M. Keith. JSR 220: Enterprise JavaBeans 3.0. http://www.jcp.org/en/jsr/detail?id=220, May 11 2006.Google ScholarGoogle Scholar
  8. D. J. DeWitt, S. Ghanderaizadeh, and D. Schneider. A performance analysis of the gamma database machine. In SIGMOD '88: Proceedings of the 1988 ACM SIGMOD international conference on Management of data, pages 350--360, New York, NY, USA, 1988. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M.-Y. Iu and W. Zwaenepoel. Queryll: Java database queries through bytecode rewriting. In M. van Steen and M. Henning, editors, Middleware, volume 4290 of Lecture Notes in Computer Science, pages 201--218. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M.-Y. Iu, E. Cecchet, and W. Zwaenepoel. JReq: Database queries in imperative languages. In CC '10: Proceedings of the 19th International Conference on Compiler Construction, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Kim, K. Jeon, H. Han, S. gyu Kim, H. Jung, and H. Y. Yeom. MRBench: A benchmark for MapReduce framework. International Conference on Parallel and Distributed Systems, 0:11--18, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Maier, J. Stein, A. Otis, and A. Purdy. Development of an object-oriented DBMS. In OOPLSA '86: Conference proceedings on Object-oriented programming systems, languages and applications, pages 472--482, New York, NY, USA, 1986. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Persyn. Database sharding at Netlog, with MySQL and PHP. http://www.jurriaanpersyn.com/archives/2009/02/12/database-sharding-at-netlog-with-mysql-and-php/.Google ScholarGoogle Scholar
  17. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13: 277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Spock Proxy. Spock proxy -- a proxy for MySQL horizontal partitioning. http://spockproxy.sourceforge.net/.Google ScholarGoogle Scholar
  19. ST Global. Spider storage engine. http://spiderformysql.com/.Google ScholarGoogle Scholar
  20. M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a Map--Reduce framework. Proc. VLDB Endow., 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Transaction Processing Performance Council (TPC). TPC Benchmark H (Decision Support) Standard Specification Version 2.8.0. Transaction Processing Performance Council, 2008.Google ScholarGoogle Scholar
  23. R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot -- a Java bytecode optimization framework. In CASCON '99: Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, page 13. IBM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Wiedermann and W. R. Cook. Extracting queries by static analysis of transparent persistence. In POPL '07: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 199--210, New York, NY, USA, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B.Wiedermann, A. Ibrahim, andW. R. Cook. Interprocedural query extraction for transparent persistence. In OOPSLA '08: Proceedings of the 23rd ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 19--36, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HadoopToSQL: a mapReduce query optimizer

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          EuroSys '10: Proceedings of the 5th European conference on Computer systems
          April 2010
          388 pages
          ISBN:9781605585772
          DOI:10.1145/1755913

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 April 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate241of1,308submissions,18%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader