research-article

HadoopToSQL: a mapReduce query optimizer

Authors:
Ming-Yee Iu

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Willy Zwaenepoel

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

EuroSys '10: Proceedings of the 5th European conference on Computer systemsApril 2010Pages 251–264https://doi.org/10.1145/1755913.1755939

Published:13 April 2010Publication History

EuroSys '10: Proceedings of the 5th European conference on Computer systems

Pages 251–264

ABSTRACT

MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data.

HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features.

We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.

References

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
Apache Software Foundation. Hadoop. http://hadoop.apache.org/core/.Google Scholar
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 205--218, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
S. Chen and S. W. Schlosser. Map-Reduce meets wider varieties of applications. Technical Report IRP-TR-08-05, Pittsburgh, USA, 2008. Intel Research Pittsburgh.Google Scholar
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72--77, 2010. Google ScholarDigital Library
L. DeMichiel and M. Keith. JSR 220: Enterprise JavaBeans 3.0. http://www.jcp.org/en/jsr/detail?id=220, May 11 2006.Google Scholar
D. J. DeWitt, S. Ghanderaizadeh, and D. Schneider. A performance analysis of the gamma database machine. In SIGMOD '88: Proceedings of the 1988 ACM SIGMOD international conference on Management of data, pages 350--360, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M.-Y. Iu and W. Zwaenepoel. Queryll: Java database queries through bytecode rewriting. In M. van Steen and M. Henning, editors, Middleware, volume 4290 of Lecture Notes in Computer Science, pages 201--218. Springer, 2006. Google ScholarDigital Library
M.-Y. Iu, E. Cecchet, and W. Zwaenepoel. JReq: Database queries in imperative languages. In CC '10: Proceedings of the 19th International Conference on Compiler Construction, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
K. Kim, K. Jeon, H. Han, S. gyu Kim, H. Jung, and H. Y. Yeom. MRBench: A benchmark for MapReduce framework. International Conference on Parallel and Distributed Systems, 0:11--18, 2008. Google ScholarDigital Library
D. Maier, J. Stein, A. Otis, and A. Purdy. Development of an object-oriented DBMS. In OOPLSA '86: Conference proceedings on Object-oriented programming systems, languages and applications, pages 472--482, New York, NY, USA, 1986. ACM Press. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Persyn. Database sharding at Netlog, with MySQL and PHP. http://www.jurriaanpersyn.com/archives/2009/02/12/database-sharding-at-netlog-with-mysql-and-php/.Google Scholar
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13: 277--298, 2005. Google ScholarDigital Library
Spock Proxy. Spock proxy -- a proxy for MySQL horizontal partitioning. http://spockproxy.sourceforge.net/.Google Scholar
ST Global. Spider storage engine. http://spiderformysql.com/.Google Scholar
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a Map--Reduce framework. Proc. VLDB Endow., 2(2):1626--1629, 2009. Google ScholarDigital Library
Transaction Processing Performance Council (TPC). TPC Benchmark H (Decision Support) Standard Specification Version 2.8.0. Transaction Processing Performance Council, 2008.Google Scholar
R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot -- a Java bytecode optimization framework. In CASCON '99: Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, page 13. IBM Press, 1999. Google ScholarDigital Library
B. Wiedermann and W. R. Cook. Extracting queries by static analysis of transparent persistence. In POPL '07: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 199--210, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
B.Wiedermann, A. Ibrahim, andW. R. Cook. Interprocedural query extraction for transparent persistence. In OOPSLA '08: Proceedings of the 23rd ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 19--36, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library

Index Terms

HadoopToSQL: a mapReduce query optimizer
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Can we analyze big data inside a DBMS?
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, ...
Read More
Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories
DEXA '11: Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications

Data aggregation is one of the key features used in databases, especially for Business Intelligence (e.g., ETL, OLAP) and analytics/data mining. When considering SQL databases, aggregation is used to prepare and visualize data for deeper analyses. ...
Read More
SQL2X: Learning SQL, NoSQL, and MapReduce via Translation
SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education

A key challenge in designing a database course is how to introduce students to the great variety of data models, query languages, databases, and data processing systems available now. To address this challenge, we propose SQL2X, a novel SQL-centric ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '10: Proceedings of the 5th European conference on Computer systems
April 2010
388 pages
ISBN:9781605585772
DOI:10.1145/1755913
General Chair:
Christine Morin
INRIA Rennes, France
,
Program Chair:
Gilles Muller
INRIA/LIP6, France
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mapreduce
query optimization
sql
symbolic execution
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 1,483
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HadoopToSQL: a mapReduce query optimizer

EuroSys '10: Proceedings of the 5th European conference on Computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Can we analyze big data inside a DBMS?

Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories

SQL2X: Learning SQL, NoSQL, and MapReduce via Translation