ABSTRACT
MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data.
HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features.
We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.
- A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
- Apache Software Foundation. Hadoop. http://hadoop.apache.org/core/.Google Scholar
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 205--218, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
- S. Chen and S. W. Schlosser. Map-Reduce meets wider varieties of applications. Technical Report IRP-TR-08-05, Pittsburgh, USA, 2008. Intel Research Pittsburgh.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72--77, 2010. Google ScholarDigital Library
- L. DeMichiel and M. Keith. JSR 220: Enterprise JavaBeans 3.0. http://www.jcp.org/en/jsr/detail?id=220, May 11 2006.Google Scholar
- D. J. DeWitt, S. Ghanderaizadeh, and D. Schneider. A performance analysis of the gamma database machine. In SIGMOD '88: Proceedings of the 1988 ACM SIGMOD international conference on Management of data, pages 350--360, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M.-Y. Iu and W. Zwaenepoel. Queryll: Java database queries through bytecode rewriting. In M. van Steen and M. Henning, editors, Middleware, volume 4290 of Lecture Notes in Computer Science, pages 201--218. Springer, 2006. Google ScholarDigital Library
- M.-Y. Iu, E. Cecchet, and W. Zwaenepoel. JReq: Database queries in imperative languages. In CC '10: Proceedings of the 19th International Conference on Compiler Construction, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- K. Kim, K. Jeon, H. Han, S. gyu Kim, H. Jung, and H. Y. Yeom. MRBench: A benchmark for MapReduce framework. International Conference on Parallel and Distributed Systems, 0:11--18, 2008. Google ScholarDigital Library
- D. Maier, J. Stein, A. Otis, and A. Purdy. Development of an object-oriented DBMS. In OOPLSA '86: Conference proceedings on Object-oriented programming systems, languages and applications, pages 472--482, New York, NY, USA, 1986. ACM Press. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Persyn. Database sharding at Netlog, with MySQL and PHP. http://www.jurriaanpersyn.com/archives/2009/02/12/database-sharding-at-netlog-with-mysql-and-php/.Google Scholar
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13: 277--298, 2005. Google ScholarDigital Library
- Spock Proxy. Spock proxy -- a proxy for MySQL horizontal partitioning. http://spockproxy.sourceforge.net/.Google Scholar
- ST Global. Spider storage engine. http://spiderformysql.com/.Google Scholar
- M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a Map--Reduce framework. Proc. VLDB Endow., 2(2):1626--1629, 2009. Google ScholarDigital Library
- Transaction Processing Performance Council (TPC). TPC Benchmark H (Decision Support) Standard Specification Version 2.8.0. Transaction Processing Performance Council, 2008.Google Scholar
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot -- a Java bytecode optimization framework. In CASCON '99: Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, page 13. IBM Press, 1999. Google ScholarDigital Library
- B. Wiedermann and W. R. Cook. Extracting queries by static analysis of transparent persistence. In POPL '07: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 199--210, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- B.Wiedermann, A. Ibrahim, andW. R. Cook. Interprocedural query extraction for transparent persistence. In OOPSLA '08: Proceedings of the 23rd ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 19--36, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library
Index Terms
- HadoopToSQL: a mapReduce query optimizer
Recommendations
Can we analyze big data inside a DBMS?
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAPRelational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, ...
Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories
DEXA '11: Proceedings of the 2011 22nd International Workshop on Database and Expert Systems ApplicationsData aggregation is one of the key features used in databases, especially for Business Intelligence (e.g., ETL, OLAP) and analytics/data mining. When considering SQL databases, aggregation is used to prepare and visualize data for deeper analyses. ...
SQL2X: Learning SQL, NoSQL, and MapReduce via Translation
SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science EducationA key challenge in designing a database course is how to introduce students to the great variety of data models, query languages, databases, and data processing systems available now. To address this challenge, we propose SQL2X, a novel SQL-centric ...
Comments