ABSTRACT
As the number of big data management systems continues to grow, users increasingly seek to leverage multiple systems in the context of a single data analysis task. To efficiently support such hybrid analytics, we develop a tool called PipeGen for efficient data transfer between database management systems (DBMSs). PipeGen automatically generates data pipes between DBMSs by leveraging their functionality to transfer data via disk files using common data formats such as CSV. PipeGen creates data pipes by extending such functionality with efficient binary data transfer capabilities that avoid file system materialization, include multiple important format optimizations, and transfer data in parallel when possible. We evaluate our PipeGen prototype by generating 20 data pipes automatically between five different DBMSs. The results show that PipeGen speeds up data transfer by up to 3.8× as compared to transferring using disk files.
- D. Agrawal, M. Lamine, L. Berti-Equille, S. Chawla, A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi, Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J. Quiane, N. Tang, and M. Zaki. Rheem: Enabling multi-platform task execution. In SIGMOD, 2016. Google ScholarDigital Library
- L. Andersen. Jdbc 4.2. Technical Report JSR 221, Oracle, March 2014.Google Scholar
- Apache Software Foundation. Derby. https://db.apache.org/derby, 2015.Google Scholar
- Apache Software Foundation. Hadoop. https://hadoop.apache.org, 2015.Google Scholar
- Apache Software Foundation. Apache Commons CSV. https://commons.apache.org/proper/commons-csv/, 2016.Google Scholar
- Apache Software Foundation. Apache arrow. https://arrow.apache.org/, 2016.Google Scholar
- Apache Software Foundation. Apache Thrift. https://thrift.apache.org/, 2016.Google Scholar
- C. Avery. Giraph: Large-scale graph processing infrastructure on Hadoop. In Hadoop Summit, Santa Clara, 2011.Google Scholar
- S. Chiba. Load-time structural reflection in java. In ECOOP, pages 313--336. Springer, 2000. Google ScholarDigital Library
- Chiba, S. Javassist. http://www.javassist.org.Google Scholar
- D. Crockford and T. Bray. The JavaScript object notation (JSON) data interchange format. IETF RFC, 7159:1--15, 2006.Google Scholar
- D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In SIGMOD, pages 1255--1266, 2013. Google ScholarDigital Library
- A. Dziedzic, A. Elmore, and M. Stonebraker. Data Transformation and Migration in Polystores. In HPEC. IEEE, 2016.Google ScholarCross Ref
- A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, et al. A demonstration of the BigDAWG polystore system. VLDB, 8(12):1908--1911, 2015. Google ScholarDigital Library
- R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005. Google ScholarDigital Library
- FasterXML. Jackson JSON Processor. http://wiki.fasterxml.com/JacksonHome/, 2016.Google Scholar
- J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, pages 229--239, 2015. Google ScholarDigital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL, page 421, 2008. Google ScholarDigital Library
- I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, page 2, 2015. Google ScholarDigital Library
- Google. Protocol Buffers. https://developers.google.com/protocol-buffers/, 2016.Google Scholar
- P. J. Guo and D. R. Engler. CDE: Using system call interposition to automatically create portable software packages. In USENIX ATC, 2011. Google ScholarDigital Library
- L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In SIGMOD, page 805, 2005. Google ScholarDigital Library
- D. Halperin, V. T. de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the Myria big data management service. In SIGMOD, pages 881--884, 2014. Google ScholarDigital Library
- P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. Quinn. Massively parallel cosmological simulations with ChaNGa. In IPDPS, pages 1--12. IEEE, 2008.Google ScholarCross Ref
- V. Josifovski, P. Schwarz, L. Haas, and E. Lin. Garlic: a new flavor of federated query processing for DB2. In SIGMOD, pages 524--532, 2002. Google ScholarDigital Library
- A. Knebe, F. R. Pearce, H. Lux, Y. Ascasibar, P. Behroozi, J. Casado, C. C. Moran, J. Diemand, K. Dolag, R. Dominguez-Tenreiro, et al. Structure finding in cosmological simulations: the state of affairs. MNRAS, 435(2): 1618, 2013.Google ScholarCross Ref
- H. Lim, Y. Han, and S. Babu. How to fit when no one size fits. In CIDR, volume 4, page 35, 2013.Google Scholar
- F. Lin and W. W. Cohen. Power iteration clustering. In ICML, page 655, 2010.Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 5(8): 716--727, 2012. Google ScholarDigital Library
- M. Mendell, H. Nasgaard, E. Bouillet, M. Hirzel, and B. Gedik. Extending a general-purpose streaming system for XML. In EDBT, page 534, 2012. Google ScholarDigital Library
- Myria: Big Data Management as a Cloud Service. http://myria.cs.washington.edu/.Google Scholar
- P.-M. Osera and S. Zdancewic. Type-and-example-directed program synthesis. In PLDI, pages 619--630, 2015. Google ScholarDigital Library
- F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu, and Y. Li. Emerging trends in the enterprise data analytics: Connecting Hadoop and DB2 Warehouse. In SIGMOD, pages 1161--1164, 2011. Google ScholarDigital Library
- A. Pan, J. Raposo, M. Álvarez, P. Montoto, V. Orjales, J. Hidalgo, L. Ardao, A. Molano, and Á. Viña. The Denodo data integration platform. In VLDB, pages 986--989, 2002. Google ScholarDigital Library
- D. Perelman, S. Gulwani, D. Grossman, and P. Provost. Test-driven synthesis. In PLDI, pages 408--418, 2014. Google ScholarDigital Library
- M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In ICAI, pages 792--800, 2015. Google ScholarDigital Library
- T. Risch, V. Josifovski, and T. Katchaounov. Functional data integration in a distributed mediator system. In The Functional Approach to Data Management, pages 211--238. Springer, 2004.Google ScholarCross Ref
- M. Rusinkiewicz, K. Loa, and A. K. Elmagarmid. Distributed operation language for specification and processing of multi-database applications. 1988.Google Scholar
- K. Saleem, Z. Bellahsene, and E. Hunt. Porsche: Performance oriented schema mediation. Information Systems, 33(7):637--657, 2008. Google ScholarDigital Library
- J. Sirosh. Microsoft acquires Metanautix to help customers connect data for business insights. http://blogs.microsoft.com/blog/2015/12/18/microsoft-acquires-metanautix-to-help-customers-connect-data-for-business-insights/, 2016.Google Scholar
- C. Smith and A. Albarghouthi. Mapreduce program synthesis. In PLDI, pages 326--340, 2016. Google ScholarDigital Library
- R. P. Spillane, C. P. Wright, G. Sivathanu, and E. Zadok. Rapid file system development using ptrace. In ExpCS, page 22, 2007. Google ScholarDigital Library
- M. Stonebraker. ACM SIGMOD blog: The case for polystores. http://wp.sigmod.org/?p=1629.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-Store: a column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of SciDB. In SSDBM, pages 1--16, 2011. Google ScholarDigital Library
- X. Su and G. Swart. Oracle in-database Hadoop: when MapReduce meets RDBMS. In SIGMOD, pages 779--790, 2012. Google ScholarDigital Library
- Sun Microsystems. BTrace. https://kenai.com/projects/btrace, 2016.Google Scholar
- Turi. Spark unity codebase. https://github.com/turi-code/spark-sframe, 2015.Google Scholar
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java bytecode optimization framework. In CASCON, page 13, 1999. Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, page 5, 2013. Google ScholarDigital Library
- D. Wagner, I. Goldberg, and R. Thomas. A secure environment for untrusted helper applications. In USENIX Security, 1996. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, page 2, 2012. Google ScholarDigital Library
- N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazières. Making information flow explicit in histar. In OSDI, pages 263--278, 2006. Google ScholarDigital Library
Index Terms
- PipeGen: Data Pipe Generator for Hybrid Analytics
Comments