skip to main content
10.1145/2987550.2987567acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

PipeGen: Data Pipe Generator for Hybrid Analytics

Published:05 October 2016Publication History

ABSTRACT

As the number of big data management systems continues to grow, users increasingly seek to leverage multiple systems in the context of a single data analysis task. To efficiently support such hybrid analytics, we develop a tool called PipeGen for efficient data transfer between database management systems (DBMSs). PipeGen automatically generates data pipes between DBMSs by leveraging their functionality to transfer data via disk files using common data formats such as CSV. PipeGen creates data pipes by extending such functionality with efficient binary data transfer capabilities that avoid file system materialization, include multiple important format optimizations, and transfer data in parallel when possible. We evaluate our PipeGen prototype by generating 20 data pipes automatically between five different DBMSs. The results show that PipeGen speeds up data transfer by up to 3.8× as compared to transferring using disk files.

References

  1. D. Agrawal, M. Lamine, L. Berti-Equille, S. Chawla, A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi, Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J. Quiane, N. Tang, and M. Zaki. Rheem: Enabling multi-platform task execution. In SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Andersen. Jdbc 4.2. Technical Report JSR 221, Oracle, March 2014.Google ScholarGoogle Scholar
  3. Apache Software Foundation. Derby. https://db.apache.org/derby, 2015.Google ScholarGoogle Scholar
  4. Apache Software Foundation. Hadoop. https://hadoop.apache.org, 2015.Google ScholarGoogle Scholar
  5. Apache Software Foundation. Apache Commons CSV. https://commons.apache.org/proper/commons-csv/, 2016.Google ScholarGoogle Scholar
  6. Apache Software Foundation. Apache arrow. https://arrow.apache.org/, 2016.Google ScholarGoogle Scholar
  7. Apache Software Foundation. Apache Thrift. https://thrift.apache.org/, 2016.Google ScholarGoogle Scholar
  8. C. Avery. Giraph: Large-scale graph processing infrastructure on Hadoop. In Hadoop Summit, Santa Clara, 2011.Google ScholarGoogle Scholar
  9. S. Chiba. Load-time structural reflection in java. In ECOOP, pages 313--336. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chiba, S. Javassist. http://www.javassist.org.Google ScholarGoogle Scholar
  11. D. Crockford and T. Bray. The JavaScript object notation (JSON) data interchange format. IETF RFC, 7159:1--15, 2006.Google ScholarGoogle Scholar
  12. D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In SIGMOD, pages 1255--1266, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Dziedzic, A. Elmore, and M. Stonebraker. Data Transformation and Migration in Polystores. In HPEC. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, et al. A demonstration of the BigDAWG polystore system. VLDB, 8(12):1908--1911, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. FasterXML. Jackson JSON Processor. http://wiki.fasterxml.com/JacksonHome/, 2016.Google ScholarGoogle Scholar
  17. J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, pages 229--239, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL, page 421, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, page 2, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Google. Protocol Buffers. https://developers.google.com/protocol-buffers/, 2016.Google ScholarGoogle Scholar
  21. P. J. Guo and D. R. Engler. CDE: Using system call interposition to automatically create portable software packages. In USENIX ATC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In SIGMOD, page 805, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Halperin, V. T. de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the Myria big data management service. In SIGMOD, pages 881--884, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. Quinn. Massively parallel cosmological simulations with ChaNGa. In IPDPS, pages 1--12. IEEE, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  25. V. Josifovski, P. Schwarz, L. Haas, and E. Lin. Garlic: a new flavor of federated query processing for DB2. In SIGMOD, pages 524--532, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Knebe, F. R. Pearce, H. Lux, Y. Ascasibar, P. Behroozi, J. Casado, C. C. Moran, J. Diemand, K. Dolag, R. Dominguez-Tenreiro, et al. Structure finding in cosmological simulations: the state of affairs. MNRAS, 435(2): 1618, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. H. Lim, Y. Han, and S. Babu. How to fit when no one size fits. In CIDR, volume 4, page 35, 2013.Google ScholarGoogle Scholar
  28. F. Lin and W. W. Cohen. Power iteration clustering. In ICML, page 655, 2010.Google ScholarGoogle Scholar
  29. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 5(8): 716--727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Mendell, H. Nasgaard, E. Bouillet, M. Hirzel, and B. Gedik. Extending a general-purpose streaming system for XML. In EDBT, page 534, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Myria: Big Data Management as a Cloud Service. http://myria.cs.washington.edu/.Google ScholarGoogle Scholar
  32. P.-M. Osera and S. Zdancewic. Type-and-example-directed program synthesis. In PLDI, pages 619--630, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu, and Y. Li. Emerging trends in the enterprise data analytics: Connecting Hadoop and DB2 Warehouse. In SIGMOD, pages 1161--1164, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Pan, J. Raposo, M. Álvarez, P. Montoto, V. Orjales, J. Hidalgo, L. Ardao, A. Molano, and Á. Viña. The Denodo data integration platform. In VLDB, pages 986--989, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Perelman, S. Gulwani, D. Grossman, and P. Provost. Test-driven synthesis. In PLDI, pages 408--418, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In ICAI, pages 792--800, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Risch, V. Josifovski, and T. Katchaounov. Functional data integration in a distributed mediator system. In The Functional Approach to Data Management, pages 211--238. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  38. M. Rusinkiewicz, K. Loa, and A. K. Elmagarmid. Distributed operation language for specification and processing of multi-database applications. 1988.Google ScholarGoogle Scholar
  39. K. Saleem, Z. Bellahsene, and E. Hunt. Porsche: Performance oriented schema mediation. Information Systems, 33(7):637--657, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Sirosh. Microsoft acquires Metanautix to help customers connect data for business insights. http://blogs.microsoft.com/blog/2015/12/18/microsoft-acquires-metanautix-to-help-customers-connect-data-for-business-insights/, 2016.Google ScholarGoogle Scholar
  41. C. Smith and A. Albarghouthi. Mapreduce program synthesis. In PLDI, pages 326--340, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R. P. Spillane, C. P. Wright, G. Sivathanu, and E. Zadok. Rapid file system development using ptrace. In ExpCS, page 22, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Stonebraker. ACM SIGMOD blog: The case for polystores. http://wp.sigmod.org/?p=1629.Google ScholarGoogle Scholar
  44. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-Store: a column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of SciDB. In SSDBM, pages 1--16, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. X. Su and G. Swart. Oracle in-database Hadoop: when MapReduce meets RDBMS. In SIGMOD, pages 779--790, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sun Microsystems. BTrace. https://kenai.com/projects/btrace, 2016.Google ScholarGoogle Scholar
  48. Turi. Spark unity codebase. https://github.com/turi-code/spark-sframe, 2015.Google ScholarGoogle Scholar
  49. R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java bytecode optimization framework. In CASCON, page 13, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet another resource negotiator. In SOCC, page 5, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. Wagner, I. Goldberg, and R. Thomas. A secure environment for untrusted helper applications. In USENIX Security, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, page 2, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazières. Making information flow explicit in histar. In OSDI, pages 263--278, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PipeGen: Data Pipe Generator for Hybrid Analytics

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
            October 2016
            534 pages
            ISBN:9781450345255
            DOI:10.1145/2987550

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 October 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader