skip to main content
10.1145/2882903.2903742acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

VectorH: Taking SQL-on-Hadoop to the Next Level

Published:14 June 2016Publication History

ABSTRACT

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

References

  1. A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In PVLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, VU University, 2010.Google ScholarGoogle Scholar
  3. M. Armbrust, R. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, 2005.Google ScholarGoogle Scholar
  5. C. Bârcă. Dynamic Resource Management in Vectorwise on Hadoop. MSc thesis, VU University Amsterdam, 2014.Google ScholarGoogle Scholar
  6. L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, et al. HAWQ: a massively parallel processing SQL engine in hadoop. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. MSc thesis, VU University, 2012.Google ScholarGoogle Scholar
  8. A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Floratou, J. Patel, E. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Graefe. Encapsulation of parallelism in the Volcano query processing system, volume 19. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Héman. Updating Compressed Column Stores. PhD thesis, VU University, 2015.Google ScholarGoogle Scholar
  12. S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. Hanson, et al. Major technical advancements in Apache Hive. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang... table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.Google ScholarGoogle Scholar
  17. P.-Å. Larson, C. Clinciu, E. Hanson, A. Oks, S. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL server column store indexes. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. A. Soliman et al. Orca: a modular query optimizer architecture for big data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. 'Switakowski, P. Boncz, and M. Zukowski. From cooperative scans to predictive buffer management. PVLDB, 5(12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. DEBULL, 37(1), 2014.Google ScholarGoogle Scholar
  25. S. Whoerl. Efficient relational main-memory query processing for Hadoop Parquet Nested Columnar storage with HyPer and Vectorwise. MSc thesis, CWI/LMU/TUM/U. Augsburg, 2014.Google ScholarGoogle Scholar
  26. M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, volume 10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, 2009.Google ScholarGoogle Scholar
  28. M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. VectorH: Taking SQL-on-Hadoop to the Next Level

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
        June 2016
        2300 pages
        ISBN:9781450335317
        DOI:10.1145/2882903

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 June 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader