ABSTRACT
Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.
- A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In PVLDB, 2001. Google ScholarDigital Library
- K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, VU University, 2010.Google Scholar
- M. Armbrust, R. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
- P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, 2005.Google Scholar
- C. Bârcă. Dynamic Resource Management in Vectorwise on Hadoop. MSc thesis, VU University Amsterdam, 2014.Google Scholar
- L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, et al. HAWQ: a massively parallel processing SQL engine in hadoop. In SIGMOD, 2014. Google ScholarDigital Library
- A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. MSc thesis, VU University, 2012.Google Scholar
- A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014. Google ScholarDigital Library
- A. Floratou, J. Patel, E. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
- G. Graefe. Encapsulation of parallelism in the Volcano query processing system, volume 19. 1990. Google ScholarDigital Library
- S. Héman. Updating Compressed Column Stores. PhD thesis, VU University, 2015.Google Scholar
- S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010. Google ScholarDigital Library
- Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. Hanson, et al. Major technical advancements in Apache Hive. In SIGMOD, 2014. Google ScholarDigital Library
- Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang... table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, 2009. Google ScholarDigital Library
- M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.Google Scholar
- P.-Å. Larson, C. Clinciu, E. Hanson, A. Oks, S. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL server column store indexes. In SIGMOD, 2011. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011. Google ScholarDigital Library
- V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013. Google ScholarDigital Library
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4), 2015. Google ScholarDigital Library
- M. A. Soliman et al. Orca: a modular query optimizer architecture for big data. In SIGMOD, 2014. Google ScholarDigital Library
- M. 'Switakowski, P. Boncz, and M. Zukowski. From cooperative scans to predictive buffer management. PVLDB, 5(12), 2012. Google ScholarDigital Library
- S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. DEBULL, 37(1), 2014.Google Scholar
- S. Whoerl. Efficient relational main-memory query processing for Hadoop Parquet Nested Columnar storage with HyPer and Vectorwise. MSc thesis, CWI/LMU/TUM/U. Augsburg, 2014.Google Scholar
- M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, volume 10, 2010. Google ScholarDigital Library
- M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, 2009.Google Scholar
- M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006. Google ScholarDigital Library
Index Terms
- VectorH: Taking SQL-on-Hadoop to the Next Level
Recommendations
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Take me to SSD: a hybrid block-selection method on HDFS based on storage type
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingAs the era of Big-data has risen, the importance of big data technologies is also increasing day by day. Especially, Hadoop has become a critical part of the overall Big-data system because of its ability to store, process, and analyze thousands of ...
The Stratosphere platform for big data analytics
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. ...
Comments