research-article

VectorH: Taking SQL-on-Hadoop to the Next Level

Authors:
Andrei Costea

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Adrian Ionescu

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Bogdan Răducanu

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Michał Switakowski

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Cristian Bârca

Actian Corp., Amterdam, Netherlands

Actian Corp., Amterdam, Netherlands
View Profile

,
Juliusz Sompolski

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Alicja Łuszczak

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Michał Szafrański

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Giel de Nijs

Actian Corp., Amsterdam, Netherlands

Actian Corp., Amsterdam, Netherlands
View Profile

,
Peter Boncz

CWI, Amsterdam, Netherlands

CWI, Amsterdam, Netherlands
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 1105–1117https://doi.org/10.1145/2882903.2903742

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1105–1117

ABSTRACT

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.

References

A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In PVLDB, 2001. Google ScholarDigital Library
K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, VU University, 2010.Google Scholar
M. Armbrust, R. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, 2005.Google Scholar
C. Bârcă. Dynamic Resource Management in Vectorwise on Hadoop. MSc thesis, VU University Amsterdam, 2014.Google Scholar
L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, et al. HAWQ: a massively parallel processing SQL engine in hadoop. In SIGMOD, 2014. Google ScholarDigital Library
A. Costea and A. Ionescu. Query optimization and execution in Vectorwise MPP. MSc thesis, VU University, 2012.Google Scholar
A. Floratou, U. F. Minhas, and F. Özcan. SQL-on-Hadoop: Full circle back to shared-nothing database architectures. PVLDB, 7(12), 2014. Google ScholarDigital Library
A. Floratou, J. Patel, E. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
G. Graefe. Encapsulation of parallelism in the Volcano query processing system, volume 19. 1990. Google ScholarDigital Library
S. Héman. Updating Compressed Column Stores. PhD thesis, VU University, 2015.Google Scholar
S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010. Google ScholarDigital Library
Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. Hanson, et al. Major technical advancements in Apache Hive. In SIGMOD, 2014. Google ScholarDigital Library
Y. Huai, S. Ma, R. Lee, O. O'Malley, and X. Zhang... table placement methods in clusters. PVLDB, 6(14), 2013. Google ScholarDigital Library
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, 2009. Google ScholarDigital Library
M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.Google Scholar
P.-Å. Larson, C. Clinciu, E. Hanson, A. Oks, S. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL server column store indexes. In SIGMOD, 2011. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011. Google ScholarDigital Library
V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11), 2013. Google ScholarDigital Library
W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4), 2015. Google ScholarDigital Library
M. A. Soliman et al. Orca: a modular query optimizer architecture for big data. In SIGMOD, 2014. Google ScholarDigital Library
M. 'Switakowski, P. Boncz, and M. Zukowski. From cooperative scans to predictive buffer management. PVLDB, 5(12), 2012. Google ScholarDigital Library
S. Wanderman-Milne and N. Li. Runtime code generation in cloudera impala. DEBULL, 37(1), 2014.Google Scholar
S. Whoerl. Efficient relational main-memory query processing for Hadoop Parquet Nested Columnar storage with HyPer and Vectorwise. MSc thesis, CWI/LMU/TUM/U. Augsburg, 2014.Google Scholar
M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, volume 10, 2010. Google ScholarDigital Library
M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, 2009.Google Scholar
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006. Google ScholarDigital Library

Index Terms

VectorH: Taking SQL-on-Hadoop to the Next Level
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
Take me to SSD: a hybrid block-selection method on HDFS based on storage type
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

As the era of Big-data has risen, the importance of big data technologies is also increasing day by day. Especially, Hadoop has become a critical part of the overall Big-data system because of its ability to store, process, and analyze thousands of ...
Read More
The Stratosphere platform for big data analytics

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SQL-on-Hadoop
column stores
distributed systems
parallel database systems
query optimization
query processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 782
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

VectorH: Taking SQL-on-Hadoop to the Next Level

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Take me to SSD: a hybrid block-selection method on HDFS based on storage type

The Stratosphere platform for big data analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

VectorH: Taking SQL-on-Hadoop to the Next Level

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Take me to SSD: a hybrid block-selection method on HDFS based on storage type

The Stratosphere platform for big data analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media