research-article

Open Access

The Snowflake Elastic Data Warehouse

Authors:
Benoit Dageville

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Thierry Cruanes

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Marcin Zukowski

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Vadim Antonov

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Artin Avanes

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Jon Bock

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Jonathan Claybaugh

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Daniel Engovatov

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Martin Hentschel

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Jiansheng Huang

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Allison W. Lee

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Ashish Motivala

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Abdul Q. Munir

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Steven Pelley

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Peter Povinec

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Greg Rahn

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Spyridon Triantafyllis

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

,
Philipp Unterbrunner

Snowflake Computing, San Mateo, CA, USA

Snowflake Computing, San Mateo, CA, USA
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 215–226https://doi.org/10.1145/2882903.2903741

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 215–226

ABSTRACT

We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterprise-class systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the cloud's elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semi-structured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or "Snowflake" for short. Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. The system is offered as a pay-as-you-go service in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data.

In this paper, we describe the design of Snowflake and its novel multi-cluster, shared-data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.

References

D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proc. SIGMOD, 2008. Google ScholarDigital Library
A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In Proc. VLDB, 2001. Google ScholarDigital Library
S. Alsubaiee et al. AsterixDB: A scalable, open source DBMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarDigital Library
Amazon Elastic Compute Cloud (EC2). burlaws.amazon.com/ec2/instance-types.Google Scholar
Amazon Simple Storage Service (S3).burlaws.amazon.com/s3.Google Scholar
Apache Cassandra. burlcassandra.apache.org.Google Scholar
Apache Drill. burldrill.apache.org.Google Scholar
Apache Hadoop. burlhadoop.apache.org.Google Scholar
Apache Hive. burlhive.apache.org.Google Scholar
Apache Parquet. burlparquet.apache.org.Google Scholar
Apache Spark. burlspark.apache.org.Google Scholar
AWS CloudHSM. burlaws.amazon.com/cloudhsm.Google Scholar
E. Barker. NIST SP 800--57 -- Recommendation for Key Management -- Part 1: General (Revision 4), chapter 7. 2016.Google ScholarCross Ref
J. Barr. AWS Import/Export Snowball -- Transfer 1 petabyte per week using Amazon-owned storage appliances. burlaws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte%-per-week-using-amazon-owned-storage-appliances/, 2015.Google Scholar
P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In Proc. CIDR, 2005.Google Scholar
V. R. Borkar, M. J. Carey, and C. Li. Big data platforms: What's next? ACM Crossroads, 19(1):44--49, 2012. Google ScholarDigital Library
M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable isolation for snapshot databases. In Proc. SIGMOD, 2008. Google ScholarDigital Library
B. Calder et al. Windows Azure Storage: A highly available storage service with strong consistency. In Proc. SOSP, 2011. Google ScholarDigital Library
Cassandra Query Language (CQL). burlcassandra.apache.org/doc/cql3/CQL.html.Google Scholar
Cloud Storage -- Google Cloud Platform. burlcloud.google.com/storage.Google Scholar
Cloudera Impala. burlimpala.io.Google Scholar
Couchbase N1QL. burlcouchbase.com/n1ql.Google Scholar
Couchbase Server. burlcouchbase.com.Google Scholar
D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In Proc. SIGMOD, 2013. Google ScholarDigital Library
D. J. DeWitt, S. Madden, and M. Stonebraker. How to build a high-performance data warehouse. burldb.csail.mit.edu/madden/high_perf.pdf, 2006.Google Scholar
D. Ferraiolo, D. R. Kuhn, and R. Chandramouli. Role-based access control. Artech House Publishers, 2003. Google ScholarCross Ref
G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994. Google ScholarDigital Library
G. Graefe. The cascades framework for query optimization. Data Engineering Bulletin, 18, 1995.Google Scholar
G. Graefe. Fast loads and fast queries. In Data Warehousing and Knowledge Discovery, volume 5691 of LNCS. Springer, 2009. Google ScholarDigital Library
A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In Proc. SIGMOD, 2015. Google ScholarDigital Library
D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proc. STOC, 1997. Google ScholarDigital Library
J. Krueger, M. Grund, C. Tinnefeld, H. Plattner, A. Zeier, and F. Faerber. Optimizing write performance for read optimized databases. In Proc. DASFAA, 2010. Google ScholarDigital Library
S. Manegold, M. L. Kersten, and P. Boncz. Database architecture evolution: Mammals flourished long before dinosaurs became extinct. PVLDB, 2(2):1648--1653, 2009. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarDigital Library
Microsoft Analytics Platform System. burlwww.microsoft.com/en-us/server-cloud/products/analytics-platform-syste%m.Google Scholar
Microsoft Azure Blob Storage. burlazure.microsoft.com/en-us/services/storage/blobs.Google Scholar
Microsoft Azure SQL DW. burlazure.microsoft.com/en-us/services/sql-data-warehouse.Google Scholar
G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In Proc. VLDB, 1998. Google ScholarDigital Library
MongoDB. burlmongodb.com.Google Scholar
J. K. Mullin. Optimal semijoins for distributed database systems. IEEE TSE, 16(5):558--560, 1990. Google ScholarDigital Library
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. SIGMOD, 2009. Google ScholarDigital Library
Presto. burlprestodb.io.Google Scholar
K. Sato. An inside look at Google BigQuery. burlcloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.Google Scholar
J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. PVLDB, 3(1):460--471, 2010. Google ScholarDigital Library
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. MSST, 2010. Google ScholarDigital Library
SQL DW Concurrency. burlazure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-de%velop-concurrency.Google Scholar
Stinger.next: Enterprise SQL at Hadoop scale. burlhortonworks.com/innovation/stinger.Google Scholar
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In Proc. SIGMOD, 2014. Google ScholarDigital Library

Index Terms

The Snowflake Elastic Data Warehouse
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Online analytical processing engines
      2. Parallel and distributed DBMSs
        Relational parallel and distributed DBMSs
2. Networks
  1. Network services
    1. Cloud computing

Recommendations

MCDB: Using Multi-clouds to Ensure Security in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Security is considered to be one of the most critical aspects in a cloud computing environment due to the sensitive and important information stored in the cloud for users. Users are wondering about attacks on the integrity and the availability of their ...
Read More
Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics
Read More
Elastic Beanstalk
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Check for updates
Author Tags
data warehousing
database as a service
multi-cluster shared data architecture
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 166
  Total Citations
  View Citations
- 28,112
  Total Downloads
- Downloads (Last 12 months)6,453
- Downloads (Last 6 weeks)700
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Snowflake Elastic Data Warehouse

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

MCDB: Using Multi-clouds to Ensure Security in Cloud Computing

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

Elastic Beanstalk

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Snowflake Elastic Data Warehouse

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

MCDB: Using Multi-clouds to Ensure Security in Cloud Computing

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

Elastic Beanstalk

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media