skip to main content
10.1145/2882903.2903741acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

The Snowflake Elastic Data Warehouse

Authors Info & Claims
Published:14 June 2016Publication History

ABSTRACT

We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterprise-class systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the cloud's elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semi-structured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or "Snowflake" for short. Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. The system is offered as a pay-as-you-go service in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data.

In this paper, we describe the design of Snowflake and its novel multi-cluster, shared-data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.

References

  1. D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proc. SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In Proc. VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Alsubaiee et al. AsterixDB: A scalable, open source DBMS. PVLDB, 7(14):1905--1916, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amazon Elastic Compute Cloud (EC2). burlaws.amazon.com/ec2/instance-types.Google ScholarGoogle Scholar
  5. Amazon Simple Storage Service (S3).burlaws.amazon.com/s3.Google ScholarGoogle Scholar
  6. Apache Cassandra. burlcassandra.apache.org.Google ScholarGoogle Scholar
  7. Apache Drill. burldrill.apache.org.Google ScholarGoogle Scholar
  8. Apache Hadoop. burlhadoop.apache.org.Google ScholarGoogle Scholar
  9. Apache Hive. burlhive.apache.org.Google ScholarGoogle Scholar
  10. Apache Parquet. burlparquet.apache.org.Google ScholarGoogle Scholar
  11. Apache Spark. burlspark.apache.org.Google ScholarGoogle Scholar
  12. AWS CloudHSM. burlaws.amazon.com/cloudhsm.Google ScholarGoogle Scholar
  13. E. Barker. NIST SP 800--57 -- Recommendation for Key Management -- Part 1: General (Revision 4), chapter 7. 2016.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Barr. AWS Import/Export Snowball -- Transfer 1 petabyte per week using Amazon-owned storage appliances. burlaws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte%-per-week-using-amazon-owned-storage-appliances/, 2015.Google ScholarGoogle Scholar
  15. P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In Proc. CIDR, 2005.Google ScholarGoogle Scholar
  16. V. R. Borkar, M. J. Carey, and C. Li. Big data platforms: What's next? ACM Crossroads, 19(1):44--49, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable isolation for snapshot databases. In Proc. SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Calder et al. Windows Azure Storage: A highly available storage service with strong consistency. In Proc. SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cassandra Query Language (CQL). burlcassandra.apache.org/doc/cql3/CQL.html.Google ScholarGoogle Scholar
  20. Cloud Storage -- Google Cloud Platform. burlcloud.google.com/storage.Google ScholarGoogle Scholar
  21. Cloudera Impala. burlimpala.io.Google ScholarGoogle Scholar
  22. Couchbase N1QL. burlcouchbase.com/n1ql.Google ScholarGoogle Scholar
  23. Couchbase Server. burlcouchbase.com.Google ScholarGoogle Scholar
  24. D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In Proc. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. J. DeWitt, S. Madden, and M. Stonebraker. How to build a high-performance data warehouse. burldb.csail.mit.edu/madden/high_perf.pdf, 2006.Google ScholarGoogle Scholar
  26. D. Ferraiolo, D. R. Kuhn, and R. Chandramouli. Role-based access control. Artech House Publishers, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  27. G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Graefe. The cascades framework for query optimization. Data Engineering Bulletin, 18, 1995.Google ScholarGoogle Scholar
  29. G. Graefe. Fast loads and fast queries. In Data Warehousing and Knowledge Discovery, volume 5691 of LNCS. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In Proc. SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proc. STOC, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Krueger, M. Grund, C. Tinnefeld, H. Plattner, A. Zeier, and F. Faerber. Optimizing write performance for read optimized databases. In Proc. DASFAA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Manegold, M. L. Kersten, and P. Boncz. Database architecture evolution: Mammals flourished long before dinosaurs became extinct. PVLDB, 2(2):1648--1653, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Microsoft Analytics Platform System. burlwww.microsoft.com/en-us/server-cloud/products/analytics-platform-syste%m.Google ScholarGoogle Scholar
  36. Microsoft Azure Blob Storage. burlazure.microsoft.com/en-us/services/storage/blobs.Google ScholarGoogle Scholar
  37. Microsoft Azure SQL DW. burlazure.microsoft.com/en-us/services/sql-data-warehouse.Google ScholarGoogle Scholar
  38. G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In Proc. VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. MongoDB. burlmongodb.com.Google ScholarGoogle Scholar
  40. J. K. Mullin. Optimal semijoins for distributed database systems. IEEE TSE, 16(5):558--560, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Presto. burlprestodb.io.Google ScholarGoogle Scholar
  44. K. Sato. An inside look at Google BigQuery. burlcloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.Google ScholarGoogle Scholar
  45. J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. PVLDB, 3(1):460--471, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. MSST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. SQL DW Concurrency. burlazure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-de%velop-concurrency.Google ScholarGoogle Scholar
  48. Stinger.next: Enterprise SQL at Hadoop scale. burlhortonworks.com/innovation/stinger.Google ScholarGoogle Scholar
  49. L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In Proc. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Snowflake Elastic Data Warehouse

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
          June 2016
          2300 pages
          ISBN:9781450335317
          DOI:10.1145/2882903

          Copyright © 2016 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2016

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader