research-article

Dremel: interactive analysis of web-scale datasets

Authors:
Sergey Melnik

Google, Inc.

Google, Inc.
View Profile

,
Andrey Gubarev

Google, Inc.

Google, Inc.
View Profile

,
Jing Jing Long

Google, Inc.

Google, Inc.
View Profile

,
Geoffrey Romer

Google, Inc.

Google, Inc.
View Profile

,
Shiva Shivakumar

Google, Inc.

Google, Inc.
View Profile

,
Matt Tolton

Google, Inc.

Google, Inc.
View Profile

,
Theo Vassilakis

Google, Inc.

Google, Inc.
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 330–339https://doi.org/10.14778/1920841.1920886

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

References

D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column-Oriented Database Systems. VLDB, 2(2), 2009. Google ScholarDigital Library
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995. Google ScholarDigital Library
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB, 2(1), 2009. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. VLDB, 1(2), 2008. Google ScholarDigital Library
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In PLDI, 2010. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarDigital Library
L. S. Colby. A Recursive Algebra and Query Optimization for Nested Relations. SIGMOD Rec., 18(2), 1989. Google ScholarDigital Library
G. Czajkowski. Sorting 1PB with MapReduce. Official Google Blog, Nov. 2008. At http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html.Google Scholar
J. Dean. Challenges in Building Large-Scale Information Retrieval Systems: Invited Talk. In WSDM, 2009. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: a Flexible Data Processing Tool. Commun. ACM, 53(1), 2010. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP, 2003. Google ScholarDigital Library
Hadoop Apache Project. http://hadoop.apache.org.Google Scholar
Hive. http://wiki.apache.org/hadoop/Hive, 2009.Google Scholar
H. Liefke and D. Suciu. XMill: An Efficient Compressor for XML Data. In SIGMOD, 2000. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a Not-so-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
P. E. O'Neil, E. J. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-Friendly XML Node Labels. In SIGMOD, 2004. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4), 2005. Google ScholarDigital Library
Protocol Buffers: Developer Guide. Available at http://code.google.com/apis/protocolbuffers/docs/overview.html.Google Scholar
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? Commun. ACM, 53(1), 2010. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google ScholarDigital Library

Recommendations

Dremel: interactive analysis of web-scale datasets

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system ...
Read More
Dremel: a decade of interactive SQL analysis at web scale

Google's Dremel was one of the first systems that combined a set of architectural principles that have become a common practice in today's cloud-native analytics tools, including disaggregated storage and compute, in situ analysis, and columnar storage ...
Read More
Dremel: Adaptive Configuration Tuning of RocksDB KV-Store
SIGMETRICS '22

LSM-tree-based key-value stores like RocksDB are widely used to support many applications. However, configuring a RocksDB instance is challenging for the following reasons: 1) RocksDB has a massive parameter space to configure; 2) there are inherent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 244
  Total Citations
  View Citations
- 3,006
  Total Downloads
- Downloads (Last 12 months)217
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Dremel: interactive analysis of web-scale datasets

Dremel: a decade of interactive SQL analysis at web scale

Dremel: Adaptive Configuration Tuning of RocksDB KV-Store

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Dremel: interactive analysis of web-scale datasets

Dremel: a decade of interactive SQL analysis at web scale

Dremel: Adaptive Configuration Tuning of RocksDB KV-Store

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media