skip to main content
research-article

Dremel: interactive analysis of web-scale datasets

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

References

  1. D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column-Oriented Database Systems. VLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. VLDB, 1(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. S. Colby. A Recursive Algebra and Query Optimization for Nested Relations. SIGMOD Rec., 18(2), 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Czajkowski. Sorting 1PB with MapReduce. Official Google Blog, Nov. 2008. At http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html.Google ScholarGoogle Scholar
  11. J. Dean. Challenges in Building Large-Scale Information Retrieval Systems: Invited Talk. In WSDM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean and S. Ghemawat. MapReduce: a Flexible Data Processing Tool. Commun. ACM, 53(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hadoop Apache Project. http://hadoop.apache.org.Google ScholarGoogle Scholar
  16. Hive. http://wiki.apache.org/hadoop/Hive, 2009.Google ScholarGoogle Scholar
  17. H. Liefke and D. Suciu. XMill: An Efficient Compressor for XML Data. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a Not-so-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. E. O'Neil, E. J. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-Friendly XML Node Labels. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Protocol Buffers: Developer Guide. Available at http://code.google.com/apis/protocolbuffers/docs/overview.html.Google ScholarGoogle Scholar
  22. M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? Commun. ACM, 53(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader