Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

Authors:
Raghu Ramakrishnan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Baskar Sridharan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
John R. Douceur

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Pavan Kasturi

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Balaji Krishnamachari-Sampath

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Karthick Krishnamoorthy

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Peng Li

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Mitica Manu

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Spiro Michaylov

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Rogério Ramos

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Neil Sharman

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Zee Xu

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Youssef Barakat

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Chris Douglas

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Richard Draves

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Shrikant S. Naidu

Microsoft, Bengaluru, India

Microsoft, Bengaluru, India
View Profile

,
Shankar Shastry

Microsoft, Bengaluru, India

Microsoft, Bengaluru, India
View Profile

,
Atul Sikaria

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Simon Sun

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Ramarathnam Venkatesan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 51–63https://doi.org/10.1145/3035918.3056100

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 51–63

ABSTRACT

Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.

References

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-overviewGoogle Scholar
J.I. Aizikowitz. Designing Distributed Services Using Refinement Mappings, Cornell University TR89-1040. Google ScholarDigital Library
P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J.M. Hellerstein, R. Sears. Boom Analytics: Exploring Data-Centric, Declarative Programming. In Eurosys 2012. Google ScholarDigital Library
J. Baker, C. Bond, J.C. Corbett, J.J. Furman, A. Khorlin, J. Larson, J.M. Léon, Y. Li, A. Lloyd, V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, 2011.Google Scholar
B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F.ul Haq, M. I. ul Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows Azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP), pages 143--157, 2011. Google ScholarDigital Library
D. Campbell and R. Ramakrishnan. Tiered Storage. Architectural Note, Microsoft, Nov 2012.Google Scholar
R. Chaiken, B. Jenkins, P-A Lar.son, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2 (August 2008), 1265--1276. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM SoCC, 2010. Google ScholarDigital Library
C. Diaconu, C. Freedman, E. Ismert, P-A. Larson, P. Mittal, R. Stonecipher, N. Verma, M. Zwilling. Hekaton: SQL Server's Memory-Optimized OLTP Engine. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1243--1254. Google ScholarDigital Library
C. Douglas and V. Jalaparti, HDFS Tiered Storage, 2016 Hadoop Summit, June 28-30, San Jose, California.Google Scholar
S. Ghemawat, H. Gobioff, and S-T. Leung. The Google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
A. Grünbacher. Access Control Lists on Linux. SuSE Lab.Google Scholar
http://hbase.apache.org/.Google Scholar
HDFS Permission Guide http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.htmlGoogle Scholar
HDFS-9806: Chris Douglas. "Allow HDFS block replicas to be provided by an external storage system" https://issues.apache.org/jira/browse/HDFS-9806Google Scholar
J. Howell, J. R. Lorch, and J. Douceur. Correctness of Paxos with replica-set-specific views. Technical report MSR-TR-2004--45, Microsoft Research, 2004.Google Scholar
J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems (TOCS), 6(1):51--81,1988. Google ScholarDigital Library
H.T. Kung and John T. Robinson, "On Optimistic Methods for Concurrency Control," ACM Transactions on Database Systems, vol. 6, no. 2, pp. 213--226, June 1981. Google ScholarDigital Library
L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarDigital Library
L. Lamport. Paxos made simple. SIGACT News, 2001.Google Scholar
L. Lu, H. Herodotou, R. Ramakrishnan, S. Rao, G. Xu. Tiered Storage Based Hadoop on Azure, Microsoft, 2013.Google Scholar
B. Oki and B. Liskov. Viewstamped replication: A new primary copy method to support highly available distributed systems. ACM PODC 1988. Google ScholarDigital Library
POSIX FAQ http://www.opengroup.org/austin/papers/posix_faq.htmlGoogle Scholar
Draft Standard for Information Technology -- Portable Operating System Interface (POSIX) -- Part 1: System Application Interface Amendment #: Protection, Audit and Control Interfaces {C Language}, IEEE Computer Society, Work Item Number: 22.42. Draft P1003.1e #17, 1997.Google Scholar
J. Rao, E. J. Shekita, and S. Tata. Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings VLDB., 4(4):24--254, 2011. Google ScholarDigital Library
F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, M. Russinovich, VC3: Trustworthy Data Analytics in the Cloud using SGX, in IEEE Symposium on Security and Privacy, 2015 Google ScholarDigital Library
P. Schwan. Lustre: Building a file system for 1000-node clusters. In Linux Symposium, 2003.Google Scholar
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--10, 2010. Google ScholarDigital Library
S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovslannikov, D. Reeves. Sailfish: a framework for large scale data processing. In ACM SoCC, 2012. Google ScholarDigital Library
http://usql.io/.Google Scholar
R. Van Renesse and F. Schneider. Chain replication for supporting high throughput and availability. In OSDI, 2004. Google ScholarDigital Library
S. Weil, A. Leung, S. Brandt, and C. Maltzahn. Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In Workshop on Petascale Data Storage, 2007. Google ScholarDigital Library
B. Welch, J. Ousterhout. Prefix Tables: A Simple Mechanism for Locating Files in a Distributed System, 6th ICDCS, 1986.Google Scholar
WinFS, en.wikipedia.org/wiki/WinFSGoogle Scholar

Index Terms

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
1. Information systems
  1. Data management systems
    1. Information integration
2. Theory of computation
  1. Theory and algorithms for application domains

Recommendations

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. ...
Read More
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Read More
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications
GRID '11: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing

MapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
aws
azure
big data
cloud service
distributed file system
gce
hadoop
hdfs
map-reduce
storage
tiered storage
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 84
  Total Citations
  View Citations
- 11,141
  Total Downloads
- Downloads (Last 12 months)862
- Downloads (Last 6 weeks)117
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications