ABSTRACT
Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.
- https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-overviewGoogle Scholar
- J.I. Aizikowitz. Designing Distributed Services Using Refinement Mappings, Cornell University TR89-1040. Google ScholarDigital Library
- P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J.M. Hellerstein, R. Sears. Boom Analytics: Exploring Data-Centric, Declarative Programming. In Eurosys 2012. Google ScholarDigital Library
- J. Baker, C. Bond, J.C. Corbett, J.J. Furman, A. Khorlin, J. Larson, J.M. Léon, Y. Li, A. Lloyd, V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, 2011.Google Scholar
- B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F.ul Haq, M. I. ul Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows Azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP), pages 143--157, 2011. Google ScholarDigital Library
- D. Campbell and R. Ramakrishnan. Tiered Storage. Architectural Note, Microsoft, Nov 2012.Google Scholar
- R. Chaiken, B. Jenkins, P-A Lar.son, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2 (August 2008), 1265--1276. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM SoCC, 2010. Google ScholarDigital Library
- C. Diaconu, C. Freedman, E. Ismert, P-A. Larson, P. Mittal, R. Stonecipher, N. Verma, M. Zwilling. Hekaton: SQL Server's Memory-Optimized OLTP Engine. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1243--1254. Google ScholarDigital Library
- C. Douglas and V. Jalaparti, HDFS Tiered Storage, 2016 Hadoop Summit, June 28-30, San Jose, California.Google Scholar
- S. Ghemawat, H. Gobioff, and S-T. Leung. The Google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- A. Grünbacher. Access Control Lists on Linux. SuSE Lab.Google Scholar
- http://hbase.apache.org/.Google Scholar
- HDFS Permission Guide http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.htmlGoogle Scholar
- HDFS-9806: Chris Douglas. "Allow HDFS block replicas to be provided by an external storage system" https://issues.apache.org/jira/browse/HDFS-9806Google Scholar
- J. Howell, J. R. Lorch, and J. Douceur. Correctness of Paxos with replica-set-specific views. Technical report MSR-TR-2004--45, Microsoft Research, 2004.Google Scholar
- J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems (TOCS), 6(1):51--81,1988. Google ScholarDigital Library
- H.T. Kung and John T. Robinson, "On Optimistic Methods for Concurrency Control," ACM Transactions on Database Systems, vol. 6, no. 2, pp. 213--226, June 1981. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarDigital Library
- L. Lamport. Paxos made simple. SIGACT News, 2001.Google Scholar
- L. Lu, H. Herodotou, R. Ramakrishnan, S. Rao, G. Xu. Tiered Storage Based Hadoop on Azure, Microsoft, 2013.Google Scholar
- B. Oki and B. Liskov. Viewstamped replication: A new primary copy method to support highly available distributed systems. ACM PODC 1988. Google ScholarDigital Library
- POSIX FAQ http://www.opengroup.org/austin/papers/posix_faq.htmlGoogle Scholar
- Draft Standard for Information Technology -- Portable Operating System Interface (POSIX) -- Part 1: System Application Interface Amendment #: Protection, Audit and Control Interfaces {C Language}, IEEE Computer Society, Work Item Number: 22.42. Draft P1003.1e #17, 1997.Google Scholar
- J. Rao, E. J. Shekita, and S. Tata. Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings VLDB., 4(4):24--254, 2011. Google ScholarDigital Library
- F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, M. Russinovich, VC3: Trustworthy Data Analytics in the Cloud using SGX, in IEEE Symposium on Security and Privacy, 2015 Google ScholarDigital Library
- P. Schwan. Lustre: Building a file system for 1000-node clusters. In Linux Symposium, 2003.Google Scholar
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--10, 2010. Google ScholarDigital Library
- S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovslannikov, D. Reeves. Sailfish: a framework for large scale data processing. In ACM SoCC, 2012. Google ScholarDigital Library
- http://usql.io/.Google Scholar
- R. Van Renesse and F. Schneider. Chain replication for supporting high throughput and availability. In OSDI, 2004. Google ScholarDigital Library
- S. Weil, A. Leung, S. Brandt, and C. Maltzahn. Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In Workshop on Petascale Data Storage, 2007. Google ScholarDigital Library
- B. Welch, J. Ousterhout. Prefix Tables: A Simple Mechanism for Locating Files in a Distributed System, 6th ICDCS, 1986.Google Scholar
- WinFS, en.wikipedia.org/wiki/WinFSGoogle Scholar
Index Terms
- Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
Recommendations
Optimizing the Hadoop MapReduce Framework with high-performance storage devices
Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. ...
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesBig Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications
GRID '11: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid ComputingMapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of ...
Comments