skip to main content
10.1145/3035918.3056100acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

Authors Info & Claims
Published:09 May 2017Publication History

ABSTRACT

Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.

References

  1. https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-overviewGoogle ScholarGoogle Scholar
  2. J.I. Aizikowitz. Designing Distributed Services Using Refinement Mappings, Cornell University TR89-1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J.M. Hellerstein, R. Sears. Boom Analytics: Exploring Data-Centric, Declarative Programming. In Eurosys 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Baker, C. Bond, J.C. Corbett, J.J. Furman, A. Khorlin, J. Larson, J.M. Léon, Y. Li, A. Lloyd, V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, 2011.Google ScholarGoogle Scholar
  5. B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F.ul Haq, M. I. ul Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows Azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP), pages 143--157, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Campbell and R. Ramakrishnan. Tiered Storage. Architectural Note, Microsoft, Nov 2012.Google ScholarGoogle Scholar
  7. R. Chaiken, B. Jenkins, P-A Lar.son, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2 (August 2008), 1265--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM SoCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Diaconu, C. Freedman, E. Ismert, P-A. Larson, P. Mittal, R. Stonecipher, N. Verma, M. Zwilling. Hekaton: SQL Server's Memory-Optimized OLTP Engine. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1243--1254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Douglas and V. Jalaparti, HDFS Tiered Storage, 2016 Hadoop Summit, June 28-30, San Jose, California.Google ScholarGoogle Scholar
  11. S. Ghemawat, H. Gobioff, and S-T. Leung. The Google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Grünbacher. Access Control Lists on Linux. SuSE Lab.Google ScholarGoogle Scholar
  13. http://hbase.apache.org/.Google ScholarGoogle Scholar
  14. HDFS Permission Guide http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.htmlGoogle ScholarGoogle Scholar
  15. HDFS-9806: Chris Douglas. "Allow HDFS block replicas to be provided by an external storage system" https://issues.apache.org/jira/browse/HDFS-9806Google ScholarGoogle Scholar
  16. J. Howell, J. R. Lorch, and J. Douceur. Correctness of Paxos with replica-set-specific views. Technical report MSR-TR-2004--45, Microsoft Research, 2004.Google ScholarGoogle Scholar
  17. J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems (TOCS), 6(1):51--81,1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H.T. Kung and John T. Robinson, "On Optimistic Methods for Concurrency Control," ACM Transactions on Database Systems, vol. 6, no. 2, pp. 213--226, June 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Lamport. Paxos made simple. SIGACT News, 2001.Google ScholarGoogle Scholar
  21. L. Lu, H. Herodotou, R. Ramakrishnan, S. Rao, G. Xu. Tiered Storage Based Hadoop on Azure, Microsoft, 2013.Google ScholarGoogle Scholar
  22. B. Oki and B. Liskov. Viewstamped replication: A new primary copy method to support highly available distributed systems. ACM PODC 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. POSIX FAQ http://www.opengroup.org/austin/papers/posix_faq.htmlGoogle ScholarGoogle Scholar
  24. Draft Standard for Information Technology -- Portable Operating System Interface (POSIX) -- Part 1: System Application Interface Amendment #: Protection, Audit and Control Interfaces {C Language}, IEEE Computer Society, Work Item Number: 22.42. Draft P1003.1e #17, 1997.Google ScholarGoogle Scholar
  25. J. Rao, E. J. Shekita, and S. Tata. Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings VLDB., 4(4):24--254, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, M. Russinovich, VC3: Trustworthy Data Analytics in the Cloud using SGX, in IEEE Symposium on Security and Privacy, 2015 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Schwan. Lustre: Building a file system for 1000-node clusters. In Linux Symposium, 2003.Google ScholarGoogle Scholar
  28. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovslannikov, D. Reeves. Sailfish: a framework for large scale data processing. In ACM SoCC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. http://usql.io/.Google ScholarGoogle Scholar
  31. R. Van Renesse and F. Schneider. Chain replication for supporting high throughput and availability. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Weil, A. Leung, S. Brandt, and C. Maltzahn. Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In Workshop on Petascale Data Storage, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Welch, J. Ousterhout. Prefix Tables: A Simple Mechanism for Locating Files in a Distributed System, 6th ICDCS, 1986.Google ScholarGoogle Scholar
  34. WinFS, en.wikipedia.org/wiki/WinFSGoogle ScholarGoogle Scholar

Index Terms

  1. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
        May 2017
        1810 pages
        ISBN:9781450341974
        DOI:10.1145/3035918

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader