Abstract
Conventional RAID solutions with fixed layouts partition large disk enclosures so that each RAID group uses its own disks exclusively. This achieves good performance isolation across underlying disk groups, at the cost of disk under-utilization and slow RAID reconstruction from disk failures.
We propose RAID+, a new RAID construction mechanism that spreads both normal I/O and reconstruction workloads to a larger disk pool in a balanced manner. Unlike systems conducting randomized placement, RAID+ employs deterministic addressing enabled by the mathematical properties of mutually orthogonal Latin squares, based on which it constructs 3-D data templates mapping a logical data volume to uniformly distributed disk blocks across all disks. While the total read/write volume remains unchanged, with or without disk failures, many more disk drives participate in data service and disk reconstruction.
Our evaluation with a 60-drive disk enclosure using both synthetic and real-world workloads shows that RAID+ significantly speeds up data recovery while delivering better normal I/O performance and higher multi-tenant system throughput.
- 2017. UMass Trace Repository. http://traces.cs.umass.edu/index.php/Storage/Storage.Google Scholar
- Guillermo A. Alvarez, Walter A. Burkhard, Larry J. Stockmeyer, and Flaviu Cristian. 1998. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of 25th International Symposium on Computer Architecture (ISCA’98). 109--120.Google ScholarDigital Library
- axboe. 2017. fio. https://github.com/axboe/fio.Google Scholar
- Eitan Bachmat and Jiri Schindler. 2002. Analysis of methods for scheduling low priority disk drive tasks. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS’02). 55--65.Google ScholarDigital Library
- Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in haystack: Facebook’s photo storage. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). 47--60.Google ScholarDigital Library
- Medha Bhadkamkar, Jorge Guerra, Luis Useche, Sam Burnett, Jason Liptak, Raju Rangaswami, and Vagelis Hristidis. 2009. BORG: Block-reORGanization and self-optimization in storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). 183--196.Google Scholar
- Paolo Boldi, Massimo Santini, and Sebastiano Vigna. 2008. A large time-aware web graph. SIGIR Forum 42, 2 (Nov. 2008), 33--38.Google ScholarDigital Library
- Jeff Bonwick and Bill Moore. 2007. ZFS: The last word in file systems. https://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf.Google Scholar
- Raj Chandra Bose and Sharadchandra S. Shrikhande. 1960. On the construction of sets of mutually orthogonal Latin squares and the falsity of a conjecture of Euler. Trans. Amer. Math. Soc. 95, 2 (1960), 191--209.Google ScholarCross Ref
- André Brinkmann, Kay Salzwedel, and Christian Scheideler. 2000. Efficient, distributed data placement strategies for storage area networks. In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’00). 119--128.Google Scholar
- Ceph. 2017. libcrush. https://github.com/ceph/libcrush.Google Scholar
- Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC’10). 143--154.Google ScholarDigital Library
- Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 1--14.Google ScholarDigital Library
- Netapp Corporation. 2017. How Long Does It Approximately Take for a RAID Reconstruction? https://kb.netapp.com/support/s/article/ka21A0000000jOzQAI/how-long-does-it-approximately-take-for-a-raid-reconstruction?language=en_US.Google Scholar
- Oracle Corporation. 2013. A Better RAID Strategy for High Capacity Drives in Mainframe Storage. http://www.oracle.com/technetwork/articles/systems-hardware-architecture/raid-strategy-hi-capacity-drives-170907.pdf.Google Scholar
- Veera Deenadhayalan. 2011. GPFS Native RAID for 100,000-Disk Petascale Systems. http://www.usenix.org/events/lisa11/tech/slides/deenadhayalan.pdf.Google Scholar
- Todd Edwards. 2017. SANtricity OS 11.40.1 Dynamic Disk Pools. https://www.netapp.com/us/media/tr-4652.pdf.Google Scholar
- Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008).Google ScholarDigital Library
- Gang Wang, Xiaoguang Liu, Sheng Lin, Guangjun Xie, and Jing Liu. 2008. Constructing double-erasure hover codes using latin squares. In Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS’08). IEEE, 533--540.Google Scholar
- Gang Wang, Xiaoguang Liu, Sheng Lin, Guangjun Xie, and Jing Liu. 2008. Constructing liberation codes using latin squares. In Proceedings of the 14th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’08). IEEE, 73--80.Google Scholar
- Gang Wang, Xiaoguang Liu, Sheng Lin, Guangjun Xie, and Jing Liu. 2008. Generalizing RDP codes using the combinatorial method. In Proceedings of the 17th IEEE International Symposium on Network Computing and Applications (NCA’08). IEEE, 93--100.Google Scholar
- Ashish Goel, Cyrus Shahabi, Shu yuen Didi Yao, and Roger Zimmermann. 2002. SCADDAR: An efficient randomized technique to reorganize continuous media blocks. In Proceedings of the 18th International Conference on Data Engineering (ICDE’02). 473--482.Google ScholarCross Ref
- Jose Gonzalez and Toni Cortes. 2004. Increasing the capacity of RAID5 by online gradual assimilation. In Proceedings of the International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI’04). 17--24.Google ScholarDigital Library
- Raúl Gracia-Tinedo, Josep Sampé, Edgar Zamora, Marc Sánchez-Artigas, Pedro García-López, Yosef Moatti, and Eran Rom. 2017. Crystal: Software-defined storage for multi-tenant object stores. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 243--256.Google Scholar
- J. L. Hafner. 2006. HoVer erasure codes for disk arrays. In Proceedings of the 2006 International Conference on Dependable Systems and Networks (DSN’06). 217--226.Google ScholarDigital Library
- Mark Holland and Garth A. Gibson. 1992. Parity declustering for continuous operation in redundant disk arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’92). 23--35.Google Scholar
- Mark Holland, Garth A. Gibson, and Daniel P. Siewiorek. 1993. Fast, on-line failure recovery in redundant disk arrays. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing (FTCS-23). 422--431.Google Scholar
- Mark C. Holland. 2001. On-line Data Reconstruction in Redundant Disk Arrays. Ph.D. Dissertation. Pittsburgh, PA.Google Scholar
- Robert Y. Hou, Jai Menon, and Yale N. Patt. 1993. Balancing I/O rresponse time and disk rebuild time in a RAID5 disk array. In Proceedings of the 26th Hawaii International Conference on System Sciences (HICSS-26). 70--79 vol.1.Google Scholar
- Windsor W. Hsu, Alan Jay Smith, and Honesty C. Young. 2005. The automatic improvement of locality in storage systems. ACM Transactions on Computer Systems (TOCS) 23, 4 (Nov. 2005), 424--473.Google ScholarDigital Library
- Yiming Hu and Qing Yang. 1996. DCD - Disk Caching Disk: A new approach for boosting I/O performance. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA’96). 169--178.Google ScholarDigital Library
- Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An analysis of Facebook photo caching. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). 167--181.Google Scholar
- Huawei. 2014. RAID 2.0+ Technical White Paper. https://actfornet.com/HUAWEI_STORAGE_DOCS/Storage_All2/Enterprise%20Unified%20Storage%20RAID%202.0+%20Technology-HUAWEI%20OceanStor%20Technical%20White%20Paper.pdf.Google Scholar
- IBM. 2017. IBM Spectrum Scale RAID. https://www.ibm.com/support/knowledgecenter/en/SSYSP8_5.3.1/raid_adm.pdf.Google Scholar
- IBM. 2017. IBM XIV Storage System Architecture and Implementation. http://www.redbooks.ibm.com/redbooks/pdfs/sg247659.pdf.Google Scholar
- Robert J. Jenkins. 1997. Hash Functions for Hash Table Lookup. http://burtleburtle.net/bob/hash/evahash.html.Google Scholar
- Jaeyong Jeong, Sangwook Shane Hahn, Sungjin Lee, and Jihong Kim. 2014. Lifetime improvement of NAND flash-based storage systems using dynamic program and erase scaling. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 61--74.Google ScholarDigital Library
- Hannu H. Kari, Heikki K. Saikkonen, Nohpill Park, and Fabrizio Lombardi. 1997. Analysis of repair algorithms for mirrored-disk systems. IEEE Transactions on Reliability 46, 2 (1997), 193--200.Google ScholarCross Ref
- Osama Khan, Randal Burns, James Plank, William Pierce, and Cheng Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 251--264.Google ScholarDigital Library
- Edward K. Lee and Randy H. Katz. 1993. The performance of parity placements in disk arrays. IEEE Transactions on Computers (TC) 42, 6 (Jun 1993), 651--664.Google ScholarDigital Library
- Jack Y. B. Lee and John C. S. Lui. 2002. Automatic recovery from disk failure in continuous-media servers. IEEE Transactions on Parallel and Distributed Systems (TPDS) 13, 5 (2002), 499--515.Google ScholarDigital Library
- Paul Hermann Lensing, Toni Cortes, Jim Hughes, and André Brinkmann. 2016. File system scalability with highly decentralized metadata on independent storage devices. In Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’16). IEEE, 366--375.Google ScholarDigital Library
- Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David Nagle, and Erik Riedel. 2000. Towards higher disk head utilization: extracting free bandwidth from busy disk drives. In Proceedings of the 4th Conference on Symposium on Operating System Design 8 Implementation (OSDI’00). 87--102.Google ScholarCross Ref
- Henry B. Mann. 1942. The construction of orthogonal Latin squares. The Annals of Mathematical Statistics 13, 4 (1942), 418--423.Google ScholarCross Ref
- Jai Menon and Dick Mattson. 1992. Distributed sparing in disk arrays. In Digest of Papers COMPCON Spring 1992. 410--421.Google ScholarCross Ref
- Alberto Miranda and Toni Cortes. 2014. CRAID: Online RAID upgrades using dynamic hot data reorganization. In Proceedings of the 12th USENIX conference on File and Storage Technologies (FAST’14). 133--146.Google Scholar
- Alberto Miranda, Sascha Effert, Yangwook Kang, Ethan L Miller, Ivan Popov, Andre Brinkmann, Tom Friedetzky, and Toni Cortes. 2014. Random slicing: Efficient and scalable data placement for large-scale storage systems. ACM Transactions on Storage (TOS) 10, 3 (2014), 9.Google Scholar
- MongoDB. 2017. MongoDB. https://www.mongodb.com/.Google Scholar
- MongoDB. 2019. “fsync” Administration Command, MongoDB Manual 4.0. https://docs.mongodb.com/manual/reference/command/fsync/.Google Scholar
- Richard R. Muntz and John C. S. Lui. 1990. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB’90). 162--173.Google Scholar
- David Nagle, Denis Serenyi, and Abbie Matthews. 2004. The panasas activescale storage cluster: Delivering scalable high bandwidth storage. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 53.Google Scholar
- Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write Off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS) 4, 3, Article 10 (2008), 23 pages.Google Scholar
- Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 1--15. https://www.usenix.org/conference/osdi12/technical-sessions/presentation/nightingaleGoogle Scholar
- Tycho Nightingale, Yiming Hu, and Qing Yang. 1999. The design and implementation of a DCD device driver for UNIX. In Proceedings of the 1999 USENIX Technical Conference (ATC’99). 295--308.Google Scholar
- David A. Patterson, Garth A. Gibson, and Randy H. Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM International Conference on Management of Data (SIGMOD’88). 109--116.Google Scholar
- James S. Plank. 2008. The RAID-6 liberation codes. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Article 7, 14 pages.Google ScholarDigital Library
- Beomjoo Seo and Roger Zimmermann. 2005. Efficient disk replacement and data migration algorithms for large disk subsystems. ACM Transactions on Storage (TOS) 1, 3 (Aug. 2005), 316--345.Google Scholar
- M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2004. Improving storage system availability with D-GRAID. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’04). 15--30.Google ScholarDigital Library
- Marc Staimer and Antony Adshead. 2010. Post-RAID alternatives address RAID’s shortcomings. http://www.computerweekly.com/feature/Post-RAID-alternatives-address-RAIDs-shortcomings.Google Scholar
- Karl L. Swartz. 2010. 3PAR Fast RAID: High performance without compromise. http://www.kls2.com/ karl/papers/raid-wp-10.0.pdf.Google Scholar
- Eno Thereska, Jiri Schindler, John S. Bucy, Brandon Salmon, Christopher R. Lumb, and Ganger R. Ganger. 2004. A framework for building unobtrusive disk maintenance applications. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 213--226.Google Scholar
- Lei Tian, Dan Feng, Hong Jiang, Ke Zhou, Lingfang Zeng, Jianxi Chen, Zhikun Wang, and Zhenlei Song. 2007. PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 277--290.Google Scholar
- tpcc mysql. 2017. https://github.com/Percona-Lab/tpcc-mysql.Google Scholar
- Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, and Ganger R. Ganger. 2007. Argon: Performance insulation for shared storage servers. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 61--76.Google ScholarDigital Library
- Jiguang Wan, Jibin Wang, Changsheng Xie, and Qing Yang. 2014. S2-RAID: Parallel RAID architecture for fast data recovery. IEEE Transactions on Parallel and Distributed Systems (TPDS) 25, 6 (2014), 1638--1647.Google ScholarDigital Library
- G. Wang, S. Lin, X. Liu, and J. Liu. 2009. Representing x-code using latin squares. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’09). 177--182.Google Scholar
- Gang Wang, Sheng Lin, Xiaoguang Liu, Guangjun Xie, and Jing Liu. 2007. Combinatorial constructions of multi-erasure-correcting codes with independent parity symbols for storage systems. In Proceedings of the 13th Pacific Rim International Symposium on Dependable Computing (PRDC’07). IEEE, 61--68.Google ScholarDigital Library
- Zhufan Wang. 2018. Reliability Analysis on RAID+. https://github.com/RAIDPLUS/Additional-materials/raw/master/reliability.pdf.Google Scholar
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06). 307--320.Google ScholarDigital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC’06). Article 122.Google ScholarDigital Library
- Brent Welch, Marc Unangst, Zainul Abbasi, Garth A. Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the Panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 2:1--2:17.Google ScholarDigital Library
- John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. 1996. The HP AutoRAID hierarchical storage system. ACM Transactions on Computer System (TOCS) 14, 1 (Feb. 1996), 108--136.Google ScholarDigital Library
- Chentao Wu and Xubin He. 2012. GSR: A global stripe-based redistribution approach to accelerate RAID-5 scaling. In Proceedings of the 41st International Conference on Parallel Processing (ICPP’12). 460--469.Google ScholarDigital Library
- Suzhen Wu, Hong Jiang, Dan Feng, Lei Tian, and Bo Mao. 2009. WorkOut: I/O Workload outsourcing for boosting RAID reconstruction performance. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). 239--252.Google Scholar
- Tao Xie and Hui Wang. 2008. MICRO: A multilevel caching-based reconstruction optimization for mobile storage systems. IEEE Transactions on Computers (TOC) 57, 10 (2008), 1386--1398.Google ScholarDigital Library
- Qin Xin, Ethan L. Miller, and Thomas J. E. Schwarz. 2004. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of 13th International Symposium on High-Performance Distributed Computing (HPDC’04). 172--181.Google Scholar
- Lihao Xu and Jehoshua Bruck. 1999. X-code: MDS array codes with optimal encoding. IEEE Transactions on Information Theory (TIT) 45, 1 (1999), 272--276.Google ScholarDigital Library
- Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. 2018. RAID+: Deterministic and balanced data distribution for large disk enclosures. In Proceedings of 16th USENIX Conference on File and Storage Technologies (FAST’18). USENIX Association, Oakland, CA, 279--294.Google Scholar
- Guangyan Zhang, Jiwu Shu, Wei Xue, and Weiming Zheng. 2007. SLAS: An efficient approach to scaling round-robin striped volumes. ACM Transactions on Storage (TOS) 3, 1 (2007), 3:1--3:39.Google Scholar
- Guangyan Zhang, Weiming Zheng, and Jiwu Shu. 2010. ALV: A new data redistribution approach to RAID-5 scaling. IEEE Transactions on Computers (TOC) 59, 3 (March 2010), 345--357.Google Scholar
- Weiming Zheng and Guangyan Zhang. 2011. FastScale: Accelerate RAID scaling by minimizing data migration. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 149--161.Google ScholarDigital Library
- Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the 2015 USENIX Annual Technical Conference (ATC’15). 375--386.Google Scholar
Index Terms
- Determining Data Distribution for Large Disk Enclosures with 3-D Data Templates
Recommendations
A High Reliable and Performance Data Distribution Strategy: A RAID-5 Case Study
CIS '13: Proceedings of the 2013 Ninth International Conference on Computational Intelligence and SecurityWith the development of the storage medium, such as the emerging of the SSD, the tradition way of data distribution can't keep up with the pace of the storage device development. Specifically, for example, Traditional RAID only enhanced the performance ...
Empirical analysis of solid state disk data retention when used with contemporary operating systems
Data recovery techniques for platter-based disk drives have remained rather static due to the dominance of the hard disk for the last two decades. Solid State Disk drives have differing storage and recall functionality from platter-based disks and ...
Floating Parity and Data Disk Arrays
A disk array is a set of disk drives (and controller) which can automatically recover data when one or more disk drives in the set fail. One method used by disk arrays to achieve high availability at lower cost than mirroring is a parity technique. This ...
Comments