Abstract
Distributed RAM storage aggregates the RAM of servers in data center networks (DCN) to provide extremely high I/O performance for large-scale cloud systems. For quick recovery of storage server failures, MemCube [53] exploits the proximity of the BCube network to limit the recovery traffic to the recovery servers’ 1-hop neighborhood. However, the previous design is applicable only to the symmetric BCube(n,k) network with nk+1 nodes and has suboptimal recovery performance due to congestion and contention.
To address these problems, in this article, we propose CubeX, which (i) generalizes the “1-hop” principle of MemCube for arbitrary cube-based networks and (ii) improves the throughput and recovery performance of RAM-based key-value (KV) store via cross-layer optimizations. At the core of CubeX is to leverage the glocality (= globality + locality) of cube-based networks: It scatters backup data across a large number of disks globally distributed throughout the cube and restricts all recovery traffic within the small local range of each server node. Our evaluation shows that CubeX not only efficiently supports RAM-based KV store for cube-based networks but also significantly outperforms MemCube and RAMCloud in both throughput and recovery time.
- AWS Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Retrieved from http://aws.amazon.com/message/65648/.Google Scholar
- NiceX Lab. Ursa Block Store. Retrieved from http://nicexlab.com/ursa/.Google Scholar
- RedisLabs. Redis Official Website. Retrieved from http://redis.io/.Google Scholar
- Dhruba Borthakur. HDFS Architecture Guide. Retrieved from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.Google Scholar
- SOSP 2011 PC meeting. SOSP 2011 Reviews and Comments on RAMCloud. https://ramcloud.stanford.edu/wiki/pages/viewpage.action?pageId=8355860SOSP-2011-Reviews-and-comments-on-RAMCloud.Google Scholar
- Josh Norem. Samsung SSD 960 EVO (500GB). Retrieved from https://www.pcmag.com/review/358847/samsung-ssd-960-evo-500gb.Google Scholar
- Rich Miller. Failure Rates in Google Data Centers. Retrieved from http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/.Google Scholar
- Dormando. Memcached Official Website. Retrieved from http://www.memcached.org/.Google Scholar
- Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI protocol. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’03). IEEE, 123--134. Google ScholarDigital Library
- Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10). USENIX Association, 433--448. Retrieved from http://www.usenix.org/events/nsdi10/tech/full_papers/anand.pdf. Google ScholarDigital Library
- David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’09), Jeanna Neefe Matthews and Thomas E. Anderson (Eds.). ACM, 1--14. Retrieved from http://dblp.uni-trier.de/db/conf/sosp/sosp2009.html#AndersenFKPTV09. Google ScholarDigital Library
- Antirez. {n.d.}. An update on the memcached/redis benchmark. Retrieved from http://antirez.com/post/update-on-memcached-redis-benchmark.html.Google Scholar
- Ed L. Cashin. 2005. Kernel korner: Ata over ethernet: Putting hard drives on the lan. Linux J. 2005, 134 (2005), 10. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205--218. Google ScholarDigital Library
- Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 228--243. Google ScholarDigital Library
- Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clusters. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications (SIGCOMM’13), Dah Ming Chiu, Jia Wang, Paul Barford, and Srinivasan Seshan (Eds.). ACM, 231--242. Google ScholarDigital Library
- Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. 2011. Managing data transfers in computer clusters with orchestra. In ACM SIGCOMM Computer Communication Review, Vol. 41. ACM, 98--109. Google ScholarDigital Library
- Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 2 (2010), 1414--1425. Retrieved from http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#DebnathSL10. Google ScholarDigital Library
- Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the SIGMOD Conference, Timos K. Sellis, Rene J. Miller, Anastasios Kementsietsidis, and Yannis Velegrakis (Eds.). ACM, 25--36. Retrieved from http://dblp.uni-trier.de/db/conf/sigmod/sigmod2011.html#DebnathSL11. Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401--414. Google ScholarDigital Library
- Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 371--384. Google ScholarDigital Library
- Clayto S. Ferner and Kyungsook Y. Lee. 1992. Hyperbanyan networks: A new class of networks for distributed memory multiprocessors. IEEE Trans. Comput. 41, 3 (1992), 254--261.Google Scholar
- Armando Fox. 2002. Toward recovery-oriented computing. In Proceedings of the Conference on Very Large Data Bases (VLDB’02). 873--876. Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. Google ScholarDigital Library
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications (SIGCOMM’11), Srinivasan Keshav, Jörg Liebeherr, John W. Byers, and Jeffrey C. Mogul (Eds.). ACM, 350--361. Google ScholarDigital Library
- Jim Gray and Gianfranco R. Putzolu. 1987. The 5 minute rule for trading memory for disk accesses and the 10 byte rule for trading memory for CPU time. In Proceedings of the Association for Computing Machinery Special Interest Group on Management of Data, Umeshwar Dayal and Irving L. Traiger (Eds.). ACM Press, 395--398. Google ScholarDigital Library
- Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2011. VL2: A scalable and flexible data center network. Commun. ACM 54, 3 (2011), 95--104. Google ScholarDigital Library
- Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications (SIGCOMM’09). 63--74. Google ScholarDigital Library
- Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen et al. 2015. Pingmesh: A large-scale system for data center network latency measurement and analysis. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 139--152. Google ScholarDigital Library
- Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of hdfs under hbase: A facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 199--212. Google ScholarDigital Library
- John H. Hartman and John K. Ousterhout. 1995. The Zebra striped network file system. ACM Trans. Comput. Syst. 13, 3 (1995), 274--310. Google ScholarDigital Library
- Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE, 18--27. Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 1--14. Google ScholarDigital Library
- Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices, Vol. 31. ACM, 84--92. Google ScholarDigital Library
- HuiBa Li, ShengYun Liu, YuXing Peng, DongSheng Li, HangJun Zhou, and XiCheng Lu. 2010. Superscalar communication: A runtime optimization for distributed applications. Sci. China Info. Sci. 53, 10 (2010), 1931--1946.Google ScholarCross Ref
- Hyeontaek Lim, Donsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 429--444. Google ScholarDigital Library
- Guohan Lu, Chuanxiong Guo, Yulong Li, Zhiqiang Zhou, Tong Yuan, Haitao Wu, Yongqiang Xiong, Rui Gao, and Yongguang Zhang. 2011. ServerSwitch: A programmable and high performance platform for data center networks. In Proceedings of the (NSDI’11). Google ScholarDigital Library
- Xicheng Lu, Huaimin Wang, and Ji Wang. 2006. Internet-based virtual computing environment (iVCE): Concepts and architecture. Sci. China Ser. F: Info. Sci. 49, 6 (2006), 681--701.Google ScholarCross Ref
- Xicheng Lu, Huaimin Wang, Ji Wang, and Jie Xu. 2013. Internet-based virtual computing environment: Beyond the data center as a computer. Future Gen. Comput. Syst. 29, 1 (2013), 309--322. Google ScholarDigital Library
- Jeanna Neefe Matthews, Drew Roselli, Adam M. Costello, Randolph Y. Wang, and Thomas E. Anderson. 1997. Improving the performance of log-structured file systems with adaptive methods. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’97). ACM. Google ScholarDigital Library
- James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257--273. Google ScholarDigital Library
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383--398. Google ScholarDigital Library
- Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google ScholarDigital Library
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John K. Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 29--41. Google ScholarDigital Library
- John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operat. Syst. Rev. 43, 4 (2009), 92--105. Google ScholarDigital Library
- Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (1992), 26--52. Google ScholarDigital Library
- Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout. 2011. It’s time for low latency. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’11). Google ScholarDigital Library
- Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 505--516. Google ScholarDigital Library
- Ji-Yong Shin, Mahesh Balakrishnan, Tudor Marian, and Hakim Weatherspoon. 2013. Gecko: Contention-oblivious disk arrays for cloud storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 285--298. Google ScholarDigital Library
- Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. USENIX Association, 19--19. Google ScholarDigital Library
- Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the salus scalable block store. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 357--370. Google ScholarDigital Library
- Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: A high performance network structure for modular data center interconnection. In Proceedings of the International Conference on Emerging Networking Experiments and Technologies (CoNEXT’09), Joörg Liebeherr, Giorgio Ventre, Ernst W. Biersack, and Srinivasan Keshav (Eds.). ACM, 25--36. Retrieved from http://dblp.uni-trier.de/db/conf/conext/conext2009.html#WuLLGZ09. Google ScholarDigital Library
- Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu, Haitao Wu, and Yongqiang Xiong. 2015. CubicRing: Enabling one-hop failure detection and recovery for distributed in-memory storage systems. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 529--542. Google ScholarDigital Library
- Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 479--491. Google ScholarDigital Library
Index Terms
- Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage
Recommendations
Optimal recovery of single disk failure in RDP code storage systems
Performance evaluation reviewModern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data availability for single ...
Optimal recovery of single disk failure in RDP code storage systems
SIGMETRICS '10: Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systemsModern storage systems use thousands of inexpensive disks to meet the storage requirement of applications. To enhance the data availability, some form of redundancy is used. For example, conventional RAID-5 systems provide data availability for single ...
Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems
In modern parallel storage systems (e.g., cloud storage and data centers), it is important to provide data availability guarantees against disk (or storage node) failures via redundancy coding schemes. One coding scheme is X-code, which is double-fault ...
Comments