skip to main content
research-article

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Published:01 March 2014Publication History
Skip Abstract Section

Abstract

Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment.

To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplication-based storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times.

References

  1. Andersen, D. G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., and Vasudevan, V. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A Berkeley view of cloud computing. Tech. rep. USB/EECS-2009-28, University of California, Berkeley.Google ScholarGoogle Scholar
  3. Bhagwat, D., Pollack, K., Long, D., Schwarz, T., Miller, E., and Pâris, J. 2006. Providing high reliability in a minimum redundancy archival storage system. In Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Caulfield, A., Grupp, L., and Swanson, S. 2009. Gordon: Using flash memory to build fast power-efficient clusters for data-intensive applications. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Clements, A. T., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Debnath, B., Sengupta, S., and Li, J. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., and Sengupta, S. 2012. Primary data deduplication - Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ESG. 2008. Data protection survey. Enterprise Strategy Group. http://www.esg-global.com.Google ScholarGoogle Scholar
  10. Guerra, J., Pucha, H., Glider, J., and Rangaswami, R. 2011. Cost effective storage using extent based dynamic tiering. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Varghese, G., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hansen, J. and Jul, E. 2010. Lithium: Virtual machine storage for the cloud. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Himelstein, M. 2011. Cloudy with a chance of data reduction: How data reduction technologies impact the cloud. In Proceedings of SNW Spring 2011.Google ScholarGoogle Scholar
  15. Jin, K. and Miller, E. L. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jones, S. 2011. Online de-duplication in a log-structured file system for primary storage. Tech. rep. UCSC-SSRC-11-03, University of California, Santa Cruz.Google ScholarGoogle Scholar
  17. Kim, Y., Gupta, A., and Urgaonkar, B. 2008. MixedStore: An enterprise-scale storage system combining solid-state and hard disk drives. Tech. rep. CSE-08-017, Department of Computer Science and Engineering, Pennsylvania State University.Google ScholarGoogle Scholar
  18. Koller, R. and Rangaswami, R. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Koltsidas, I. and Viglas, S. D. 2008. Flashing up the storage layer. Proc. VLDB Endow. 1, 1, 514--525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kruus, E., Ungureanu, C., and Dubnicki, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lillibridge, M., Eshghi, K., and Bhagwat, D. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Meister, D. and Brinkmann, A. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., and Kunkel, J. 2012. A study on data deduplication in HPC storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Meyer, D. T. and Bolosky, W. J. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Muthitacharoenand, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nath, P., Kozuch, M. A., O’Hallaron, D. R., Harkes, J., Satyanarayanan, M., Tolia, N., and Toups, M. 2006. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Nath, P., Urgaonkar, B., and Sivasubramaniam, A. 2008. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Polte, M., Simsa, J., and Gibson, G. 2008. Comparing performance of solid state devices and mechanical disks. In Proceedings of the 3rd Petascale Data Storage Workshop (PDSW’08).Google ScholarGoogle Scholar
  30. Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ren, J. and Yang, Q. 2010. A new buffer cache design exploiting both temporal and content localities. In Proceedings of the 30th International Conference on Distributed Computing Systems (ICDCS’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rhea, S., Cox, R., and Pesterev, A. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Srinivasan, K., Bisson, T., Goodson, G., and Voruganti, K. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., and Zhou, G. 2011. CABdedupe: A causality-based de-duplication performance booster for cloud backup services. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS’’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Całkowski, G., Dubnicki, C., and Bohra, A. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiao, W. and Yang, Q. 2008. Can we really recover data if storage subsystem fails? In Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., and Wan, Y. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10).Google ScholarGoogle Scholar
  39. Zhang, X., Huo, Z., Ma, J., and Meng, D. 2010. Exploiting data deduplication to accelerate live virtual machine migration. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zhu, Q., Chen, Z., Tan, L., Zhou, Y., Keeton, K., and Wilkes, J. 2005. Hibernator: Helping disk arrays sleep through the winter. In Proceedings of the ACM SIGOPS 20th Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 177--190. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 10, Issue 2
              March 2014
              86 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/2600090
              • Editor:
              • Darrell Long
              Issue’s Table of Contents

              Copyright © 2014 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2014
              • Accepted: 1 July 2013
              • Revised: 1 June 2013
              • Received: 1 December 2012
              Published in tos Volume 10, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader