research-article

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Authors:
Bo Mao

Xiamen University

Xiamen University
View Profile

,
Hong Jiang

University of Nebraska-Lincoln

University of Nebraska-Lincoln
View Profile

,
Suzhen Wu

Xiamen University

Xiamen University
View Profile

,
Yinjin Fu

National University of Defense Technology

National University of Defense Technology
View Profile

,
Lei Tian

University of Nebraska-Lincoln

University of Nebraska-Lincoln
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 10 Issue 2Article No.: 6pp 1–22https://doi.org/10.1145/2512348

Published:01 March 2014Publication History

ACM Transactions on Storage

Abstract

Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment.

To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplication-based storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times.

References

Andersen, D. G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., and Vasudevan, V. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). Google ScholarDigital Library
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A Berkeley view of cloud computing. Tech. rep. USB/EECS-2009-28, University of California, Berkeley.Google Scholar
Bhagwat, D., Pollack, K., Long, D., Schwarz, T., Miller, E., and Pâris, J. 2006. Providing high reliability in a minimum redundancy archival storage system. In Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’06). Google ScholarDigital Library
Caulfield, A., Grupp, L., and Swanson, S. 2009. Gordon: Using flash memory to build fast power-efficient clusters for data-intensive applications. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). Google ScholarDigital Library
Clements, A. T., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’09). Google ScholarDigital Library
Debnath, B., Sengupta, S., and Li, J. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’10). Google ScholarDigital Library
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarDigital Library
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., and Sengupta, S. 2012. Primary data deduplication - Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12). Google ScholarDigital Library
ESG. 2008. Data protection survey. Enterprise Strategy Group. http://www.esg-global.com.Google Scholar
Guerra, J., Pucha, H., Glider, J., and Rangaswami, R. 2011. Cost effective storage using extent based dynamic tiering. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarDigital Library
Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarDigital Library
Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Varghese, G., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). Google ScholarDigital Library
Hansen, J. and Jul, E. 2010. Lithium: Virtual machine storage for the cloud. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC’10). Google ScholarDigital Library
Himelstein, M. 2011. Cloudy with a chance of data reduction: How data reduction technologies impact the cloud. In Proceedings of SNW Spring 2011.Google Scholar
Jin, K. and Miller, E. L. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google ScholarDigital Library
Jones, S. 2011. Online de-duplication in a log-structured file system for primary storage. Tech. rep. UCSC-SSRC-11-03, University of California, Santa Cruz.Google Scholar
Kim, Y., Gupta, A., and Urgaonkar, B. 2008. MixedStore: An enterprise-scale storage system combining solid-state and hard disk drives. Tech. rep. CSE-08-017, Department of Computer Science and Engineering, Pennsylvania State University.Google Scholar
Koller, R. and Rangaswami, R. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarDigital Library
Koltsidas, I. and Viglas, S. D. 2008. Flashing up the storage layer. Proc. VLDB Endow. 1, 1, 514--525. Google ScholarDigital Library
Kruus, E., Ungureanu, C., and Dubnicki, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarDigital Library
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09). Google ScholarDigital Library
Lillibridge, M., Eshghi, K., and Bhagwat, D. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google ScholarDigital Library
Meister, D. and Brinkmann, A. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). Google ScholarDigital Library
Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., and Kunkel, J. 2012. A study on data deduplication in HPC storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google ScholarDigital Library
Meyer, D. T. and Bolosky, W. J. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarDigital Library
Muthitacharoenand, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarDigital Library
Nath, P., Kozuch, M. A., O’Hallaron, D. R., Harkes, J., Satyanarayanan, M., Tolia, N., and Toups, M. 2006. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’06). Google ScholarDigital Library
Nath, P., Urgaonkar, B., and Sivasubramaniam, A. 2008. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC’08). Google ScholarDigital Library
Polte, M., Simsa, J., and Gibson, G. 2008. Comparing performance of solid state devices and mechanical disks. In Proceedings of the 3rd Petascale Data Storage Workshop (PDSW’08).Google Scholar
Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarDigital Library
Ren, J. and Yang, Q. 2010. A new buffer cache design exploiting both temporal and content localities. In Proceedings of the 30th International Conference on Distributed Computing Systems (ICDCS’10). Google ScholarDigital Library
Rhea, S., Cox, R., and Pesterev, A. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’08). Google ScholarDigital Library
Srinivasan, K., Bisson, T., Goodson, G., and Voruganti, K. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarDigital Library
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., and Zhou, G. 2011. CABdedupe: A causality-based de-duplication performance booster for cloud backup services. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS’’11). Google ScholarDigital Library
Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Całkowski, G., Dubnicki, C., and Bohra, A. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarDigital Library
Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarDigital Library
Xiao, W. and Yang, Q. 2008. Can we really recover data if storage subsystem fails? In Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS’08). Google ScholarDigital Library
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., and Wan, Y. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10).Google Scholar
Zhang, X., Huo, Z., Ma, J., and Meng, D. 2010. Exploiting data deduplication to accelerate live virtual machine migration. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’10). Google ScholarDigital Library
Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarDigital Library
Zhu, Q., Chen, Z., Tan, L., Zhou, Y., Keeton, K., and Wilkes, J. 2005. Hibernator: Helping disk arrays sleep through the winter. In Proceedings of the ACM SIGOPS 20th Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 177--190. Google ScholarDigital Library

Index Terms

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Recommendations

Leveraging data deduplication to improve the performance of primary storage systems in the cloud
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Recent studies have shown that moderate to high data redundancy exists in primary storage systems, such as VM-based, enterprise and HPC storage systems, which indicates that the data deduplication technology can be used to effectively reduce the write ...
Read More
Improving runtime performance of deduplication system with host-managed SMR storage drives
DAC '18: Proceedings of the 55th Annual Design Automation Conference

Due to the cost consideration for data storage, high-areal-density shingled-magnetic-recording (SMR) drives and data deduplication techniques are getting popular in many data storage services for the improvement of profit per storage unit. However, ...
Read More
Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets
MASCOTS '12: Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems

Data deduplication has been widely adopted in contemporary backup storage systems. It not only saves storage space considerably, but also shortens the data backup time significantly. Since the major goal of the original data deduplication lies in saving ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 10, Issue 2
March 2014
86 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2600090
Editor:
Darrell Long
University of California Santa Cruz, USA
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2014
- Accepted: 1 July 2013
- Revised: 1 June 2013
- Received: 1 December 2012
Published in tos Volume 10, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Storage systems
data deduplication
read performance
solid-state drive
virtual machine
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 1,135
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Leveraging data deduplication to improve the performance of primary storage systems in the cloud

Improving runtime performance of deduplication system with host-managed SMR storage drives

Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets