research-article

kMemvisor: flexible system wide memory mirroring in virtual environments

Authors:
Bin Wang

School of Software, Shanghai Jiao Tong University, Shanghai, China

School of Software, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Zhengwei Qi

School of Software, Shanghai Jiao Tong University, Shanghai, China

School of Software, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Haibing Guan

School of Software, Shanghai Jiao Tong University, Shanghai, China

School of Software, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Haoliang Dong

School of Software, Shanghai Jiao Tong University, Shanghai, China

School of Software, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Wei Sun

School of Software, Shanghai Jiao Tong University, Shanghai, China

School of Software, Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Yaozu Dong

Intel China Software Center, Shanghai, China

Intel China Software Center, Shanghai, China
View Profile

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013Pages 251–262https://doi.org/10.1145/2493123.2462910

Published:17 June 2013Publication History

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Pages 251–262

ABSTRACT

Today's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, uncorrected errors from DRAM can be a significant source of system downtime. To address this increasingly important concern, both hardware and software memory mirroring technologies are studied nowadays to provide memory high availability. However, hardware solutions like mirror memory, which uses doubled chip, need dedicated and costly peripheral hardware. While existing software approaches, i.e., virtual machine's checkpoint technology, reduce the expense but incur the high overhead in practical usage. In this paper, we present a novel system called \emph{k}Memvisor to provide system-wide high availability memory mirroring. It is a software approach achieving flexible multi-granularity memory mirroring via virtualization and binary translation technology. Specifically, kMemvisor first creates backup space of the same size of the specified memory for applications or virtual machines. We can flexibly set memory areas to be mirrored or not mirrored from application level to system-wide. Then, all memory write instructions in the native memory space are captured and instrumented by mirror memory write instructions to synchronize the data in backup space. Furthermore, this instruction level memory synchronization reduces backup overhead and lowers the probability of data loss compared with traditional software approaches. So kMemvisor could use data from the backup space to recover when memory failures happen. The results show that kMemvisor causes 55% overhead in the worst case of system-wide high availability and 30% average for the real world applications, which outperforms the state-of-the-art software approaches even in the worst case.

References

ACME Laboratories. thttpd - tiny/turbo/throttling HTTP server. http://www.acme.com/software/thttpd/.Google Scholar
Amazon. Amazon EC2 Service Level Agreement. http://aws.amazon.com/ec2-sla/.Google Scholar
Andi Kleen. Machine check handling on linux. In SUSE Labs, 2004.Google Scholar
Apache Software Foundation. ab - Apache HTTP server benchmarking tool. http://httpd.apache.org/docs/2.0/programs/ab.html.Google Scholar
Bernd Panzer-Steindel. Data integrity. CERN/IT, 2007.Google Scholar
Bianca Schroeder, Eduardo Pinheiro and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, 2011. Google ScholarDigital Library
Bodik Peter, Menache Ishai, Chowdhury Mosharaf, Mani Pradeepkumar, Maltz David A., Stoica Ion. Surviving failures in bandwidth-constrained datacenters. SIGCOMM Comput. Commun. Rev., 42(4), Aug. 2012. Google ScholarDigital Library
Brendan Cully, Geoffrey Lefebvre, Dutch T. Meyer, Mike Feeley, Norman C. Hutchinson and Andrew Warfield. Remus: High availability via asynchronous virtual machine replication. (best paper). In NSDI, pages 161--175, 2008. Google ScholarDigital Library
Chang Cheng-Shang, Chen Yi-Ting, Lee Duan-Shin. Constructions of optical fifo queues. IEEE/ACM Trans. Netw., 14(SI), June 2006. Google ScholarDigital Library
Chen, C. L. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. In IBM Journal of Research and Development, 1984. Google ScholarDigital Library
Chen Zizhong. Algorithm-based recovery for iterative methods without checkpointing. In HPDC, pages 73--84, 2011. Google ScholarDigital Library
Christopher Clark, Keir Fraser, Steven H, Jakob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt and Andrew Warfield. Live migration of virtual machines. In NSDI, pages 273--286, 2005. Google ScholarDigital Library
Daniel J. Scales, Mike Nelson and Ganesh Venkitachalam. The design of a practical system for fault-tolerant virtual machines. Operating Systems Review, 44(4):30--39, 2010. Google ScholarDigital Library
David Chisnall. The Definitive Guide to the Xen Hypervisor. Prentice Hall, 1 edition, 2007. Google ScholarDigital Library
David Fiala, Kurt B. Ferreira, Frank Mueller and Christian Engelmann. A tunable, software-based dram error detection and correction library for hpc. Euro-par 2011, PARALLEL PROCESSING WORKSHOPS, 7156:251--261, 2012. Google ScholarDigital Library
Dell. Dell PowerEdge 12th generation servers. http://www.dell.com/poweredge.Google Scholar
Denys Vlasenko. BusyBox: The Swiss Army Knife of Embedded Linux. http://www.busybox.net/.Google Scholar
Fenn Michael, Murphy Michael A., Goasguen Sebastien. A study of a kvm-based cluster for grid computing. In ACM-SE, pages 34:1--34:6, 2009. Google ScholarDigital Library
Fiala David, Ferreira Kurt, Mueller Frank, Engelmann Christian. A tunable, software-based dram error detection and correction library for hpc. In sc, 2011. Google ScholarDigital Library
R. Gallager. Low-density parity-check codes. In Information Theory, IRE Transactions on, pages 21--28, 1962.Google Scholar
Gang Wu, Jian Gao, Huxing Zhang and Yaozu Dong. Improving pcm endurance with randomized address remapping in hybrid memory system. In CLUSTER (poster), pages 503--507, 2011. Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: amazon's highly available key-value store. In SOSP, pages 205--220, 2007. Google ScholarDigital Library
Google. App Engine Service Level Agreement. https://developers.google.com/appengine/sla.Google Scholar
Haikun Liu, Cheng-Zhong Xu, Hai Jin, Jiayu Gong, Xiaofei Liao. Performance and energy modeling for live migration of virtual machines. In HPDC, pages 171--182, 2011. Google ScholarDigital Library
Haikun Liu, Hai Jin, Xiaofei Liao, Bo Ma, Cheng-Zhong Xu. Vmckpt: lightweight and live virtual machine checkpointing. SCIENCE CHINA Information Sciences, 55(12):2865--2880, 2012.Google ScholarCross Ref
Haoliang Dong, Wei Sun, Bin Wang, Haiyang Sun and Zhengwei Qi. Memvisor: Application level memory mirroring via binary translation. In CLUSTER (poster), 2012. Google ScholarDigital Library
HP Corporation. HP advanced memory protection technologies. http://h18000.www1.hp.com/products/servers/technology/memoryprotection.html.Google Scholar
Intel Corporation. IA-32 Intel Architecture Software Developer's Manual. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.Google Scholar
JM Deegan. High reliability memory subsystem using data error correcting code symbol sliced command repowering. US Patent 7,206,962, Google Patents.Google Scholar
Kutlu Mucahid, Agrawal Gagan, Kurt Oguz. Fault tolerant parallel data-intensive algorithms. In HPDC, pages 133--134, 2012. Google ScholarDigital Library
Levien L, Meyers W. Special feature: Semiconductor memory reliability with error detecting and correcting codes. In Computer, pages 43--50, 1976. Google ScholarDigital Library
Mel Gorman and Patrick Healy. Supporting superpage allocation without additional hardware support. In ISMM, pages 41--50, 2008. Google ScholarDigital Library
Qingsong Li, Utpal Patel. Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers. http://www.dell.com/downloads/global/power/ps3q05-20050176-patel-oe.pdf.Google Scholar
Qureshi Moinuddin K. Pay-as-you-go: low-overhead hard-error correction for phase change memories. In MICRO-44, pages 318--328, 2011. Google ScholarDigital Library
Qureshi Moinuddin K., Srinivasan Vijayalakshmi, Rivers Jude A. Scalable high performance main memory system using phase-change memory technology. In ISCA, pages 24--33, 2009. Google ScholarDigital Library
RW Hamming. Error detecting and error correcting codes. Bell System technical journal, 1950.Google Scholar
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. The google file system. In SOSP, pages 29--43, 2003. Google ScholarDigital Library
Schechter Stuart, Loh Gabriel H., Straus Karin, Burger Doug. Use ecp, not ecc, for hard failures in resistive memories. SIGARCH Comput. Archit. News, 38(3), June 2010. Google ScholarDigital Library
Seong Nak Hee, Woo Dong Hyuk, Srinivasan Vijayalakshmi, Rivers Jude A., Lee Hsien-Hsin S. Safer: Stuck-at-fault error recovery for memories. In MICRO, pages 115--124, 2010. Google ScholarDigital Library
Sharma Prateek, Kulkarni Purushottam. Singleton: system-wide page deduplication in virtual environments. In HPDC, pages 15--26, 2012. Google ScholarDigital Library
SQLite. SQLite Web Site. http://www.sqlite.org/.Google Scholar
Sridharan Vilas, Liberty Dean. A study of dram failures in the field. In SC, pages 76:1--76:11, 2012. Google ScholarDigital Library
Timothy J. Dell. Ecc-on-simm test challenges. In ITC, pages 511--515, 1994. Google ScholarDigital Library
XV6. XV6 Doc. http://pdos.csail.mit.edu/6.828/2011/xv6.html.Google Scholar
Yuyang Du, Hongliang Yu, Yunhong Jiang, Yaozu Dong and Weimin Zheng. A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity. In HotDep USENIX Association Berkeley, 2010. Google ScholarDigital Library

Index Terms

kMemvisor: flexible system wide memory mirroring in virtual environments

Recommendations

kMemvisor: flexible system wide memory mirroring in virtual environments
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Today's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, ...
Read More
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Read More
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
June 2013
276 pages
ISBN:9781450319102
DOI:10.1145/2493123
General Chairs:
Manish Parashar
Rutgers University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Renato Figueiredo
University of Florida, USA and Vrije Universiteit, The Netherlands
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
flexible memory mirroring
system-wide high availability
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

kMemvisor: flexible system wide memory mirroring in virtual environments

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

kMemvisor: flexible system wide memory mirroring in virtual environments

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory

A Novel Memory Block Management Scheme for PCM Using WOM-Code

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

kMemvisor: flexible system wide memory mirroring in virtual environments

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

kMemvisor: flexible system wide memory mirroring in virtual environments

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory

A Novel Memory Block Management Scheme for PCM Using WOM-Code

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media