ABSTRACT
Today's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, uncorrected errors from DRAM can be a significant source of system downtime. To address this increasingly important concern, both hardware and software memory mirroring technologies are studied nowadays to provide memory high availability. However, hardware solutions like mirror memory, which uses doubled chip, need dedicated and costly peripheral hardware. While existing software approaches, i.e., virtual machine's checkpoint technology, reduce the expense but incur the high overhead in practical usage. In this paper, we present a novel system called \emph{k}Memvisor to provide system-wide high availability memory mirroring. It is a software approach achieving flexible multi-granularity memory mirroring via virtualization and binary translation technology. Specifically, kMemvisor first creates backup space of the same size of the specified memory for applications or virtual machines. We can flexibly set memory areas to be mirrored or not mirrored from application level to system-wide. Then, all memory write instructions in the native memory space are captured and instrumented by mirror memory write instructions to synchronize the data in backup space. Furthermore, this instruction level memory synchronization reduces backup overhead and lowers the probability of data loss compared with traditional software approaches. So kMemvisor could use data from the backup space to recover when memory failures happen. The results show that kMemvisor causes 55% overhead in the worst case of system-wide high availability and 30% average for the real world applications, which outperforms the state-of-the-art software approaches even in the worst case.
- ACME Laboratories. thttpd - tiny/turbo/throttling HTTP server. http://www.acme.com/software/thttpd/.Google Scholar
- Amazon. Amazon EC2 Service Level Agreement. http://aws.amazon.com/ec2-sla/.Google Scholar
- Andi Kleen. Machine check handling on linux. In SUSE Labs, 2004.Google Scholar
- Apache Software Foundation. ab - Apache HTTP server benchmarking tool. http://httpd.apache.org/docs/2.0/programs/ab.html.Google Scholar
- Bernd Panzer-Steindel. Data integrity. CERN/IT, 2007.Google Scholar
- Bianca Schroeder, Eduardo Pinheiro and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, 2011. Google ScholarDigital Library
- Bodik Peter, Menache Ishai, Chowdhury Mosharaf, Mani Pradeepkumar, Maltz David A., Stoica Ion. Surviving failures in bandwidth-constrained datacenters. SIGCOMM Comput. Commun. Rev., 42(4), Aug. 2012. Google ScholarDigital Library
- Brendan Cully, Geoffrey Lefebvre, Dutch T. Meyer, Mike Feeley, Norman C. Hutchinson and Andrew Warfield. Remus: High availability via asynchronous virtual machine replication. (best paper). In NSDI, pages 161--175, 2008. Google ScholarDigital Library
- Chang Cheng-Shang, Chen Yi-Ting, Lee Duan-Shin. Constructions of optical fifo queues. IEEE/ACM Trans. Netw., 14(SI), June 2006. Google ScholarDigital Library
- Chen, C. L. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. In IBM Journal of Research and Development, 1984. Google ScholarDigital Library
- Chen Zizhong. Algorithm-based recovery for iterative methods without checkpointing. In HPDC, pages 73--84, 2011. Google ScholarDigital Library
- Christopher Clark, Keir Fraser, Steven H, Jakob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt and Andrew Warfield. Live migration of virtual machines. In NSDI, pages 273--286, 2005. Google ScholarDigital Library
- Daniel J. Scales, Mike Nelson and Ganesh Venkitachalam. The design of a practical system for fault-tolerant virtual machines. Operating Systems Review, 44(4):30--39, 2010. Google ScholarDigital Library
- David Chisnall. The Definitive Guide to the Xen Hypervisor. Prentice Hall, 1 edition, 2007. Google ScholarDigital Library
- David Fiala, Kurt B. Ferreira, Frank Mueller and Christian Engelmann. A tunable, software-based dram error detection and correction library for hpc. Euro-par 2011, PARALLEL PROCESSING WORKSHOPS, 7156:251--261, 2012. Google ScholarDigital Library
- Dell. Dell PowerEdge 12th generation servers. http://www.dell.com/poweredge.Google Scholar
- Denys Vlasenko. BusyBox: The Swiss Army Knife of Embedded Linux. http://www.busybox.net/.Google Scholar
- Fenn Michael, Murphy Michael A., Goasguen Sebastien. A study of a kvm-based cluster for grid computing. In ACM-SE, pages 34:1--34:6, 2009. Google ScholarDigital Library
- Fiala David, Ferreira Kurt, Mueller Frank, Engelmann Christian. A tunable, software-based dram error detection and correction library for hpc. In sc, 2011. Google ScholarDigital Library
- R. Gallager. Low-density parity-check codes. In Information Theory, IRE Transactions on, pages 21--28, 1962.Google Scholar
- Gang Wu, Jian Gao, Huxing Zhang and Yaozu Dong. Improving pcm endurance with randomized address remapping in hybrid memory system. In CLUSTER (poster), pages 503--507, 2011. Google ScholarDigital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: amazon's highly available key-value store. In SOSP, pages 205--220, 2007. Google ScholarDigital Library
- Google. App Engine Service Level Agreement. https://developers.google.com/appengine/sla.Google Scholar
- Haikun Liu, Cheng-Zhong Xu, Hai Jin, Jiayu Gong, Xiaofei Liao. Performance and energy modeling for live migration of virtual machines. In HPDC, pages 171--182, 2011. Google ScholarDigital Library
- Haikun Liu, Hai Jin, Xiaofei Liao, Bo Ma, Cheng-Zhong Xu. Vmckpt: lightweight and live virtual machine checkpointing. SCIENCE CHINA Information Sciences, 55(12):2865--2880, 2012.Google ScholarCross Ref
- Haoliang Dong, Wei Sun, Bin Wang, Haiyang Sun and Zhengwei Qi. Memvisor: Application level memory mirroring via binary translation. In CLUSTER (poster), 2012. Google ScholarDigital Library
- HP Corporation. HP advanced memory protection technologies. http://h18000.www1.hp.com/products/servers/technology/memoryprotection.html.Google Scholar
- Intel Corporation. IA-32 Intel Architecture Software Developer's Manual. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.Google Scholar
- JM Deegan. High reliability memory subsystem using data error correcting code symbol sliced command repowering. US Patent 7,206,962, Google Patents.Google Scholar
- Kutlu Mucahid, Agrawal Gagan, Kurt Oguz. Fault tolerant parallel data-intensive algorithms. In HPDC, pages 133--134, 2012. Google ScholarDigital Library
- Levien L, Meyers W. Special feature: Semiconductor memory reliability with error detecting and correcting codes. In Computer, pages 43--50, 1976. Google ScholarDigital Library
- Mel Gorman and Patrick Healy. Supporting superpage allocation without additional hardware support. In ISMM, pages 41--50, 2008. Google ScholarDigital Library
- Qingsong Li, Utpal Patel. Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers. http://www.dell.com/downloads/global/power/ps3q05-20050176-patel-oe.pdf.Google Scholar
- Qureshi Moinuddin K. Pay-as-you-go: low-overhead hard-error correction for phase change memories. In MICRO-44, pages 318--328, 2011. Google ScholarDigital Library
- Qureshi Moinuddin K., Srinivasan Vijayalakshmi, Rivers Jude A. Scalable high performance main memory system using phase-change memory technology. In ISCA, pages 24--33, 2009. Google ScholarDigital Library
- RW Hamming. Error detecting and error correcting codes. Bell System technical journal, 1950.Google Scholar
- Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. The google file system. In SOSP, pages 29--43, 2003. Google ScholarDigital Library
- Schechter Stuart, Loh Gabriel H., Straus Karin, Burger Doug. Use ecp, not ecc, for hard failures in resistive memories. SIGARCH Comput. Archit. News, 38(3), June 2010. Google ScholarDigital Library
- Seong Nak Hee, Woo Dong Hyuk, Srinivasan Vijayalakshmi, Rivers Jude A., Lee Hsien-Hsin S. Safer: Stuck-at-fault error recovery for memories. In MICRO, pages 115--124, 2010. Google ScholarDigital Library
- Sharma Prateek, Kulkarni Purushottam. Singleton: system-wide page deduplication in virtual environments. In HPDC, pages 15--26, 2012. Google ScholarDigital Library
- SQLite. SQLite Web Site. http://www.sqlite.org/.Google Scholar
- Sridharan Vilas, Liberty Dean. A study of dram failures in the field. In SC, pages 76:1--76:11, 2012. Google ScholarDigital Library
- Timothy J. Dell. Ecc-on-simm test challenges. In ITC, pages 511--515, 1994. Google ScholarDigital Library
- XV6. XV6 Doc. http://pdos.csail.mit.edu/6.828/2011/xv6.html.Google Scholar
- Yuyang Du, Hongliang Yu, Yunhong Jiang, Yaozu Dong and Weimin Zheng. A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity. In HotDep USENIX Association Berkeley, 2010. Google ScholarDigital Library
Index Terms
- kMemvisor: flexible system wide memory mirroring in virtual environments
Recommendations
kMemvisor: flexible system wide memory mirroring in virtual environments
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingToday's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, ...
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and SystemsPhase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
Comments