skip to main content
10.1145/2493123.2462910acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

kMemvisor: flexible system wide memory mirroring in virtual environments

Authors Info & Claims
Published:17 June 2013Publication History

ABSTRACT

Today's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, uncorrected errors from DRAM can be a significant source of system downtime. To address this increasingly important concern, both hardware and software memory mirroring technologies are studied nowadays to provide memory high availability. However, hardware solutions like mirror memory, which uses doubled chip, need dedicated and costly peripheral hardware. While existing software approaches, i.e., virtual machine's checkpoint technology, reduce the expense but incur the high overhead in practical usage. In this paper, we present a novel system called \emph{k}Memvisor to provide system-wide high availability memory mirroring. It is a software approach achieving flexible multi-granularity memory mirroring via virtualization and binary translation technology. Specifically, kMemvisor first creates backup space of the same size of the specified memory for applications or virtual machines. We can flexibly set memory areas to be mirrored or not mirrored from application level to system-wide. Then, all memory write instructions in the native memory space are captured and instrumented by mirror memory write instructions to synchronize the data in backup space. Furthermore, this instruction level memory synchronization reduces backup overhead and lowers the probability of data loss compared with traditional software approaches. So kMemvisor could use data from the backup space to recover when memory failures happen. The results show that kMemvisor causes 55% overhead in the worst case of system-wide high availability and 30% average for the real world applications, which outperforms the state-of-the-art software approaches even in the worst case.

References

  1. ACME Laboratories. thttpd - tiny/turbo/throttling HTTP server. http://www.acme.com/software/thttpd/.Google ScholarGoogle Scholar
  2. Amazon. Amazon EC2 Service Level Agreement. http://aws.amazon.com/ec2-sla/.Google ScholarGoogle Scholar
  3. Andi Kleen. Machine check handling on linux. In SUSE Labs, 2004.Google ScholarGoogle Scholar
  4. Apache Software Foundation. ab - Apache HTTP server benchmarking tool. http://httpd.apache.org/docs/2.0/programs/ab.html.Google ScholarGoogle Scholar
  5. Bernd Panzer-Steindel. Data integrity. CERN/IT, 2007.Google ScholarGoogle Scholar
  6. Bianca Schroeder, Eduardo Pinheiro and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bodik Peter, Menache Ishai, Chowdhury Mosharaf, Mani Pradeepkumar, Maltz David A., Stoica Ion. Surviving failures in bandwidth-constrained datacenters. SIGCOMM Comput. Commun. Rev., 42(4), Aug. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brendan Cully, Geoffrey Lefebvre, Dutch T. Meyer, Mike Feeley, Norman C. Hutchinson and Andrew Warfield. Remus: High availability via asynchronous virtual machine replication. (best paper). In NSDI, pages 161--175, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chang Cheng-Shang, Chen Yi-Ting, Lee Duan-Shin. Constructions of optical fifo queues. IEEE/ACM Trans. Netw., 14(SI), June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, C. L. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. In IBM Journal of Research and Development, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen Zizhong. Algorithm-based recovery for iterative methods without checkpointing. In HPDC, pages 73--84, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christopher Clark, Keir Fraser, Steven H, Jakob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt and Andrew Warfield. Live migration of virtual machines. In NSDI, pages 273--286, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Daniel J. Scales, Mike Nelson and Ganesh Venkitachalam. The design of a practical system for fault-tolerant virtual machines. Operating Systems Review, 44(4):30--39, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Chisnall. The Definitive Guide to the Xen Hypervisor. Prentice Hall, 1 edition, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David Fiala, Kurt B. Ferreira, Frank Mueller and Christian Engelmann. A tunable, software-based dram error detection and correction library for hpc. Euro-par 2011, PARALLEL PROCESSING WORKSHOPS, 7156:251--261, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dell. Dell PowerEdge 12th generation servers. http://www.dell.com/poweredge.Google ScholarGoogle Scholar
  17. Denys Vlasenko. BusyBox: The Swiss Army Knife of Embedded Linux. http://www.busybox.net/.Google ScholarGoogle Scholar
  18. Fenn Michael, Murphy Michael A., Goasguen Sebastien. A study of a kvm-based cluster for grid computing. In ACM-SE, pages 34:1--34:6, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fiala David, Ferreira Kurt, Mueller Frank, Engelmann Christian. A tunable, software-based dram error detection and correction library for hpc. In sc, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Gallager. Low-density parity-check codes. In Information Theory, IRE Transactions on, pages 21--28, 1962.Google ScholarGoogle Scholar
  21. Gang Wu, Jian Gao, Huxing Zhang and Yaozu Dong. Improving pcm endurance with randomized address remapping in hybrid memory system. In CLUSTER (poster), pages 503--507, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: amazon's highly available key-value store. In SOSP, pages 205--220, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Google. App Engine Service Level Agreement. https://developers.google.com/appengine/sla.Google ScholarGoogle Scholar
  24. Haikun Liu, Cheng-Zhong Xu, Hai Jin, Jiayu Gong, Xiaofei Liao. Performance and energy modeling for live migration of virtual machines. In HPDC, pages 171--182, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Haikun Liu, Hai Jin, Xiaofei Liao, Bo Ma, Cheng-Zhong Xu. Vmckpt: lightweight and live virtual machine checkpointing. SCIENCE CHINA Information Sciences, 55(12):2865--2880, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  26. Haoliang Dong, Wei Sun, Bin Wang, Haiyang Sun and Zhengwei Qi. Memvisor: Application level memory mirroring via binary translation. In CLUSTER (poster), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. HP Corporation. HP advanced memory protection technologies. http://h18000.www1.hp.com/products/servers/technology/memoryprotection.html.Google ScholarGoogle Scholar
  28. Intel Corporation. IA-32 Intel Architecture Software Developer's Manual. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.Google ScholarGoogle Scholar
  29. JM Deegan. High reliability memory subsystem using data error correcting code symbol sliced command repowering. US Patent 7,206,962, Google Patents.Google ScholarGoogle Scholar
  30. Kutlu Mucahid, Agrawal Gagan, Kurt Oguz. Fault tolerant parallel data-intensive algorithms. In HPDC, pages 133--134, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Levien L, Meyers W. Special feature: Semiconductor memory reliability with error detecting and correcting codes. In Computer, pages 43--50, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mel Gorman and Patrick Healy. Supporting superpage allocation without additional hardware support. In ISMM, pages 41--50, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Qingsong Li, Utpal Patel. Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers. http://www.dell.com/downloads/global/power/ps3q05-20050176-patel-oe.pdf.Google ScholarGoogle Scholar
  34. Qureshi Moinuddin K. Pay-as-you-go: low-overhead hard-error correction for phase change memories. In MICRO-44, pages 318--328, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Qureshi Moinuddin K., Srinivasan Vijayalakshmi, Rivers Jude A. Scalable high performance main memory system using phase-change memory technology. In ISCA, pages 24--33, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. RW Hamming. Error detecting and error correcting codes. Bell System technical journal, 1950.Google ScholarGoogle Scholar
  37. Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. The google file system. In SOSP, pages 29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Schechter Stuart, Loh Gabriel H., Straus Karin, Burger Doug. Use ecp, not ecc, for hard failures in resistive memories. SIGARCH Comput. Archit. News, 38(3), June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Seong Nak Hee, Woo Dong Hyuk, Srinivasan Vijayalakshmi, Rivers Jude A., Lee Hsien-Hsin S. Safer: Stuck-at-fault error recovery for memories. In MICRO, pages 115--124, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sharma Prateek, Kulkarni Purushottam. Singleton: system-wide page deduplication in virtual environments. In HPDC, pages 15--26, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. SQLite. SQLite Web Site. http://www.sqlite.org/.Google ScholarGoogle Scholar
  42. Sridharan Vilas, Liberty Dean. A study of dram failures in the field. In SC, pages 76:1--76:11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Timothy J. Dell. Ecc-on-simm test challenges. In ITC, pages 511--515, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. XV6. XV6 Doc. http://pdos.csail.mit.edu/6.828/2011/xv6.html.Google ScholarGoogle Scholar
  45. Yuyang Du, Hongliang Yu, Yunhong Jiang, Yaozu Dong and Weimin Zheng. A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity. In HotDep USENIX Association Berkeley, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. kMemvisor: flexible system wide memory mirroring in virtual environments

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
            June 2013
            276 pages
            ISBN:9781450319102
            DOI:10.1145/2493123
            • General Chairs:
            • Manish Parashar,
            • Jon Weissman,
            • Program Chairs:
            • Dick Epema,
            • Renato Figueiredo

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 June 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader