skip to main content
research-article

Selective replication: A lightweight technique for soft errors

Authors Info & Claims
Published:01 January 2010Publication History
Skip Abstract Section

Abstract

Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, and the increasing sensitivity of combinational logic to particle strikes. Moreover, as Chip Multi-Processors (CMPs) become ubiquitous, meeting the FIT budget for new designs is becoming a major challenge.

Solutions based on replicating threads have been explored deeply; however, their high cost in performance and energy make them unsuitable for current designs. Moreover, our studies based on a typical configuration for a modern processor show that focusing on the top 5 most vulnerable structures can provide up to 70% reduction in FIT rate. Therefore, full replication may overprotect the chip by reducing the FIT much below budget.

We propose Selective Replication, a lightweight-reconfigurable mechanism that achieves a high FIT reduction by protecting the most vulnerable instructions with minimal performance and energy impact. Low performance degradation is achieved by not requiring additional issue slots and reissuing instructions only during the time window between when they are retirable and they actually retire. Coverage can be reconfigured online by replicating only a subset of the instructions (the most vulnerable ones). Instructions' vulnerability is estimated based on the area they occupy and the time they spend in the issue queue. By changing the vulnerability threshold, we can adjust the trade-off between coverage and performance loss.

Results for an out-of-order processor configured similarly to Intel® Core™ Micro-Architecture show that our scheme can achieve over 65% FIT reduction with less than 4% performance degradation with small area and complexity overhead.

References

  1. Austin, T. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baumann, R. 2005. Soft errors in advanced computer systems. In Proceedings of the IEEE Design and Test of Computers. IEEE Computer Society, Los Alamitos, CA, 258--266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Biswas, A., Racunas, P., Cheveresan, R., Emer, J., Mukherjee, S., and Rangan, R. 2005. Computing architectural vulnerability factors for address-based structures. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brooks, D., Bose, P., Schuster, S., Jacobson, H., Kudva, P., Buyuktosunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M., and Cook, P. 2000. Power-Aware microarchitecture: Design and modeling challenges for next-generation microprocessors. IEEE Micro 20, 6, 26--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Constantinescu, C. 2003. Trends and challenges in vlsi circuit reliability. IEEE Micro 23, 4, 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Freeman, L. B. 1996. Critical charge calculations for a bipolar sram array. IBM J. Res. Dev. 40, 1, 119--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fu, X., Poe, J., Li, T., and Fortes, J. 2006. Characterizing microarchitecture soft error vulnerability phase behavior. In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS'06). 147--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gomaa, M., Scarbrough, C., Vijaykumar, T., and Pomeranz, I. 2003. Transient-Fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gomaa, M. and Vijaykumar, T. 2005. Opportunistic transient-fault detection. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kumar, S. and Aggarwal, A. 2006. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  11. Li, X., Adve, S., Bose, P., and Rivers, J. 2008. Online estimation of architectural vulnerability factor for soft errors. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA). 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., and Upton, M. 2002. Hyper-Threading technology architecture and microarchitecture. Intel Tech. J. 6, 1.Google ScholarGoogle Scholar
  13. Memik, G., Kandemir, M., and Ozturk, O. 2005. Increasing register file immunity to transient errors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ming, Z. and Shanbhag, N. 2005. A cmos design style for logic circuit hardening. In International Reliability Physics Symposium. IEEE Computer Society, Los Alamitos, CA, 223--229.Google ScholarGoogle Scholar
  15. Mitra, S., Seifert, N., Zhang, M., Shi, Q., and Kim, K. S. 2005. Robust system design with built-in soft-error resilience. Comput. 38, 2, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., and Kim, K. 2006. Combinational logic soft error correction. In Proceedings of the 2nd Workshop on System Effects of Logic Soft Errors (SELSE).Google ScholarGoogle Scholar
  17. Mukherjee, S., Kontz, M., and Reinhardt, S. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture (MICRO). ACM Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nguyen, H. and Yagil, Y. 2003. A systematic approach to SER estimation and solutions. In Proceedings of the Reliability Physics Symposium. 60--70.Google ScholarGoogle Scholar
  20. Qureshi, M., Mutlu, O., and Patt, Y. 2005. Microarchitectural-Based inspection: A technique for transient-fault tolerance in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ray, J., Hoe, J., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture (MICRO). 214--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Reddy, V., Al-Zawawi, A., and Rotenberg, E. 2007. Assertion-Based microarchitecture design for improved fault tolerance. In Proceedings of the International Conference on Computer Design (ICCD). 362--369.Google ScholarGoogle Scholar
  23. Reddy, V. and Rotenberg, E. 2007. Inherent time redundancy (itr): Using program repetition for low-overhead fault tolerance. In Proceedings of 37th Annual International Conference on Dependable Systems and Networks (DSN). 307--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Reddy, V., Rotenberg, E., and Parthasarathy, S. 2006. Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Reinhardt, S. and Mukherjee, S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA). ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Reis, G., Chang, J., Vachharajani, N., Rangan, R., and August, D. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., and Mukherjee, S. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rotenberg, E. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Annual International Symposium on Fault-Tolerant Computing (FTC). 84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Seifert, N., Slankard, P., Kirsch, M., Narasimham, B., Zia, V., C., Brookresonand A., Voand S., Mitraand B., Gill, B., and Maiz, J. 2006. Radiation-Induced soft error rates of advanced cmos bulk devices. In Proceedings of the International Reliability Physics Symposium. IEEE Computer Society, Los Alamitos, CA, 217--225.Google ScholarGoogle Scholar
  30. Seifert, N. and Tam, N. 2004. Timing vulnerability factors of sequentials. IEEE Trans. Device Materials Reliab. 4, 3, 516--522.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shivakumar, P., Kistler, M., Keckler, S., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Network (DSN'02), 389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Smolens, J., Kim, J., Hoe, J., and Falsafi, B. 2004. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proceedings of the 37th International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Soundararajan, N., Parashar, A., and Sivasubramaniam, A. 2007. Mechanisms for bounding vulnerabilities of processor structures. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA). 506--515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Spainhower, L. and Gregg, T. 1999. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM J. Res. Devlop. 43, 5/6, 863--873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sundaramoorthy, K., Purser, Z., and Rotenberg, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 33th International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tullsen, D., Eggers, S., Emer, J., Levy, H., Lo, J., and Stamm, R. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA). ACM Press, New York, NY, 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vijaykumar, T., Pomeranz, I., and Cheng, K. 2002. Transient-Fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Walcott, K., Humphreys, G., and Gurumurthi, S. 2007. Dynamic prediction of architectural vulnerability from microarchitectural state. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA). 516--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wang, N. and Patel, S. 2005. Restore: Symptom based soft error detection in microprocessors. In Proceedings of International Conference on Dependable Systems and Networks (DSN). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Weaver, C., Emer, J., Mukherjee, S., and Reinhardt, S. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Selective replication: A lightweight technique for soft errors

          Recommendations

          Reviews

          Festus Gail Gray

          By selectively replicating only those instructions that have the highest probability of failing due to soft errors?caused by particle strikes, noise, electromagnetic interference, or electrostatic discharge?Vera et al. are able to reduce the failures in time (FIT) (number of errors in one billion device-hours of operation) by 65 percent, with less than four percent performance degradation and with one percent hardware overhead. Compared with the 100 percent reduction when using a competing method with performance degradation of 32 percent, the proposed technique provides a possible trade-off in situations involving noncritical applications, where infrequent errors are acceptable. The techniques employed are only applicable to the specific processor used in the study, but the principles, which are fairly self-evident, are likely to apply to other processors. The paper is also valuable as a tutorial in soft errors, including appropriate analysis techniques. In summary, Vera et al. validate an intuitively obvious anecdotal claim with real numbers, on a real processor. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Computer Systems
            ACM Transactions on Computer Systems  Volume 27, Issue 4
            December 2009
            69 pages
            ISSN:0734-2071
            EISSN:1557-7333
            DOI:10.1145/1658357
            Issue’s Table of Contents

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 January 2010
            • Accepted: 1 October 2009
            • Revised: 1 September 2009
            • Received: 1 November 2008
            Published in tocs Volume 27, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader