Abstract
Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, and the increasing sensitivity of combinational logic to particle strikes. Moreover, as Chip Multi-Processors (CMPs) become ubiquitous, meeting the FIT budget for new designs is becoming a major challenge.
Solutions based on replicating threads have been explored deeply; however, their high cost in performance and energy make them unsuitable for current designs. Moreover, our studies based on a typical configuration for a modern processor show that focusing on the top 5 most vulnerable structures can provide up to 70% reduction in FIT rate. Therefore, full replication may overprotect the chip by reducing the FIT much below budget.
We propose Selective Replication, a lightweight-reconfigurable mechanism that achieves a high FIT reduction by protecting the most vulnerable instructions with minimal performance and energy impact. Low performance degradation is achieved by not requiring additional issue slots and reissuing instructions only during the time window between when they are retirable and they actually retire. Coverage can be reconfigured online by replicating only a subset of the instructions (the most vulnerable ones). Instructions' vulnerability is estimated based on the area they occupy and the time they spend in the issue queue. By changing the vulnerability threshold, we can adjust the trade-off between coverage and performance loss.
Results for an out-of-order processor configured similarly to Intel® Core™ Micro-Architecture show that our scheme can achieve over 65% FIT reduction with less than 4% performance degradation with small area and complexity overhead.
- Austin, T. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
- Baumann, R. 2005. Soft errors in advanced computer systems. In Proceedings of the IEEE Design and Test of Computers. IEEE Computer Society, Los Alamitos, CA, 258--266.Google ScholarDigital Library
- Biswas, A., Racunas, P., Cheveresan, R., Emer, J., Mukherjee, S., and Rangan, R. 2005. Computing architectural vulnerability factors for address-based structures. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Brooks, D., Bose, P., Schuster, S., Jacobson, H., Kudva, P., Buyuktosunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M., and Cook, P. 2000. Power-Aware microarchitecture: Design and modeling challenges for next-generation microprocessors. IEEE Micro 20, 6, 26--44. Google ScholarDigital Library
- Constantinescu, C. 2003. Trends and challenges in vlsi circuit reliability. IEEE Micro 23, 4, 14--19. Google ScholarDigital Library
- Freeman, L. B. 1996. Critical charge calculations for a bipolar sram array. IBM J. Res. Dev. 40, 1, 119--129. Google ScholarDigital Library
- Fu, X., Poe, J., Li, T., and Fortes, J. 2006. Characterizing microarchitecture soft error vulnerability phase behavior. In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS'06). 147--155. Google ScholarDigital Library
- Gomaa, M., Scarbrough, C., Vijaykumar, T., and Pomeranz, I. 2003. Transient-Fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Gomaa, M. and Vijaykumar, T. 2005. Opportunistic transient-fault detection. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Kumar, S. and Aggarwal, A. 2006. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
- Li, X., Adve, S., Bose, P., and Rivers, J. 2008. Online estimation of architectural vulnerability factor for soft errors. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA). 341--352. Google ScholarDigital Library
- Marr, D., Binns, F., Hill, D., Hinton, G., Koufaty, D., Miller, J., and Upton, M. 2002. Hyper-Threading technology architecture and microarchitecture. Intel Tech. J. 6, 1.Google Scholar
- Memik, G., Kandemir, M., and Ozturk, O. 2005. Increasing register file immunity to transient errors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE). Google ScholarDigital Library
- Ming, Z. and Shanbhag, N. 2005. A cmos design style for logic circuit hardening. In International Reliability Physics Symposium. IEEE Computer Society, Los Alamitos, CA, 223--229.Google Scholar
- Mitra, S., Seifert, N., Zhang, M., Shi, Q., and Kim, K. S. 2005. Robust system design with built-in soft-error resilience. Comput. 38, 2, 43--52. Google ScholarDigital Library
- Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., and Kim, K. 2006. Combinational logic soft error correction. In Proceedings of the 2nd Workshop on System Effects of Logic Soft Errors (SELSE).Google Scholar
- Mukherjee, S., Kontz, M., and Reinhardt, S. 2002. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., and Austin, T. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture (MICRO). ACM Press, New York, NY. Google ScholarDigital Library
- Nguyen, H. and Yagil, Y. 2003. A systematic approach to SER estimation and solutions. In Proceedings of the Reliability Physics Symposium. 60--70.Google Scholar
- Qureshi, M., Mutlu, O., and Patt, Y. 2005. Microarchitectural-Based inspection: A technique for transient-fault tolerance in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). Google ScholarDigital Library
- Ray, J., Hoe, J., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture (MICRO). 214--224. Google ScholarDigital Library
- Reddy, V., Al-Zawawi, A., and Rotenberg, E. 2007. Assertion-Based microarchitecture design for improved fault tolerance. In Proceedings of the International Conference on Computer Design (ICCD). 362--369.Google Scholar
- Reddy, V. and Rotenberg, E. 2007. Inherent time redundancy (itr): Using program repetition for low-overhead fault tolerance. In Proceedings of 37th Annual International Conference on Dependable Systems and Networks (DSN). 307--316. Google ScholarDigital Library
- Reddy, V., Rotenberg, E., and Parthasarathy, S. 2006. Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 83--94. Google ScholarDigital Library
- Reinhardt, S. and Mukherjee, S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA). ACM Press, New York. Google ScholarDigital Library
- Reis, G., Chang, J., Vachharajani, N., Rangan, R., and August, D. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). Google ScholarDigital Library
- Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., and Mukherjee, S. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Rotenberg, E. 1999. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the Annual International Symposium on Fault-Tolerant Computing (FTC). 84. Google ScholarDigital Library
- Seifert, N., Slankard, P., Kirsch, M., Narasimham, B., Zia, V., C., Brookresonand A., Voand S., Mitraand B., Gill, B., and Maiz, J. 2006. Radiation-Induced soft error rates of advanced cmos bulk devices. In Proceedings of the International Reliability Physics Symposium. IEEE Computer Society, Los Alamitos, CA, 217--225.Google Scholar
- Seifert, N. and Tam, N. 2004. Timing vulnerability factors of sequentials. IEEE Trans. Device Materials Reliab. 4, 3, 516--522.Google ScholarCross Ref
- Shivakumar, P., Kistler, M., Keckler, S., Burger, D., and Alvisi, L. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Network (DSN'02), 389. Google ScholarDigital Library
- Smolens, J., Kim, J., Hoe, J., and Falsafi, B. 2004. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proceedings of the 37th International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
- Soundararajan, N., Parashar, A., and Sivasubramaniam, A. 2007. Mechanisms for bounding vulnerabilities of processor structures. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA). 506--515. Google ScholarDigital Library
- Spainhower, L. and Gregg, T. 1999. IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM J. Res. Devlop. 43, 5/6, 863--873. Google ScholarDigital Library
- Sundaramoorthy, K., Purser, Z., and Rotenberg, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 33th International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
- Tullsen, D., Eggers, S., Emer, J., Levy, H., Lo, J., and Stamm, R. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA). ACM Press, New York, NY, 191--202. Google ScholarDigital Library
- Vijaykumar, T., Pomeranz, I., and Cheng, K. 2002. Transient-Fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Walcott, K., Humphreys, G., and Gurumurthi, S. 2007. Dynamic prediction of architectural vulnerability from microarchitectural state. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA). 516--527. Google ScholarDigital Library
- Wang, N. and Patel, S. 2005. Restore: Symptom based soft error detection in microprocessors. In Proceedings of International Conference on Dependable Systems and Networks (DSN). Google ScholarDigital Library
- Weaver, C., Emer, J., Mukherjee, S., and Reinhardt, S. 2004. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA. Google ScholarDigital Library
Index Terms
- Selective replication: A lightweight technique for soft errors
Recommendations
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Proceedings of the 2006 ASPLOS ConferenceRedundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsRedundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Efficient fault tolerance in multi-media applications through selective instruction replication
WREFT '08: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologiesAs voltages decrease, soft errors are expected to become an increasing problem in maintaining program correctness. Unfortunately, previous mechanisms to improve processor reliability protect all processor instructions equally, causing such approaches to ...
Comments