This textbook serves as an introduction to fault-tolerance, intended for upper-division undergraduate students, graduate-level students and practicing engineers in need of an overview of the field. Readers will develop skills in modeling and evaluating fault-tolerant architectures in terms of reliability, availability and safety. They will gain a thorough understanding of fault tolerant computers, including both the theory of how to design and evaluate them and the practical knowledge of achieving fault-tolerance in electronic, communication and software systems. Coverage includes fault-tolerance techniques through hardware, software, information and time redundancy. The content is designed to be highly accessible, including numerous examples and exercises. Solutions and powerpoint slides are available for instructors.
Cited By
- Baharloo M, Abdollahi M and Baniasadi A (2023). System-level reliability assessment of optical network on chip, Microprocessors & Microsystems, 99:C, Online publication date: 1-Jun-2023.
- Censor-Hillel K, Cohen S, Gelles R and Sela G Distributed Computations in Fully-Defective Networks Proceedings of the 2022 ACM Symposium on Principles of Distributed Computing, (141-150)
- Kumar N, Mayank J and Mondal A (2020). Reliability Aware Energy Optimized Scheduling of Non-Preemptive Periodic Real-Time Tasks on Heterogeneous Multiprocessor System, IEEE Transactions on Parallel and Distributed Systems, 31:4, (871-885), Online publication date: 1-Apr-2020.
- Ma Z, Yu F, Jiang X and Boukerche A Trustworthy Traffic Information Sharing Secured via Blockchain in VANETs Proceedings of the 10th ACM Symposium on Design and Analysis of Intelligent Vehicular Networks and Applications, (33-40)
- Xu X, Xie X, Zhang B and Pan W (2019). A hybrid method for evaluating the effectiveness of giant systems with indicator correlations: an application for naval formation decision making in multiple scenarios, Soft Computing - A Fusion of Foundations, Methodologies and Applications, 24:6, (4295-4306), Online publication date: 1-Mar-2020.
- Wu B, Zhang B, Cheng Y, Wang Y, Liu D and Zhao W (2019). An Adaptive Thermal-Aware ECC Scheme for Reliable STT-MRAM LLC Design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27:8, (1851-1860), Online publication date: 1-Aug-2019.
- Mesbahi M, Rahmani A and Hosseinzadeh M (2018). Reliability and high availability in cloud computing environments, Human-centric Computing and Information Sciences, 8:1, (1-31), Online publication date: 1-Dec-2018.
- Leipnitz M and Nazar G (2018). Fault Tolerance Mechanisms for FPGA-Based Regular Expression Matching, Journal of Electronic Testing: Theory and Applications, 34:4, (487-506), Online publication date: 1-Aug-2018.
- Ghimire S, Sarraipa J, Agostinho C and Jardim-Goncalves R Fault tolerant sensing model for cyber-physical systems Proceedings of the Symposium on Model-driven Approaches for Simulation Engineering, (1-9)
- Okamoto T (2017). Design of a Lightweight Intrusion-Tolerant System for Highly Available Servers, Procedia Computer Science, 112:C, (2319-2327), Online publication date: 1-Sep-2017.
- (2017). An analysis of root functionsA subclass of the Impossible Class of Faulty Functions (ICFF), Discrete Applied Mathematics, 222:C, (1-13), Online publication date: 11-May-2017.
- Govindan R, Minei I, Kallahalla M, Koley B and Vahdat A Evolve or Die Proceedings of the 2016 ACM SIGCOMM Conference, (58-72)
- Tarasov A Modern techniques of function-level fault tolerance in MFM-systems Proceedings of the 8th International Conference on Security of Information and Networks, (28-29)
- Radetzki M, Feng C, Zhao X and Jantsch A (2013). Methods for fault tolerance in networks-on-chip, ACM Computing Surveys, 46:1, (1-38), Online publication date: 1-Oct-2013.
- Lin T, Chong K, Shu W, Lwin N, Jiang J and Chang J Experimental investigation into radiation-hardening-by-design (RHBD) flip-flop designs in a 65nm CMOS process 2016 IEEE International Symposium on Circuits and Systems (ISCAS), (966-969)
- Bolte B, Shah S, Kim S, Hwang P and Hasler J Live demonstration: FPAA Demonstration Controlled through Android-Based Device 2016 IEEE International Symposium on Circuits and Systems (ISCAS), (1442-1442)
- Larkin E, Bogomolov A and Privalov A Discrete Model of Mobile Robot Assemble Fault-Tolerance Interactive Collaborative Robotics, (204-215)
Index Terms
- Fault-Tolerant Design
Recommendations
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
A Fault-Tolerant Systolic Sorter
A fault-tolerant systolic sorter design is proposed. An algorithm-based fault tolerance is achieved by testing the invariants of a systolic sorter during normal operation. Transient and permanent computation errors can be detected by using error-...
Design of Two-Level Fault-Tolerant Networks
Some new techniques for the synthesis of fault-tolerant two-level combinational networks are presented. Two classes of faults are defined, 1) critical faults and 2) subcritical faults. Critical fauls are the class of faults that cannot be tolerated by ...