There are many applications in which the reliability of the overall system must be far higher than the reliability of its individual components. In such cases, designers devise mechanisms and architectures that allow the system to either completely mask the effects of a component failure or recover from it so quickly that the application is not seriously affected. This is the work of fault-tolerant designers and their work is increasingly important and complex not only because of the...
There are many applications in which the reliability of the overall system must be far higher than the reliability of its individual components. In such cases, designers devise mechanisms and architectures that allow the system to either completely mask the effects of a component failure or recover from it so quickly that the application is not seriously affected. This is the work of fault-tolerant designers and their work is increasingly important and complex not only because of the increasing number of “mission critical” applications, but also because the diminishing reliability of hardware means that even systems for non-critical applications will need to be designed with fault-tolerance in mind. Reflecting the real-world challenges faced by designers of these systems, this book addresses fault tolerance design with a systems approach to both hardware and software. No other text on the market takes this approach, nor offers the comprehensive and up-to-date treatment Koren and Krishna provide. Students, designers and architects of high performance processors will value this comprehensive overview of the field.* The first book on fault tolerance design with a systems approach* Comprehensive coverage of both hardware and software fault tolerance, as well as information and time redundancy* Incorporated case studies highlight six different computer systems with fault-tolerance techniques implemented in their design* Available to lecturers is a complete ancillary package including online solutions manual for instructors and PowerPoint slides
Foreword xiPreface xiiiAcknowledgements xviiAbout the Authors xixPreliminaries 1Fault Classification 2Types of Redundancy 3Basic Measures of Fault Tolerance 4Traditional Measures 5Network Measures 6Outline of This Book 7Further Reading 9References 10Hardware Fault Tolerance 11The Rate of Hardware Failures 11Failure Rate, Reliability, and Mean Time to Failure 13Canonical and Resilient Structures 15Series and Parallel Systems 16Non-Series/Parallel Systems 17M-of-N Systems 20Voters 23Variations on N-Modular Redundancy 23Duplex Systems 27Other Reliability Evaluation Techniques 30Poisson Processes 30Markov Models 33Fault-Tolerance Processor-Level Techniques 36Watchdog Processor 37Simultaneous Multithreading for Fault Tolerance 39Byzantine Failures 41Byzantine Agreement with Message Authentication 46Further Reading 48Exercises 48References 53Information Redundancy 55Coding 56Parity Codes 57Checksum 64M-of-N Codes 65Berger Code 66Cyclic Codes 67Arithmetic Codes 74Resilient Disk Systems 79Raid Level 1 79Raid Level 2 81Raid Level 3 82Raid Level 4 83Raid Level 5 84Modeling Correlated Failures 84Data Replication 88Voting: Non-Hierarchical Organization 89Voting: Hierarchical Organization 95Primary-Backup Approach 96Algorithm-Based Fault Tolerance 99Further Reading 101Exercises 102References 106Fault-Tolerant Networks 109Measures of Resilience 110Graph-Theoretical Measures 110Computer Networks Measures 111Common Network Topologies and Their Resilience 112Multistage and Extra-Stage Networks 112Crossbar Networks 119Rectangular Mesh and Interstitial Mesh 121Hypercube Network 124Cube-Connected Cycles Networks 128Loop Networks 130Ad hoc Point-to-Point Networks 132Fault-Tolerant Routing 135Hypercube Fault-Tolerant Routing 136Origin-Based Routing in the Mesh 138Further Reading 141Exercises 142References 145Software Fault Tolerance 147Acceptance Tests 148Single-Version Fault Tolerance 149Wrappers 149Software Rejuvenation 152Data Diversity 155Software Implemented Hardware Fault Tolerance (SIHFT) 157N-Version Programming 160Consistent Comparison Problem 161Version Independence 162Recovery Block Approach 169Basic Principles 169Success Probability Calculation 169Distributed Recovery Blocks 171Preconditions, Postconditions, and Assertions 173Exception-Handling 173Requirements from Exception-Handlers 174Basics of Exceptions and Exception-Handling 175Language Support 177Software Reliability Models 178Jelinski-Moranda Model 178Littlewood-Verrall Model 179Musa-Okumoto Model 180Model Selection and Parameter Estimation 182Fault-Tolerant Remote Procedure Calls 182Primary-Backup Approach 182The Circus Approach 183Further Reading 184Exercises 186References 188Checkpointing 193What is Checkpointing? 195Why is Checkpointing Nontrivial? 197Checkpoint Level 197Optimal Checkpointing-An Analytical Model 198Time Between Checkpoints-A First-Order Approximation 200Optimal Checkpoint Placement 201Time Between Checkpoints-A More Accurate Model 202Reducing Overhead 204Reducing Latency 205Cache-Aided Rollback Error Recovery (CARER) 206Checkpointing in Distributed Systems 207The Domino Effect and Livelock 209A Coordinated Checkpointing Algorithm 210Time-Based Synchronization 211Diskless Checkpointing 212Message Logging 213Checkpointing in Shared-Memory Systems 217Bus-Based Coherence Protocol 218Directory-Based Protocol 219Checkpointing in Real-Time Systems 220Other Uses of Checkpointing 223Further Reading 223Exercises 224References 226Case Studies 229NonStop Systems 229Architecture 229Maintenance and Repair Aids 233Software 233Modifications to the NonStop Architecture 235Stratus Systems 236Cassini Command and Data Subsystem 238IBM G5 241IBM Sysplex 242Itanium 244Further Reading 246References 247Defect Tolerance in VLSI Circuits 249Manufacturing Defects and Circuit Faults 249Probability of Failure and Critical Area 251Basic Yield Models 253The Poisson and Compound Poisson Yield Models 254Variations on the Simple Yield Models 256Yield Enhancement Through Redundancy 258Yield Projection for Chips with Redundancy 259Memory Arrays with Redundancy 263Logic Integrated Circuits with Redundancy 270Modifying the Floorplan 272Further Reading 276Exercises 277References 281Fault Detection in Cryptographic Systems 285Overview of Ciphers 286Symmetric Key Ciphers 286Public Key Ciphers 295Security Attacks Through Fault Injection 296Fault Attacks on Symmetric Key Ciphers 297Fault Attacks on Public (Asymmetric) Key Ciphers 298Countermeasures 299Spatial and Temporal Duplication 300Error-Detecting Codes 300Are These Countermeasures Sufficient? 304Final Comment 307Further Reading 307Exercises 307References 308Simulation Techniques 311Writing a Simulation Program 311Parameter Estimation 315Point Versus Interval Estimation 315Method of Moments 316Method of Maximum Likelihood 318The Bayesian Approach to Parameter Estimation 322Confidence Intervals 324Variance Reduction Methods 328Antithetic Variables 328Using Control Variables 330Stratified Sampling 331Importance Sampling 333Random Number Generation 341Uniformly Distributed Random Number Generators 342Testing Uniform Random Number Generators 345Generating Other Distributions 349Fault Injection 355Types of Fault Injection Techniques 356Fault Injection Application and Tools 358Further Reading 358Exercises 359References 363Subject Index 365