Fault-Tolerant Systems

Hardcover
from $0.00

Author: Israel Koren

ISBN-10: 0120885255

ISBN-13: 9780120885251

Category: Systems Analysis and Design - Programming

There are many applications in which the reliability of the overall system must be far higher than the reliability of its individual components. In such cases, designers devise mechanisms and architectures that allow the system to either completely mask the effects of a component failure or recover from it so quickly that the application is not seriously affected. This is the work of fault-tolerant designers and their work is increasingly important and complex not only because of the...

Search in google:

There are many applications in which the reliability of the overall system must be far higher than the reliability of its individual components. In such cases, designers devise mechanisms and architectures that allow the system to either completely mask the effects of a component failure or recover from it so quickly that the application is not seriously affected. This is the work of fault-tolerant designers and their work is increasingly important and complex not only because of the increasing number of “mission critical” applications, but also because the diminishing reliability of hardware means that even systems for non-critical applications will need to be designed with fault-tolerance in mind. Reflecting the real-world challenges faced by designers of these systems, this book addresses fault tolerance design with a systems approach to both hardware and software. No other text on the market takes this approach, nor offers the comprehensive and up-to-date treatment Koren and Krishna provide. Students, designers and architects of high performance processors will value this comprehensive overview of the field.* The first book on fault tolerance design with a systems approach* Comprehensive coverage of both hardware and software fault tolerance, as well as information and time redundancy* Incorporated case studies highlight six different computer systems with fault-tolerance techniques implemented in their design* Available to lecturers is a complete ancillary package including online solutions manual for instructors and PowerPoint slides

Foreword     xiPreface     xiiiAcknowledgements     xviiAbout the Authors     xixPreliminaries     1Fault Classification     2Types of Redundancy     3Basic Measures of Fault Tolerance     4Traditional Measures     5Network Measures     6Outline of This Book     7Further Reading     9References     10Hardware Fault Tolerance     11The Rate of Hardware Failures     11Failure Rate, Reliability, and Mean Time to Failure     13Canonical and Resilient Structures     15Series and Parallel Systems     16Non-Series/Parallel Systems     17M-of-N Systems     20Voters     23Variations on N-Modular Redundancy     23Duplex Systems     27Other Reliability Evaluation Techniques     30Poisson Processes     30Markov Models     33Fault-Tolerance Processor-Level Techniques     36Watchdog Processor     37Simultaneous Multithreading for Fault Tolerance     39Byzantine Failures     41Byzantine Agreement with Message Authentication     46Further Reading     48Exercises     48References     53Information Redundancy     55Coding     56Parity Codes     57Checksum     64M-of-N Codes     65Berger Code     66Cyclic Codes     67Arithmetic Codes     74Resilient Disk Systems     79Raid Level 1     79Raid Level 2     81Raid Level 3     82Raid Level 4     83Raid Level 5     84Modeling Correlated Failures     84Data Replication     88Voting: Non-Hierarchical Organization     89Voting: Hierarchical Organization     95Primary-Backup Approach     96Algorithm-Based Fault Tolerance     99Further Reading     101Exercises     102References     106Fault-Tolerant Networks     109Measures of Resilience     110Graph-Theoretical Measures     110Computer Networks Measures     111Common Network Topologies and Their Resilience     112Multistage and Extra-Stage Networks     112Crossbar Networks     119Rectangular Mesh and Interstitial Mesh     121Hypercube Network     124Cube-Connected Cycles Networks     128Loop Networks     130Ad hoc Point-to-Point Networks     132Fault-Tolerant Routing     135Hypercube Fault-Tolerant Routing     136Origin-Based Routing in the Mesh     138Further Reading     141Exercises     142References     145Software Fault Tolerance     147Acceptance Tests     148Single-Version Fault Tolerance     149Wrappers     149Software Rejuvenation     152Data Diversity     155Software Implemented Hardware Fault Tolerance (SIHFT)     157N-Version Programming     160Consistent Comparison Problem     161Version Independence     162Recovery Block Approach     169Basic Principles     169Success Probability Calculation     169Distributed Recovery Blocks     171Preconditions, Postconditions, and Assertions     173Exception-Handling     173Requirements from Exception-Handlers     174Basics of Exceptions and Exception-Handling     175Language Support     177Software Reliability Models     178Jelinski-Moranda Model     178Littlewood-Verrall Model     179Musa-Okumoto Model     180Model Selection and Parameter Estimation     182Fault-Tolerant Remote Procedure Calls     182Primary-Backup Approach     182The Circus Approach     183Further Reading     184Exercises     186References     188Checkpointing     193What is Checkpointing?     195Why is Checkpointing Nontrivial?     197Checkpoint Level     197Optimal Checkpointing-An Analytical Model     198Time Between Checkpoints-A First-Order Approximation     200Optimal Checkpoint Placement     201Time Between Checkpoints-A More Accurate Model     202Reducing Overhead     204Reducing Latency     205Cache-Aided Rollback Error Recovery (CARER)     206Checkpointing in Distributed Systems     207The Domino Effect and Livelock     209A Coordinated Checkpointing Algorithm     210Time-Based Synchronization     211Diskless Checkpointing     212Message Logging     213Checkpointing in Shared-Memory Systems     217Bus-Based Coherence Protocol     218Directory-Based Protocol     219Checkpointing in Real-Time Systems     220Other Uses of Checkpointing     223Further Reading     223Exercises     224References     226Case Studies     229NonStop Systems     229Architecture     229Maintenance and Repair Aids     233Software     233Modifications to the NonStop Architecture     235Stratus Systems     236Cassini Command and Data Subsystem     238IBM G5     241IBM Sysplex     242Itanium     244Further Reading     246References     247Defect Tolerance in VLSI Circuits     249Manufacturing Defects and Circuit Faults     249Probability of Failure and Critical Area     251Basic Yield Models     253The Poisson and Compound Poisson Yield Models     254Variations on the Simple Yield Models     256Yield Enhancement Through Redundancy     258Yield Projection for Chips with Redundancy     259Memory Arrays with Redundancy     263Logic Integrated Circuits with Redundancy     270Modifying the Floorplan     272Further Reading     276Exercises     277References     281Fault Detection in Cryptographic Systems     285Overview of Ciphers     286Symmetric Key Ciphers     286Public Key Ciphers     295Security Attacks Through Fault Injection     296Fault Attacks on Symmetric Key Ciphers     297Fault Attacks on Public (Asymmetric) Key Ciphers     298Countermeasures     299Spatial and Temporal Duplication     300Error-Detecting Codes     300Are These Countermeasures Sufficient?     304Final Comment     307Further Reading     307Exercises     307References     308Simulation Techniques      311Writing a Simulation Program     311Parameter Estimation     315Point Versus Interval Estimation     315Method of Moments     316Method of Maximum Likelihood     318The Bayesian Approach to Parameter Estimation     322Confidence Intervals     324Variance Reduction Methods     328Antithetic Variables     328Using Control Variables     330Stratified Sampling     331Importance Sampling     333Random Number Generation     341Uniformly Distributed Random Number Generators     342Testing Uniform Random Number Generators     345Generating Other Distributions     349Fault Injection     355Types of Fault Injection Techniques     356Fault Injection Application and Tools     358Further Reading     358Exercises     359References     363Subject Index     365