Understanding System Reliability and Fault Tolerance: Concepts and Terminology | Study notes Computer Science

Terminology and Concepts∗

Prof. Naga Kandasamy

1 Goals of Fault Tolerance

Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety,

and testability. We will now define the above terms in an intuitive fashion.

1.1 Reliability

The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the

system will perform correctly throughout the interval [t0, t]. given that the system was performing correctly

at time t0. So, reliability is the probability that the system will operate correctly throughout a complete

interval of time.

Reliability is used to characterize systems in which even momentary periods of incorrect performance are

unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of

concern may be years). In some other applications such as flight control, the time interval of concern may

be a few hours.

Fault tolerance can improve a system’s reliability by keeping the system operational when hardware and

software failures occur.

1.2 Availability

Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly

and is available to perform its functions at the instant of time t. Availability differs from reliability in that

reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system

can be highly available yet experience frequent periods of down time as long as the length of each down-time

period is very short. The most common measure of availability is the expected fraction of time that a system

is available to correctly perform its functions.

1.3 Performability

In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of

hardware/software failures, at a diminished level of performance. So, the performability P(L, t) of a system

is a function of time, and is defined as the probability that the system performance will be at, or above,

some level Lat the instant of time t. Performability differs from reliability in that reliability is a measure

of the likelihood that all of the functions are performed correctly, whereas performability is a measure of

likelihood that some subset of the functions is performed correctly.

Graceful degradation is the ability of the system to automatically decrease its level of performance to com-

pensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve per-

formability by eliminating failed hardware/software components, allowing performance at some reduced level.

∗These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley,

1989.

Understanding System Reliability and Fault Tolerance: Concepts and Terminology, Study notes of Computer Science

Related documents

Partial preview of the text

Download Understanding System Reliability and Fault Tolerance: Concepts and Terminology and more Study notes Computer Science in PDF only on Docsity!

Terminology and Concepts

Prof. Naga Kandasamy

1 Goals of Fault Tolerance

1.1 Reliability

1.2 Availability

1.3 Performability

1.4 Safety

1.5 Maintainability and Testability

2.2 Fault Models

2.3 Failure Response Strategies

N

= −N

[

− N

]

M T T F

M T T F + M T T R

M T T F

M T BF