Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding System Reliability and Fault Tolerance: Concepts and Terminology, Study notes of Computer Science

An introduction to the concepts of system reliability and fault tolerance. The definitions of reliability, availability, performability, safety, maintainability, and testability. It also explains the difference between faults, errors, and system failures. The characteristics of faults and the use of fault models for designing fault-tolerant systems. Additionally, it introduces the concepts of fault detection, masking redundancy, dynamic redundancy, and fault-tolerant systems.

Typology: Study notes

2012/2013

Uploaded on 05/18/2013

maazi
maazi 🇮🇳

4.4

(12)

77 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Terminology and Concepts
Prof. Naga Kandasamy
1 Goals of Fault Tolerance
Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety,
and testability. We will now define the above terms in an intuitive fashion.
1.1 Reliability
The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the
system will perform correctly throughout the interval [t0, t]. given that the system was performing correctly
at time t0. So, reliability is the probability that the system will operate correctly throughout a complete
interval of time.
Reliability is used to characterize systems in which even momentary periods of incorrect performance are
unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of
concern may be years). In some other applications such as flight control, the time interval of concern may
be a few hours.
Fault tolerance can improve a system’s reliability by keeping the system operational when hardware and
software failures occur.
1.2 Availability
Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly
and is available to perform its functions at the instant of time t. Availability differs from reliability in that
reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system
can be highly available yet experience frequent periods of down time as long as the length of each down-time
period is very short. The most common measure of availability is the expected fraction of time that a system
is available to correctly perform its functions.
1.3 Performability
In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of
hardware/software failures, at a diminished level of performance. So, the performability P(L, t) of a system
is a function of time, and is defined as the probability that the system performance will be at, or above,
some level Lat the instant of time t. Performability differs from reliability in that reliability is a measure
of the likelihood that all of the functions are performed correctly, whereas performability is a measure of
likelihood that some subset of the functions is performed correctly.
Graceful degradation is the ability of the system to automatically decrease its level of performance to com-
pensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve per-
formability by eliminating failed hardware/software components, allowing performance at some reduced level.
These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley,
1989.
1
pf3
pf4
pf5

Partial preview of the text

Download Understanding System Reliability and Fault Tolerance: Concepts and Terminology and more Study notes Computer Science in PDF only on Docsity!

Terminology and Concepts

Prof. Naga Kandasamy

1 Goals of Fault Tolerance

Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability. We will now define the above terms in an intuitive fashion.

1.1 Reliability

The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the system will perform correctly throughout the interval [t 0 , t]. given that the system was performing correctly at time t 0. So, reliability is the probability that the system will operate correctly throughout a complete interval of time.

Reliability is used to characterize systems in which even momentary periods of incorrect performance are unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of concern may be years). In some other applications such as flight control, the time interval of concern may be a few hours.

Fault tolerance can improve a system’s reliability by keeping the system operational when hardware and software failures occur.

1.2 Availability

Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly and is available to perform its functions at the instant of time t. Availability differs from reliability in that reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system can be highly available yet experience frequent periods of down time as long as the length of each down-time period is very short. The most common measure of availability is the expected fraction of time that a system is available to correctly perform its functions.

1.3 Performability

In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of hardware/software failures, at a diminished level of performance. So, the performability P (L, t) of a system is a function of time, and is defined as the probability that the system performance will be at, or above, some level L at the instant of time t. Performability differs from reliability in that reliability is a measure of the likelihood that all of the functions are performed correctly, whereas performability is a measure of likelihood that some subset of the functions is performed correctly.

Graceful degradation is the ability of the system to automatically decrease its level of performance to com- pensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve per- formability by eliminating failed hardware/software components, allowing performance at some reduced level. ∗These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley,

Fig. 1: The full adder circuit used to illustrate the distinction between faults and errors.

1.4 Safety

Safety S(t) is the probability that a system will either perform its functions correctly or will discontinue its functions in a manner that does not compromise the safety of any people associated with the system (fail-safe capability). Safety and reliability differ because reliability is the probability that a system will perform its functions correctly, whereas safety is the probability that a system will either perform its functions correctly or will discontinue the functions in a fail-safe manner.

1.5 Maintainability and Testability

Maintainability is a measure of the ease with which a system can be repaired, once it has failed. So, the maintainability M (t) is the probability that a failed system will be restored to an operational state within a specified period of time t. The restoration process includes locating and diagnosing the problem, repairing and reconfiguring the system, and bringing the system back to its operational state.

2 Faults, errors, and system failures

We define the following basic terms:

  • A fault is a defect within the system.
  • An error is a deviation from the required operation of the (sub)system.
  • A system failure occurs when the system delivers a function deviating from the one specified.

There is a cause-and-effect relationship between faults, errors, and failures. Faults result in errors, and errors can lead to system failures. In other words, errors are the effect of faults, and failures are the effect of errors.

The full-adder circuit shown in Fig. 1 can be used to illustrate the distinction between faults and errors. The inputs Ai, Bi, and Ci are the two operand bits and the carry bit, respectively. The truth table showing the correct performance for this circuit is shown in Fig. 2. If a short occurs between line L and the power supply line, resulting in line L becoming permanently fixed at a logic 1 value, then a fault (or defect) has occurred in the circuit. The fault is the actual short within the circuit.

Fig. 3 shows the truth table of the circuit that contains the physical fault. Comparing Figs. 2 and 3, we see that the circuit performs correctly for the input combinations 100, 101, 110, and 111, but not for 000, 001, 010, and 011. So, whenever an input pattern is supplied to the circuit that results in an incorrect output,

System reliability

Non-redundant systems

Redundant systems

Fault avoidance Masking redundancy

Dynamic redundancy

Fault detection

Fault-tolerant systems

On-line detection/masking

Reconfiguration Retry^ Online repair

Fig. 4: A taxonomy of possible failure-response strategies.

2.2 Fault Models

To design a fault-tolerant system, it is necessary to assume that the underlying faults behave according to some fault model. Even though, in practice, faults can be transient in nature, and exhibit complex behavior, fault models are used to make the problem of designing fault-tolerant systems more manageable, and as a way to restrict our attention to a subset of all faults that can occur.

A commonly used fault model to capture the behavior of fault digital circuits is the logical stuck-fault model.

2.3 Failure Response Strategies

A taxonomy of the primary techniques used to design systems to operate in a fault-prone environment is shown in Fig. 4. Broadly speaking, there are three primary methods: fault avoidance (e.g., shielding from EMI), fault masking (e.g., TMR systems), and fault tolerance.

Fault detection does not tolerate faults, but provides a warning that a fault has occurred. Masking re- dundancy (also called static redundancy) tolerates failures, but provides no warning of them. Dynamic redundancy covers those systems whose configuration can be dynamically changed in response to a fault, or in which masking redundancy is enhanced by on-line fault detection which allows on-line repair.

3 Quantitative Evaluation of System Reliability

Reliability of a system R(t) is defined to be the probability of a component or system functioning correctly over a given time period [t 0 , t] under a given set of operating conditions. Consider a set of N identical components, all of which begin operating at the same time. Then, at some time t, the number of components operating correctly is No(t) and the number of failed components is Nf (t). Then, the reliability of a

component at time t is given by

R(t) =

No(t) N

No(t) No(t) + Nf (t)

which is simply the probability that a component has survived the interval [t, t 0 ]. We can also define unreliability Q(t) as the probability that a system will not function correctly over a given period of time. This is also called the probability of failure. If the number of failed components during time t is given by nf(t), then

Q(t) = Nf (t) N

Nf (t) No(t) + Nf (t)

From the definitions of reliability and unreliability, we obtain

Q(t) = 1 − R(t)

If we write the reliability function as

R(t) = 1 − Nf (t) N

and differentiate R(t) with respect to time, we obtain

dR(t) dt

N

dNf (t) dt

which can be rewritten as dNf (t) dt

= −N

dR(t) dt

The derivative dNf (t)/dt is simply the instantaneous rate at which components are failing. At time t, there are still No(t) components operating correctly. Dividing dNf (t)/dt by No(t), we obtain

z(t) =

No(t)

dNf (t) dt

where z(t) is called the hazard function, hazard rate, or failure rate. The unit for the failure-rate function is failures per unit of time. The failure rate function can also be written in terms of the reliability function R(t) as

z(t) =

No(t)

dNf (t) dt

No(t)

[

− N

dR(t) dt

]

dR(t) dt R(t)

Rearranging, we obtain the following differential equation.

dR(t) dt

= −z(t)R(t)

The failure-rate function z(t) of electronic components exhibits a ‘bathtub’ curve, shown in Fig. 5, comprising three distinct regions—burn in, useful life, and wear out. It is typically assumed that the failure rate is constant during a component’s useful life and is given by z(t) = λ.

So, the differential equation is dR(t) dt

= −λR(t)

Solving the above equation gives us R(t) = e−λt^ (1)

The exponential relationship between reliability and time is known as the the exponential failure law. Thus, the probability of a system working correctly throughout a given period of time decreases exponentially with the length of this time period. The exponential failure law is extremely valuable for the analysis of electronic components, and is by far the most commonly used relationship between reliability and time.

Mean time to repair. The mean time to repair (MTTR) is the average time taken to repair a failed system. Just as we describe the reliability of a system using its failure rate, we can quantify the ’repairability’ of a system using its repair rate μ. The MTTR is (^) μ^1.

Mean time between failures. If a failed system can be repaired and made as good as new, then the mean time between failures (MTBF) is given by

M T BF = M T T F + M T T R (3)

The availability of the system is the probability that the system will be functioning correctly at any given time. In other words, it is the fraction of time for which a system is operational.

Availabilty =

Time system is operational Total time

M T T F

M T T F + M T T R

M T T F

M T BF