



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to the concepts of system reliability and fault tolerance. The definitions of reliability, availability, performability, safety, maintainability, and testability. It also explains the difference between faults, errors, and system failures. The characteristics of faults and the use of fault models for designing fault-tolerant systems. Additionally, it introduces the concepts of fault detection, masking redundancy, dynamic redundancy, and fault-tolerant systems.
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!
Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability. We will now define the above terms in an intuitive fashion.
The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the system will perform correctly throughout the interval [t 0 , t]. given that the system was performing correctly at time t 0. So, reliability is the probability that the system will operate correctly throughout a complete interval of time.
Reliability is used to characterize systems in which even momentary periods of incorrect performance are unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of concern may be years). In some other applications such as flight control, the time interval of concern may be a few hours.
Fault tolerance can improve a system’s reliability by keeping the system operational when hardware and software failures occur.
Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly and is available to perform its functions at the instant of time t. Availability differs from reliability in that reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system can be highly available yet experience frequent periods of down time as long as the length of each down-time period is very short. The most common measure of availability is the expected fraction of time that a system is available to correctly perform its functions.
In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of hardware/software failures, at a diminished level of performance. So, the performability P (L, t) of a system is a function of time, and is defined as the probability that the system performance will be at, or above, some level L at the instant of time t. Performability differs from reliability in that reliability is a measure of the likelihood that all of the functions are performed correctly, whereas performability is a measure of likelihood that some subset of the functions is performed correctly.
Graceful degradation is the ability of the system to automatically decrease its level of performance to com- pensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve per- formability by eliminating failed hardware/software components, allowing performance at some reduced level. ∗These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley,
Fig. 1: The full adder circuit used to illustrate the distinction between faults and errors.
Safety S(t) is the probability that a system will either perform its functions correctly or will discontinue its functions in a manner that does not compromise the safety of any people associated with the system (fail-safe capability). Safety and reliability differ because reliability is the probability that a system will perform its functions correctly, whereas safety is the probability that a system will either perform its functions correctly or will discontinue the functions in a fail-safe manner.
Maintainability is a measure of the ease with which a system can be repaired, once it has failed. So, the maintainability M (t) is the probability that a failed system will be restored to an operational state within a specified period of time t. The restoration process includes locating and diagnosing the problem, repairing and reconfiguring the system, and bringing the system back to its operational state.
2 Faults, errors, and system failures
We define the following basic terms:
There is a cause-and-effect relationship between faults, errors, and failures. Faults result in errors, and errors can lead to system failures. In other words, errors are the effect of faults, and failures are the effect of errors.
The full-adder circuit shown in Fig. 1 can be used to illustrate the distinction between faults and errors. The inputs Ai, Bi, and Ci are the two operand bits and the carry bit, respectively. The truth table showing the correct performance for this circuit is shown in Fig. 2. If a short occurs between line L and the power supply line, resulting in line L becoming permanently fixed at a logic 1 value, then a fault (or defect) has occurred in the circuit. The fault is the actual short within the circuit.
Fig. 3 shows the truth table of the circuit that contains the physical fault. Comparing Figs. 2 and 3, we see that the circuit performs correctly for the input combinations 100, 101, 110, and 111, but not for 000, 001, 010, and 011. So, whenever an input pattern is supplied to the circuit that results in an incorrect output,
System reliability
Non-redundant systems
Redundant systems
Fault avoidance Masking redundancy
Dynamic redundancy
Fault detection
Fault-tolerant systems
On-line detection/masking
Reconfiguration Retry^ Online repair
Fig. 4: A taxonomy of possible failure-response strategies.
To design a fault-tolerant system, it is necessary to assume that the underlying faults behave according to some fault model. Even though, in practice, faults can be transient in nature, and exhibit complex behavior, fault models are used to make the problem of designing fault-tolerant systems more manageable, and as a way to restrict our attention to a subset of all faults that can occur.
A commonly used fault model to capture the behavior of fault digital circuits is the logical stuck-fault model.
A taxonomy of the primary techniques used to design systems to operate in a fault-prone environment is shown in Fig. 4. Broadly speaking, there are three primary methods: fault avoidance (e.g., shielding from EMI), fault masking (e.g., TMR systems), and fault tolerance.
Fault detection does not tolerate faults, but provides a warning that a fault has occurred. Masking re- dundancy (also called static redundancy) tolerates failures, but provides no warning of them. Dynamic redundancy covers those systems whose configuration can be dynamically changed in response to a fault, or in which masking redundancy is enhanced by on-line fault detection which allows on-line repair.
3 Quantitative Evaluation of System Reliability
Reliability of a system R(t) is defined to be the probability of a component or system functioning correctly over a given time period [t 0 , t] under a given set of operating conditions. Consider a set of N identical components, all of which begin operating at the same time. Then, at some time t, the number of components operating correctly is No(t) and the number of failed components is Nf (t). Then, the reliability of a
component at time t is given by
R(t) =
No(t) N
No(t) No(t) + Nf (t)
which is simply the probability that a component has survived the interval [t, t 0 ]. We can also define unreliability Q(t) as the probability that a system will not function correctly over a given period of time. This is also called the probability of failure. If the number of failed components during time t is given by nf(t), then
Q(t) = Nf (t) N
Nf (t) No(t) + Nf (t)
From the definitions of reliability and unreliability, we obtain
Q(t) = 1 − R(t)
If we write the reliability function as
R(t) = 1 − Nf (t) N
and differentiate R(t) with respect to time, we obtain
dR(t) dt
dNf (t) dt
which can be rewritten as dNf (t) dt
dR(t) dt
The derivative dNf (t)/dt is simply the instantaneous rate at which components are failing. At time t, there are still No(t) components operating correctly. Dividing dNf (t)/dt by No(t), we obtain
z(t) =
No(t)
dNf (t) dt
where z(t) is called the hazard function, hazard rate, or failure rate. The unit for the failure-rate function is failures per unit of time. The failure rate function can also be written in terms of the reliability function R(t) as
z(t) =
No(t)
dNf (t) dt
No(t)
dR(t) dt
dR(t) dt R(t)
Rearranging, we obtain the following differential equation.
dR(t) dt
= −z(t)R(t)
The failure-rate function z(t) of electronic components exhibits a ‘bathtub’ curve, shown in Fig. 5, comprising three distinct regions—burn in, useful life, and wear out. It is typically assumed that the failure rate is constant during a component’s useful life and is given by z(t) = λ.
So, the differential equation is dR(t) dt
= −λR(t)
Solving the above equation gives us R(t) = e−λt^ (1)
The exponential relationship between reliability and time is known as the the exponential failure law. Thus, the probability of a system working correctly throughout a given period of time decreases exponentially with the length of this time period. The exponential failure law is extremely valuable for the analysis of electronic components, and is by far the most commonly used relationship between reliability and time.
Mean time to repair. The mean time to repair (MTTR) is the average time taken to repair a failed system. Just as we describe the reliability of a system using its failure rate, we can quantify the ’repairability’ of a system using its repair rate μ. The MTTR is (^) μ^1.
Mean time between failures. If a failed system can be repaired and made as good as new, then the mean time between failures (MTBF) is given by
M T BF = M T T F + M T T R (3)
The availability of the system is the probability that the system will be functioning correctly at any given time. In other words, it is the fraction of time for which a system is operational.
Availabilty =
Time system is operational Total time