[springer series in reliability engineering] reliability and safety engineering volume 0 || system...

98
Chapter 3 System Reliability Modeling This chapter presents various system reliability modeling techniques such as reli- ability block diagram (RBD), fault tree analysis (FTA), Markov model, and Monte Carlo simulation. System reliability is evaluated as a function of constituting com- ponents. Modeling in dynamic scenarios is also explained in the chapter. 3.1 Reliability Block Diagram An RBD is a graphical representation of a system’s success logic using modular or block structures. It is easy to understand and system success paths can be visually verified. The RBD approach integrates various components using submod- els/blocks. The RBD can be evaluated using analytical methods to obtain system reliability. Reliability modeling by an RBD is primarily intended for non-repairable sys- tems only; for example, space systems (space shuttle, etc.) adopt RBD techniques for reliability prediction. In most electronic systems, though repair is possible, re- placement is the practical resort, hence the RBD is widely used. Nevertheless, the RBD approach has limitations in considering different failure modes, external events (like human error), and priority of events. In such scenar- ios FTA and Markov models are recommended for modeling. 3.1.1 Procedure for System Reliability Prediction Using Reliability Block Diagram The procedure for constructing an RBD is shown in Figure 3.1 [1]. System famili- arization is the prerequisite for doing reliability modeling. After system familiari-

Upload: durga-rao

Post on 09-Dec-2016

228 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

Chapter 3 System Reliability Modeling

This chapter presents various system reliability modeling techniques such as reli-ability block diagram (RBD), fault tree analysis (FTA), Markov model, and Monte Carlo simulation. System reliability is evaluated as a function of constituting com-ponents. Modeling in dynamic scenarios is also explained in the chapter.

3.1 Reliability Block Diagram

An RBD is a graphical representation of a system’s success logic using modular or block structures. It is easy to understand and system success paths can be visually verified. The RBD approach integrates various components using submod-els/blocks. The RBD can be evaluated using analytical methods to obtain system reliability.

Reliability modeling by an RBD is primarily intended for non-repairable sys-tems only; for example, space systems (space shuttle, etc.) adopt RBD techniques for reliability prediction. In most electronic systems, though repair is possible, re-placement is the practical resort, hence the RBD is widely used.

Nevertheless, the RBD approach has limitations in considering different failure modes, external events (like human error), and priority of events. In such scenar-ios FTA and Markov models are recommended for modeling.

3.1.1 Procedure for System Reliability Prediction Using Reliability Block Diagram

The procedure for constructing an RBD is shown in Figure 3.1 [1]. System famili-arization is the prerequisite for doing reliability modeling. After system familiari-

Page 2: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

72 3 System Reliability Modeling

zation, one has to select a system success definition. If more than one definition is possible a separate RBD may be required for each. The next step is to divide the system into blocks of equipment to reflect its logical behaviors so that each block is statistically independent and as large as possible. At the same time each block should contain (where possible) no redundancy. For some of the numerical evalua-tion, each block should contain only those items which follow the same statistical distributions for times to failure.

Figure 3.1 Procedure for constructing RBD

System familiarization

System success / Failure definition

Divide the systems into blocks

Construct a diagram that connects blocks to from success paths

Review of RBD with designer

Quantitative evaluation of RBD

Page 3: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 73

In practice it may be necessary to make repeated attempts at constructing the block diagram (each time bearing in mind the steps referred to above) before a suitable block diagram is finalized.

The next step is to refer to the system fault definition and construct a diagram that connects the blocks to form a “success path.” As indicated in the diagrams that follow, various paths exist between the input and output ports of blocks which must function in order that the system functions. If all the blocks are required to function for the system to function then the corresponding block diagram will be one to which all the blocks are joined in series, as illustrated in Figure 3.2. In this diagram I is the input port, O the output port, and R1, R2, R3,…, Rn are the blocks which to-gether constitute the system. Diagrams of this type are known as series RBDs.

A different type of block diagram is needed when failure of one component, or “block,” does not affect system performance as far as the system fault definition is concerned. If in the above instances the entire link is duplicated (made redundant), then the block diagram is as illustrated by Figure 3.3. If, however, each block within the link is duplicated; the block diagram is as illustrated by Figure 3.4.

Diagrams of this type are known as parallel RBDs. Block diagrams used for modeling system reliability are often mixtures of series and parallel diagrams.

Figure 3.2 Series model

3.1.1.1 Important Points to be Considered while Constructing RBDs

• Sound understanding of the system to be modeled is a prerequisite for develop-ing RBDs.

• Failure criteria shall be explicitly defined. • Environmental and operating considerations: the description of the environ-

mental conditions under which the system is designed to operate should be ob-tained. This may include a description of all the conditions to which the system will be subjected during transport, storage, and use. The same component of a system is often used in more than one environment, for example, in a space sat-ellite system, on the ground, during the flight, in orbit. In such scenarios, reli-ability evaluation should be carried out using the same RBD each time but us-ing the appropriate failure rates for each environment.

• It should be noted that RBDs may not be synonymous with the physical inter-connections of various constitute elements within a system.

• The relationship between calendar time, operating time, and on/off cycles should be established.

• Apart from operational failure rates, the process of switching on and off may also be considered depending upon instances.

I R1 R2 R3 Rn O

Page 4: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

74 3 System Reliability Modeling

Figure 3.3 Series-parallel model

Figure 3.4 Parallel-series model

3.1.2 Different Types of Models

The reliability of a system, Rs(t), is the probability that a system can perform a re-quired function under given conditions for a given time interval (0, t). In general it is defined by the relationship (Equation 2.22)

s0

( ) exp[ ( )t

R t u duλ= − ∫ ], (3.1)

where λ(u) denotes the system failure rate at t = u, u being a dummy variable. In following sections, Rs(t) will be written for simplicity as Rs. The unreliability of a system (probability of failure), Fs, is given by

Fs(t) = 1 – Rs(t). (3.2)

R21 R22 R23 R2n

R11 R12 R13 R1n

O

R21 R22 R23 R2n

R11 R12 R13 R1n

O

Page 5: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 75

3.1.2.1 Series Model

For systems as illustrated by Figure 3.2, all the elements have to function for the success of the system. The system reliability Rs is the probability of success of all the elements, given by

s ( )R P A B C ... Z= ∩ ∩ ∩ ∩ .

Assuming the events to be independent,

s A B C zR R R R ...R= , (3.3)

that is, multiplying together the reliabilities of all the blocks constituting the sys-tem.

Example 1 A personal computer consists of four basic subsystems: motherboard (MB), hard disk (HD), power supply (PS), and processor (CPU). The reliabilities of four subsystems are 0.98, 0.95, 0.91, and 0.99 respectively. What is the system reliability for a mission of 1000 h?

Solution: As all the subsystems need to be functioning for the overall system suc-cess, the RBD is a series configuration, as shown in Figure 3.5.

Figure 3.5 RBD of typical computer

The reliability of the system is

sys MB HD PS CPU

0 98 0 95 0 91 0 990 8387

R R R R R

. . . .

. .

= × × ×

= × × ×=

3.1.2.2 Parallel Model

For systems of the type illustrated by Figure 3.6, all the elements have to fail for system failure. The system unreliability Fs is the probability of failure of all the elements, given by

s ( )F P A B= ∩ .

I MB HD PS CPU O

0.98 0.95 0.91 0.99

Page 6: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

76 3 System Reliability Modeling

Assuming the events to be independent,

s A BF F F= . (3.4)

Hence system reliability (Rs) is given by

s A B A BR R R R R= + − . (3.5)

Figure 3.6 Two-unit parallel model

Equations 3.3 and 3.5 can be combined. Thus if we have a system as depicted by Figure 3.3, but with only three items in each branch, the system reliability is

s 1 1 1 2 2 2 1 1 1 2 2 2A B c A B c A B c A B cR R R R R R R R R R R R R= + − . (3.6)

Similarly, for Figure 3.4 we have

s 1 2 1 2 1 2 1 2 1 2 1 2( )( )( )A A A A B B B B C C C CR R R R R R R R R R R R R= + − + − + − . (3.7)

Example 2 To ensure safe shutdown of nuclear power plants (NPPs) during nor-mal or accidental conditions there is a primary shutdown system (SDS), and a re-dundancy secondary SDS is present. The failure probability of the primary SDS is 0.01 and of the secondary SDS is 0.035. Calculate the reliability of the overall SDS of an NPP.

Solution: As any SDS operation is sufficient for the success of the overall SDS of the NPP, the RBD is a parallel configuration, as shown in Figure 3.7.

Figure 3.7 Shutdown system

O

A

B

I

Primary SDS

Secondary SDS

Page 7: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 77

The system reliability is given by SYS p s p s

p p

s s

SYS p s p s

SYS

SYS

,1 1 0 01 0 99 ,

1 1 0 035 0 965Now,

,

0 99 0 965 (0 99 0 965) ,0 9997

R R R R RR F . .

R F . . .

R R R R R

R . . . .R . .

= + −

= − = − =

= − = − =

= + −

= + − ×=

Example 3 In designing a computer-based control system, two computers are be-ing considered, to obtain higher reliability. Designer A is suggesting redundancy at the system level, where as, designer B is suggesting redundancy at the subsystem level. Use the failure data from Example 1 and recommend the better design. Solution: Designer A – reliability evaluation. Considering redundancy at the system level will lead to the RBD shown in Figure 3.8, where

1 1 1 1 1

2 2 2 2 2

1 2 1 2

0 8387 ,

0 8387 ,

0 9740

SYS M B H D P S C P U

SYS M B H D P S C P U

SYS SYS SYS SYS SYS

R R R R R .

R R R R R .

R R R R R . .

= × × × =

= × × × =

= + − × =

The proposed design of B is better than the design of A.

Figure 3.8 (a) Designer A’s RBD, and (b) simplified RBD

0.99 0.98 0.91

I

MB HD PS CPU

MB HD PS CPU

O

Rsys1

Rsys2

0.95

Page 8: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

78 3 System Reliability Modeling

Designer B – reliability evaluation (Figure 3.9) 0 9996 ,0 9975 ,

0 9919 ,0 9999

Now,0 9889

MB MB MB MB MB

HD HD HD HD HD

PS PS PS PS PS

CPU CPU CPU CPU CPU

SYS MB HD PS CPU

R R R R R .R R R R R .R R R R R .R R R R R . .

R R R R R . .

= + − × == + − × == + − × == + − × =

= × × × =

Figure 3.9 (a) Designer B’s RBD, and (b) simplified RBD

3.1.2.3 M-out-of-N Models (Identical Items)

A system having three subsystems A, B, and C fails only when more than one item has failed, as shown in Figure 3.10.

( )( ) ( )( ) ( )( ) ( )( )( )1 21 1 1 1 1 1 1 2 1 1 11 1 1 12 2 2 2 2 2 2 2

2

SYS A B A C B C A B C

A B A C B C A B C

A B A B A C A C B C B C

A B C A B A C B C A B C

A B A C B C A B C

R Q Q Q Q Q Q Q Q QR R R R R R R R R

R R R R R R R R R R R RR R R R R R R R R R R R

R R R R R R R R R .

= − − − += − − − − − − − − − + − − −= − + + − − + + − − + + −

+ − − − + + + −= + + −

In general, if the reliability of a system can be represented by n identical items in parallel where m out of n are required for system success, then the system reliabil-ity Rs is given by

rrnmn

rr

ns RRCR )1()(

0

−= −−

=∑ . (3.8)

MB HD PS CPU

MB HD PS CPU

0.99 0.95 0.91 0.99

O I

MB HD PS CPU

Page 9: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 79

Figure 3.10 Model for 2-out-of-3 success

Thus the reliability of the system illustrated by Figure 3.10 is given by

3223 23)1(3 RRRRRRs −=−+= . (3.9)

where R is the reliability of the individual items.

Example 4 The control and instrumentation (C&I) system is very important in the NPP as it monitors critical process parameters. Failure criteria are twofold: (i) failure of the C&I equipment when there is actual variation in parame-

ters; (ii) failure due to spurious signals. Compare designs for 1-out-of-2 success and 2-out-of-3 success under both criteria. Assume failure probability q for each subsystem.

Solution:

1 out of 2. In scenario (i) where there is actual variation in the process parameters, successful operation of 1 out of 2 subsystems leads to system success. The reli-ability is given by

11 2 21

0

2 2 0 0 2 1 10 1

2

2 2

2

(1 )

(1 ) (1 )(1 ) 2(1 )1 2 2 2(1 ) .

r rr

r

R C q q

C q q C q qq q q

q q q qq

=

= −

= − + −

= − + −= + − + −= −

A

B

C

2 out of 3

Page 10: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

80 3 System Reliability Modeling

In scenario (ii), where there are spurious signals, any subsystem failure will lead to system failure for a two-unit system, making it a 2-out-of-2-success system or simple series system.

The reliability is given by 2

1

2

(1 )(1 )(1 )

R q qq

= − −= −

2 out of 3. In both the scenarios, a 2-out-of-3 system will have the same reliability, given by

13 3

20

3 3 0 3 2 10 1

3 2

2

2

2

2 3 2

2 3

(1 )

(1 ) (1 )(1 ) 3(1 )(1 ) (1 3 )(1 ) (1 2 )(1 2 )(1 2 )1 2 2 2 41 3 2

r rr

r

R C q q

C q q C q qq q qq q qq qq q qq q q q qq q .

=

= −

= − + −

= − + −= − − += − += − − += + + + − −= − +

For very low q values (high-reliability systems), for failure criterion (i) 2 2 3

11 2

1 1 3 2q q qR R− > − +

>

and in the case of failure criteria (ii)

22

1 RR << . Thus the reliability differences are marginal in the non-spurious-signals case and significantly different for spurious signals. Hence the 2-out-of-3 system is better than the 1-out-of-2 system.

3.1.2.4 Standby Redundancy Models

Another frequently used form of redundancy is what is known as standby redun-dancy. In its most elementary form, the physical arrangement of items is repre-sented in Figure 3.11.

In Figure 3.11, item A is the on-line active item, and item B is standing by, waiting to be switched on to replace A when the latter fails. The RBD formulae al-ready established are not applicable for the reliability analysis of standby redun-dant systems.

Page 11: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 81

Figure 3.11 Standby model

The expression for system reliability is

( ) )( tetR ts λλ += − 1

with the following assumptions: 1. When operating, both items have a constant failure rate λ and have zero failure

rate in standby mode. 2. The switch is perfect. 3. Switchover time is negligible. 4. Standby unit does not fail while in standby mode.

If there are n items in standby, this expression becomes

( ) [n

ts n

ttttetR!)(.......................................

!3)(!

!2)(1

32 λλλλλ +++++= − ]. (3.10)

It is to be noted that a practical block diagram should include blocks to repre-sent the reliability of the switch plus sensing mechanism, which is often the “weak link” in standby systems. Further, unlike this example, the probability of survival of one item (item B) is dependent upon the time when the other item (item A) fails. In other words items A and B cannot be regarded as failing independently. As a consequence, other procedures, such as Markov analysis, should be used to ana-lyze the standby system.

Example 5 A typical uninterruptable power supply (UPS) circuit is shown in Fig-ure 3.12. Given the unavailability of components (as shown in Table 3.1), calcu-late the UPS unavailability.

B

A

Page 12: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

82 3 System Reliability Modeling

Figure 3.12 UPS

Solution: In normal conditions, when the AC power supply is present, the load draws current from the supply. When the AC supply is not there, the load draws current from the battery. However, an inverter is required in both scenarios. The RBD for the UPS can be represented as in Figure 3.13.

Figure 3.13 RBD of UPS

Table 3.1 Component failure and repair rates

Component λ (failure rate) μ (repair rate) AC supply 2.28 × 10–4 2 Rectifier 10–4 0.25 Battery 10–6 0.125 Inverter 10–4 0.25

A C

Battery

Rectifier Inverter

A1 A2

A.C. Supply ~

~ - - ~

Load Battery

Rectifier Inverter

Page 13: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 83

The RBD can be successively simplified as follows. As A1 and A2 are in series, they can be replaced by their equivalent availability,

A = A1.A2.

Now there is a simple parallel structure, represented with its equivalent expres-sion,

A1A2 + A3 – A1A2A3.

The final availability expression is given as

A4[A1A2 + A3 – A1A2A3].

Availability is calculated with the parameters given in Table 3.1:

A μμ λ

=+

,

A1 = 0.9999, A2 = 0.9996, A3 = 0.9999, A4 = 0.9996.

Substitute the values in AUPS = A4[A1A2 + A3 – A1A2A3]:

AUPS = 0.9996.

A1A2 A4

A3

A1.A2+A3–A1.A2.A3 A4

UPS

Page 14: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

84 3 System Reliability Modeling

3.1.3 Solving the Reliability Block Diagram

Apart from the standard models discussed in the previous section, there can be non-series parallel or complex configurations. There are general solution ap-proaches to solve RBDs, such as the truth table method, path-set/cut-set method, and bounds method.

3.1.3.1 Truth Table Method

This method is also known as the event enumeration method. In this approach, all the combinations of events are enumerated and the system state for the given combination is identified. For example, if there are n components in the system, considering success and failure for every component, there would be 2n combina-tions. All such combinations will be explored for system success. This method is computationally intensive. It is illustrated with simple example here.

Example 6 One portion of a fluid system physically consists of a pump and two check valves in series. The series check valves provide redundancy against flow in the reverse direction when the pump is not operating and the downstream pressure exceeds the upstream pressure.

Solution: The system diagram is as in Figure 3.14.

Figure 3.14 A simple fluid system

The functional diagram is as in Figure 3.15.

Figure 3.15 RBD of fluid system

Considering the functional diagram the Boolean expression for this system is given by

S = C(A + B),

A B

C

C

B

A

Page 15: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 85

or one can make a truth table (Table 3.2) by assigning a “0” and “1” value to fail-ure and success respectively:

Pump not working = 1, Pump working = 0, Check valve reverse blocking = 1, Check valve not reverse blocking = 0, System success pump not working and valve reverse blocking = 1, System failure pump working = 0.

Table 3.2 Truth table

Serial no. A B C S Event probability, P 1 0 0 0 0 2 0 0 1 0 3 0 1 0 0 4 0 1 1 1 (1 – PA)PBPC 5 1 0 0 0 6 1 0 1 1 PA(1 – PB)PC 7 1 1 0 0 8 1 1 1 1 PAPBPC

From this truth table add all the entries under P corresponding to 1 under S.

The reliability is obtained as

(1 ) (1 )( ) .

A B C A B C A B C

C A B A B

R P P P P P P P P PP P P P P

= − + − += + −

If it is possible to write the Boolean expression of the system it is not necessary to make the truth table. What is required is that the Boolean expression should be reduced to its minimal form and then one directly do the probability operation on it.

3.1.3.2 Cut-set and Tie-set Method

This is an efficient method to compute the reliability of a given system. Computer programs are also available. • Cut set: the group of those elements or units which will make the system to

fail, if their failure occurs. The minimum number of such units forms the minimal cut set.

• Tie set: the set of those elements whose working will make the system work. A minimal tie set is the minimum number of such elements which would assure the system success.

Page 16: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

86 3 System Reliability Modeling

For reliability computation either the minimal cut set or the minimal tie set should be found.

Suppose in the system T1, T2,…, Tn are the minimal tie sets then the system re-liability is given by

1 2 3( )nP T T T T∪ ∪ …∪

and if the minimal cut sets are known to be C1, C2,…, Ck then the system reliabil-ity is

1 2 3 1 ( )kP C C C C− ∪ ∪ …∪ ,

where P(T1), P(T2),…, P(Tn) denote the success probability attached to the tie sets T1, T2,…, Tn and P(C1), P(C2),…, P(Ck) are the failure probabilities attached to the cut sets C1, C2,…, Ck.

Example 7 Considering the bridge network shown in Figure 3.16, calculate reli-ability of the system as a function of tie sets.

Figure 3.16 Bridge network

The minimal tie sets are

T1 = (A, B); T2 = (C, D); T3 = (A, E, D); T4 = (C, E, B).

A B

C D

E

Page 17: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.1 Reliability Block Diagram 87

The minimal cut sets are

C1 = (A, C); C2 = (B, D); C3 = (A, E, D); C4 = (C, E, B).

If success probabilities of tie sets are

P(T1) = PAPB , P(T2) = PCPD , P(T3) = PAPEPD, P(T4) = PCPEPB,

reliability is

R = P(T1 ∪ T2 ∪ T3 ∪ T4) = P(T1) + P(T2) + P(T3) + P(T4) – [P(T1)P(T2) + P(T2) + P(T3) + P(T3) + P(T4) + P(T1) + P(T4) + P(T1) + P(T3) + P(T2) + P(T4)] + [P(T1)P(T2)P(T3) + P(T2)P(T3)P(T4) + P(T3)P(T4)P(T1) + P(T1)P(T2)P(T4)] – P(T1)P(T2)P(T3)P(T4). Similarly the cut-set method can be used for reliability prediction.

Example 8 A simplified emergency power supply system is shown in Figure 3.17. Availability of the power supply at bus A or bus B ensures the supply to loads. There is a transfer switch to connect the diesel generator DG1 to bus B or to con-nect DG2 to bus A. Develop the RBD and identify the combinations of failures leading to failure of power supply.

Solution:

Figure 3.17 Simplified emergency power supply system

The RBD can be represented as shown in Figure 3.18.

DG1 DG2 Grid Supply

Transfer

Bus A Bus B

Page 18: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

88 3 System Reliability Modeling

Figure 3.18 RBD of simplified emergency power supply system

The following combinations of failure lead to system failure:

A ⋅ B, Grid ⋅ DG1 ⋅ DG2, Grid ⋅ DG1 ⋅ TS ⋅ B, Grid ⋅ DG2 ⋅ TS ⋅ A.

3.1.3.3 Bounds Method

When the system is large, Boolean techniques and the cut-set, tie-set method be-come tedious. But if we use a computer program with cut sets and tie sets then one can adopt the bounds method, which is a variation of cut-set and tie-set method.

If T1, T2,…, Tn are minimal tie sets then the upper bound for system reliability is

Ru < P(T1) + P(T2) + ... + P(Tn).

This becomes a good approximation in the low-reliability region.

If C1, C2,…, Ck are minimal cut sets, the lower bound of system reliability can be found as

Rl > 1 – [P(C1) + P(C2) + ... + P(Cn)].

Grid Supply

DG1

DG2

Bus A

Bus B

Transfer Switch

Page 19: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 89

This becomes a good approximation in the high-reliability region. The reliability of the system approximately will be

)1)(1(1 lu RRR −−−= . (3.11)

3.2 Markov Models

Combinatorial models (RBDs, fault trees, etc.) are often used for system reliability analysis owing to their simplicity and ease of application. They are well suited for systems with components following “independent” behavior. Dependencies can-not be handled efficiently by these models. Modeling flexibility for complex sys-tems could be gained through the usage of Markov analysis. Inclusion of coverage and repair processes can be appropriately analyzed using this. This is primarily applicable when the failure and repair rates of all components are constant (expo-nentially distributed).

Markov chains are sequences of random variables in which the future variables are determined by the present variable only and are independent of how the evolu-tion of the present state from its antecedents has taken place. In effect, they are characterized by the “memoryless” property.

A stochastic process X(t)| t ∈ T is called a Markov process if for any t0 < t1 <

... tn < tn+1 , the conditional distribution of X(tn+1) for given values of X(t0), X(t1),...,

X(tn) depends only on X(tn) and not on the earlier values. The values that X(t) can assume are in general called “states,” all of which together form a “state space” Ω.

Markov processes are classified depending upon whether T and Ω are discrete (countable) or continuous (uncountable). Thus, there are four types of Markov processes depending upon the associated time and state spaces.

3.2.1 State Space Method – Principles

Consider a repairable system made up of n components where each component has a finite number of operating and failed states. Thus, the system has two types of states:

• Operating states: these are the states where the system function is per-formed although some of the components may have failed; a fully opera-tional state is a state where no component has failed.

• Failed states: these are the states where the system function is no longer fulfilled due to the failure of one or more of the system components.

Page 20: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

90 3 System Reliability Modeling

3.2.1.1 Steps

Step 1: listing and classifying all the system states as either operating states or failed states. If each component has two possible states (operating and failed) and if the system has n components, the maximum number of states is 2

n. During the

life of the system, failed states can appear due to the existence of failures, or dis-appear following repairs.

Step 2: listing all possible transitions between the different states, and identi-fying the causes of all transitions. In general, the causes of the transitions are ei-ther failure of one or several subsystem components or a repair made to a compo-nent.

Step 3: calculating the probabilities of being in the different states during a cer-tain period in the life of the system or calculating the dependability characteristics (mean time to failure, MTTF; mean time to repair, MTTR; mean time between failures, MTBF). This representation is done by constructing a state graph wherein each node represents a state of the system, and each arc symbolizes a transition between the two nodes it links. A transition rate between two states is assigned to each arc.

There are several state space reduction techniques available to simplify the computational tasks involved, the most noted of them being the “merging of states” approach, where identical states are merged and equivalent transition rates and probabilities are found.

3.2.1.2 Basic Analysis

Consider any two states i and j and transitions from i to j. Let Pi(t) be the probability of the system being in state i at time t and Pj(t + Δt)

be the probability of the system being in state j at time (t + Δt). Then: Pj(t + Δt) = λij Δt Pi(t),

where λij is the transition rate from state i to state j; λij Δt is the probability of failure in the interval Δt.

0

1limij tλ

Δ →=

Δ [Probability of transition from i to j in Δ t]

It is assumed that the probability of two or more transitions occurring simulta-neously is negligible. In general,

( ) ( ) 1 ( )ij jkj i jP t t t P t t P ti j k j

λ λ⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

+ Δ = Δ + − Δ∑ ∑≠ ≠

.

Page 21: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 91

The first term in the summation pertains to the probability of entering a state and the second term pertains to the probability of staying in that state.

A simple one-component system (simplex system, Figure 3.19) is taken for il-lustrative purposes and a detailed Markov analysis for the reliability evaluation is done.

Figure 3.19 Simplex system

( )1 21 2 12 1( ) ( ) 1 ( )P t t tP t t P tλ λ+ Δ = Δ + − Δ , ( )2 21 1 12 2( ) ( ) 1 ( )P t t tP t t P tλ λ+ Δ = Δ + − Δ

=> 1 121 2 12 1

( ) ( ) ( ) ( )P t t P t P t P tt

λ λ+ Δ − = −Δ

,

and 2 212 1 21 2

( ) ( ) ( ) ( )P t t P t P t P tt

λ λ+ Δ − = −Δ

.

As 0tΔ → , 121 2 12 1

( ) ( ) ( )dP t P t P tdt

λ λ= −

and

212 1 21 2

( ) ( ) ( )dP t P t P tdt

λ λ= − ;

12λ λ= , 21λ μ= ,

12 1

21 2

( ) ( )

( ) ( )

dPP t P t

dtdP

P t P tdt

μ λ

λ μ

⎫⇒ = − ⎪⎪⎬⎪= −⎪⎭

. (3.12)

Expressing the above equations in matrix form:

[ ]

[ ]

1 21 2

( ) ( )( ) ( )

. ( ) ( )

dP t dP tP t P t

dt dt

i e P t P t A

λ λμ μ

−⎡ ⎤⎡ ⎤ = ⎢ ⎥⎢ ⎥ −⎣ ⎦ ⎣ ⎦⎡ ⎤ ⎡ ⎤′ =⎣ ⎦ ⎣ ⎦

where [A] is known as a “stochastic transition matrix.”

1

(Up)

(1- λ12Δt)

2

(Down)

(1- λ21Δt)

λ12

λ21

Page 22: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

92 3 System Reliability Modeling

Properties of Stochastic Transition Matrix

• A is a square matrix of the order n × n where n is the number of states in the Markov model.

• Element aij is the transition rate from state i to state j (i ≠ j). • Element aii is the negative of the summation of the transition rates from state i

to all other states.

When the process is homogeneous, the elements of A are constant. Further-more, the terms of any row of A sum to 0; thus, A is a singular matrix (its determi-nant is equal to zero).

Note: A Markov process is said to be homogeneous when the failure and repair rates are constant.

When the initial distribution, Pi(0) is known, i.e., P1(0) and P2(0) are known, true solutions to the Markov differential equations as shown in Equation 3.12 can be found, in particular, using Laplace transforms, discretization, or the eigenvalues of matrix A.

(For all computational purposes, it can be assumed that the system is initially in the “up” state, i.e., all components are working. In the example considered, this renders P1(0), the probability of system being in state 1 at time t = 0, as 1, and P2(0) as 0.)

Refresher of Laplace Transforms

1(1)LS

= ,

1( )nn

nL tS +

∠= ,

1( )atL eS a

=−

.

Transform of derivatives. If ( ( )) ( )L f t F s= , then ( ( )) ( ) (0)L f t SF S f′ = − .

Transform of integrals. If ( ( )) ( )L f t F s= , then 0

( )( ( ))t F SL f t

S=∫ .

Taking Laplace transforms of Equation 3.12:

1 1 1 2( ) (0) ( ) ( )SP S P P S P Sλ μ− = − + ,

1 1 2i.e. ( ) 1 ( ) ( )SP S P S P Sλ μ− = − + , (3.13)

Page 23: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 93

and 2 2 1 2 ( ) (0) ( ) ( )SP S P P S P Sλ μ− = − ,

2 1 2i.e. ( ) 0 ( ) ( )SP S P S P Sλ μ− = −

2 1 2 ( ) ( ) ( )SP S P S P Sλ μ⇒ = − (3.14)

2 1 ( )[ ] ( )P S S P Sμ λ⇒ + =

2 1 ( ) ( ) P S P SS

λμ

⇒ =+

. (3.15)

Substitute Equation 3.15 into Equation 3.13:

1 1 1( ) 1 ( ) ( )SP S P S P SSμλλ

μ− = − +

+

1 ( ) 1P S SSμλλ

μ⎡ ⎤

⇒ + − =⎢ ⎥+⎣ ⎦ , (3.16)

( )11 1( )P SS S

μ λμ λ μ λ μ λ

⎡ ⎤⎡ ⎤= + ⎢ ⎥⎢ ⎥+ + + +⎣ ⎦ ⎢ ⎥⎣ ⎦

.

Inverse Laplace transformation yields

( )1( ) tP t e μ λμ λ

μ λ μ λ− += +

+ + . (3.17)

Put Equation 3.16 into Equation 3.15 to get a simplified P2(S):

( )2 ( )P SS S

λμ λ

=⎡ ⎤+ +⎣ ⎦

.

The RHS is solved into partial fractions as

( )1 1S S

λ λμ λ μ λ μ λ

⎡ ⎤⎡ ⎤ − ⎢ ⎥⎢ ⎥+ + + +⎣ ⎦ ⎢ ⎥⎣ ⎦.

Inverse Laplace transformation yields

( )2 ( ) tP t e μ λλ λ

μ λ μ λ− += +

+ + , (3.18)

11( ) .P S

SSμλλ

μ

=⎡ ⎤

+ − ⎢ ⎥+⎣ ⎦

( )1( ) SP SS S

μμ λ

+⇒ =

⎡ ⎤+ +⎣ ⎦

.

Page 24: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

94 3 System Reliability Modeling

The RHS is resolved into partial fractions:

( ) ( )S A B

S SS Sμ

μ λμ λ+ = +

+ +⎡ ⎤+ +⎣ ⎦

( )S A S BSμ μ λ⎡ ⎤⇒ + = + + +⎣ ⎦ .

Comparing the like coefficients on both sides:

( ) ( )1;A B A A μμ λ μμ λ

+ = + = ⇒ =+

and

( )1B μμ λ

= −+

.

Steady-state probabilities as time t tends to infinity can be obtained by substi-

tuting into these equations the value of t. Thus,

1( )=P μμ λ

∞+

, (3.19)

2 ( )=P λμ λ

∞+

. (3.20)

These steady-state probabilities can also be obtained directly without the solu-tion of the system of Markov differential equations.

As probabilities tend to be constant with respect to time as time tends to infin-ity, the vector of differential probabilities becomes a null vector:

[ (t)]=[0]P′ . In general:

[ ][ (t)]= [ (t)]P P A′ . For steady-state probabilities:

[ ][0]= [ (t)]P A . For the single component with repair considered, the steady-state equations

could be written as follows:

1 2

1 2

0 ( )+ ( ) ,( ) ( ) .P P

P Pλ μ

λ μ= − ∞ ∞= ∞ − ∞

Also, the summation of probabilities of all states amounts to 1. Thus,

1 2( )+ ( )=1P P∞ ∞ .

Page 25: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 95

For an n-state Markov model, any (n – 1)-steady-state Markov linear equations along with the equation of summation of state probabilities are solved simultane-ously:

1( )=P μμ λ

∞+

,

2 ( )=P λμ λ

∞+

.

3.2.1.3 State Frequencies and Durations

The frequency of encountering state i, fi, is defined as the expected number of stays in (or arrivals into, or departures from) i per unit time, computed over a long period. The concept of frequency is associated with the long-term behavior of the process describing the system.

In order to relate the frequency, probability, and mean duration of a given sys-tem state, the history of the system is thought of as consisting of two alternating periods: the stays in state i and the stays outside state i. The state space diagram of the two-state process representation is as shown in Figure 3.20.

Figure 3.20 State space diagram of the two-state process representation

Mean duration of stay in state i = Ti, Mean duration of stay in states outside i = Ti′, Mean cycle time = Ti + Ti′, State frequency, fi = 1/(Mean cycle time) = 1/Tci, i.e., fiTci = 1.

Multiplying both sides by T,

fiTi = Ti /Tci. But the ratio Ti /Tci can be proved to be Pi. Thus fi = Pi/Ti.

i All the other states

Page 26: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

96 3 System Reliability Modeling

Frequency of transfer from state i to state j. The frequency of transfer from state i to state j, fij, is defined as the expected number of direct transfers from i to j per unit time.

0

1lim [( ( ) ) ( ( ) )]ij tf P X t t j X t i

tΔ →= + Δ = ∩ =

Δ

0

1lim [ ( ) / ( ( ) )] [ ( ) ]t

P X t t j X t i P X t itΔ →

= + Δ = = =Δ

ij iPλ= .

Thus, the transition rate λij is essentially a conditional frequency, the condition being the residence of system in state i. From the definitions of fi and fij, we have

i ij i i ijj i j i

f f f P λ≠ ≠

= ⇒ =∑ ∑ .

Thus, Ti can be expressed as 1

iij

j i

=∑

.

In other words, the mean duration of the stays in any given state is the recipro-cal of the total rate of departures from that state.

3.2.1.4 Two-component System with Repair

A system consists of two components. State 0 is the up state where both the com-ponents are working. When component 1 fails, transition to state 1 takes place. If repaired in this state, it goes back to state 1. A failure in state 1 leads the system to state 3, where both the components are down. The transitions 0-2-3 can be ex-plained on similar lines. The Markov model is as shown in Figure 3.21.

Figure 3.21 Markov model for two-component system

State 0

0

2

1, 2

λ1

λ2 λ1

1

λ2

μ1

μ1

μ2

μ2

State 1

State 2

State 3

Page 27: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 97

In the case where the two components are in series in a system, the reliability of the system is the probability of that state where both the components are in the up state, i.e., state 0. For a parallel system, reliability is given by the summation of probabilities of all those states which have at least one working component:

• Series system: A = P0. • Parallel system: A = P0 + P1 + P2.

Steady-state probabilities are given by the following expressions: 1 2

01 1 2 2( )( )

Pμ μ

λ μ λ μ=

+ + ,

1 21

1 1 2 2( )( )P

λ μλ μ λ μ

=+ + ,

2 12

1 1 2 2( )( )P

λ μλ μ λ μ

=+ + ,

1 23

1 1 2 2( )( )P

λ λλ μ λ μ

=+ + ,

0 1 20 0

1 2 0 1 1 2 2

1( ) ( )( )

PT f

Tμ μ

λ λ λ μ λ μ= ⇒ = =

+ + + .

The frequency of encountering a state is fi = (Prob. of being in that state) × (Sum of departure rates from that state),

or (Prob. of not being in that state) × (Sum of arrival rates to that state). If two identical states, say x and y, are to be merged together, the following re-

lationships are to be employed to obtain the equivalent transition rates between the merged state and other states (say state i) of the state space:

,

,

.

z x y

iz ix iy

xi x yi yzi

x y

P P P

P PP P

= +

λ = λ + λ

λ + λλ =

+

The transitions between two merged states can be obtained from the following relations:

,

.

i iji I j J

IJi

i I

j jij J i I

JIj

j J

P

P

P

P

λ

λλ

∈ ∈

∈ ∈

λ =

=

∑ ∑∑

∑ ∑∑

Page 28: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

98 3 System Reliability Modeling

This could be illustrated by considering a two-component system without re-pair facilities (Figure 3.22). States 1 and 2 are identical in that the basic characterization of these states is the same: states with only one component down. For the end result, it does not matter that it is component 1 which is down in state 1, and component 2 that is down in state 2. Using the concept of merging, we can alter the definitions of states accordingly and arrive at a simplified Markov model. The states could now be interpreted as:

• State 0: both components are up. • State 2′: only one component is up. • State 3: both components are down.

Figure 3.22 Markov model for two-component system without repair

Whether the individual component-wise interpretation or the identical compo-nent condition-wise interpretation is done, the end result – reliability evaluation, would be the same.

Merging states 1 and 2 to yield state 2′ would yield

02 ' 1 2λ = λ + λ , 2 '0 0λ = , 2 2 1 3

2 '32 3

P PP P

λ + λλ =

+.

In the case where all failure rates are identical (Figure 3.23), 02 ' 2λ = λ ,

2 32 '3

2 3

P PP P

λλ + λλ = =

+.

State 0

0

2

1, 2

λ1

λ2 λ1

1

λ2

State 1State 2

State 3

Page 29: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 99

Figure 3.23 Markov model with identical failure rates

Figure 3.24 Markov model for a two-component parallel load-sharing system

Now writing the Markov differential equations for time-dependent probabilities and solving them, the following expressions are obtained:

20 ( ) tP t e λ−= ,

22 ' ( ) 2 2t tP t e eλ λ− −= − ,

23 ( ) 1 2 t tP t e eλ λ− −= − + .

In the case of load-sharing components, the assumption of independence is in-correct. When two components share the same load and one fails, the failure rate of the second component increases because of the stresses due to the additional load on the second component. The Markov model for a two-component parallel load-sharing system is shown in Figure 3.24. This model can be solved by using the methods described earlier.

0

2

1, 2

λ'1

λ2 λ1

1

λ'2

State 0

State 1 State 2

State 3

2λ λ

State 0

State 2’

State 3

Page 30: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

100 3 System Reliability Modeling

3.2.2 Safety Modeling

Safety and reliability are both mathematically and philosophically related. A safety problem is created when a critical failure occurs – addressed in reliability analysis by, for example, failure mode and effects analysis, and failure mode, ef-fects, and criticality analysis.

There should be adequate handling of fault detection capabilities and subse-quent reconfiguration if necessary in safety critical systems. The probability that a failure will be correctly handled is called the fault coverage, denoted as C.

Inclusion of a coverage factor for a failure critical component (as in NPPs) yields the Markov model shown in Figure 3.25. If a failure is detected, it leads to a fail-safe (FS) state, or else a fail-unsafe (FU) state. Figure 3.26 shows fault cov-erage vs. reliability.

Figure 3.25 Markov model for simplex system with coverage modeling

Analysis:

0 0( ) (1 ) ( )P t t t P tλ+ Δ = − Δ ,

FS 0 FS( ) ( ) ( )P t t tCP t P tλ+ Δ = Δ + ,

FU 0 FS( ) (1 ) ( ) ( )P t t t C P t P tλ+ Δ = Δ − + ,

0( ) ( )R t P t= ,

FS( ) ( ) ( )S t R t P t= + ,

00

( )( )

dP tP t

dtλ= − ,

FS0

( )( )

dP tCP t

dtλ= ,

FU0

( )(1 ) ( )

dP tC P t

dtλ= − − .

FS 1.0

1-λ ∆ t 0

λ Δ t C

λ Δ t (1-C) FU 1.0

Page 31: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 101

Taking the Laplace transform results in the following:

00

(0)( )

PP s

s λ=

+,

0 FSFS

(0) (0)( )

CP PP s

s sλ

λ= +

+,

0 FUFU

(1 ) (0) (0)( )

( )C P P

P ss s s

λλ

−= +

+.

P0(0), PFS(0), and PFU(0), are the initial values of the respective state probabilities. We assume P0(0) = 1; PFS(0) = 0; PFU(0) = 0:

01( )

( )P s

s λ=

+,

FS ( )( )

C C CP ss s s s

λλ λ

= = −+ +

,

FU(1 ) (1 ) (1 )( )( )

C C CP ss s s sλ

λ λ− − −= = −+ +

.

The time-domain solutions can now be written as

0 ( ) tP t e λ−= ,

Figure 3.26 Fault coverage vs. reliability

FS ( ) tP t C Ce λ−= − ,

FU ( ) (1 ) (1 ) tP t C C e λ−= − − − , ( )S C∞ = ,

0 FS( ) ( ) ( ) (1 ) tS t P t P t C C e λ−= + = + − .

R

1-(1-R)2

0.2 0.5 0.7 1.0

Fault Coverage (C)

Reliability

Page 32: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

102 3 System Reliability Modeling

For C = 1:

( ) 1 (1 1) 1tS t e λ−= + − = . For C = 0:

( ) 0 (1 0) ( )tS t e R tλ−= + − = , ( ) (1 ) ( ) ( )S t C C R t R t= + − ≥ .

For example,

0.2 0.8 (0.2 0.2 )t t te e eλ λ λ− − −+ > > .

3.2.2.1 Imperfect Coverage – Two-component Parallel System

The Markov model including coverage for a two-component system is shown in Figure 3.27.

Figure 3.27 Markov model for two-component system with coverage

Assume that the initial state 2 is such that

P2(0) = 1, P0(0) = P1(0) = 0.

The system differential equations are

( )22 2 1

( )2 ( ) 2 1 ( ) ( )

dP tcP t c P t P t

dtλ λ μ= − − − + ,

12 1

( )2 ( ) ( ) ( )

dP tcP t P t

dtλ λ μ= − + ,

02 1

( )2 (1 ) ( ) ( )

dP tc P t P t

dtλ λ= − + .

2Cλ

2λ(1-C) λ

μ

2 1

0

Page 33: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 103

The above equations are solved to get the expressions for probabilities of different states. Reliability is the sum of probabilities of states 1 and 2: R(t) = P2(t) + P1(t). From R(t), we can obtain the system MTTF as

0

( )

(1 2 ) .2 [ (1 )]

MTTF R t dt

cc

λ μλ λ μ

=

+ +=+ −

It should be clear that the system MTTF and system reliability are critically de-

pendent on the coverage factor. As coverage increases, the MTTF also increases.

Example 9 A simplex computer system has failure rate λ and a fault detection coverage factor C. The fault detection capability is the result of the self-diagnostics that are run continuously. If the self-diagnostics detect a fault, the time required to repair the system is 24 hours because the faulty board is identified, ob-tained overnight, and easily replaced. If, however, the self-diagnostics do not de-tect the fault, the time required to repair the system is 72 hours because the repair person must visit the site, determine the problem, and perform the repair. The dis-advantage is that the inclusion of self-diagnostics results in the failure rate becom-ing αλ. In other words, the failure rate is increased by a factor of α because of the self-diagnostics. Find the value of α for a coverage factor of 0.95 at which includ-ing the self-diagnostics begins to degrade the availability of the system.

Solution:

The Markov model is shown in Figure 3.28.

Figure 3.28 Markov model for a simplex computer system

αλ(1−C)

FU

Up

αλC

FS

μu=1/72

μS=1/24

State 1

State 3

State 2

Page 34: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

104 3 System Reliability Modeling

Availability = P1

[ ]( ) ( )P t P t A•

= ,

[ 1( )P t•

2( )P t•

3( )P t•

] = [P1(t) P2(t) P3(t)] (1 )

1 1 024 241 1072 72

C Cαλ αλ αλ⎛ ⎞⎜ ⎟− −⎜ ⎟⎜ ⎟−⎜ ⎟⎜ ⎟⎜ ⎟−⎜ ⎟⎝ ⎠

.

For availability, ( ) 0P t•

=

1 2 31 1 024 72

P P Pαλ⇒ − + + = ,

1 21 024

P C Pαλ − = ,

1 31(1 ) 072

P C Pαλ − − = .

Also P1 + P2 + P3 = 1,

2 124P P Cαλ= , 3 172 (1 )P P Cαλ= −

1 1 124 72 (1 ) 1P P C P Cαλ αλ⇒ + + − =

[ ]11

1 24 3 2P

Cαλ⇒ =

+ −. (3.21)

When there is no coverage (self-diagnostics) the model reduces as shown in Figure 3.29.

172

μ = ,

11 72

Availability μμ λ λ

= =+ +

. (3.22)

Equate 3.21 and 3.22:

[ ]1 1

1 24 3 2 1 72Cαλ λ⇒ =

+ − +

26.4 72αλ λ⇒ = 72 2.7272

26.4α⇒ = = .

Page 35: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 105

Figure 3.29 Markov model for a computer system without self-diagnostics

Generalized Expression

1 2 3 0S uP P Pαλ μ μ− + + = ,

1 2 0SP C Pαλ μ− = ,

1 3(1 ) 0uP C Pαλ μ− − = ,

1 2 3 1P P P+ + = ,

2 1SP P Cμ αλ=

12

S

P CP αλμ

⇒ = ,

1 3(1 ) uP C Pαλ μ− =

13

(1 )

u

P CP αλμ

−⇒ =

1(1 )1 1

S u

C CP αλ αλμ μ

⎡ ⎤−⇒ + + =⎢ ⎥

⎣ ⎦,

1(1 )

1S u u S

S u

C CP

μ μ αλ μ αλ μμ μ

⎡ ⎤+ + −=⎢ ⎥

⎣ ⎦

1(1 )

S u

S u u S

PC C

μ μμ μ αλ μ αλ μ

⇒ =+ + −

.

up down

λ

μ

Page 36: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

106 3 System Reliability Modeling

For no coverage case,

1u

u

μ λ=

+

(1 )S u u

S u u S uC Cμ μ μ

μ μ αλ μ αλ μ μ λ⇒ =

+ + − +

( ) (1 )u S S u u SC Cμ λ μ μ μ αλ μ αλ μ⇒ + = + + −

(1 ) S u S u uC Cα λ μ λ μ μ μ λ μ⇒ − + = + −⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦

(1 )S

S uC Cμ λα

λ μ λ μ⇒ =

− +⎡ ⎤⎣ ⎦.

Example 10 Construct a Markov model of the four-component system governed by the RBD in Figure 3.30. P1, P2, and P3 are three copies of a processor. M is a memory chip. Assume that the components are independent and non-repairable. The failure rate of the processors 1, 2, and 3 is λp. The failure rate of the memory is λm.

Figure 3.30 RBD of four-component system

Solution: If a detailed state space diagram is to be drawn, there would be a total of 24 states. Using the concept of merging, a simplified diagram (Figure 3.31) can be obtained. Merging gives certain rules of thumb which could be easily deployed for obtaining direct simplified models by an inspection approach.

State description: 1. All components work. 2. One of the processors failed. 3. Two of the processors failed. 4. Three processors failed, or memory failed, or both.

P1

P2

P3

M

Page 37: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.2 Markov Models 107

Figure 3.31 Markov model for four-component system

3.2.2.2 Modeling of Fault-tolerant Systems

The most famous of fault-tolerant systems is the triple modular redundant (TMR) system (Figures 3.32, 3.33). It masks faults by providing three modules and re-quiring that any two should agree (assumes that agreement implies correct opera-tion). Voting arrangement returns (as an output) the majority view from the three modules running in parallel. The three redundant components all perform the same task and the voter selects the correct output from among the three redundant out-puts. As long as at least two of the redundant components are operating correctly and the voter does not fail, then the TMR configuration operates correctly. Passive fault tolerance is embedded into its design. It could be easily analyzed by Markov modeling.

Figure 3.32 TMR arrangement

Figure 3.33 RBD of a TMR system

3λp 2λp 1 2 3

λp+λm

4

λm

λm

A

B

C Perfect voter

Module1

Module3

Module2 Voter

Page 38: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

108 3 System Reliability Modeling

A detailed state space diagram of a TMR is shown in Figure 3.34. The failure rates of all the three components are assumed to be the same. Note: state 111 im-plies component A is working, B is working and C is working. State 100 implies component A is working, B is not working and C is not working, and so on.

The state space could be reduced by combining identical states using the con-cept of merging as explained earlier. A reduced Markov model is shown in Figure 3.35. Assume that all components have the same failure rate λ. In state 1, all com-ponents are operational. Failure of any one of the three components will trigger a transition from the initial state to state 2, which represents a system configuration with two operational components and one failed component. Furthermore, any component failure in state 2 will lead to state 3 – the system failed state, according to the failure criteria of a TMR system. The voter is assumed to be perfectly reli-able here.

Markov analysis yields:

2 3TMR ( ) 3 2t tR t e eλ λ− −= − ,

2 3TMR

0

5(3 2 )6

t tMTTF e e dtλ λ λ∞

− −= − =∫ .

Note that MTTFTMR < MTTFSIMPLEX.

Figure 3.34 Markov model of TMR system

Perfect State

System Failed

One Module Failed 2 Modules Failed3 Modules Failed

111

110

101

011 010

001

100

000

λ

λ

λ λ λ

λ

λ

λ

λ

λ

λ

λ

Page 39: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 109

Figure 3.35 Simplified Markov model of TMR system

Though TMR is an important redundancy technique in the design of reliable digital systems, it has one major disadvantage in that the reliability improvement, compared to that of a single unit, is prevalent for a relatively short period of time (Figure 3.36), i.e., TMRs are highly suited for situations where relatively short mission times are intended.

Figure 3.36 TMR vs. simplex

3.3 Fault Tree Analysis

FTA is a failure-oriented, deductive, and top-down approach, which considers an undesirable event associated with the system as the top event; the various possi-ble combinations of fault events leading to the top event are represented with logic gates.

The fault tree is a qualitative model which provides useful information on the various causes of undesired top events. However, quantification of the fault tree provides top-event occurrence probability and critical contribution of the basic

3λ 2λ

State 2 State 3 State 1

t

0

1.0

0.5

SIMPLEX

R

TMR

Page 40: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

110 3 System Reliability Modeling

causes and events. The fault tree approach is widely used in probability safety as-sessment.

The faults can be events that are associated with component hardware failure, software error, human errors, or any other relevant events which can lead to top events. The gates show the relationships of faults (or events) needed for the occur-rence of a higher event. The gates thus serve to permit or inhibit the fault logic up the tree. The gate symbol denotes the type of relationship of the input (lower) events required for the output (higher) event.

3.3.1 Procedure for Carrying out Fault Tree Analysis

The procedure for carrying out FTA is shown as a flow chart in Figure 3.37 [2, 3].

3.3.1.1 System Awareness and Details

Thorough understanding of the system is the prerequisite for doing FTA. System awareness through discussion with designers, and operating and maintenance en-gineers is very important; plant or system visits will also enhance it further. Input information such as the following should be collected and studied:

• design basis reports; • safety analysis reports (deterministic); • technical specification report (for testing and maintenance information); • history cards, maintenance cards, and safety-related unusual occurrence reports

for obtaining failure or repair data.

3.3.1.2 Defining Objectives, Top Event, and Scope of Fault Tree Analysis

Objectives are defined in consultation with decision makers or managers who commissioned the FTA. Though the general objective may be evaluation of cur-rent design or comparisons of alternative designs, particular objectives should be explicitly defined in terms of system failure.

The top event of the fault tree is the event which is analyzed to find all credible ways in which it could be brought about. The failure probability is determined for the defined top event. The top event is defined based on the objectives of the analysis.

There can be more than one top event required for successfully meeting objec-tives. In such cases separate top events are then defined.

Lack of proper understanding of objectives may lead incorrect definition of the top event, which will result in wrong decisions being made. Hence it is extremely important to define and understand the objectives of the analysis. After identifying

Page 41: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 111

the top event from the objectives, the scope of the analysis is defined. The scope of the FTA specifies which of the failures and contributors are to be included in the analysis. It mainly includes the boundary conditions for the analysis. The boundary conditions comprise the initial states of the subsystems and the assumed inputs to the system. Interfaces to the system such as power sources or water sup-plies are typically included in the analysis; their states need to be identified and mentioned in the assumptions.

3.3.1.3 Construction of the Fault Tree

The basic principle in constructing a fault tree is “consider shortsightedly.” The immediate events or causes are identified for the event that is analyzed. The analy-sis does not jump to the basic causes of the event. Instead, a small step is taken and the necessary and sufficient immediate events are identified. This talking of small steps backwards assures that all of the relationships and primary conse-quences will be revealed. This backward stepping ends with the basic consequence identified that constitutes the resolution of the analysis. Fault trees are developed to a level of detail where the best failure probability data are available. The termi-nology and basic building blocks of fault trees are explained in the next section.

3.3.1.4 Qualitative Evaluation of the Fault Tree

The qualitative evaluations basically transform the fault tree into logically equiva-lence forms that provide more focused information. The qualitative evaluation provides information on the minimal cut sets of the top event. The minimal cut set is the smallest combination of basic events that result in the occurrence of the top event. The basic events are the bottom events of the fault tree. Hence, the minimal cut set that relates to the top event is represented by the set of minimal cut sets. Success sets may also be identified that guarantee prevention of the top event.

Methods of obtaining the minimal cut set are explained in the subsequent sec-tions.

3.3.1.5 Data Assessment and Parameter Estimation

This step aims at acquiring and generating all information necessary for the quan-titative evaluation of the fault tree.

The tasks of this step include the following considerations:

• identification of the various models that describe the stochastic nature of con-tain phenomena related to the events of interest and the corresponding parame-ters that need to be estimated;

Page 42: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

112 3 System Reliability Modeling

• determination of the nature and sources of relevant data; • compilation and evaluation of the data to produce the necessary parameter es-

timations and associated uncertainties.

3.3.1.6 Quantitative Evaluation of the Fault Tree

Fault trees are quantified by first calculating the probability of each minimal cut set and then by summing all the cut-set probabilities. The quantitative evaluation produces the probability of the top event. This determines dominant cut sets and also identifies important basic events that contribute to the top event.

Sensitivity studies and uncertainty propagation provide further key information. Identification of important basic events is very useful for decision making in

resource allocation and trade-off studies. Better, surveillance, maintenance and re-placement can be focused on the critical events for cost-effective management of reliability or risk.

3.3.1.7 Interpretation and Presentation of the Results

It is very important to interpret the results of the analysis and present it to the deci-sion makers in an effective manner. FTA should not be limited to documentation and sets of numerical values. The FTA results must be interpreted to provide tan-gible implications, especially concerning the potential impact upon the objectives.

3.3.1.8 Important Points to Be Considered while Constructing Fault Trees

The following issues should be considered carefully while carrying out FTA:

• To maintain consistency and traceability all the assumptions and simplifica-tions made during the analysis should be well documented.

• To ensure quality, consistency, and efficiency, standard computer codes should be used.

• To ensure the clarity and ease of identification of events, a standardized format needs to be adopted while giving the names in the fault tree for intermediate and basic events. The format should include specific component type and iden-tification, the specific system in which the component is located, and compo-nent failure mode. However, the formatting should be compatible with the computer code adopted.

• To avoid double counting and/or complete omission of systems/interfaces/sup-port systems, it is strongly recommended that explicit definitions of boundary conditions should be established and documented.

Page 43: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 113

• It is important to see whether protective systems or testing practices may in-duce failures. If such failure causes are possible, they need to be considered in the analysis.

• The following aspects should also be considered:

– human reliability issues; – operator recovery actions; – dependent and common-cause failures; – external environment impact (fire, flood, seismic, and missile attack).

Figure 3.37 Procedure for carrying out FTA

System awareness details

Define objectives, top event, and slope of analysis

Qualitative evaluation of fault tree

Construct the fault tree

Quantitative evaluation of fault tree

Data assessment & parameter estimation

Interpretation & presentation of the results

Page 44: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

114 3 System Reliability Modeling

3.3.2 Elements of Fault Tree

A typical fault tree is shown in Figure 3.38. It depicts various combinations of events leading person X to arrive late at their office.

It is essential to understand some of the terms that are used in FTA:

• Basic event: the initiating fault event that requires no further development. • Intermediate event: a failure resulting from the logical interaction of pri-

mary failures. • Top event: an undesired event for the system under consideration which oc-

curs as a result of the occurrence of several intermediate events. Several combinations of primary failures lead to the event.

TOP

Late tooffice

A

Oversleep

T

Transportfailure

B

No WakeupPulse

NA

Naturalapathy

P

Publictransport fails

PV

Personalvehicle fails

C

Artificialwakeup fails

BI

Bio rhythmfails

AL

Alarmclock fails

HE

Forget toset

ST

Publicstrike

PVB

Public vehiclebreakdown

Figure 3.38 Fault tree for the top event “arrive late at office”

Page 45: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 115

The symbols used in fault trees for representing events and their relations have been more or less standardized. The symbols can be classified into three types, viz., event symbols (Table 3.3), gate symbols (Table 3.4), and transfer symbols (Table 3.5).

Table 3.3 Event symbols

Name Symbol Description Basic event A basic initiating fault requiring no fur-

ther development

Undeveloped event

An event which is not further devel-oped either because it is of insufficient consequence or because information is unavailable

House event

An event which is normally expected to occur

Conditional event Specific conditions or restrictions that apply to any logic gate

Table 3.4 Gate symbols

Name Symbol Description Truth table AND gate

Output fault occurs if all of the input faults occur

A B o/p 0 0 0 0 1 0 1 0 0 1 1 1

Priority gate

Output fault occurs if all the input faults occur in a specific sequence

A B o/p 0 0 0 0 1 0 1 0 0 1 (first)

1 (second)

1

1 (second)

1 (first)

0

Page 46: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

116 3 System Reliability Modeling

Table 3.4 (continued)

Name Symbol Description Truth table OR gate

Output fault occurs if a least one of the input faults occur

A B o/p 0 0 0 0 1 1 1 0 1 1 1 1

Voting gate

Output fault occurs if a least k out of m input faults occur

A B C o/p 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1

EXOR gate

Output fault occurs if exactly one of the input faults occur

A B o/p 0 0 0 0 1 1 1 0 1 1 1 0

Inhibit gate

Output fault occurs if (single) input faults occur in the presence of an enabling con-dition

A B o/p 0 0 0 0 1 0 1 0 0 1 1 1

INV gate

Output event is true if and only if input event is false

A o/p 0 1 1 1

Cond.

k

Page 47: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 117

Table 3.5 Transfer symbols

Name Symbol Description Transfer-in

Indicates that the tree is de-veloped further at the occur-rence of the corresponding transfer-out (e.g., on another page)

Transfer-out

Indicates that this portion of the tree must be attached at the corresponding transfer-in

3.3.3 Evaluation of Fault Tree

The evaluation of fault tree includes both qualitative evaluation and quantitative evaluation. The top events as a function of the minimal cut set are determined with the help of Boolean algebra. Later, by applying probability over the Boolean ex-pression and substituting the respective basic event probability values, the quanti-fication is carried out. There is one-to-one correspondence between the fault tree gates representation and Boolean operations. Boolean algebra is explained in Chapter 2.

In the development of any fault tree, the OR gate and the AND gate are often present. Both are explained here to obtain basic probability expressions.

3.3.3.1 AND Gate

This gate allows the output event to occur only if the all input events occur, repre-senting the intersection of the input events. The AND gate is equivalent to the Boolean symbol “⋅”.

For example, an AND gate with two input events A and B and output event T can be represented by its equivalent Boolean expression, T = A B. (Symbol will be omitted subsequently.)

A realistic example is power supply failure to a personal computer due to the occurrence of two events: failure of main supply and UPS failure (Figure 3.39).

Page 48: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

118 3 System Reliability Modeling

Figure 3.39 Example for AND gate

The probability formula for the top event T is given by

P(T) = P(AB) = P(A)P(B) or = P(B)P(A/B). If A and B are independent events, then as P(T) = P(A)P(B), P(A/B) = P(A) and P(B/A) = P(B).

When A and B are completely dependent (if event A occurs, B will also occur)

)()( APTP = . In the case of any partial dependency, one can give the bounds for P(T) as

( ) ( ) ( ) ( )P A P B P T P A< < or ( )P B . (3.23)

Generalizing for n input events, for the independent case,

1 2( ) ( ) ( ) ( )nP T P E P E ...P E= , (3.24)

where Ei, i = 1,2,…, n are input events. When Ei’s are not independent,

1 2( ) ( ) ( ) ( )nP T P E P E ...P E> .

Power supply failure to PC (T)

Main supply failure (A)

UPS failure (B)

Page 49: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 119

3.3.3.2 OR Gate

This gate allows the output event to occur if any one or more input event occur, representing the union of input events. The OR gate is equivalent to the Boolean symbol “+”.

For example, an OR gate with two input events A and B and the output event T can be represented by its equivalent Boolean expression, T = A + B.

A practical example for the OR gate is where a diesel generator (DG) did not start on demand due to actuation failure or a DG was already in a failed condition prior to demand on both (Figure 3.40).

The probability formula for the top event T is given by

( ) ( )( ) ( ) ( ) .

P T P A BP A P B P A B

= += + − ∩

where )( BAP ∩ is equivalent to the output from an AND gate. This can be rear-ranged as

)(1)( BAPTP ∩−= ,

where A and B denote the non-occurrence of events A and B respectively. If the input events are mutually exclusive, then

P(T) = P(A) + P(B). If the event B is completely dependent on event A, then

P(T) = P(B). In the case of any partial dependency, one can give bounds on P(T) as

( ) ( ) ( ) ( ) ( ) ( ) ( )P A P B P A P B P T P B P A+ − < < ⊄ . (3.25)

Figure 3.40 Example for OR gate

DG did not start

Actuation failure

DG failed in standby

Page 50: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

120 3 System Reliability Modeling

Generating for n input events,

1 2 3( ) 1 [ ]nP T P E E E E= − ∩ ∩ ∩ ∩…

11 2 3

1

( ) ( ) ( 1) ( )n

ki i j n

i i j

P E P E E P E E E E−

= <

= − ∩ + + − ∩ ∩ ∩ ∩ ∩∑ ∑ … … . (3.26)

If the probability of events are low values (say P(Ei) < 0.1) and are independent then P(T) can be approximated to

1

( ) ( )n

ii

P T P E=

=∑ .

It is famously known as rare-event approximation. When Ei’s are not independent,

21

1 1 2 1

( ) 1 ( ) n

n

EEP T P E P P

E E E ....... E −

⎛ ⎞⎛ ⎞= − ⎜ ⎟⎜ ⎟ ∩ ∩⎝ ⎠ ⎝ ⎠

… (3.27)

1

( )n

ii

P E=

>∑ .

Prior to obtaining the quantitative reliability parameter results for the fault tree, repetition of basic events and redundancies must be eliminated.

If the calculations are carried out directly on the fault tree without simplifying, the quantitative values will be incorrect. This is achieved by obtaining minimal cut sets using Boolean algebra rules algorithms developed for them.

There are many methods available in the literature, for example, Vesely, Fus-sell, Kumamoto, Rasmuson. However, methods based on top-down or bottom-up successive substitution method and Monte Carlo simulation are most often used. The latter is a numerical computer-based approach. The top-down successive sub-stitution method can be done by simple hand calculations also. In this method, the equivalent Boolean representation of each gate in the fault tree is obtained such that only basic events remain. Various Boolean algebra rules are used to reduce the Boolean expression to its most compact form. The substitution process can proceed from the top of the tree to the bottom or vice versa. The distribution law, laws of idempotence, and the law of absorption are extensively used in these cal-culations. The final expression thus obtained has minimal cut sets which are in the form of run of products, and can be written in the general form

1 1

( )imn

i j

T E i, j= =

=∑∏ , (3.28)

where n is the number of minimal cut sets, and mi is the number of basic events in ith minimal cut set.

Page 51: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 121

Any fault tree will consist of a finite number of minimal cut sets that are unique for that top event. If there are single-order cut sets, then those single failures will lead to the occurrence of the top event.

3.3.4 Case Study

The basic aspects of FTA can be explained through an example of a containment spray system which is used to scrub and cool the atmosphere around a nuclear re-actor during an accident. It is shown in Figure 3.41.

Any one of the pumps and one of two discharge valves (V1 and V2) is sufficient for its successful operation. To improve the reliability, an interconnecting valve (V3) is there which is normally closed. The system is simplified and the actual sys-tem will contain more number of valves.

3.3.4.1 Step 1 – Defining Top Event

The undesired top event is defined as “no water for cooling containment.”

3.3.4.2 Step 2 – Construction of the Fault Tree

The fault tree is developed deductively to identify possible events leading to the top event. These may be: • No water from V1 branch and V2 branch. • No supply to V1, or V1 itself failed. Since V1 failure is a basic event, it doesn’t

need further analysis. • The lack of supply to V1 is due to simultaneous failure of P1 branch and V3 branch. • Supply from V3 branch is due to failure of either V3 or P2. • Similarly V2 branch is also developed.

Figure 3.41 Containment spray system of NPP

P1

P2 Water Tank

V3

V2

V1

Page 52: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

122 3 System Reliability Modeling

The resulting fault tree is shown in Figure 3.42.

T

No Waterfor Cooling

A

No water fromV1 Branch

B

No water fromV2 branch

C

No waterto V1

V1

V1 fails

D

No waterto V2

V2

V2 fails

E

No waterfrom V3

P1

P1 fails

F

No waterfrom V2

P2

P2 fails

V3

V3 fails

P2

P2 fails

V3

V3 fails

P1

P1 fails

Figure 3.42 Fault tree for containment spray system

3.3.4.3 Step 3 – Qualitative Evaluation

The qualitative evaluation of the fault tree determines minimal cut sets of the fault tree. One can write the logical relationship between various events of the fault tree as follows: T = AB,

A = C + V1, C = EP1, E = V3 + P2, B = D + V2, D = FP2, F = V3 + P1,

Page 53: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 123

where:

• T is the top event; • A, B, C, D, E, F are intermediate events; • P1, P2, V1, V2, V3 are basic events.

First the top-down substitution will be performed, starting with the top-event equation and substituting and expanding until the minimal cut-set expression for the top event is obtained.

Substituting for A and B and expanding produces

1 2

2 1 1 2

( ) ( )T C V D VC D CV V D V V

= + += + + +

Substituting for C,

1 1 2 1 1 2

1 1 2 1 1 2

( ) ( )T E P D E P V V D V VE P D E P V V D V V .

= + + += + + +

Substituting for D,

1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2

( ) ( )T E P F P E PV V F P V VP P E F E PV V P F V V .

= + + += + + +

Substituting for E,

1 2 3 2 3 2 1 2 1 2 1 2( ) ( )T PP V P F V P PV V P F VV= + + + + + .

Using the distributive law X(Y + Z) = XY + XZ and the idempotent law XX = X,

212122132121321 VVFPVVPPVVPFPPFVPPT +++++= .

Substituting for F and using the distributive law X(Y + Z) = XY + XZ and the idempotent law XX = X,

211321321321132113321 )()()( VVPVPVVPPVVPPVPPPVVPPT ++++++++= ,

21121321321321121321321321 VVPPVVPVVPPVVPPPPVPPVPPVPPT ++++++++= .

Using the idempotent law X + X = X,

[ ]1 2 3 1 2 1 2 3 1 2 2 1 2 3 1 2 1 1 2

1 3 1 2 1 2 3 2 1 3 1 22

,1

T PPV PP PV V PPV PV V PPV VVT PP V V V PV V PVV VV .

= + + + + + += + + + + + +

Using the absorption law A + AB = A,

2131232121 VVVVPVVPPPT +++= .

Rearranging the terms,

3123212121 VVPVVPVVPPT +++= ,

which is the final Boolean expression. There are four minimal cut sets: two double-component minimal cut sets and

two triple-component minimal cut sets.

Page 54: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

124 3 System Reliability Modeling

3.3.4.4 Step 4 – Quantitative Evaluation

The quantitative evaluation of the fault tree determines the probability of the top event, the basic event probability information, and the list of minimal cut sets re-quired for final quantification. The probability of the top event is the probability over the union of the minimal cut sets, mathematically expressed as

)()( 3123212121 VVPVVPVVPPPTP ∪∪∪= .

Using the example for the OR gate evaluation, P(T) can be derived as

[ ][ ] [ ]

1 2 1 2 1 2 3 2 1 3

1 2 1 2 1 2 2 3 1 2 1 3 1 1 2 3 1 1 2 3 2 1 2 3 1 2 1 2 3

1 2 1 2 3 1 2 1 2 3 1 2 1 2 3 1 2 1 2 3 1 2 1 2 3

( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )(

P T P PP P VV P PVV P PVVP PPVV P PPVV P PPVV P PVVV P PVVV P PVVV P PPVVV

P PPVVV P PPVVV P PPVVV P PPVVV P PPVVVP P

= + + + −+ + + + + +

+ + + + −= 1 2 1 2 2 1 3 1 2 3 1 2 1 2 1 1 2 3 1 2 2 3

2 1 2 3 1 2 1 3 1 2 1 2 3

) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) 2 ( ).P P VV P PVV P PVV P PPVV P PVVV P PPVV

P PVVV P PPVV P PPVVV+ + + − − − −

− +

Example 11 A main control power supply (MCPS) is a very important support system in NPPs which provides an uninterrupted AC power supply to safety-related loads such as reactor regulation systems and safety system loads such as shutdown systems. The schematic diagram of this system is shown in Figure 3.43.

Figure 3.43 Schematic diagram of MCPS

DIV II 415V AC CLASS III BUSES DIV I

240V AC CLASS II

Bus F2 Bus F4 Bus F6

UPS1 UPS3 UPS2 UPS4

Rectifier

Battery

Inverter

Switch

Circuit breaker

Page 55: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.3 Fault Tree Analysis 125

There are four UPSs, namely, UPS-1, UPS-2, UPS-3, and UPS-4. The input sup-ply to UPS-1 and UPS-3, and UPS-2 and UPS-4 is taken from division I and divi-sion II of class III, respectively. The failure criterion is unavailability of the power supply at two out of three buses. The circuit breaker can be assumed to be part of the respective division supply and unavailability data can be assumed to be avail-able for the UPS. Develop the fault tree with these assumptions and calculate the minimal cut sets of the MCPS.

Solution: As the failure criterion is failure of the power supply at more than two buses, the top event is a 2-out-of-3 voting gate: failure logic is as shown in Figure 3.44. From the fault tree we have the following Boolean expressions for various gates:

T = F2F4 + F4F6 + F2F6, (U1BR) = U1DIV1, (U2BR) = U2DIV2, (U3BR) = U3DIV1, (U4BR) = U4DIV2, F2 = (U1BR)(U4BR), F4 = (U3BR)(U4BR), F6 = (U2BR)(U4BR).

MCPS2

No supplyfrom MCPS

F2

No Supplyat F2

F4

No Supplyat F4

F6

No Supplyat F6

U1BR

No supplyfrom UPS1

U4BR

No supply fromUPS 4 branch

1U2BR

No supplyfrom UPS2

U4BR

No supply fromUPS 4 branch

Page 1

U1

UPS1failure

DIV1

DIV 1failure

U4

UPS 4failure

DIV2

DIV 2failure

U3BR

No supplyfrom UPS3

U4BR

No supply fromUPS 4 branch

Page 1

U3

UPS 3failure

DIV1

DIV 1failure

U2

UPS2failure

DIV2

DIV 2failure

Figure 3.44 Fault tree of MCPS

Page 56: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

126 3 System Reliability Modeling

Substituting successively in the top-event terms, F2F4 = (U1BR)(U3BR)(U4BR) = U1U3U4DIV1DIV2, F4F6 = (U2BR)(U3BR)(U4BR) = U2U3U4DIV1DIV2, F2F6 = (U1BR)(U2BR)(U4BR) = U1U2U4DIV1DIV2. Rare-event approximation can be used here and the probability of the top event

can be calculated by adding the probability of all the cut sets.

3.4 Monte Carlo Simulation

System reliability modeling with analytical approaches such as RBD, Markov model, and FTA are discussed in the previous sections. The same can be done by a simulation approach using Monte Carlo methods. This section presents various elements of Monte Carlo simulation-based system reliability modeling.

3.4.1 Analytical versus Simulation Approaches for System Reliability Modeling

Analytical techniques represent the system by a mathematical model and evaluate the reliability indices from this model using direct mathematical solutions. The dis-advantage with the analytical approach is that the model used in the analysis is usu-ally a simplification of the system, sometimes to an extent it becomes totally unreal-istic. In addition, the output of the analytical methods is usually limited to expected values only. The complexity of modern engineering systems, beside the need for re-alistic considerations when modeling their availability/reliability, renders analytical methods very difficult to use. When considering only the failure characteristics of the components, the analytical approach is generally used. The models are only ap-plicable with exponential failure/repair probability density functions (PDFs). They are difficult to apply for components having non-exponential failure/repair PDFs. Analyses that involve repairable systems with multiple additional events and/or other maintainability information are very difficult (if not impossible) to solve ana-lytically. Modern engineering systems have complex environments, as depicted in Figure 3.45. In these cases, analysis through simulation becomes necessary.

Simulation techniques estimate the reliability indices by simulating the actual process and random behavior of the system in a computer model in order to create a realistic lifetime scenario of the system. This method treats the problem as a se-ries of real experiments conducted in a simulated time. It estimates the probability and other indices by counting the number of times an event occurs in simulated time. Simulation is a very valuable method which is widely used in the solution of real engineering problems. Lately the utilization of this method is growing for the assessment of availability of complex systems and the monetary value of plant op-eration and maintenance [4–7].

Page 57: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 127

Figure 3.45 Complex environments for system modeling

The simulation approach overcomes the disadvantages of the former method by incorporating and simulating any system characteristic that can be recognized. It can provide a wide range of output parameters including all moments and com-plete PDFs. It can handle very complex scenarios like inconstant transition rate, multistate systems and time-dependent reliability problems. The uncertainties that arise due to simplification by the analytical mathematical models can be elimi-nated with simulation. However, the solution time is usually large and there is un-certainty from one simulation to another. But the recent studies show the demerits of simulation can be easily overcome with few modifications in the simulation. It is to be noted that the experimentation required is different for different types of problems and it is not possible to precisely define a general procedure that is ap-plicable in all circumstances. However, the simulation technique provides consid-erable flexibility in solving any type of complex problem. Table 3.6 gives a com-parison of both the approaches with various issues.

Table 3.6 Comparison of analytical and simulation techniques

Issue Analytical techniques Simulation techniques Approach Direct mathematical solutions Numerical calculations over the

simulated model Methods RBD, FTA, Markov model Monte Carlo simulation Complex scenarios

Adopt simplified mathematical models with questionable assump-tions and approximations

Realistic modeling

Analysis results Limited to point estimates only Wide range of output parameters Computational time

Once the algebraic analysis is over, the calculations are very simple

Large number of computer calcu-lations

Active/standby redundancies

Complex environment

Dormant demand systems

Inspection policies

Preventive maintenance policies

Aging of components

Inter dependencies

Detiorating repairs

Non-exponential PDFs for failures and repairs

Page 58: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

128 3 System Reliability Modeling

3.4.1.2 Benefits/Applications of Simulation-based Reliability Evaluation

• Realistic modeling of system behavior in complex environment. • The number of assumptions can be reduced significantly. • Handling of inconstant hazard rate models at component level. • Wide range of output parameters at the system level (failure frequency, MTBF,

MTTR, unavailability, failure rate, etc.). • Dynamics in a sequence of operations and complex maintenance policies can

be adopted in system modeling. • The simulation model can be used for optimizing inspection interval or the re-

placement time of components in the system [8]. • Quantification of aleatory uncertainty associated with the random variable time

to failure of the overall system. • Importance measures can be obtained from the analysis, which is helpful in

identifying the critical components and ranking them [9, 10].

3.4.2 Elements of Monte Carlo Simulation

In simulation, random failure/repair times from each component’s failure/repair distribution are generated. These failure/repair times are then combined in accor-dance with the way the components are arranged reliability-wise within the sys-tem. The overall results are analyzed in order to determine the behavior of the en-tire system. Sound understanding of the system behavior is the prerequisite for system success/failure logic. It is assumed that the reliability values for the com-ponents have been determined using standard (or accelerated) life data analysis techniques, so that the reliability function for each component is known. With this component-level reliability information available, simulation can then be per-formed to determine the reliability of the entire system. The random failure/repair times of components is obtained using uniform random numbers and converting these into required density functions as per the component PDF.

Evaluation of Time to Failure (or Time to Repair) of a Component

Consider a random variable x that follows an exponential distribution with pa-rameters λ, f(x), and F(x), given by the following expressions:

0

( ) exp( ) ,

( ) ( ) 1 exp( )x

f x x

F x f x dx x .

λ λ

λ

= −

= = − −∫

Now x is derived as a function of F(x), 1 1( ( )) ln( )

1 ( )x G F x

F xλ= =

−.

Page 59: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 129

A uniform random number is generated using any of the standard random number generators. Let us assume 0.8 is generated by random number generator, then the value of x is calculated by substituting 0.8 in place of F(x) and say 1.825/year (0.005/h) in place of λ in the above equation:

1 1ln( ) 321 880 005 1 0 8

x .. .

= =−

h.

This indicates that time to failure of the component is 321.88 h (see Figure 3.46). This procedure is applicable similarly for repair time also. If the shape of the

PDF is different, accordingly one has to solve for G(F(x)). Table 3.7 gives mathe-matical expressions for generating random samples for different distributions fre-quently used in reliability calculations. Here Ui represents a standard random number generated for the ith iteration.

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000Time (h)

F(x)

; R(x

)

R(x)=exp(-0.005x)

F(x)=1-exp(-0.005x)

Figure 3.46 Exponential distribution

Table 3.7 Generation of random samples for different distributions

Distribution Random Samples Uniform (a, b) iUaba )( −+

Exponential (λ) )ln(1

iUλ

Weibull (α, β) βα /1)ln( iU−

Normal (μ, σ)

)2cos()ln2( 12/1

+−=

+=

iis

si

UUX

XX

πμσ

Lognormal (μ, σ) Generate Y = ln(X) as a normal variate with mean μ and standard deviation σ and then compute Xi = exp(Yi)

Page 60: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

130 3 System Reliability Modeling

3.4.3 Repairable Series and Parallel Systems

Components are simulated for a specified mission time to depict the duration of available (up) and unavailable (down) states. Up and down states will come alter-natively; as these states are changing with time, their representations are called state time diagrams. A down state can be due to unexpected failure and its recov-ery will depend upon the time taken for repair action. The duration of the state is random for both up and down states. It will depend upon the PDF of time to fail-ure and time to repair respectively.

To first understand how component failures and simple repairs affect the sys-tem and to visualize the steps involved, the simulation procedure is explained here with the help of two examples. The first example is a repairable series system and the second example is a two-unit parallel system.

Example 12 – Series System A typical power supply system consists of grid sup-ply, circuit breaker, transformer, and bus. The success criterion is the availability of the power supply at the bus, which demands the successful functioning of all the components. So, it is a simple four-component series system. The RBD (func-tional diagram) is shown in Figure 3.47.

Figure 3.47 Functional diagram of typical class IV supply system

In addition to success/failure logic, failure and repair PDFs of components are the input to the simulation. The PDFs of failure and repair for the components are given in Table 3.8. It is assumed that all the component failure/repair PDFs follow an exponential distribution. However, the procedure is the same even when com-ponent PDFs are non-exponential except that the random variants will be different as per the PDF. The simulation procedure is as follows:

Step 1. Generate a random number. Step 2. Convert this number into a value of operating time using a conversion method on the appropriate times to failure distribution (exponential in the pre-sent case). Step 3. Generate a new random number. Step 4. Convert this number into a value of repair time using conversion method on the appropriate times to repair distribution. Step 5. Repeat steps 1–4 for the desired period of operating life. Step 6. Repeat steps 1–5 for each component. Step 7. Compare the sequences for each component and deduce system failure times, frequencies, and other parameters. Step 8. Repeat steps 1–7 for the desired number of simulations.

Grid Supply

Circuit Breaker

Transformer Bus

Page 61: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 131

Figure 3.48 Overview of up and down states for power supply system

A typical overview of up and down states for a class IV supply system is shown in Figure 3.48. System failure time is the sum of the component failure times if they are mutually exclusive. If there is any simultaneous failure of two or more components, the failure time of the component having the largest value is taken for system failure time.

Table 3.8 Failure and repair rate of components

Serial no. Component Failure rate (/h) Repair rate (/h) 1 2 3 4 5

Grid supply Circuit breaker Transformer Bus Pumps (1 and 2)

2E–4 1E–6 2E–6 1E–7 3.7E–5

2.59 0.166 0.0925926 0.0925926 0.0925926

Grid supply

Circuit Breaker

Transformer

Bus

Class IV (System)

Time

Functional state

Failure state

Page 62: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

132 3 System Reliability Modeling

3.4.3.1 Reliability Evaluation with Analytical Approach

In the analytical (or algebraic analysis) approach, the system’s PDF/other reliabil-ity indices are obtained analytically from each component’s failure distribution us-ing probability theory. In other words, the analytical approach involves the deter-mination of a mathematical expression that describes the reliability of the system in terms the reliabilities of its components.

Considering components to be independent, the availability expression for the power supply system (A) is given by the following expression:

1 2 3 4

1 2 3 4

1 1 2 2 3 3 4 4

,

,

,( )( )( )( )

ii

i i

A A A A A

A

A

μμ λ

μ μ μ μμ λ μ λ μ λ μ λ

=

=+

=+ + + +

where Ai, λi, and μi are the availability, failure rate, and repair rate of ith compo-nent (i = 1, 2, 3, and 4).

Example 13 – Parallel System Typical emergency core cooling system of NPP consists of a two unit injection pump active redundant system. One pump opera-tion is sufficient for the successful operation of the system. The failure of the sys-tem occurs when both the pumps fail. The RBD (Functional diagram) is shown in Figure 3.49.

A typical overview of up and down states for the emergency injection pumps branch having two pumps in parallel is shown in Figure 3.50. System failure time is the time when there is simultaneous failure of the two pumps.

Figure 3.49 Functional block diagram of two-unit pump active redundant system

Pump 1

Pump 2

Page 63: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 133

Considering components in the two-unit active redundant parallel pump system to be independent, the unavailability (Q) is given by the following expression:

1 2

1 2

1 1 2 2

,

,

( )( )

ii

i i

Q Q Q

Q

Q .

λμ λ

λ λμ λ μ λ

=

=+

=+ +

Qi, λi, and μi are the unavailability, failure rate, and repair rate of ith component (i = 1 and 2).

Figure 3.50 Overview of up and down states for emergency injection pumps branch

Table 3.9 gives the comparison of both the approaches, analytical and simula-tion, for the two problems. In addition to the parameters such as average unavail-ability, expected number of failures, failure frequency, MTBF, and MTTR, simu-lation can give the cumulative density function (CDF) of random variable time between failures for the system under consideration. The CDFs of problems 1 and 2 are shown in Figures 3.51 and 3.52, respectively. Mission times of 107 h and 108

h are considered for problem 1 and problem 2, respectively. Simulation results are from 104 iterations in both the cases.

Time

Pump 1

Pump

2

System

Functional state

Failure state

Page 64: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

134 3 System Reliability Modeling

0

0.2

0.4

0.6

0.8

1

0.00E+00 1.00E+04 2.00E+04 3.00E+04 4.00E+04 5.00E+04Time between failure (h)

Cum

. Pro

b.

Figure 3.51 CDF of series system

0

0.2

0.4

0.6

0.8

1

0.00E+00 2.00E+06 4.00E+06 6.00E+06 8.00E+06 1.00E+07Time between failure (h)

Cum

. Pro

b.

Figure 3.52 CDF of parallel system

Table 3.9 Summary of results

Series Parallel Parameter Analytical Simulation Analytical Simulation

Average unavailability 1.059E–4 1.059E–4 1.59E–7 1.61E–7 Avg. no. of failures 2030.78 2031.031 2.955 2.997 Failure frequency(/h) 2.031E–4 2.031E–4 2.95E–8 2.99E–8 MTBF (h) 4924.21 4923.51 33.84E6 33.36E6 MTTR (h) 0.5214 0.5214 5.39 5.37

Page 65: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 135

3.4.4 Simulation Procedure for Complex Systems

The simulation procedure is explained below for systems having complex operat-ing environments [11]:

1. System failure logic is obtained from qualitative FTA or RBD in the form of minimal cut sets (combination of minimum number of component failures lead-ing to system failures).

2. PDFs for time to failure/repair of all basic components are obtained from past experience or lab testing. Maintenance policies of all components have to be collected from the system technical specifications record. Information such as interval and duration of tests and preventive maintenance are obtained in this step.

3. Generation of component state profiles. Components are simulated for a speci-fied mission time to depict the duration of available (up) and unavailable (down) states. If a component is repairable as is the case for most of practical systems, up and down states will come alternatively. Down states can be due to failure or scheduled maintenance activity. The duration of the state is random for up states and also for down states if it is unscheduled repair, whereas sched-uled maintenance activity may be a fixed value. Active components: active components are those which are in working condi-tion during normal operation of the system. Active components can be either in success state or failure state. Based on the PDF of failure of the component, time to failure is obtained from the random variant calculations. The failure is followed by repair whose time depends on the PDF of repair time. This se-quence is continued until it reaches the predetermined system mission time. Standby/dormant components: these components are required on demand due to the failure of active components. When there is no demand, they will be in standby state or may be in a failed state due to on-shelf failure. They can also be unavailable due to testing or maintenance state as per the scheduled activity when there is a demand for them. This makes the components have multiple states and such stochastic behavior needs to be modeled to exactly suit the prac-tical scenario. Down times due to the scheduled testing and maintenance poli-cies are first accommodated in the component state profiles. In certain cases test override probability has to be taken to account for its availability during testing. As failures occurring during a standby period cannot be revealed until testing, time from failure until identification has to be taken as down time. It is followed by imposing the standby down times obtained from the standby time to failure PDF and time to repair PDF. Apart from the availability on demand, it is also required to check whether the standby component is successfully meet-ing its mission. This is incorporated by obtaining the time to failure based on the operating failure PDF and is checked with the mission time, which is the down time of active component.

Page 66: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

136 3 System Reliability Modeling

4. Generation of system state profile. The system state profile is developed by in-tegrating component state profiles with the system failure logic. Failure logic of complex systems is generally derived from FTA, which is the logical and graphical description of various combinations of failure events. FTA represents failure logic of the system with the sum of minimal cut sets. In other words, system logic is denoted by series configuration of parallel subsystems. Each minimal cut set represents this subsystem which will have certain basic compo-nents in parallel.

5. The state profile for each minimal cut set is generated based on component state profiles obtained from step 3. A down state is identified by calculating the duration that all the components in the cut set under consideration are simulta-neously unavailable as it is equivalent to a parallel configuration. The minimal cut set is in an up state for the remaining duration of the mission. Thus, the state profile for the cut set is also in up and down states alternately throughout its mission.

6. System states are generated from state profiles of the minimal cut set which are obtained from step 4. As the system is in series configuration of all minimal cut sets, the down state of every minimal cut set imposes the same down state on the system. Thus all down states of all minimal cut sets are reflected in the sys-tem state profile and the remainder of the mission is in the up state.

7. Steps 3 and 4 are repeated for a sufficient number of simulations and required measures of reliability are obtained from the simulation results.

3.4.4.1 Case Study – AC Power Supply System of Indian Nuclear Power Plant

Reliability analysis adopting the above-discussed procedure for a practical system is presented here. An AC power supply system is chosen as the case of application as it is a very important system in the safe operation of an NPP. This system has redundant components having multistate systems with different maintenance poli-cies. System-specific information as far as possible is used in the modeling.

Description of the System

An electrical power supply is essential in the operation of the process as well as safety systems of any NPP. To ensure high reliability of power supply systems, high redundancy and diversity are provided in the design. Loss of the off-site power supply coupled with loss of on-site AC power is called station blackout. In many probabilistic safety assessment (PSA) studies [12], severe accident se-quences resulting from station blackout conditions have been recognized to be significant contributors to the risk of core damage. For this reason the reliabil-ity/availability modeling of AC power supply systems is of special interest in PSA of NPP.

Page 67: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 137

Figure 3.53 Schematic diagram of AC electrical power supply

The electrical power supply system of an Indian NPP consists of four classes. In the station blackout criteria, there are only class IV and class III systems. Class IV power is supplied from two sources: (i) grid supply and (ii) station alternator supply. Class III can be termed as redundant to class IV supply. Whenever class IV is unavailable, two class III buses are fed from dedicated standby DGs of 100% capacity each. There are three standby DGs. These DGs start automatically on failure of the class IV power supply through an emergency transfer scheme. Two of the DGs supply power to the buses to which they are connected. In the case of failure/unavailability of any of these two DGs, the third DG can be connected automatically to any of the two class III buses. In the case where only one DG is available the tie breaker between the buses closes automatically. The class III loads are connected to the buses in such a way that failure of any bus will not af-fect the performance of systems needed to ensure safety of the plant. Thus one DG is sufficient for all the emergency loads and this gives a redundancy of one out of three. The line diagram of the AC power supply system in the Indian NPP is shown in Figure 3.53.

System Modeling

Failure/success logic of the system can be obtained from developing RBDs or qualitative FTA. The interaction between failure of components and their impact on system success state is depicted with RBDs or FTA. The latter method is suit-able when there is complex configuration. However, both the approaches are adopted here to give the list of minimal cut sets. The RBD for the system is shown in Figure 3.54. There can be dependency between the cut sets and this is properly

BUS D

Grid

DG 3

CB370

CB 368 CB 357

DG 2

CB361

DG 1

CB351

CB 364 CB 353

BUS DE BUS E

Page 68: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

138 3 System Reliability Modeling

accounted for in the analysis. Parameters of distribution for all the components in the systems are shown in the Table 3.10. Time to failure and time to repair are ob-served to follow an exponential distribution from the operating experience [13]. However, by changing the random variant in the simulation one can do simulation for any kind of PDF for time to failure or time to repair.

Figure 3.54 RBD of AC power supply system

Table 3.10 Failure rate and repair rate of components

Se-rial no.

Compo-nent

Description Failure rate (/h) (operating)

Failure rate (/h) (standby)

Repair rate (/h)

1 CLIV Class IV supply 2.34E–04 – 2.59 2 DG1 Diesel generator 1 9.00E–05 5.33E–04 8.69E–02 3 CB351 Circuit breaker 351 3.60E–07 2.14E–05 0.25 4 CB353 Circuit breaker 353 3.60E–07 2.14E–05 0.25 5 BUSD Bus D 3.20E–07 – 0.125 6 DG3 Diesel generator 3 9.00E–05 5.33E–04 8.69E–02 7 CB370 Circuit breaker 370 3.60E–07 2.14E–05 0.25 8 CB357 Circuit breaker 357 3.60E–07 2.14E–05 0.25 9 CB368 Circuit breaker 368 3.60E–07 2.14E–05 0.25

10 BUSE Bus E 3.20E–07 – 0.125 11 DG2 Diesel generator 2 9.00E–05 5.33E–04 8.69E–02 12 CB361 Circuit breaker 361 3.60E–07 2.14E–05 0.25 13 CB364 Circuit breaker 364 3.60E–07 2.14E–05 0.25 14 DG-CCF Common cause failure 1.00E–05 5.92E–05 4.166E–02

Page 69: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 139

System-specific testing and maintenance information is obtained from the op-erating experience. All DGs are tested with no load once in a week and tested with a load once in two months. Scheduled maintenance is carried out once in three months on all the DGs. However, maintenance is not simultaneously carried out for more than one DG. During a no-load or a full-load test, DGs can take the de-mand which makes the override probability one, and the test duration will not come under down time. Scheduled maintenance is carried out on all circuit break-ers once a year during the reactor shutdown. The testing and maintenance policy for standby components of the system is given in Table 3.11.

Table 3.11 Testing and maintenance policy for standby components

No-load test (h) Load test (h) Preventive maintenance (h)

Se-rial no.

Compo-nent

Inter-val

Dura-tion

Interval Dura-tion

Inter-val

Duration

1 DG1 168 0.0833 1440 2 2160 8 2 CB351 168 0.0833 1440 2 8760 2 3 CB353 168 0.0833 1440 2 8760 2 4 DG3 168 0.0833 – – 2184 8 5 CB370 168 0.0833 – – 8760 2 6 CB357 – – – – 8760 2 7 CB368 – – – – 8760 2 8 DG2 168 0.0833 1440 2 2208 8 9 CB361 168 0.0833 1440 2 8760 2 10 CB364 168 0.0833 1440 2 8760 2

Results and Discussion

The FTA approach with suitable assumptions is often used for unavailability as-sessment as a part of level-1 PSA of NPP. It is assumed that the unavailability of a standby system can be reasonably approximated by the use of fault trees (or some other logic models) in which the component time-averaged unavailabilities are used as the probabilities of basic events [14]. To reduce the burden of calculations, the time-dependent unavailabilitites of the components are substituted in some logic models by their average values over the period of analysis. In addition to these assumptions and approximations (rare event), actual processes (complex in-teraction and dependencies between components) and random behavior of the sys-tems are depicted with simplified logic models. The output results from this ap-proach are limited to point estimates only. Using this fault tree (cut set) approach, unavailability thus obtained is 5.87E–7.

An alternative approach could be based on Markov models. These models can take into account a wide range of dependencies; however, they are restrictive in

Page 70: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

140 3 System Reliability Modeling

terms of number of components, preventive maintenance and failure/repair time distributions. Furthermore it is not possible to take into account any trends or sea-sonal effects. Another alternative could be the use of semi-Markov models. The scalability in terms of number of possible states of the system, and number of maintenance actions, is an important advantage of these models; however, they are also complex and therefore very difficult to handle when the number of possible system states increases.

Table 3.12 Summary of results

Serial no. Parameter Value 1 Average unavailability 7.14E–7 2 Failure frequency(/h) 2.77E–6 3 MTBF (h) 3.62E5 4 MTTR (h) 0.673

The subsystems of the AC power supply system have multiple states due to sur-veillance tests and scheduled maintenance activities. In addition, the operation of DGs involves starting and running (until its mission time) which is a sequential (or conditional) event. Furthermore, the redundancies and dependencies add to the complexity. Thus, this complexity or dynamic environment of the chosen problem makes the Monte Carlo simulation approach the obvious choice as this method al-lows considering various relevant aspects of system operations which cannot be easily captured by analytical methods.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00E+00 2.00E+05 4.00E+05 6.00E+05 8.00E+05 1.00E+06

Time (h)

Cum

ulat

ive

Pro

babi

lity

Figure 3.55 CDF for the time to failure

Page 71: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 141

The number of iterations is kept as the convergence criterion for simulation. A crude sampling approach is adopted in the present problem; however, variance re-duction methods such as Latin hypercube sampling (LHS) or importance sampling also can be used to improve the performance of simulation. Table 3.12 gives the summary of results obtained from simulation of 10,000 iterations and mission time of 106 hours of operation. Average unavailability calculated from the simulation approach is 7.14E–7, whereas from the analytical approach (fault tree cut-set ap-proach) it is 5.87E–7. The underestimation of unavailability in the case of the ana-lytical approach is due to its inability to incorporate down time due to scheduled maintenance and surveillance test activities in the model. The output results from the analytical approach are limited to point estimates of unavailability only. But the simulation approach, in addition to the parameters such as average unavailabil-ity, expected number of failures, failure frequency, MTBF, and MTTR, can give CDFs of the random variables time between failures and time to repair for the sys-tem under consideration (Figures 3.55 and 3.56). The generated failure times of the system can be used to see how the hazard rate is varying with time. Further-more, average unavailability with respect to time is plotted against mission time (Figure 3.57). The results of the analysis are very important, as a severe accident resulting from loss of power supply is a significant contributor to the risk of core damage of the NPP. This simulation model can also be used for optimizing inspec-tion intervals or the replacement time of components in the system, for example, the surveillance interval standby power supply can be optimized based on this model.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Time (h)

Cum

ulat

ive

Pro

babi

lity

Figure 3.56 CDF for the time to repair

Page 72: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

142 3 System Reliability Modeling

The Monte Carlo simulation approach has flexibility in solving any kind of complex reliability problem. It can solve problems of dynamics in terms of se-quence occurrences, time dependent, having any kind of component PDF, and it can give the required system attribute. However, the solution time is usually large and there is uncertainty from one simulation to another. It is to be noted that the experimentation required is different for different types of problems and it is not possible to precisely define a general procedure that is applicable in all circum-stances. However, the simulation technique provides considerable flexibility in solving any type of complex problem.

The incredible developments in the computer technology for data processing at unprecedented speed levels are further emphasizing the use of simulation ap-proaches to solve reliability problems. Use of the simulation approach eliminates many of the assumptions that are inevitable with analytical approaches. In order to simplify the complex reliability problems, analytical approaches make lot of as-sumptions in order to make a simple mathematical model. On the contrary, the Monte Carlo simulation-based reliability approach, due to its inherent capability in simulating the actual process and random behavior of the system, can eliminate the uncertainty in system reliability modeling. One should not forget Einstein’s quota-tion in this regard, “A theory should be as simple as possible, but no simpler.”

0.00E+00

1.00E-07

2.00E-07

3.00E-07

4.00E-07

5.00E-07

6.00E-07

7.00E-07

8.00E-07

9.00E-07

1.00E-06

0.00E+00 2.00E+05 4.00E+05 6.00E+05 8.00E+05 1.00E+06Time (h)

Una

vaila

bilit

y

Figure 3.57 Unavailability vs. time

Page 73: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 143

3.4.5 Increasing Efficiency of Simulation

One advantage of using Monte Carlo sampling is that, with a sufficient sample size, it provides an excellent approximation of the output distribution. Also, since it is a random sampling technique, the resulting distribution of values can be ana-lyzed using standard statistical methods [15]. Monte Carlo sampling being subject to standard statistics, statistical techniques can be used to draw conclusions about the results. One such useful technique is that the sample size required to obtain a result that is within a pre-specified confidence interval can be determined. If the primary interest is in achieving a confidence interval of width w and confidence level α about the mean of the output, then the sample size n must be at least

22⎟⎠⎞

⎜⎝⎛=

wSZn α ,

where S is the standard deviation, and Zα is the deviation from the unit normal containing probability α. In order to calculate this sample size, the simulation must first be run a small number of times to estimate the standard deviation, S. After the necessary sample size, n is determined, the simulation is continued until n scenar-ios have been generated. For example, suppose a 95% confidence interval for the mean is desired that is less than five units wide (w = 5). The deviation Zα enclos-ing 95% of the unit normal distribution is 1.96. A minimum appropriate sample size can also be estimated for a desired confidence interval about a given fractile. In this case, the necessary sample size n is given by

( )2

1 ⎟⎟⎠

⎞⎜⎜⎝

⎛Δ

−=p

Zppn α ,

where p is the percentile of interest, and Δp is the precision of the estimate. Alter-natively, a model can be run using n samples, at which point some measure of convergence, such as a confidence interval about a quantile, can be computed. If the degree of convergence is not efficient, then sampling may be continued. This is an advantage that Monte Carlo sampling has over stratified sampling techniques that require knowledge of the sample size before sampling begins.

The primary disadvantage of the Monte Carlo sampling technique is that even the minimum necessary sample size is often undesirably large. This is because a large sample size may be necessary in order for a sufficient number of samples to be taken from low-probability events. This may require a lot of computational re-sources. Some of the following variance reduction techniques can be used to re-duce the computational time.

Page 74: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

144 3 System Reliability Modeling

3.4.5.1 Importance Sampling

Importance sampling is a general technique for estimating the properties of a par-ticular distribution, while only having samples generated from a different distribu-tion than the distribution of interest. Importance sampling helps us sample from the important regions of the sample space. Consider evaluation of the mean value of a function h(x). We can write this mean value as

∫∞

∞−

= dxxfxh )()(μ ,

where f(x) is the PDF of the random variable X. h(x) is usually called the score function. How do we estimate µ? A Monte Carlo simulation is called an analog simulation if it does not employ any variance reduction devices. The simulation that employs variance reduction techniques we shall term biased simulation. In analog Monte Carlo, we sample randomly a sequence of xi: i = 1, 2,…, N, from the density f(x) and write

∑=

=N

iiN xh

Nh

1

)(1 .

In the limit ∞→N , μ→Nh . Also by the central limit theorem, in the limit

∞→N , the probability density of Nh tends to a Gaussian with mean µ and vari-ance σ2/N, where

[ ] dxxfxh )()(2

2 ∫∞

∞−

−= μσ .

Thus we say that the analog Monte Carlo estimate of µ is given by NhN /σ± , where ±σ/√N defines the one-sigma confidence interval. This means

that with a probability p given by

( )

68268.02

exp21

2exp

21

1

2

/

/2

2

=⎥⎦

⎤⎢⎣

⎡−=

⎥⎥⎦

⎢⎢⎣

⎡ −−=

∫+

+

dxx

dxxNNpN

N

π

σμ

πσ

σμ

σμ

Page 75: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.4 Monte Carlo Simulation 145

as σ is not known. Hence we approximate it by its Monte Carlo estimate SN, given by

( )2

11

22 1)(1⎥⎦

⎤⎢⎣

⎡−= ∑∑

==

N

iii

N

iN xh

Nxh

NS .

The quantity ±SN/√N is called the statistical error. Notice that the sample size N must be large for the above estimate of the error. Normality tests are useful in a biased Monte Carlo simulation. Now it can be very easily understood from this explanation that, it is very much required to reduce the variance in the Monte Carlo computation. Importance sampling can be used to reduce the variance.

In order to preserve the mean we define a modified score function H(x) given by

)()()()(

xgxfxhxH = .

The expectation value of H is evaluated over the importance density g(x); this is identically equal to the expectation value of the original score function h over the analog density f(x):

( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( )

f xH H x g x dx h x g x dx h x f x dx hg x

μ μ∞ ∞ ∞

−∞ −∞ −∞

= = = =∫ ∫ ∫ .

Thus we sample xi: i = 1, 2,…, N from g(x), and calculate the ratios f(xi)/g(xi). It is adequate if we consider the second moment, since we have formally shown that the mean is preserved under the importance sampling. Thus choosing properly the importance function g(x) and ensuring that the ratio f(x)/g(x) on the average is substantially less than unity is in essence the basic principle of various reduction techniques. Thus sampling from an importance density helps us estimate the mean with a much better statistical accuracy for a given sample size.

3.4.5.2 Latin Hypercube Sampling

As an alternative to random sampling, a stratified sampling technique can ensure that samples are taken from the entire range of the distribution. One such tech-nique, developed by Mckay, Beckman, and Conover [16] is LHS. In LHS, the range of each input distribution is divided into n intervals of equal margin prob-ability. One value of the random variable is selected from each interval. The sam-ple taken from each interval may be selected at random from within the interval, or from the median of the interval. The former is referred to as random LHS, and

Page 76: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

146 3 System Reliability Modeling

the latter as median LHS. Median LHS requires less computing time than random LHS since the median value of an interval can be calculated rather than randomly selected. However, it has the disadvantage that in the rare situation of sampling from a periodic distribution, erroneous results are obtained if the interval width is on the order of the period of the distribution [12]. The specified sample size, and therefore the interval width can be adjusted to avoid this type of problem. Another disadvantage of median LHS is that it may give less accurate estimates of the standard deviation from random Monte Carlo sampling. The stratification of the input distributions into n equal probability intervals ensures that samples are taken from the entire range of the distributions even with a relatively small sample size compared to random Monte Carlo sampling. The primary disadvantage of LHS is that, because it is not a purely random sampling technique, the results are not sub-ject to analysis by standard statistics. Therefore, one cannot determine in advance the sample size necessary for a desired degree of convergence, as is possible for random Monte Carlo sampling. However, since the sample size necessary for LHS is typically less than that of Monte Carlo sampling. Also, since stratification re-quires a prior knowledge of the sample size, one cannot run LHS for n samples, compute a degree of convergence, and then continue sampling as described above for random Monte Carlo sampling.

3.5 Dynamic Reliability Analysis

Dynamic reliability methods focus on modeling the behavior of components of complex systems and their interactions such as sequence- and func-tional-dependent failures, spares and dynamic redundancy management, and prior-ity of failure events. As an example of sequence-dependent failure, consider a power supply system in an NPP where one active system (grid supply) and one standby system (DG supply) are connected with a switch controller. If the switch controller fails after the grid supply fails, then the system can continue operation with the DG supply. However, if the switch fails before the grid supply fails, then the DG supply cannot be switched into active operation and the power supply fails when the grid supply fails. Thus, the failure criterion depends on the sequence of events also apart from the combination of events. One of the most widely used ap-proaches to address these sequential dependencies is dynamic FTA, which is the marriage of conventional FTA and Markov models or Monte Carlo simulation.

3.5.1 Dynamic Fault Tree Gates

The traditional static fault trees with AND, OR, and voting gates cannot capture the behavior of components of complex systems and their interactions. In order to

Page 77: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 147

overcome this difficulty, the concept of dynamic fault trees is introduced by add-ing a sequential notion to the traditional fault tree approach [17]. System failures can then depend on component failure order as well as combination. This is done by introducing dynamic gates into fault trees. With the help of dynamic gates, sys-tem sequence-dependent failure behavior can be specified using dynamic fault trees that are compact and easily understood. The modeling power of dynamic fault trees has gained the attention of many reliability engineers working on safety critical systems.

Dynamic fault trees introduce four basic (dynamic) gates: the priority AND (PAND), the sequence enforcing (SEQ), the warm SPARE, and the functional de-pendency (FDEP) [13] (Figure 3.58). They are discussed here briefly.

PAND

G

FDEP

GG

SPARESEQ

G

Figure 3.58 Dynamic gates

3.5.1.1 PAND Gate

The PAND gate reaches a failure state if all of its input components have failed in a pre-assigned order (from left to right in graphical notation). In Figure 3.58, a failure occurs if A fails before B, but B may fail before A without producing a fail-ure in G. A truth table for a PAND gate is shown in Table 3.13; the occurrence of event is represented as 1 and its non-occurrence as 0. In the second case, both A and B have failed but due to the undesired order, it is not a failure of the system.

Table 3.13 Truth table for PAND gate with two inputs

A B Output 1 (first) 1 (second) 1 1 (second) 1 (first) 0 0 1 0 1 0 0 0 0 0

Page 78: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

148 3 System Reliability Modeling

Example

The fire alarm in a chemical process plant gives a signal to fire-fighting personnel for further action if it detects any fire. If the fire alarm fails (gets burnt in the fire) after giving the alarm, then the plant will be in safe state as fire-fighting is in place. However, if the alarm fails (in standby mode which was undetected) before the fire accident, then the extent of damage would be very high. This can be mod-eled by a PAND gate only as the scenario exactly fits its definition.

3.5.1.2 SEQ Gate

A SEQ gate forces its inputs to fail in a particular order: when a SEQ gate is found in a dynamic fault tree, it never happens that the failure sequence takes place in different orders (Table 3.14). While the SEQ gate allows the events to occur only in a pre-assigned order and states that a different failure sequence can never take place, the PAND gate does not force such a strong assumption: it simply detects the failure order and fails just in one case.

Table 3.14 Truth table for SEQ gate with three inputs

A B C Output 0 0 0 0 0 0 1 Impossible 0 1 0 Impossible 0 1 1 Impossible 1 0 0 0 1 0 1 Impossible 1 1 0 0 1 1 1 1

Example

Consider a scenario where a pipe in a pumping system fails in different stages. There is a minor welding defect at the joint of the pipe section, which can become a minor leak with time and subsequently lead to a rupture.

3.5.1.3 SPARE Gate

SPARE gates are dynamic gates modeling one or more principal components that can be substituted by one or more backups (spares), with the same functionality (Figure 3.58; truth table in Table 3.15). The SPARE gate fails when the number of operational powered spares and/or principal components is less than the minimum

Page 79: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 149

required. Spares can fail even while they are dormant, but the failure rate of an unpowered spare is lower than the failure rate of the corresponding powered one. More precisely, λ being the failure rate of a powered spare, the failure rate of the unpowered spare is αλ, where 0 ≤ α ≤ 1 is the dormancy factor. Spares are more properly called “hot” if α = 1 and “cold” if α = 0.

Table 3.15 Truth table for SPARE gate with two inputs

A B Output 1 1 1 0 1 0 1 0 0 0 0 0

Example

The reactor regulation system in an NPP consists of a dual-processor hot standby system. There will be two processors which will be continuously working. Proces-sor 1 will be normally doing the regulation; in the case where it fails processor 2 will take over.

3.5.1.4 FDEP Gate

In the FDEP gate (Figure 3.58; truth table in Table 3.16), there will be one trigger input (either a basic event or the output of another gate in the tree) and one or more dependent events. The dependent events are functionally dependent on the trigger event. When the trigger event occurs, the dependent basic events are forced to occur. In the Markov chain generation, when a state is generated in which the trigger event is satisfied, all the associated dependent events are marked as having occurred. The separate occurrence of any of the dependent basic events has no ef-fect on the trigger event.

Table 3.16 Truth table for FDEP gate with two inputs

Trigger Output Dependent event 1 Dependent event 2 1 1 1 1 0 0 0/1 0/1

Example

In the event of a power supply failure, all the dependent systems will be unavail-able. The trigger event is the power supply and systems which are drawing power are dependent events.

Page 80: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

150 3 System Reliability Modeling

Figure 3.59 Markov models for (a) AND gate, (b) PAND gate, (c) SEQ gate, (d) SPARE gate, and (e) FDEP gate

1 1

0 1

1 0

0 0

0 0

λA

λA

λB

A B

λB AND

1 1

0 1

1 0

0 0

0 0

λA

λA

λB

A B

λB PAND

1 1

0 1

0 0

λA

A B

λB

SEQ

1 1

0 1

1 0

0 0

0 0

λA

λA

αλ

A B

λB SPARE

λT

1 1 1

T A B

1 0 1

1 1 0

1 0 0

0 0 0

λA

λB

λB

λT

λA

λT

FDEP

Page 81: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 151

3.5.2 Modular Solution for Dynamic Fault Trees

Markov models can be used to solve dynamic fault trees. The order of occurrence of failure events can be easily modeled with the help of Markov models. Figure 3.59 shows the Markov models for various gates. However, the solution of a Markov model is much more time- and memory-consuming than the solution of a standard fault tree model. As the number of components increases in the system, the number of states and transition rates grows exponentially. Development of the state transition diagram can become very cumbersome and a mathematical solu-tion may be infeasible.

Dugan proposed a modular approach for solving dynamic fault trees. In this approach, the system level fault tree is divided into independent modules, and the modules are solved separately, then the separate results can be combined to achieve a complete analysis. The dynamic modules are solved with the help of Markov models and the solution static module is straightforward.

For example, consider the fault tree for dual processor failure; the dynamic module can be identified as shown in Figure 3.60. The remaining module has only static gates. Using the Markov model approach the dynamic module can be solved and plugged into the fault tree for further analysis.

Figure 3.60 Fault tree for dual-processor failure

Page 82: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

152 3 System Reliability Modeling

3.5.3 Numerical Method

Amari [18] proposed a numerical integration technique for solving dynamic gates, explained below.

3.5.3.1 PAND Gate

A PAND gate has two inputs. The output occurs when the two inputs occur in a specified order (left one first and then right one). Let T1 and T2 be the random variables of the inputs (subtrees). Therefore,

G(t) = PrT1 ≤ T2 < t

∫=

t

x

xdG0

11

1

)(⎥⎥⎦

⎢⎢⎣

⎡∫=

t

xx

xdG12

)( 22

[ ])()()( 1220

11

1

xGtGxdGt

x

−∫=

.

Once we compute G1(t) and G2(t), we can easily find G(t) in the above equation using numerical integration methods. In order to illustrate this computation, the trapezoidal integral is used. Therefore,

[ ]1 1 2 21

( ) ( ) ( 1) [ ( ) ( )]m

i

G t G i h G i h G t G i h=

= × − − × × − ×∑ ,

where M is the number of time steps/intervals and h = t/M is step size/interval. The number of steps, M, in the above equation is almost equivalent to the number of steps (K) required in solving differential equations corresponding to a Markov chain. Therefore, the gain in these computations can be in the order of n3n. It shows that this method takes much less computational time than the Markov chain solution.

Example 14 Consider a PAND gate with AND and OR gates as inputs (see Table 3.17 and Figure 3.61). For mission time 1000, calculate the top-event probability.

Solution: This is based on the numerical integration technique and is compared with the Markov model approach. For mission time 1000 h, the top-event prob-ability is 0.362, and overall computation time is less than 0.01 s. The state space approach generated 162 states, and computation time is 25 s.

Page 83: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 153

Figure 3.61 Fault tree having dynamic gate (PAND)

Table 3.17 Failure data for the basic events

Gate Failure rate of basic events AND 0.011 0.012 0.013 0.014 0.015 OR 0.0011 0.0012 0.0013 0.0014 0.0015

3.5.3.2 SEQ Gate

A SEQ gate forces events to occur in a particular order. The first input of a SEQ gate can be a basic event or a gate, and all other inputs are basic events.

Consider that the distribution of time to occurrence of input i is Gi; then, the probability of occurrence of the SEQ gate can be found by solving the following equation:

G(t) = PrT1 + T2 + · · · + Tm < t

= G1G2 · · · Gm (t).

EVENT 1

GATE 1

GATE 2 GATE 3

EVENT 5 EVENT 6 EVENT 10

... ...

Page 84: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

154 3 System Reliability Modeling

3.5.3.3 SPARE Gate

A generic SPARE gate allows the modeling of heterogeneous spares including cold, hot, and warm spares. The output of the SPARE gate will be true when the number of powered spares/components is less than the minimum number required. The only inputs that are allowed for a SPARE gate are basic events (spare events). Therefore:

• If all the distributions are exponential, we can get the closed-form solutions for G(t).

• If the standby failure rate of all spares are constant (not time dependent), then G(t) can be solved using non-homogeneous Markov chains.

• Otherwise, we need to use conditional probabilities or simulation to solve this part of the fault tree.

Therefore, using the above method, we can calculate the occurrence probability of a dynamic gate without explicitly converting it into a Markov model (except for some cases of the SPARE gate).

3.5.4 Monte Carlo Simulation

Monte Carlo simulation is a very valuable method which is widely used in the so-lution of real engineering problems in many fields. Lately the utilization of this method is growing for the assessment of availability of complex systems and the monetary value of plant operations and maintenances. The complexity of the mod-ern engineering systems besides the need for realistic considerations when model-ing their availability/reliability renders analytical methods very difficult to use. Analyses that involve repairable systems with multiple additional events and/or other maintainability information are very difficult to solve analytically (dynamic fault trees through state space, numerical integration, and Bayesian network ap-proaches). Dynamic fault trees through a simulation approach can incorporate these complexities and can give a wide range of output parameters.

The four basic dynamic gates are solved here through a simulation approach [19].

3.5.4.1 PAND Gate

Consider a PAND gate having two active components. An active component is one which is in working condition during normal operation of the system. Active components can be either in success state or failure state. Based on the PDF of failure of component, time to failure is obtained from the procedure mentioned above. The failure is followed by repair whose time depends on the PDF of repair

Page 85: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 155

time. This sequence is continued until it reaches the predetermined system mission time. Similarly for the second component also state time diagrams are developed.

For generating a PAND gate state time diagram, both the components’ state time profiles are compared. The PAND gate reaches a failure state if all of its in-put components have failed in a pre-assigned order (usually from left to right). As shown in Figure 3.61 (first and second scenarios), when the first component failed followed by the second component, it is identified as failure and simultaneous down time is taken into account. But, in the third scenario of Figure 3.62, both the components have failed simultaneously but the second component has failed first, hence it is not considered as a failure.

Figure 3.62 PAND gate state time possibilities

3.6.4.2 SPARE Gate

The SPARE gate will have one active component and remaining spare compo-nents. Component state time diagrams are generated in a sequence starting with the active component followed by spare components in the order left to right. The steps are as follows:

• Active components: time to failures and time to repairs based on their respec-tive PDFs are generated alternatively until they reach mission time.

• Spare components: when there is no demand, it will be in standby state or may be in a failed state due to on-shelf failure. It can also be unavailable due to test-ing or maintenance state as per the scheduled activity when there is a demand for it. This makes the component have multiple states and such stochastic be-havior needs to be modeled to represent the practical scenario. Down times due to the scheduled testing and maintenance policies are first accommodated in the component state time diagrams. In certain cases test override probability has to be taken to account for its availability during testing. As failures occurring dur-

Failure

Failure

Not a Failure

A

B

A

B

A

B

Down state

Functioning

Page 86: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

156 3 System Reliability Modeling

ing the standby period cannot be revealed until testing, time from failure to identification has to be taken as down time. It is followed by imposing the standby down times obtained from the standby time to failure PDF and time to repair PDF. Apart from the availability on demand, it is also required to check whether the standby component is successfully meeting its mission. This is in-corporated by obtaining the time to failure based on the operating failure PDF and is checked with the mission time, which is the down time of active compo-nent. If the first standby component fails before the recovery of the active com-ponent, then demand will be passed on to the next spare component.

Figure 3.63 SPARE gate state time possibilities

Various scenarios with the SPARE gate are shown in Figure 3.63. The first sce-nario shows that demand due to failure of the active component is met by the standby component, but it has failed before the recovery of the active component. In the second scenario, demand is met by the standby component. But the standby failed twice when it is in dormant mode, but it has no effect on success of the sys-tem. In the third scenario, the standby component is already in failed mode when the demand comes, but it has reduced the overall down time due to its recovery af-terwards.

3.5.4.3 FDEP Gate

The FDEP gate’s output is a “dummy” output as it is not taken into account during the calculation of the system’s failure probability. When the trigger event occurs, it will lead to the occurrence of the dependent event associated with the gate. De-pending upon the PDF of the trigger event, failure time and repair times are gener-ated. During the down time of the trigger event, the dependent events will be vir-tually in a failed state though they are functioning. This scenario is depicted in the

Failure

Not a Failure

Failure

A

B

A

B

A

B

Down state

Functioning

Standby

Page 87: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 157

Figure 3.64. In the second scenario, the individual occurrences of the dependent events are not affecting the trigger event.

Figure 3.64 FDEP gate state time possibilities

3.5.4.4 SEQ Gate

It is similar to the priority AND gate but occurrences of events are forced to take place in a particular fashion. Failure of the first component forces the other com-ponents to follow. No component can fail prior to the first component. Consider a three-input SEQ gate having repairable components. The following steps are in-volved with the Monte Carlo simulation approach.

1. The component state time profile is generated for the first component based upon its failure and repair rate. The down time of the first component is the mission time for the second component. Similarly the down time of the second component is the mission time for the third component.

2. When the first component fails, operation of the second component starts. The failure instance of the first component is taken as t = 0 for the second compo-nent. The time to failure (TTF2) and time to repair/component down time (CD2) are generated for the second component.

3. When the second component fails, operation of the third component starts. The failure instance of the second component is taken as t = 0 for the third compo-nent. The time to failure (TTF3) and time to repair/component down time (CD3) are generated for the third component.

4. The common period in which all the components are down is considered as the down time of the SEQ gate.

5. The process is repeated for all the down states of the first component.

Failure

Not Failure

T

A

B

T

A

B

Down state due to inde-

pendent failure

Functioning

Down state due to trig-

ger event failure

Page 88: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

158 3 System Reliability Modeling

CD1

CD2

CD3

t=0

TTF1

TTF2

TTF3

SYS_DOWN

Figure 3.65 SEQ gate state time possibilities. TTFi = time to failure for ith component; CDi = component down time for ith component; SYS_DOWN = system down time

3.5.4.5 Case Study 1 – Simplified Electrical (AC) Power Supply System of Nuclear Power Plant

An electrical power supply is essential in the operation of process and safety sys-tems of any NPP. A grid supply (off-site power supply) known as a class IV sup-ply is the one which feeds all these loads. To ensure high reliability of the power supply, redundancy is provided with DGs, known as a class III supply (also known as an on-site emergency supply) in the absence of a class IV supply to sup-ply the loads. There will be sensing and control circuitry to detect the failure of the class IV supply which triggers the redundant class III supply. Loss of the off-site power supply (class IV) coupled with loss of on-site AC power (class III) is called station blackout. In many PSA studies [12], severe accident sequences resulting from station blackout conditions have been recognized to be significant contribu-tors to the risk of core damage. For this reason the reliability/availability modeling of AC power supply systems is of special interest in PSA of NPPs.

The RBD is shown in Figure 3.66. Now this system can be modeled with the dy-namic gates to calculate the unavailability of the overall AC power supply of an NPP.

The dynamic fault tree (Figure 3.67) has one PAND gate having two events, namely, sensor and class IV. If the sensor fails first then it will not be able to trig-ger the class III, which will lead to non-availability of the power supply. But if it fails after already triggering class III due to occurrence of a class IV failure first, it will not affect the power supply. As class III is a standby component to class IV, it is represented with a SPARE gate. This indicates that their simultaneous unavail-ability will lead to supply failure. There is a functional dependency gate as the sensor is the trigger signal and class III is the dependent event.

This system is solved with an analytical approach and Monte Carlo simulation.

1

2

3

Page 89: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 159

Figure 3.66 RBD of electrical power supply system of an NPP

Figure 3.67 Dynamic fault tree for station blackout

CSP FDEP

Class IV Failure

Class III Failure

Sensor Failure

Class IV Failure

Sensor Failure

Station Blackout

Grid Supply

Diesel Supply

Sensing and

Control Circuitry

Page 90: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

160 3 System Reliability Modeling

Solution with Analytical Approach

Station blackout is the top-event of the fault tree. Dynamic gates can be solved by developing state-space diagrams and their solutions give required measures of re-liability. However, for subsystems which are tested (surveillance), maintained, and repaired if any problem is identified during check-up, cannot be modeled by state space diagrams. However, there is a school of thought that initial state probabili-ties can be given as per the maintenance and demand information; this is often de-batable. A simplified time-averaged unavailability expression is suggested by IAEA P-4 [14] for standby subsystems having exponential failure/repair character-istics. The same is applied here to solve the standby gate. If Q is the unavailability of the standby component, it is expressed by the following equation:

m m11 [ ] [ ] [ ]

T

reQ f T TT T

λ τ λλ

−⎡ ⎤−= − + + +⎢ ⎥⎣ ⎦

,

where λ is failure rate, T is test interval, τ is test duration, fm is frequency of pre-ventive maintenance, Tm is duration of maintenance, and Tr is repair time. It is the sum of contributions from failures, test outage, maintenance outage, and repair outage. In order to obtain the unavailability of standby gate, the unavailability of class IV is multiplied by the unavailability of the standby component (Q).

Figure 3.68 Markov (state space) diagram for PAND gate having sensor and class IV as inputs

SENSOR (A) CL IV (B)

A – Dn B – Up

A – Up B – Dn

A – Dn B – Dn

A – Dn B – Dn

λA

λA

λB

λB

μB

μA

μA

μB

μA

μB

Failed state

Page 91: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 161

Table 3.18 Component failure and maintenance information

Component Failure rate (/h)

Repair rate (/h)

Test period (h)

Test time (h)

Maint. period (h)

Maint. time (h)

Class IV 2.34E–4 2.59 – – – – Sensor 1E–4 0.25 – – – – Class III 5.33E–4 0.08695 168 0.0833 2160 8

The failure of the sensor and class IV is modeled by the PAND gate in the fault tree. This is solved by a state space approach by developing a Markov model as shown in Figure 3.68. The state shown in bold, where both the components failed in the required order is the unavailable state and remaining states are all available states. ISOGRAPH software has been used to solve the state space model. Input parameter values used in the analysis are shown in Table 3.18 [13]. The sum of the both the values (PAND and SPARE) give the unavailability of a station black-out scenario, which is obtained as 4.847E–6.

Solution with Monte Carlo simulation

As one can see, the Markov model for a two-component dynamic gate has five states with ten transitions, thus state space becomes unmanageable as the number of components increases. In the case of standby components, the time-averaged ana-lytical expression for unavailability is only valid for exponential cases. To address these limitations, Monte-Carlo simulation is applied here to solve the problem.

In the simulation approach, random failure/repair times from each component’s failure/repair distributions are generated. These failure/repair times are then com-bined in accordance with the way the components are arranged reliability-wise within the system. As explained in the previous section, the PAND gate and SPARE gate can easily be implemented through a simulation approach. The dif-ference between the normal AND gate and the PAND and SPARE gates is that the sequence of failure has to be taken into account and standby behavior including the testing, maintenance, and dormant failures have to be accommodated. The unique advantage with simulation is incorporating non-exponential distributions and eliminating the statistically independent assumption.

Component state time diagrams are developed as shown in Figure 3.69 for all the components in the system. For active components which are independent, only two states will exist: functioning state (up) and repair state due to failure (down). In the present problem, class IV and sensor are active components where as class III is a standby component. For class III, generation of the state time diagram in-volves more calculations than the former. It has six possible states, namely: test-ing, preventive maintenance, corrective maintenance, standby functioning, standby failure undetected, and normal functioning to meet the demand. As testing and preventive maintenance are scheduled activities, they are deterministic and are initially accommodated in component profile. Standby failure, demand failure, and

Page 92: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

162 3 System Reliability Modeling

repair are random and according to their PDF the values are generated. The de-mand functionality of class III depends on the functioning of sensor and class IV. Initially after generating the state-time diagrams of sensor and class IV, the down states of class IV are identified and sensor availability at the beginning of the down state is checked to trigger the class III. The reliability of class III during the down state of class IV is checked. Monte Carlo simulation code has been devel-oped for implementing the station blackout studies. The unavailability obtained is 4.8826E–6 for a mission time of 10,000 h with 106 simulations, which is in agree-ment with the analytical solution. Failure time, repair time, and unavailability dis-tributions are shown in Figures 3.70, 3.71, and 3.72, respectively.

Figure 3.69 State time diagrams for class IV, sensor, class III, and overall system

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000Failure time (h)

Cum

. Pro

b.

Figure 3.70 Failure time distribution

Standby (available)

Functioning Down state

Class IV

Class III

Sensor

System

Page 93: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 163

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8Repair time (h)

Cum

. Pro

b.

Figure 3.71 Repair time distribution

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

0 5000 10000 15000Time (h)

Una

vaila

bilit

y

Figure 3.72 Unavailability with time

3.5.4.2 Case Study 2 – Reactor Regulation System of Nuclear Power Plant

The reactor regulation system (RRS) regulates rector power in an NPP. It is a computer-based feedback control system. The regulating system is intended to control the reactor power at a set demand from 10–7 FP (full power) to 100% FP by generating control signals for adjusting the position of adjuster rods and adding poison to the moderator in order to supplement the effectiveness of the adjuster rods [20, 21]. The simplified block diagram of the RRS is shown in Figure 3.73.

Page 94: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

164 3 System Reliability Modeling

Figure 3.73 Simplified block diagram of RRS

RRS

RRS Failure

Loss of TotalControl with A

Loss of TotalControl with B

SYSAB

System A & Bfailing together

Loss of control inAuto Mode with A

MANUAL

Loss of ManualControl

Loss of ManualControl

Loss of control inAUTO mode with

B

MANUAL SYSA

Channel AFailure (DPHS)

SYSB

Channel BFailure (DPHS)

AUTO

Auto SwitchingFailure

Auto SwitchingFailure

SYSA

Channel AFailure (DPHS)

AUTO SYSB

Channel BFailure (DPHS)

PAND PAND

SEQ SEQ

Figure 3.74 Dynamic fault tree of RRS

Input

System A

System B

CTU A

CTU B

Field Actuators

Page 95: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

3.5 Dynamic Reliability Analysis 165

The RRS has a dual-processor hot standby configuration with two systems, A and B. All inputs (analog and digital or contact) are fed to system A as well as sys-tem-B. On failure of system A or B, the control transfer unit (CTU) shall automati-cally change over the control from system A to system B, and vice versa, if the system to which control is transferred is healthy. Control transfer shall also be possible through manual command by an external switch. This command shall be ineffective if the system, to which control is desired to be transferred, is declared unhealthy. Transfer logic shall be implemented through CTU. To summarize, for this computer-based system failures need to happen in a specific sequence to be declared as system failure. A dynamic fault tree is constructed for realistic reliabil-ity assessment.

Dynamic Fault Tree Modeling

The important issue that arises in modeling is the dynamic sequence of actions in-volved in assessing the system failure. The top event for RRS, “failure of reactor regulation,” will have following sequence of failures to occur:

1. Computer system A or B fails. 2. Transfer of control to hot standby system by automatic mode through relay

switching and CTU fails. 3. Transfer of control to hot standby system by manual mode through operator in-

tervention and hand switches fails after the failure of auto mode.

PAND and SEQ gates are used, as shown in Figure 3.74, to model these dy-namic actions. The PAND gate has two inputs, namely, auto transfer and system A/B failure. Auto transfer failure after the failure of system A/B does not have any affect as the switching action has already taken place. The sequence gate has two inputs, one from the PAND gate and another from manual action. Chances of manual failure only arise after the failure of auto and system A/B. Manual action has four events, in which three are hand-switch failures and one is operator error. Auto has only two events, failure of CTU and failure of relay. System A/B has many basic events and failure of any of these basic events will lead to failure, rep-resented with the OR gate.

Page 96: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

166 3 System Reliability Modeling

Exercise Problems

1. Calculate the reliability of the pumping system shown in Figure 3.75.

Figure 3.75 Pumping system

2. A simplified line diagram of an emergency case cooling system of an NPP is shown in Figure 3.76. Calculate the availability of the system using a cut-set or path-set method.

Figure 3.76 Emergency case cooling system

3. A system has 11 components. Components 1 through 7 are different and have reliabilities 0.96, 0.98, 0.97, 0.96, 0.94, 0.98, 0.99, respectively. Components 8 through 11 are the same, with reliability of 0.9. Components 4 and 5 are critical,

SV2

Pump1

Pump2

SV1 DV1

DV2

V3P1

P2

P3V4V6

V5 V1

V2

Page 97: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

References 167

and each must operate for the system to function. However, only one of compo-nents 1, 2, and 3 has to be working and the same for 6 and 7. At least two of the four identical components must also work. The block diagram of the system is shown in Figure 3.77; find the probability that the system survives.

Figure 3.77 Eleven-component system

4. Construct the typical state time diagram for the dynamic fault tree shown in Figure 3.61. Typical component up and down states, each gate output, and the fi-nal top event should be reflected in the diagram. 5. Construct the Markov model for the RRS shown in Figure 3.73. Assume differ-ent transition rates between various states. Derive the unavailability expression for the failure of the RRS.

References

1. Bureau of Indian Standards (2002) Analysis technique for dependability – reliability block diagram method. IS 15037, Bureau of Indian Standards, New Delhi

2. Vesely W et al (1981) Fault tree handbook. NUREG-0492, Nuclear Regulatory Commission 3. NASA (2002) Fault tree handbook with aerospace applications. NASA, Washington, DC 4. Zio E, Podofillinia L, Zille V (2006) A combination of Monte Carlo simulation and cellular

automata for computing the availability of complex network systems. Reliability Engineering and System Safety 91:181–190

5. Yanez J, Ormfio T, Vitoriano B (1997) A simulation approach to reliability analysis of weapon systems. European Journal of Operational Research 100:216–224

6. Taylor NP, Knight PJ, Ward DJ (2000) A model of the availability of a fusion power plant. Fusion Engineering Design 52:363–369

7. Marseguerra M, Zio E (2000) Optimizing maintenance and repair policies via combination of genetic algorithms and Monte Carlo simulation. Reliability Engineering and System Safety 68:69–83

8. Alfares H (1999) A simulation model for determining inspection frequency. Computers and Industrial Engineering 36:685–696

1

4 2 5

3

6

7

9

10

8

11

2/4

Page 98: [Springer Series in Reliability Engineering] Reliability and Safety Engineering Volume 0 || System Reliability Modeling

168 3 System Reliability Modeling

9. Marseguerra M, Zio E (2004) Monte Carlo estimation of the differential importance measure: application to the protection system of a nuclear reactor. Reliability Engineering and System Safety 86:11–24

10. Zio E, Podofillinia L, Levitin G (2004) Estimation of the importance measures of multi-state elements by Monte Carlo simulation. Reliability Engineering and System Safety 86:191–204

11. Karanki DR, Kushwaha HS, Verma AK, Srividya A (2007) Simulation based reliability evaluation of AC power supply system of Indian nuclear power plant. International Journal of Quality and Reliability Management 24(6):628–642

12. IAEA-TECDOC-593 (1991) Case study on the use of PSA methods: Station blackout risk at Millstone unit 3. International Atomic Energy Agency, Vienna

13. IAEA TECDOC 478 (1988) Component reliability data for use in probabilistic safety as-sessment. International Atomic Energy Agency, Vienna

14. IAEA (1992) Procedure for conducting probabilistic safety assessment of nuclear power plants (level 1). Safety series no. 50-P-4, International Atomic Energy Agency, Vienna

15. Morgan MG, Henrion M (1992) Uncertainty – a guide to dealing uncertainty in quantitative risk and policy analysis. Cambridge University Press, New York

16. McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics (American Statistical Association) 21(2):239–245

17. Dugan JB, Bavuso SJ, Boyd MA (1992) Dynamic fault-tree for fault-tolerant computer sys-tems. IEEE Transactions on Reliability 41(3):363–376

18. Amari S, Dill G, Howald E (2003) A new approach to solve dynamic fault trees. In: Annual IEEE reliability and maintainability symposium, pp 374–379

19. Karanki DR, Rao VVSS, Kushwaha HS, Verma AK, Srividya A (2009) Dynamic fault tree analysis using Monte Carlo simulation in probabilistic safety assessment. Reliability Engi-neering and System Safety 94:872–883

20. Gopika V, Santosh TV, Saraf RK, Ghosh AK (2008) Integrating safety critical software sys-tem in probabilistic safety assessment. Nuclear Engineering and Design 238(9):2392–2399

21. Khobare SK, Shrikhande SV, Chandra U, Govindarajan G (1998) Reliability analysis of mi-crocomputer circuit modules and computer-based control systems important to safety of nu-clear power plants. Reliability Engineering and System Safety 59:253–258