[ieee 2005 ieee international engineering management conference, 2005. - st. john's,...

5
Abstract: This paper introduces the usage of the Fault Tree Analysis (FTA) for the risk assessment of a project. It utilizes this technique, well known to engineers, to obtain quantitative measures of risks that can be encountered during project execution. In order to apply this technique, representation of a project as an engineering system is introduced in this paper. The method is generic and allows modeling of any type of risk but is illustrated with a delay of the project completion. The efficiency of the new method is analyzed using computational system efficiency theory based on functional – reliability measures approach. I. INTRODUCTION. Uncertainty and risk analysis is not new; however, as a tool in business it has historically been of limited use. This is surprising considering that many business decisions are based on a figure that has been calculated from analysis of some kind. FTA has been introduced nearly 40 years ago and from that time has been extensively developed. The main field of application for this method were always complicated engineering systems, e.g. nuclear plants, airplanes, [1, 2] whose failure could potentially lead to catastrophic consequences. The purpose of this paper is to show how the FTA can be used for project risk analysis in order to expand existing methods offering new qualities and features. The method permits a simultaneous analysis of various risks associated with a project. This is in contrast with a common practice, where the various risks are usually assessed separately and a judgment is made playing one risk against another. The FTA approach can also be used to asses a risk in a series of related projects. It is shown in the paper that the method is computationally efficient when several risk factors need to be valuated simultaneously. II. RISK ASSESSMENT TECHNIQUES In order to assess various risks associated with a given project, a systematic approach is required to estimate the likelihood and the consequences of the threats facing the basic activities of the project. Once this is done, a method is required that would bring all the constituent risks together to provide an overall risk index. Several risk evaluation 1 Marcin Krysinski is with Alfa – Zeta Co. Ltd., Lodz, Poland, email:[email protected] 2 George Anders is with Kinectrics Inc., Toronto, Canada, email:[email protected]. techniques have been promoted in the project management literature [3 – 7]. III. PROJECT AS A SYSTEM A system is a set of interrelated elements, where all elements are interconnected. In engineering systems, one can find different types of components (e.g. valves, switches, connectors, actuators, computers, machines, etc.) connected together to perform a particular task. By analogy, a project can be seen as a system of interconnected tangible and/or intangible components. When treating a project as a system one must be aware of the following limitations: 1. A project should be a system, which is designed for a single use. This fact has the following consequences: a. Design phase is based on an experience and historical data. One may only perform simulations and hope that the risk management would include and predict most probable risk factors. b. A normal system can be stopped in case of failure, redesigned and run again. This cannot happen in a project and risk analysis and management should create a contingency plan in case of such a failure. c. A purely statistical approach towards project failure has to be applied with caution. 2. A project is a system which is operating for a given, strictly defined period of time. IV. FTA USAGE FOR PROJECTS Application of this method allows assessing various risks facing a project in a systematic and accurate manner and leads to a proper management of these risks. Every project is started with a promise of a successful completion. Project management is aiming at conducting all actions that would support it, and this includes a need for risk analysis to prevent project failure. Every project can be described by means of three variables Budget or the amount of money spent for the completion of the project. Timeframe, or amount of time used for completion of the project. Quality or an abstract issue that allows the assessment whether the project goals and targets have been reached. The first two are easy to measure and are also interdependent. In many cases, the completion time is an inverse function of the amount of money spent. A final success of a project must be measured according to a criterion that should be settled before the project is started. Fault Tree Analysis in a Project Context. Marcin Krysinski 1 George Anders 2 , Fellow IEEE 0-7803-9139-X/05/$20.00 ©2005 IEEE. 710

Upload: g

Post on 28-Feb-2017

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: [IEEE 2005 IEEE International Engineering Management Conference, 2005. - St. John's, Newfoundland & amp; Labrador, Canada (Sept. 11-13, 2005)] Proceedings. 2005 IEEE International

Abstract: This paper introduces the usage of the Fault Tree Analysis (FTA) for the risk assessment of a project. It utilizes this technique, well known to engineers, to obtain quantitative measures of risks that can be encountered during project execution. In order to apply this technique, representation of a project as an engineering system is introduced in this paper. The method is generic and allows modeling of any type of risk but is illustrated with a delay of the project completion. The efficiency of the new method is analyzed using computational system efficiency theory based on functional – reliability measures approach.

I. INTRODUCTION. Uncertainty and risk analysis is not new; however, as a tool

in business it has historically been of limited use. This is surprising considering that many business decisions are based on a figure that has been calculated from analysis of some kind.

FTA has been introduced nearly 40 years ago and from that time has been extensively developed. The main field of application for this method were always complicated engineering systems, e.g. nuclear plants, airplanes, [1, 2] whose failure could potentially lead to catastrophic consequences.

The purpose of this paper is to show how the FTA can be used for project risk analysis in order to expand existing methods offering new qualities and features. The method permits a simultaneous analysis of various risks associated with a project. This is in contrast with a common practice, where the various risks are usually assessed separately and a judgment is made playing one risk against another. The FTA approach can also be used to asses a risk in a series of related projects. It is shown in the paper that the method is computationally efficient when several risk factors need to be valuated simultaneously.

II. RISK ASSESSMENT TECHNIQUES In order to assess various risks associated with a given

project, a systematic approach is required to estimate the likelihood and the consequences of the threats facing the basic activities of the project. Once this is done, a method is required that would bring all the constituent risks together to provide an overall risk index. Several risk evaluation

1 Marcin Krysinski is with Alfa – Zeta Co. Ltd., Lodz, Poland, email:[email protected] 2 George Anders is with Kinectrics Inc., Toronto, Canada, email:[email protected].

techniques have been promoted in the project management literature [3 – 7].

III. PROJECT AS A SYSTEM A system is a set of interrelated elements, where all elements are interconnected. In engineering systems, one can find different types of components (e.g. valves, switches, connectors, actuators, computers, machines, etc.) connected together to perform a particular task. By analogy, a project can be seen as a system of interconnected tangible and/or intangible components.

When treating a project as a system one must be aware of the following limitations: 1. A project should be a system, which is designed for a

single use. This fact has the following consequences: a. Design phase is based on an experience and historical

data. One may only perform simulations and hope that the risk management would include and predict most probable risk factors.

b. A normal system can be stopped in case of failure, redesigned and run again. This cannot happen in a project and risk analysis and management should create a contingency plan in case of such a failure.

c. A purely statistical approach towards project failure has to be applied with caution.

2. A project is a system which is operating for a given, strictly defined period of time.

IV. FTA USAGE FOR PROJECTS Application of this method allows assessing various risks

facing a project in a systematic and accurate manner and leads to a proper management of these risks.

Every project is started with a promise of a successful completion. Project management is aiming at conducting all actions that would support it, and this includes a need for risk analysis to prevent project failure.

Every project can be described by means of three variables • Budget or the amount of money spent for the completion

of the project. • Timeframe, or amount of time used for completion of the

project. • Quality or an abstract issue that allows the assessment

whether the project goals and targets have been reached. The first two are easy to measure and are also

interdependent. In many cases, the completion time is an inverse function of the amount of money spent.

A final success of a project must be measured according to a criterion that should be settled before the project is started.

Fault Tree Analysis in a Project Context. Marcin Krysinski1 George Anders2, Fellow IEEE

0-7803-9139-X/05/$20.00 ©2005 IEEE. 710

Page 2: [IEEE 2005 IEEE International Engineering Management Conference, 2005. - St. John's, Newfoundland & amp; Labrador, Canada (Sept. 11-13, 2005)] Proceedings. 2005 IEEE International

In order to judge objectively if the project has been completed properly, we should establish clear, measurable and global criteria.

The most difficult task is to define quality of the final results and here, in the majority of cases, only approximate, descriptive methods can be used. In case of projects that prepare a product or service, the client can define criteria of a success.

Fault Tree is a network of logic gates connecting so called ‘initiating events’. FTA aim is to predict how these events spread into the project failures. An important phase of this analysis is the estimation of the probability of occurrence of basic events. Since in projects, people perform majority of the tasks, quantification of project components failure is tightly connected with human activities.

There are several methods or models that, after some modifications, could be used for the evaluation of the probability of failure of human tasks. Two methods: technique for human error rate prediction – THERP and performance shaping factors – PSFs are especially suitable to this task. Both, as applied to the project management field, are described in [8].

In addition to human error, several other probabilities may be required in a project risk analysis. Evaluation of probabilities associated with uncertain events is widely discussed in the literature and, for the brevity of this presentation, we will assume that they are known.

In order to assess project risk, we will need to discover dependencies between project tasks and how they may lead to the project failure.

The fault tree method traces the top event of project failure down to the basic events / basic tasks. The fault tree can then be analyzed using probabilistic methods. The result of this analysis will be the probability of occurrence of the top event and this should give us information useful in making modifications to the project to minimize the risks.

The FT analysis includes three steps: • definition of top events to be analyzed, • construction of a FT as detailed as possible –

aiming at having single tasks or basic tasks at the bottom,

• calculation and analysis of the probability of occurrence of the top event and suggestions of changes to the project logic / structure which may improve its reliability.

V. METHOD OF SEARCHING FOR TOP EVENTS Top events are those, which are directly derived from

failures and accidents. In case of projects, we may distinguish three categories of top events. They are presented in Table 1.

While the first two categories are fairly simple to define and general rules can easily be established, the last one can become very complex.

TABLE 1: Top Events for Projects.

Category of top events

Description

Time related Delays with respect to fixed schedule of a project or any of its milestones if their completion time is important. Reasons of delays can be found by means of a FT analysis.

Budget related Changes in expenditures, which result in an increase of total project cost or cost of any of the project phases. Reasons for these changes may be found by means of a FT analysis.

Performance related

Abstract category, which includes all events not related to the previous two. Difficult to measure and difficult to define in a general way, includes everything, which may result in the project failure as a result of bad performance or bad quality of deliverables.

There are two reasons why it may be difficult to find

performance related top event: • It may be difficult to measure the performance. If

incomplete documents are created that describe a given problem, give incomplete solutions or conclusions, it is difficult to judge the quality of the proposed solutions. This is often the case with financial audits, forecasting or any other business activity, which cannot be verified at the time it is presented.

• Performance related top events are specific to a given project. It is possible to define a specification of requirements at the beginning of the project, but they only rarely can be generalized.

Methods for identification of top events vary between industries and projects.

Two approaches can be used to identify top events: (a) general evaluation – which assumes existence of

historical data, thorough knowledge of the project and other similar projects. Information from several experts is compiled in order to create a list of possible top events. These evaluation results can be used to create checklists, which allow application of these results for future projects from the same industry or related to similar activities.

(b) Formal methods that may include: - preliminary hazard analysis (PHA)

- failure mode and effect analysis (FMEA) - hazard and operability study (HAZOPS) - master logic diagrams (MLD) It should be noted that project acceptance criteria must be

very precisely defined. Any deviation of a project’s results is a very good starting point for finding top events. Additionally, these definitions can be use for final assessment of projects performance.

VI. EXAMPLE OF CONSTRUCTION OF A FAULT TREE There are two basic mathematical operations that will be

performed in this work using the fault trees. The first one represents calculation of the probability of a simultaneous

0-7803-9139-X/05/$20.00 ©2005 IEEE. 711

Page 3: [IEEE 2005 IEEE International Engineering Management Conference, 2005. - St. John's, Newfoundland & amp; Labrador, Canada (Sept. 11-13, 2005)] Proceedings. 2005 IEEE International

occurrence of two random events and is represented by an AND gate as illustrated in Figure 1 for two basic events B1 and B2.

Figure 3. Fault Tree for the most basic approach to Project analysis.

Another phase would require identification of those events, which may help to compensate for the delay in one of the tasks. An example would be providing a backup power supply, which can be used in case of failure of a primary one. It is

impossible to find a general rule for finding such events, but some contingency techniques could be helpful.

In our example, we can assume that if Task 1 is delayed, it is possible to speed up the completion of the subsequent Task 2 so that the project is not delayed. The modified FT is presented in Figure 4.

We can also apply another method that can be used to compensate for the delay within a task. We can perform a more detailed analysis and try to represent the task as a sequence of sub-tasks and the delay inside of a project can be solved by means of an internal reorganization, introduction of more resources or other methods.

In general, every action or event that can help to compensate for the lost time can be used in the analysis. It should be noted that it is easier to undertake some actions if the tasks are at the beginning of the project rather then at the end.

At the end, probability of occurrence of top event can be calculated as a function of basic events probabilities using common logic gates equations [1].

Figure 4. Fault Tree which includes a possibility of compensation of a delay in one of the project’s tasks.

VII. EFFICIENCY ANALYSIS OF A CALCULATION SYSTEM BASED ON FTA.

The computing model defined in the previous part of this paper is different from other models used to analyze projects’ risk. It is possible to prove that it is more efficient then the others. The proof is based on appropriate functional measures of reliability. Using a computer–man system analysis proposed by Zamojski in [9] it is possible to define failures of computing system resulting from errors in calculation algorithms and operator (human) interventions.

A functional-reliability model of a system considering human-computer interactions combines parameters of

Figure 2. Gated OR fault tree

In order to illustrate how a fault tree is constructed, we will limit our attention to the case where the failure is defined as a delay of the project and by this delay we understand its completion after the scheduled date.

The easiest and the most straightforward FT is the one that includes all events on a critical path “Task n is delayed beyond its slack” under one OR gate. If we assume that every such delay implies a delay of a complete project, we can compare the reasoning behind the construction of a fault tree in this case to a failure analysis of a chain without any additional safety features – the system fails when one or more links fail.

Example of such a FT for a sequence of four tasks is presented in Figure 3.

Figure 1. Gated AND fault tree.

The second case involves calculation of a probability of a sum of two random events and is represented by an OR gate as illustrated in Figure 2.

0-7803-9139-X/05/$20.00 ©2005 IEEE. 712

Page 4: [IEEE 2005 IEEE International Engineering Management Conference, 2005. - St. John's, Newfoundland & amp; Labrador, Canada (Sept. 11-13, 2005)] Proceedings. 2005 IEEE International

hardware, software, tasks realized by the system and human errors and their impact on functional and reliable properties of the system. A classification of the system faults (failures and malfunctions) and system renewals (technical – repairs - and information renewals) can be considered. Functional-reliability measures of the computer-man system can be defined. The model allows an analysis of the impact of software and human faults on the availability coefficient of the computer-man system.

The availability coefficient of the computation system can be defined using a standard availability measure

CSCS

CS CS

MTBFA

MTBF MTSR=

+ (1)

where: CSMTBF - mean time between failures (failures and

malfunctions) of the system, CSMTSR - average restoration time

An examination of efficiency properties of computing models requires replacement of traditional measures of reliability with functional measures of reliability, which also include a consideration of the influence of “autonomic” system reactions to upcoming and detected failures.

A reason for malfunction of a computing system can be permanent failures (failures in short) or non – permanent failures (malfunctions).

It is assumed that all computing system problems, both failures and malfunctions are detected in one of two modes; on-line or off-line. Off-line mode is understood as a detection of failure during testing phase, it means during a phase of its verification performed before usage of the system.

It is assumed that all system elements can be sources of failures. In particular these can be:

- algorithms – very often, because modern algorithms are very complicated and in many cases not sufficiently tested,

- person – playing a role of a system operator, software programmer or a decision maker (a person who is performing calculations or using the results of the analysis)

After occurrence of a failure or a malfunction, a calculation system is submitted to a process of information restoration understood as a complete restoration of the process (tasks or their parts) destroyed as a result of an occurrence of a failure.

In some special cases (e.g., for systems with on-line monitoring) information restoration may include necessity of repeating few or even all tasks performed before the occurrence of a failure.

With the above definitions and some additional assumptions not limiting the generality of the analysis [11], the computing system availability given by equation (1) can be written as [8]:

1CS

T A m

Ak k kα β χ δ

=+ + +

(2)

where [ ]0,1Tk ∈ - off – line failure detection coefficient ( 0 – all

failures are detected, 1 – no failure is detected),

[ ]0,1Ak ∈ - coefficient of on – line detection of failure due to an algorithm ( 0 – all failures are detected, 1 – no failure is detected), [ ]0,1mk ∈ - coefficient of on – line detection of failure due

to human operator errors ( 0 – all failures are detected, 1 – no failure is detected),

, , ,α β χ δ - coefficients reflecting failure rate and the restoration time, which are specific to a project and independent on analysis method.

CSA coefficient can have values ranging from 0 to 1, where: 1CSA = means that a system is always ready to perform its tasks 0CSA = means a system that is incapable of

performing its tasks Equation (2) can be used to compare the efficiency of the

calculations using two alternative systems; one based on the FTA and the other using a classical approach. If we assume that the results of the analysis coming from both systems are almost the same, the most important difference between these systems is the elimination in the FTA method of the manual interaction between various calculations needed for the risk evaluation in a project. Table 2 summarizes some of these actions.

TABLE 2: Computing system failures list eliminated by means of construction of a new FTA – based system.

Description of failures / operations required ‘manual’ entry

Omission of important project risk factors. Neglect of influence of some factors on a Project. Incomplete analysis of consequences of some failure factors. Omission of reciprocal interactions between risk factors of different types. Lack of a complete relation analysis between initiating events and total Project failure risk.

As a result, we achieve not only smaller probability of error

while writing input data for next computing module, but also computing time is significantly reduced.

In comparison to alternative analysis methods, FTA – based methods defines a very formal algorithm based on simple rules. Every step of the analysis can be very clearly represented in a graphical form. At the same time final project failure probability is fairly simple to achieve (while for large project can be very time consuming). Thanks to its universality, the method allows the analysis of all aspects of project risk.

The level of the complexity of the calculation method is influencing the value of the coefficient Ak describing detection of distortions of the analysis resulting from the algorithm errors. Simple algorithm allows avoiding most malfunctions. Thus we can say that:

*A Ak k< (3)

0-7803-9139-X/05/$20.00 ©2005 IEEE. 713

Page 5: [IEEE 2005 IEEE International Engineering Management Conference, 2005. - St. John's, Newfoundland & amp; Labrador, Canada (Sept. 11-13, 2005)] Proceedings. 2005 IEEE International

where *Ak is the value of the coefficient of on-line detection of

failures caused by an algorithm for the FTA-based computing system and Ak is the value of the coefficient of on-line detection of failure caused by an algorithm for alternative computation methods.

Presented algorithm is at the same time formalized and easy to record. As a result a human operator conducting FTA-based analysis will make errors with smaller probabilities in comparison to traditional risk analysis methods. Therefore:

*m mk k< (4)

where: *mk - value of coefficient of on-line detection of

failure caused by human operator errors following FTA-based method of analysis

mk - value of coefficient of on-line detection of failure caused by human operator errors for alternative analysis methods.

Assuming that average restoration time is dependant on the level of complexity of the system and as such, are independent on the method of analysis, substituting (3) and (4) into (2) we can observe that the system availability coefficient for the FTA-based method will be larger then the same coefficient for systems based on alternative methods.

VIII. CONCLUSIONS The method of project risk analysis based on FTA presented

in this paper has several important features that may help to design a better project with a reduced risk. In comparison with other risk analysis and management methods, it offers a very interesting augmentation of solutions and encourages the investigation of several aspects of the project. These will include: 1. Logical analysis of the project tasks and their

interdependencies: FTA in order to be performed correctly, requires an in-depth knowledge and understanding of the project under consideration. They help to check not only the correctness of a logical structure of the tasks and their completeness but also help to improve the clarity of the project documentation and description.

2. Quantification of the probabilities of the basic events: in order to model the project failure, it is necessary to estimate the probabilities of occurrence of the basic events At the same time, such an exercise gives a possibility to consider factors that may reduce the probability of failure of basic events.

3. Mathematical analysis: allows calculation of a reliable value of the probability of project failure and this gives a good basis for the minimization of the project risk.

4. Contingency analysis: expansion of a simple AND–gate in a Fault Tree into more complicated, multi-level AND/OR gated Fault Tree requires finding events or methods which may help to avoid failure of the project in case of failure of a single basic task. This adds a new perspective to project design and analysis since it helps to discover how project tasks are linked together.

5. Multi – perspective approach: can be applied to every aspect of project failure / success. In this paper, we limited our investigation to the time delay, budget and performance, however, the same method can be applied to simultaneous analysis of other risk factors.

6. Analytical approach: helps to present a project as a set of interconnected components which may depend on each other and may contribute in various ways to project success / failure. These methods help to find the links between the basic tasks / events and a global project failure.

Its efficiency has been proven by using a novel approach applying a functional measure of computational reliability.

IX. BIBLIOGRAPHY [1] H. Kumamoto, E.J. Henley, Probabilistic Risk Assessment and

Management for Engineers and Scientists, IEEE Press, Piscataway, NJ, 1996.

[2] H. Kerzner, Project management: a systems approach to planning, scheduling and control, J. Willey and Sons, New York, 2001

[3] A Guide to the Project Management Book of Knowledge. Project Management Institute, Inc., Newton Square, Pennsylvania: 2000

[4] F.T., Anbari, Quantitative Methods for Project Management, New York: International Institute for Learning, Inc. 1997

[5] G. Anders, Probability Concepts in Electric Power Systems, J. Willey and Sons, New York, 1990.

[6] C. Pritchard, Risk Management. Concepts and Guidance. ESI International, Arlington, Virginia, 2001

[7] J. Schuyler, Risk and decision analysis in Projects, Project Management Institute, Inc., Newton Square, Pennsylvania, 2001.

[8] M. Krysinski, Probabilistic Risk Assessment Method applied for Project Risk Analysis, Ph.D. thesis, Technical University of Lodz, Poland, 2005.

[9] W. Zamojski, Model funkcjonalno – niezawodnościowy systemu komputer – człowiek [in Polish], in Inżynieria komputerowa, ed. W. Zamojski, WKŁ, Warszawa 2005

[10] W. Zamojski, Zagadnienia eksploatacji maszyn [in Polish], 1985, vol. 20, z. 2, s 317 - 327

[11] A. Fox , D. Patterson Self – repairing computers, Scientific American, May 2003

0-7803-9139-X/05/$20.00 ©2005 IEEE. 714