analisis cost sist seg redundantes.pdf

8/11/2019 ANALISIS COST SIST SEG REDUNDANTES.pdf

http://slidepdf.com/reader/full/analisis-cost-sist-seg-redundantespdf 1/11

Safety Performance vs. Cost Analysis of RedundantArchitectures Used in Safety Systems

By: Dr. Lawrence Beckman - HIMA-Americas, Inc.

There are several system architectures available for use in process safety applications.These range from single channel systems to triplicated or higher redundancyconfigurations. The selected architecture should first satisfy the safety requirements of the process under control, but likewise address production and operational issues whichhave a definite impact on safety, i.e. false tripping the process. As such, highavailability is also an important consideration for safety systems. If frequent internalfailures of the safety system forces the process through an excessive number of shutdown/start-up cycles, that process is operating in its most hazardous state forunnecessarily long periods of time. Operation under these conditions should beminimized in the interest of safety.

There is considerable work in process to establish standards and implementationguidelines, both in the USA and internationally, which match the risk inherent in a givensituation to the required integrity level of the safety system. Regrettably, they are notspecific to a particular type of process, but deal only with a qualitative level of risk. Thispaper is intended to provide a background and further economic insight into this subject,and explore some architectural alternatives available to the control or safety engineer,given a state-of-the-art approach to safety system design.

Safety System Architecture

Most safety systems in the process environment are designed to shut the process down

upon detecting a hazardous state or condition. These systems are called EmergencyShutdown (ESD) Systems and many operate in a fail safe mode. In this mode of operation, an internal failure of the safety system will result in a shut down of theprocess, and all repairs to this system are performed in the non-operating state. Fail safesystems can be redundant, but may lack sufficient redundancy to be considered faulttolerant. As such, they are subject to false trips and are not used in applications requiringa high level of availability. Availability is defined as the percentage of time over which asystem is capable of performing its intended function during a given time period.

Efforts have been made to increase the availability of dual systems by operating them ina mode where both channels must fail for the system to shutdown the process. This

mode of operation is called the 2-out-of-2 (2oo2) mode, as opposed to the normal 1-out-of-2 (1oo2) fail safe mode. As such, the system continues to operate on a single channelafter sustaining the first internal failure. While this configuration is certainly moreavailable, its integrity depends heavily on comprehensive internal diagnostics.

Fault tolerant (dual or higher) architectures are capable of sustaining reliable operation inthe presence of a fault, by providing additional levels of redundancy. The 1oo2Dconfiguration is an example of a dual implementation of fault tolerant architecture. Itnormally operates in the 2oo2 mode, but reverts to the 1oo2 mode upon the unlikely



occurrence of an unresolved fault. Diagnostic watchdogs are provided in both channelsas a secondary means of de-energizing outputs. Either channel is capable of switching off its outputs, as well as those of the other channel if required. As such, it is both very safeand available. Details of this Programmable Electronic System (PES) configuration aregiven in the literature (2,3). After sustaining a fault, the system will continue to operateproperly on its remaining channel, thus avoiding a process shutdown; and allow repairs

to be performed on-line. The safety integrity of the system in the presence of a fault isnot compromised for the sake of increased availability, as all internal diagnostics arefully functional on the remaining channel. As such, fault tolerant systems found inindustrial applications are at minimum dual redundant, providing two independentchannels of redundancy.

Another inherent advantage of some redundant architectures is the ability tocommunicate data between channels in order to decide which of the channels of thesystem is malfunctioning. This capability significantly improves the systems ability todiagnose faults and subsequently increases the safety integrity of the system. A system’sability to diagnose faults is often referred to as its diagnostic coverage, where the

Coverage Factor (C) is defined as the probability of detecting a fault, given that one hasoccurred. Perfect coverage would imply 100% effective self-diagnostics. This is of course impossible.

After sustaining a fault, a triplicated 2-out-of-3 (2oo3) system can operate in either of two modes. If a second fault occurs before repairs can be effected on the firstmalfunctioning channel, the system can shut itself down immediately upon theoccurrence of the second fault; or it can revert to single channel operation. For safetyapplications of a 2oo3 system, two channel operation is restricted to a short time interval;and single channel operation is never allowed, due to the lack of comprehensive internaldiagnostics. As such, the integrity of a triplicated (TMR) system depends heavily on its

ability to vote, and consequently diagnostic coverage degrades consistent with thenumber of operating channels.

The Coverage Factor of the various architectures will vary based on the quality of thesystem’s internal diagnostics. Triplicated architectures rely heavily on their votingcapacity to implement diagnostic coverage, and as such an operational third channel iscritical. After sustaining a fault, diagnostic capability, and consequently coverage, issubstantially diminished and in some instance may be non existent. Dual architecturetypically offer superior internal diagnostics, which are capable of diagnosing theoperational state of the entire system every scan cycle. This differs dramatically fromother architectures which may require 30 seconds or longer to diagnose a problem; i.e.,

memory failure. Diagnostic coverage is an important consideration in evaluating covertsystem availability (U

C ).

Levels of redundancy beyond triplicated systems are rare in the industrial environment,and are very difficult to justify economically considering cost versus incrementalimprovement in safety integrity. Single channel systems are definitely not recommendedfor critical safety applications. Please refer to the following table of PES Architecturesfor further clarification.



Configuration Operating Mode Channels

Needed to Operate

Channels

Needed to Trip

1

2

1

1oo1

1oo2

2oo2

2oo3

(2oo2 1oo2)

1oo2D

1

1

1

1-0

2-0

2-1-0 1

2

2

23-2-0

2-1-0

Safety vs. Availability

Having discussed the redundant configuration options available, let us quantify theirrelative performance for both safety and availability operation. The criteria arenecessarily different, and will be characterized as follows:

The Safety criterion will be the Hazard Rate (H) which is calculated as

H D U C == ••

where D = Demand Rate (demands/yr) U C

== Covert (safety) Unavailability

The Covert (safety) Unavailability (also referred to as fractional dead time or probabilityof dangerous failure) is the probability that the system is in a failed or non-functioningstate because of a covert failure. It is this condition which represents the true hazard.Not all covert failures are dangerous failures, but all are potentially dangerous. Thus, amore conservative approach would require that covert and dangerous be consideredsynonymous. As such, the Covert Unavailability ( C U ) is a function of the unrevealed

system failure rate ( C ) and the proof test interval (T P

).

The Availability criterion will be the False Trip Rate (F). It is a function of the revealedfailure rate (

R ) and the repair time (T

R ). Whether a given failure is revealed or

unrevealed (Covert) depends upon the level of coverage provided by the system'sdiagnostics. The repair process likewise is heavily dependent upon the system’s ability todetect a fault, as the repair time is the sum of the time to detect the fault and the time tomake the repair. In a system with a low level of diagnostic coverage, the repair time willbe extended to equal the proof test interval in most instances. Programmable Electronic



Common Mode Failures

Common Mode Failure occurs when a single cause affects multiple channels of aredundant system, usually resulting in complete system failure. Sources of commonmode failure are environmental conditions, design errors, manufacturing errors, andoperational or maintenance failures. The higher the level of redundancy, the more likely

the occurrence of this type of failure. For example, a dual system (consisting of channelsA and B) has only a single common mode failure possibility; while a triple redundantsystem (Channels A, B and C) has multiple common mode failure possibilities (AB, AC,BC and ABC). This situation is exacerbated further when multiple channels share acommon hardware platform; i.e., a common I/O module, etc.

Common Mode Failure is typically modeled using the "beta factor" method, where beta( ) represents the percentage of total failures attributable to common mode failure; i.e.,

==

++

CM

CM NM , where total failures include both common and normal mode failure.

Usually this fraction is in the range of 5-15%, but can be smaller based on operational

experience. Necessarily, it is a reasonable estimate. However, depending upon theimportance placed on this type of failure, the resulting system reliability will besignificantly altered.

Considering two of the redundant architectures discussed, the occurrence of a covertcommon mode failure in either the 1-out-of-2/1-out-of-2D or the 2-out-of-3 systemconfiguration results in a fail-to-function situation. This result is the same irrespective of the architecture. However, the susceptibility is lower by a factor of three for the dualarchitecture. Incorporating covert common mode failure into their correspondingreliability models yields the following modified equations for Covert Unavailability (U-

c):

System Covert Unavailability (Uc)Configuration Common Mode Normal Mode

1-out-of-2 or CC P T

3 + CN p T

2 2

31-out-of-2D

2-out-of-3CC P

T + CN p T 2 2

where Uc = Uc (Common Mode Failure) + Uc (Normal Mode Failure)

CC =

2 (assuming 50% are covert)

CN = (1- )• (1-C)•

The net effect is equivalent to placing a simplex (non-redundant) element in series withthe redundant architecture for both the 1-out-of-2 and 2-out-of-3 system configurationsconsidered. Depending upon a reasonable estimate of the beta factor and the resultingcommon mode dangerous failure rate, the Common Mode term can completely dominatethe computation, rendering Normal Mode failures insignificant for higher levels of



redundancy. In practice, this is typically not the case; and care should be taken to keepcommon mode failure in perspective. However, it certainly should not be ignored incritical safety evaluations. In the economic analysis that follows, common mode failureshave not been included, as they are outside the scope of this paper. A comprehensivediscussion is provided in the literature (2,4,5).

System Integrity

In a hazardous process environment there are typically two types of systems in operation;the control system and the safety or protective system. The two systems should be totallyindependent of each other.

The purpose of the safety system is to protect against the process hazard, whilepreventing plant shutdowns due to false trips. The safety system is typically dormant forextended periods of time and susceptible to functional failures, which are generallyunrevealed failures. Given less than perfect diagnostic coverage, the internal diagnosticsof today's programmable safety systems will not detect 100% of all possible failure

conditions. As such, it is necessary to conduct periodic proof testing to detect suchundiagnosed failures. It is however, not a substitute for comprehensive internaldiagnostics. This testing should be made as quick and simple as possible in order to givethe maximum system availability, while reducing the possibility of human error. Repairshould be able to be performed while the system is operating. No advantage is gained if the safety system or the process has to be stopped to rectify any faults found.

The time interval between proof testing is of great concern, as the potential for humanerror while conducting the proof test is significant. Considerations which affect thechoice of proof test interval are as follows:

1) System redundancy and the coverage factor of the internal diagnostics.2) Potential for human error due to complexity of the test/repair process.3) The time required to perform the necessary testing and repair.

During the test period, the system (or some portion thereof) is under test and unavailableto perform its intended safety function. As such, proof testing too frequently increasesthe unavailability of the safety system and the probability of human error. On the otherhand, infrequent testing increases the risk of developing undiagnosed faults, particularlyin systems with a low level of diagnostic coverage.

As stated earlier, the purpose of the proof test is to improve the reliability of the safety

system. The objective is to minimize the safety unavailability of the system whileconducting the required periodic testing to maintain system integrity. The selection of the optimum proof test interval based on minimizing safety unavailability is critical.Consider the following equation for total system Unavailability ( TOTALU ), including field

devices:

TOTA L C T F D E U U U U U == ++ ++ ++

where C U = covert (safety) unavailability due to unrevealed system failure



T U = unavailability resulting from proof testing

F D U = Covert (safety) unavailability due to unrevealed field device failures

E U = unavailability resulting from human error (system isolation,

i.e., bypass not restored)

It is desired to minimize TOTALU with respect to P T for a given configuration of thesystem and field devices. The resulting optimum proof test interval is P

M I N T . A

derivation of this methodology can be found in Beckman (1).

Testing and Repair

The safety system design should facilitate maintenance of both the safety system andassociated field devices. The system itself should give the maintenance technician aclear, visual indication of the fault; so that repair can proceed with absolute certainty,thereby reducing the possibility for human error. Repair procedures should be simple

and straight forward to allow fast, easy repair and keep the repair time as short aspossible (low MTTR). In addition, provisions should be made to simplify the by-passingof field devices for purposes of proof testing, calibration and maintenance.

Many redundant systems based on traditional PLC’s are not fully integrated, and areconsequently difficult to test and repair. Redundant implementations which require themaintenance technician to diagnose complex problems, perform difficult repairprocedures, or reload the application program as part of the repair process are prone tohuman error, and will at the least contribute to false or nuisance trips of the process. Atworst, incomplete or inadequate repair could result in a catastrophic failure of the safetysystem. Steps should be taken to minimize the occurrence of human error during testing

and maintenance of the safety system.

In determining the optimum proof test interval, consideration should also be given to thepotential for human error. Under ideal conditions, the human error rate is estimated to be1 in 100. However, most process testing conditions are far from ideal, and as such thisrate will be substantially higher. Measures can be taken in the safety system design tomitigate this situation, but the potential for human error both while conducting the testingand required repair must be considered. The "Human Error" failure rate far exceeds thatof other safety system components such as sensors, actuators, etc. As such, it representsthe largest potential cause for operational failure of the safety system.

Economic Model

Given the above, it is now possible to construct an economic model for the systemconfigurations of interest. This analysis will focus on the safety system itself, and assuch will not include the associated field devices.

The link between Safety and Availability is becoming significantly stronger, as industryrecognizes that cycling processes up and down inherently has safety implications; inaddition to the cost associated with lost production. Hence, availability is now



considered a key factor in safety system design and operation. Considering the above,the model includes three terms as follows:

1) Hazardous failures 2) False or Nuisance Trips 3) Periodic Proof Testing

The focus of the model is on Total Safety Cost, and consequently does not include theinitial cost of system hardware, integration, programming or maintenance. One couldsafely assume that these costs would be in proportion to the selected level of redundancy,with triplication being the most expensive. These costs, even when amortized over thelife cycle of the system, differ from one configuration to another; but are mostly fixedcompared to the Total Safety Cost. The model utilizes an operating period of one (1)year, and a proof testing interval that is optimized for each configuration considered.Given these conditions, the Safety Cost model is

TOTAL H F T P M I N

C H C F C C T ($) /== •• ++ •• ++

where TOTALC ($) = Total Annual Safety Cost H = Hazard Rate F = False Trip Rate H C = Hazard Cost ($)

F C = Nuisance Trip Cost ($)

T C = Proof Testing Cost ($)

P M I N

T = Optimum Proof Test Interval

Please note that P M I N

T is also used in computing the Safety Unavailability and

consequently the Hazard Rate.

Using the following values for Coverage Factor (C), Covert Failure Rate ( C ), DemandRate ( D ), Mean Time to Repair ( R T ), and Test Time ( D T ), we compute the following

for the configurations of interest, using the equations for Covert Unavailability and FalseTrip Rate given in the listing:

C= 0.97 C = 0.18 Failures/yr. D = 0.5 Demands/yr. R T = 8 hrs. D T = 4

hrs.

Configuration P M I N

T weeks ( ) TOTALU H per year ( ) F per year ( ) TOTALC ($)

1-out-of-1 3.7 1.38x10-2 6.91x10-3 5.82 322,5341-out-of-2 14.4 3.48x10-3 1.74x10-3 11.64 590,103

2-out-of-2 2.6 1.91x10-2 9.56x10-3 0.062 47,5852-out-of-3 10.0 4.57x10-3 2.28x10-3 0.186 20,8551-out-of-2D 14.4 3.48x10-3 1.74x10-3 0.062 11,196

TOTALC ($)was calculated using the following associated costs (per occurrence):

Hazard Cost ( H C ) = $500,000

Nuisance Trip Cost ( F C ) = $50,000



Proof Testing Cost ( T C ) = $2,000

The lowest cost was achieved by the 1-out-of-2D configuration where both Hazardfailures and Nuisance trip were virtually eliminated. The 2-out-of-3 configurationfinished second, due to increased costs associated with proof testing and nuisance trips.Note that the 1-out-of-2/1-out-of-2D configuration also had the longest proof testinterval, and that the 2-out-of-2 configuration was the least safe, actually less safe than

the 1-out-of-1 (simplex) configuration.

In addition, no attempt was made to comprehend the following in the model

1) Increase in Covert Unavailability ( C U ) due to human error associated with more

frequent proof testing.2) Increase in Demand rate ( D ) due to more frequent false trips and process start-

ups.

Including these effects would have further biased the results in favor of the higherintegrity configurations.

Effects of Coverage Factors

It would now be of interest to investigate the effect of the Coverage Factor (C) on theeconomic model for the configurations of interest. We will use the same model for

TOTALC ($) , keeping all parameters constant with the exception of the Coverage Factor,

and the resulting covert and revealed system failure rates. Based on this analysis, wecompute the following dollar values for TOTALC ($) on an annual basis.

TOTALC ($)

C=0.98 C=0.90 C=0.75Configuration C = 0.12 C = 0.6 C = 1.5

1-out-of-1 319,793 327,366 315,5591-out-of-2 594,243 557,772 482,5272-out-of-2 39,531 83,678 129,8152-out-of-3 18,365 33,510 52,3491-out-of-2D 9,400 20,435 34,376

The results are interesting in that two distinct effects are observed. TOTALC ($) actually

decreases as C decreases for the two configurations (1-out-of-X) which are prone tofalse trips. Correspondingly TOTALC ($) increases as the coverage factor decreases for the

2-out-of-X configurations, and the 1-out-of-2D configuration (which are less prone tofalse trips), because of a dramatic increase in the Hazard Rate. As these are the mostlikely configurations to be utilized, a decrease in the coverage factor represents asignificant increase in the probability of a hazard, and its inherent financialconsequences. Please refer to Figure 1 for a summary of these results.

It is also interesting to note that a small increase in the coverage factor (which implies acorresponding decrease in the covert failure rate) resulting from more comprehensiveinternal diagnostics, voting, etc. will substantially reduce the Hazard Rate in all cases,



and consequently the Total Safety Cost for those configurations least affected by false

trips. The optimum Proof Test Interval ( P M I N

T ) likewise increases as the overall integrity

of the system improves. This effect could also be achieved by reducing the total failurerates of the individual module which comprise the overall system configuration.

Conclusions

An economic analysis of the safety system must comprehend both process safety andavailability. A model was constructed which included Hazardous Failures, False orNuisance Trips and lastly Periodic Proof Testing required to maintain system integrityfor the system configurations of interest. Hazardous failures were computed based on theTotal Safety Unavailability of the system. Human error is a significant contributor toSafety Unavailability, and steps to minimize the probability of occurrence should beemployed in the integration, testing, and repair of the system.

The analysis indicates that the use of either the 1-out-of-1 or 1-out-of-2 configuration isnot economically feasible, given that the 1-out-of-2 configuration however is quite safe.

It is best suited for fail-safe applications where loss of production is not a consideration.The 2-out-of-2 configuration is the least safe, and should be utilized only where safety isnot the primary consideration. The 2-out-of-3 and 1-out-of-2D configurations areeconomically advantaged, in that they satisfy both safety and availability requirements,thus minimizing the Total Safety Cost on an annual basis. However, the 1-out-of-2Dconfiguration is superior in both safety performance and cost. Including common modefailure in the economic model would further reinforce this result.

The importance of having comprehensive diagnostics and consequently a high coveragefactor in the safety system cannot be overemphasized. Improving coverage has anexponential effect on increasing reliability, safety system integrity, and reducing TotalSafety Cost. Proof testing should be used to complement a systems' internal diagnostics,and not as a substitute for inadequate diagnostics. Frequent proof testing and complexrepair procedures increase the probability of human error, and should always be avoided.

Given the above, a safety analysis should be performed both prior to design and againafter installation to determine if the System achieves the Safety Integrity Level (SIL) asrequired by the Process Hazard Analysis (PHA). This analysis can likewise establish theproper selection of the System architecture to satisfy economic criteria, and the tangibleperformance of the system as regards the mitigation of hazards which can lead tosignificant economic, safety, and environmental consequences.

References

analisis cost sist seg redundantes.pdf

Documents