aeo guide to reliability, availability and maintainability · aeo guide to reliability,...

AEO Guide to Reliability, Availability and Maintainability

T MU AM 06002 GU

Guide

Version 1.0

Issued date: 27 July 2015

Important Warning This document is one of a set of standards developed solely and specifically for use on public transport assets which are vested in or owned, managed, controlled, commissioned or funded by the NSW Government, a NSW Government agency or a Transport Agency (as defined in the Asset Standards Authority Charter). It is not suitable for any other purpose. You must not use or adapt it or rely upon it in any way unless you are authorised in writing to do so by a relevant NSW Government agency. If this document forms part of a contract with, or is a condition of approval by a NSW Government agency, use of the document is subject to the terms of the contract or approval. This document may not be current. Current standards are available for download from the Asset Standards Authority website at www.asa.transport.nsw.gov.au. © State of NSW through Transport for NSW

www.asa.transport.nsw.gov.au

Standard governance

© State of NSW through Transport for NSW

T MU AM 06002 GU AEO Guide to Reliability, Availability and Maintainability

Version 1.0 Issued date: 27 July 2015

Owner: Manager Systems Engineering Process, Asset Standards Authority

Authoriser: Principal Manager Authorisation and Audit, Asset Standards Authority

Approver: Director, Asset Standards Authority on behalf of the ASA Configuration Control Board

Document history

Version Summary of Changes

1.0 First issue.

For queries regarding this document, please email the ASA at [email protected] or visit www.asa.transport.nsw.gov.au

www.asa.transport.nsw.gov.au



Preface

The Asset Standards Authority (ASA) is an independent unit within Transport for NSW (TfNSW)

and is the network design and standards authority for defined NSW transport assets.

The ASA is responsible for developing engineering governance frameworks to support industry

delivery in the assurance of design, safety, integrity, construction, and commissioning of

transport assets for the whole asset life cycle. In order to achieve this, the ASA effectively

discharges obligations as the authority for various technical, process, and planning matters

across the asset life cycle.

The ASA collaborates with industry using stakeholder engagement activities to assist in

achieving its mission. These activities help align the ASA to broader government expectations

of making it clearer, simpler, and more attractive to do business within the NSW transport

industry, allowing the supply chain to deliver safe, efficient, and competent transport services.

The ASA develops, maintains, controls, and publishes a suite of standards and other

documentation for transport assets of TfNSW. Further, the ASA ensures that these standards

are performance-based to create opportunities for innovation and improve access to a broader

competitive supply chain.

This AEO Guide to Reliability, Availability and Maintainability has been developed on the

technical processes of ISO/IEC 15288:2008 by the ASA; reviewed by a consultative group

containing members from TfNSW stakeholder groups and approved by the ASA.

This guide aims to provide supplier organisations with guidance in managing engineering

activities involving systems that are required to be reliable, available and maintainable.

This guide has been approved by the ASA Configuration Control Board and is the first issue.

© State of NSW through Transport for NSW of 36

Table of contents



1. Introduction.............................................................................................................................................. 5

2. Purpose .................................................................................................................................................... 5

2.1. Scope ..................................................................................................................................................... 5

2.2. Application ............................................................................................................................................. 6

3. Reference documents ............................................................................................................................. 6

4. Terms and definitions ............................................................................................................................. 8

5. Reliability, availability, maintainability management ........................................................................ 10

5.1. Plan reliability, availability and maintainability management activities ................................................ 12

5.2. Definition of system boundaries and assumptions .............................................................................. 13

5.3. Identification of reliability, availability and maintainability requirements .............................................. 14

5.4. Allocation of reliability, availability and maintainability requirements .................................................. 14

5.5. Development of reliability, availability and maintainability acceptance criteria ................................... 15

5.6. Reliability, availability and maintainability analysis and modelling ...................................................... 15

5.7. Validation of reliability, availability and maintainability requirements .................................................. 16

5.8. Reliability, availability and maintainability deliverables ....................................................................... 17

6. Reliability, availability and maintainability tools and techniques .................................................... 17

6.1. Reliability block diagram analysis ........................................................................................................ 18

6.2. Failure mode, effects and criticality analysis ....................................................................................... 18

6.3. Fault tree analysis ................................................................................................................................ 19

6.4. Human reliability analysis .................................................................................................................... 20

6.5. Maintenance requirements analysis .................................................................................................... 22

6.6. Failure recording analysis and corrective action system ..................................................................... 23

Appendix A Examples of reliability block diagrams ........................................................................... 26

A.1. RBD - Station public announcement ................................................................................................... 26

A.2. RBD - Blue light emergency station ..................................................................................................... 27

Appendix B Examples of FMECA table - Bogie assembly .................................................................. 28

Appendix C Examples of fault tree analysis ........................................................................................ 29

C.1. Fault tree – Failure of electrical interlocking system ........................................................................... 29

C.2. Fault tree - Exceed safe speed (ETCS) ............................................................................................... 30

Appendix D Examples of human error analysis .................................................................................. 31

D.1. Human error analysis - Ticketing system ............................................................................................ 31

D.2. Human error analysis – Doors release system.................................................................................... 32

Appendix E Example of maintenance requirements analysis - Station escalator ........................... 33

Appendix F Example of FRACAS incident report – CPU motherboard ................................................ 36




1. Introduction

An Authorised Engineering Organisation (AEO) engaged by TfNSW to undertake engineering

activities is required to have reliability, availability and maintainability (RAM) management

arrangements in place that are relevant to the engineering services or products that the AEO

provides to TfNSW. These arrangements should enable the planning, execution, and reporting

of all RAM management activities for a system and documented in a RAM management plan

and related progress report(s).

The T MU MD 00009 ST AEO Authorisation Requirements and T MU AM 06006 ST Systems

Engineering standard state mandatory requirements for RAM. This document provides

guidance on complying with the requirements stated in these standards.

This guide further elaborates the guidance described in TS 10504 AEO Guide to Engineering

Management.

AEOs should ensure that RAM management documentation meets the level required for the

complexity of engineering services provided and incorporate RAMS requirements in the design

and development of systems they are contracted to deliver.

2. Purpose

This document is intended to provide guidance to AEOs applying RAM management during

engineering specification and asset life cycle stages and activities involving systems that are

required to operate dependably.

This ensures that AEOs are able to demonstrate sufficient control over RAM-related risks. This

guidance is of particular relevance to suppliers who provide reliability-critical or safety-critical

engineering specification and design, in addition to systems engineering, integration and

maintenance services.

2.1. Scope

This document provides guidance to AEOs for reliability, availability and maintainability

management related, in particular, to the system specification, design and maintenance

services. It also provides guidance on RAM management principles, methods, techniques and

processes used to analyse and deliver RAM requirements from stakeholders including

operational, maintenance and interfacing targets. AEOs are assumed to have business-level

policies addressing quality, performance and safety.

For this guide, the term reliability, availability, maintainability and safety (RAMS) is used to

define an integrated management approach. However, this guide is limited to RAM and not the

safety element of RAMS management, as safety assurance is addressed in TS 20001 System




Safety Standard for New or Altered Assets. Refer to this document for guidance on safety

management.

The specific evidence required to demonstrate RAM management processes will depend on the

scope and nature of the work. For that reason, this document does not outline the evidence

required to be an AEO, rather it provides an outline of the processes that AEOs need to

demonstrate.

2.2. Application

This document applies to the member Transport agencies and AEOs, and applies specifically to

the management of system and element level reliability, availability and maintainability for new

or altered NSW transport assets.

The level of application of RAM management principles should be scalable and tailored

according to the degree to which novelty or complexity is employed, the use of unique or

non-standard configurations and the associated level of safety risk.

The application of RAM analysis in support of design may be negligible or zero for some

projects where type approved products are used in standard, repeatable system configurations.

This should be reflected in contractual requirements to avoid unnecessary and excessive effort,

resources, time and cost.

The need for and application of RAM management has different meaning to different disciplines.

The impact of RAM management on planning and acquisition of new or altered systems and the

specific disciplines that support the system design should be understood.

T MU AM 06006 GU Systems Engineering guide provides guidance on the level of RAM

management activities required for engineering disciplines and identifies a range of typical rail

engineering projects and the level of RAMS to be applied.

3. Reference documents

The following documents are either cited in the text or may provide further information. For

dated references, only the cited edition applies. For undated references, the latest edition of the

referenced document applies.

International standards

EN 50126:1999 Railway applications - The specification and demonstration of Reliability,

Availability, Maintainability and Safety (RAMS)

EN 50128:2011 (Railway applications - Software for railway control and protection)

EN 60706 Maintainability of Equipment, Testability and diagnostic testing

ISO/IEC 15288:2008 Systems and software engineering - System life cycle processes




ISO/IEC 26702:2007 - Application and Management of the Systems Engineering Process

(Formally IEEE Std 1220-2005)

IEC/TR 62380:2004 Reliability data handbook - Universal model for reliability prediction of

electronics components, PCBs and equipment

Australian standards

AS ISO 55001:2014 Asset management - Management systems - Requirements

AS IEC 60300.2.2005 Dependability management - Part 2: Guidance for dependability

programme management

AS IEC 60812:2008 Analysis techniques for system reliability - Procedure for failure mode and

effects analysis (FMEA)

AS IEC 61025-2008 Fault tree analysis (FTA)

AS IEC 61078-2008 Analysis techniques for system reliability - Reliability block diagram and

Boolean methods

AS IEC 62508-2011 Guidance on human aspects of dependability

Transport for NSW standards

T MU MD 00008 GU AEO Guide to Authorisation

T MU MD 00009 ST AEO Authorisation Requirements

T MU AM 01003 ST Development of Technical Maintenance Plans

T MU AM 01003 F1 Blank FMECA Sheet

T MU AM 06001 GU AEO Guide to Systems Architectural Design

T MU AM 06007 GU Guide to Requirements Definition and Analysis

T MU AM 01010 ST Framework for developing an Asset Spares Assessment and Strategy

T MU HF 00001 GU AEO Guide to Human Factors Integration

TS 10504 AEO Guide to Engineering Management

TS 10506 AEO Guide to Verification and Validation

TS 20001 System Safety Standard for New or Altered Assets

T MU AM 01002 MA Maintenance Requirements Analysis Manual

T MU AM 06006 ST Systems Engineering (standard)

T MU AM 06006 GU System Engineering (guide)




Other references

MIL-HDBK-781A, Reliability Test Methods, Plans, and Environments for Engineering

Development, Qualification and Production

MIL-HDBK-217F, Notice 2, Reliability Prediction of Electronic Equipment

MIL-STD-2155(AS), Failure Reporting, Analysis and Corrective Action System

Williams, J.C., HEART – A proposed method for achieving high reliability in process operation

by means of human factors engineering technology in Proceedings of a Symposium on the

Achievement of Reliability in Operating Plant, Safety and Reliability Society,1985, NEC,

Birmingham

Swain, A.D. and Guttmann, H.E., Handbook of Human Reliability Analysis with Emphasis on

Nuclear Power Plant Applications. 1983, NUREG/CR-1278, USNRC

Shappell, S.A. and Wiegmann, D.A., The Human Factors Analysis and Classification System—

HFACS, February 2000, DOT/FAA/AM-00/7

Stanton, N. A., Salmon P. M. et al, Human Factors Methods A practical guide for Engineering

and Design, 2nd Edition, 2013, Ashgate, Aldershot, ISBN 978-1-4094-5754-1

ETCS Application Level 1 - Safety Analysis Part 1 - Functional Fault Tree, SUBSET-088-1 Part

1, Issue 2.3.0

Railtrack EE &CS Report, Infrastructure Risk Modelling Geographical Interlocking,

RT/S&S/IRM_FTA/11 Issue 1 January 1998

4. Terms and definitions

AEO Authorised Engineering Organisation (as defined in ASA Charter) a legal entity (which

may include a Transport Agency as applicable) to whom the ASA has issued an ASA

Authorisation

ASA Asset Standards Authority

ASA Authorisation means an authorisation issued by the ASA to a legal entity (which may

include a Transport Agency as applicable) which verifies that it has the relevant systems in

place to carry out the class of Asset life cycle work specified in the authorisation, subject to any

conditions of the authorisation. The issue of ASA Authorisation confers the status of 'authorised

engineering organisation' or AEO on the entity.

assurance a positive declaration intended to give confidence

authorisation the conferring of authority, by means of an official instruction and supported by

assessment and audit




availability the measure of the percentage of time that an item or system is available to perform

its designated function

BRS business requirements specification

compliance the state or fact of according with, or meeting, rules or standards

ETCS European train control system

failure the inability of a system or asset to perform its intended function or satisfy some

predetermined conditional attribute (for example, rail head profile or gap size)

fault tree logic diagram showing the faults of sub items, external events, or combinations

thereof, which cause a predefined, undesired event

fault tree analysis deductive analysis using fault trees

FMECA failure mode, effects and criticality analysis

FRACAS failure recording analysis and corrective action system

HEART human error assessment and reduction technique

HFACS human factors analysis and classification system

HRA human reliability analysis

MRA maintenance requirements analysis

maintainability (as defined in IEC 60050-191)the probability that a given active maintenance

action, for an item under given conditions of use can be carried out within a stated time interval

when the maintenance is performed under stated conditions and using stated procedures and

resources

MTBF Mean Time Between Failures

MTTR Mean Time to Repair

RAMS reliability, availability, maintainability and safety

RBD reliability block diagram a diagrammatic method for demonstrating the contribution of

component reliability to the success or failure of a complex system

RCIL reliability critical items list

reliability the probability that a specified item will perform a specified function, within a defined

environment, for a specified length of time

responsible a duty or obligation to satisfactorily perform or complete a task (assigned by

someone, or created by one's own promise or circumstances) that one must fulfil, and which

has a consequent penalty for failure. Responsibility can be delegated




review a method to provide assurance by a competent person that an engineering output

complies with relevant standards and specific requirements is safe and fit for purpose

SME subject matter expert a person assessed or recognised as having the highest level of

competence (including knowledge, skills and practical experience) in a particular field or

discipline

SRS system requirement specification

supplier a supplier of services or products. Defined as an 'applicant' until such time as it has

been granted AEO status, after which it is referred to as an AEO.

system safety the concurrent application of a systems based approach to safety engineering

and of a risk management strategy covering the identification and analysis of hazards and the

elimination, control or management of those hazards throughout the life cycle of a system or

asset

Transport Agencies Transport for NSW (and each of its divisions), Rail Corporation NSW,

Sydney Trains and NSW Trains

TfNSW Transport for New South Wales

THERP technique for human error rate prediction

Transport Assets those assets listed in Schedule 1 (of ASA Charter) which are vested in or

owned, managed, controlled, commissioned or funded by the NSW Government, a NSW

Government agency or Transport agency

5. Reliability, availability, maintainability management

T MU MD 00009 ST states the following requirement:

"The Authorised Engineering Organisation shall demonstrate that it has reliability,

availability and maintainability (RAM) management arrangements in place, relevant to

the engineering services or products provided".

T MU AM 06006 ST states the following requirement:

"A project shall implement management arrangements that define the reliability,

availability, maintainability and safety (RAMS) process, responsibilities, structure, tools

and deliverables"

The introduction of new or altered assets results in complexity and RAM implications.

Implementation decisions should be made based on trade-offs between implementation costs

and the subsequent operation and maintenance.




Consideration should be given to the total impact on the existing network, existing maintenance

activities such as safety working and additional possessions.

The introduction of new assets that simplify the network should generate RAM improvements.

However the introduction of new assets that do not simplify the network may not generate RAM

improvements.

Application of reliability, availability and maintainability (RAM) engineering is required to ensure

optimum system effectiveness, safety and availability. RAM engineering is a whole of system

life cycle philosophy that is applied during plan, acquire, operate/maintain, and dispose stages.

RAM management activities which include planning and producing deliverables should be

carried out by suitably qualified and experienced individuals. Deliverables for RAM management

should be appropriate and sufficient such as to provide assurance to stakeholders that the

system can satisfy the high level performance targets as required. TfNSW should provide the

performance targets. For example, the availability performance target of 92% on-time running of

trains.

The following RAM activities should be undertaken but not limited to:

plan the RAM management activities

define system boundaries and assumptions for RAM analysis

identify the system RAM requirements

allocate the requirements to elements

develop the RAM acceptance criteria

undertake RAM analysis and modelling

validate the RAM requirements

System failure recording and analysis is undertaken using a range of tools and processes.

These include, but are not limited to the following:

failure mode, effects and criticality analysis (FMECA)

reliability block diagrams

fault tree analysis

failure recording analysis and corrective action system (FRACAS)




5.1. Plan reliability, availability and maintainability management activities

T MU AM 06006 ST states the following requirements:

“A project shall consider RAMS performance and how it relates to operational

performance for novel systems early in the system life cycle, starting with

development of the operational concept definition and maintenance concept definition”

“A project shall consider sustainable operation and maintenance of the new or altered

system over the full system life cycle”

At the beginning of the project, before undertaking any asset life cycle stages and activities

related work, AEOs should prepare a RAM management plan. Depending on the level of

complexity the plan may be combined with other asset related plans to demonstrate how the

system RAM requirements will be achieved.

The RAM management plan should focus on managing RAM across the asset life cycle stages

and the activities rules and principles that are required to be adopted including the following:

reliability

o use of proven systems and equipment (assurance figures should be obtained)

o use of systems that are applicable to the conditions (systems proven in other countries

may not be suitable to NSW)

o human factors

o fault tolerance and graceful degradation

o the levels of redundancy designed into the system

availability

o maintenance scheduling

o service recovery

maintainability

o condition monitoring and diagnostics

o condition inspections

o obsolescence

o human factors

o resources

o access arrangements for maintenance




o maintenance scheduling

o isolation for maintenance

o preventative maintenance

o corrective maintenance

o human factors considerations for maintenance

The RAM management plan should also include details on the roles and responsibilities

required within the organisation to achieve the RAM objectives.

Where there are proposed changes to an existing system the RAM management plan should

consider the resulting impact to the system RAM from these changes. The RAM management

plan should, where practical, include an assessment of the existing system RAM and the

changes to the RAM resulting from the new or altered assets.

An example of an impact to the reliability is the addition of a platform display to an existing light

rail system. The light rail operating contract specifies a maximum of three isolations of the line

per year. The platform display system needs to have reliability to work within this limitation.

An example of impact to the availability is if the relocation of a maintenance depots from

multiple existing locations to a new central location. The relocation of the maintenance depots

results in additional travelling distances from the central depot to faults and an increase to the

maintenance response time.

An example of impact to the maintainability is the addition of two extra railway running lines to

an existing double running line system. These two additional running lines alter the

maintainability of the combined services route and sub-stations adjacent to the original two

lines. These assets transition from a safe place location to a danger zone location and

additional safety procedures will be required to maintain these assets.

5.2. Definition of system boundaries and assumptions

System and element boundaries should be defined clearly and by means of defined system

architecture, before starting any RAM activities.

Assumptions may be made as a result of incomplete information in instances where programs

are large or complex. As system definition progresses, these assumptions should be clarified as

either statement of fact, or eliminated within the system design process. Clarification of these

assumptions should be sort with the asset owner (client representative).

The system architecture may need to change, based on the inability to satisfy system RAM

requirements.

Refer to T MU AM 06001 GU AEO Guide to Systems Architectural Design for more information.


5.3. Identification of reliability, availability and maintainability requirements



The asset owner (client representative) should provide, early in the asset lifecycle, the high level

system RAM objectives.

RAM requirements captured from the stakeholders should be well-defined, demonstrable,

include explicit targets and meaningful to allow efficient RAM activities to be conducted. RAM

requirements should be considered in the context of their implementation cost. If the RAM

targets are very exacting then the resulting implementation cost may be very high. The RAM

requirement capture should start with the business requirement specification (BRS) and be

further refined in the system requirement specification (SRS) development process. Any

requirements which fall outside these criteria should be challenged and clarified as necessary.

Refer to T MU AM 06007 GU Guide to Requirements Definition and Analysis for more

information.

5.4. Allocation of reliability, availability and maintainability requirements

RAM requirements allocation assures that the high level BRS RAM targets are allocated

appropriately at system and element levels. Models based on reliability block diagrams and

other modelling techniques should be employed in the allocation process for novel, highly

complex systems. The allocations should be used as an aid to achieving the RAM objectives.

These system and element level RAM targets should then be converted into RAM requirements.

To ensure realistic allocation, system and element RAM requirements should be compared to

empirical data for identical or similar systems whenever possible. The empirical data should be

validated for its relevance considering factors such as the modes of operation, the operating

environment and any fine-tuning or adjustments that have been used. If allocated values are not

achievable, design options analysis across systems and elements should be performed to

reallocate system RAM requirements. The process of allocation, comparison with empirical

data, trade-offs and iteration as required should result in system and element RAM

requirements being defined.

The allocation of a RAM target to each system and element should be specific, measurable and

attainable, taking into account the criticality and risks involved in the design, development and

installation.

Systems and elements that are critical to performance should have RAM targets set higher than

other non-critical systems, based on the system level reliability or redundancy employed. When

allocating RAM targets, the number and complexity of the system interfaces and the extent to




which the system will be affected by external factors including the operating environment needs

to be considered.

5.5. Development of reliability, availability and maintainability acceptance criteria

The acceptance criteria for RAM requirements should be agreed between the stakeholders

including the asset owner (client representative) and system developer.

These stakeholders may include representatives from the transport agencies.

This should include, but not limited to, the RAM validation principles to be applied and the tests

and analysis to be carried out for the validation. Acceptance criteria should be agreed and

documented through the requirements allocation process starting with the BRS and then the

SRS. Consideration should be given to the cost of implementing the acceptance criteria.

5.6. Reliability, availability and maintainability analysis and modelling


"A project shall use RAMS modelling to appropriately support option selection and

development and preliminary system design, to ensure that the new or altered system

will meet the stated operational capability and provide value for money over the

designed system lifetime"

During the plan and acquire stages of a project, reliability predictions should be used to assess

whether the allocated RAM requirements are achievable. An iterative process of comparing

predictions with allocations which combined with trade-off studies, eventually results in an

efficient design that achieves whole of life performance targets.

Predictions combine lower level component or unit level reliability data through reliability

modelling and the operating and environmental conditions to estimate the integrated system

reliability. The validity of the reliability predictions is highly dependent upon the quality of

reliability data and assumptions made.

Whenever possible, reliability predictions should be based on data from similar components or

equipment already in use in service, in similar operational environments. For electronic

equipment, parts count prediction methods based on MIL-HDBK-217F Notice 2 can be used to

obtain reliability predictions. Where this is not possible, reliability data may be extrapolated from

tests or trials conducted by the supplier or manufacturer. In all cases the sources of the data

should be cited to maintain an audit trail. Suppliers of original equipment and systems should

provide evidence that they satisfy all RAM requirements and that they are suitable for the

intended application.




Reliability prediction should use reliability modelling where practicable for novel, high complexity

systems, such as a reliability block diagram, fault tree or a computerised simulation model, to

describe the reliability behaviour of the system and reliability data of the constituent elements.

RAM predictions are performed predominantly for the following purposes:

reliability

o to evaluate reliability performance against target risk of failure

o to identify weaknesses in a design, including single point failures

o to provide basis for a testing program

o to predict maintenance effort and cost

availability

o to evaluate outage times and service disruptions against economic, community and

quality criteria

o to identify critical subsystems and components

o to determine the need for redundant or stand-by equipment

maintainability

o to determine the most effective maintenance strategy

o to optimise maintenance facilities, diagnostic and training tools, spares holdings and

manning levels

o to assess need for condition monitoring

Reliability block diagrams (RBD) and fault tree analysis (FTA) are systematic top-down reliability

modelling and analysis techniques, and are usually best applied when introducing novel, highly

complex new or altered systems.

In addition to RAM modelling, complimentary analysis techniques should be used during design

to concentrate on areas which are critical to the system reliability, such as failure mode, effects,

and criticality analysis (FMECA).

5.7. Validation of reliability, availability and maintainability requirements

Validation should include details of the validation tasks and relevant results against the RAM

acceptance criteria. Any limitations and constraints applying to the system should also be noted.

There are numerous sources of international good practice in reliability and maintainability

validation. These include MIL-HDBK 781, EN 60300-3 and EN 60706.




A RAM report including results from the analysis and verification and validation activities should

be prepared and then issued to stakeholders. Refer to TS 10506 AEO Guide to Verification and

Validation for more information. The RAM report should clearly display all verification and

validation failures against RAM acceptance criteria. Corrective action should then be

undertaken to rectify these failures. Validation and verification activities should be repeated and

the RAM report re-issued.

5.8. Reliability, availability and maintainability deliverables

The following deliverables should be produced during the RAM process:

RAM management plan including the asset life cycle stages

BRS RAM requirements with their acceptance criteria

SRS RAM requirements with their acceptance criteria

element level RAM requirements with their acceptance criteria

RAM analysis and modelling with their data

RAM report including results from the analysis, modelling, verification and validation

activities

6. Reliability, availability and maintainability tools and techniques

Careful consideration should be given to the selection of the appropriate RAM tools and

techniques used to provide RAM results. This consideration involves a critical decision as to

whether a simple calculation or a comparison with an existing system is sufficient or whether

RAM tools and techniques are required.

These tools and techniques may provide different RAM results as the system definition

progresses. These progressive RAM results should be recorded in the RAM report during the

asset life cycle stages.

Different asset types may have different approaches and tools for RAM modelling and analysis.

Communications, signalling and electrical designers may use reliability block diagram (RBD)

analysis, failure mode, effects, and criticality analysis (FMECA) or fault tree analysis (FTA)

tools, whereas bridge and structural designers may use finite element analysis (FEA) tools.

The reliability, availability and maintainability tools and techniques are explained in Section 6.1

through to Section 6.6.


6.1. Reliability block diagram analysis



AS IEC 61078 describes reliability block diagrams (RBD) as a diagrammatic analysis method for

demonstrating the contribution of component reliability to the success or failure of a complex

system.

A reliability block diagram is drawn as a series of blocks connected in parallel or series

configuration with each block representing a component of the system with an associated failure

rate. Parallel paths are redundant, meaning that all of the parallel paths are failed for the parallel

network to fail. By contrast, any failure along a serial path causes the entire serial path to fail.

Reliability block diagrams are used to calculate the reliability of each element and the

contributory effect on the reliability of the system. This assists in the identification of single

points of failure in the system.

Examples of where a reliability block diagram would be used are for the development of a

station announcement system and a blue light emergency station provided in Appendix A.

6.2. Failure mode, effects and criticality analysis

AS IEC 60812 describes failure mode, effects and criticality analysis (FMECA) as a 'bottom up'

analysis method that is used to understand failure modes and their escalation effect, both at a

local and a system wide level. This method requires the system design to be well defined down

to unit level.

Each system is broken down into its elements, usually down to line replaceable unit level where

each element is then analysed uniquely to identify functional failures and relevant modes of

failure, and their escalated effect on the next higher level of the system.

This process is employed to identify those elements of a system which have a significant impact

on system reliability, availability and safety. This analysis is further used to promote mitigation

measures leading to improved system reliability and availability.

FMECA is typically used for high level analysis of system reliability through the following

process:

identification of failure modes and consequences, and facilitation of design modifications

assessment of failure causation, performance limits and vulnerability issues

classification of failure modes relative to the severity of their effects

An output of the FMECA should be a reliability critical items list (RCIL). This is a list of items

which have at least one failure mode classified as critical according to its criticality analysis.




Consideration should also be given to common-mode failure where an event causes multiple

systems to fail. For example an explosion in a room causes both transformers in the room to fail

at the same time.

An example of where a FMECA would be used is for the development of a bogie system for a

train provided in Appendix B.

Refer to T MU AM 01003 F1 Blank FMECA Sheet for further details.

6.3. Fault tree analysis

Fault tree analysis should be used for highly complex or safety or reliability critical systems.

Fault tree analysis should be done during the initial stage of the project and updated as more

details become available during subsequent stages of the project.

AS IEC 61025 describes fault tree analysis as a top down deductive failure analysis. An

undesired state of a system is analysed using Boolean logic to combine a series of lower-level

events. This analysis method is used to determine the probability of a safety accident or a

particular system level (functional) failure.

The basic symbols used in fault tree analysis are grouped as events, gates, and transfer

symbols. Event symbols are used for primary events and intermediate events. Primary events

are not further developed on the fault tree. Intermediate events are found at the output of a

gate. Events in a fault tree are associated with statistical probabilities. Gate symbols describe

the relationship between input and output events. The gate symbols are derived from Boolean

logic symbols. Transfer symbols are used to connect the inputs and outputs of related fault

trees, such as the fault tree of a subsystem to its system.

Fault tree analysis incorporates the following phases:

definition of the undesired system top event to analyse

obtaining an understanding of the system functional breakdown

construction of the Boolean fault tree from top event down to base events

assignment of failure rates to the base events

evaluation of the fault tree

control of the hazards identified

Examples of where fault tree analysis would be used are the risk of a rail vehicle collision and

exceed safe speed (ETCS). The contributory factors that lead to these system top events are

provided in Appendix C.




6.4. Human reliability analysis


"A project shall consider human reliability factors as part of the overall reliability of the

system"

The purpose of conducting human reliability analysis (HRA) is to ensure that the actual

performance of the system is in line with its designed requirements. Humans are an integral part

of designed systems, playing important roles in operation, accidents prevention and

maintenance activities.

Operators and maintainers should be trained and competent; however ‘trained and competent

people’ is not a way of preventing human error. Human error is a normal part of human

performance, and should be appropriately assessed to create resilient systems. Early,

appropriate HRA is essential to ensure the exploration of the appropriate hierarchy of controls.

Delayed or ineffective assessments tend to create dependencies on administrative risk control

which can create latent system weaknesses.

Therefore, analysing and predicting the reliability of a system without assessing human

reliability may result in an over estimation of system performance.

Although there are many ways in which a human can positively impact on system performance,

the focus within a RAM assessment is usually to identify the following:

human errors that may impact on the RAM of the system

mitigation measures to reduce likelihood of human errors or to reduce impact of these

errors on the system

These measures can relate to the design of the equipment or the task, or may warrant

additional redundancy or diversity to be incorporated within the overall system design.

In order to be able to identify the errors that can be made and what their likely effect on the

performance of the system would be it is necessary to identify but not limited to the following:

the tasks that are required to be carried out by operators and maintainers

the likely conditions under which those tasks will be performed

the potential errors that could be made

With many other aspects of the design in the early stages, information may be at a relatively

high level and should be used to identify those areas of the system where a more detailed

assessment is of most value.

There are a number of methods available for identifying human errors ranging from utilising past

experience through to the application of structured processes based on guidewords or




checklists. Human error should be built into existing analysis techniques such as FMECA or

FTA.

In those cases where a quantitative assessment is required, techniques for human error rate

prediction may be employed for evaluating the probability of a human error occurring and

impacting on system performance. This should then be incorporated into the system models to

assess the impact on the overall system performance.

Techniques to evaluate the probability of human errors fall into the following three general

categories:

the use of screening data

the use of historical or subjective data

the use of human error databases

Examples of where a human reliability analysis would be used are for a ticketing system and a

door release system provided in Appendix D.

Refer to T MU HF 00001 GU AEO Guide to Human Factors Integration for more information on

HRA.

6.4.1. Screening data

A single screening value for human error within a system model may be used in the early

stages of an assessment. This enables an organisation to identify where the system is

particularly vulnerable to human error and to review the design in terms of the level of

redundancy or diversity that is currently built in, or to identify whether a more detailed

assessment may be required.

6.4.2. Historical or subjective data

Actual performance data, if available, may be used as estimation within reliability models. This

data is normally only available at the system level and does not specifically highlight the human

error contribution. However, it is estimated that approximately 70% - 90% of failures are due to

human error and so it is possible to factor the data in this way to obtain a more reliable

estimate.

Note: Manufacturer's data generally does not include human errors and so it will be

indicative of performance based on 100% reliability of people.

Alternatively, subjective data may be sought through consultation with users or their opinions

and may be used to modify existing data.




6.4.3. Human error databases

A number of techniques are used for quantitative human error assessments where it is possible

to look up a generic human error probability and then modify it according to the specific task.

Commonly used examples include the following:

Human Error Assessment and Reduction Technique (HEART)

HEART method is based upon the principle that every time a task is performed there is a

possibility of failure and that the probability of failure is effected by one or more error

producing conditions to varying degrees. Error producing conditions include topics such as

training, poor procedures, poor system feedback and so on.

Factors which have a significant effect on performance are of greatest interest. These

conditions are applied to a ‘best-case-scenario’ estimate of the failure probability under

ideal conditions to then obtain a final error probability. By forcing consideration of the error

producing conditions potentially affecting a given procedure, the application of HEART also

enables the user to identify a range of potential improvements to system performance.

An example of where HEART would be used is the assessment of a critical maintenance

task.

Technique for Human Error Rate Prediction (THERP)

THERP models human error probabilities using an event tree approach, in a similar way to

an engineering risk assessment, but also considers performance shaping factors that may

influence these probabilities. The probabilities for the human reliability analysis event tree,

which is the primary tool for assessment, are nominally calculated from historic databases,

local data including simulated data or from accident reports. The resultant tree portrays a

step by step account of the stages involved in a task in a logical order. The technique is

described as a total human reliability assessment methodology as it simultaneously

manages a number of different activities including task analysis, error identification and

human error quantification.

6.5. Maintenance requirements analysis

Maintenance requirements analysis (MRA) is the inclusion of reliability, availability and safety

integrity as a part of the maintenance requirements of the system.

Maintenance requirements analysis applies reliability theory principles within a structured

process designed to identify effective maintenance and inspection tasks that would detect or

delay failures of equipment. This ensures that maintenance requirements are incorporated into

the design activities.




The following elements are inherent in maintenance requirements analysis:

identify the maintenance item

Identify the items to be maintained at system, element, assembly, unit, component level as

part of the asset breakdown structure.

establish the function

Identify all functions associated with the maintenance item.

establish failure modes and effects

Identify and analyse all possible failures to or deviations from the specified functionality

associated with the maintenance item. Analyse their escalation effects from component

level to unit level to assembly level to subsystem level to system level.

recognise failure

Identify the means by which each failure is detected and communicated to the maintainer.

identify maintenance task options

Identify how each maintenance item should be repaired or replaced (both preventative and

corrective maintenance tasks).

establish maintenance task intervals

Identify a maintenance program which includes the schedule of inspection or replacement

for all maintenance items.

An example of where maintenance requirements analysis would be used is for the development

of a station escalator provided in Appendix E.

Refer to T MU AM 01003 ST Development of Technical Maintenance Plans and

T MU AM 01010 ST Framework for developing an Asset Spares Assessment and Strategy for

further details.

6.6. Failure recording analysis and corrective action system

A failure recording analysis and corrective action system (FRACAS) should be applied from that

point in the design cycle at which a version of the product or service approximating the final

operational version becomes available until the product or service is decommissioned.

The FRACAS is a closed loop process incorporating data reporting, collecting, recording,

analysing, investigating and timely corrective action for all failure incidents. The objective of the

system is to aid design, identify corrective action tasks and evaluate test results in order to

provide confidence in the results of the safety analysis activities in addition to the correct

operation of the safety features.




The effectiveness of FRACAS is dependent upon accurate input data in the form of reports

which should document all the conditions relating to the incident.

Incident reviews should be undertaken to ensure that the impact on the safety and reliability

characteristics of the product or service are quickly assessed, with any corrective actions

requiring design changes, quickly approved.

The FRACAS process is outlined as follows and is illustrated in Figure 1:

an incident report is raised and recorded in a database

a data search is carried out for related events

the incident is reviewed - if the incident is a new hazard it is recorded as such in the hazard

log

information concerning the incident is communicated to those that need to know, in order to

control risk

corrective actions are recommended, as necessary

if no corrective action is required, the database is updated and the process ends

the corrective action is authorised and implemented then assessed for success

if the corrective action is unsuccessful, the incident is re-reviewed, corrective actions are

modified as required, details are updated in the database and the action returns for further

authorisation to proceed

if the corrective action is successful, the database is updated and the process ends

An example of where FRACAS would be used is the development of a CPU motherboard

provided in Appendix F.


No

Incident raised and recorded

Search for related events

Review incident

Communicate information as necessary

Corrective action

necessary?

Authorise, implement and assess plan

Corrective action

successful?

Update

Yes

Yes

No

database

Figure 1 - FRACAS process






Appendix A Examples of reliability block diagrams

© State of NSW through Transport for NSW 26 of 36

Figure 1 and Figure 2 provide examples of reliability block diagrams for station public announcement and blue light emergency station, respectively.

A.1. RBD - Station public announcement

Page

Loudspeakers

STM 64 Port P.MUX AMD II

Matrix Enhanced

MTBF=157.680 H

MTBF=183.960H

MTTR=2hours

MTTR=2hours

Network Fibre

MP50 Call Station

MTBF=163.549 H

MTTR=2hours

STM 64 Port

Matrix Enhanced

P.Mux AMD II

MTTR=2hours

MTBF=183.960H

MTBF=157.680 H

MTTR=2hours

MTBF=163.549 H

MTTR=2hours

Amplifier Module V400 Amplifier Mainframe

VIPET

P1 Ethernet Switch

PCAS Workstation

VIPA HOST VAR 4

Network Fibre

MTBF=600000H

MTTR=24h

MTBF=600000H

MTTR=24h

MTBF=21400 H

MTBF=121354 H

MTTR=2hours

MTBF=39800H MTBF=65000H

MTBF=48681 H MTBF=96400 H MTBF=215800 H MTBF=118600H

MTTR=4h MTTR=1h MTTR=4h

MTTR=4 hours MTTR=4 hours MTTR=4 hours MTTR=4 hours

Loudspeakers

MTBF=87600H

MTBF=87600H

MTTR=2hours

Service board

MTBF=621.960 H

MTTR=2hours

Service board

MTBF=621.960 H

MTTR=2hours

Matrix Enhanced

MTBF=157.680 H

MTTR=2hours

Matrix Enhanced

MTBF=157.680 H

MTTR=2hours

Switch OS6450-24

MTTR=2hours

MTBF=894251H

MTTR=2hours

Figure 2 - RBD station public announcement (sample fragment)



A.2. RBD - Blue light emergency station


Relays

MTBF=100,000 H MTBF=50,000 H MTBF=100,000 H MTBF= 100,000 H MTBF=50,000 H

Input Relay Push Button

Comms Module Output Relay Input Relay Alarm Module Push Button

Emergency

with Key Reset

MTBF= 50,000 H with Key Reset MTBF=600,000 H

Emergency Comms Module Output Relay Alarm Module

MTTR=4 hours MTTR=1 hour MTTR=1 hour MTTR=1 hour MTTR=1 hour Blue Light 240V AC Display UPS

MTBF=100,000 H MTBF=50,000 H MTBF=100,000 H MTBF= 100,000 H MTBF=50,000 H MTTR=3 hours MTTR= 3 hours

MTTR=4 hours MTTR=1 hour MTTR=1 hour MTTR=1 hour MTTR=1 hour

Figure 3 - RBD blue light emergency station (sample fragment)

Page



Appendix B Examples of FMECA table - Bogie assembly


Table 1 provides an example of FMECA table for bogie assemble.

Table 1 - FMECA table (sample fragment)

ents

Analysts: Fred Person, Joe Smith Function General

1 Secure wheel bearings and wheel set

Item/Assembly Name: Bogie Assembly Drawing No: BOG-123 2 Maintain wheels to track gauge

Part No: GOB-457-631

Functional Description: Provides the interfacing and suspension of the train body to the track

3 Provide 1kN braking force per wheel set

4 Permit low friction axle rotation

MTBF (hrs) 100,000 Task Type P –Preventative, S-Surveillance, C - Corrective

Function Part Failure Mode

Cause of Failure

Local Effect % Failure Rate

Tasks Type Period Latitude Insp Comm

Secure wheel bearings and wheel set

Bogie frame

Wheel bearing unsecured

Bearing cradle cracked

Wheel bearing vibration

100 1/1000 Wheel bearing cradle inspection

S 6 month

Maintain wheels to track gauge

Wheel set

Wheel to track gauge lost

Axle cracks Wheel gauge mismatch to track gauge

100 1/5000 Wheel set gauge inspection

S 6 month

Provide 1kN braking force per wheel set

Brake assembl y

Reduced braking force

Brake pads worn beyond limits

Reduced braking on bogie set

100 1/5000 Brake pad thickness inspection

S 6 month

Permit low friction axle rotation

Wheel Bearing

Bearing friction increase

Metal fatigue & bearing seizure

Wheel bearing overheat

100 1/10000 Bearing inspection and replacement

C 3 month

Page



Appendix C Examples of fault tree analysis


Figure 4 and Figure 5 provide examples of fault tree analysis for failure of electrical interlocking system and exceed safe speed, respectively.

C.1. Fault tree – Failure of electrical interlocking system

Figure 4 - Fault tree failure of electrical interlocking system (sample fragment from Railtrack EE&CS Report)

Page




C.2. Fault tree - Exceed safe speed (ETCS)

Figure 5 - Fault tree for exceeding safe speed (sample fragment from ETCS Application Level 1 - Functional Fault Tree)

Page



Appendix D Examples of human error analysis


Table 2 and Table 3 provide examples of human error analysis for ticketing system and doors release system, respectively.

D.1. Human error analysis - Ticketing system

Table 2 – Example of human error analysis for ticketing system

Task Error Mitigation

Select ticket type at ticket vending machine Incorrect ticket type selected Machine buttons labelled with the various ticket types

Machine visual display showing ticket type selected

Select destination at ticket vending machine Incorrect destination selected Machine buttons labelled with the various destinations

Machine visual display showing destinations

Enter coins into ticket vending machine Coins inserted into the notes reader Ticket vending machine has coin slot labelled

Enter notes into ticket vending machine Notes inserted into the coins slot Ticket vending machine has the notes reader labelled

Notes inserted upside down or back to front Machine labelled with a diagram showing the correct note orientations

Transport the ticket Ticket bent in transit “Do not bend this ticket” marked on the ticket

Ticket made from flexible plastic to avoid damage

Ticket size allows ticket to be placed in a wallet or purse Ticket dropped or crushed

Insert ticket into ticket reader Ticket inserted upside down “Travel Card” marked on upside of the ticket

Ticket inserted back to front Direction arrow marked on the upside of the ticket

Page



D.2. Human error analysis – Doors release system


Table 3 – Example of human error analysis for doors release system

Task Error Mitigation

Locate doors release button Button not located Button labelled with doors release

Button illuminated with green lights

Press doors release button Button not pressed Button labelled with doors release

Button illuminated with green lights

Travel on train Button pressed accidently Button recessed to avoid accidental presses

Button must be pressed for 3 seconds to activate

Button obscured by passengers Door labelled requesting passengers to stand clear

Button damaged by passengers Button recessed to avoid accidental contact

Button made from material that can withstand high impacts

Page



Appendix E Example of maintenance requirements analysis - Station escalator


Table 4 provides an example of maintenance requirements analysis for station escalator.

Table 4 – Example of maintenance requirements analysis for station escalator

Maintenance item

Functions associated

Possible failures modes

Effect Escalation effect

Failure recognition

Maintenance task options

Maintenance task intervals

Platforms Entry and exit access

Unable to provide entry or exit access

Platform blocked to passengers

Passengers unable to use the escalator

Visual inspections for damage

Testing

Replace platform panels

Daily cleaning, inspection and testing

6 monthly service inspection and testing

Steps Support standing or walking passengers

Unable to support passengers

Steps not safe for passengers

Passengers unable to use the escalator


Testing

Replace steps Daily cleaning, inspection and testing


Tracks Provides running surface for the steps

Unable to provide running surface for the steps

Steps unable to move

Passengers unable to ride on escalator

Testing Lubricate tracks

Replace tracks

Daily testing


Provides running surface for the handrails

Unable to provide running surface for the handrails

Handrails unable to move

Passengers unable to use handrail for support


Testing

Lubricate tracks

Replace tracks

Daily testing


Drive gears Provides coupling and speed conversion of the motor to the steps

Unable to provide coupling and speed conversion of the motor to the

No or slow movement of steps.


Testing Lubricate gears

Replace gears

Daily testing


Page




Maintenance item




Failure recognition



steps

Provides coupling and speed conversion of the motor to handrails

Unable to provide coupling and speed conversion of the motor to handrails

No or slow movement of handrails

Passengers unable to use handrail for support

Testing Lubricate gears

Replace gears

Daily testing


Hand rails Provides support and stability to passengers

Handrails unable to support passengers

No handrails Passengers unable to use handrail for support


Testing

Replace handrails Daily cleaning, inspection and testing


Motors Provides driving force for handrails and steps

Unable to drive the handrails or steps

No or slow movement of steps or handrails


Testing Replace motors Daily testing


Control system

Regulates speed of steps and handrails

Unable to drive the handrails or steps

No or slow movement of steps or handrails


Testing Replace motors Daily testing


Emergency stop system

Halts movement of steps and handrails in an emergency situation

Unable to halt steps and handrails

Steps and handrail movement

Passenger injuries

Testing Replace components Daily testing


Glass screens

Protects passengers from moving components

Unable to protect passengers from moving components

Exposed moving parts

Passenger injuries


Replace components Daily inspection

Protects passengers Unable to protect

Passengers Passenger Visual inspections for

Replace components Daily inspection

Page




Maintenance item




Failure recognition



from falling passengers from falling

fall off steps injuries damage

Page




Appendix F Example of FRACAS incident report – CPU motherboard Table 5 provides an example of FRACAS incident report for CPU mother board.

Table 5 - Example of FRACAS incident report for CPU motherboard

FRACAS No. Company name System name Part description Part number Opened by Opened date

Closed date

F-0001 Cyberdyne Systems T-101 CPU Core CPU-XXX-001 John Connor 13/05/2015 30/06/2015

Description of problem Description of failure analysis Description of corrective action taken Remarks

Central Computer Motherboard failure Visual inspection revealed solder bridging address bus tracks on central computer motherboard, leading to CPU core failure

Automate component placement and soldering process to reduce human error

Resin-coat PCB prior to component placement and soldering to reduce solder splash impact

Increase lot inspection frequency of motherboard PCB manufacturing process

Corrective actions to be implemented in next scheduled production run on 01 July 2015

Page

aeo guide to reliability, availability and maintainability · aeo guide to reliability,...

Documents