130604 reliability data analytics, adams

Reliability & Data Analytics-Viewpoints, Tips, Examples-

Tim C. AdamsNASA Kennedy Space Center

[email protected]

mailto:[email protected]

Career NotesTim Adams is a Senior Engineer in the Systems Engineering and Integration Division with the NASA John F.Kennedy Space Center’s (KSC) Engineering and Technology Directorate. Tim serves as a technical lead andresource in Reliability Engineering, the Technical Editor of the “KSC Reliability” web page, and the Manager ofKSC’s Integrated Design and Assurance System.

At NASA, Tim has 19 years of “hands-on” experience in Reliability Engineering and Technical Risk Analysisinvolving both flight systems and ground systems for the Space Shuttle, International Space Station, andConstellation Programs. Prior to KSC, he was the Lead of the Office of Safety, Reliability, and Quality Assurance’sAnalysis and Assessment Methodology Group at NASA Johnson Space Center.

Selected work products in Reliability Engineering and Technical Risk include reliability goals, trending problemhistory and measuring relative risk, predicting reliability (or availability) for a variety of components andsubsystems, and developing a Center-wide resource for “learning and doing” engineering-assurance analyses.

Selected special assignments and roles in Reliability Engineering and Technical Risk include Reliability andMaintainability Technical Discipline Team Lead for the Agency for the NASA Safety Center, Reliability EngineeringConsultant for the Centers for Disease Control and Prevention (CDC), and Team Lead and Principal Technical RiskEngineer for a multi-NASA Center effort that resolved the debate to disposition a degraded critical system.

Prior to NASA, Tim was a Product Manager/Market Analyst for an international manufacturer of oil tools and anApplication/Industrial Engineer for an international manufacturer computerized-numerical controlled (CNC)machine tools. In addition, Tim was a Director/Operations Manager for a municipality and utilities company thatincluded roles as an Emergency Response Officer and Project Manager for community’s first 911 public-emergency system.

Tim is a Certified Reliability Engineer (CRE) with continuous re-certifications since 1994 and a senior member withthe American Society of Quality (ASQ). His formal education is in Mathematics, Education, and Management.

2

Table of ContentsTopic Page

Presentation Objectives 4

The Management Process 5

Reliability is a System Operating Outcome 6

Reliability is an Enabling Function 7

Reliability Definition and Related Concepts 8

Reliability Efforts and Methods 9 & 10

Basic Math for Probabilistic Reliability 11

Thinking Analytically: A Process 12

Involving Management with Analytics 13

Example 1 – Making a Risk Analysis 14 - 16

Reacting to Reactive Situations 17

Example 2 – Providing Reliability to Operations 18 - 21

Example 3 – Integrated and Accessible Analytical Tools 22 - 28

Example 4 – Providing Reliability to Design Engineering 29 - 31

Example 5 – Sometimes Statistics will not do the Job 32 - 37

Uncertainty – Describing the “Goodness” of a Point Estimate 38

Summary 39

NASA, Tim C Adams 3

Objectives

In the area of quantitative Reliability Engineering

and Risk Assessment, share:

Viewpoints

Tips, Techniques, and Tools

Examples from various NASA programs

Show how the Reliability discipline relates to and

has synergy with other disciplines such as

Management, Systems Engineering, Safety, Quality,

and Risk.

Note: The live presentation of this content adds storytelling (case

method) to encourage discussion and embrace expertise from

participants.

NASA, Tim C Adams 4

Viewpoint The management process

NASA, Tim C Adams 5

Tip Trace your organization’s Return on Net Assets (RONA) to your Reliability effort. Is it

more than maintenance? Reliability excellence uses all portions of an organization.

~~ Reliability Management, An Overview, EQE International, 2000

Goals

Feedback

Core Values

Objectives

Project Plans orWork Processes

Actions

MeasuresEvaluations

Vision

Mission

Say what we do

Do what we say

Viewpoint Reliability is an operating outcome

A system operating outcome is a non-physical

characteristic (e.g., safe, dependable, low cost) and not a

physical characteristic (e.g., size, weight, flow rate).

Operating outcomes are inferred. An inference goes

beyond the known data.

Operating outcomes are the essence of Mission Assurance.

Though abstract (e.g., not observable on an engineering

drawing), the Reliability function needs to have and use its

own management process as well as be part of the

organization’s management process.

For more on mission assurance and operating outcomes,

see http://kscsma.ksc.nasa.gov/Reliability/Documents/Mission_Assurance.pdf

NASA, Tim C Adams 6

Viewpoint Reliability is an enabling function

Three types of jobs (NASA examples), namely

Doers (Astronaut, Flight Design Engineer, Subsystem Engineer, Flight

Controller, Vehicle and Ground Processing Specialist)

Enablers (Systems Engineer, Reliability Engineer, Safety Engineer, IT

Engineer, Contracting Officer, Attorney)

Managers (Project Manager, Director, Supervisor)

Reliability, Maintainability, Availability (RMA) Engineers are

enablers for Design Engineering and Production/Operations.

Why RMA (or just R&M) and not RAM? RMA is the sequence

used to design a system with this type of system operating

outcome. Thus, A is a function of R and M; denoted A = f(R,M).

RMA Engineers partner with systems engineers and safety

engineers.

NASA, Tim C Adams 7

Viewpoint Reliability and related concepts

Classic definition: Reliability is the probability (likelihood) an item will

perform its intended or required functions (mission) with no downtime

(e.g., maintenance and repair activities) during a given period of time

(mission time) under specified operating conditions (environment).

Emerging definition: Reliability is the ability of an item to perform…etc.

The focus is on achievement rather than probability. (Ref. Reliability Physics)

Reliability + Unreliability = R + U = Probsuccess + Probfailure = 1.

Thus, Likelihood in Risk = 1 – Reliability measure.

It is Availability and not Reliability that addresses and characterizes

quantitatively both uptime and downtime.

For more on Availability, see

http://kscsma.ksc.nasa.gov/Reliability/Availability.html

If Quality is the degree of fulfillment of customer expectations, then

Reliability is Quality as a function of time.

NASA, Tim C Adams 8

Viewpoint Reliability efforts and methods

In a lead role (strong enabler) the Reliability effort is called

Reliability Engineering; partners with Design, Systems, and

Safety Engineering; and uses both

Qualitative (Big Q) techniques

Quantitative (Little q) techniques.

In a non-lead role (weak enabler), Reliability is subordinate

to Maintenance and Safety Engineering and uses selected

Big Q and Little q techniques.

For more on Reliability techniques, seehttp://kscsma.ksc.nasa.gov/Reliability/Documents/Reliability_Discipline_Overvie

w_of_Methods.pdf

NASA, Tim C Adams 9

Viewpoint Crossroads in reliability math

Probabilism is the doctrine that probability is anadequate basis for belief and action, since certainty inknowledge cannot be attained.

Determinism is the doctrine that every event, act, anddecision is the inevitable consequence of antecedents(past events) that are independent of the human will.

Classical Reliability uses probabilistics; Reliability Physics(or Physics of Failure) uses deterministics.

The reliability measure from Reliability Physics is frommaterials, design, and environment and not from thestatistical treatment of items that failed or did not failduring test, operation, or both.

NASA, Tim C Adams 10

Viewpoint Basic math for probabilistic reliability

For occurrences with a constant rate: Use the Cumulative Binomial distribution

(demand or event based) or the Cumulative Poisson distribution (time based).

For non-repairable items that fail: Use the Weibull distribution for each failure

mode. This distribution models data sets (containing both failure and non-failure

data) having a decreasing, constant, or increasing failure (hazard) rate over time.

For repairable items that are down: Ideally, first, use the Laplace Test to measure

with statistical confidence the trend of failures over time. Second, if the Laplace

Test score is around zero, repairs can be assumed to be good as new. Thus,

reordering the data to fit a Weibull distribution is mathematically acceptable.

For more on the above, see the “Tools Sections” at the “KSC Reliability” web page.

Given stress (load) and strength (capacity) data: Use Stress-Strength Interference.

In this case, Reliability is probability the item survives the application of the load.

To quantify the dimensions (axes) of a risk scenario: In a risk matrix, a risk

scenario’s likelihood (probability) of occurrence axis can be one minus Reliability,

and the consequence (impact) axis can be (for example) a scaled Hazard Analysis.


Viewpoint “Think about your thinking”

My analytical process uses COP (or POC), i.e., C→O→P.

Concept: think aim purpose before

Operation: do fire action during

Product : produce confirm outcome after

Why COP? It helps me to police myself by asking core

questions, divide and conquer, and focus. When answers

to the COP questions are “yes” or “yes, but…” as opposed

to “yes” or “no,” options and possibilities remain available.

“Wisdom begins in wonder.”~~ Socrates

“People do not resist change; they resist being changed.”~~ Peter M. Senge


Tip Practice COP with the decision maker

Why, with analytical work, there are advantages when you

(are able to) communicate with the decision maker early in

the analytical work’s process.

Why, action-oriented managers do not like the lengthy

formal decision-making process. Use techniques to yield

immediate results; later use more sophisticated methods.~~ R.E.D. Woolsey & H. Swanson, Operations Research for Immediate Application: A Quick and Dirty Manual

Also, in regards to management support and user

involvement, “…only 40% of projects suggested by

quantitative analysts were ever implemented. But 70% of

the quantitative projects initiated by users, and fully 98% of

projects suggested by top managers, were implemented.”~~ Barry Render & Ralph M. Stair Jr., Qualitative Analysis for Management, 6th edition


Example 1 System (Asset) Level

For ISS Crew Health Care Maintenance System

“GOAL or SCOPE: Construct a [quantitative/

analytical] model of the various exercise devices

needed for crew health maintenance onboard ISS

[International Space Station] and use this model to

characterize the relative risk to the crew and ISS

mission for durations (mission times) of 200, 180,

and on down to 60 or less days. Provide a means of

assessing the risk of no on-orbit spare parts for our

equipment and its potential impact on the mission.”


Example 1 Making a Risk Analysis-Actual process

1. Obtain customer requirements (what, when, and “fictional or notional”product—“initial” P of COP).

2. Clarify jargon and “possible” strategy (how) for applying the Reliability discipline(C of COP for team members and customer).

3. Identify, sequence, and assign tasks to make a “draft” project plan (O of theCOP).

4. Obtain system knowledge (e.g., requirements, behavior, structure, andparametrics).

5. Obtain and organize uptime and downtime data for each subsystem. Provide atemplate and sample, if needed.

6. Provide a “quick” product. In this example, for each subsystem and at thesystem level, the dates for downtime events were converted to a bar chart andLaplace Test score. Identify poke outs (significant problem areas), if any.

7. Identify scenarios or “central questions.” A scenario asks what is the likelihood(probability) a particular configuration of the system (and an associatedconsequence) will occur. Thus, a scenario becomes the rule to assign portions ofthe uptime and downtime data to a particular system model. (Critical C and“final” P of COP-- not firm until pdf page 57 of the 83-page report)


Example 1 Making a Risk Analysis-Actual process (cont.)8. Build the model for crew health benefit (i.e., the complement of the consequence axis in a

risk matrix) by building Reliability Block Diagrams (RBDs). Each RBD contains subsystems

that satisfy physiological objectives for flight crews (astronauts aboard the International

Space Station). Parallel paths in the RBD mean more than one subsystem (with a relatively

“high” health benefit as determined by flight medical experts) can satisfy a physiological

objective. (First critical O of COP)

9. Build the model for likelihood of success (i.e., the complement of the probability of failure

axis in a risk matrix) by: (1) Assigning applicable uptime and downtime data to each scenario

by subsystem and (2) Determining the reliability and maintainability probability density

functions and parameters (math models) for each scenario by subsystem. (Second critical O

of COP)

10. Use (run) the mentioned models to calculate reliability, conditional reliability, and

instantaneous availability for each subsystem and for each physiological objective at various

mission times.

11. Build risk matrices at various mission times. The likelihood axis in success space is either

reliability (no repairs allowed) or availability (repairs allowed). The consequence axis in

success space is a weighted average of all physiological objectives.

12. Organize and release findings. Tip: Provide the source for each finding to aid in a self check

and with an external audit.


Tip Reacting to reactive situations*

“Don’t bring me a perfect answer after launch.”~~ A NASA Johnson Space Center Manager’s directive to his safety and mission assurance engineers

"Success comes in cans, not in cannots!“~~Joel H. Weldon, motivational speaker

“Plans are nothing; planning is everything.”~~ Dwight D. Eisenhower

“If you torture the data long enough, it will confess to you.”~~Ronald H. Coase, a British-born, America-based economist

~~Update: Mark Hulbert’s Sept 26, 2006 Market Watch stated, “If you torture the data long enough, you

can get it to say just about anything.”

“Somebody is going to have to suffer, either the reader or the

writer.”~~Tom Murawski, writing consultant

*Analyst self talk or mantras. A mantra is “A sacred Hindu formula believed to…possess magical power.”


Example 2 Program Level: Space Shuttle Program

Probabilistic RiskAssessment (PRA)

3rd - Reliability &Maintainability (R&M)

Assessments

1st - Trend Analyses

2nd - Quick Look Analyses

“Bottoms-Up Strategy”(The Triage Process)

“Top-Down Strategy”

Data

Notes:• Data is the driver, the starting point,

in the bottoms-up strategy.• The system is the driver in the top-

down strategy.• Both strategies are used to minimize

risk when only one of the twostrategies is used.

• Quick Look Analyses can use thedata pulled for the Trend Analysis.

• The next slide gives details on theTriage Process.


REQUIREMENT AREA

*Note: An element is one of the following:function, system, or line replaceable unit.

1 TREND ANALYSIS - an analysisthat identifies candidates for review and

possibly further analysis based onactivity, trend, and/or a risk measure

2 QUICK LOOK - an analysis thatprovides the density of problem attributesrelative to the element’s total and various

subtotals with high-level explanations

3 R&M ASSESSMENT - a detailedanalysis and assessment aimed at

determining problem root cause andrecommending corrective action

1.0 Element* Under Study Total system. All elements at various

levels are scanned. One or more elements (broken down

by part number and/or serial number) One or more elements (broken down

by part number and/or serial number)

2.0 Data

(“the raw materials”)

Detect date, failure criticality,operation purpose, and recurrencecontrol data from Problem ReportingAnd Corrective Action (PRACA).

All coded data and element ID datafrom PRACA. Typically, this data ispulled the same time data is pulled tomake a Trend Analysis.

Any data applicable to analyzing andassessing the reliability andmaintainability (R&M) of the currentdesign.

3.0 Data Sets/Scenarios

(“cutting the data”)

One-dimensional: each element is arow and uses 8 or 9 columns todescribe problem type and problemdisposition.

Two-dimensional: an analyzedelement uses columns from the TrendAnalysis and rows of 4 types toisolate valid problems on flighthardware.

Multi-dimensional: an analyzedelement uses a variety of data todescribe and evaluate conditions,element configuration, flight rules,performance, and management goals.

4.0 Quantitative Analysis

(“using Probability and Statistics”)

Count, trend, criticality, and risk Various types of graphical treatments

Double Pareto at the part group level Comparison to previous reports.

A “frequency table” is made for sub-elements and for each PRACA codedfield.

Standard or custom mademathematical models measure andevaluate the R&M of the elementunder study.

5.0 Quantitative Findings

(a.k.a. “little q”)

Decision rules based on problemcount, trend, criticality, and riskidentify “significant findings.”

A relativity large number ofoccurrences in a scenario (i.e., a cellor combination of cells) identifies anitem that needs to be explained.

“Fact-based” findings traceable todata and a quantitative analysismethodology. Use as metrics forrelated management goals.

6.0 Qualitative Analysis

(“using Science and Engineering”)

None High-level explanations from thesubsystem engineer especially oncells or combination of cells thatcontain large number of occurrences.

A multidisciplinary team uses selectedanalysis techniques to identify theunderlying reason or root cause forremovals, inefficiency, failures, etc.

7.0 Qualitative Findings

(a.k.a. “big Q”)

None High-density scenarios are noted asunderstood or not understood andunder control or not under control.

“Fact-based” findings traceable toaccepted principles and to aqualitative analysis methodology.

8.0 Conclusions (“using Logic”)None None Conclusions based on and traceable to

the findings in a logical manner.

9.0 Recommendations

(“proposed actions to make actual = plan”)

None None Recommendations based on andtraceable to the conclusions.

10.0 Benefit Analysis

(“testing the recommendations”)

None None An analysis that uses adjustedscenario(s) to measure the value ofproposed recommendation(s).

11.0 Report Format & Packaging Formal report containing

spreadsheets, graphs, methodology,and summary of significant findings.

Informal comments hand written ortyped below each frequency table.

Formal report on the above 10 areas, aspecial executive summary, andvisuals suitable for viewgraph use.


Example 2 The Triage Process in action at Level 3

The problem:

During launch, a system failed on the Space Shuttle Orbiter thatcaused major embarrassment as well as much expense. Should thissystem be replaced with new technology or upgraded?

If upgraded, identify the system elements (e.g., components) causingthe problem and the required reliability.


Example 2 Problem details and resolution

A Fuel Cell on a Space Shuttle Orbiter caused

a minimum duration flight (MDF) during STS-

83.

In addition to the MDF, a previous launch delay and

numerous maintenance actions during “vehicle

turnaround” made this system a serious candidate

for improvement.

A detailed reliability and maintenance (R&M)

analysis and assessment report on all fuel cell line

replaceable units (LRUs) from the STS-26 to STS-85

time period was completed.

This R&M assessment was instrumental in the

decision to change regulator material in all LRU’s

for $12M instead of replacing with a new design

estimated at $50M.


Example 3 Center Level: An Enabling System

Around year 2006, NASA Kennedy Space Center’s (KSC)

Office of Chief Engineer led the effort called Integrated

Design and Assurance System (IDAS).

IDAS’s goal that remains today is to provide and support a set

of integrated COTS modules for any KSC employee at

anytime (24/7) to “learn and do” engineering assurance

analyses (e.g., safety, reliability, maintainability) over the life

cycle of a system. In addition, IDAS had the purpose to

demonstrate data file transfer with other tools.

The next six slides provide an overview of IDAS and

viewpoints on classifying and deploying various analytical

methods.


KSC’s Integrated Design andAssurance System (IDAS)

Tim Adams

NASA Kennedy Space Center

Engineering & Technology Directorate

Systems Engineering & Integration, NE-D1

February 2012

23

KSC’s Integrated Design and Assurance System (IDAS)External Website for IDAS: http://kscsma.ksc.nasa.gov/Reliability/IDAS.html

• Origin and Purpose: About six years ago, KSC’s Office of Chief Engineer led the effortcalled IDAS. IDAS’s goal that remains today is to provide and support a set of integratedCOTS modules for any KSC employee at anytime (24/7) to “learn and do” engineeringassurance analyses (e.g., reliability) over the life cycle of a system. In addition, IDAS hadthe purpose to demonstrate data file transfer with other COTS tools.

• Selected Features: IDAS allows an equipment list (bill of materials) to be imported orinputted and then used to populate various analysis modules. Some modules can belinked.

• IDAS is located on the KSC network and is totally electronic (e.g., provides access viaNAMS, provides online training as well as by other methods, allows a user or group ofusers to build, run, and view analyses, and uploads to KSC’s Product Data Managementsystem).

• Use: IDAS is used by all KSC programs. For example with the Constellation Program, KSCGround Systems’ Systems Engineers in conjunction with KSC Design Engineers used IDASto identify and make over 100 design changes prior to build. This work is described in anAIAA paper that was selected as one of the top 30 papers.

• Summary: In today’s Model-Centric Engineering terms, IDAS is an integrated and highlysupported suite of “model-based engineering” modules that perform various types oftechnical risk analyses, one important part of systems engineering, and is a step in thedirection of closed-loop-model-based-systems engineering (MBSE).


http://kscsma.ksc.nasa.gov/Reliability/IDAS.html

IDAS Modules by Vendor & File Transfer Capability

NASA, Tim C Adams

Probabilistic RiskAssessment

(PRA)

Operations Analysis Text Mining

Design, Systems, Reliability, and Safety Engineers

Systems, Reliability,Maintainability, Supportability

(Logistics), Cost, and IndustrialEngineers

ReliabilityEngineers

andStatisticians

Quality &SustainingEngineers

ManagementCost

Analyst-Engs.

ReliabilityEngineers

andStatisticians

DefinesSystem

Elements(BOM)1 &Element

Failure Rates

DefinesSystem

Structure &Calculates

Reliability &Availability2

AnalyzesElementFailureModes

(BottomsUp)

IdentifiesConditions and

Factors thatCause an

Undesired Eventto Occur

(Top-Down)

EvaluatesSystems with

MultipleStates (e.g.,up, down,degraded)

AnalyzesCost of

Reliability,Availability2,MaintenanceIntervals, &

Spares

CalculatesSystem Repair& Maintenance

Measures

AnalyzeAcceleratedLife Testing

Data toPredictProduct

Reliability

Provides aClosed-loop

ProblemReporting& Correc-tive Action

System

ProvidesElectronic

Reporting &Notification

Calcu-latesTotal

Cost ofOwner-

ship

Uses Field& Test Datato CalculateProbabilityof Failure

Prediction RBD FMEA3 Event TreeFault Tree

Markov OpSim (nowincluded in

RBD)

Maintainability ALT FRACAS4

with AuditFeature

Dashboard &Alert Feature

Life-CycleCost

Weibull

FaultTreeFiles

Legend: In Work = ; Not Started = ; File-Transfer Capability Completed = ________ ; Colored Boxes = Software Modules; Shaded Boxes = Not Active at KSC

Notes: 1 – BOM = Bill of Materials; 2 – Availability applies to repairable systems and is a function of Reliability and Maintainability; 3 – FMEA = Failure Modes & Effects Analysis;

4 – FRACAS = Failure Reporting and Corrective Action System.

SearchTechnology’s

TechOASIS s/w(Sponsored by US Army)

NASA’s PRACA data can beimported via snap shots or

live connectors.

NASA’s ProblemReporting & CorrectiveAction (PRACA) Data

EPRI’s CAFTA s/w

PTC’s Windchill Quality Solutions(formerly Relex) s/w

Enterprise Edition-All Modules(Many NASA Centers and contractors have this

software in some form)

Multiple &Dissimilar PRACA

Databases

ModulePurpose

INL’s SAPHIRE s/w andItem Software’s iQRAS s/w(Both sponsored by NASA HQ’s S&MA)

ARINC’s Raptor s/w

ModuleName

ModuleUser

Upload IDAS products to other systems (e.g., Product data management (PDM) or Product lifecycle management (PLM)

IDAS

25

CONSEQUENCE

LIK

ELIH

OO

D

PredictionWeibull

Accelerated Life TestingLife-Cycle Cost

RBDEvent TreeFault TreeFRACASMarkov

Maintainability

FMEADashboard

QualitativeTools

QuantitativeTools

PTC WQS (formerly Relex) Modules from a Risk PerspectiveInternal Website for KSC’s PTC WQS software: https://sp.ksc.nasa.gov/sites/sre/tools/relextool/default.aspx


https://sp.ksc.nasa.gov/sites/sre/tools/relextool/default.aspx

Selected IDAS Modules that Provide Engineering Assurance orTechnical Risk Analyses during the Design Phase

Specify system requirements

Ready forbuild

System effectivenessLife-cycle costs Implement design methods

Failure analysis(Failure Modes & Effects Analysis)

Conduct system safetyanalysis (Fault Tree Analysis)

Allocate and predict reliability(Reliability Block Diagram Analysis)

Aresafety goalsachieved?

Aredesigngoals

achieved?

NoYes

Yes

No

Reference: An Introduction to Reliability and Maintainability Engineering, Figure 8.1, Charles Ebeling, 2005.


An Idealized Sequence for Producing Engineering Assurance Analyses

Start Finish

Engineering

Safety &MissionAssurance

0%

100%FFBD RBDA FMEA PRA

WHEN

WHAT

WHO

EFFO

RT

Success Space Analyses Failure Space Analyses

Analytical Products:FFBD = Functional Flow Block DiagramRBDA = Reliability Block Diagram AnalysisFMEA = Failure Modes & Effects AnalysisFTA = Fault Tree AnalysisPRA = Probabilistic Risk Assessment

FTA

Theme:This work sequence (WHEN) builds and uses analytical products (WHAT)in an optimum manner—especially during the Design Phase. Theappropriate mix of experts (WHO and EFFORT) make and deliver the rightanalytical product at the right time. In addition to serving the intendedpurpose at the desired time, each analytical product serves as an inputthat expands the technical fidelity of analytical products that follow.


Example 4 IDAS in action-during design

The Constellation Program’s (CxP) Ground Systems RMA Teamconsisted of a contracted two-person team.

RMA Analyses performed in cooperation with ~30 design teams.

For Maintainability:

Used Cut Set analyses combined with the RBD analyses. This providedthe CxP with the target subsystems to focus via:

Procuring more reliable components.

Designing for Maintainability using data for components most likely to fail.

Providing operational workarounds for subsystems targeted most likely to fail.

RMA Team’s recommendations improved subsystem reliability bya factor of ~9.4 using a conservative method.

Applying the 9.4 improvement factor to the Space ShuttleProgram increased the historical 88% probability of launch to98.6%.


Example 4 The RMA Team

The RMA Team was viewed as part of the Design Team—

thus, an embedded team approach by the virtue of

support from KSC’s Design Chief Engineer.

The RMA Team was low impact on Design Team.

Normally, two to four meetings were required throughout the

process (total 3 to 8 hours)

The RMA Team provided feedback to the Design Team

throughout the design process.

RMA Analysis was a required product for design review.


Example 4 RMA Results (as of year 2011)

At a Preliminary Design Review (PDR), Ground Systemsprojected a 99.5% probability of success during the specifiedtime period:

34 of 57 subsystems analyzed, combined with allocations forremaining subsystems.

90% of Ground Systems unreliability was attributed to lessthan 2% of the possible failure paths

Based on analysis of 15 subsystems

~435,000 Cut Sets with a probability greater than 1X10-16

This type of analytical product provided focus for:

Design changes (“prior to cutting metal”)

Operational workarounds

Maintainability considerations.


Example 5 RMA is not just Statistics

The problem—sound familiar?

The test director wants to know if testing can stop after

receiving no failures in 360 tests on a life-critical item.

In particular, does this testing certify that the item is

safe?


Example 5 At NASA, this problem was…

The White Sands Test Facility (WSTF) conducted 360 tests

to determine if ignition would occur during the presence

of a small quantity of hydrocarbon oil in 100% oxygen

under adiabatic compression, the compression heating of

oxygen.

None of the WSTF tests produced ignition. These tests

were in response to a hydrocarbon oil contaminate found

in the Portable Life Support System (PLSS) used in the

Extravehicular Mobility Unit (EMU).


Example 5 The Extravehicular Mobility Unit

Extravehicular MobilityUnit (EMU) is anindependent systemthat providesenvironmentalprotection, mobility, lifesupport, andcommunications for aSpace Shuttle orInternational SpaceStation (ISS) crewmember to performextra-vehicular activity(EVA) in earth orbit.


Example 5 And it is in the news!

To enlargenews article inPowerPoint,leave slideshow anddouble clickthe newsarticle.


Example 5 The Reliability response

Method 1 used Classical Test Statistics to determine the

maximum failure rate with a high degree of statistical

confidence. This failure rate did not meet the program’s

failure-rate goal. Thus, more testing would have been

required if only this analysis method was used for decision

making.

Method 2 used Ancillary Data (i.e., similar test data) to

identify a boundary for ignition and no ignition. This

method did not address heat loss and was not sufficient for

decision making.


Example 5 From Classical Reliability to Physics

Method 3 used the Arrhenius Reaction Rate Model. This

model adjusted the failure rate found in the first method since

all WSTF testing was done under stressed conditions (higher

pressure). The failure-rate goal was surpassed under certain

assumptions.

Method 4 used Combustion Physics (i.e., Semenov equations)

to address the heat loss not addressed in Method 2. It was

found that the reaction rate was not fast enough to cause

ignition in the PLSS. Thus qualitatively, the quantitative failure-

rate goal was believed to be satisfied with certainty.

Note: Regardless the analytical method, uncertainty needs to

be described. Next, the types of uncertainty will be outlined.


Tip When quantitative work is not certain

When the work is probabilistic (not deterministic), characterize point

estimates using uncertainty in order to provide the estimate’s

measure of the “goodness.” There are four types of uncertainty.

Inherent Uncertainty (Physical Variability)

Parameter (Statistical) Uncertainty

The uncertainty in the values of the parameters of a model. Assume the mathematical

form of that model has been agreed to be appropriate.

Model Uncertainty

Related to an issue for which no consensus approach or model exists, and the choice of

approach or model is known to have an effect on the risk model.

Completeness Uncertainty

Represents a type of uncertainty that cannot be quantified and because it represents those

aspects of the system that are, either knowingly or unknowingly, not addressed in the

model.

For more on Uncertainty, see

http://kscsma.ksc.nasa.gov/Reliability/Documents/100128_Uncertainty_Concepts.pdf


Summary Reliability (thefirstword foreachitembelow)… Excellence in an organization is more than a support function for maintenance and safety

engineering (pp. 5 & 9).

Is a “design for” operating outcome and is inferred from an engineering drawing since reliabilityis not a physical characteristic that appears on the drawing (p. 6).

As a job enables others (doers, management, and other enablers) to work a collection ofplanned activities to maximize system function and uptime (pp. 7 & 9).

Has four elements in its classic definition: likelihood, function, duration (demand or load), andenvironment (p. 8).

In this document focused on a quantitative view being techniques based on Probability andStatistics (Probabilism), Physics (Determinism), or both (pp. 8 & 10).

Coupled with Maintainability is Availability. Availability is a function of Reliability andMaintainability (R&M) that addresses both uptime and downtime (pp. 7 & 8).

That is quantitative typically encounters three types of data, clock time or cycles (duration),events (demands), and stress-strength (load-capacity) relationships (p. 11).

Quantifies the likelihood axis in a risk matrix with the axis being consequence (p. 11).

As a meaningful product to the organization must be understood by as well as must pace andcollaborate with the decision makers (p. 13).

And Availability in quantitative form are a forecast. As with any forecast, uncertainty providesthe estimate’s measure of the “goodness.” Without such a measure, it is impossible for thedecision maker to judge how closely the predictions relate to or represent reality (p. 38).