a maintenance planning and business case development model...

Available online at www.sciencedirect.com

www.elsevier.com/locate/microrel

Microelectronics Reliability 47 (2007) 1889–1901

A maintenance planning and business case development model forthe application of prognostics and health management (PHM)

to electronic systems

Peter A. Sandborn *, Chris Wilkinson

CALCE Electronic Products and Systems Center, Department of Mechanical Engineering, University of Maryland,

College Park, MD 20742, United States

Received 8 January 2007Available online 5 April 2007

Abstract

This paper presents a model that enables the optimal interpretation of prognostics and health management (PHM) results for elec-tronic systems. In this context, optimal interpretation of PHM results means translating PHM information into maintenance policies anddecisions that minimize life cycle costs, or maximize availability or some other utility function. The electronics PHM problem is char-acterized by imperfect and partial monitoring, and a random/overstress failure component must be considered in the decision process.Given that the forecasting ability of PHM is subject to uncertainties in the sensor data collected, the failure and damage accumulationmodels applied, the material dimensions and properties used in the models, the decision model in this paper addresses how PHM resultscan best be interpreted to provide value to the system maintainer. The result of this model is a methodology for determining an optimalsafety margin and prognostic distance for various PHM approaches in single and multiple socket systems where the LRU’s in the varioussockets that make up a system can incorporate different PHM approaches (or have no PHM structures at all).

The discrete event simulation model described in this paper provides the information needed to construct a business case showing theapplication-specific usefulness for various PHM approaches including health monitoring (HM) and life consumption monitoring (LCM)for electronic systems. An example business case analysis for a single socket system is provided.� 2007 Elsevier Ltd. All rights reserved.

1. Introduction

Prognostics is the estimation of remaining useful life(RUL) in terms that are useful to the maintenance decisionmaking process. The decision process can be tactical (real-time interpretation and feedback) or strategic (maintenanceplanning). All PHM approaches are essentially the extrapo-lation of trends based on recent observations to estimateRUL [1]. Unfortunately, the calculation of RUL alone doesnot provide sufficient information to form a decision or todetermine corrective action. Without comprehending thecorresponding measures of the uncertainty associated withthe calculation, RUL projections have little practical value

0026-2714/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.

doi:10.1016/j.microrel.2007.02.016

* Corresponding author. Tel.: +1 301 405 3167; fax: +1 301 314 9269.E-mail address: [email protected] (P.A. Sandborn).

[1]. It is the comprehension of the corresponding uncertain-ties (decision making under uncertainty) that is at the heartof being able to develop a realistic business case thataddresses prognostic requirements. The PHM approachesused to estimate RUL include: (1) life consumption moni-toring (LCM) forecasts based on physics-of-failure (PoF)models [2,3], (2) health monitoring (HM) forecasts basedon precursor variable monitoring [4,5], and (3) HM fore-casts based on failure mechanism specific fuses [6]. Vichareet al. [3] has shown how the uncertainty of an RUL forecastderived from LCM can be estimated. Mishra et al. [6] hasshown how the uncertainty of a RUL forecast derived fromfailure mechanism specific fuse structures can be estimated.

Electronic systems have not traditionally been subject toPHM because their time to failure was assumed to be non-quantifiable and in any case, much longer than the system

mailto:[email protected]

1890 P.A. Sandborn, C. Wilkinson / Microelectronics Reliability 47 (2007) 1889–1901

support life or technology refresh period (non-life limited).Most approaches to PHM are focused on monitoring fail-ure precursor indications (i.e., HM), which does notrequire system failures to be deterministic in nature, butdoes require that the precursor selected has a deterministiclink to the actual system failure. While there is considerableexisting work on precursors for mechanical systems [4,5],relatively few attempts have been made to apply HM tech-niques to electronics [6,7]. Alternatively, LCM, depends onthe deterministic nature of system failures expressedthrough failure models. In LCM, a history of environmen-tal stresses (e.g., thermal, vibration) is used in conjunctionwith physics of failure (PoF) models to compute accumu-lated damage and thereby forecast RUL [2]. With thetransition from military-specification parts to commer-cial-off-the-shelf (COTS) parts, many of which are now tar-geted by design for lifetimes in the 5–7 year range, wear-outof electronics may become a relevant concern for long fieldlife systems [8]. Also, it has long been known that intercon-nects are subject to fatigue failure from temperaturecycling. In addition, PoF approaches to modeling elec-tronic system reliability have shown that time-to-failure(TTF) for electronic parts and interconnects can be pre-dicted within quantifiable bounds of uncertainty [9].

Modeling to determine the optimum schedule for per-forming maintenance for systems is not a new concept.Examples of traditional applications of maintenance mod-eling include production equipment [10], and the hardwareportions of engines and other propulsion systems [5]. How-ever, maintenance modeling has not been widely applied toelectronic systems where presumed random electronics fail-ure is usually modeled as an unscheduled maintenanceactivity, and wear-out is assumed to be beyond the endof the system’s support life.

Although many applicable models for single and multi-unit maintenance planning have appeared [11,12], themajority of the models assume that monitoring informa-tion is perfect (without uncertainty) and complete (all unitsare monitored the same), i.e., maintenance planning can beperformed with perfect knowledge as to the state of eachunit. For many types of systems, and especially electronicsystems, these are not good assumptions and maintenanceplanning, if possible at all, becomes an exercise in decisionmaking under uncertainty with sparse data. The perfectmonitoring assumption is especially problematic when thePHM approach is LCM because LCM does not dependon precursors. Thus, for electronics, LCM processes donot deliver any measures that correspond exactly to thestate of a specific instance of a system. Previous work thattreats imperfect monitoring includes [4,13]. Perfect, butpartial monitoring has been previously treated in [14].

This paper presents a new stochastic decision model thatenables the optimal interpretation of LCM damage accu-mulation or HM precursor data, and applies to failureevents that appear to be random or appear to be clearlycaused by defects. Specifically the model is targeted ataddressing the following two questions:

• How do we determine on an application-specific basiswhen the reliability of electronics has become predict-able enough to warrant the application of PHM-basedscheduled maintenance concepts? Note, we do not meanto imply that predictability in isolation is the criteria forPHM vs. non-PHM solutions, e.g., if the system reliabil-ity is predictable and very reliable, it would not makesense to implement a PHM solution.

• Given that the forecasting ability of PHM is subject touncertainties in the sensor data collected, the data reduc-tion methods, the failure models applied, the materialparameters assumed in the models, etc., how can PHMresults be interpreted so as to provide value, i.e., howcan a business case be constructed? This boils down todetermining an optimal safety margin on LCM predic-tion and prognostic distance for HM.

2. Model formulation

The following maintenance planning model accommo-dates variable time-to-failure (TTF) of LRUs and variableRUL estimates associated with PHM approaches imple-mented within LRUs. The model considers both singleand multiple sockets within a larger system. A ‘‘socket’’,in our terminology, is a unique instance of an installationlocation, for a line replaceable unit (LRU). An exampleof an LRU could be an engine controller for a jet engine,one instance of a socket occupied by the engine controlleris the location on a particular jet engine. This socket maybe occupied by a single LRU during its lifetime (if theLRU never fails), or multiple LRU’s if one or more LRUsfail and needs to be replaced.

Discrete event simulation is used to follow the life ofindividual socket instances from the start of their field lifeto the end of their operation and support. Discrete eventsimulation implies the modeling of a system as it evolvesover time by representing the changes as separate events(as opposed to continuous simulation where the systemevolves as a continuous function). The evolutionary unitneed not be time; it could be thermal cycles, or some otherunit relevant to the particular failure mechanismsaddressed by the PHM approach. Discrete event simula-tion has the advantage of defining the problem in termsof something intuitive, i.e., a sequence of events, thusavoiding the need for formal specification. Discrete eventsimulation is widely used for maintenance and operationsmodeling, e.g. [15–17], and has also previously been usedto model PHM activities [18].

The model use in this paper treats all inputs to the dis-crete event simulation as probability distributions, i.e., astochastic analysis is used, implemented as a Monte Carlosimulation. Various maintenance interval and PHMapproaches are distinguished by how sampled TTF valuesare used to model PHM RUL forecasting distributions. Toassess PHM, relevant failure mechanisms are segregatedinto two types:

P.A. Sandborn, C. Wilkinson / Microelectronics Reliability 47 (2007) 1889–1901 1891

1. Failure mechanisms that are random from the viewpoint of the PHM methodology. These are failure mech-anisms that the PHM methodology is not collecting anyinformation about (non-detection events). These failuremechanisms may be predictable, but are outside thescope of the PHM methods applied.

2. Failure mechanisms that are predictable from the viewpoint of the PHM methodology, i.e., for which a prob-ability distribution can be assigned.

For the purposes of cost model formulation, PHMapproaches are categorized as: (a) a fixed schedule mainte-nance interval that is kept constant for all instances of theLRU’s occupying all socket instances throughout the sys-tem life cycle; (b) a variable maintenance interval schedulefor LRU instances that is based on inputs from a Precursorto Failure methodology (e.g., HM or LRU-dependentfuses); and (c) a variable maintenance interval schedulefor LRU instances that is based on an LRU-independentmethodology (e.g., an LCM methodology or LRU-inde-pendent fuses).1 Note, for simplicity, throughout this paperthe model formulation is presented based on ‘‘time’’ to fail-ure measured in operational hours, however, the relevantquantity could be a non-time measure such as thermalcycles.

The metrics computed are: life cycle cost, failuresavoided, and operational availability. Appendix A providesa detailed description of the model implementation. Thekey features of the model’s formulation are described inSections 2.1–2.3. Example results generated using all theapproaches discussed in this section are presented in Sec-tions 3 and 4.

2.1. Fixed schedule maintenance interval

A fixed schedule maintenance interval is selected that iskept constant for all instances of the LRU that occupy asocket throughout the system life cycle. In this case theLRU is replaced on a fixed interval (measured in opera-tional hours), i.e., time-based prognostics. This is analo-gous to mileage based oil change in automobiles.

2.2. Precursor to failure monitoring

Precursor to failure monitoring approaches are definedas a fuse or other monitored structure that is manufacturedwith or within the LRUs or as a monitored precursor var-iable that represents a non-reversible physical process, i.e.,it is coupled to a particular LRU’s manufacturing or mate-rial variations. Health monitoring (HM) and LRU-depen-dent fuses are examples of precursor to failure methods.

1 LRU-dependent fuses are fabricated concurrently with specificinstances of LRUs, e.g., they would share LRU-specific variations inmanufacturing and materials. LRU-independent fuses are fabricatedseparately from the LRUs and assembled into the LRUs, so they do notshare any LRU-specific variations in manufacturing and materials.

The parameter to be determined (optimized) is prognosticdistance. The prognostic distance is a measure of how longbefore system failure the prognostic structures or prognos-tic cell is expected to indicate failure (in operational hoursfor example). The precursor to failure monitoring method-ology forecasts a unique TTF distribution for each instanceof an LRU based on the instance’s TTF.2 For illustrationpurposes, the precursor to failure monitoring forecast isrepresented as a symmetric triangular distribution with amost likely value (mode) set to the TTF of the LRUinstance minus the prognostic distance, Fig. 1. The precur-sor to failure monitoring distribution has a fixed widthmeasured in the relevant environmental stress units (e.g.,operational hours in our example) representing the proba-bility of the prognostic structure correctly indicating theprecursor to a failure. As a simple example, if the prognos-tic structure was a LRU-dependent fuse that is designed tofail at some prognostic distance earlier than the system itprotects, then for this example the distribution on the rightside of Fig. 1 represents the distribution of fuse failures (theTTF distribution of the fuse).

The parameter to be optimized in this case is the prog-nostic distance assumed for the precursor to failure moni-toring forecasted TTF. The model proceeds in thefollowing way: for each LRU TTF distribution sample(t1) taken from the left side of Fig. 1, a precursor to failuremonitoring TTF distribution is created that is centered onthe LRU TTF minus the prognostic distance (t1 � d). Theprecursor to failure monitoring TTF distribution is thensampled and if the precursor to failure monitoring TTFsample is less than the actual TTF of the LRU instancethen precursor to failure monitoring was successful. If theprecursor to failure monitoring distribution TTF sampleis greater than the actual TTF of the LRU instance thenprecursor to failure monitoring was unsuccessful. If suc-cessful, a scheduled maintenance activity is performedand the timeline for the socket is incremented by the pre-cursor to failure monitoring sampled TTF. If unsuccessful,an unscheduled maintenance activity is performed and thetimeline for the socket is incremented by the actual TTF ofthe LRU instance. At each maintenance activity, the rele-vant costs given in Table 1 are accumulated.

2.3. LRU-independent methods

In LRU-independent PHM methods, the PHM struc-ture (or sensors) are manufactured independent of theLRUs, i.e., the PHM structures are not coupled to a partic-ular LRU’s manufacturing or material variations. Anexample of a LRU-independent method is life consumptionmonitoring (LCM). LCM is the process by which a historyof environmental stresses (e.g., thermal, vibration) is used

2 In the present model, all failing LRUs are assumed to be maintainedvia replacement or good as new repair, therefore, time between failure andtime to failure are the same.

Pro

babi

lity

LRU Time-to-Failure (TTF)

Pro

babi

lity

PHM Structure Time-to-Failure

PrognosticDistance

Sam

pled

TT

F

fore

cast

and

m

aint

enan

ce in

terv

al

for

the

sam

ple

Nom

inal

LR

U

The LRU’s TTF pdf represents variations in manufacturing and materials

LRUInstance

Actual TTF (sample)

Precursor Indicated Replacement Time

LRU Instance Actual TTF

d

t1t1

TimeTime

Fig. 1. Precursor to failure monitoring modeling approach. Symmetric triangular distributions are chosen for illustration. Note, the LRU TTF pdf (left)and the precursor to failure TTF pdf (right) are not the same (they could have different shapes and sizes).

Table 1Data assumptions for example cases presented in this paper

Variable in the model Value used for example analysis

Production cost (per unit) $10,000Time to failure (TTF) 5000 operational hours = the most likely value (symmetric triangular

distribution with variable distribution width)Operational hours per year 2500Sustainment life 25 years

Unscheduled Scheduled

Value of each hour out of service $10,000 $500Time to repair 6 h 4 hTime to replace 1 h 0.7 hCost of repair (materials cost) $500 $350Fraction of repairs requiring replacement of the LRU (as opposed to

repair of the LRU)1.0 0.7

Pro

babi

lity


LRUInstance

Actual TTF (sample)

Prob

abili

ty

PHM Structure Time-to-Failure

Safety Margin

Sam

pled

TT

F

fore

cast

and

m

aint

enan

ce in

terv

al

for

the

sam

ple

Nom

inal

LR

U

Nom

inal

LR

U

IndicatedReplacement Time

The LRU’s TTF pdf represents variations in manufacturing and materials

LRU Instance Actual TTF

TimeTime

t1 t1

Fig. 2. LRU-independent modeling approach. Symmetric triangular distributions are chosen for illustration. Note, the LRU TTF pdf (left) and the LRU-independent method TTF pdf (right) are not the same (they could have different shapes and sizes).

3 A preliminary version of this work [19], suggested a different model forlife consumption monitoring (LCM), which is not applicable to LCM asdefined in this paper. The model for LCM used in [19] is mathematicallyidentical to the model presented for precursor to failure methods in Fig. 1.


in conjunction with physics of failure (PoF) models to com-pute damage accumulated and thereby forecast RUL.

The LRU-independent methodology forecasts a uniqueTTF distribution for each instance of an LRU based onits unique environmental stress history. For illustrationpurposes, the LRU-independent TTF forecast is repre-sented as a symmetric triangular distribution with a mostlikely value (mode) set relative to the TTF of the nominal

LRU and a fixed width measured in operational hours(Fig. 2).3 Other distributions may be chosen and [3] hasshown how this distribution may also be derived from


recorded environment history. The shape and width of theLRU-independent method distribution depends on theuncertainties associated with the sensing technologies anduncertainties in the prediction of the damage accumulated(data and model uncertainty). The variable to be optimizedin this case is the safety margin assumed on the LRU-inde-pendent method forecasted TTF, i.e., the length of time(e.g., in operation hours) before the LRU-independentmethod forecasted TTF the unit should be replaced.

The LRU-independent model proceeds in the followingway: for each LRU TTF distribution sample (left side ofFig. 2), an LRU-independent method TTF distribution iscreated that is centered on the TTF of the nominal LRUminus the safety margin – right side of Fig. 2 (note, theLRU-independent methods only know about the nominalLRU, not about how a specific instance of a LRU variesfrom the nominal). The LRU-independent method TTFdistribution is then sampled and if the LRU-independentmethod TTF sample is less than the actual TTF of theLRU instance then LRU-independent method was success-ful (failure avoided). If the LRU-independent method TTFdistribution sample is greater than the actual TTF of theLRU instance then LRU-independent method was unsuc-cessful. If successful, a scheduled maintenance activity isperformed and the timeline for the socket is incrementedby the LRU-independent method sampled TTF. If unsuc-cessful, an unscheduled maintenance activity is performedand the timeline for the socket is incremented by the actualTTF of the LRU instance.

In all the maintenance models discussed, a random fail-ure component may also be superimposed (see AppendixA). The fixed scheduled maintenance, precursor to failuremonitoring and LRU-independent method model is imple-mented as stochastic simulations, in which a statistically

100000

150000

200000

250000

300000

0 2000 4000

Fixed Scheduled Maintena

Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)

LRU TTF Distribution is

narrow

LRU TTF distribuis wide

Fig. 3. Variation of the effective life cycle cost per socket with the fixed scheduassumed.

relevant number of sockets are considered in order to con-struct histograms of costs, availability, and failuresavoided. At each maintenance activity, the relevant costsaccording to Table 1 are accumulated.

The fundamental difference between the precursor tofailure and LRU-independent models is that in the precur-sor to failure models the TTF distribution associated withthe PHM structure (or sensor) is unique to each LRUinstance; whereas in the LRU-independent models theTTF distribution associated with the PHM structure (orsensor) is tied to the nominal LRU and knows nothingabout manufacturing or material variations betweenLRU instances.

3. Single socket model results

The baseline data assumptions used to demonstrate themodel in this paper are given in Table 1.

All of the variable inputs to the model can be treated asprobability distributions or as fixed values, however, forexample purposes, only the TTFs of the LRUs and thePHM structures have been characterized by probabilitydistributions. Note, all the life cycle cost results providedin the remainder of this paper are the mean life cycle costfrom a probability distribution of life cycle costs generatedby the model.

Fig. 3 shows the fixed scheduled maintenance intervalresults. 10,000 sockets were simulated in a Monte Carloanalysis and the mean life cycle costs are plotted. Thegeneral characteristics in Fig. 3 are intuitive: for shortscheduled maintenance intervals, virtually no expensiveunscheduled maintenance occurs, but the life cycle costper unit is high because large amounts of RUL in theLRUs are thrown away. For long scheduled maintenance

6000 8000 10000 12000

nce Interval (operational hours)

1000 hours

2000 hours

4000 hours

6000 hours

8000 hours

10000 hours

tion

Width of the LRU time to failure distribution

~Unscheduled maintenance solution

led maintenance interval (10,000 sockets simulation). No random failures

TTF Width

Variations in TTF distribution width

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500

Prognostic Distance (operating hours)

1000 hr TTF width, 1000 hr PHM width



100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500

Safety Margin (operating hours)

Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et) 1000 hr TTF width, 1000 hr PHM width



Precursor to FailureLRU Independent


Pro

babi

lity

Fig. 4. Variation of the effective life cycle cost per socket with the safety margin and prognostic distance for various LRU TTF distribution widths andconstant PHM structure TTF width (10,000 sockets simulated).


intervals virtually every LRU instance in a socket failsprior to the scheduled maintenance activity and the lifecycle cost per unit becomes equivalent to unscheduledmaintenance. For some scheduled maintenance intervalbetween the extremes, the life cycle cost per unit is mini-mized. If the TTF distribution for the LRU had a widthof zero, then the optimum fixed scheduled maintenanceinterval would be exactly equal to the forecasted TTF.As the forecasted TTF distribution for the LRU becomeswider (i.e., the forecast is less well defined), a practical fixedscheduled maintenance interval becomes more difficult tofind and the best solution approaches an unscheduledmaintenance model.

Fig. 4 shows example results for various widths of theLRU TTF distribution as a function of the safety marginand prognostic distances associated with the precursor tofailure and LRU-independent models. Several generaltrends are apparent. First, the width of the LRU TTF dis-tribution has little effect on the precursor to failure PHMmethod results. This result is intuitive since in the precursorto failure case the PHM structures are coupled to the LRUinstances and track whatever manufacturing or materialvariation they have, so they also track the LRU TTF dis-tribution (the degree to which the LRU-to-LRU variationsare removed from the problem depends on the degree ofcoupling between the LRU manufacturing/materials andthe PHM structure manufacturing/materials). Alterna-tively, the LRU-independent PHM method is sensitive tothe LRU TTF distribution width since it is uncoupled from

the specific LRU instance and can only base its forecast offailure on the performance of a nominal LRU. A secondobservation is that the optimum safety margin decreasesas the width of the LRU TTF distribution decreases. Thisis also intuitive, since as the reliability becomes more pre-dictable (i.e., narrower forecasted LRU TTF distributionwidth), the safety margin that needs to be applied to thePHM predictions also drops.

Fig. 5 shows example results for various widths of thePHM associated distribution (constant LRU TTF distribu-tion width) as a function of the safety margin and prognos-tic distances associated with the precursor to failure andLRU-independent models. In this case, both PHMapproaches are sensitive to the width of their distributions.General observations from Figs. 4 and 5 are that:

(1) The LRU-independent model is highly dependent onthe LRU’s TTF distribution.

(2) Precursor to failure methods are approximately inde-pendent of the LRU’s TTF distribution.

(3) All things equal, optimum prognostic distances forprecursor methods are always smaller than optimumsafety margins for LRU-independent methods, andtherefore, all things equal, precursor to failurePHM methods will always result in lower life cyclecost solutions than LRU-independent methods.

Where ‘‘all things equal’’ means the same LRUs with thesame shape and size distribution associated with the PHM

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500


Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)




100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500


Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)




PHM Width

Variations in PHM distribution width

PHM Time-to-Failure

Pro

babi

lity

Precursor to FailureLRU Independent

Fig. 5. Variation of the effective life cycle cost per socket with the safety margin and prognostic distance for various PHM structure TTF and constantLRU TTF distribution widths (10,000 sockets simulated).


approach. Any comparison between the precursor to fail-ure approach and the LRU-independent approach,assumes that you have a choice between the two, i.e., thatthere is a precursor to failure method that is applicable –there may not be (especially for application to electronicsystems). Appendix B provides an example business caseconstruction for the single socket case.

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500


Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)

0

10

20

30

40

50

60

70

80

90

100

Fai

lure

s A

void

ed (%

)

Failures Avoided: No random failures

Failures Avoided: 10% random failures per year

Cost: No random failures

Cost: 10% random failures per year

LRU Independent

Fig. 6. Variation of the effective life cycle cost per socket and failures avoidedistribution widths and 1000 h PHM distribution widths, with and without ra

Fig. 6 shows an example with 10% random failuresincluded. Fig. 6 also includes the associated failuresavoided. In all cases the failures avoided when random fail-ures are included is lower than when random failures arenot included, however, the change in the optimum safetymargin or prognostic distance is small. As the safety mar-gin or prognostic distance increase the failures avoided lim-

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0 500 1000 1500


Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)

0

10

20

30

40

50

60

70

80

90

100F

ailu

res

Avo

ided

(%

)Failures Avoided: No random failures

Failures Avoided: 10% random failures per year

Cost: No random failures

Cost: 10% random failures per year

Precursor to Failure

d, with the safety margin and prognostic distance for 2000 h LRU TTFndom failures included (10,000 sockets simulated).

Etc… to the end of support

Socket 1 Timeline


its to 100% in all cases (with and without random failuresincluded). However, for the example data used in thispaper, safety margins or prognostic distances must beincreased substantially beyond the range plotted in Fig. 6for the cases with random failures to approach 100%.

Socket 2 Timeline

CumulativeTimeline LR

U in socket 1

and LRU

in socket 2 replaced

< Coincident time

LRU instance-specific “fix me” requests originating from failures, scheduled maintenance intervals, or PHM structures

> Coincident time

LRU

in socket 2 replaced

< Coincident time

LRU

in socket 1 and LR

U in socket

2 replaced

Fig. 7. Multi-socket timeline example.

4. Multiple socket model results

Real systems are composed of multiple sockets, wherethe sockets are occupied by mixture of LRUs, some withno PHM structures or strategies, and others with fixedinterval strategies, precursor to failure structures or LRUindependent structures. Maintenance, even when it isscheduled, is expensive. Therefore, when the system isremoved from service to perform a maintenance activityfor one socket it may be desirable to address multiple sock-ets (even if some have not reached their most desirable indi-vidual maintenance point).

First we address, how to use the single socket modelsdeveloped in Section 2 to optimize a system composed ofmultiple sockets, where we are assuming that all the LRUsthat occupy a particular socket have the same PHMapproach (but approaches can vary from socket to socket).To address this problem we introduce the concept of acoincident time. The coincident time is the time intervalwithin which different sockets should be treated by thesame maintenance action. If,

Timecoincident > Timerequired maintenance action on LRU i

� Timecurrent maintenance action ð1Þ

then the LRU i is addressed at the current maintenanceaction. A coincident time of 0 means that each socket istreated independently. A coincident time of infinity meansthat any time any LRU in any socket in the systemdemands to be maintained; all sockets are maintained nomatter what remaining life expectancy they have. In thediscrete event simulation, the time of the current mainte-nance and the future times for the required maintenance

Fig. 8. Time to failure (TTF) distributions for LRUs used in multi-socket analmade from these two LRUs as a function of time using a prognostic distanceeach socket are shown). All data other than the LRU TTF is given in Table 1

actions on other LRUs are known or forecasted and appli-cation-specific optimum coincident times can be found.

Implementation of the above constraint in the discreteevent simulation is identical to the single socket simulationexcept we follow more than one socket at a time (seeAppendix A). When the first LRU in the multiple socketsystem indicates that it needs to be maintained by RULforecast or actually does fail, a maintenance activity is per-formed on all sockets in which the LRUs forecast the needfor maintenance within a user specified coincident time,e.g., Fig. 7. Our model assumes that LRUs replaced at amaintenance event are good-as-new and that the damageaccumulated by portions of the system not addressed bya maintenance event are not affected by the maintenanceevent. Costs are accumulated for scheduled and unsched-uled maintenance activities and a final total life cycle costcomputed. In practice, the future maintenance actionstimes for LRUs, other than the one indicating the needfor maintenance, need to be determined from reliabilityforecasting (note, however, there is greater uncertainty inthese forecasts the further from the present you go).

ysis examples. The plot on the right shows the cost of single socket systemsof 500 h for the LRU in socket #1 (note, the results for 10,000 instance of.

65000

65500

66000

66500

67000

67500

68000

68500

69000

69500

70000

1 10 100 1000 10000 100000

Coincident Time (Operational Hours)

Mea

n C

ost

(per

syst

em)

2 socket #1

3 socket #1 – 35,000 (shifted down for plotting)

All sockets maintainedseparately All sockets

maintained at the same time

Fig. 10. Mean life cycle cost per system of two or three similar sockets. AllLRUs, location parameter = 19,900 h (health monitoring) (10,000 systemssimulated).

210000

230000

250000

270000

Co

st (p

ersy

stem

)

3 socket #1 and 2 socket #2

All sockets maintainedseparately

All sockets maintained at the same time


Analysis of multi-socket systems demonstrates that dif-ferent types of system responses are possible for three typesof systems: dissimilar LRUs, similar LRUs, and optimiz-able mixed systems of LRUs. Consider systems built fromthe two different sockets shown in Fig. 8. For the examplesin this section, with the exception of the LRU TTF distri-bution, all the data is given in Table 1. With LRU TTFsdefined as shown in Fig. 8, a system composed of sockets#1 and #2 is considered to be dissimilar (LRUs with sub-stantially different reliabilities and different PHMapproaches). The first step in analyzing a multi-socket sys-tem is to determine what prognostic distance/safety mar-gins to use for the individual sockets – we have observedno differences between the optimum prognostic distance/safety margins determined analyzing individual sockets orthe sockets within larger systems. For the case shown inFig. 8, the optimum prognostic distance for the LRU insocket #1 was 500 h.

Figs. 9–11 plot the mean life cycle cost for a system ofsockets. The mean life cycle cost is the mean of a distribu-tion of life cycle costs computed for a population of 10,000systems. Fig. 9 shows the most common life cycle cost char-acteristic for dissimilar systems. For small coincident times,both sockets are being maintained separately, for largecoincident times, LRUs in both sockets are replaced any-time either socket requires maintenance. Obviously, dissim-ilar systems prefer small coincident time (this is intuitive).

Fig. 10 shows the cases of two and three similar LRUs ina system. In this case the multiple sockets that make up thesystem are all populated with LRU #1 in Fig. 8. In thiscase, the solution prefers to maintain the LRUs in all thesockets at the same time, i.e., when the LRU in one socketindicates that it needs to be maintained, the LRUs in all the

142000

144000

146000

148000

150000

152000

154000

156000

158000

160000

162000

1 10 100 1000 10000 100000


Mea

n C

ost

(per

syst

em)

Both sockets maintainedseparately

Both sockets maintained at the same time

Cos

t per

Sys

tem

Fig. 9. Mean life cycle cost per system of two dissimilar sockets. Socket#1 LRU, location parameter = 19,900 h (health monitoring); socket #2LRU, FFOP = 9900 h (unscheduled maintenance) (10,000 systemssimulated).

170000

190000

1 10 100 1000 10000 100000


Mea

n

2 socket #1 and 2 socket #2

Minimum life cycle costs are for coincident times = 2000 operational hours

Fig. 11. Mean life cycle cost per system of mixed sockets (10,000 systemssimulated).

sockets are maintained. Note the height of the step dependson the number of hours to perform scheduled maintenanceand the cost of those hours.

Fig. 11 shows the results for a mixed system that has anon-trivial optima in the coincident time. In this case thereis a clear minima in the mean life cycle cost that is not atzero or infinity.

5. Discussion

Previous work has demonstrated that PHM approachescan be successfully applied to electronic systems [2,3,6].


However, the previous work has not addressed if (orexactly how) a specific application’s life cycle cost can actu-ally be reduced and/or operational availability increased by

Is Time > End of

Support Date?

Construct uniform distribution representing random failure rate (if any) for each socket

Construct environmental stress history distribution for each socket subject to fixed interval or unscheduled maintenance (Esocket i)

Sample TTF distribution for the LRUs in each socket (TTFsocket i)

isocket C =

isocket

isocket

TTF

TTF

TTF isocket =

isocket M =

Use sampled TTF distributions (TTFsocket i) to synthesize a PHM distribution for each socket

Defined inand LRU-fixed inter

Sample the PHM distribution for each socket to get each socket’s maintenance interval (Msocket i)

Adjust maintenance interval (Msocket i) using the safety margin if LRU-independent

isocket M

M

M

isocket

isocket

==

iallminM =

Determine the actual TTF for each socket

Determine the actual maintenance interval for each socket

Compute when the next maintenance event is

Maintain all sockets with a maintenance interval within the coincident time of the next maintenance event

ScheduledUnscheduCost is co

Increment time Time = Tim

Initialize the cost of each socket

No

Yes

Mon

te C

arlo

Ana

lysi

s Lo

op

Where Sso

Fig. A.1. Model impl

using PHM. Such an analysis becomes non-trivial whenone considers that the RUL predictions are based onimperfect and partial monitoring conditions and thus are

iLRUC

isocket iLRU

iLRU

ETTF

TTF

+==

( )sampleTTFrandom,TTFmin isocket

isocket isocket SM

Precursor to failure or LRU-independent

Fixed interval or unscheduled maintenance

Figs. 1 and 2 for precursor to failure independent methods (nothing to do for val or unscheduled maintenance)

Unscheduled maintenance

Fixed interval

isocket isocket isocket MTTFifTTF >=

intervalfixed

∞

( )isocket M

maintenance event for socket i if TTFsocketi ≥ Mled maintenance event for socket i if TTFsocketi < Mmputed using (A.1)

e + M

cket i = safety margin

−

ementation detail.


themselves subject to uncertainty. Schemes for interpretingand applying PHM results to maintenance decisions willhave to balance the risk of unscheduled failure with thesubstantial uncertainties present in the PHM results.

The single and multi-socket models presented in thispaper provide insight into the maintenance decision pro-cess when complex systems use various PHM approacheswithin their subsystems. To be complete, models like thosepresented in this paper will ultimately need to address addi-tional issues including:

• What is the right shape and size of distributions associ-ated with PHM approaches?

• How to treat redundancy and what exactly constitutes afailure?

• Second order uncertainty (uncertainty about uncer-tainty) may be a real issue in the treatment of thisproblem.

• The effects of repair, i.e., LRUs with a mixed age popu-lation of sub-assemblies.

• More detailed revenue models are needed to model thecost of maintenance, e.g., simply costing scheduled andunscheduled maintenance is oversimplified for modelingthe maintenance planning of commercial enterprises.

• The present analysis has not addressed a calculation ofthe actual implementation costs of PHM, i.e., this anal-ysis has only discussed the ‘‘return on’’ part of the returnon investment problem.

Appendix A. Model implementation details

This appendix provides a detailed description of the gen-eral model implementation. In the model, the time-to-fail-ure (TTF) distribution is assumed to representmanufacturing and material variations from LRU toLRU. The range of possible environmental stress historiesthat sockets may see are modeled using an environmental

+

LRU Reliability(due to manufacturing and

material variations)

Range of Environmental Stress Profiles

Seen by a Population of Sockets in the

Field

=

Time to Failure

Fig. B.1. Single socket timeline example. This result indicates that the measimulated).

stress history distribution. Note, the environmental stresshistory distribution need not be used if the TTF distribu-tion for the LRUs includes environmental stress variations.The environmental stress history distribution is not usedwith the precursor to failure or LRU-independent models.Random TTFs are characterized by a uniform distributionwith a height equal to the average random failure rate peryear and a width equal to the inverse of the average ran-dom failure rate.

The model follows the history of a single socket or agroup of sockets from time zero to the end of support lifefor the system. To generate meaningful results, a statisti-cally relevant number of sockets (or systems of sockets)are modeled and the resulting cost and other metrics arepresented in the form of histograms. Fig. A.1 provides adetailed flow chart for the simulation performed.

The scheduled and unscheduled costs computed for thesockets are given by,

Csocket i ¼ fCLRU i þ ð1� f ÞCLRU i repair þ fT replace iV

þ ð1� f ÞT repair iV ðA:1Þ

where Csocket i is the life cycle cost of socket i, CLRU i is thecost of procuring a new LRU for socket i, CLRU i repair is thecost of repairing an LRU in socket i, f is the fraction ofmaintenance events on socket i that require replacementof the LRU in socket i with a new LRU, Treplace i is the timeto replace the LRU in socket i, Trepair i is the time to repairthe LRU in socket i, V is the value of time out of service.

Note, the values of f and V generally differ depending onwhether the maintenance activity is scheduled or unsched-uled. For simplicity, (A.1) is written assuming that quantityof replaced LRUs in socket i is one, however, in general,the socket could receive many LRUs during its lifetime.

Appendix B. Example PHM business case construction

Commitments to implement and support PHMapproaches cannot be made without the development of

20000

30000

40000

50000

60000

70000

80000

90000

0 5000 10000 15000 20000 25000 30000

Maintenance Interval (operational hours)

Eff

ecti

ve L

ife

Cyc

le C

ost

(per

so

cket

)

No Env Variation+/- 1000 hr triangular+/- 5000 hr triangular+/- 10000 hr triangular+/- 10000 hr uniform

Fixed Interval Result(various environmental stress profile distributions)

n unscheduled maintenance cost per socket is $61,696 (10,000 sockets

32000

33000

34000

35000

36000

37000

38000

39000

0 500 1000 1500 2000

Safety Margin/Prognostic Distance (operatonal hours)

Eff

ecti

ve L

ife

Cyc

le C

ost

(p

er s

ock

et)

PHM Stuctures = $0PHM Structures = $1000/LRUPHM Stuctures = $0PHM Structures = $1000/LRU

30000

32000

34000

36000

38000

40000

42000

44000

0 500 1000 1500 2000 2500 3000

Cost of PHM Stuctures (per LRU)

Eff

ecti

ve L

ife

Cyc

le C

ost

(per

so

cket

)


LRU Independent

No cost PHM structures

$1000/LRU PHM structures


LRU Independent

Effective life cycle cost per socket using fixed interval maintenance

Fig. B.2. The effective life cycle cost associated with precursor to failure and LRU independent PHM approaches. The left plot shows PHM approachescosting either $0 or $1000 per LRU instance to implement. Note, from Table 1, $1000/LRU = 10% of the LRU’s recurring cost. The right plot shows ageneralization of the left plot where just the lowest life cycle cost solutions are plotted. The small arrows indicate points on the left plot that are mapped to(common with) points on the right plot (10,000 sockets simulated).


a supporting business case justifying it to management.One important attribute of most business cases is the devel-opment of an economic justification. The economic justifi-cation of PHM has been previously discussed by severalauthors, e.g. [20–24]. These previous business case discus-sions provide useful insight into the issues influencing theimplementation, management, and return associated withPHM and present some application-specific results, butdo not approach the problem from a simulation or stochas-tic view. This appendix presents an example of the use ofthe discrete event simulation model to contribute to busi-ness case development. The objective of this example isto determine what the cost of implementing a PHM struc-ture has to be in order for it to be viable from a life cyclecost viewpoint. Consider a single socket containinginstances of an LRU characterized by the TTF distributionshown in Fig. B.1. The sockets that are occupied byinstances of this LRU see a range (distribution) of environ-mental stress profiles. Fig. B.1 shows a fixed maintenanceinterval analysis performed using the process described inSections 2.1. We will use the best fixed interval mainte-nance solution as the metric that the PHM approachesmust achieve.

The second step in the business case construction isshown in Fig. B.2. In Fig. B.2 we show four examplePHM approaches, two with a precursor to failure approach(costing either $0 or $1000 per LRU instance to implement)and two with a LRU independent approach (again, costingeither $0 or $1000 per LRU instance to implement). Theright side of Fig. B.2 also generalizes the result by consid-ering a continuum of PHM implementation costs and plot-ting the minimum life cycle cost solution corresponding toeach one. Fig. B.2 also shows the life cycle cost of the bestfixed interval maintenance solutions for the ±5000 h trian-

gularly distributed environmental stress distribution casefrom Fig. B.1. The intersection of the fixed interval main-tenance line and the precursor to failure and LRU indepen-dent lines in the graph on the right side of Fig. B.2 tells uswhat we can spend on the PHM approaches. The precursorto failure is economically practical if it can be implementedfor <$730/LRU (7.3% of recurring LRU cost) and LRUindependent methods are practical if they can be imple-mented for <$400/LRU (4% of recurring LRU cost). Itshould be stressed that this is an application-specific result.

References

[1] Engel S, Gilmartin B, Bongort K, Hess A. Prognostics, the real issuesinvolved with predicting life remaining. In: Proceedings of the IEEEaerospace conference; 2000. p. 457–69.

[2] Ramakrishnan A, Pecht M. A life consumption monitoring method-ology for electronic systems. IEEE Trans Compon Pack Technol2003:625–34.

[3] Vichare N, Rodgers P, Pecht M. Methods for binning and densityestimation of load parameters for prognostics and health manage-ment. Int J Performability Eng 2006;2(2).

[4] Wang W. A model to determine the optimal critical level and themonitoring intervals in condition-based maintenance. Int J Prod Res2000;38(6):1425–36.

[5] Kacprzynski GJ, Gumina M, Roemer MJ, Caguiat DE, Galie TR,McGroarty JJ. A prognostic modeling approach for predictingrecurring maintenance for shipboard propulsion systems. In: Pro-ceedings of the ASME Turbo Expo, June; 2001.

[6] Mishra S, Pecht M, Goodman D. In situ sensors for productreliability monitoring. Proc SPIE 2002;4755:10–9.

[7] Kelkar N, Dasgupta A, Pecht M, Knowles I, Hawley M, Jennings D.‘Smart’ electronic systems for condition-based health management.Qual Reliab Eng Int 1997;13:3–7.

[8] Humphrey D, Lorenson D, Shawlee W, Sandborn P. Aging aircraftusable life and wear-out issues. In: Proceedings of the World AviationCongress (SAE Technical Paper: 2002-1-3013), Phoenix, AZ, Novem-ber; 2002.


[9] Pecht M, editor. Product reliability, maintainability, and support-ability handbook. New York, NY: CRC Press; 1995.

[10] Ben-Daya M, Duffuaa SO, Raouf A. Maintenance, modeling andoptimization. Boston, MA: Kluwer Academic Publishers; 2000.

[11] Valdez-Flores C, Feldman R. A survey of preventative maintenancemodels for stochastically determining single-unit systems. Nav ResLog 1989;36:419–46.

[12] Cho D, Parlar M. A survey of preventative maintenance models formulti-unit systems. Eur J Oper Res 1991;51:1–23.

[13] Barros A, Berenguer C, Grall A. Optimization of replacement timesusing imperfect monitoring information. IEEE Trans Reliab2003;52(4):523–33.

[14] Heinrich G, Jensen U. Bivariate lifetime distributions and optimalreplacement. Math Method Oper Res 1996;44:31–47.

[15] Raivio T, Kuumola E, Mattila VA, Virtanen K, Hamalainen RP. Asimulation model for military aircraft aintenance and availability. In:Proceedings of the European simulation multiconference, September;2001. p. 190–4.

[16] Warrington L, Jones JA, Davis N. Modelling of maintenance, withindiscrete event simulation. In: Proceedings of the reliability andmaintainability symposium (RAMS); 2002. p. 260–5.

[17] Bazargan M, McGrath RN. Discrete event simulation to improveaircraft availability and maintainability. In: Proceedings of the

reliability and maintainability symposium (RAMS), Tampa, FL,January; 2003. p. 63–7.

[18] Lin Y, Hsu A, Rajamani R. A simulation model for field service withcondition-based maintenance. In: Proceedings of the winter simula-tion conference; 2002. p. 1885–90.

[19] Sandborn P. A decision support model for determining the applica-bility of prognostic health management (PHM) approaches toelectronic systems. In: Proceedings of the reliability and maintain-ability symposium (RAMS), Arlington, VA, January; 2005.

[20] Spare JH. Building the business case for condition-based mainte-nance. In: Proceedings of the IEEE/PES transmission and distribu-tion conference and exposition, November; 2001. p. 954–6.

[21] Vachtsevanos G. Cost and complexity trade-offs in prognostics. In:Proceedings of the NDIA conference on intelligent vehicles; 2003.

[22] Goodman DL, Wood S, Turner A. Return-on-investment (ROI) forelectronic prognostics in MIL/AERO systems. In: Proceedings of theAutoTestCon; 2005.

[23] Banks J, Reichard K, Crow E, Nickell K. How engineers can conductcost-benefit analysis for PHM systems. In: Proceedings of the IEEEaerospace conference; 2005.

[24] Hecht H. Prognostics for electronic equipment: An economicperspective. In: Proceedings of the reliability and maintainabilitysymposium (RAMS), Newport Beach, CA, January; 2006.

a maintenance planning and business case development model...

Documents