The Handbook of Applied Bayesian Analysis, Eds: Tony O’Hagan & Mike West, Oxford UniversityPress
Bayesian analysis and decisions in nuclear power plant
maintenance
Elmira Popova, David Morton, Paul Damien
The University of Texas at Austin
Austin, TX 78712, USA
Tim Hanson
University of Minnesota
Minneapolis, MN 55455, USA
1 Introduction
It is somewhat true that in most mainstream statistical literature the transition from in-
ference to a formal decision model is seldom explicitly considered. Since the Markov chain
Monte Carlo (MCMC) revolution in Bayesian statistics, focus has generally been on the
development of novel algorithmic methods to enable comprehensive inference in a variety of
applications, or to tackle realistic problems which naturally fit into the Bayesian paradigm.
In this chapter, the primary focus is on the formal decision or optimization model. The
statistical input needed to solve the decision problem is tackled via Bayesian parametric
and semi-parametric models. The optimization/Bayesian models are then applied to solving
an important problem in a nuclear power plant system at the South Texas Project (STP)
Electric Generation Station.
STP is one of the newest and largest nuclear power plants in the US, and is an industry
leader in safety, reliability and efficiency. STP has two nuclear reactors that together can
produce 2,500 megawatts of electric power. The reactors went online in August 1988 and
June 1989, and are the sixth and fourth youngest, respectively, of more than 100 such
reactors operating nationwide. STP consistently leads all US nuclear plants in the amount
1
of electricity its reactors produce.
The STP Nuclear Operating Company (STPNOC) manages the plant for its owners,
who share its energy in proportion to their ownership interests, which as of July 2004 are:
Austin Energy, The City of Austin, 16%, City Public Service of San Antonio, 40%, and
Texas Genco LP, 44%. All decisions that the board of directors make are of finite time
since every nuclear reactor is given a license to operate. In the case of STP, 25 and 27 years
remain on the licenses for the two reactors, respectively.
Equipment used to support production in long-lived (more than a few years) installations
such as those at STP requires maintenance. While maintenance is being carried out, the
associated equipment is typically out of service and the system operator may be required to
reduce or completely curtail production. In general, maintenance costs include labor, parts
and overhead. In some cases, there are safety concerns associated with certain types of plant
disruptions. Overhead costs include hazardous environment monitoring and mitigation,
disposal fees, license fees, indirect costs associated with production loss such as wasted feed
stock, and so forth.
STPNOC works actively with the Electric Power Research Institute (EPRI) to develop
robust methods to plan maintenance activities and prioritize maintenance options. Ex-
isting nuclear industry guidelines, Bridges and Worledge (August 2002); Gann and Wells
(2004); Hudson and Richards (2004); INPO (2002), recommend estimating reliability and
safety performance based on evaluating changes taken one at a time, using risk importance
measures supplemented by heuristics to prioritize maintenance. In our view, the nuclear
industry can do better. For example, Liming et al. (2003) propose instead investigating
“packages” of changes, and in their study of a typical set of changes at STPNOC, the pro-
jected plant reliability and nuclear safety estimates were found to be significantly different
compared to changes evaluated one at a time. STPNOC is working with EPRI to improve
its preventive maintenance reliability database with more accurate probability models to
aid in better quantification of preventive maintenance.
Here, we consider a single-item maintenance problem. That said, this chapter’s model
forms the basis for higher fidelity multi-item maintenance models that we are currently
2
investigating. Furthermore, as we describe below, even though the model is mathematically
a single-item model, in practice it could easily be applied to multiple items within an
equipment class.
Equipment can undergo either preventive maintenance (PM) or corrective maintenance
(CM). PM can include condition-based, age-based, or calendar-based equipment replace-
ment or major overhaul. Also included in some strategies is equipment redesign. In all
these categories of PM, the equipment is assumed to be replaced or brought back to the as-
purchased condition. CM is performed when equipment has failed unexpectedly in service
at a more-or-less random time (i.e., the out-of-service time is not the operator’s choice as
in PM).
Because such systems are generally required to meet production-run or calendar-based
production goals, as well as safety goals, installation operators want assurance that the plant
equipment will support these goals. On the other hand, the operator is usually constrained
by a limited maintenance budget. As a consequence, operators are faced with the important
problem of balancing maintenance costs against production and safety goals.
Thus, the primary decision problem with which we are confronted is interlinked, and
could be stated in the form of a question: “How should one minimize the total cost of
maintaining a nuclear power plant while ensuring that its reliability meets the standards
defined by the United States Department of Energy?”
There is an enormous literature on optimal maintenance policies for a single item that
dates back to the early 1950s. The majority of the work covers maintenance optimization
over an infinite horizon, see Valdez-Florez and Feldman (1989) for an extensive review.
The problem that we address in this chapter is over a finite planning horizon, which comes
from the fact that every nuclear power plant has a license to operate that expires in a
finite predefined time. In addition the form of the policy is effectively predefined by the
industry as a combination of preventive and corrective maintenance, as we describe below.
Marquez and Heguedas (2002) present an excellent review of the more recent research on
maintenance policies and solve the problem of periodic replacement in the context of a semi-
Markov decision processes methodology. Su and Chang (2000) find the periodic maintenance
3
policies that minimize the life cycle cost over a predefined finite horizon.
A review of the Bayesian approaches to maintenance intervention is presented in Wilson
and Popova (1998). Chen and Popova (2000) propose two types of Bayesian policies that
learn from the failure history and adapt the next maintenance point accordingly. They
find that the optimal time to observe the system depends on the underlying failure dis-
tribution. A combination of Monte Carlo simulation and optimization methodologies is
used to obtain the problem’s solution. In Popova (2004), the optimal structure of Bayesian
group-replacement policies for a parallel system of n items with exponential failure times
and random failure parameter is presented. The paper shows that it is optimal to observe
the system only at failure times. For the case of two items operating in parallel the exact
form of the optimal policy is derived.
With respect to the form of the maintenance policy that we consider, commercial nuclear
power industry practice establishes a set PM schedule for many major equipment classes.
In an effort to avoid mistakenly removing from service sets of equipment that would cause
production loss or safety concerns, effectively all equipment that can be safely maintained
together are grouped and assigned a base calendar frequency. The grouping policy ensures
PM (as well as most CM) will occur at the same time for equipment within a group. The
base frequency is typically either set by the refueling schedule (equipment for which at-
power maintenance is either impossible or undesirable) or calendar-based. The current
thinking for PM is typified by, for example, the INPO guidance, INPO (2002), whereby a
balance is sought between maintenance cost and production loss. By properly taking into
account the probability of production loss, the cost of lost production, the CM cost and PM
cost, the simple model described in this chapter captures industry practice. Not accounted
for in the model is the probability and cost of production loss due to PM during at-power
operation as well as potential critical path extension for PM during scheduled outages (such
as refueling outage). This is partly justified by the fact that PM is only performed during
the equipment’s group assigned outage time (unlike CM when the equipment outage time
is not predetermined) and because outage planning will (generally) assure PM is done off
critical path. In practice, there is normally one or two major equipment activities (main
4
turbine and generator, reactor vessel inspection, steam generator inspection, main condenser
inspection) that, along with refueling, govern critical path during planned outages.
2 Maintenance Model
The model we develop has the following constructs. The problem has a finite horizon of
length L (e.g., 25 or 27 years). We consider the following (positive) costs: Cpm – preventive
maintenance cost, Ccm – corrective maintenance cost, and Cd – downtime cost, which
includes all lost production costs due to a disruption of power generation. Let N(t) be the
counting process for the number of failures in the interval (0, t).
Let the random time to failure of the item from its as-new state be governed by dis-
tribution F with density f . Further assume that each failure of the item causes a loss of
production (i.e., a plant trip) with probability p and in that case a downtime cost, Cd > Ccm,
is instead incurred (this cost can include Ccm, if appropriate).
We consider the following form of a maintenance policy, which we denote (P1):
(P1): Bring the item to the “as-good-as-new” state every T units of time (preventive main-
tenance) at a cost of Cpm. If it fails meanwhile then repair it to the “as-good-as-old” state
(corrective maintenance) for a cost of Ccm or Cd, depending on whether the failure induces
a production loss. The optimization model has a single decision variable T , which is the
time between PMs, i.e., we assume constant interval lengths between PMs. The goal is to
find T ∈ A ⊂ [0, L] that minimizes the total expected cost, i.e.,
z(T ) = minT∈A
[CpmdL/T e+ {pCd + (1− p)Ccm} bL/T cE {N(T )}
+Cpm + {pCd + (1− p)Ccm} bL/T cE{N(L− T bL
Tc)}]
, (1)
where A = {i − integer, i ∈ [0, L]}, b·c is the “floor” (round-down to nearest integer)
operator, d·e is the “ceiling” (round-up to nearest integer) operator, and E {N(T )} is the
expected number of failures, taken with respect to the failure distribution, in the interval
(0, T ). Barlow and Hunter (1960), showed that for the above defined policy, the number
of CMs in an interval of length t follows a nonhomogeneous Poisson process with expected
5
number of failures in the interval (0, t),
E{N(t)} =∫ t
0q(u)du (2)
where q(u) = f(u)1−F (u) is the associated failure rate function. First we will describe an
algorithm to find the optimal T when the failure distribution is IFR (increasing failure
rate). Then we model the failure rate nonparametrically.
3 Optimization Results
The development of successful optimization algorithms in the IFR context detailed in the
previous section requires a closer examination of the objective function z(T ). This exam-
ination will culminate in certain key conditions that will guarantee a solution to the cost
minimization problem posed in equation (1). To this end, consider the following three
propositions in which the form of the IFR function is left unspecified, that is, the theory
below will hold for any arbitrary choice of an IFR function; for example, one could choose
a Weibull failure rate.
The proofs of the following propositions are given in Appendix B. Also the optimization
algorithm corresponding to these propositions, and which was used in the decision analysis
aspect of this chapter appears in Appendix B.
Proposition 1
Assume that we follow maintenance policy (P), and that the time between failures is a
random variable with an IFR distribution, i.e., the failure rate function q(t) is increasing.
Then, the objective function z(T ) is:
(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, . . ., and
(ii) increasing and convex on each interval ( Ln+1 ,
Ln ), n = 1, 2, . . ..
Proposition 2
Let D = {d : d = L/n, n = 1, 2, . . .} denote the set of discontinuities of z(T ) (cf. Proposi-
tion 1). Then,
(i) zc(T ) is quasiconvex on [0, L].
Furthermore if d ∈ D then
6
(ii) zc(d) = z(d), and
(iii) limT→d−
[z(T )− zc(T )] = Cpm.
Proposition 3
Let
T ∗c ∈ arg minT∈[0,L]
zc(T )
and let n∗ be the positive integer satisfying Ln∗+1 ≤ T
∗c ≤ L
n∗ . Then,
T ∗ ∈ arg minT∈{ L
n∗+1, Ln∗ }
zc(T )
solves minT∈[0,L]
z(T ).
4 Data and Bayesian Models
The assumption of constant failure rates with the associated exponential failure distribu-
tion and homogeneous Poisson process pervades analysis in today’s nuclear power industry.
Often times, the data gathered in industry are the number of failures in a given time pe-
riod, which is sufficient for exponential (or constant) failure times. In general, we cannot
expect to have a “complete data set” but rather a collection of failure times and “success”
times (a.k.a. right-censored times). Recognizing the limitations of the exponential failure
rate model, the Risk Management Department at STP has implemented the procedure now
recommended by the U.S. Department of Energy, see Blanchard (1993a). The constant
failure rate assumption is dropped in favor of a Bayesian scheme in which the failure rate is
assumed to be a gamma random variable, which is the conjugate prior for the exponential
distribution, and an updating procedure is defined. This methodology requires gathering,
at a local level, the number of failed items in a given month only. The mean of the posterior
distribution, combining the local and industry-wide history, of the failure rate is then being
used as a forecast of the frequency of failures for the upcoming month.
The hazard function associated with a gamma random variable is monotonically in-
creasing or decreasing to unity as time tends to infinity. However, a Weibull family allows
hazards to decay to zero or monotonically increase, and is therefore a common choice in
7
reliability studies when one is uncertain a priori that the instantaneous risk of compo-
nent failure becomes essentially constant after some time, as in the case of exponential and
gamma random variables. Yu et al. (2004a, 2005) apply Bayesian estimation procedures
when the failure rate distribution is assumed Weibull. Thus a Weibull failure rate model
is an improvement on the constant failure rate model. But parametric models such as the
exponential, gamma, and Weibull models all imply that the hazard rate function is uni-
modal and skewed right. This has serious consequences for the estimation of the expected
value in equation (1) if the true underlying failure rate is multi-modal with varying levels
of skewness. A semi-parametric approach that relaxes such stringent requirements on data
while retaining the flavor of these parametric families is therefore warranted.
We first describe the nuclear power plant data to be used in the Bayesian analysis.
4.1 Data
The failure data analyzed in this chapter are for 2 different systems, the Auxiliary Feedwater
System, part of the safety system, and the Electro-Hydraulic Control System, part of the
nuclear power generation reactor.
The primary function of the Auxiliary Feedwater System (AF) is to supply cooling water
during emergency operation. The cooling water supplied is boiled off in Steam Generators to
remove decay heat created by products of nuclear fission. There are four pumps provided in
the system, three electric motor driven and one turbine driven. Each of these pumps supplies
its own Steam Generator although, if required, cross connection capability is provided such
that any steam generator could be cooled by any pump. In addition, isolation valves, flow
control valves, and back flow prevention valves called check valves are installed to control
flow through the system and to provide capability to isolate the pumping system, if needed,
to control an accident. A simplified schematic diagram is shown in Figure 1.
We are investigating the performance of the following five groups:
• the electric motor driven pumps (six in total, three in each of two STP Units), labeled
GS1
8
• the flow control valves for all the pumps (eight in total, four in each of two STP
Units), labeled GS2
• the cross connect valves (eight in total, four in each of two STP Units), labeled GS3
• the accident isolation valves’ motor driver (eight in total, four in each of two STP
Units), labeled GS4
• the check valves (eight in total, four in each of two STP Units), labeled GS5.
We have failure time observations for the 5 different groups described above starting
from 1988 until May 2008. There are 56 exact observations for GS1, 2144 for GS2, 202 for
GS3, 232 for GS4, and 72 for GS5.
Figure 1: Diagram of the Auxiliary Feedwater System
The Electro-Hydraulic Control System (EHC) consists of the high pressure Electro-
Hydraulic fluid system and the electronic controllers, which are used for the control of the
main electrical generating steam turbine as well as the steam turbines that drive the normal
steam generator feed water supply pump. The high pressure steam used to turn the turbines
described above is throttled for speed control and shut off with steam valves using hydraulic
operators (pistons). The control functions include limiting the top speed of the turbines
up to and including shut down of speeds become excessive. EHC provides high pressure
hydraulic oil which is used as the motive force and the electrical signals that adjust the
9
speed control and steam shut-off valves for the turbines. Because only one main electrical
generating steam turbine is provided for each plant, production of electricity from the plant
is stopped or reduced for some failures in the EHC system.
We would like to study the performance of:
• Particular pressure switches that control starting and stopping of the hydraulic fluid
supply pumps; the turbine shutoff warning; and temperature control of steam to the
main turbine from a heat exchanger, labeled GN1.
• Oil cleaning filters located at the outlet of the hydraulic fluid supply pumps, labeled
GN2.
• Solenoid valves (referred to as servo valves) which control the operation of the steam
valves to the main electrical generating steam turbine, labeled GN3.
• The two hydraulic supply pumps (one is required for the system to operate), labeled
GN4.
• Solenoid valves which, when operating properly, respond to certain failures and stop
the main electrical generating steam turbine as necessary to prevent damage, labeled
GN5.
We have failure time observations for the five different groups described above starting
from 1988 until May 2008. There are 221 exact and 1 censored observations for GN1, 243
exact and 1 censored for GN2, 133 exact for GN3, 223 exact for GN4, and 91 exact for
GN5.
4.2 Bayesian Parametric Model
Here we present a Bayesian model for the failure times assuming that the lifetime distribu-
tion is Weibull with random parameters λ and β.
The sampling plan is as follows: If we have n items under observation,s of which have
failed at ordered times T1, T2, . . . , Ts, then n− s have operated without failing. If there are
10
no withdrawals then the total time on test is: ω = nT βs , which is also the sufficient statistic
for estimating λ (also known as the rescaled total time on test).
Assume that λ has an inverted gamma prior distribution with hyper-parameters ν0, µ0
and β has uniform prior distribution with hyper-parameters α0, β0.
We use the generic information (lognormal prior for λ) to assess the values of α0 and β0.
First we calculate the mean and variance for the lognormal distribution with parameters:
M = lnMn − S2/2 (3)
S2 =(
lnEF1.645
)2
(4)
where Mn is the mean and EF is the error factor as derived in Blanchard (1993b). Then
we compute the parameters of the Gamma prior distribution, α0, β0:
α0 = Mn/β0 (5)
β0 = e2MeS2(eS
2 − 1)/Mn. (6)
We will follow the development from Martz and Waller (1982). The posterior expecta-
tions of λ and β are:
E[λ|z] =J2
(s+ ν0 − 1)J1(7)
E[β|z] =J3
J1(8)
where z is a summary of the sample evidence, ν =∏si=1 Ti, ω1 = nTs + µ0, J1 =∫ β0
α0
[βsνβ
ωs+ν01
]dβ, J2 =
∫ β0
α0
[βsνβ
ωs+ν0−11
]dβ, and J3 =
∫ β0
α0
[βs+1νβ
ωs+ν01
]dβ.
We use numerical integration to solve the above integrals.
The choice of the hyper-parameters values is not random. We used the values given in
the DOE database (see Blanchard, 1993b) for the values of the inverted gamma parameters,
and empirically assessed the parameters of the uniform prior distribution, for details see Yu
et al. (2004b). Table 1 shows the Bayes estimates of the model parameters for each group
of data.
In the Appendix we offer screen shots that are used by the engineers at STP to implement
the above estimation model in combination with the optimization model described in §3 on
a routine basis.
11
Table 1: Bayes estimates of the parametric Bayes modelGroup λ β
GN1 0.000152 4.763118GN2 0.000448 1.994206GN3 0.000294 9.980280GN4 0.001594 1.006348GN5 0.000298 5.931251GS1 0.000313 9.910241GS2 0.001499 0.348347GS3 0.000932 0.655784GS4 0.001667 0.723488GS5 0.000308 11.109310
4.3 Bayesian Semi-parametric Model
The failure data is modeled using an accelerated failure time (AFT) specification, along
with the Cox proportional hazard regression model:
Sx(t) = S0(ex′βt),
where x is a p dimensional vector of covariates, Sx(·) the corresponding reliability function,
and S0 is the “baseline” reliability function.
The mixture of Polya trees models (MPTs) of Hanson (2006) are used to model the
reliability or survival data. These models are “centered” at the corresponding parametric
models but allow significant data-driven deviations from parametric assumptions while re-
taining predictive power associated with parametric modeling. This point will be elaborated
in Appendix A. This more flexible approach in essence assumes a nonparametric model for
S0; i.e., an MPT prior is placed on S0. For the power plant data considered in this chapter,
without loss of generality, we take the first group from each system to be represented by
S0. Then, the covariates are indicator variables corresponding to each of the remaining five
groups in the data. Note that, if needed, you could easily incorporate other continuous,
time-dependent covariates as well in the above formulation. Details of the MPT model are
provided in Appendix A.
The mixture of Polya trees (MPT) prior provides an intermediate choice between a
12
strictly parametric analysis and allowing S0 to be completely arbitrary. In some ways
it provides the best of both worlds. In areas where data are sparse, such as the tails,
the MPT prior places relatively more posterior mass on the underlying parametric family
{Gθ : θ ∈ Θ}. In areas where data are plentiful the posterior is more data driven; and
features not allowed in the strictly parametric model, such as left-skew and multimodality,
become apparent. The user-specified weight w controls how closely the posterior follows
{Gθ : θ ∈ Θ} with larger values of w yielding inference closer to that obtained from the
underlying parametric model. Hanson (2006) describes priors for w; we simply fix w to be
some small value, typically w = 1.
Assume standard, right-censored reliability data D = {(xi, ti, δi)}ni=1, and let Di denote
the ith triple (xi, ti, δi). Let Ti ∼ Sxi(·). As usual, δi = 0 indicates that ti is a censoring
time, Ti > ti, and δi = 1 denotes that ti is a survival time, Ti = ti.
Given S0 (through Y and θ) and β, the survival function for covariates x is
Sx(t|Y, θ, β) = S0(ex′βt|Y, θ), (9)
and the pdf is
fx(t|Y, θ, β) = ex′βf0(ex
′βt|Y, θ), (10)
where S0(t|Y, θ) and f0(t|Y, θ) are given by (16) and (17).
Assuming uninformative censoring, the likelihood for right censored data is then given
by
L(Y, θ, β) =n∏i=1
fxi(ti|Y, θ, β)δiSxi(ti|Y, θ, β)1−δi . (11)
The MCMC algorithm alternately samples [β, θ|Y,D] and [Y|β, θ,D]. The vector [β, θ|Y,D]
is efficiently sampled with a random walk Metropolis Hastings step (Tierney, 1994). A pro-
posal that has worked very well in practice is obtained from using the large sample estimated
covariance matrix from fitting the parametric log-logistic model via maximum likelihood.
A simple Metropolis-Hastings step for updating the components (Yε0, Yε1) one at a time
first samples a candidate (Y ∗ε0, Y∗ε1) from a Dirichlet(mYε0,mYε1) distribution, where m > 0,
13
typically m = 20. This candidate is accepted as the “new” (Yε0, Yε1) with probability
ρ = min
{1,
Γ(mYε0)Γ(mYε1)(Yε0)mY∗ε0−wj2(Yε1)mY
∗ε1−wj2L(Y∗, θ, β)
Γ(mY ∗ε0)Γ(mY ∗ε1)(Y ∗ε0)mYε0−wj2(Y ∗ε1)mYε1−wj2L(Y, θ, β)
},
where j is the number of digits in the binary number ε0 and Y∗ is the set Y with (Y ∗ε0, Y∗ε1)
replacing (Yε0, Yε1). The resulting Markov chain has mixed reasonably well for many data
sets unless w is set to be very close to zero.
Figure 2: Estimated hazard curves for the AF system using the MPT model
Based on Figures 2, 3, and 4 it is clear that the assumption of IFR is inappropriate.
The 10 groups depict different hazard patterns that is nicely captured by the MPT model.
Also, the survival probabilities are widely different. For example, for the EHC system, the
group GN3 is by far the most reliable, whereas the group GS1 performs best for the AF
system.
In contrast, the parametric Weibull model produces different results. Figures 5 and
14
Figure 3: Estimated hazard curves for the EHC system using the MPT model
6 show the corresponding hazard curves using the Bayes point estimators of the model
parameters.
15
Figure 4: Estimated survival curves for the AF and EHC systems using the MPT model
Figure 5: Estimated hazard curves for the EHC system using the parametric Bayes model
16
Figure 6: Estimated hazard curves for the AF system using the parametric Bayes model
17
Appendix A
In this chapter we have introduced Bayesian decision analysis to enable engineers reach
an optimal decision in the context of maintaining nuclear power plants. Typical Bayesian
analysis usually stops at inference. However, as noted by De Groot, Smith, and Berger,
for Bayesian methods to gain acceptance, one needs to extend posterior inference to actual
decisions.
In this chapter we have used Polya trees and mixtures of Polya trees to carry out
inference. We provide some details of this methodology.
Extending the work of Lavine (1992, 1994), mixtures of Polya trees have been considered
by Berger and Guglielmi (2001), Walker and Mallick (1999), Hanson and Johnson (2002),
Hanson (2006), Hanson et al. (2006), and Hanson (2007).
The last four references deal with survival or reliability data. This prior is attractive in
that it is easily centered at a parametric family such as the class of Weibull distributions.
When sample sizes are small, posterior inference has much more of a parametric flavor due to
the centering family. When sample sizes increase, the data take over the prior and features
such as multi-modality and left skew in the reliability distributions are better modeled.
Consider a mixture of Polya trees prior on S0,
S0|θ ∼ PT (c, ρ,Gθ), (12)
θ ∼ p(θ), (13)
where (12) is shorthand for a particular Polya tree prior (Hanson and Johnson, 2002). We
briefly describe the prior but leave details to the references above.
Let J be a fixed, positive integer and let Gθ denote the family of Weibull cumula-
tive distribution functions, Gθ(t) = 1 − exp{−(t/λ)α} for t ≥ 0, where θ = (α, λ)′. The
distribution Gθ serves to center the distribution of the Polya tree prior. A Polya tree
prior is constructed from a set of partitions Πθ and a family A of positive real num-
bers. Consider the family of partitions Πθ = {Bθ(ε) : ε ∈⋃Jl=1{0, 1}l}. If j is the
base-10 representation of the binary number ε = ε1 · · · εk at level k, then Bθ(ε1 · · · εk) is
18
defined to be the interval (G−1θ (j/2k), G−1
θ ((j + 1)/2k)). For example, with k = 3, and
ε = 000, then j = 0 and Bθ(000) = (0, G−1θ (1/8)), and with ε = 010, then j = 2 and
Bθ(010) = (G−1θ (2/8), G−1
θ (3/8)), etc.
Note then that at each level k, the class {Bθ(ε) : ε ∈ {0, 1}k} forms a partition of
the positive reals and furthermore Bθ(ε1 · · · εk) = Bθ(ε1 · · · εk0)⋃Bθ(ε1 · · · εk1) for k =
1, 2, . . . ,M − 1. We take the family A = {αε : ε ∈⋃Mj=1{0, 1}j} to be defined by αε1···εk =
wk2 for some w > 0 (Walker and Mallick, 1999; Hanson and Johnson, 2002). As w tends to
zero the posterior baseline is almost entirely data-driven. As w tends to infinity we obtain
a fully parametric analysis.
Given Πθ and A, the Polya tree prior is defined up to level J by the random vectors
Y = {(Yε0, Yε1) : ε ∈⋃M−1j=0 {0, 1}j} through the product
S0{Bθ(ε1 · · · εk)|Y, θ} =k∏j=1
Yε1···εj , (14)
for k = 1, 2, . . . ,M , where we define S0(A) to be the baseline measure of any set A. The
vectors (Yε0, Yε1) are independent Dirichlet:
(Yε0, Yε1) ∼ Dirichlet(αε0, αε1), ε ∈M−1⋃j=0
{0, 1}j . (15)
The Polya tree parameters “adjust” conditional probabilities, and hence the shape of
the survival density f0, relative to a parametric centering family of distributions. If the
data are truly distributed Gθ observations should be on average evenly distributed among
partition sets at any level j. Under the Polya tree posterior, if more observations fall into
interval Bθ(ε0) ⊂ R+ than its companion set Bθ(ε1), the conditional probability Yε0 of
Bθ(ε0) is accordingly stochastically “increased” relative to Yε1. This adaptability makes
the Polya tree attractive in its flexibility, but also anchors the random S0 firmly about the
family {Sθ : θ ∈ Θ}.
Beyond sets at the level J in Πθ we assume S0|Y, θ follows the baseline Gθ. Hanson
and Johnson (2002) show that this assumption yields predictive distributions that are the
same as from a fully specified (infinite) Polya tree for large enough J ; this assumption also
19
avoids a complication involving infinite probability in the tail of the density f0|Y, θ that
arises from taking f0|Y, θ to be flat on these sets.
Define the vector of probabilities p = p(Y) = (p1, p2, . . . , p2J )′ as pj+1 = S0{Bθ(ε1 · · · εJ)|Y, θ} =∏Ji=1 Yε1···εi where ε1 · · · εJ is the base-2 representation of j, j = 0, . . . , 2J − 1. After simpli-
fication, the baseline survival function is,
S0(t|Y, θ) = pN[N − 2JGθ(t)
]+
2J∑j=N+1
pj , (16)
where N denotes the integer part of 2JGθ(t)+1 and where gθ(·) is the density corresponding
to Gθ. The density associated with S0(t|Y, θ) is given by
f0(t|Y, θ) =2J∑j=1
2Jpjgθ(t)IBθ(εJ (j−1))(t) = 2JpNgθ(t), (17)
where εJ(i) is the binary representation ε1 · · · εJ of the integer i and N as is in (16). Note
that the number of elements of Y may be moderate. This number is∑J
j=1 2j = 2J+1 − 2.
For J = 5, a typical level, this is 62, of which half need to be sampled in an MCMC scheme.
Appendix B: Implementation in STP
Proposition 1 Assume that we follow maintenance policy (P), and that the time between
failures is a random variable with an IFR distribution, i.e., the failure rate function q(t) is
increasing. Then, the objective function z(T ) is:
(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, . . ., and
(ii) increasing and convex on each interval ( Ln+1 ,
Ln ), n = 1, 2, . . ..
Proof
We first show (ii). Assume that T ∈ ( Ln+1 ,
Ln ) for some n. Within this interval
z′(T ) ≡ dz(T )dT
= nC̄cm {q(T )− q(L− nT )]} .
Here, we have used q(t) = ddtE[N(t)] and the fact that bL/T c = n is constant within
this n-th interval. That z increases on this interval follows from the fact that z′(T ) is
positive since q is increasing and L − nT < T for T ∈ ( Ln+1 ,
Ln ). To show convexity, we
20
show that z′(T ) is increasing. Let T + ∆T ∈ ( Ln+1 ,
Ln ). Then, q(T + ∆T ) ≥ q(T ) and
q(L− n(T + ∆T )) ≤ q(L− nT ), and hence z′(T + ∆T ) ≥ z′(T ). This completes the proof
of (ii).
The convexity result for z(T ) from (ii) shows that T = L,L/2, L/3, . . . are the only
possible points of discontinuity in z(T ). The first term in z(T ) (i.e., the PM term) is lower
semicontinuous because of the ceiling operator. The second term (i.e., z(T ) when Cpm = 0)
is continuous in T . To show this it suffices to verify that
bL/T cE [N(T )] + E
[N
(L− T
⌊L
T
⌋)]has limit nE[N(d)] for both T → d− and T → d+, where d = L/n for any n = 1, 2, . . .. This
follows in a straightforward manner using the bounded convergence theorem, e.g., Cinlar
(1975, pp. 33). Hence, we have (i).
Let
zc(T ) = Cpm(L/T ) + C̄cm(L/T )E [N(T )] ,
i.e., zc(T ) is z(T ) with bL/T c and dL/T e replaced by L/T . The following proposition
characterizes zc(T ) and its relationship with z(T ).
Proposition 2
Let D = {d : d = L/n, n = 1, 2, . . .} denote the set of discontinuities of z(T ) (cf. Proposi-
tion 1). Then,
(i) zc(T ) is quasiconvex on [0, L].
Furthermore if d ∈ D then
(ii) zc(d) = z(d), and
(iii) limT→d−
[z(T )− zc(T )] = Cpm.
Proof
(i): We have
zc(T ) =Cpm + C̄cmE [N(T )]
T/L. (18)
The numerator of (18) is convex in T because E [N(T )] has increasing derivative q(T ). As
a result, zc(T ) has convex level sets and is hence quasiconvex.
21
(ii): The result is immediate by evaluating z and zc at points of D.
(iii): Let d ∈ D. Then, there exists and positive integer n with d = L/n. Note that due to
the ceiling operator, we have the following result
limT→d−
{⌈L
T
⌉− L
T
}= 1. (19)
Based on (19) and the definitions of z and zc it suffices to show
bL/T cE [N(T )] + E
[N
(L− T
⌊L
T
⌋)]has limit nE[N(d)] as T → d−. This was established in the proof of part (i) in Proposition
1.
Proposition 2 shows that at its points of discontinuity, our objective function z drops by
magnitude Cpm to agree with zc. Before developing the associated algorithm, the following
proposition shows how to solve a simpler variant of our model in which the set A of feasible
replacement intervals is replaced by [0, L].
Proposition 3
Let
T ∗c ∈ arg minT∈[0,L]
zc(T )
and let n∗ be the positive integer satisfying Ln∗+1 ≤ T
∗c ≤ L
n∗ . Then,
T ∗ ∈ arg minT∈{ L
n∗+1, Ln∗ }
zc(T )
solves minT∈[0,L]
z(T ).
Proof
Proposition 1 says that z(T ) increases on each interval ( Ln+1 ,
Ln ), and is lower semicon-
tinuous. As a result, minT∈[0,L] z(T ) is equivalent to minT∈D z(T ), where D = {d : d =
L/n, n = 1, 2, . . .} is the set of discontinuities of z. This optimization model is, in turn,
equivalent to minT∈D zc(T ) since, part (ii) of Proposition 2 establishes that z(T ) = zc(T )
for T ∈ D. Finally, since zc(T ) is quasiconvex (see Proposition 2 part (i)), we know zc(T )
is nondecreasing on [T ∗c , L] and nonincreasing on [0, T ∗c ]. Proposition 3 follows.
22
The Optimization Algorithm
Proposition 3 shows how to solve minT∈[0,L] z(T ). The key idea is to first solve the
simpler problem minT∈[0,L] zc(T ), which can be accomplished efficiently, e.g., via a Fibonacci
search. Let T ∗c denote the optimal solution to the latter problem. Then, the optimal solution
to the former problem is either the first point in D to the right of T ∗c or the first point in
D to the left of T ∗c , whichever yields a smaller value of z(T ) (or zc(T ) since they are equal
on D).
As described above, our model has the optimization restricted over a finite grid of points
A ⊂ [0, L]. Unfortunately, it is not true that the optimal solution is given by either the
first point in A to the right of T ∗c or the first point in A to the left of T ∗c . However, the
results of Proposition 1, part (ii) establish that within the set A ∩ ( Ln+1 ,
Ln ) we can restrict
attention to the left-most point (or ignore the segment if the intersection is empty). To
simplify the discussion that follows we will assume that A has been redefined using this
restriction. Furthermore, given a feasible point TA ∈ A with cost z(TA) we can eliminate
from consideration all points in A to the right of min{T ∈ D : T ≥ T ∗c , zc(T ) ≥ z(TA)}
and all points in A to the left of max{T ∈ D : T ≤ T ∗c , zc(T ) ≥ z(TA)}. This simplification
follows from the fact that z increases on each of its intervals and is equal to the quasiconvex
zc on D.
Our algorithm therefore consists of:
• Step 0: Let T ∗c ∈ arg minT∈[0,L] zc(T ).
• Step 1: Let T ∗ ∈ A be the first point in A to the right of T ∗c and let z∗ = z(T ∗). Let
TA = T ∗.
• Step 2: Increment TA to be the next point in A to the right. If z(TA) < z∗ then
z∗ = z(TA) and T ∗ = TA. Let d be the next point to the right of TA in D. If
zc(d) ≥ z∗ then let TA to be the first point in A to the left of T ∗c and proceed to the
next step. Otherwise, repeat this step.
• Step 3: If z(TA) < z∗ then z∗ = z(TA) and T ∗ = TA. Let d be the next point to the
23
left of TA in D. If zc(d) ≥ z∗ then output T ∗ and z∗ and stop. Otherwise, increment
TA to be the next point in A to the left and repeat this step.
Step 0 solves the simpler continuous problem for T ∗c , as described above and Step 1
simply finds an initial candidate solution. Then, Steps 2 and 3, respectively, increment to
the right and left of T ∗c , moving from the point of A in, say, A ∩ ( Ln+1 ,
Ln ) to the point (if
any) in the next interval. Increments to the right and to the left stop when further points
in that direction are provably suboptimal. The algorithm is ensured to terminate with an
optimal solution.
Acknowledgements
The authors would like to thank the OR/IE graduate students Alexander Galenko and
Dmitriy Belyi for their help in proving the propositions and computational work. Thank
you to Ernie Kee for providing the data and helping with the nuclear engineering part of
the paper. This research has been partially supported by the Electric Power Research Insti-
tute under contract EP-P22134/C10834, STPNOC under grant B02857, and the National
Science Foundation under grants CMMI-0457558 and CMMI-0653916.
24
References
Barlow, R., Hunter, L., 1960. Optimum preventive maintenance policies. Operations Re-
search 8, 90–100.
Berger, J. O., Guglielmi, A., 2001. Bayesian testing of a parametric model versus nonpara-
metric alternatives. Journal of the American Statistical Association 96, 174–184.
Blanchard, A., 1993a. Savannah river site generic data base development. Tech. Rep. WSRC-
TR 93-262, National Technical Information Service, U.S. Department of Commerce, 5285
Port Royal Road, Springfield, VA 22161, Savannah River Site, Aiken, South Carolina
29808.
Blanchard, A., 1993b. Savannah river site generic data base development. Tech. Rep.
WSRC-TR 93-262, National Technical Information Service, U.S. Department of Com-
merce, 5285 Port Royal Road, Springfield, VA 22161, Savannah River Site, Aiken, South
Carolina 29808.
Bridges, M., Worledge, D., August 2002. Reliability and risk significance for maintenance
and reliability professionals at nuclear power plants. Tech. rep., EPRI Technical Report
1007079.
Chen, T.-M., Popova, E., 2000. Bayesian maintenance policies during a warranty period.
Communications in Statistics - Stochastic Models 16 (1), 121–142.
Cinlar, E., 1975. Introduction to stochastic processes. Prentice Hall.
Gann, C., Wells, J., 2004. Work Process Program. STPNOC plant procedure 0PGP03-ZA-
0090, Revision 28.
Hanson, T., 2006. Inferences on mixtures of finite Polya tree models. Journal of the Amer-
ican Statistical Association, 101, 1548–1565.
Hanson, T., 2007. Polya trees and their use in reliability and survival analysis. In: Ency-
clopedia of Statistics in Quality and Reliability. John Wiley & Sons, in press.
25
Hanson, T., Johnson, W., Laud, P., 2006. A semiparametric accelerated failure time model
for survival data with time dependent covariates. In: Upadhyay, S., Singh, U., Dey,
D. (Eds.), Bayesian Statistics and its Applications. New Delhi:Anamaya Publishers, pp.
254–269.
Hanson, T., Johnson, W. O., 2002. Modeling regression error with a mixture of Polya trees.
Journal of the American Statistical Association 97, 1020–1033.
Hudson, E., Richards, D., 2004. Configuration Risk Management Program. STPNOC plant
procedure 0PGP03-ZA-0091, Revision 6.
INPO, November 2002. Equipment Reliability Process Description. Institute of Nuclear
Power Operations process description, AP 913, Revision 1.
Lavine, M., 1992. Some aspects of Polya tree distributions for statistical modeling. Annals
of Statistics 20, 1222–1235.
Lavine, M., 1994. More aspects of Polya trees for statistical modeling. Annals of Statistics
22, 1161–1176.
Liming, J., Kee, E., Johnson, D., Sun, A., April 20-30 2003. Practical application of decision
support metrics for nuclear power plant risk-informed asset management. In: Proceedings
of the 11th International Conference on Nuclear Engineering. Tokyo, Japan.
Marquez, A., Heguedas, A., 2002. Models for maintenance optimization: a study for re-
pairable systems and finite time periods. Reliability Engineering and System Safety 75,
367–377.
Martz, H. F., Waller, R. A., 1982. Bayesian reliability analysis. John Wiley and Sons.
Popova, E., 2004. Basic optimality results for Bayesian group replacement policies. Opera-
tions Research Letters 32, 283–287.
Su, C.-T., Chang, C.-C., 2000. Minimization of the life cycle cost for a multistate system
under periodic maintenance. International Journal of Systems Science 31, 217–227.
26
Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of Statis-
tics 22 (4), 1701–1762.
Valdez-Florez, C., Feldman, R., 1989. A survey of preventive maintenance models for
stochastically deteriorating single-unit systems. Naval Research Logistics, 419–446.
Walker, S. G., Mallick, B. K., 1999. Semiparametric accelerated life time model. Biometrics
55, 477–483.
Wilson, J. G., Popova, E., 1998. Bayesian approaches to maintenance intervention. In:
Proceedings of the section on Bayesian Science of the American Statistical Association.
pp. 278–284.
Yu, W., Popova, E., Kee, E., Sun, A., Grantom, R., May 16-20 2005. Basic factors to
forecast maintenance cost for nuclear power plants. In: Proceedings of 13th International
Conference on Nuclear Engineering. Beijing, China.
Yu, W., Popova, E., Kee, E., Sun, A., Richards, D., Grantom, R., July 2004a. Equipment
data development case study-Bayesian Weibull analysis. Tech. rep., STPNOC.
Yu, W., Popova, E., Kee, E., Sun, A., Richards, D., Grantom, R., July 2004b. Equipment
data development case study-Bayesian Weibull analysis. Tech. rep., STPNOC.
27