nonparametric bayesian assessment of the order of ... · nonparametric bayesian assessment of the...
TRANSCRIPT
Nonparametric Bayesian Assessment of the Order of
Dependence for Binary Sequences
Fernando A. Quintana∗ Peter Muller†
December, 2002
Abstract
We discuss inference on the order of dependence in binary sequences. The
proposed approach is based on the notion of partial exchangeability of order
k. A partially exchangeable binary sequence of order k can be represented
as a mixture of Markov chains. The mixture is with respect to the unknown
transition probability matrix θ. We use this defining property to construct a
semiparametric model for binary sequences by assuming a nonparametric prior
on the transition matrix θ. This enables us to consider inference on the order of
dependence without constraint to a particular parametric model. Implement-
ing posterior simulation in the proposed model is complicated by the fact that
the dimension of θ changes with the order of dependence k. We discuss appro-
priate posterior simulation schemes based on a pseudo prior approach (Carlin
and Chib 1995). We extend the model to include covariates by considering
an alternative parameterization as an autologistic regression which allows for a
straightforward introduction of covariates. The regression on covariates raises
∗Departamento de Estadıstica, Pontificia Universidad Catolica de Chile, Casilla 306, Santiago
22, CHILE. e-mail: [email protected]. Partially supported by grant FONDECYT 1990430.†Department of Biostatistics, Box 447, University of Texas M. D. Anderson Cancer Center,
Houston, TX 77030-4009, USA. e-mail: [email protected]. Partially supported by grants
NSF/INT 0104496 and NIH R01CA75981.
1
the additional inference problem of variable selection. We discuss appropriate
posterior simulation schemes, focusing on inference about the order of depen-
dence. We discuss and develop the model with covariates only to the extent
needed for such inference.
1 Introduction
We consider inference for multiple binary sequences. Assume the data are n binary
sequences yi = (yi1, . . . , yini), one for each of n experimental units i = 1, . . . , n. Ad-
ditionally, the dataset may include a d-dimensional covariate vector xij, j = 1, . . . , ni
recorded for each subject at each time. This data structure arises in a wide variety
of applications. Typical examples are repeated binary measurements for patients in
a clinical trial, sequences of hits and outs for baseball players, or repeated rolls of a
thumbtack. In Section 4 we use the latter two examples to illustrate the proposed
models and methods.
To analyze such data we consider a flexible class of models for binary sequences,
defined through the notion of partial exchangeability of order k (Quintana and Newton
1998). Order k partial exchangeable probability models can be defined as mixtures
of order k homogeneous Markov chains. The mixture is with respect to a probabil-
ity measure on the space of transition matrices. Such partially exchangeable models
include as special cases independence, exchangeability and ordinary order k Markov
chains. Subject to some technical constraints, any probability distribution for binary
sequences that is invariant under all permutations that do not alter the initial state
and all transition counts up to order k can be represented by such models. This prop-
erty explains the label “partially exchangeable.” Theoretical properties of partially
exchangeable sequences and further applications can be found in Diaconis (1988),
Quintana and Newton (1999, 2000) and references therein.
The focus of this article is inference on the order of dependence k for partially
exchangeable models. Central to the proposed model is a nonparametric probability
2
model under order k, allowing for the desired inference without reference to any
particular parametric model. To define the nonparametric model we exploit the
mixture representation of partially exchangeable probability models and use mixtures
of Markov chains with a nonparametric mixing measure on transition probability
matrices. Unlike the earlier references, we concentrate on the case where the order of
dependence k is assumed unknown.
We first discuss the case without time dependent covariates. Under order k partial
exchangeability, the likelihood for experimental unit i is a mixture of order k Markov
chains. We will use θi to denote the transition probability matrix for experimental
unit i, and Fk(θi) to denote the mixing measure. Thus the likelihood for experimen-
tal unit i is p(yi|k, Fk) =∫
p(yi|θi) dFk(θi). An important feature of the proposed
models is the nested nature of the parameter sub-spaces for the transition matrices
θi under increasing order of dependence k. Consider, for example, a transition prob-
ability matrix θ for a Markov chain of order k = 1. If the rows of θ are chosen to be
identical we lose one-step dependence, and the resulting sequence has order k = 0.
Similarly, it follows that order k1 is a sub-model of order k2 for any 0 ≤ k1 < k2. In
the context of this nested structure it is convenient to have all models defined on a
common space. To achieve this, let K denote the maximum possible order of depen-
dence, for example, K = min1≤i≤n
ni − 1. Let Θ ≡ ΘK denote the space of transition
matrices corresponding to order K dependence. An element θ ∈ Θ can be repre-
sented as a set of transition probabilities {piK ,iK−1···i1}, for i1, . . . , iK ∈ {0, 1}, where
piK ,··· ,i1 is the conditional probability that the next outcome is 1, given that the pre-
vious K values are iK , · · · , i1 in that order. Thus Θ = [0, 1]2K. In this setting, order
k dependence is identified with a parameter subspace Θk of transition probabilities
that do not depend on indices iK , . . . , ik+1. In other words, θ = {piK ···i1} ∈ Θk if and
only if piK ···ik···i1 = p1···1,ik···i1 for all i1, . . . , iK ∈ {0, 1}. This corresponds to choosing
some of the rows in the transition matrix to be identical. We will write pik···i1 for the
common value of piK ···ik···i1 ∈ Θk.
The order of dependence k, 0 ≤ k ≤ K, is treated as an unknown parameter and
3
included in the parameter vector, leading to model averaging over K + 1 competing
models corresponding to k = 0, . . . , K. To complete the model specification we need
a probability model for Fk, the prior for order k transition probabilities. The choice
of an appropriate model is subject to the following considerations. First, the focus of
inference is on the order of dependence k. Thus we want to avoid unduly restrictive
parametric models. Second, the possibly high dimensional nature of θi makes it im-
perative to choose a model that allows easy extension to high dimensions. Finally, the
model should allow computationally efficient posterior simulation. This leads us to
consider a nonparametric Dirichlet process (DP) prior for Fk(θi). DP priors were in-
troduced in Ferguson (1973) and have been successfully used in many similar mixture
model contexts, including, among many others, Bush and MacEachern (1996), Es-
cobar and West (1995), Quintana (1998), and Muller and Rosner (1998). See, for
example, MacEachern and Muller (2000) for a recent review. The DP prior model
for Fk(·) generalizes the model studied in Quintana and Newton (2000). Section 2
discusses details of this baseline model and the corresponding Markov chain Monte
Carlo (MCMC) scheme for selecting the order of dependence.
The modeling strategy for the case with covariates xij is developed next. To
facilitate the introduction of a regression on covariates we start with a one-to-one
transformation of the transition probabilities θi on the logit scale. With the extension
to covariates in mind we write the logistic transformation as a saturated autologistic
regression of yit on the lagged observations up to lag K:
logit P (yl = 1|yl−1 = i1, . . . , yl−K = iK) = logit piK ···ik+1,ik···i1 =
α0 +∑
1≤j1≤K
αj1yl−j1 +K∑
1≤j1<j2≤K
αj1,j2yl−j1yl−j2 + · · ·
+K∑
1≤j1<···jK−1≤K
αj1···jK−1
K−1∏m=1
yl−jm + α1···KK∏
m=1
yl−m. (1)
Equation (1) defines a one-to-one transformation of θ and the coefficients {αj1···jm}.In this form the model is naturally extended in (10) to accommodate covariates by
including a regression on xij. At the same time, model (1) is well suited to inference
4
about the order of dependence k. Lower order of dependence k is represented by
setting the αj1···jm coefficients to zero, for k + 1 ≤ m ≤ K. For more discussion and
applications of autologistic regression models see Besag (1974), Hoeting, Leecaster
and Bowden (2000) and references therein. Beyond the context of inference for the
order of dependence, the probability model proposed in (10) could be of interest as
a general sampling model for repeated binary measurements. However, we have not
explored the performance of the model in such more general contexts. In particular
we have not considered any comparison with other alternative models.
Modeling and posterior simulation under model (1) are discussed in Section 3. Fi-
nally, Section 4 illustrates the proposed methods in several examples, and we conclude
with final remarks in Section 5.
2 The Basic Model
For all indices i0, i1, . . . , ik ∈ {0, 1}, define tiik···i1,i0as the number of times the string
(ikik−1 · · · i1) is followed by i0 in the ith sequence. We call these the kth order tran-
sition counts. The likelihood factor for the ith experimental unit is
fi(yi|k, θi) =∏
ik···i1∈{0,1}(pi
ik···i1)tiik···i1,1 (1− pi
ik···i1)tiik···i1,0 . (2)
The complete model consists of the sampling model
yi|k, θiind∼ fi(yi|k, θi) for 1 ≤ i ≤ n (3)
and a prior probability model for the transition matrices θ1, . . . , θn ∈ Θ = [0, 1]2K
θi | k, Fki.i.d.∼ Fk (4)
with a nonparametric DP prior on Fk
Fk|k ∼ D(ckF0,k) (5)
and a hyperprior on the order of dependence
k ∼ pk. (6)
5
Here D(cF0) denotes a Dirichlet process with total mass parameter c and centering
measure F0 (Ferguson 1973). The centering measure F0,k in (5) is a distribution
concentrated on the subspace Θk ⊂ Θ.
We define the centering probability measures F0,k for k ∈ I = {0, 1, . . . , K} by
first specifying a distribution F k0 on [0, 1]2
kand then embedding it in Θ. We give a
constructive definition for F k0 . A random variate generated from F k
0 is a collection
{pik···i1} of 2k independent Beta-distributed random variables with known, positive
parameters akik···i1 and bk
ik···i1 . The distribution F k0 is a particular case of the Matrix-
Beta distribution (Martin 1967). The associated density function is
fk0 ({pik···i1}) =
∏
ik···i1∈{0,1}
{(pik···i1)
akik···i1 (1− pik···i1)
bkik···i1
B(akik···i1 , b
kik···i1)
},
where B(a, b) = Γ(a)Γ(b)/Γ(a + b) for positive real numbers a and b. Finally, F k0
induces the random distribution F0,k on Θ by setting piK ···ik···i1 ≡ pik···i1 , for all
i1, . . . , iK ∈ {0, 1}.The Polya urn representation by Blackwell and MacQueen (1973) yields a simple
expression for the joint marginal prior distribution of θ1, . . . , θn after integrating out
the random measure Fk:
P (θ1, . . . , θn|k) =n∏
i=1
ckF0,k(θi) +i−1∑j=1
δ�j(θi)
ck + i− 1
, (7)
where δx(·) indicates a point mass at {x}. Due to the almost sure discreteness of the
random measure Fk, the set θ1, . . . , θn may include ties, implying the formation of
clusters grouping those experimental units with identical transition matrices.
For convenience, we introduce the following notation. Let θ∗1, . . . , θ∗|ρ| be the set of
unique θi values. Let ρ(θ1, . . . , θn) = {S1, . . . , S|ρ|} denote the partition of {1, . . . , n}induced by θ1, . . . , θn via i1, i2 ∈ Sj iff θi1 = θi2 = θ∗j . We define cluster memberships
s = (s1, . . . , sn) as si = j iff i ∈ Sj, i.e., θi = θ∗si. We adopt the convention that
s1 = 1 and that clusters are assigned consecutive labels as they are formed. We write
θ = (θ1, . . . , θn), and θ∗ = (θ∗1, . . . , θ∗|ρ|).
6
To select the order of dependence we need to compute the posterior probabilities
p(k|y) for k ∈ I. We achieve this by Markov chain Monte Carlo (MCMC) simulation.
Using representation (7) and the fact that F0,k and the likelihood (2) are of conjugate
form, drawing samples from P (θ|k, y) can be accomplished by Bush and MacEach-
ern’s (1996) algorithm. The procedure works by upgrading the memberships s first,
and then the locations θ∗ given s. We complete the MCMC scheme by including a
step to update k given all other parameters. The choice of an appropriate updating
step for k needs care. A simple Gibbs step, i.e., sampling from the complete condi-
tional posterior distribution for k would lead to a violation of irreducibility, in the
following sense. Once k = K is imputed the complete conditional posterior p(k|θ, y)
would assign probability zero to any k < K.
To avoid this problem, we propose a Metropolis-Hastings (Tierney 1994) strategy
to generate a joint proposal (k′,θ′) for k and the transition probabilities θ. Denote
the target posterior by
π(k, θ) ∝n∏
i=1
fi(yi|k, θi)p(θ|k)pk. (8)
Denote with (k, θ) the currently imputed state in the Markov chain. A new candidate
(k′,θ′) is proposed in two steps. We draw first k′ ∼ q(k′|k). As a probing distribution
we use, for example,
q(k′|k) =
13I{k′ = k − 1}+ 1
3I{k′ = k}+ 1
3I{k′ = k + 1} if 1 ≤ k ≤ K − 1
12I{k′ = 0}+ 1
2I{k′ = 1} if k = 0
12I{k′ = K − 1}+ 1
2I{k′ = K} if k = K.
(9)
In a second step we generate a candidate θ′ for the transition probabilities. If k′ =
k we upgrade memberships and locations in one step of Bush and MacEachern’s
(1996) algorithm. When k′ 6= k then the memberships are retained, i.e. s′ = s,
while the locations θ∗′ are generated by drawing θ∗′j independently from the posterior
p(θ∗′j |k′, s,y) ∝ ∏i∈Sj
fi(yi|k′, θ∗′j )f0,k′(θ∗′j ), corresponding to the standard parametric
7
model
yi ∼ fi(yi|k′,θi = θ∗j), i ∈ Sj
with prior θ∗′j ∼ fk′0 . We have |ρ| such models, one for each cluster j = 1, . . . , |ρ|. In
our conjugate setting, drawing from the posterior p(θ∗′j |k′, s, y) is a set of draws from
Matrix-Beta distributions. Each of these draws amounts to sampling 2k independent
Beta-distributed random variables. The move to the proposed state (k′,θ′) is accepted
with probability
A ((k, θ), (k′, θ′)) = min
{1,
π(k′,θ′) q((k′,θ′), (k, θ))
π(k, θ) q((k, θ), (k′,θ′))
},
where
q((k, θ), (k′,θ′)) = q(k′|k)
{n∏
i=1
p(θ′i|θ′1, . . . , θ′i−1,θi+1, . . . , θn, k, y)I{k′ = k} +
+
|ρ|∏j=1
p(θ′?j |k′, s, y)I{k′ 6= k} .
When k′ = k then A((k, θ), (k′, θ′)) = 1, i.e., a candidate that retains the cur-
rent order of dependence will always be accepted. Thus, the resulting Markov chain
is unlikely to get trapped at a given point for too long, and the memberships are
frequently updated. On the other hand, taking s′ = s when k′ 6= k leads to an
acceptance probability given by
min
1,
pk′|ρ|∏
j=1
pj(y|k′, s) q(k|k′)
pk
|ρ|∏j=1
pj(y|k, s) q(k′|k)
,
where pj(y|k, s) =∫ ∏
i∈Sj
fi(yi|k, θ?j)f0,k(θ
?j) dθ?
j is the marginal distribution for clus-
ter j. The most influential term is the ratio of marginals under orders of dependence
k and k′, for a common clustering structure fixed at s. Candidates will be accepted if
the data are more likely to be observed under k′ than k for the partition considered
(and correcting for the prior and proposal distribution on k and k′).
8
The values of θ and k are updated according to the algorithm just described,
until the Markov chain is judged to have practically converged. Standard procedures
can be applied to produce estimates of all the quantities of interest, in particular the
posterior probabilities p(k|y).
3 Covariates in Partially Exchangeable Models
The invariance under the class of permutations characterizing the notion of partial
exchangeability is considered now as a baseline condition on which the regression on
covariates is built (Qu et al. 1987, Connolly and Liang 1988). By this we mean that,
in the absence of additional information, the binary sequence corresponding to each
experimental unit is assumed to be a mixture of Markov chains of a certain order.
Transition probabilities, however, will be modified as covariates become available.
The resulting sequences are no longer conditionally homogeneous Markov chains.
One possible approach is to refer to the saturated autologistic representation (1)
and assume that covariate effects are added linearly. However, this choice involves
excessively many parameters. Also, high order interactions are difficult to interpret.
Therefore, we propose a more parsimonious alternative that includes only main effects
and interactions of lagged terms, plus a linear regression on covariates. For order-k
models we propose
logit P (yil = 1|yi[l − k, l − 1],αi,β) =
αi0 +
k∑j1=1
αij1
yi,l−j1 +∑
1≤j1<j2≤k
αij1,j2
yi,l−j1yi,l−j2 + βT xil
≡ hk(yi[l − k, l − 1],αi, β), (10)
where yi[l1, l2] = (yil1 , . . . , yil2), for 1 ≤ l1 < l2 ≤ ni, β is a d-dimensional parameter
vector of regression coefficients on the logit scale, and αi is the vector of autoregression
coefficients. We introduce hk(·) to denote the logit transition probability under order
k. For this particular model specification, covariates modify the intercept term only.
9
Interactions with higher orders of dependence can be included, but for simplicity we
omit them in our discussion. The likelihood factor for experimental unit i under the
order-k model now becomes
fi(yi|k, αi, β) = exp
{ni∑
l=K+1
yil hk(yi[l − k, l − 1],αi,β)
−ni∑
l=K+1
log(1 + exp{hk(yi[l − k, l − 1],αi,β)})
}. (11)
The regression on covariates xij implies that the Markov chain transition probabilities
in (11) are inhomogeneous.
In addition to model selection with respect to k, the regression on covariates
gives rise to another model selection problem. Model (10) requires to choose the
covariates that are included in the regression term βT xil. Similar variable selection
problems have been studied extensively in recent literature. See, for example Hoeting
et al. (1999) for a review and references. We will follow the pseudo-prior approach
proposed by Carlin and Chib (1995).
We replace the βT xil term in (10) byd∑
u=1
βuγuxilu, where γ = (γ1, . . . , γd) is a
vector of 0/1 indicators with γu = 1 if the u-th covariate is included in the model.
The special cases γ = 0 = (0, . . . , 0) and γ = 1 = (1, . . . , 1) reduce to the model
without regression and the full model with all covariates included, respectively. Let
α = (α1, . . . , αn), β = (βu : γu = 1), β1− = (βu : γu = 0), and
θ =
(α, k, β ) if γ 6= 0
(α, k) if γ = 0,
i.e., θ is the parameter vector under model γ. The idea of the pseudo prior approach
is to augment the parameter vector to θ = (θ ,β1− ) by augmenting the probability
model with an (artificial) prior distribution p(β1− |θ , γ). Including β1− in the
parameter vector under model γ removes the problem of varying dimension parameter
spaces that otherwise complicates posterior simulation over competing models γ. The
fact that β1− does not appear in the likelihood under model γ does not hinder
the augmentation of the prior probability model. The choice of the pseudo prior
10
is theoretically arbitrary, but a good choice is important to achieve a fast mixing
Markov chain simulation. The guiding principle is to choose the pseudo prior under
model γ to mimic the corresponding conditional p(β1− |θ , γ = 1) in the full model.
See Carlin and Chib (1995). The specific choice in our model is introduced below.
Using the pseudo-prior mechanism we now construct a prior to complete model
(10):
p(k, α,β,γ) = pk p(α|k) p(β |γ) p(β1− |k, α,β , γ) p(γ). (12)
Here pk is the prior on the order of dependence. Similar to Section 2 the prior for α
is defined as p(α|k) =∫ n∏
i=1
Fk(αi) dp(Fk|k), with Fk ∼ D(ck F0,k). The base measure
F0,k in the DP prior is a zero-mean multivariate normal N(0,V k) for the non-zero
coefficients, and a pointmass at zero for all the α coefficients that vanish under model
k in (10). Let x and A denote the sub-vector of x selected by the indicators γu and
the sub-matrix of A with rows and columns selected by γu, respectively. To define the
prior p(β |γ) on the regression coefficients we consider a multivariate normal model
p(β|γ = 1) = N(λ, D−1) for the full model and define
p(β |γ) ≡ N(λ ,D
−1
)(13)
as the corresponding conditional distribution for β given β1− = 0.
To specify the pseudo-prior p(β1− |γ,θ ) we approximate p(β1− |γ = 1,θ )
as follows. Consider a normal approximation to the likelihood for observation i,
fi(yi|k, αi, β,γ = 1) ≈ N(µi, Si), e.g., a second order Taylor expansion around
the corresponding mode. Combining the normal likelihood approximation with the
normal prior p(α,β|γ = 1) = N(λ,D−1) we obtain a multivariate normal posterior
approximation, p(α,β|γ = 1, y) ≈ N(m,C−1) which we use to define
p(β1− |θ , γ,y, k) = N(m1− ,C
−11−
), (14)
where
C1− = D1− +n∑
i=1
S−1i and m1− = C−1
1−
(n∑
i=1
S−1i µi + D1− λ1−
).
11
Following a suggestion by Quintana, Liu and del Pino (1999) we use a factor κ
(e.g., κ = 2.0) to inflate the variance-covariance matrix Si in the above normal
approximation to the likelihood.
Finally, we complete the specification of the prior (12) by defining a uniform
prior p(γ) = const. Other alternative priors p(γ) could be used without altering the
posterior simulation scheme described below.
Our computational strategy is based on the Metropolis-Hastings algorithm of Sec-
tion 2, but with some necessary modifications to account for the larger model. Some
of the changes are based on the Metropolized Carlin-Chib algorithm proposed in God-
sill (2001), also connected to the reversible jumps algorithm of Green (1995) in the
context of model composition.
The target distribution is
π(k, α,β,γ) ∝n∏
i=1
fi(yi|k, αi,β)× p(k, α,β, γ), (15)
with p(k, α,β,γ) given in (12). Generating appropriate Metropolis-Hastings propos-
als is complicated by the fact that the centering measure F0,k and the likelihood (11)
are not in conjugate form. Computing the probabilities for resampling the configura-
tion indicators si requires the evaluation of an integral for the marginal probability:∫
fi(yi|k, αi, β)f0,k(αi) dαi. A possible strategy to avoid this integral is the algorithm
proposed in MacEachern and Muller (1998). Essentially the approach amounts to car-
rying “empty” clusters, that are to be used when needed, in addition to the “full”,
already existing clusters. With this trick, the evaluation of integrals is completely
avoided.
Our proposal starts with a draw k′ ∼ q(k′|k) from (9) to determine the candidate
order of dependence k′. If k′ = k, then we perform one iteration of MacEachern and
Muller’s (1998) algorithm, which yields α′, followed by draws from the full conditional
distributions for each of the remaining parameters, to get γ ′ (one coordinate at a
time), and β′ ′ and β′1− ′ . Sampling for β′ ′ and β′1− ′ can be accomplished by
adaptive rejection sampling for log-concave densities (Gilks and Wild 1992), or based
12
on normal approximations centered around the corresponding mode.
When k′ 6= k, we define the following proposal. Paralleling the definition of θ∗j
in Section 2, we use {α∗1, . . . , α
∗|ρ|} to denote the unique values among {α1, . . . , αn},
and si for configuration indicators. Set s′ = s and γ ′ = γ. To generate a proposal
α′? we construct a normal approximation to p(α′?j|yi, k′,β) ∝ ∏
i∈Sj
fi(yi|k′,α′?j,β)×f0,k′(α
′?j), for 1 ≤ j ≤ |ρ|. In the implementation for the examples reported in
Section 4 we used a normal approximation based on maximizing∏
i∈Sj
fi(yi|k′,α′?j,β)
at each iteration of the Gibbs sampler. After the |ρ| independent draws α′?j are
generated, the same mechanism, i.e., a proposal using a normal approximation based
on maximizing the likelihood, is implemented to draw the candidate β′ ′ . And finally,
β′1− ′ is taken as a draw from the pseudo-prior (14).
Given these choices, the Metropolis-Hastings acceptance ratio can be readily
computed. Further, some simplifications arise in the terms involving the pseudo-
prior (Godsill 2001, Chen et al. 2000).
4 Applications
The first example is the thumb tack dataset of Beckett and Diaconis (1994), consisting
of 320 sequences of 9 rolls of an ordinary thumb tack. The outcome was defined as
1 if the tack landed upwards, and 0 otherwise. The rolls were made by the authors
themselves, and using different kinds of surface (carpet, tiles, etc.). The original data
are available in Beckett and Diaconis (1994). The same data have been analyzed
in Liu (1996) and MacEachern, Clyde and Liu (1999).
The second example consists of 127 sequences of hits and outs for players in
the 1990 season of the National League Baseball. These data were first described
in Albright (1993), and a subset later considered in Quintana, Liu and del Pino
(1999). The binary outcomes are defined as yij = 1 if a hit, walk or sacrifice occurred
the j-th time the i-th player was at bat, and yij = 0 if an out occurred. Each sequence
has a length of at least 500. There are 11 potential covariates, which are detailed in
13
Table 1. The full dataset consists of 75, 337 records.
4.1 The Thumb Tack Dataset
For this application, no time dependent covariates are recorded. We use a flat prior on
the order of dependence, ck = 1 for all k up to a maximum order K and independent
uniforms for each of the baseline measures F k0 . We ran three independent Markov
chains with K = 2, 5, and 7, using a burn-in period of length 1, 000 and a Monte
Carlo sample size of 20, 000 in each case. Using these choices we proceeded with the
MCMC simulation as described in Section 2. The simulated chain satisfied standard
convergence criteria (Best et al. 1995, Cowles and Carlin 1996), e.g. as implemented
in the BOA package (Smith 2000). For the K = 2 case, the corrected scale reduction
factor (CRSC) proposed in Brooks and Gelman (1998) across three realized chains
was 1.0000844 for the order of dependence and 1.0002376 for p10,0,0,0,0, the first row in
the transition matrix for the first binary sequence (this last variable is considered here
only for the purpose of checking convergence). Thus, the multivariate potential scale
reduction factor (MPSRF) was 1.000232 for these two variables. We also computed
the convergence diagnostics by Geweke (1992), using fractions of 10% and 50% for the
first and last window. The corresponding Z-scores were, for the order of dependence,
1.0693, 0.7946 and 1.3998, with p-values 0.2849, 0.4268 and 0.1616. For p10,0,0,0,0 we
obtained 0.5858, 0.3495 and −0.0092, with p-values 0.5580, 0.7267 and 0.9926. In
addition, the stationarity and half-width tests of Heidelberger and Welch (1988) were
both passed for the three chains (data not shown). To assess performance of the
Metropolis-Hastings sampler we compute average acceptance rates. For K = 2 the
overall acceptance rates ranged in [61%, 63%]. For the case of proposed moves to a
different order of dependence, i.e. k′ 6= k, the range was [33%, 34%]. For K = 5
these values changed to [57%, 59%] and [31%, 32%], while for K = 7 we obtained
[57%, 58%], and [32%, 33%].
Table 2 displays some relevant summaries of the posterior distribution. The effect
of different choices for K is immediately seen. The maximum a posteriori (MAP)
14
estimate of the order of dependence k is 1 for K = 2 , 5 for K = 5 and 6 for orders
K = 6 (data not shown) and K = 7. The reported sensitivity of the inference on
k as we change K is due to the relatively short lengths of the data sequences. We
have ni = 9 observations per sequence, leaving only between two (for K = 7) and
seven (for K = 2) transitions per sequence. This critical difference in the number
of data points under different K explains the reported inference. The same effect
is also clearly seen in the inference on p10,0,0,0,0 reported in the last row of Figure 1.
Notice how the posterior distribution becomes increasingly flatter with higher K. The
middle row in Figure 1 further illustrates how the posterior distribution approaches
the prior Dirichlet process when K increases. For reference, the prior mean for the
number of clusters |ρ| in our example is320∑i=1
1/i = 6.347 and the variance319∑i=1
i/(i+1)2 =
4.705 (Liu 1996). In summary, our analysis supports the claim in Beckett and Diaconis
(1994) that sequences exhibit stronger dependence than what is compatible with
conditional independence.
The example shows that the method should be used with caution for short se-
quences. For high order k the transition probabilities pik···i1 are estimated based on
few instances of transitions only. In the extreme case of the maximum order K,
no data are available, and the posterior distribution is trivially equal to the baseline
measure of the prior. Consequently, for large K the posterior distribution of the order
of dependence tends to mimic the prior.
Another issue of general interest is related to the sensitivity of the results to
different specifications of the akik···i1 and bk
ik···i1 hyperparameters. In the above results
we chose all of these to be 1, i.e., we used a default uniform prior. Table 3 shows,
for the case K = 2, the posterior distribution of the order of dependence for different
choices of these parameters. The chosen values correspond to considerable variations
in the shape of the centering measure. The reported results suggest that posterior
inference is robust with respect to the choice of these hyperparameters. We suggest
1 as a default choice.
15
4.2 The Baseball Dataset
As a first approach to analyzing these data, we start by ignoring the covariates and
proceed as in the previous section, with the same prior parameter values and K = 5.
The acceptance rates are slightly higher than for the thumb tack dataset. Posterior
summary statistics can be found in Table 4. The MAP selected order of dependence
is 1, which remains true also for K = 2 and K = 7. Compared to the thumb tack
dataset, here the shape of the posterior distribution p(k|y) does not exhibit significant
sensitivity when moving K from 2 to 7. The mode of p(k|y) remains unchanged. The
comparison suggests that longer sequences result in more robust posterior inferences
than moderate or short ones. In applications with sufficiently large sequence lengths
ni one can proceed more confidently when setting a maximum order K.
We proceed now to incorporate the 11 covariates described in Table 1, and fit
model (11) – (14) using K = 5, ck = 1 for all k, uniform baseline measures in all
cases, λ = 0, D−1 = 10 I, and V k = 10 I, where I is the identity matrix of the
appropriate dimension. These choices represent default vague priors. Running three
chains of length 60, 000 each, the overall acceptance rates for our algorithm ranged in
[87%, 91%], while for proposed moves to a different order of dependence we obtained
[75%, 83%]. The MPSRF for the chains including β, α for the first player, and the
order of dependence is 1.0935, still well within the acceptable range. On the other
hand, the Geweke (1992) diagnostics for these variables ranged in [−0.9972, 1.6020],
with p-values in [0.1092, 0.8964], so convergence can be safely assumed.
The first panel in Figure 3 shows the frequency with which each covariate is
selected, i.e., the estimated posterior probability for including covariate u in the
model. Compare with Table 5. The remaining panels in Figure 3 plot the marginal
posterior distributions of the regression coefficients. We observe that covariates 2,
4, 5 and 10 are chosen nearly all the time, while variable 8 about 30% of time, and
the rest of the variables less than 10% of time. Most of the posterior mass for β2
is concentrated on β2 < 0. Variable 2 could be interpreted as a stress factor with
negative impact on the batter. In contrast, most of the posterior mass for β4, β5, β8
16
and β10 is concentrated on βi > 0. This means that the batter feels animated when
any runner is on base, when his team has the home field advantage, or when the
opposing pitcher has a high ERA (a measure of the quality of this opponent; better
players have lower values). The bottom row of Table 5 reports the three models with
highest posterior probabilities. We see that the model including covariates 2, 4, 5 and
10 was imputed over 58% of the time, and an additional 24% of the time variable 8 was
added to the previous model. From this we conclude that the mentioned covariates
are the most relevant ones, as they appear in all top three models.
Figure 4 shows the posterior distribution for the order of dependence, number of
clusters, and the first order autoregressive coefficients in model (10) corresponding to
subject i = 1. The MAP estimate for order of dependence is k = 1, closely followed
by 2 and 3. This supports the claim that players tend to exhibit streaky behavior at
batting positions, and the conclusion can be reached with or without using covariates.
Thus streakiness is apparently not linked to other external stimuli and it could be
seen as a feature inherent to players. The second panel of Figure 4 shows inference
on the number of clusters |ρ| for the model with covariates. The posterior is almost
entirely concentrated on 2 and 3 clusters. This suggests that, after correcting for the
effect of covariates, we find essentially two or three groups of players that present
similar patterns of streakiness. Ignoring the covariates results in the creation of some
extra clusters of players, as can be seen in Figure 2, middle row.
5 Discussion
We discussed a modeling framework to assess the order of dependence for multiple
binary sequences that are a priori assumed to be marginally partially exchangeable.
The semiparametric model specification allows inference on the order of dependence
without constraint to a specific parametric form. Despite its generality, posterior
simulation in the model is reasonably straightforward. We introduced an appropriate
MCMC strategy that alternates Metropolis-Hastings with Gibbs sampling steps. The
17
basic idea is to update the clustering structure only when the proposed order of
dependence coincides with the current one, which is done via a full Gibbs sampling
scan. Otherwise, we keep the cluster memberships and propose only new locations.
The partially exchangeable models are extended to accommodate covariates, using
autologistic regression models, with main effects and possibly interactions of lagged
binary outcomes. Model selection in this extended model has two aspects, one related
to the order of dependence, and the other related to the covariates. The MCMC al-
gorithm is adapted to this new scenario. We experimented with alternative updating
strategies, including schemes involving updating of the membership at each proposal
step. However, the resulting Markov chain simulations were computationally ineffi-
cient, with high rejection rates and slow mixing.
The proposed framework is easily extended to multinomial sequences. The only
modification in the basic model is to allow ij ∈ {0, 1, . . . , C} in (2). For moderately
small C, say C = 3 or 4, the resulting posterior MCMC can still be implemented with
reasonable computational effort.
In the development of the discussed models and in the examples we have focused
on inference about the order of dependence and the variable selection. Another in-
teresting aspect of the model is the facility to semi-parametrically estimate the joint
probability model for the binary (or multinomial) sequence. This is important, for ex-
ample, in decision problems built around binary (or multinomial) sequences. Possible
applications are planning of sampling times in clinical trials with binary or categorical
outcomes.
The two applications discussed in Section 5 provide some useful guidelines for
assessing the order of dependence. For short to moderate sequences we advise against
choosing large K. A possible alternative for such sequences is to consider a penalty
term in the loss function that accounts for model complexity. A simpler alternative
that we explore here is to restrict K to values well below the average sequence length,
and examine how results are affected with increasing K.
18
Acknowledgments
We thank the Referees and the Associate Editor for the critical review and the
many constructive comments that helped to substantially improve the manuscript.
References
Albright, S. C. (1993). A statistical analysis of hitting streaks in baseball (Disc:
p1184–1196), Journal of the American Statistical Association 88: 1175–1183.
Beckett, L. and Diaconis, P. (1994). Spectral Analysis for Discrete Longitudinal Data,
Advances in Mathematics 103: 107–128.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems,
Journal of the Royal Statistical Society, Series B 36: 192–236.
Best, N. G., Cowles, M. K. and Vines, K. (1995). CODA: Convergence diagnosis
and output analysis software for Gibbs sampling output, Version 0.30, Technical
report, MRC Biostatistics Unit.
Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Polya urn
schemes, The Annals of Statistics 1: 353–355.
Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of it-
erative simulations, Journal of Computational and Graphical Statistics 7(4): 434–
455.
Bush, C. A. and MacEachern, S. N. (1996). A semiparametric Bayesian model for
randomised block designs, Biometrika 83(2): 275–285.
Carlin, B. P. and Chib, S. (1995). Bayesian Model Choice via Markov Chain Monte
Carlo Methods, Journal of the Royal Statistical Society B 57: 473–484.
Chen, M.-H., Shao, Q.-M. and Ibrahim, J. G. (2000). Monte Carlo Methods in
Bayesian Computation, New York: Springer-Verlag.
19
Connolly, M. A. and Liang, K.-Y. (1988). Conditional Logistic Regression Models for
Correlated Binary Data, Biometrika 75: 501–506.
Cowles, M. and Carlin, B. (1996). Markov chain monte carlo convergence diagnostics:
a comparative review, Journal of the American Statistical Association 91: 883–
904.
Diaconis, P. (1988). Recent progress on de Finetti’s notions of exchangeability, in
J. Bernardo, M. D. Groot, D. Lindley and A. Smith (eds), Bayesian Statistics
3, Oxford University Press, pp. 111–125.
Escobar, M. and West, M. (1995). Bayesian density estimation and inference using
mixtures, Journal of the American Statistical Association 90: 577–588.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems, The
Annals of Statistics 1: 209–230.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calcu-
lating posterior moments, in J. Bernardo, J. Berger, A. Dawid and A. Smith
(eds), Bayesian Statistics 4, Oxford: Oxford University Press, pp. 169–193.
Gilks, W. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling,
Applied Statistics 41: 337–348.
Godsill, S. J. (2001). On the relationship between MCMC model uncertainty methods,
Journal of Computational and Graphical Statistics 10: 230–248.
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination, Biometrika 82: 711–732.
Heidelberger, P. and Welch, P. (1988). Simulation run length control in the presence
of an initial transient, Operations Research 31: 1109–1144.
20
Hoeting, J. A., Leecaster, M. and Bowden, D. (2000). An improved model for spatially
correlated binary responses, Biological and Environmental Statistics 5(1): 102–
114.
Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian
Model Averaging: A Tutorial, Statistical Science 14(4): 382–417.
Liu, J. S. (1996). Nonparametric Hierarchical Bayes via Sequential Imputations, The
Annals of Statistics 24(3): 911–930.
MacEachern, S. N. and Muller, P. (1998). Estimating Mixture of Dirichlet Process
Models, Journal of Computational and Graphical Statistics 7(2): 223–338.
MacEachern, S. N. and Muller, P. (2000). Efficient MCMC Schemes for Robust Model
Extensions using Encompassing Dirichlet Process Mixture Models, in F. Ruggeri
and D. Rıos-Insua (eds), Robust Bayesian Analysis, New York:Springer-Verlag,
pp. 295–316.
MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential Importance Sampling
for Nonparametric Bayes Models: The Next Generation, Canadian Journal of
Statistics 27: 251–267.
Martin, J. (1967). Bayesian Decision Processes and Markov chains, New York: Wiley.
Muller, P. and Rosner, G. (1998). Semi-parametric PK/PD Models, in D. Dey,
P. Muller and D. Sinha (eds), Practical Nonparametric and Semiparametric
Bayesian Statistics, New York: Springer-Verlag, pp. 323–338.
Qu, Y., Williamns, G. W., Beck, G. J. and Goormastic, M. (1987). A Generalized
Model of Logistic Regression for Clustered Data, Communications in Statistics,
Part A – Theory and Methods 16: 3447–3476.
Quintana, F. A. (1998). Nonparametric bayesian analysis for assessing homogeneity in
k× l contingency tables with fixed right margin totals, Journal of the American
Statistical Association 93: 1140–1149.
21
Quintana, F. A. and Newton, M. A. (1998). Assessing the Order of Dependence
for Partially Exchangeable Binary Data, Journal of the American Statistical
Association 93: 194–202.
Quintana, F. A. and Newton, M. A. (1999). Parametric Partially Exchangeable
Models and Polya Urns for Multiple Binary Sequences, Brazilian Journal of
Probability and Statistics 13: 55–76.
Quintana, F. A. and Newton, M. A. (2000). Computational aspects of Nonparametric
Bayesian analysis with applications to the modeling of multiple binary sequences,
Journal of Computational and Graphical Statistics 9(4): 711–737.
Quintana, F. A., Liu, J. S. and del Pino, G. E. (1999). Monte Carlo EM with
Importance Reweighting and its Applications to Random Effects Models, Com-
putational Statistics and Data Analysis 29: 429–444.
Smith, B. J. (2000). Bayesian Output Analysis Program (BOA), Version 0.5.0 for
S-PLUS and R. Available at http://www.public-health.uiowa.edu/BOA.
Tierney, L. (1994). Markov chains for exploring posterior distributions, Annals of
Statistics 22(4): 1701–1762.
22
Name Number Variable Description
I7 1 1 if game is into the 7th inning or beyond; 0 otherwise
O2 2 1 if there are 2 outs; 0 otherwise
Score 3 number of runs batter’s team is ahead by (if positive) or
behind by (if negative)
R123 4 1 if any runners are on base; 0 otherwise
R23 5 1 if any runners are on 2nd or 3rd; 0 otherwise
Game 6 a chronological index of the game number
DN 7 1 for a night game; 0 for a day game
HA 8 1 for a home game; 0 for an away game
T 9 1 if opposing pitcher is right-handed; 0 if left-handed
ERA 10 opposing pitcher’s earned run average for that season
Turf 11 1 if playing on grass; 0 if artificial turf
Table 1: Covariates for Baseball Dataset in Section 4. Note: to make the values of
all variables comparable Game was divided by 100.
23
Estimates
Quantity of Interest K = 2 K = 5 K = 7
P (k = 0|y) 0.1193 (0.00502) 0.0121 (0.00205) 0.0138 (0.00224)
P (k = 1|y) 0.5357 (0.00631) 0.0576 (0.00607) 0.0497 (0.00546)
P (k = 2|y) 0.3450 (0.00754) 0.0840 (0.00571) 0.0695 (0.00619)
P (k = 3|y) – 0.1722 (0.00717) 0.0967 (0.00682)
P (k = 4|y) – 0.3121 (0.00765) 0.1233 (0.00701)
P (k = 5|y) – 0.3620 (0.01141) 0.1692 (0.00666)
P (k = 6|y) – – 0.2550 (0.00866)
P (k = 7|y) – – 0.2228 (0.01020)
E(p10,0,0,0,0;0|y) 0.496 (0.0010) 0.469 (0.0018) 0.536 (0.0023)
Var(p10,0,0,0,0;0|y) 0.0054 (0.00018) 0.0233 (0.0006) 0.044 (0.0009)
E(|ρ||y) 4.11 (0.053) 5.67 (0.065) 5.99 (0.070)
Var(|ρ||y) 2.85 (0.107) 3.77 (0.156) 4.34 (0.163)
Table 2: Posterior quantities of interest for thumb tack dataset. Monte Carlo standard
errors are indicated between parentheses.
Parameter Values Estimates
akik···i1 bk
ik···i1 P (k = 0|y) P (k = 1|y) P (k = 2|y)
1 1 0.1193 0.5357 0.3450
10 1 0.0923 0.5414 0.3663
1 10 0.1044 0.5402 0.3554
0.5 0.5 0.1205 0.5291 0.3504
5 5 0.0814 0.5606 0.3580
10 10 0.0578 0.5649 0.3773
Table 3: Sensitivity analysis of posterior distribution of order of dependence to choices
of parameters for the centering measure. The maximum order was K = 2.
24
0 2 4 6
0.0
0.2
0.4
o
o
o
0 2 4 6
0.0
0.2
0.4
oo o
o
oo
0 2 4 6
0.0
0.2
0.4
o o o o oo
o o
2 6 10
0.0
0.10
0.20
o
o
oo
o
o
ooooooooo
2 6 10
0.0
0.10
0.20
oo
o
o
oo
o
oooooooo
2 6 100.
00.
100.
20
oo
o
ooo
o
oooooooo
0.0 0.4 0.8
02
46
0.0 0.4 0.8
02
46
0.0 0.4 0.8
02
46
Firs
t Com
pone
nt#
Clu
ster
sO
rder
of D
epen
denc
e
K = 7K = 5K = 2
Figure 1: Posterior summaries for thumb tack dataset. Columns represent results
for different values of the maximum order K. First and second rows show estimated
posterior probabilities for order of dependence and number of clusters. The last row
shows the posterior distribution of p10,0,0,0,0, the first component of θ1 in (2).
25
Estimates
Quantity of Interest K = 2 K = 5 K = 7
P (k = 0|y) 0.2403 (0.00749) 0.1321 (0.00630) 0.0983 (0.00550)
P (k = 1|y) 0.4804 (0.00626) 0.3110 (0.00786) 0.3456 (0.00999)
P (k = 2|y) 0.2793 (0.00736) 0.2124 (0.00646) 0.1857 (0.00619)
P (k = 3|y) – 0.1354 (0.00506) 0.1082 (0.00444)
P (k = 4|y) – 0.1021 (0.00516) 0.0631 (0.00324)
P (k = 5|y) – 0.1070 (0.00666) 0.0503 (0.00326)
P (k = 6|y) – – 0.0881 (0.00694)
P (k = 7|y) – – 0.0608 (0.00546)
E(p10,0,0,0,0;0|y) 0.652 (0.0001) 0.651 (0.0002) 0.652 (0.0001)
Var(p10,0,0,0,0;0|y) 0.0587 (0.06359) 0.0581 (0.06357) 0.0587 (0.06548)
E(|ρ||y) 3.28 (0.038) 2.63 (0.039) 1.99 (0.028)
Var(|ρ||y) 1.49 (0.0548) 1.36 (0.058) 0.83 (0.042)
Table 4: Posterior quantities of interest for baseball dataset (ignoring covariates).
Monte Carlo standard errors are indicated between parentheses.
26
0 2 4 6
0.1
0.3
o
o
o
0 2 4 6
0.1
0.3
o
o
oo
o o
0 2 4 6
0.1
0.3
o
o
oo
o oo o
2 4 6 8
0.0
0.2
0.4
o
o
o
o
oo o o o o
2 4 6 8
0.0
0.2
0.4
o
o
o
o
o o o o o o
2 4 6 80.
00.
20.
4
o
o
o
oo o o o o o
0.50 0.65 0.80
020
40
0.50 0.65 0.80
020
40
0.50 0.65 0.80
020
40
Firs
t Com
pone
nt#
Clu
ster
sO
rder
of D
epen
denc
e
K = 7K = 5K = 2
Figure 2: Posterior summaries for the baseball dataset. Columns represent results
for different values of the maximum order K. First and second rows show estimated
posterior probabilities for order of dependence and number of clusters. The last row
shows the posterior distribution of p10,0,0,0,0, the first component of θ1 in (2).
27
o
o
o
oo
oo
o
o
o
o
2 4 6 8
040
80
-0.10 0.05
04
812
-0.2 0.2
04
812
-0.01 0.01
1418
0.0 0.15
04
812
0.10 0.25
04
8
-0.1 0.1 0.3
04
812
-0.1 0.1
04
812
0.0 0.10
04
812
-0.05 0.10
04
812
0.0 0.06 0.12
05
15
-0.10 0.05
05
1015
β1 β2
β3 β4 β5
β6 β7 β8
β9 β10 β11
Figure 3: Posterior distributions for all regression coefficients, including percentiles
2.5%, 50% and 97.5% (dotted lines). Upper left plot represents the percentage of times
each covariate was selected in the model.
28
Quantities of Interest Estimated Probability
P (γ1 = 1|y) 0.08217 (0.002573)
P (γ2 = 1|y) 0.99855 (0.000418)
P (γ3 = 1|y) 0.01067 (0.000825)
P (γ4 = 1|y) 0.99675 (0.000889)
P (γ5 = 1|y) 0.99950 (0.000244)
P (γ6 = 1|y) 0.04765 (0.002470)
P (γ7 = 1|y) 0.02592 (0.001607)
P (γ8 = 1|y) 0.30167 (0.006333)
P (γ9 = 1|y) 0.00983 (0.000852)
P (γ10 = 1|y) 0.99967 (0.000202)
P (γ11 = 1|y) 0.01107 (0.000874)
P (∑
γi = 1|y) - (-)
P (∑
γi = 2|y) - (-)
P (∑
γi = 3|y) 0.00210 (0.000685)
P (∑
γi = 4|y) 0.58110 (0.006210)
P (∑
γi = 5|y) 0.35267 (0.005520)
P (∑
γi = 6|y) 0.05968 (0.002355)
P (∑
γi = 7|y) 0.00437 (0.000624)
P (∑
γi = 8|y) 0.00003 (0.000024)
P (∑
γi = 9|y) 0.00003 (0.000026)
P (∑
γi = 10|y) 0.00002 (0.000002)
P (∑
γi = 11|y) - (-)
P (γ = (0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0)|y) 0.58017 (0.006230)
P (γ = (0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0)|y) 0.24277 (0.005476)
P (γ = (1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0)|y) 0.04163 (0.001656)
Table 5: Posterior probabilities for quantities related to variable selection (baseball
dataset), including the three most likely models. Monte Carlo standard errors are
indicated between parentheses. Note: no models with 1, 2 or 11 variables were
imputed. 29
o
o o o
o
o
0 1 2 3 4 50.08
0.14
0.20
o
o
oo o
1 2 3 4 5
0.0
0.4
0.8
-1.5 -0.5
0.0
1.0
2.0
-0.8 -0.2 0.4
02
46
8
-1.0 0.00
24
6
-0.4 0.2
02
46
8
-0.4 0.0
02
46
8
-0.3 0.0 0.3
02
46
8
# C
lust
ers
Ord
er o
f Dep
ende
nce
α0 α1 α2
α3 α4 α5
Figure 4: Posterior distribution of order of dependence, number of clusters and main
effects for autologistic regression coefficients for subject 1, including percentiles 2.5%,
50% and 97.5% (dotted lines).
30