nonparametric bayesian assessment of the order of ... · nonparametric bayesian assessment of the...

Nonparametric Bayesian Assessment of the Order of

Dependence for Binary Sequences

Fernando A. Quintana∗ Peter Muller†

December, 2002

Abstract

We discuss inference on the order of dependence in binary sequences. The

proposed approach is based on the notion of partial exchangeability of order

k. A partially exchangeable binary sequence of order k can be represented

as a mixture of Markov chains. The mixture is with respect to the unknown

transition probability matrix θ. We use this defining property to construct a

semiparametric model for binary sequences by assuming a nonparametric prior

on the transition matrix θ. This enables us to consider inference on the order of

dependence without constraint to a particular parametric model. Implement-

ing posterior simulation in the proposed model is complicated by the fact that

the dimension of θ changes with the order of dependence k. We discuss appro-

priate posterior simulation schemes based on a pseudo prior approach (Carlin

and Chib 1995). We extend the model to include covariates by considering

an alternative parameterization as an autologistic regression which allows for a

straightforward introduction of covariates. The regression on covariates raises

∗Departamento de Estadıstica, Pontificia Universidad Catolica de Chile, Casilla 306, Santiago

22, CHILE. e-mail: [email protected]. Partially supported by grant FONDECYT 1990430.†Department of Biostatistics, Box 447, University of Texas M. D. Anderson Cancer Center,

Houston, TX 77030-4009, USA. e-mail: [email protected]. Partially supported by grants

NSF/INT 0104496 and NIH R01CA75981.

1

the additional inference problem of variable selection. We discuss appropriate

posterior simulation schemes, focusing on inference about the order of depen-

dence. We discuss and develop the model with covariates only to the extent

needed for such inference.

1 Introduction

We consider inference for multiple binary sequences. Assume the data are n binary

sequences yi = (yi1, . . . , yini), one for each of n experimental units i = 1, . . . , n. Ad-

ditionally, the dataset may include a d-dimensional covariate vector xij, j = 1, . . . , ni

recorded for each subject at each time. This data structure arises in a wide variety

of applications. Typical examples are repeated binary measurements for patients in

a clinical trial, sequences of hits and outs for baseball players, or repeated rolls of a

thumbtack. In Section 4 we use the latter two examples to illustrate the proposed

models and methods.

To analyze such data we consider a flexible class of models for binary sequences,

defined through the notion of partial exchangeability of order k (Quintana and Newton

1998). Order k partial exchangeable probability models can be defined as mixtures

of order k homogeneous Markov chains. The mixture is with respect to a probabil-

ity measure on the space of transition matrices. Such partially exchangeable models

include as special cases independence, exchangeability and ordinary order k Markov

chains. Subject to some technical constraints, any probability distribution for binary

sequences that is invariant under all permutations that do not alter the initial state

and all transition counts up to order k can be represented by such models. This prop-

erty explains the label “partially exchangeable.” Theoretical properties of partially

exchangeable sequences and further applications can be found in Diaconis (1988),

Quintana and Newton (1999, 2000) and references therein.

The focus of this article is inference on the order of dependence k for partially

exchangeable models. Central to the proposed model is a nonparametric probability

2

model under order k, allowing for the desired inference without reference to any

particular parametric model. To define the nonparametric model we exploit the

mixture representation of partially exchangeable probability models and use mixtures

of Markov chains with a nonparametric mixing measure on transition probability

matrices. Unlike the earlier references, we concentrate on the case where the order of

dependence k is assumed unknown.

We first discuss the case without time dependent covariates. Under order k partial

exchangeability, the likelihood for experimental unit i is a mixture of order k Markov

chains. We will use θi to denote the transition probability matrix for experimental

unit i, and Fk(θi) to denote the mixing measure. Thus the likelihood for experimen-

tal unit i is p(yi|k, Fk) =∫

p(yi|θi) dFk(θi). An important feature of the proposed

models is the nested nature of the parameter sub-spaces for the transition matrices

θi under increasing order of dependence k. Consider, for example, a transition prob-

ability matrix θ for a Markov chain of order k = 1. If the rows of θ are chosen to be

identical we lose one-step dependence, and the resulting sequence has order k = 0.

Similarly, it follows that order k1 is a sub-model of order k2 for any 0 ≤ k1 < k2. In

the context of this nested structure it is convenient to have all models defined on a

common space. To achieve this, let K denote the maximum possible order of depen-

dence, for example, K = min1≤i≤n

ni − 1. Let Θ ≡ ΘK denote the space of transition

matrices corresponding to order K dependence. An element θ ∈ Θ can be repre-

sented as a set of transition probabilities {piK ,iK−1···i1}, for i1, . . . , iK ∈ {0, 1}, where

piK ,··· ,i1 is the conditional probability that the next outcome is 1, given that the pre-

vious K values are iK , · · · , i1 in that order. Thus Θ = [0, 1]2K. In this setting, order

k dependence is identified with a parameter subspace Θk of transition probabilities

that do not depend on indices iK , . . . , ik+1. In other words, θ = {piK ···i1} ∈ Θk if and

only if piK ···ik···i1 = p1···1,ik···i1 for all i1, . . . , iK ∈ {0, 1}. This corresponds to choosing

some of the rows in the transition matrix to be identical. We will write pik···i1 for the

common value of piK ···ik···i1 ∈ Θk.

The order of dependence k, 0 ≤ k ≤ K, is treated as an unknown parameter and

3

included in the parameter vector, leading to model averaging over K + 1 competing

models corresponding to k = 0, . . . , K. To complete the model specification we need

a probability model for Fk, the prior for order k transition probabilities. The choice

of an appropriate model is subject to the following considerations. First, the focus of

inference is on the order of dependence k. Thus we want to avoid unduly restrictive

parametric models. Second, the possibly high dimensional nature of θi makes it im-

perative to choose a model that allows easy extension to high dimensions. Finally, the

model should allow computationally efficient posterior simulation. This leads us to

consider a nonparametric Dirichlet process (DP) prior for Fk(θi). DP priors were in-

troduced in Ferguson (1973) and have been successfully used in many similar mixture

model contexts, including, among many others, Bush and MacEachern (1996), Es-

cobar and West (1995), Quintana (1998), and Muller and Rosner (1998). See, for

example, MacEachern and Muller (2000) for a recent review. The DP prior model

for Fk(·) generalizes the model studied in Quintana and Newton (2000). Section 2

discusses details of this baseline model and the corresponding Markov chain Monte

Carlo (MCMC) scheme for selecting the order of dependence.

The modeling strategy for the case with covariates xij is developed next. To

facilitate the introduction of a regression on covariates we start with a one-to-one

transformation of the transition probabilities θi on the logit scale. With the extension

to covariates in mind we write the logistic transformation as a saturated autologistic

regression of yit on the lagged observations up to lag K:

logit P (yl = 1|yl−1 = i1, . . . , yl−K = iK) = logit piK ···ik+1,ik···i1 =

α0 +∑

1≤j1≤K

αj1yl−j1 +K∑

1≤j1<j2≤K

αj1,j2yl−j1yl−j2 + · · ·

+K∑

1≤j1<···jK−1≤K

αj1···jK−1

K−1∏m=1

yl−jm + α1···KK∏

m=1

yl−m. (1)

Equation (1) defines a one-to-one transformation of θ and the coefficients {αj1···jm}.In this form the model is naturally extended in (10) to accommodate covariates by

including a regression on xij. At the same time, model (1) is well suited to inference

4

about the order of dependence k. Lower order of dependence k is represented by

setting the αj1···jm coefficients to zero, for k + 1 ≤ m ≤ K. For more discussion and

applications of autologistic regression models see Besag (1974), Hoeting, Leecaster

and Bowden (2000) and references therein. Beyond the context of inference for the

order of dependence, the probability model proposed in (10) could be of interest as

a general sampling model for repeated binary measurements. However, we have not

explored the performance of the model in such more general contexts. In particular

we have not considered any comparison with other alternative models.

Modeling and posterior simulation under model (1) are discussed in Section 3. Fi-

nally, Section 4 illustrates the proposed methods in several examples, and we conclude

with final remarks in Section 5.

2 The Basic Model

For all indices i0, i1, . . . , ik ∈ {0, 1}, define tiik···i1,i0as the number of times the string

(ikik−1 · · · i1) is followed by i0 in the ith sequence. We call these the kth order tran-

sition counts. The likelihood factor for the ith experimental unit is

fi(yi|k, θi) =∏

ik···i1∈{0,1}(pi

ik···i1)tiik···i1,1 (1− pi

ik···i1)tiik···i1,0 . (2)

The complete model consists of the sampling model

yi|k, θiind∼ fi(yi|k, θi) for 1 ≤ i ≤ n (3)

and a prior probability model for the transition matrices θ1, . . . , θn ∈ Θ = [0, 1]2K

θi | k, Fki.i.d.∼ Fk (4)

with a nonparametric DP prior on Fk

Fk|k ∼ D(ckF0,k) (5)

and a hyperprior on the order of dependence

k ∼ pk. (6)

5

Here D(cF0) denotes a Dirichlet process with total mass parameter c and centering

measure F0 (Ferguson 1973). The centering measure F0,k in (5) is a distribution

concentrated on the subspace Θk ⊂ Θ.

We define the centering probability measures F0,k for k ∈ I = {0, 1, . . . , K} by

first specifying a distribution F k0 on [0, 1]2

kand then embedding it in Θ. We give a

constructive definition for F k0 . A random variate generated from F k

0 is a collection

{pik···i1} of 2k independent Beta-distributed random variables with known, positive

parameters akik···i1 and bk

ik···i1 . The distribution F k0 is a particular case of the Matrix-

Beta distribution (Martin 1967). The associated density function is

fk0 ({pik···i1}) =

∏

ik···i1∈{0,1}

{(pik···i1)

akik···i1 (1− pik···i1)

bkik···i1

B(akik···i1 , b

kik···i1)

},

where B(a, b) = Γ(a)Γ(b)/Γ(a + b) for positive real numbers a and b. Finally, F k0

induces the random distribution F0,k on Θ by setting piK ···ik···i1 ≡ pik···i1 , for all

i1, . . . , iK ∈ {0, 1}.The Polya urn representation by Blackwell and MacQueen (1973) yields a simple

expression for the joint marginal prior distribution of θ1, . . . , θn after integrating out

the random measure Fk:

P (θ1, . . . , θn|k) =n∏

i=1

ckF0,k(θi) +i−1∑j=1

δ�j(θi)

ck + i− 1

, (7)

where δx(·) indicates a point mass at {x}. Due to the almost sure discreteness of the

random measure Fk, the set θ1, . . . , θn may include ties, implying the formation of

clusters grouping those experimental units with identical transition matrices.

For convenience, we introduce the following notation. Let θ∗1, . . . , θ∗|ρ| be the set of

unique θi values. Let ρ(θ1, . . . , θn) = {S1, . . . , S|ρ|} denote the partition of {1, . . . , n}induced by θ1, . . . , θn via i1, i2 ∈ Sj iff θi1 = θi2 = θ∗j . We define cluster memberships

s = (s1, . . . , sn) as si = j iff i ∈ Sj, i.e., θi = θ∗si. We adopt the convention that

s1 = 1 and that clusters are assigned consecutive labels as they are formed. We write

θ = (θ1, . . . , θn), and θ∗ = (θ∗1, . . . , θ∗|ρ|).

6

To select the order of dependence we need to compute the posterior probabilities

p(k|y) for k ∈ I. We achieve this by Markov chain Monte Carlo (MCMC) simulation.

Using representation (7) and the fact that F0,k and the likelihood (2) are of conjugate

form, drawing samples from P (θ|k, y) can be accomplished by Bush and MacEach-

ern’s (1996) algorithm. The procedure works by upgrading the memberships s first,

and then the locations θ∗ given s. We complete the MCMC scheme by including a

step to update k given all other parameters. The choice of an appropriate updating

step for k needs care. A simple Gibbs step, i.e., sampling from the complete condi-

tional posterior distribution for k would lead to a violation of irreducibility, in the

following sense. Once k = K is imputed the complete conditional posterior p(k|θ, y)

would assign probability zero to any k < K.

To avoid this problem, we propose a Metropolis-Hastings (Tierney 1994) strategy

to generate a joint proposal (k′,θ′) for k and the transition probabilities θ. Denote

the target posterior by

π(k, θ) ∝n∏

i=1

fi(yi|k, θi)p(θ|k)pk. (8)

Denote with (k, θ) the currently imputed state in the Markov chain. A new candidate

(k′,θ′) is proposed in two steps. We draw first k′ ∼ q(k′|k). As a probing distribution

we use, for example,

q(k′|k) =

13I{k′ = k − 1}+ 1

3I{k′ = k}+ 1

3I{k′ = k + 1} if 1 ≤ k ≤ K − 1

12I{k′ = 0}+ 1

2I{k′ = 1} if k = 0

12I{k′ = K − 1}+ 1

2I{k′ = K} if k = K.

(9)

In a second step we generate a candidate θ′ for the transition probabilities. If k′ =

k we upgrade memberships and locations in one step of Bush and MacEachern’s

(1996) algorithm. When k′ 6= k then the memberships are retained, i.e. s′ = s,

while the locations θ∗′ are generated by drawing θ∗′j independently from the posterior

p(θ∗′j |k′, s,y) ∝ ∏i∈Sj

fi(yi|k′, θ∗′j )f0,k′(θ∗′j ), corresponding to the standard parametric

7

model

yi ∼ fi(yi|k′,θi = θ∗j), i ∈ Sj

with prior θ∗′j ∼ fk′0 . We have |ρ| such models, one for each cluster j = 1, . . . , |ρ|. In

our conjugate setting, drawing from the posterior p(θ∗′j |k′, s, y) is a set of draws from

Matrix-Beta distributions. Each of these draws amounts to sampling 2k independent

Beta-distributed random variables. The move to the proposed state (k′,θ′) is accepted

with probability

A ((k, θ), (k′, θ′)) = min

{1,

π(k′,θ′) q((k′,θ′), (k, θ))

π(k, θ) q((k, θ), (k′,θ′))

},

where

q((k, θ), (k′,θ′)) = q(k′|k)

{n∏

i=1

p(θ′i|θ′1, . . . , θ′i−1,θi+1, . . . , θn, k, y)I{k′ = k} +

+

|ρ|∏j=1

p(θ′?j |k′, s, y)I{k′ 6= k} .

When k′ = k then A((k, θ), (k′, θ′)) = 1, i.e., a candidate that retains the cur-

rent order of dependence will always be accepted. Thus, the resulting Markov chain

is unlikely to get trapped at a given point for too long, and the memberships are

frequently updated. On the other hand, taking s′ = s when k′ 6= k leads to an

acceptance probability given by

min

1,

pk′|ρ|∏

j=1

pj(y|k′, s) q(k|k′)

pk

|ρ|∏j=1

pj(y|k, s) q(k′|k)

,

where pj(y|k, s) =∫ ∏

i∈Sj

fi(yi|k, θ?j)f0,k(θ

?j) dθ?

j is the marginal distribution for clus-

ter j. The most influential term is the ratio of marginals under orders of dependence

k and k′, for a common clustering structure fixed at s. Candidates will be accepted if

the data are more likely to be observed under k′ than k for the partition considered

(and correcting for the prior and proposal distribution on k and k′).

8

The values of θ and k are updated according to the algorithm just described,

until the Markov chain is judged to have practically converged. Standard procedures

can be applied to produce estimates of all the quantities of interest, in particular the

posterior probabilities p(k|y).

3 Covariates in Partially Exchangeable Models

The invariance under the class of permutations characterizing the notion of partial

exchangeability is considered now as a baseline condition on which the regression on

covariates is built (Qu et al. 1987, Connolly and Liang 1988). By this we mean that,

in the absence of additional information, the binary sequence corresponding to each

experimental unit is assumed to be a mixture of Markov chains of a certain order.

Transition probabilities, however, will be modified as covariates become available.

The resulting sequences are no longer conditionally homogeneous Markov chains.

One possible approach is to refer to the saturated autologistic representation (1)

and assume that covariate effects are added linearly. However, this choice involves

excessively many parameters. Also, high order interactions are difficult to interpret.

Therefore, we propose a more parsimonious alternative that includes only main effects

and interactions of lagged terms, plus a linear regression on covariates. For order-k

models we propose

logit P (yil = 1|yi[l − k, l − 1],αi,β) =

αi0 +

k∑j1=1

αij1

yi,l−j1 +∑

1≤j1<j2≤k

αij1,j2

yi,l−j1yi,l−j2 + βT xil

≡ hk(yi[l − k, l − 1],αi, β), (10)

where yi[l1, l2] = (yil1 , . . . , yil2), for 1 ≤ l1 < l2 ≤ ni, β is a d-dimensional parameter

vector of regression coefficients on the logit scale, and αi is the vector of autoregression

coefficients. We introduce hk(·) to denote the logit transition probability under order

k. For this particular model specification, covariates modify the intercept term only.

9

Interactions with higher orders of dependence can be included, but for simplicity we

omit them in our discussion. The likelihood factor for experimental unit i under the

order-k model now becomes

fi(yi|k, αi, β) = exp

{ni∑

l=K+1

yil hk(yi[l − k, l − 1],αi,β)

−ni∑

l=K+1

log(1 + exp{hk(yi[l − k, l − 1],αi,β)})

}. (11)

The regression on covariates xij implies that the Markov chain transition probabilities

in (11) are inhomogeneous.

In addition to model selection with respect to k, the regression on covariates

gives rise to another model selection problem. Model (10) requires to choose the

covariates that are included in the regression term βT xil. Similar variable selection

problems have been studied extensively in recent literature. See, for example Hoeting

et al. (1999) for a review and references. We will follow the pseudo-prior approach

proposed by Carlin and Chib (1995).

We replace the βT xil term in (10) byd∑

u=1

βuγuxilu, where γ = (γ1, . . . , γd) is a

vector of 0/1 indicators with γu = 1 if the u-th covariate is included in the model.

The special cases γ = 0 = (0, . . . , 0) and γ = 1 = (1, . . . , 1) reduce to the model

without regression and the full model with all covariates included, respectively. Let

α = (α1, . . . , αn), β = (βu : γu = 1), β1− = (βu : γu = 0), and

θ =

(α, k, β ) if γ 6= 0

(α, k) if γ = 0,

i.e., θ is the parameter vector under model γ. The idea of the pseudo prior approach

is to augment the parameter vector to θ = (θ ,β1− ) by augmenting the probability

model with an (artificial) prior distribution p(β1− |θ , γ). Including β1− in the

parameter vector under model γ removes the problem of varying dimension parameter

spaces that otherwise complicates posterior simulation over competing models γ. The

fact that β1− does not appear in the likelihood under model γ does not hinder

the augmentation of the prior probability model. The choice of the pseudo prior

10

is theoretically arbitrary, but a good choice is important to achieve a fast mixing

Markov chain simulation. The guiding principle is to choose the pseudo prior under

model γ to mimic the corresponding conditional p(β1− |θ , γ = 1) in the full model.

See Carlin and Chib (1995). The specific choice in our model is introduced below.

Using the pseudo-prior mechanism we now construct a prior to complete model

(10):

p(k, α,β,γ) = pk p(α|k) p(β |γ) p(β1− |k, α,β , γ) p(γ). (12)

Here pk is the prior on the order of dependence. Similar to Section 2 the prior for α

is defined as p(α|k) =∫ n∏

i=1

Fk(αi) dp(Fk|k), with Fk ∼ D(ck F0,k). The base measure

F0,k in the DP prior is a zero-mean multivariate normal N(0,V k) for the non-zero

coefficients, and a pointmass at zero for all the α coefficients that vanish under model

k in (10). Let x and A denote the sub-vector of x selected by the indicators γu and

the sub-matrix of A with rows and columns selected by γu, respectively. To define the

prior p(β |γ) on the regression coefficients we consider a multivariate normal model

p(β|γ = 1) = N(λ, D−1) for the full model and define

p(β |γ) ≡ N(λ ,D

−1

)(13)

as the corresponding conditional distribution for β given β1− = 0.

To specify the pseudo-prior p(β1− |γ,θ ) we approximate p(β1− |γ = 1,θ )

as follows. Consider a normal approximation to the likelihood for observation i,

fi(yi|k, αi, β,γ = 1) ≈ N(µi, Si), e.g., a second order Taylor expansion around

the corresponding mode. Combining the normal likelihood approximation with the

normal prior p(α,β|γ = 1) = N(λ,D−1) we obtain a multivariate normal posterior

approximation, p(α,β|γ = 1, y) ≈ N(m,C−1) which we use to define

p(β1− |θ , γ,y, k) = N(m1− ,C

−11−

), (14)

where

C1− = D1− +n∑

i=1

S−1i and m1− = C−1

1−

(n∑

i=1

S−1i µi + D1− λ1−

).

11

Following a suggestion by Quintana, Liu and del Pino (1999) we use a factor κ

(e.g., κ = 2.0) to inflate the variance-covariance matrix Si in the above normal

approximation to the likelihood.

Finally, we complete the specification of the prior (12) by defining a uniform

prior p(γ) = const. Other alternative priors p(γ) could be used without altering the

posterior simulation scheme described below.

Our computational strategy is based on the Metropolis-Hastings algorithm of Sec-

tion 2, but with some necessary modifications to account for the larger model. Some

of the changes are based on the Metropolized Carlin-Chib algorithm proposed in God-

sill (2001), also connected to the reversible jumps algorithm of Green (1995) in the

context of model composition.

The target distribution is

π(k, α,β,γ) ∝n∏

i=1

fi(yi|k, αi,β)× p(k, α,β, γ), (15)

with p(k, α,β,γ) given in (12). Generating appropriate Metropolis-Hastings propos-

als is complicated by the fact that the centering measure F0,k and the likelihood (11)

are not in conjugate form. Computing the probabilities for resampling the configura-

tion indicators si requires the evaluation of an integral for the marginal probability:∫

fi(yi|k, αi, β)f0,k(αi) dαi. A possible strategy to avoid this integral is the algorithm

proposed in MacEachern and Muller (1998). Essentially the approach amounts to car-

rying “empty” clusters, that are to be used when needed, in addition to the “full”,

already existing clusters. With this trick, the evaluation of integrals is completely

avoided.

Our proposal starts with a draw k′ ∼ q(k′|k) from (9) to determine the candidate

order of dependence k′. If k′ = k, then we perform one iteration of MacEachern and

Muller’s (1998) algorithm, which yields α′, followed by draws from the full conditional

distributions for each of the remaining parameters, to get γ ′ (one coordinate at a

time), and β′ ′ and β′1− ′ . Sampling for β′ ′ and β′1− ′ can be accomplished by

adaptive rejection sampling for log-concave densities (Gilks and Wild 1992), or based

12

on normal approximations centered around the corresponding mode.

When k′ 6= k, we define the following proposal. Paralleling the definition of θ∗j

in Section 2, we use {α∗1, . . . , α

∗|ρ|} to denote the unique values among {α1, . . . , αn},

and si for configuration indicators. Set s′ = s and γ ′ = γ. To generate a proposal

α′? we construct a normal approximation to p(α′?j|yi, k′,β) ∝ ∏

i∈Sj

fi(yi|k′,α′?j,β)×f0,k′(α

′?j), for 1 ≤ j ≤ |ρ|. In the implementation for the examples reported in

Section 4 we used a normal approximation based on maximizing∏

i∈Sj

fi(yi|k′,α′?j,β)

at each iteration of the Gibbs sampler. After the |ρ| independent draws α′?j are

generated, the same mechanism, i.e., a proposal using a normal approximation based

on maximizing the likelihood, is implemented to draw the candidate β′ ′ . And finally,

β′1− ′ is taken as a draw from the pseudo-prior (14).

Given these choices, the Metropolis-Hastings acceptance ratio can be readily

computed. Further, some simplifications arise in the terms involving the pseudo-

prior (Godsill 2001, Chen et al. 2000).

4 Applications

The first example is the thumb tack dataset of Beckett and Diaconis (1994), consisting

of 320 sequences of 9 rolls of an ordinary thumb tack. The outcome was defined as

1 if the tack landed upwards, and 0 otherwise. The rolls were made by the authors

themselves, and using different kinds of surface (carpet, tiles, etc.). The original data

are available in Beckett and Diaconis (1994). The same data have been analyzed

in Liu (1996) and MacEachern, Clyde and Liu (1999).

The second example consists of 127 sequences of hits and outs for players in

the 1990 season of the National League Baseball. These data were first described

in Albright (1993), and a subset later considered in Quintana, Liu and del Pino

(1999). The binary outcomes are defined as yij = 1 if a hit, walk or sacrifice occurred

the j-th time the i-th player was at bat, and yij = 0 if an out occurred. Each sequence

has a length of at least 500. There are 11 potential covariates, which are detailed in

13

Table 1. The full dataset consists of 75, 337 records.

4.1 The Thumb Tack Dataset

For this application, no time dependent covariates are recorded. We use a flat prior on

the order of dependence, ck = 1 for all k up to a maximum order K and independent

uniforms for each of the baseline measures F k0 . We ran three independent Markov

chains with K = 2, 5, and 7, using a burn-in period of length 1, 000 and a Monte

Carlo sample size of 20, 000 in each case. Using these choices we proceeded with the

MCMC simulation as described in Section 2. The simulated chain satisfied standard

convergence criteria (Best et al. 1995, Cowles and Carlin 1996), e.g. as implemented

in the BOA package (Smith 2000). For the K = 2 case, the corrected scale reduction

factor (CRSC) proposed in Brooks and Gelman (1998) across three realized chains

was 1.0000844 for the order of dependence and 1.0002376 for p10,0,0,0,0, the first row in

the transition matrix for the first binary sequence (this last variable is considered here

only for the purpose of checking convergence). Thus, the multivariate potential scale

reduction factor (MPSRF) was 1.000232 for these two variables. We also computed

the convergence diagnostics by Geweke (1992), using fractions of 10% and 50% for the

first and last window. The corresponding Z-scores were, for the order of dependence,

1.0693, 0.7946 and 1.3998, with p-values 0.2849, 0.4268 and 0.1616. For p10,0,0,0,0 we

obtained 0.5858, 0.3495 and −0.0092, with p-values 0.5580, 0.7267 and 0.9926. In

addition, the stationarity and half-width tests of Heidelberger and Welch (1988) were

both passed for the three chains (data not shown). To assess performance of the

Metropolis-Hastings sampler we compute average acceptance rates. For K = 2 the

overall acceptance rates ranged in [61%, 63%]. For the case of proposed moves to a

different order of dependence, i.e. k′ 6= k, the range was [33%, 34%]. For K = 5

these values changed to [57%, 59%] and [31%, 32%], while for K = 7 we obtained

[57%, 58%], and [32%, 33%].

Table 2 displays some relevant summaries of the posterior distribution. The effect

of different choices for K is immediately seen. The maximum a posteriori (MAP)

14

estimate of the order of dependence k is 1 for K = 2 , 5 for K = 5 and 6 for orders

K = 6 (data not shown) and K = 7. The reported sensitivity of the inference on

k as we change K is due to the relatively short lengths of the data sequences. We

have ni = 9 observations per sequence, leaving only between two (for K = 7) and

seven (for K = 2) transitions per sequence. This critical difference in the number

of data points under different K explains the reported inference. The same effect

is also clearly seen in the inference on p10,0,0,0,0 reported in the last row of Figure 1.

Notice how the posterior distribution becomes increasingly flatter with higher K. The

middle row in Figure 1 further illustrates how the posterior distribution approaches

the prior Dirichlet process when K increases. For reference, the prior mean for the

number of clusters |ρ| in our example is320∑i=1

1/i = 6.347 and the variance319∑i=1

i/(i+1)2 =

4.705 (Liu 1996). In summary, our analysis supports the claim in Beckett and Diaconis

(1994) that sequences exhibit stronger dependence than what is compatible with

conditional independence.

The example shows that the method should be used with caution for short se-

quences. For high order k the transition probabilities pik···i1 are estimated based on

few instances of transitions only. In the extreme case of the maximum order K,

no data are available, and the posterior distribution is trivially equal to the baseline

measure of the prior. Consequently, for large K the posterior distribution of the order

of dependence tends to mimic the prior.

Another issue of general interest is related to the sensitivity of the results to

different specifications of the akik···i1 and bk

ik···i1 hyperparameters. In the above results

we chose all of these to be 1, i.e., we used a default uniform prior. Table 3 shows,

for the case K = 2, the posterior distribution of the order of dependence for different

choices of these parameters. The chosen values correspond to considerable variations

in the shape of the centering measure. The reported results suggest that posterior

inference is robust with respect to the choice of these hyperparameters. We suggest

1 as a default choice.

15

4.2 The Baseball Dataset

As a first approach to analyzing these data, we start by ignoring the covariates and

proceed as in the previous section, with the same prior parameter values and K = 5.

The acceptance rates are slightly higher than for the thumb tack dataset. Posterior

summary statistics can be found in Table 4. The MAP selected order of dependence

is 1, which remains true also for K = 2 and K = 7. Compared to the thumb tack

dataset, here the shape of the posterior distribution p(k|y) does not exhibit significant

sensitivity when moving K from 2 to 7. The mode of p(k|y) remains unchanged. The

comparison suggests that longer sequences result in more robust posterior inferences

than moderate or short ones. In applications with sufficiently large sequence lengths

ni one can proceed more confidently when setting a maximum order K.

We proceed now to incorporate the 11 covariates described in Table 1, and fit

model (11) – (14) using K = 5, ck = 1 for all k, uniform baseline measures in all

cases, λ = 0, D−1 = 10 I, and V k = 10 I, where I is the identity matrix of the

appropriate dimension. These choices represent default vague priors. Running three

chains of length 60, 000 each, the overall acceptance rates for our algorithm ranged in

[87%, 91%], while for proposed moves to a different order of dependence we obtained

[75%, 83%]. The MPSRF for the chains including β, α for the first player, and the

order of dependence is 1.0935, still well within the acceptable range. On the other

hand, the Geweke (1992) diagnostics for these variables ranged in [−0.9972, 1.6020],

with p-values in [0.1092, 0.8964], so convergence can be safely assumed.

The first panel in Figure 3 shows the frequency with which each covariate is

selected, i.e., the estimated posterior probability for including covariate u in the

model. Compare with Table 5. The remaining panels in Figure 3 plot the marginal

posterior distributions of the regression coefficients. We observe that covariates 2,

4, 5 and 10 are chosen nearly all the time, while variable 8 about 30% of time, and

the rest of the variables less than 10% of time. Most of the posterior mass for β2

is concentrated on β2 < 0. Variable 2 could be interpreted as a stress factor with

negative impact on the batter. In contrast, most of the posterior mass for β4, β5, β8

16

and β10 is concentrated on βi > 0. This means that the batter feels animated when

any runner is on base, when his team has the home field advantage, or when the

opposing pitcher has a high ERA (a measure of the quality of this opponent; better

players have lower values). The bottom row of Table 5 reports the three models with

highest posterior probabilities. We see that the model including covariates 2, 4, 5 and

10 was imputed over 58% of the time, and an additional 24% of the time variable 8 was

added to the previous model. From this we conclude that the mentioned covariates

are the most relevant ones, as they appear in all top three models.

Figure 4 shows the posterior distribution for the order of dependence, number of

clusters, and the first order autoregressive coefficients in model (10) corresponding to

subject i = 1. The MAP estimate for order of dependence is k = 1, closely followed

by 2 and 3. This supports the claim that players tend to exhibit streaky behavior at

batting positions, and the conclusion can be reached with or without using covariates.

Thus streakiness is apparently not linked to other external stimuli and it could be

seen as a feature inherent to players. The second panel of Figure 4 shows inference

on the number of clusters |ρ| for the model with covariates. The posterior is almost

entirely concentrated on 2 and 3 clusters. This suggests that, after correcting for the

effect of covariates, we find essentially two or three groups of players that present

similar patterns of streakiness. Ignoring the covariates results in the creation of some

extra clusters of players, as can be seen in Figure 2, middle row.

5 Discussion

We discussed a modeling framework to assess the order of dependence for multiple

binary sequences that are a priori assumed to be marginally partially exchangeable.

The semiparametric model specification allows inference on the order of dependence

without constraint to a specific parametric form. Despite its generality, posterior

simulation in the model is reasonably straightforward. We introduced an appropriate

MCMC strategy that alternates Metropolis-Hastings with Gibbs sampling steps. The

17

basic idea is to update the clustering structure only when the proposed order of

dependence coincides with the current one, which is done via a full Gibbs sampling

scan. Otherwise, we keep the cluster memberships and propose only new locations.

The partially exchangeable models are extended to accommodate covariates, using

autologistic regression models, with main effects and possibly interactions of lagged

binary outcomes. Model selection in this extended model has two aspects, one related

to the order of dependence, and the other related to the covariates. The MCMC al-

gorithm is adapted to this new scenario. We experimented with alternative updating

strategies, including schemes involving updating of the membership at each proposal

step. However, the resulting Markov chain simulations were computationally ineffi-

cient, with high rejection rates and slow mixing.

The proposed framework is easily extended to multinomial sequences. The only

modification in the basic model is to allow ij ∈ {0, 1, . . . , C} in (2). For moderately

small C, say C = 3 or 4, the resulting posterior MCMC can still be implemented with

reasonable computational effort.

In the development of the discussed models and in the examples we have focused

on inference about the order of dependence and the variable selection. Another in-

teresting aspect of the model is the facility to semi-parametrically estimate the joint

probability model for the binary (or multinomial) sequence. This is important, for ex-

ample, in decision problems built around binary (or multinomial) sequences. Possible

applications are planning of sampling times in clinical trials with binary or categorical

outcomes.

The two applications discussed in Section 5 provide some useful guidelines for

assessing the order of dependence. For short to moderate sequences we advise against

choosing large K. A possible alternative for such sequences is to consider a penalty

term in the loss function that accounts for model complexity. A simpler alternative

that we explore here is to restrict K to values well below the average sequence length,

and examine how results are affected with increasing K.

18

Acknowledgments

We thank the Referees and the Associate Editor for the critical review and the

many constructive comments that helped to substantially improve the manuscript.

References

Albright, S. C. (1993). A statistical analysis of hitting streaks in baseball (Disc:

p1184–1196), Journal of the American Statistical Association 88: 1175–1183.

Beckett, L. and Diaconis, P. (1994). Spectral Analysis for Discrete Longitudinal Data,

Advances in Mathematics 103: 107–128.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems,

Journal of the Royal Statistical Society, Series B 36: 192–236.

Best, N. G., Cowles, M. K. and Vines, K. (1995). CODA: Convergence diagnosis

and output analysis software for Gibbs sampling output, Version 0.30, Technical

report, MRC Biostatistics Unit.

Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Polya urn

schemes, The Annals of Statistics 1: 353–355.

Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of it-

erative simulations, Journal of Computational and Graphical Statistics 7(4): 434–

455.

Bush, C. A. and MacEachern, S. N. (1996). A semiparametric Bayesian model for

randomised block designs, Biometrika 83(2): 275–285.

Carlin, B. P. and Chib, S. (1995). Bayesian Model Choice via Markov Chain Monte

Carlo Methods, Journal of the Royal Statistical Society B 57: 473–484.

Chen, M.-H., Shao, Q.-M. and Ibrahim, J. G. (2000). Monte Carlo Methods in

Bayesian Computation, New York: Springer-Verlag.

19

Connolly, M. A. and Liang, K.-Y. (1988). Conditional Logistic Regression Models for

Correlated Binary Data, Biometrika 75: 501–506.

Cowles, M. and Carlin, B. (1996). Markov chain monte carlo convergence diagnostics:

a comparative review, Journal of the American Statistical Association 91: 883–

904.

Diaconis, P. (1988). Recent progress on de Finetti’s notions of exchangeability, in

J. Bernardo, M. D. Groot, D. Lindley and A. Smith (eds), Bayesian Statistics

3, Oxford University Press, pp. 111–125.

Escobar, M. and West, M. (1995). Bayesian density estimation and inference using

mixtures, Journal of the American Statistical Association 90: 577–588.

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems, The

Annals of Statistics 1: 209–230.

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calcu-

lating posterior moments, in J. Bernardo, J. Berger, A. Dawid and A. Smith

(eds), Bayesian Statistics 4, Oxford: Oxford University Press, pp. 169–193.

Gilks, W. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling,

Applied Statistics 41: 337–348.

Godsill, S. J. (2001). On the relationship between MCMC model uncertainty methods,

Journal of Computational and Graphical Statistics 10: 230–248.

Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and

Bayesian model determination, Biometrika 82: 711–732.

Heidelberger, P. and Welch, P. (1988). Simulation run length control in the presence

of an initial transient, Operations Research 31: 1109–1144.

20

Hoeting, J. A., Leecaster, M. and Bowden, D. (2000). An improved model for spatially

correlated binary responses, Biological and Environmental Statistics 5(1): 102–

114.

Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian

Model Averaging: A Tutorial, Statistical Science 14(4): 382–417.

Liu, J. S. (1996). Nonparametric Hierarchical Bayes via Sequential Imputations, The

Annals of Statistics 24(3): 911–930.

MacEachern, S. N. and Muller, P. (1998). Estimating Mixture of Dirichlet Process

Models, Journal of Computational and Graphical Statistics 7(2): 223–338.

MacEachern, S. N. and Muller, P. (2000). Efficient MCMC Schemes for Robust Model

Extensions using Encompassing Dirichlet Process Mixture Models, in F. Ruggeri

and D. Rıos-Insua (eds), Robust Bayesian Analysis, New York:Springer-Verlag,

pp. 295–316.

MacEachern, S. N., Clyde, M. and Liu, J. S. (1999). Sequential Importance Sampling

for Nonparametric Bayes Models: The Next Generation, Canadian Journal of

Statistics 27: 251–267.

Martin, J. (1967). Bayesian Decision Processes and Markov chains, New York: Wiley.

Muller, P. and Rosner, G. (1998). Semi-parametric PK/PD Models, in D. Dey,

P. Muller and D. Sinha (eds), Practical Nonparametric and Semiparametric

Bayesian Statistics, New York: Springer-Verlag, pp. 323–338.

Qu, Y., Williamns, G. W., Beck, G. J. and Goormastic, M. (1987). A Generalized

Model of Logistic Regression for Clustered Data, Communications in Statistics,

Part A – Theory and Methods 16: 3447–3476.

Quintana, F. A. (1998). Nonparametric bayesian analysis for assessing homogeneity in

k× l contingency tables with fixed right margin totals, Journal of the American

Statistical Association 93: 1140–1149.

21

Quintana, F. A. and Newton, M. A. (1998). Assessing the Order of Dependence

for Partially Exchangeable Binary Data, Journal of the American Statistical

Association 93: 194–202.

Quintana, F. A. and Newton, M. A. (1999). Parametric Partially Exchangeable

Models and Polya Urns for Multiple Binary Sequences, Brazilian Journal of

Probability and Statistics 13: 55–76.

Quintana, F. A. and Newton, M. A. (2000). Computational aspects of Nonparametric

Bayesian analysis with applications to the modeling of multiple binary sequences,

Journal of Computational and Graphical Statistics 9(4): 711–737.

Quintana, F. A., Liu, J. S. and del Pino, G. E. (1999). Monte Carlo EM with

Importance Reweighting and its Applications to Random Effects Models, Com-

putational Statistics and Data Analysis 29: 429–444.

Smith, B. J. (2000). Bayesian Output Analysis Program (BOA), Version 0.5.0 for

S-PLUS and R. Available at http://www.public-health.uiowa.edu/BOA.

Tierney, L. (1994). Markov chains for exploring posterior distributions, Annals of

Statistics 22(4): 1701–1762.

22

Name Number Variable Description

I7 1 1 if game is into the 7th inning or beyond; 0 otherwise

O2 2 1 if there are 2 outs; 0 otherwise

Score 3 number of runs batter’s team is ahead by (if positive) or

behind by (if negative)

R123 4 1 if any runners are on base; 0 otherwise

R23 5 1 if any runners are on 2nd or 3rd; 0 otherwise

Game 6 a chronological index of the game number

DN 7 1 for a night game; 0 for a day game

HA 8 1 for a home game; 0 for an away game

T 9 1 if opposing pitcher is right-handed; 0 if left-handed

ERA 10 opposing pitcher’s earned run average for that season

Turf 11 1 if playing on grass; 0 if artificial turf

Table 1: Covariates for Baseball Dataset in Section 4. Note: to make the values of

all variables comparable Game was divided by 100.

23

Estimates

Quantity of Interest K = 2 K = 5 K = 7

P (k = 0|y) 0.1193 (0.00502) 0.0121 (0.00205) 0.0138 (0.00224)

P (k = 1|y) 0.5357 (0.00631) 0.0576 (0.00607) 0.0497 (0.00546)

P (k = 2|y) 0.3450 (0.00754) 0.0840 (0.00571) 0.0695 (0.00619)

P (k = 3|y) – 0.1722 (0.00717) 0.0967 (0.00682)

P (k = 4|y) – 0.3121 (0.00765) 0.1233 (0.00701)

P (k = 5|y) – 0.3620 (0.01141) 0.1692 (0.00666)

P (k = 6|y) – – 0.2550 (0.00866)

P (k = 7|y) – – 0.2228 (0.01020)

E(p10,0,0,0,0;0|y) 0.496 (0.0010) 0.469 (0.0018) 0.536 (0.0023)

Var(p10,0,0,0,0;0|y) 0.0054 (0.00018) 0.0233 (0.0006) 0.044 (0.0009)

E(|ρ||y) 4.11 (0.053) 5.67 (0.065) 5.99 (0.070)

Var(|ρ||y) 2.85 (0.107) 3.77 (0.156) 4.34 (0.163)

Table 2: Posterior quantities of interest for thumb tack dataset. Monte Carlo standard

errors are indicated between parentheses.

Parameter Values Estimates

akik···i1 bk

ik···i1 P (k = 0|y) P (k = 1|y) P (k = 2|y)

1 1 0.1193 0.5357 0.3450

10 1 0.0923 0.5414 0.3663

1 10 0.1044 0.5402 0.3554

0.5 0.5 0.1205 0.5291 0.3504

5 5 0.0814 0.5606 0.3580

10 10 0.0578 0.5649 0.3773

Table 3: Sensitivity analysis of posterior distribution of order of dependence to choices

of parameters for the centering measure. The maximum order was K = 2.

24

0 2 4 6

0.0

0.2

0.4

o

o

o

0 2 4 6

0.0

0.2

0.4

oo o

o

oo

0 2 4 6

0.0

0.2

0.4

o o o o oo

o o

2 6 10

0.0

0.10

0.20

o

o

oo

o

o

ooooooooo

2 6 10

0.0

0.10

0.20

oo

o

o

oo

o

oooooooo

2 6 100.

00.

100.

20

oo

o

ooo

o

oooooooo

0.0 0.4 0.8

02

46

0.0 0.4 0.8

02

46

0.0 0.4 0.8

02

46

Firs

t Com

pone

nt#

Clu

ster

sO

rder

of D

epen

denc

e

K = 7K = 5K = 2

Figure 1: Posterior summaries for thumb tack dataset. Columns represent results

for different values of the maximum order K. First and second rows show estimated

posterior probabilities for order of dependence and number of clusters. The last row

shows the posterior distribution of p10,0,0,0,0, the first component of θ1 in (2).

25

Estimates

Quantity of Interest K = 2 K = 5 K = 7

P (k = 0|y) 0.2403 (0.00749) 0.1321 (0.00630) 0.0983 (0.00550)

P (k = 1|y) 0.4804 (0.00626) 0.3110 (0.00786) 0.3456 (0.00999)

P (k = 2|y) 0.2793 (0.00736) 0.2124 (0.00646) 0.1857 (0.00619)

P (k = 3|y) – 0.1354 (0.00506) 0.1082 (0.00444)

P (k = 4|y) – 0.1021 (0.00516) 0.0631 (0.00324)

P (k = 5|y) – 0.1070 (0.00666) 0.0503 (0.00326)

P (k = 6|y) – – 0.0881 (0.00694)

P (k = 7|y) – – 0.0608 (0.00546)

E(p10,0,0,0,0;0|y) 0.652 (0.0001) 0.651 (0.0002) 0.652 (0.0001)

Var(p10,0,0,0,0;0|y) 0.0587 (0.06359) 0.0581 (0.06357) 0.0587 (0.06548)

E(|ρ||y) 3.28 (0.038) 2.63 (0.039) 1.99 (0.028)

Var(|ρ||y) 1.49 (0.0548) 1.36 (0.058) 0.83 (0.042)

Table 4: Posterior quantities of interest for baseball dataset (ignoring covariates).

Monte Carlo standard errors are indicated between parentheses.

26

0 2 4 6

0.1

0.3

o

o

o

0 2 4 6

0.1

0.3

o

o

oo

o o

0 2 4 6

0.1

0.3

o

o

oo

o oo o

2 4 6 8

0.0

0.2

0.4

o

o

o

o

oo o o o o

2 4 6 8

0.0

0.2

0.4

o

o

o

o

o o o o o o

2 4 6 80.

00.

20.

4

o

o

o

oo o o o o o

0.50 0.65 0.80

020

40

0.50 0.65 0.80

020

40

0.50 0.65 0.80

020

40

Firs

t Com

pone

nt#

Clu

ster

sO

rder

of D

epen

denc

e

K = 7K = 5K = 2

Figure 2: Posterior summaries for the baseball dataset. Columns represent results

for different values of the maximum order K. First and second rows show estimated

posterior probabilities for order of dependence and number of clusters. The last row

shows the posterior distribution of p10,0,0,0,0, the first component of θ1 in (2).

27

o

o

o

oo

oo

o

o

o

o

2 4 6 8

040

80

-0.10 0.05

04

812

-0.2 0.2

04

812

-0.01 0.01

1418

0.0 0.15

04

812

0.10 0.25

04

8

-0.1 0.1 0.3

04

812

-0.1 0.1

04

812

0.0 0.10

04

812

-0.05 0.10

04

812

0.0 0.06 0.12

05

15

-0.10 0.05

05

1015

β1 β2

β3 β4 β5

β6 β7 β8

β9 β10 β11

Figure 3: Posterior distributions for all regression coefficients, including percentiles

2.5%, 50% and 97.5% (dotted lines). Upper left plot represents the percentage of times

each covariate was selected in the model.

28

Quantities of Interest Estimated Probability

P (γ1 = 1|y) 0.08217 (0.002573)

P (γ2 = 1|y) 0.99855 (0.000418)

P (γ3 = 1|y) 0.01067 (0.000825)

P (γ4 = 1|y) 0.99675 (0.000889)

P (γ5 = 1|y) 0.99950 (0.000244)

P (γ6 = 1|y) 0.04765 (0.002470)

P (γ7 = 1|y) 0.02592 (0.001607)

P (γ8 = 1|y) 0.30167 (0.006333)

P (γ9 = 1|y) 0.00983 (0.000852)

P (γ10 = 1|y) 0.99967 (0.000202)

P (γ11 = 1|y) 0.01107 (0.000874)

P (∑

γi = 1|y) - (-)

P (∑

γi = 2|y) - (-)

P (∑

γi = 3|y) 0.00210 (0.000685)

P (∑

γi = 4|y) 0.58110 (0.006210)

P (∑

γi = 5|y) 0.35267 (0.005520)

P (∑

γi = 6|y) 0.05968 (0.002355)

P (∑

γi = 7|y) 0.00437 (0.000624)

P (∑

γi = 8|y) 0.00003 (0.000024)

P (∑

γi = 9|y) 0.00003 (0.000026)

P (∑

γi = 10|y) 0.00002 (0.000002)

P (∑

γi = 11|y) - (-)

P (γ = (0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0)|y) 0.58017 (0.006230)

P (γ = (0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0)|y) 0.24277 (0.005476)

P (γ = (1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0)|y) 0.04163 (0.001656)

Table 5: Posterior probabilities for quantities related to variable selection (baseball

dataset), including the three most likely models. Monte Carlo standard errors are

indicated between parentheses. Note: no models with 1, 2 or 11 variables were

imputed. 29

o

o o o

o

o

0 1 2 3 4 50.08

0.14

0.20

o

o

oo o

1 2 3 4 5

0.0

0.4

0.8

-1.5 -0.5

0.0

1.0

2.0

-0.8 -0.2 0.4

02

46

8

-1.0 0.00

24

6

-0.4 0.2

02

46

8

-0.4 0.0

02

46

8

-0.3 0.0 0.3

02

46

8

# C

lust

ers

Ord

er o

f Dep

ende

nce

α0 α1 α2

α3 α4 α5

Figure 4: Posterior distribution of order of dependence, number of clusters and main

effects for autologistic regression coefficients for subject 1, including percentiles 2.5%,

50% and 97.5% (dotted lines).

30

nonparametric bayesian assessment of the order of ... · nonparametric bayesian assessment of the...

Documents