copy of gibbs sampler full - division of public health ... · examples, we show mathematically that...

1

Behavior of the Gibbs Sampler When Conditional

Distributions Are Potentially Incompatible

Shyh-Huei Chen

Department of Biostatistical Sciences, Wake Forest University School of Medicine,

Winston-Salem, NC 27157

Edward H. Ip

Department of Biostatistical Sciences, Department of Social Sciences and Health Policy,

Wake Forest University School of Medicine, Winston-Salem, NC 27157

Shyh-Huei Chen (E-mail: [email protected]) is assistant professor at the Department of

Biostatistical Sciences, Division of Public Health Sciences, Wake Forest School of Medicine,

Wells Fargo Center 23rd floor, Medical Center Blvd, Winston-Salem, NC 27157. Edward H. Ip

(E-mail: [email protected]) is Professor at the Department of Biostatistical Sciences and the

Department of Social Sciences and Health Policy, Division of Public Health Sciences, Wake

Forest School of Medicine, Wells Fargo Center 23rd floor, Medical Center Blvd, Winston-Salem,

NC 27157. This work was partially supported by NIH grant 1R21AG042761-01 (PI: Ip).

2

ABSTRACT

The Gibbs sampler has been used extensively in the statistics literature. It relies on

iteratively sampling from a set compatible conditional distributions and the sampler is

known to converge to a unique invariant joint distribution. However, the Gibbs sampler

behaves rather differently when the conditional distributions are not compatible. Such

applications have seen increasing use in areas such as multiple imputation. In this paper,

we demonstrate that what a Gibbs sampler converges to is a function of the order of the

sampling scheme. Besides providing empirical examples to illustrate this behavior, we

also explain how that happens through a thorough analysis of the examples.

KEY WORDS: Gibbs chain; Gibbs sampler; Potentially incompatible conditional-

specified distribution.

3

1. INTRDUCTION

The Gibbs sampler is, if not singularly, one of the most prominent Markov chain

Monte Carlo (MCMC)-based methods. Partly because of its conceptual simplicity and

elegance in implementation, the Gibbs sampler has been increasingly used across a very

broad range of subject areas including bioinformatics and spatial analysis. While its root

dates back to earlier work (e.g., Hasting 1970), the popularity of the Gibbs sampling is

commonly credited to Geman and Geman (1984), in which the algorithm was used as a

tool for image processing. Its use in statistics, especially Bayesian analysis, has since

grown very rapidly (Gelfand and Smith 1990; Smith and Roberts 1993; Gilks,

Richardson, and Spiegelhalter 1996). For a quick introduction of the algorithm, see

Casella and George (1992).

One of the recent developments of the Gibbs sampler is in its application to

potentially incompatible conditional-specified distributions (PICSD). When statistical

models involve high-dimensional data, it is often easier to specify conditional

distributions instead of the entire joint distributions. However, the approach of specifying

a conditional distribution has the risk of not forming a compatible joint model. Consider a

system of d discrete random variables, whose fully conditional model

is specified by , where and c

kx is the relative

complement of with respect to . If the conditional models are individually specified,

then there may not exist a joint distribution that will give rise to the specified set of

conditional distributions. In such a case, we call incompatible.

The study of PICSD is closely related to the Gibbs sampler because the latter relies

on iteratively drawing samples from to form a Markov chain. Under mild conditions,

the Markov chain converges to the desired joint distribution, if is compatible.

However, if is not compatible, then the Gibbs sampler could exhibit erratic behavior

(e.g., Hobert and Casella 1998).

In this paper, our goal is to demonstrate the behavior of the Gibbs sampler (or the

pseudo Gibbs sampler as it is not a true Gibbs sampler in the traditional sense of

1 2{ , , , }d

X x x x= ⋯

{ }1 2, , ,

df f f=F ⋯

|( | )c

k k

c

k k kx xf f f x x≡ =

kx X

F

F

F

F

4

presumed compatible conditional distributions) for PICSD. By using several simple

examples, we show mathematically that what a Gibbs sampler converges to is a function

of the order of the sampling scheme in the Gibbs sampler. Furthermore, we show that if

we follow a random order in sampling conditional distributions at each iteration—i.e.,

using a random-scan Gibbs sampler (Liu, Wong, and Kong 1995)—then the Gibbs

sampling will lead to a mixture of the joint distributions formed by each combination of

fixed-order (or more formally, fixed-scan) when d = 2 but the result is not true when

d > 2. This result is a refinement of a conjecture put forward in Liu (1996).

The demonstration in this paper is intended to provide readers not familiar with

incompatible conditional distributions some basic background regarding the mechanism

driving the behavior of the Gibbs sampler for PICSD. Two recent developments in the

statistical and machine-learning literature underscore the importance of the current work.

The first is in the application of the Gibbs sampler to a dependency network, which is a

type of generalized graphical model specified by conditional probability distributions

(Heckerman et al. 2000). One approach to learning a dependency network is to first

specify individual conditional models and then apply a (pseudo) Gibbs sampler to

estimate the joint model. The authors acknowledged the possibility of incompatible

conditional models but argued that when the sample size is large, the degree of

incompatibility will not be substantial and the Gibbs sampler is still applicable. Yet

another example is the use of the fully conditional specification for multiple imputation

of missing data (van Burren et al. 1999, 2006). The method, which is also called multiple

imputation by chained equations (MICE), makes use of a Gibbs sampler or other MCMC-

based methods that operate on a set of conditionally specified models. For each variable

with a missing value, an imputed value is created under an individual conditional-

regression model. This kind of procedure was viewed as combining the best features of

many currently available multiple imputation approaches (Rubin 2003). Due to its

flexibility over compatible multivariate-imputation models (Schafer 1997) and ability to

handle different variable types (continuous, binary, and categorical) the MICE has gained

acceptance for its practical treatment of missing data, especially in high-dimensional data

sets (Rassler, Rubin, and Zell 2008). Popular as it is, the MICE has the limitation of

potentially encountering incompatible conditional-regression models and it has been

5

shown that an incompatible imputation model can lead to biased estimates from imputed

data (Drescher and Rassler 2008). So far, very little theory has been developed in

supporting the use of MICE (White, Royston, and Wood 2011). A better understanding of

the theoretical properties of applying the Gibbs sampler to PICSD could lead to important

refinements of these imputation methods in practice.

The article is organized as follows: First, we provide basic background to the Gibbs

chain and Gibbs sampler and define the scan order of a Gibbs sampler. Section 3

describes a simple example to demonstrate the convergence behavior of a Gibbs sampler

as a function of scan order, both by applying matrix algebra to the transition kernel as

well as using MCMC-based computation. In Section 4, we offer several analytic results

concerning the stationary distributions of the Gibbs sampler under different scan patterns

and a counter-example to a surmise about the Gibbs sampler under a random order of

scan pattern. Finally in Section 5 we provide a brief discussion.

2. GIBBS CHAIN AND GIBBS SAMPLER

Continuing the notation in the previous section, let denote a

permutation of

, denote a realization of

X with

, where

is the number of categories of the k

th variable. Thus,

is a realization of defined in the order of .

For a specified , the associated fixed (systemic)-scan Gibbs chain governed by a scan

pattern can be implemented as follows:

1. Pick an arbitrary starting vector .

2. On the tth

cycle, successively draw from the full conditional distributions

according to scan pattern as follows:

1 2( , , , )d

a a a=a ⋯

{1,2, , }d⋯

T

1 2(x ,x , ,x )d

=x ⋯

{ }x 1,2, ,k k

C∈ ⋯ kC

1 2 2(x ,x , ,x )

a a a≡ax ⋯ X a

F

1 2( , , , )d

a a a=a ⋯

1 2

(0) (0) (0) (0)(x ,x , , x )da a a

=a

x ⋯

a

6

.

The series obtained by a single draw (iteration) is called a

realization of Gibbs chain defined by with scan pattern ; and the series

(0) ( ) (2 ) ( ), , , , ,d d tda a a ax x x x⋯ ⋯ obtained by a single cycle is a realization of the associated

Gibbs sampler. For example, , (2,4,1,3)=a , and given initial value

(0) (0) (0) (0) (0)

2 4 1 3(x ,x ,x ,x )=x , the Gibbs chain in cycle 1 performs the following draws and

produces the corresponding states:

(1) (0) (0) (0)

2 2 4 4 1 1 3 3x ( | x , x , x )f x x x x= = =∼ , (1) (1) (0) (0) (0)

2 4 1 3(x ,x ,x ,x )=x ;

(2) (1) (0) (0)

4 4 2 2 1 1 3 3x ( | x , x , x )f x x x x= = =∼ , (2) (1) (2) (0) (0)

2 4 1 3(x ,x ,x ,x )=x ;

(3) (1) (2) (0)

1 1 2 2 4 4 3 3x ( | x , x , x )f x x x x= = =∼ , (3) (1) (2) (3) (0)

2 4 1 3(x ,x ,x ,x )=x ; and

(4) (1) (2) (3)

3 3 2 2 4 4 1 1x ( | x , x , x )f x x x x= = =∼ , (4) (1) (2) (3) (4)

2 4 1 3(x ,x ,x , x )=x .

In this example, the series (0) (4) (8), , , ,x x x ⋯

is the realization of Gibbs sampler defined by

with scan pattern .

We can also express a Gibbs sampler of random scan order as a Gibbs chain. Let

be the set of selection probabilities, where is the probability of

visiting a conditional , and . The random-scan Gibbs sampler (Levine and

Casella 2006) can be stated as follows:

1. Pick an arbitrary starting vector.

2. At the sth

iteration, s = 1, 2, …

(0) (1) (2) ( ), , , , ,sa a a ax x x x⋯ ⋯

F a

1 2 3 4( , , , )X x x x x=

F a

1 2( , , , ) d

r r r=r ⋯ 0k

r >

kf

1

1d

k

k

r=

=∑

(0) (0) (0) (0)

1 2(x ,x , ,x )d

=x ⋯

7

a. Randomly choose with probability ;

b. .

3. Repeat step 2 until a convergence criterion is reached.

3. ILLUSTRATIVE EXAMPLES

Example 1. (Compatible conditional distributions). Consider the following bivariate 2 × 2

joint distribution and for defined on the domain {1, 2}, with its

corresponding conditional distributions and (Arnold, Castillo, and

Sarabia 2002, p. 242):

There are 4 possible states, (1, 1), (1, 2), (2, 1), and (2, 2) for the Gibbs chain. The

transition from one state to another is apparently governed by the conditional matrices

and . As a shorthand, we denote an entry in the matrix as 1(.,.)f ; e.g., 1(1,2) 1 / 3.f = In

order to keep track of the scan order, we denote the state in the Gibbs chain as ,

if the current state at time t is the result of drawing from the conditional . To fix

ideas, we use a fixed-scan Gibbs sampler with and the conditional distributions

The transition kernel for the Gibbs chain is diagrammatically represented in

Figure 1, where and indicate local transition probabilities. For example, the local

transition probability from to is , and

to

is 0 (indicated by disconnectedness).

{1,2, , }k d∈ ⋯k

r

( ) ( 1)x ( | )c

s s

k k kf x −x∼

π1 2( , )X X

1 1 2( | )f x x 2 2 1( | )f x x

1 2 1 1 1 210 10 4 3 3 3

1 23 3 34 2 410 10 4 3 7 7

, and , . f f

= = =

π

1f

2f

( )( | )k

t

afx

( )tx

kaf

(1,2)=a

1 2( , ).f f

1P 2P

(2 )

2( (1,1) | )tf=x

(2 1)

1( (1,1) | )tf

+ =x 141(1,1)f =

(2 )

2( (1,1) | )tf=x

(2 1)

1( (1,2) | )tf

+ =x

8

Figure 1. Transition probabilities of the Gibbs chain in Example 1.

By arranging the state in lexicographical order such that the first index changes the

fast and the last index the slowest, the transition probability matrices 1T and 2T that

correspond respectively to 1P , and 2P are:

More generally, the local transition probability (Madras 2002, p. 77) for two

successive states of Gibbs chain, and , can be defined by

The matrices and in Example 1 have two pairs of identical rows and are

idempotent but not irreducible. As this example illustrates, generally a Gibbs chain is not

homogeneous, but if one defines a surrogate transition probability matrix

1 2 3 da a a aT T T T T=a ⋯ , then a homogeneous chain with transition matrix T

acan be formed for

31 1 24 4 3 3

3 31 44 4 7 7

1 21 2 1 23 3 3 3

31 2 43 3 7 7

0 0 0 0

0 0 0 0 and .

0 0 0 0

0 0 0 0

T T

= =

kaP

1

( 1)( | )k

s

af

−

−x ( )( | )k

s

afx

( ) ( ) ( 1)

( 1) ( )( ), if ;

( , )0, otherwise.

c ck k k

k

s s s

as s a a

a

fP

−

− =

=

x x xx x

1T 2T

9

the scan pattern . In other words, for a collection of full conditional

distributions and a scan pattern the fixed-scan Gibbs sampler is a homogeneous

Markov chain with transition matrix . Analogously, a random-scan Gibbs sampler with

selection probability can be also transferred to a homogeneous Markov

chain by defining as the surrogate transition probability matrix. The

corresponding stationary distribution and can be directly computed by evaluating

and , where 1

d

kkC

== ∏C , and is a C-dimensional

vector of 1’s.

In Example 1, the transition matrices for fixed- and random-scans are respectively

for and for and

Table 1

directly compares the joint distributions obtained from the following computations: (1)

direct MCMC Gibbs sampler for the only two possible fixed-scan patterns and

; (2) direct MCMC Gibbs sampler for random-scan patterns with the

following selection probabilities: , and , (3) matrix

multiplication using with low power (m = 4) and high power (m = 32) and (4) matrix

multiplication using also with low and high powers. For both (1) and (2), we used the

first 5,000 cycles as burn-in and the subsequent 1,000,000 cycles for sampling.

As expected, both the fixed-scan, regardless of scan order, and the random-scan

Gibbs samplers numerically converge to the same joint distribution (convergence is

defined here as all cell-wise differences between estimates from two consecutive

iterations to be less than ). Table 1 also demonstrates that direct matrix

multiplication of the transition probabilities produces rapid convergence even for a small

m and different values of . However, we also observed that if was heavily

imbalanced, it took many more iterations to achieve numerical convergence (not shown).

For example, if = , it took to achieve the same numerical

convergence (up to 4 decimal places).

1( , , )ad

a a= ⋯

F ,a

Ta

1 2( , , , ) d

r r r=r ⋯

1

d

k kkT r T

=≡∑r

πa πr

Tlim m

mT π

→∞=a C a1 Tlim m

mT π

→∞=r C r1

C1

1 1 2aT T T=1 (1,2)a =

2 2 1aT T T=2 (2,1),a = 1 2( ) / 2.T T T= +r

1 (1,2)=a

2 (2,1)=a

1 12 20 ( , )=r 1 2

3 31 ( , ) =r 2 13 32 ( , ) =r

mTa

mTr

40.5 10

−×

r r

r 9110 10( , ) 120m >

10

Table 1. Joint distributiuons produced by various Gibbs samplers for Example 1.

(1,1) (2,1) (1,2) (2,2)

0.1 0.3 0.2 0.4

0.1002 0.3002 0.2000 0.3997

0.0998 0.3004 0.1999 0.4000

0.1007 0.2998 0.2000 0.3995

0.1000 0.3000 0.2000 0.4000

0.1000 0.3000 0.2000 0.4000

0.1000 0.3000 0.2000 0.4000

0.1000 0.3000 0.2000 0.4000

0.1000 0.3000 0.2000 0.4000

Example 2. (Incompatible conditional distributions). Consider a pair of 2 × 2 conditional

distributions and defined on the domain {1, 2} as follows (Arnold

Castillo, and Sarabia, p. 242):

1 1 1 24 3 3 3

1 23 92 14 3 10 10

and f f

= =

(1)

These two conditional distributions are not compatible. It is easy to show that the local

transition probability matrices are respectively:

Table 2 shows the results for the joint distributions derived from the simulated Gibbs

samplers and matrix-multiplication of Example 2 for conditions that are identical to those

presented in Table 1. Several important observations can be made here: (1) The Gibbs

samplers that use the fixed-scan pattern and respectively converge to two distinct

π

1 (1,2)=a

2 (2,1)=a

1 12 20 ( , )=r

1( 4)mπ =a

2( 4)mπ =a

0( 32)mπ =r

1( 32)mπ =r

2( 32)mπ =r

1 1 2( | )f x x 2 2 1( | )f x x

31 1 24 4 3 3

3 91 14 4 10 10

1 21 2 1 23 3 3 3

91 2 13 3 10 10

0 0 0 0

0 0 0 0 and .

0 0 0 0

0 0 0 0

T T

= =

1a 2a

11

joint distributions; (2) each individual fixed-scan Gibbs sampler converges to the

corresponding solution computed from the matrix-multiplication method; and (3) the

random-scan Gibbs sampler converges to the mixture distribution of the individual fixed-

scan distributions—i.e., . However, the last observation, as we shall

see later, only holds true for .

Table 3 shows the conditional distributions of Example 2 derived from the matrix-

multiplication method (m = 32) for ¸ . ,

and . Interestingly, one of the

given conditional distributions is always identical to the conditional distribution derived

from the joint distribution of , . For example, the given conditional

distribution in Eq. (1) is numerically identical to the conditional distribution

directly derived from the fixed-scan Gibbs sampler . On the other hand, in Eq. (1)

is identical to the conditional distribution derived from . Indeed, as we shall

see later, for a given set of full conditionals , for a scan

pattern , the fixed-scan Gibbs sampler always has at least one as

its conditional distributions.

To illustrate the “error” of the joint distribution to which a Gibbs sampler

converges when conditional distributions are incompatible, we prescribe a cell-wise 2

l -

norm-based metric to quantify the distance between two distributions. The metric

computes the Euclidean distance between the given conditionally specified distributions

and the derived conditional distributions of the joint density. Thus, when the conditional

distributions are compatible, the distance metric, or error term, is identical to zero.

Table 4 shows the error terms obtained for the various schemes for Example 2.

Based on the summary statistics, the random scan appears to have the least error.

However, we remark here that such a result could depend on how the distance metric is

defined. For example, when cell-wise 1

l -norm was used to measure distance, is the

distribution that contained the smallest error (Table 4).

1 2r (1 )r rπ π π= − +a a

2d =

1aπ2aπ

0rπ1r

π2r

π

kaπ 1,2k =

2f 2 1( | )f x x

1aπ1f

1 2( | )f x x2aπ

{ ( | ), 1, , }a a ak k k

cf f x x k d= = ⋯

1 2( , , , )d

a a a=a ⋯ aπ akf

2πa

12

Table 2. Joint distributiuons produced by various Gibbs samplers for Example 2.

(1,1) (2,1) (1,2) (2,1)

0.1062 0.0680 0.2128 0.6130

0.0435 0.1314 0.2753 0.5498

0.0748 0.0990 0.2447 0.5815

0.1063 0.0681 0.2125 0.6131

0.0436 0.1308 0.2752 0.5504

0.0749 0.0995 0.2439 0.5817

0.0854 0.0890 0.2334 0.5922

0.0645 0.1099 0.2543 0.5713

Table 3. Conditional distributiuons derived from the computed joint distributions (by

using matrix multiplication) in Table 2.

(1,1) (2,1) (1,2) (2,1) (1,1) (2,1) (1,2) (2,1)

Eq.(1) 0.2500 0.7500 0.3333 0.6667 0.3333 0.1000 0.6667 0.9000

0.6094 0.3906 0.2574 0.7426 0.3333 0.1000 0.6667 0.9000

0.2500 0.7500 0.3333 0.6667 0.1368 0.1920 0.8632 0.8080

0.4297 0.5703 0.2954 0.7046 0.2350 0.1460 0.7650 0.8540

0.4896 0.5104 0.2827 0.7173 0.2678 0.1307 0.7322 0.8693

0.3698 0.6302 0.3080 0.6920 0.2023 0.1613 0.7977 0.8387

1 (1,2)=a

2 (2,1)=a

1 12 20 ( , )=r

1( 4)mπ =a

2( 4)mπ =a

0( 32)mπ =r

1( 32)mπ =r

2( 32)mπ =r

1f 2f

1πa

2πa

0π r

1π r

2πr

13

Table 4. 2

l -norm and 1

l -norm errors between the computed joint distributions and given

conditional models.

2

l 1

l

1f 2f Total Total

0.5195 0.0000 0.5195 0.8706 0.0000 0.8706

0.0000 0.3068 0.3068 0.0000 0.5770 0.5770

0.2597 0.1535 0.3017 0.4352 0.2886 0.7238

0.3463 0.1023 0.3611 0.5804 0.1924 0.7728

0.1732 0.2045 0.2680 0.2902 0.3846 0.6748

4. SOME ANALYTIC RESULTS

In this section, we offer several general results regarding the behaviors of the fixed-

scan and the random-scan Gibbs sampler for discrete variables in which the transition

matrices are finite. For most of these results, it is not necessary to assume compatibility.

Besides providing some theoretical underpinning to the previous illustrative examples,

the results here allow a closer look at the mechanisms through which incompatibility

impacts the behaviors of the different Gibbs sampling schemes. Note that these results are

special cases that can be derived from more general theories for Markov chains, but for

our purpose focusing on the special case of discrete variables and scan patterns makes it

easier to examine the dynamics of convergence. General results regarding convergence of

Markov chains can be found elsewhere (e.g., see Tierney 1994; Gilks et al. 1996, and the

references therein). All of the proofs of the following results are included in the

Appendix.

Theorem 1: If is positive then the Gibbs sampler, either fixed-scan with a scan pattern

a or random scan with selection probability , converges to a unique

stationary distribution and respectively.

Note that Theorem 1 does not require to be compatible. The result assures that

when is positive—a stronger condition than being non-negative—any scan pattern

1f 2f

1πa

2πa

0π r

1π r

2πr

F

0>r

πa πr

F

F F

14

can have one and only one stationary distribution. Furthermore, the transition for any

fixed-scan pattern is governed by the following theorem:

Theorem 2: If is positive then for each state set, , of the

Gibbs chain with scan pattern has exactly one stationary

distribution . In particular, and ,

and .

A direct consequence of Theorem 2 is that for any fixed-scan pattern, one of the

specified conditional distributions in F can always be derived from its stationary

distribution. This is summarized in the following corollary:

Corollary 1: If is positive then the stationary distribution of the Gibbs sampler has

as one of its conditional distributions for the scan pattern

, i.e., .

When F is compatible, all scan patterns converge to the same joint distribution. The

following theorem provides a formal statement.

Theorem 3: Given is positive. is compatible if and only if there exists a joint

distribution with either or . Furthermore, is

the joint distribution characterized by .

An interesting observation about the random scan is that it forms a mixture of the

fixed-scan patterns only for . We state the Corollary for the case d = 2 and give a

counter-example for .

Corollary 2: If and then , , is formed by the convex

combination of 2 (2,1) π =a and

1 (1,2)π =a ; i.e., for all [0,1]r ∈ ,

.

F 1 2(x ,x , ,x | ), 1, ,kd a

f k d=⋯ ⋯

1 2( , , , )d

a a a=a ⋯

kaπ

1 1d

T T

a a aTπ π=

1, 2, ,

k k k

T T

a a aT k dπ π

−= = ⋯

daπ π= a

F πa

daf

1 2( , , , )d

a a a=a ⋯1 1 1

( | , , , )d d da a a a a

x x x x fπ−

=a ⋯

F F

π , ,π π= ∀a a , π π= ∀r r π

F

2d =

3d =

0>F 2d =rπ ( ,1 )r r r= −

1 2(1 )r rπ π π= − +r a a

15

0

6

1 iiicπ π

==∑r a

A three-dimensional counter example to Corollary 2 for the case is presented

in Table 5. In this example, is positive but not compatible. There are a

total of six scan patterns and for each scan pattern, the solution to which the individual

Gibbs sampler converges is shown as a row in Table 5. The average of all six fixed-scan

Gibbs sampler is provided, as well as a reference. In order to solve for a

non-negative linear combination (mixture) of the fixed-scan distributions,

(2)

where 1 1 13 3 30

( , , )=r , we treated equation (2) as a system of linear equations and solved

for ( ), 1, ,6.ci

c i= = ⋯ As it turned out, our result indicated that there was no solution

that satisfied . This observation led us to believe that the surmise

(Liu 1996) that the stationary distribution for a random-scan Gibbs sampler is a mixture

of the stationary distributions for all systematic scan Gibbs samplers is not true in general.

It only holds for .

Table 5. A three-dimensional counter example for Corollary 2.

(1,1,1) (2,1,1) (1,2,1) (2,2,1) (1,1,2) (2,1,2) (1,2,2) (2,2,2)

1f 0.1 0.9 0.2 0.8 0.3 0.7 0.4 0.6

2f 0.5 0.6 0.5 0.4 0.7 0.8 0.3 0.2

3f 0.9 0.1 0.1 0.9 0.1 0.9 0.9 0.1

0.0199 0.1795 0.0411 0.1646 0.1484 0.3462 0.0401 0.0602

0.0305 0.2064 0.0305 0.1376 0.1319 0.3251 0.0565 0.0813

0.1462 0.0532 0.0087 0.1970 0.0162 0.4784 0.0784 0.0219

0.0228 0.2050 0.0355 0.1421 0.1399 0.3263 0.0513 0.0770

0.0775 0.1502 0.0775 0.1002 0.0661 0.4001 0.0283 0.1000

0.1464 0.0531 0.0087 0.1972 0.0163 0.4782 0.0782 0.0219

π 0.0739 0.1412 0.0337 0.1565 0.0865 0.3924 0.0555 0.0604

0.0728 0.1406 0.0331 0.1532 0.0873 0.3944 0.0575 0.0613

3d =

{ }1 2 3, ,f f f=F

616 1 ii

π π=

= ∑ a

1 2 6( , , , ) 0c c c= ≥c ⋯

2d =

1 (2,3,1)π =a

2 (3,1,2)π =a

3 (1,2,3)π =a

4 (3,2,1)π =a

5 (1,3,2)π =a

6 (2,1,3)π =a

1 1 10 3 3 3( , , )

π=r

16

5. DISUSSION

This paper provides some simple examples to illustrate the behaviors of the Gibbs

sampler for a full set of conditionally specified distributions that may not be compatible.

We show that for a given scan pattern, a homogeneous Markov chain is formed by the

Gibbs sampling procedure and under mild conditions, the Gibbs sampler converges to a

unique stationary distribution. Unlike compatible distributions, different scan patterns

lead to different stationary distributions for PICSD. The random-scan Gibbs sampler

generally converges to “something in between” but the exact weighted equation only

holds for simple cases – i.e., when the dimension is two.

Our findings have several implications for the practical application of the Gibbs

sampler, especially when they operate on PICSD. For example, the MICE often relies on

a single fixed-scan pattern. This implies that the imputed missing values could change

beyond expected statistical bounds when a seemingly innocuous change in the order of

the variable is being made. Although in this paper we have not studied the issue of which

fixed-scan pattern produces the “best” joint distribution, some recent work has been done

in that direction. For example, Chen, Ip, and Wang (2011) proposed using an ensemble

approach to derive an optimal joint density. The authors also showed that the random-

scan procedure generally produces promising joint distributions. It is possible that in

some cases the gain from using multiple Gibbs chains, as in the case of random-scan, is

marginal. As argued by Heckerman et al. (2000), the single-chain fixed-scan (pseudo)

Gibbs sampler asymptotically works well when the extent to which the specified

conditional distributions are incompatible is minimal. This may be true for models that

are applied to one single data set with a large sample size. However, the extent of

incompatibility could be much higher when multiple data sets are used and when multiple

sets of conditional models are specified. While it is likely that even in more complex

applications a brute-force implementation of the (pseudo) Gibbs sampler will still provide

some kinds of solutions, the qualities and behaviors of such “solutions” will need to be

rigorously evaluated.

17

APPENDIX: PROOFS OF ANALYTIC RESULTS

Proof of Theorem 1. We need a lemma to prove Theorem 1 about irreducibility (ability

to reach all interesting points of the state-space) and aperiodicity (returning to a

given state-space at irregular times).

Lemma 1: If F is positive then Ta and T

rare irreducible and aperiodic w.r.t. any given

a and 0>r .

Proof. Let 1 1 1 1 1

1 2 3(x ,x ,x , ,x )k=x ⋯ and

2 2 2 2 2

1 2 3(x ,x ,x , ,x )k=x ⋯ be two states for the

chain induced by Ta or T

r. Without loss generality, we also let (1,2,3, , )d=a ⋯ and

1 1 1( , , , )d d d=r ⋯ , 1 2 3 d

T T T T T=a

⋯ and

11 2

( )dd

T T T T= + + +r

⋯ .

To prove that Ta and T

r are aperiodic, we must have ( ) 0

iiT >

a and ( ) 0

iiT >

r, i∀ . By

the definition of local transition probability, we have ( ) 0, k ii

T k> ∀ , if F is positive.

Consequently, 1

( ) ( ) 0d

ii k iikT T

=≥ >∏a

and 1

1( ) ( ) 0

d

ii k iid kT T

= = > ∑r

, i∀ .

To prove that Ta and T

r are irreducible is equivalent to prove that 1x and

2x commute,

i.e., to show the transition probability 1 2( ) 0P → >x x and

2 1( ) 0P → >x x . Given the

scan pattern a we have

1 2 2 1 1 1 2 2 1 1 2 2 2 2

1 1 2 1 2 1 2 3 1 2 3( ) (x ,x , ,x ,x ) (x ,x ,x , ,x ) (x ,x ,x , ,x ) 0,d d d d dP f f f−→ = >x x ⋯ ⋯ ⋯ ⋯

and

2 1 1 2 2 2 1 1 2 2 1 1 1 1

1 1 2 1 2 1 2 3 1 2 3( ) (x ,x , ,x ,x ) (x ,x ,x , ,x ) (x ,x ,x , , x ) 0.d d d d dP f f f−→ = >x x ⋯ ⋯ ⋯ ⋯

Similarly, for the random-scan case we have

1 2 2 1 1 1 2 2 1 1 2 2 2 211 1 2 1 2 1 2 3 1 2 3( ) ( ) (x , x , ,x ,x ) (x ,x ,x , ,x ) (x , x ,x , ,x ) 0,d

d d d d ddP f f f−

→ ≥ > x x ⋯ ⋯ ⋯ ⋯

and

18

2 1 1 2 2 2 1 1 2 2 1 1 1 111 1 2 1 2 1 2 3 1 2 3( ) ( ) (x ,x , , x ,x ) (x ,x ,x , ,x ) (x ,x ,x , ,x ) 0.d

d d d d ddP f f f−

→ ≥ > x x ⋯ ⋯ ⋯ ⋯ ■

It is well known that if a Markov chain is irreducible and aperiodic, then it converges

to a unique stationary distribution (Norris 1997). Consequently, we have the uniqueness

and existence theorem (Theorem 1) for the Gibbs sampler and Gibbs chain.

Proof of Theorem 2. We need a lemma to prove Theorem 2.

Lemma 2: If F is positive then the stationary distribution πa of the Gibbs sampler has

daf

as one of its conditional distribution for the scan pattern

1 2( , , , )

da a a=a ⋯ , i.e.,

1 1 1( | , , , )

d d da a a a ax x x x fπ−

=a ⋯ .

Proof: Since 1 2 1

( ) ( ) ( ) ( )x ( | x , x , , x )d d d d

td td td td

a a a a a af x

−∼ ⋯ , it follows

1 2 1( | , , , )

d d da a a a af x x x xπ−

∝a ⋯ . Consequently, ( | ) ( | )d d d d d

c c

a a a a ax x f x xπ =

a.■

Theorem 2 easily follows from Lemma 2.

Proof of Theorem 3. “If” part. Since F is positive and compatible, there exists a

positive joint distribution 0π > characterized by F . Under the positive assumption of π ,

it is well known that the Gibbs sampler governed by F determines π (Besag, 1994).

“Only if” part. Let 1 2

( , , , )i d

a a a=a ⋯ be a scan pattern with , 1, ,d

a i i d= = ⋯ .

Assuming that there exists a π such that , π π= ∀a

a . From Theorem 1 and Lemma 2, it

follows that ( | ) ( | ), .d d d d d

c c

a a a a ax x f x xπ = ∀

aa

Thus, ( | ) ( | ) ( | ),

i

c c c

i i i i i i ix x x x f x x iπ π= = ∀

a.

Hence F is compatible and π is the joint distribution of F .

19

Assuming that there exists a π such that , π π= ∀r

r . We only need to prove that

, π π= ∀a

a . From Theorem 1, we have .Tπ π=a a a

By the definition of random-scan

Gibbs sampler, we have , , .k

T kπ π π= = ∀ ∀r r

r It follows that

1 1 2 1( ) ( )

d d d d da a a a a a aT T T T T T T Tπ π π π π− −

= = = = = a⋯

⋯ .

Form Theorem 1, π is uniquely determined by Ta. As a result, = , ,π π π= ∀ ∀

r ar a . ■

Proof of Corollary 1. The proof follows directly from Theorem 2 and 3.■

Proof of Corollary 2. Since F is positive, 1 2, and π π πa a r are stationary distributions

uniquely determined by1 2, and a a r , respectively. Therefore,

[ ]2 1

2 1 2 1

2 1 2 1

2 1

1 2

2 2

1 1 2 2

2 2

(1 ) (1 )

(1 ) (1 ) (1 )

(1 ) (1 ) (1 )

(1 )

T T

T T T T

T T T T

T T

r r rT r T

r T r r T r r T r T

r r r r r r

r r

π π

π π π π

π π π π

π π

+ − + −

= + − + − + −

= + − + − + −

= + −

a a

a a a a

a a a a

a a

Because 1 2

(1 )T rT r T= + −r

is the transition kernel for the random-scan Gibbs chain with

selection probability r, we have the uniquely determined πrwhich equals

2 1(1 ) .r rπ π+ −a a ■

20

REFERENCES

Arnold, B. C., Castillo, E., and Sarabia, J. M. (2002), “Exact and Near Compatibility of

Discrete Conditional Distributions,” Computational Statistics and Data Analysis,

40, 231–252.

Besag, J. E. (1994), “Discussion of Markov Chains for Exploring Posterior Distributions,”

The Annals of Statistics, 22, 1734–1741.

Casella, G. and George, E. (1992), “Explaining the Gibbs Sampler,” The American

Statistician, 46, 167–174.

Chen, S–H., Ip, E. H., and Wang, Y. (2011), “Gibbs Ensembles for Nearly Compatible

and Incompatible Conditional Models,” Computational Statistics and Data

Analysis, 55, 1760–1769.

Drechsler, J. and Rässler, S. (2008), “Does Convergence Really Matter?” in Recent

Advances in Linear Models and Related Areas, eds. Shalabh, and C. Heumann,

Heidelberg: Physica-Verlag, pp. 341–355.

Gelfand, A. E. and Smith F. M. (1990), “Sampling-based Approaches to Calculating

Marginal Densities,” Journal of the America Statistical Association, 85, 398–409.

Geman, S. and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distribution, and the

Bayes Restoration of Images,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, 6, 721–741.

Gilks, W. R., Richardson, S., and Spiegelhalter, D. (1996), Markov Chain Monte Carlo

in Practice. London: Chapman & Hall.

Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and

Their Applications,” Biometrika, 87, 97–109.

21

Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000),

“Dependence Networks for inference, Collaborative Filtering and Data

Visualization.,” Journal of Machine Learning and Research, 1, 49–57.

Hobert, J. P. and Casella, G. (1998), “Functional Compatibility, Markov Chains and

Gibbs Sampling with Improper Posteriors,” Journal of Computational and

Graphical Statistics, 7, 42–60.

Levine, R. and Casella, G. (2006), “Optimizing Random Scan Gibbs Samplers,” Journal

of Multivariate Analysis, 97, 2071–2100.

Liu, J. S. (1996), Discussion on “Statistical inference and Monte Carlo algorithms,” by G.

Casella, Test, 5, 305–310.

Liu, J. S., Wong, H. W., and Kong, A. (1995), “Correlation structure and convergence

rate of the Gibbs sampler with various scans,” Journal of the Royal Statistical

Society, Ser. B, 57, 157–169.

Madras, N. (2002), Lectures on Monte Carlo Methods, Providence, Rhode Island:

American Mathematical Association.

Norris, J. R. (1998), Markov Chain, Cambridge, UK: Cambridge University Press.

Rässler, S., Rubin, D.B., and Zell, E.R. (2008), “Incomplete Data in Epidemiology and

Medical Statistics,” in Handbook of Statistics 27: Epidemiology and Medical

Statistics, eds. C. R. Rao, J. P. Miller and D. C. Rao, The Netherlands: Elsevier,

pp. 569–601.

Rubin, D.B. (2003), “Nested Multiple Imputation of NMES via Partially Incompatible

MCMC,” Statistica Neerlandica, 57, 3–18.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman &

Hall.

22

Smith, A. F. M. and Roberts, G. O. (1993), “Bayesian Computation via the Gibbs

Sampler and Related Markov Chain Monte Carlo Methods,” Journal of the Royal

Statistical Society, Ser. B, 55, 3–23.

Tierney, L. (1994), “Markov Chains for Exploring Posterior distributions,” The Annals of

Statistics, 22, 1701–1728.

van Buuren, S., Boshuizen, H. C., and Knook D. L. (1999), “Multiple Imputation of

Missing Blood Pressure Covariates in Survival Analysis," Statistics in Medicine,

18, 681-694.

van Buuren, S., Brand J. P. L., Groothuis-Oudshoorn C. G. M., and Rubin, D. B. (2006),

“Fully Conditional Specifcation in Multivariate Imputation," Journal of Statistical

Computation and Simulation, 76, 1049-1064.

White, I. R., Royston, P., and Wood, A. M. (2011), “Multiple Imputation Using Chained

Equations: Issues and Guidance for Practice,” Statistics in Medicine, 30, 377–399.

copy of gibbs sampler full - division of public health ... · examples, we show mathematically that...

Documents