copy of gibbs sampler full - division of public health ... · examples, we show mathematically that...
TRANSCRIPT
1
Behavior of the Gibbs Sampler When Conditional
Distributions Are Potentially Incompatible
Shyh-Huei Chen
Department of Biostatistical Sciences, Wake Forest University School of Medicine,
Winston-Salem, NC 27157
Edward H. Ip
Department of Biostatistical Sciences, Department of Social Sciences and Health Policy,
Wake Forest University School of Medicine, Winston-Salem, NC 27157
Shyh-Huei Chen (E-mail: [email protected]) is assistant professor at the Department of
Biostatistical Sciences, Division of Public Health Sciences, Wake Forest School of Medicine,
Wells Fargo Center 23rd floor, Medical Center Blvd, Winston-Salem, NC 27157. Edward H. Ip
(E-mail: [email protected]) is Professor at the Department of Biostatistical Sciences and the
Department of Social Sciences and Health Policy, Division of Public Health Sciences, Wake
Forest School of Medicine, Wells Fargo Center 23rd floor, Medical Center Blvd, Winston-Salem,
NC 27157. This work was partially supported by NIH grant 1R21AG042761-01 (PI: Ip).
2
ABSTRACT
The Gibbs sampler has been used extensively in the statistics literature. It relies on
iteratively sampling from a set compatible conditional distributions and the sampler is
known to converge to a unique invariant joint distribution. However, the Gibbs sampler
behaves rather differently when the conditional distributions are not compatible. Such
applications have seen increasing use in areas such as multiple imputation. In this paper,
we demonstrate that what a Gibbs sampler converges to is a function of the order of the
sampling scheme. Besides providing empirical examples to illustrate this behavior, we
also explain how that happens through a thorough analysis of the examples.
KEY WORDS: Gibbs chain; Gibbs sampler; Potentially incompatible conditional-
specified distribution.
3
1. INTRDUCTION
The Gibbs sampler is, if not singularly, one of the most prominent Markov chain
Monte Carlo (MCMC)-based methods. Partly because of its conceptual simplicity and
elegance in implementation, the Gibbs sampler has been increasingly used across a very
broad range of subject areas including bioinformatics and spatial analysis. While its root
dates back to earlier work (e.g., Hasting 1970), the popularity of the Gibbs sampling is
commonly credited to Geman and Geman (1984), in which the algorithm was used as a
tool for image processing. Its use in statistics, especially Bayesian analysis, has since
grown very rapidly (Gelfand and Smith 1990; Smith and Roberts 1993; Gilks,
Richardson, and Spiegelhalter 1996). For a quick introduction of the algorithm, see
Casella and George (1992).
One of the recent developments of the Gibbs sampler is in its application to
potentially incompatible conditional-specified distributions (PICSD). When statistical
models involve high-dimensional data, it is often easier to specify conditional
distributions instead of the entire joint distributions. However, the approach of specifying
a conditional distribution has the risk of not forming a compatible joint model. Consider a
system of d discrete random variables, whose fully conditional model
is specified by , where and c
kx is the relative
complement of with respect to . If the conditional models are individually specified,
then there may not exist a joint distribution that will give rise to the specified set of
conditional distributions. In such a case, we call incompatible.
The study of PICSD is closely related to the Gibbs sampler because the latter relies
on iteratively drawing samples from to form a Markov chain. Under mild conditions,
the Markov chain converges to the desired joint distribution, if is compatible.
However, if is not compatible, then the Gibbs sampler could exhibit erratic behavior
(e.g., Hobert and Casella 1998).
In this paper, our goal is to demonstrate the behavior of the Gibbs sampler (or the
pseudo Gibbs sampler as it is not a true Gibbs sampler in the traditional sense of
1 2{ , , , }d
X x x x= ⋯
{ }1 2, , ,
df f f=F ⋯
|( | )c
k k
c
k k kx xf f f x x≡ =
kx X
F
F
F
F
4
presumed compatible conditional distributions) for PICSD. By using several simple
examples, we show mathematically that what a Gibbs sampler converges to is a function
of the order of the sampling scheme in the Gibbs sampler. Furthermore, we show that if
we follow a random order in sampling conditional distributions at each iteration—i.e.,
using a random-scan Gibbs sampler (Liu, Wong, and Kong 1995)—then the Gibbs
sampling will lead to a mixture of the joint distributions formed by each combination of
fixed-order (or more formally, fixed-scan) when d = 2 but the result is not true when
d > 2. This result is a refinement of a conjecture put forward in Liu (1996).
The demonstration in this paper is intended to provide readers not familiar with
incompatible conditional distributions some basic background regarding the mechanism
driving the behavior of the Gibbs sampler for PICSD. Two recent developments in the
statistical and machine-learning literature underscore the importance of the current work.
The first is in the application of the Gibbs sampler to a dependency network, which is a
type of generalized graphical model specified by conditional probability distributions
(Heckerman et al. 2000). One approach to learning a dependency network is to first
specify individual conditional models and then apply a (pseudo) Gibbs sampler to
estimate the joint model. The authors acknowledged the possibility of incompatible
conditional models but argued that when the sample size is large, the degree of
incompatibility will not be substantial and the Gibbs sampler is still applicable. Yet
another example is the use of the fully conditional specification for multiple imputation
of missing data (van Burren et al. 1999, 2006). The method, which is also called multiple
imputation by chained equations (MICE), makes use of a Gibbs sampler or other MCMC-
based methods that operate on a set of conditionally specified models. For each variable
with a missing value, an imputed value is created under an individual conditional-
regression model. This kind of procedure was viewed as combining the best features of
many currently available multiple imputation approaches (Rubin 2003). Due to its
flexibility over compatible multivariate-imputation models (Schafer 1997) and ability to
handle different variable types (continuous, binary, and categorical) the MICE has gained
acceptance for its practical treatment of missing data, especially in high-dimensional data
sets (Rassler, Rubin, and Zell 2008). Popular as it is, the MICE has the limitation of
potentially encountering incompatible conditional-regression models and it has been
5
shown that an incompatible imputation model can lead to biased estimates from imputed
data (Drescher and Rassler 2008). So far, very little theory has been developed in
supporting the use of MICE (White, Royston, and Wood 2011). A better understanding of
the theoretical properties of applying the Gibbs sampler to PICSD could lead to important
refinements of these imputation methods in practice.
The article is organized as follows: First, we provide basic background to the Gibbs
chain and Gibbs sampler and define the scan order of a Gibbs sampler. Section 3
describes a simple example to demonstrate the convergence behavior of a Gibbs sampler
as a function of scan order, both by applying matrix algebra to the transition kernel as
well as using MCMC-based computation. In Section 4, we offer several analytic results
concerning the stationary distributions of the Gibbs sampler under different scan patterns
and a counter-example to a surmise about the Gibbs sampler under a random order of
scan pattern. Finally in Section 5 we provide a brief discussion.
2. GIBBS CHAIN AND GIBBS SAMPLER
Continuing the notation in the previous section, let denote a
permutation of
, denote a realization of
X with
, where
is the number of categories of the k
th variable. Thus,
is a realization of defined in the order of .
For a specified , the associated fixed (systemic)-scan Gibbs chain governed by a scan
pattern can be implemented as follows:
1. Pick an arbitrary starting vector .
2. On the tth
cycle, successively draw from the full conditional distributions
according to scan pattern as follows:
1 2( , , , )d
a a a=a ⋯
{1,2, , }d⋯
T
1 2(x ,x , ,x )d
=x ⋯
{ }x 1,2, ,k k
C∈ ⋯ kC
1 2 2(x ,x , ,x )
a a a≡ax ⋯ X a
F
1 2( , , , )d
a a a=a ⋯
1 2
(0) (0) (0) (0)(x ,x , , x )da a a
=a
x ⋯
a
6
.
The series obtained by a single draw (iteration) is called a
realization of Gibbs chain defined by with scan pattern ; and the series
(0) ( ) (2 ) ( ), , , , ,d d tda a a ax x x x⋯ ⋯ obtained by a single cycle is a realization of the associated
Gibbs sampler. For example, , (2,4,1,3)=a , and given initial value
(0) (0) (0) (0) (0)
2 4 1 3(x ,x ,x ,x )=x , the Gibbs chain in cycle 1 performs the following draws and
produces the corresponding states:
(1) (0) (0) (0)
2 2 4 4 1 1 3 3x ( | x , x , x )f x x x x= = =∼ , (1) (1) (0) (0) (0)
2 4 1 3(x ,x ,x ,x )=x ;
(2) (1) (0) (0)
4 4 2 2 1 1 3 3x ( | x , x , x )f x x x x= = =∼ , (2) (1) (2) (0) (0)
2 4 1 3(x ,x ,x ,x )=x ;
(3) (1) (2) (0)
1 1 2 2 4 4 3 3x ( | x , x , x )f x x x x= = =∼ , (3) (1) (2) (3) (0)
2 4 1 3(x ,x ,x ,x )=x ; and
(4) (1) (2) (3)
3 3 2 2 4 4 1 1x ( | x , x , x )f x x x x= = =∼ , (4) (1) (2) (3) (4)
2 4 1 3(x ,x ,x , x )=x .
In this example, the series (0) (4) (8), , , ,x x x ⋯
is the realization of Gibbs sampler defined by
with scan pattern .
We can also express a Gibbs sampler of random scan order as a Gibbs chain. Let
be the set of selection probabilities, where is the probability of
visiting a conditional , and . The random-scan Gibbs sampler (Levine and
Casella 2006) can be stated as follows:
1. Pick an arbitrary starting vector.
2. At the sth
iteration, s = 1, 2, …
(0) (1) (2) ( ), , , , ,sa a a ax x x x⋯ ⋯
F a
1 2 3 4( , , , )X x x x x=
F a
1 2( , , , ) d
r r r=r ⋯ 0k
r >
kf
1
1d
k
k
r=
=∑
(0) (0) (0) (0)
1 2(x ,x , ,x )d
=x ⋯
7
a. Randomly choose with probability ;
b. .
3. Repeat step 2 until a convergence criterion is reached.
3. ILLUSTRATIVE EXAMPLES
Example 1. (Compatible conditional distributions). Consider the following bivariate 2 × 2
joint distribution and for defined on the domain {1, 2}, with its
corresponding conditional distributions and (Arnold, Castillo, and
Sarabia 2002, p. 242):
There are 4 possible states, (1, 1), (1, 2), (2, 1), and (2, 2) for the Gibbs chain. The
transition from one state to another is apparently governed by the conditional matrices
and . As a shorthand, we denote an entry in the matrix as 1(.,.)f ; e.g., 1(1,2) 1 / 3.f = In
order to keep track of the scan order, we denote the state in the Gibbs chain as ,
if the current state at time t is the result of drawing from the conditional . To fix
ideas, we use a fixed-scan Gibbs sampler with and the conditional distributions
The transition kernel for the Gibbs chain is diagrammatically represented in
Figure 1, where and indicate local transition probabilities. For example, the local
transition probability from to is , and
to
is 0 (indicated by disconnectedness).
{1,2, , }k d∈ ⋯k
r
( ) ( 1)x ( | )c
s s
k k kf x −x∼
π1 2( , )X X
1 1 2( | )f x x 2 2 1( | )f x x
1 2 1 1 1 210 10 4 3 3 3
1 23 3 34 2 410 10 4 3 7 7
, and , . f f
= = =
π
1f
2f
( )( | )k
t
afx
( )tx
kaf
(1,2)=a
1 2( , ).f f
1P 2P
(2 )
2( (1,1) | )tf=x
(2 1)
1( (1,1) | )tf
+ =x 141(1,1)f =
(2 )
2( (1,1) | )tf=x
(2 1)
1( (1,2) | )tf
+ =x
8
Figure 1. Transition probabilities of the Gibbs chain in Example 1.
By arranging the state in lexicographical order such that the first index changes the
fast and the last index the slowest, the transition probability matrices 1T and 2T that
correspond respectively to 1P , and 2P are:
More generally, the local transition probability (Madras 2002, p. 77) for two
successive states of Gibbs chain, and , can be defined by
The matrices and in Example 1 have two pairs of identical rows and are
idempotent but not irreducible. As this example illustrates, generally a Gibbs chain is not
homogeneous, but if one defines a surrogate transition probability matrix
1 2 3 da a a aT T T T T=a ⋯ , then a homogeneous chain with transition matrix T
acan be formed for
31 1 24 4 3 3
3 31 44 4 7 7
1 21 2 1 23 3 3 3
31 2 43 3 7 7
0 0 0 0
0 0 0 0 and .
0 0 0 0
0 0 0 0
T T
= =
kaP
1
( 1)( | )k
s
af
−
−x ( )( | )k
s
afx
( ) ( ) ( 1)
( 1) ( )( ), if ;
( , )0, otherwise.
c ck k k
k
s s s
as s a a
a
fP
−
− =
=
x x xx x
1T 2T
9
the scan pattern . In other words, for a collection of full conditional
distributions and a scan pattern the fixed-scan Gibbs sampler is a homogeneous
Markov chain with transition matrix . Analogously, a random-scan Gibbs sampler with
selection probability can be also transferred to a homogeneous Markov
chain by defining as the surrogate transition probability matrix. The
corresponding stationary distribution and can be directly computed by evaluating
and , where 1
d
kkC
== ∏C , and is a C-dimensional
vector of 1’s.
In Example 1, the transition matrices for fixed- and random-scans are respectively
for and for and
Table 1
directly compares the joint distributions obtained from the following computations: (1)
direct MCMC Gibbs sampler for the only two possible fixed-scan patterns and
; (2) direct MCMC Gibbs sampler for random-scan patterns with the
following selection probabilities: , and , (3) matrix
multiplication using with low power (m = 4) and high power (m = 32) and (4) matrix
multiplication using also with low and high powers. For both (1) and (2), we used the
first 5,000 cycles as burn-in and the subsequent 1,000,000 cycles for sampling.
As expected, both the fixed-scan, regardless of scan order, and the random-scan
Gibbs samplers numerically converge to the same joint distribution (convergence is
defined here as all cell-wise differences between estimates from two consecutive
iterations to be less than ). Table 1 also demonstrates that direct matrix
multiplication of the transition probabilities produces rapid convergence even for a small
m and different values of . However, we also observed that if was heavily
imbalanced, it took many more iterations to achieve numerical convergence (not shown).
For example, if = , it took to achieve the same numerical
convergence (up to 4 decimal places).
1( , , )ad
a a= ⋯
F ,a
Ta
1 2( , , , ) d
r r r=r ⋯
1
d
k kkT r T
=≡∑r
πa πr
Tlim m
mT π
→∞=a C a1 Tlim m
mT π
→∞=r C r1
C1
1 1 2aT T T=1 (1,2)a =
2 2 1aT T T=2 (2,1),a = 1 2( ) / 2.T T T= +r
1 (1,2)=a
2 (2,1)=a
1 12 20 ( , )=r 1 2
3 31 ( , ) =r 2 13 32 ( , ) =r
mTa
mTr
40.5 10
−×
r r
r 9110 10( , ) 120m >
10
Table 1. Joint distributiuons produced by various Gibbs samplers for Example 1.
(1,1) (2,1) (1,2) (2,2)
0.1 0.3 0.2 0.4
0.1002 0.3002 0.2000 0.3997
0.0998 0.3004 0.1999 0.4000
0.1007 0.2998 0.2000 0.3995
0.1000 0.3000 0.2000 0.4000
0.1000 0.3000 0.2000 0.4000
0.1000 0.3000 0.2000 0.4000
0.1000 0.3000 0.2000 0.4000
0.1000 0.3000 0.2000 0.4000
Example 2. (Incompatible conditional distributions). Consider a pair of 2 × 2 conditional
distributions and defined on the domain {1, 2} as follows (Arnold
Castillo, and Sarabia, p. 242):
1 1 1 24 3 3 3
1 23 92 14 3 10 10
and f f
= =
(1)
These two conditional distributions are not compatible. It is easy to show that the local
transition probability matrices are respectively:
Table 2 shows the results for the joint distributions derived from the simulated Gibbs
samplers and matrix-multiplication of Example 2 for conditions that are identical to those
presented in Table 1. Several important observations can be made here: (1) The Gibbs
samplers that use the fixed-scan pattern and respectively converge to two distinct
π
1 (1,2)=a
2 (2,1)=a
1 12 20 ( , )=r
1( 4)mπ =a
2( 4)mπ =a
0( 32)mπ =r
1( 32)mπ =r
2( 32)mπ =r
1 1 2( | )f x x 2 2 1( | )f x x
31 1 24 4 3 3
3 91 14 4 10 10
1 21 2 1 23 3 3 3
91 2 13 3 10 10
0 0 0 0
0 0 0 0 and .
0 0 0 0
0 0 0 0
T T
= =
1a 2a
11
joint distributions; (2) each individual fixed-scan Gibbs sampler converges to the
corresponding solution computed from the matrix-multiplication method; and (3) the
random-scan Gibbs sampler converges to the mixture distribution of the individual fixed-
scan distributions—i.e., . However, the last observation, as we shall
see later, only holds true for .
Table 3 shows the conditional distributions of Example 2 derived from the matrix-
multiplication method (m = 32) for ¸ . ,
and . Interestingly, one of the
given conditional distributions is always identical to the conditional distribution derived
from the joint distribution of , . For example, the given conditional
distribution in Eq. (1) is numerically identical to the conditional distribution
directly derived from the fixed-scan Gibbs sampler . On the other hand, in Eq. (1)
is identical to the conditional distribution derived from . Indeed, as we shall
see later, for a given set of full conditionals , for a scan
pattern , the fixed-scan Gibbs sampler always has at least one as
its conditional distributions.
To illustrate the “error” of the joint distribution to which a Gibbs sampler
converges when conditional distributions are incompatible, we prescribe a cell-wise 2
l -
norm-based metric to quantify the distance between two distributions. The metric
computes the Euclidean distance between the given conditionally specified distributions
and the derived conditional distributions of the joint density. Thus, when the conditional
distributions are compatible, the distance metric, or error term, is identical to zero.
Table 4 shows the error terms obtained for the various schemes for Example 2.
Based on the summary statistics, the random scan appears to have the least error.
However, we remark here that such a result could depend on how the distance metric is
defined. For example, when cell-wise 1
l -norm was used to measure distance, is the
distribution that contained the smallest error (Table 4).
1 2r (1 )r rπ π π= − +a a
2d =
1aπ2aπ
0rπ1r
π2r
π
kaπ 1,2k =
2f 2 1( | )f x x
1aπ1f
1 2( | )f x x2aπ
{ ( | ), 1, , }a a ak k k
cf f x x k d= = ⋯
1 2( , , , )d
a a a=a ⋯ aπ akf
2πa
12
Table 2. Joint distributiuons produced by various Gibbs samplers for Example 2.
(1,1) (2,1) (1,2) (2,1)
0.1062 0.0680 0.2128 0.6130
0.0435 0.1314 0.2753 0.5498
0.0748 0.0990 0.2447 0.5815
0.1063 0.0681 0.2125 0.6131
0.0436 0.1308 0.2752 0.5504
0.0749 0.0995 0.2439 0.5817
0.0854 0.0890 0.2334 0.5922
0.0645 0.1099 0.2543 0.5713
Table 3. Conditional distributiuons derived from the computed joint distributions (by
using matrix multiplication) in Table 2.
(1,1) (2,1) (1,2) (2,1) (1,1) (2,1) (1,2) (2,1)
Eq.(1) 0.2500 0.7500 0.3333 0.6667 0.3333 0.1000 0.6667 0.9000
0.6094 0.3906 0.2574 0.7426 0.3333 0.1000 0.6667 0.9000
0.2500 0.7500 0.3333 0.6667 0.1368 0.1920 0.8632 0.8080
0.4297 0.5703 0.2954 0.7046 0.2350 0.1460 0.7650 0.8540
0.4896 0.5104 0.2827 0.7173 0.2678 0.1307 0.7322 0.8693
0.3698 0.6302 0.3080 0.6920 0.2023 0.1613 0.7977 0.8387
1 (1,2)=a
2 (2,1)=a
1 12 20 ( , )=r
1( 4)mπ =a
2( 4)mπ =a
0( 32)mπ =r
1( 32)mπ =r
2( 32)mπ =r
1f 2f
1πa
2πa
0π r
1π r
2πr
13
Table 4. 2
l -norm and 1
l -norm errors between the computed joint distributions and given
conditional models.
2
l 1
l
1f 2f Total Total
0.5195 0.0000 0.5195 0.8706 0.0000 0.8706
0.0000 0.3068 0.3068 0.0000 0.5770 0.5770
0.2597 0.1535 0.3017 0.4352 0.2886 0.7238
0.3463 0.1023 0.3611 0.5804 0.1924 0.7728
0.1732 0.2045 0.2680 0.2902 0.3846 0.6748
4. SOME ANALYTIC RESULTS
In this section, we offer several general results regarding the behaviors of the fixed-
scan and the random-scan Gibbs sampler for discrete variables in which the transition
matrices are finite. For most of these results, it is not necessary to assume compatibility.
Besides providing some theoretical underpinning to the previous illustrative examples,
the results here allow a closer look at the mechanisms through which incompatibility
impacts the behaviors of the different Gibbs sampling schemes. Note that these results are
special cases that can be derived from more general theories for Markov chains, but for
our purpose focusing on the special case of discrete variables and scan patterns makes it
easier to examine the dynamics of convergence. General results regarding convergence of
Markov chains can be found elsewhere (e.g., see Tierney 1994; Gilks et al. 1996, and the
references therein). All of the proofs of the following results are included in the
Appendix.
Theorem 1: If is positive then the Gibbs sampler, either fixed-scan with a scan pattern
a or random scan with selection probability , converges to a unique
stationary distribution and respectively.
Note that Theorem 1 does not require to be compatible. The result assures that
when is positive—a stronger condition than being non-negative—any scan pattern
1f 2f
1πa
2πa
0π r
1π r
2πr
F
0>r
πa πr
F
F F
14
can have one and only one stationary distribution. Furthermore, the transition for any
fixed-scan pattern is governed by the following theorem:
Theorem 2: If is positive then for each state set, , of the
Gibbs chain with scan pattern has exactly one stationary
distribution . In particular, and ,
and .
A direct consequence of Theorem 2 is that for any fixed-scan pattern, one of the
specified conditional distributions in F can always be derived from its stationary
distribution. This is summarized in the following corollary:
Corollary 1: If is positive then the stationary distribution of the Gibbs sampler has
as one of its conditional distributions for the scan pattern
, i.e., .
When F is compatible, all scan patterns converge to the same joint distribution. The
following theorem provides a formal statement.
Theorem 3: Given is positive. is compatible if and only if there exists a joint
distribution with either or . Furthermore, is
the joint distribution characterized by .
An interesting observation about the random scan is that it forms a mixture of the
fixed-scan patterns only for . We state the Corollary for the case d = 2 and give a
counter-example for .
Corollary 2: If and then , , is formed by the convex
combination of 2 (2,1) π =a and
1 (1,2)π =a ; i.e., for all [0,1]r ∈ ,
.
F 1 2(x ,x , ,x | ), 1, ,kd a
f k d=⋯ ⋯
1 2( , , , )d
a a a=a ⋯
kaπ
1 1d
T T
a a aTπ π=
1, 2, ,
k k k
T T
a a aT k dπ π
−= = ⋯
daπ π= a
F πa
daf
1 2( , , , )d
a a a=a ⋯1 1 1
( | , , , )d d da a a a a
x x x x fπ−
=a ⋯
F F
π , ,π π= ∀a a , π π= ∀r r π
F
2d =
3d =
0>F 2d =rπ ( ,1 )r r r= −
1 2(1 )r rπ π π= − +r a a
15
0
6
1 iiicπ π
==∑r a
A three-dimensional counter example to Corollary 2 for the case is presented
in Table 5. In this example, is positive but not compatible. There are a
total of six scan patterns and for each scan pattern, the solution to which the individual
Gibbs sampler converges is shown as a row in Table 5. The average of all six fixed-scan
Gibbs sampler is provided, as well as a reference. In order to solve for a
non-negative linear combination (mixture) of the fixed-scan distributions,
(2)
where 1 1 13 3 30
( , , )=r , we treated equation (2) as a system of linear equations and solved
for ( ), 1, ,6.ci
c i= = ⋯ As it turned out, our result indicated that there was no solution
that satisfied . This observation led us to believe that the surmise
(Liu 1996) that the stationary distribution for a random-scan Gibbs sampler is a mixture
of the stationary distributions for all systematic scan Gibbs samplers is not true in general.
It only holds for .
Table 5. A three-dimensional counter example for Corollary 2.
(1,1,1) (2,1,1) (1,2,1) (2,2,1) (1,1,2) (2,1,2) (1,2,2) (2,2,2)
1f 0.1 0.9 0.2 0.8 0.3 0.7 0.4 0.6
2f 0.5 0.6 0.5 0.4 0.7 0.8 0.3 0.2
3f 0.9 0.1 0.1 0.9 0.1 0.9 0.9 0.1
0.0199 0.1795 0.0411 0.1646 0.1484 0.3462 0.0401 0.0602
0.0305 0.2064 0.0305 0.1376 0.1319 0.3251 0.0565 0.0813
0.1462 0.0532 0.0087 0.1970 0.0162 0.4784 0.0784 0.0219
0.0228 0.2050 0.0355 0.1421 0.1399 0.3263 0.0513 0.0770
0.0775 0.1502 0.0775 0.1002 0.0661 0.4001 0.0283 0.1000
0.1464 0.0531 0.0087 0.1972 0.0163 0.4782 0.0782 0.0219
π 0.0739 0.1412 0.0337 0.1565 0.0865 0.3924 0.0555 0.0604
0.0728 0.1406 0.0331 0.1532 0.0873 0.3944 0.0575 0.0613
3d =
{ }1 2 3, ,f f f=F
616 1 ii
π π=
= ∑ a
1 2 6( , , , ) 0c c c= ≥c ⋯
2d =
1 (2,3,1)π =a
2 (3,1,2)π =a
3 (1,2,3)π =a
4 (3,2,1)π =a
5 (1,3,2)π =a
6 (2,1,3)π =a
1 1 10 3 3 3( , , )
π=r
16
5. DISUSSION
This paper provides some simple examples to illustrate the behaviors of the Gibbs
sampler for a full set of conditionally specified distributions that may not be compatible.
We show that for a given scan pattern, a homogeneous Markov chain is formed by the
Gibbs sampling procedure and under mild conditions, the Gibbs sampler converges to a
unique stationary distribution. Unlike compatible distributions, different scan patterns
lead to different stationary distributions for PICSD. The random-scan Gibbs sampler
generally converges to “something in between” but the exact weighted equation only
holds for simple cases – i.e., when the dimension is two.
Our findings have several implications for the practical application of the Gibbs
sampler, especially when they operate on PICSD. For example, the MICE often relies on
a single fixed-scan pattern. This implies that the imputed missing values could change
beyond expected statistical bounds when a seemingly innocuous change in the order of
the variable is being made. Although in this paper we have not studied the issue of which
fixed-scan pattern produces the “best” joint distribution, some recent work has been done
in that direction. For example, Chen, Ip, and Wang (2011) proposed using an ensemble
approach to derive an optimal joint density. The authors also showed that the random-
scan procedure generally produces promising joint distributions. It is possible that in
some cases the gain from using multiple Gibbs chains, as in the case of random-scan, is
marginal. As argued by Heckerman et al. (2000), the single-chain fixed-scan (pseudo)
Gibbs sampler asymptotically works well when the extent to which the specified
conditional distributions are incompatible is minimal. This may be true for models that
are applied to one single data set with a large sample size. However, the extent of
incompatibility could be much higher when multiple data sets are used and when multiple
sets of conditional models are specified. While it is likely that even in more complex
applications a brute-force implementation of the (pseudo) Gibbs sampler will still provide
some kinds of solutions, the qualities and behaviors of such “solutions” will need to be
rigorously evaluated.
17
APPENDIX: PROOFS OF ANALYTIC RESULTS
Proof of Theorem 1. We need a lemma to prove Theorem 1 about irreducibility (ability
to reach all interesting points of the state-space) and aperiodicity (returning to a
given state-space at irregular times).
Lemma 1: If F is positive then Ta and T
rare irreducible and aperiodic w.r.t. any given
a and 0>r .
Proof. Let 1 1 1 1 1
1 2 3(x ,x ,x , ,x )k=x ⋯ and
2 2 2 2 2
1 2 3(x ,x ,x , ,x )k=x ⋯ be two states for the
chain induced by Ta or T
r. Without loss generality, we also let (1,2,3, , )d=a ⋯ and
1 1 1( , , , )d d d=r ⋯ , 1 2 3 d
T T T T T=a
⋯ and
11 2
( )dd
T T T T= + + +r
⋯ .
To prove that Ta and T
r are aperiodic, we must have ( ) 0
iiT >
a and ( ) 0
iiT >
r, i∀ . By
the definition of local transition probability, we have ( ) 0, k ii
T k> ∀ , if F is positive.
Consequently, 1
( ) ( ) 0d
ii k iikT T
=≥ >∏a
and 1
1( ) ( ) 0
d
ii k iid kT T
= = > ∑r
, i∀ .
To prove that Ta and T
r are irreducible is equivalent to prove that 1x and
2x commute,
i.e., to show the transition probability 1 2( ) 0P → >x x and
2 1( ) 0P → >x x . Given the
scan pattern a we have
1 2 2 1 1 1 2 2 1 1 2 2 2 2
1 1 2 1 2 1 2 3 1 2 3( ) (x ,x , ,x ,x ) (x ,x ,x , ,x ) (x ,x ,x , ,x ) 0,d d d d dP f f f−→ = >x x ⋯ ⋯ ⋯ ⋯
and
2 1 1 2 2 2 1 1 2 2 1 1 1 1
1 1 2 1 2 1 2 3 1 2 3( ) (x ,x , ,x ,x ) (x ,x ,x , ,x ) (x ,x ,x , , x ) 0.d d d d dP f f f−→ = >x x ⋯ ⋯ ⋯ ⋯
Similarly, for the random-scan case we have
1 2 2 1 1 1 2 2 1 1 2 2 2 211 1 2 1 2 1 2 3 1 2 3( ) ( ) (x , x , ,x ,x ) (x ,x ,x , ,x ) (x , x ,x , ,x ) 0,d
d d d d ddP f f f−
→ ≥ > x x ⋯ ⋯ ⋯ ⋯
and
18
2 1 1 2 2 2 1 1 2 2 1 1 1 111 1 2 1 2 1 2 3 1 2 3( ) ( ) (x ,x , , x ,x ) (x ,x ,x , ,x ) (x ,x ,x , ,x ) 0.d
d d d d ddP f f f−
→ ≥ > x x ⋯ ⋯ ⋯ ⋯ ■
It is well known that if a Markov chain is irreducible and aperiodic, then it converges
to a unique stationary distribution (Norris 1997). Consequently, we have the uniqueness
and existence theorem (Theorem 1) for the Gibbs sampler and Gibbs chain.
Proof of Theorem 2. We need a lemma to prove Theorem 2.
Lemma 2: If F is positive then the stationary distribution πa of the Gibbs sampler has
daf
as one of its conditional distribution for the scan pattern
1 2( , , , )
da a a=a ⋯ , i.e.,
1 1 1( | , , , )
d d da a a a ax x x x fπ−
=a ⋯ .
Proof: Since 1 2 1
( ) ( ) ( ) ( )x ( | x , x , , x )d d d d
td td td td
a a a a a af x
−∼ ⋯ , it follows
1 2 1( | , , , )
d d da a a a af x x x xπ−
∝a ⋯ . Consequently, ( | ) ( | )d d d d d
c c
a a a a ax x f x xπ =
a.■
Theorem 2 easily follows from Lemma 2.
Proof of Theorem 3. “If” part. Since F is positive and compatible, there exists a
positive joint distribution 0π > characterized by F . Under the positive assumption of π ,
it is well known that the Gibbs sampler governed by F determines π (Besag, 1994).
“Only if” part. Let 1 2
( , , , )i d
a a a=a ⋯ be a scan pattern with , 1, ,d
a i i d= = ⋯ .
Assuming that there exists a π such that , π π= ∀a
a . From Theorem 1 and Lemma 2, it
follows that ( | ) ( | ), .d d d d d
c c
a a a a ax x f x xπ = ∀
aa
Thus, ( | ) ( | ) ( | ),
i
c c c
i i i i i i ix x x x f x x iπ π= = ∀
a.
Hence F is compatible and π is the joint distribution of F .
19
Assuming that there exists a π such that , π π= ∀r
r . We only need to prove that
, π π= ∀a
a . From Theorem 1, we have .Tπ π=a a a
By the definition of random-scan
Gibbs sampler, we have , , .k
T kπ π π= = ∀ ∀r r
r It follows that
1 1 2 1( ) ( )
d d d d da a a a a a aT T T T T T T Tπ π π π π− −
= = = = = a⋯
⋯ .
Form Theorem 1, π is uniquely determined by Ta. As a result, = , ,π π π= ∀ ∀
r ar a . ■
Proof of Corollary 1. The proof follows directly from Theorem 2 and 3.■
Proof of Corollary 2. Since F is positive, 1 2, and π π πa a r are stationary distributions
uniquely determined by1 2, and a a r , respectively. Therefore,
[ ]2 1
2 1 2 1
2 1 2 1
2 1
1 2
2 2
1 1 2 2
2 2
(1 ) (1 )
(1 ) (1 ) (1 )
(1 ) (1 ) (1 )
(1 )
T T
T T T T
T T T T
T T
r r rT r T
r T r r T r r T r T
r r r r r r
r r
π π
π π π π
π π π π
π π
+ − + −
= + − + − + −
= + − + − + −
= + −
a a
a a a a
a a a a
a a
Because 1 2
(1 )T rT r T= + −r
is the transition kernel for the random-scan Gibbs chain with
selection probability r, we have the uniquely determined πrwhich equals
2 1(1 ) .r rπ π+ −a a ■
20
REFERENCES
Arnold, B. C., Castillo, E., and Sarabia, J. M. (2002), “Exact and Near Compatibility of
Discrete Conditional Distributions,” Computational Statistics and Data Analysis,
40, 231–252.
Besag, J. E. (1994), “Discussion of Markov Chains for Exploring Posterior Distributions,”
The Annals of Statistics, 22, 1734–1741.
Casella, G. and George, E. (1992), “Explaining the Gibbs Sampler,” The American
Statistician, 46, 167–174.
Chen, S–H., Ip, E. H., and Wang, Y. (2011), “Gibbs Ensembles for Nearly Compatible
and Incompatible Conditional Models,” Computational Statistics and Data
Analysis, 55, 1760–1769.
Drechsler, J. and Rässler, S. (2008), “Does Convergence Really Matter?” in Recent
Advances in Linear Models and Related Areas, eds. Shalabh, and C. Heumann,
Heidelberg: Physica-Verlag, pp. 341–355.
Gelfand, A. E. and Smith F. M. (1990), “Sampling-based Approaches to Calculating
Marginal Densities,” Journal of the America Statistical Association, 85, 398–409.
Geman, S. and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distribution, and the
Bayes Restoration of Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721–741.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. (1996), Markov Chain Monte Carlo
in Practice. London: Chapman & Hall.
Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and
Their Applications,” Biometrika, 87, 97–109.
21
Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000),
“Dependence Networks for inference, Collaborative Filtering and Data
Visualization.,” Journal of Machine Learning and Research, 1, 49–57.
Hobert, J. P. and Casella, G. (1998), “Functional Compatibility, Markov Chains and
Gibbs Sampling with Improper Posteriors,” Journal of Computational and
Graphical Statistics, 7, 42–60.
Levine, R. and Casella, G. (2006), “Optimizing Random Scan Gibbs Samplers,” Journal
of Multivariate Analysis, 97, 2071–2100.
Liu, J. S. (1996), Discussion on “Statistical inference and Monte Carlo algorithms,” by G.
Casella, Test, 5, 305–310.
Liu, J. S., Wong, H. W., and Kong, A. (1995), “Correlation structure and convergence
rate of the Gibbs sampler with various scans,” Journal of the Royal Statistical
Society, Ser. B, 57, 157–169.
Madras, N. (2002), Lectures on Monte Carlo Methods, Providence, Rhode Island:
American Mathematical Association.
Norris, J. R. (1998), Markov Chain, Cambridge, UK: Cambridge University Press.
Rässler, S., Rubin, D.B., and Zell, E.R. (2008), “Incomplete Data in Epidemiology and
Medical Statistics,” in Handbook of Statistics 27: Epidemiology and Medical
Statistics, eds. C. R. Rao, J. P. Miller and D. C. Rao, The Netherlands: Elsevier,
pp. 569–601.
Rubin, D.B. (2003), “Nested Multiple Imputation of NMES via Partially Incompatible
MCMC,” Statistica Neerlandica, 57, 3–18.
Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman &
Hall.
22
Smith, A. F. M. and Roberts, G. O. (1993), “Bayesian Computation via the Gibbs
Sampler and Related Markov Chain Monte Carlo Methods,” Journal of the Royal
Statistical Society, Ser. B, 55, 3–23.
Tierney, L. (1994), “Markov Chains for Exploring Posterior distributions,” The Annals of
Statistics, 22, 1701–1728.
van Buuren, S., Boshuizen, H. C., and Knook D. L. (1999), “Multiple Imputation of
Missing Blood Pressure Covariates in Survival Analysis," Statistics in Medicine,
18, 681-694.
van Buuren, S., Brand J. P. L., Groothuis-Oudshoorn C. G. M., and Rubin, D. B. (2006),
“Fully Conditional Specifcation in Multivariate Imputation," Journal of Statistical
Computation and Simulation, 76, 1049-1064.
White, I. R., Royston, P., and Wood, A. M. (2011), “Multiple Imputation Using Chained
Equations: Issues and Guidance for Practice,” Statistics in Medicine, 30, 377–399.