aspects of composite likelihood estimation and …...aspects of composite likelihood estimation and...

Aspects Of Composite Likelihood Estimation And Prediction

by

Ximing Xu

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of StatisticsUniversity of Toronto

c© Copyright by Ximing Xu 2012

Aspects Of Composite Likelihood Estimation And Prediction

Ximing Xu

Submitted for the Degree of Doctor of PhilosophyDepartment of Statistics, University of Toronto

June 2012

Abstract

A composite likelihood is usually constructed by multiplying a collection of lower di-mensional marginal or conditional densities. In recent years, composite likelihood methodshave received increasing interest for modeling complex data arising from various applica-tion areas, where the full likelihood function is analytically unknown or computationallyprohibitive due to the structure of dependence, the dimension of data or the presence ofnuisance parameters.

In this thesis we investigate some theoretical properties of the maximum composite like-lihood estimator (MCLE). In particular, we obtain the limit of the MCLE in a generalsetting, and set out a framework for understanding the notion of robustness in the contextof composite likelihood inference. We also study the improvement of the efficiency of acomposite likelihood by incorporating additional component likelihoods, or by using com-ponent likelihoods with higher dimension. We show through some illustrative examplesthat such strategies do not always work and may impair the efficiency. We also show thatthe MCLE of the parameter of interest can be less efficient when the nuisance parametersare known than when they are unknown.

In addition to the theoretical study on composite likelihood estimation, we also explorethe possibility of using composite likelihood to make predictive inference in computer ex-periments. The Gaussian process model is widely used to build statistical emulators forcomputer experiments. However, when the number of trials is large, both estimation andprediction based on a Gaussian process can be computationally intractable due to the di-mension of the covariance matrix. To address this problem, we propose prediction methods

ii

based on different composite likelihood functions, which do not require the evaluation ofthe large covariance matrix and hence alleviate the computational burden. Simulation stud-ies show that the blockwise composite likelihood-based predictors perform well and arecompetitive with the optimal predictor based on the full likelihood.

iii

Acknowledgements

First and foremost, I would like to express my deepest gratitude to my supervisor, ProfessorNancy Reid, for her inspiration, guidance, encouragement and patience throughout mydoctoral program.

I thank my thesis committee members, Professor Keith Knight and Professor Muni Sri-vastava, and the external examiner, Professor Naisyin Wang of University of Michigan fortheir constructive feedback and suggestions on my work. I would also like to thank Pro-fessor Derek Bingham at Simon Fraser University for his help on my work of computerexperiments. Many thanks to all the faculty, staff and my fellow graduate students formaking my study at Toronto such an enjoyable experience.

Finally, I dedicate this thesis to my parents, for being always understanding, supportingand loving me.

iv

Contents

1 Introduction 1

1.1 Definition and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Asymptotics of maximum composite likelihood estimators . . . . . . . . . 6

1.3 Main results and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 On the robustness of maximum composite likelihood estimators 10

2.1 The limit of the MCLE in a general setting . . . . . . . . . . . . . . . . . . 12

2.1.1 Introduction and assumptions . . . . . . . . . . . . . . . . . . . . 12

2.1.2 The main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Aspects of robustness for the MCLE . . . . . . . . . . . . . . . . . . . . . 18

2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Estimation of association parameters . . . . . . . . . . . . . . . . . 19

2.3.2 Estimation of the correlation . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 No compatible joint density exists . . . . . . . . . . . . . . . . . . 21

2.3.4 A class of distributions with normal margins . . . . . . . . . . . . 23

v

2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 On the efficiency of maximum composite likelihood estimators 29

3.1 Efficiency of composite likelihood with more component likelihoods . . . . 30

3.1.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2 Product of information-unbiased composite likelihoods . . . . . . . 33

3.1.3 Product of uncorrelated composite likelihoods . . . . . . . . . . . . 35

3.1.4 Pairwise likelihood and independence likelihood . . . . . . . . . . 38

3.2 Efficiency of composite likelihood with known nuisance parameters . . . . 43

3.2.1 Equicorrelated multivariate normal model . . . . . . . . . . . . . . 44

3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Theoretical results on multiparameter case . . . . . . . . . . . . . . . . . . 49

3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Prediction in computer experiments with composite likelihood 53

4.1 Computer experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1.1 Gaussian random function model . . . . . . . . . . . . . . . . . . 54

4.1.2 Estimation for GRF model . . . . . . . . . . . . . . . . . . . . . . 56

4.1.3 Prediction for GRF model . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Estimation using composite likelihood . . . . . . . . . . . . . . . . . . . . 61

4.3 Prediction using composite likelihood . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Maximum pairwise likelihood predictors . . . . . . . . . . . . . . 64

vi

4.3.2 Maximum triplet-wise likelihood predictors . . . . . . . . . . . . . 65

4.3.3 Maximum blockwise likelihood predictors . . . . . . . . . . . . . 66

4.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.1 Prediction for GRF model with 1-dimensional input . . . . . . . . 69

4.4.2 Prediction for GRF model with 2-dimensional input . . . . . . . . 72

4.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 83

vii

List of Figures

2.1 The ratio of the simulated variances, Sρ/Sρpl , as a function of ρ. Thelines shown are for p = 3, 6, 8 (descending) . . . . . . . . . . . . . . . . . 26

3.1 The asymptotic variances (multiplied by n) of the maximum compositelikelihood estimators for the full likelihood (solid line), the independencelikelihood (dashed line) and the pairwise likelihood (dotted line). . . . . . . 42

3.2 The plot of the ratio r(ρ)= avar(ρpl)/avar(ρpl) at p = 3. The vertical andhorizontal dashed line denotes ρ = 0 and r(ρ) = 1 respectively. . . . . . . . 46

3.3 The plot of the ratio r(ρ) = avar(ρcl)/avar(ρcl) at p = 3. The horizontaldashed line denotes r(ρ) = 1. . . . . . . . . . . . . . . . . . . . . . . . . . 48

viii

List of Tables

2.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Performances of ρ, ρpl and ρ when n = 100, M = 10000, p = 3 and t = 1. . 26

4.1 EMSPE of the six predictors for different γ when α = 1.99 . . . . . . . . . 71

4.2 EMSPE of the six predictors for different α when γ = 100 . . . . . . . . . 71

4.3 EMSPE of the six predictors at different sample size n . . . . . . . . . . . 72

4.4 EMSPE of the six predictors with 2-dimensional input . . . . . . . . . . . 73

ix

Chapter 1

Introduction

The likelihood function plays a critical role in statistical inference in both frequentist and

Bayesian frameworks. However, with the current explosion in the size of data sets and the

increase in complexity of the dependencies among variables in many realistic models, the

full likelihood function is often impractical to construct or too cumbersome to evaluate. In

these situations, composite likelihood, which is usually constructed by compounding some

lower dimensional likelihood objects, appears to be an attractive alternative.

The idea of composite likelihood dates back at least to the pseudolikelihood, a product

of conditional densities, suggested by Besag (1974, 1975) for modelling spatial data. Since

then, various types of composite likelihood methods have been proposed in a range of com-

plex applications. Examples include variants of Besag’s pseudolikelihood using blocks of

1

1 Introduction 2

observations in both conditional and conditioned events (Vecchia, 1988; Stein et al., 2004),

the pairwise likelihood for longitudinal data (Renard et al., 2004) and time series (Varin

and Vidoni, 2006; Davis and Yau, 2011), and composite likelihood methods in Markov

chain models (Hjort and Varin, 2007). A recent comprehensive review on the applications

of composite likelihood methods is given in Varin et al. (2011).

1.1 Definition and Notation

Consider a p-dimensional random vector Y with probability density function f(y; θ) for

some q-dimensional parameter vector θ ∈ Θ. Denote by A1, . . . , AK a set of marginal

or conditional events, for example determined by the joint or conditional distributions of

sub-vectors of Y , with associated likelihood functions Lk(θ; y), k = 1, . . . , K. Following

Lindsay (1988), the composite likelihood function is defined as

CL(θ; y) =K∏k=1

Lk(θ; y)wk , (1.1)

where wk is a set of nonnegative weights. Note that Lk(θ; y) might depend only on a sub-

vector of θ. The choice of the component likelihood functions Lk(θ; y) and the weights

wk may be critical to improve the accuracy and efficiency of the resulting statistical

inference (Lindsay, 1988; Joe and Lee, 2009; Varin et al., 2011). From the above definition

1 Introduction 3

it is easy to see that the full likelihood function is a special case of a composite likelihood

function; however composite likelihood functions will not usually be a genuine likelihood

function, that is, will not be proportional to the density function of any random vector.

The composite likelihood functions are usually distinguished in conditional and marginal

versions. Two of the most commonly used composite conditional likelihood functions are

the pairwise conditional likelihood function,

CLPC(θ; y) =

p∏r=1

∏s 6=r

f(yr|ys; θ )wrs , (1.2)

and the full conditional likelihood function,

CLFC(θ; y) =

p∏r=1

f(yr|y(−r); θ )wr , (1.3)

where y(−r) denotes the vector y omitting the rth component yr.

Two particularly useful composite marginal likelihood functions are the independence

likelihood function,

CLind(θ; y) =

p∏r=1

f(yr; θ )wr , (1.4)

1 Introduction 4

and the pairwise likelihood function,

CLpair(θ; y) =

p−1∏r=1

p∏s=r+1

f(yr, ys; θ )wrs . (1.5)

With a sample of independent observations y = (y(1), ..., y(n)), where each y(i) is a p-

dimensional vector, the composite log-likelihood function is

c`(θ; y) =n∑i=1

c`(θ; y(i)) =n∑i=1

log CL(θ; y(i)), (1.6)

and the composite score function uc(θ; y) is defined as

uc(θ; y) = ∇θc`(θ; y), (1.7)

where ∇θ denotes the operation of differentiation with respect to the vector parameter θ.

The notation u(θ; y) is reserved for the score function of the full likelihood.

Because each of the component likelihood function Lk(θ; y) is based on a marginal or

conditional density, it is easy to show that Euc(θ; y) = 0 under mild regularity conditions

on Lk(θ; y)’s, such as the differentiability of logLk(θ; y) under the integral sign and that

the support does not depend on θ (Godambe, 1960). The Kullback-Leibler inequality also

applies to the composite likelihood function (1.1):

1 Introduction 5

Eθ0

log

CL(θ0; y)

CL(θ; y)

=

K∑k=1

Eθ0

log

Lk(θ0; y)

Lk(θ; y)

≥ 0, (1.8)

where θ0 denotes the true value of θ, and the expectation Eθ0(·) is determined by the true

density functions with θ = θ0. The validity of composite likelihood inference is usu-

ally justified either based on the theory of unbiased estimating equations or based on the

Kullback-Leibler criterion (Lindsay, 1988; Varin, 2008; Molenberghs and Verbeke, 2005).

Composite likelihood versions of Wald, score, and likelihood ratio test statistics can be

easily constructed, and have been proposed in a range of applications (Geys et al., 1999;

Chandler and Bate, 2007; Varin, 2008; Pace et al., 2011; Molenberghs and Verbeke, 2005,

Chapter 9). Some discussion of higher order asymptotic theory for the composite likelihood

ratio statistic can be found in Jin (2009, Section 2.4). Model selection criteria have also

been derived under the framework of composite likelihood; examples include the composite

AIC criterion in Varin and Vidoni (2005) and the composite BIC criterion in Gao and Song

(2010). In this thesis we focus on estimation and prediction using composite likelihood

functions. Some theoretical results on the maximum composite likelihood estimator are

summarized in the next section.

1 Introduction 6

1.2 Asymptotics of maximum composite likelihood esti-

mators

Given a sample y = (y(1), ..., y(n)) as in Section 1.1, the maximum composite likelihood

estimator (MCLE) of θ in the family of models, f(y; θ), is defined as the maximizer of the

composite log-likelihood function (1.6), i.e.

θCL = arg maxθ c`(θ; y). (1.9)

In many applications, θCL may be found by solving the composite score equation uc(θ; y) =

∇θc`(θ; y) = 0. Under some regularity conditions, as n → ∞, θCL is consistent and

asymptotically normally distributed (Lindsay, 1988; Varin, 2008; Xu and Reid, 2011):

√n(θCL − θ)

d→ Nq0, G−1(θ),

where Nqµ,Σ denotes the q-dimensional normal distribution with mean and variance as

indicated, and G(θ) is the Godambe information matrix (Godambe, 1960):

G(θ) = H(θ)J−1(θ)H(θ), (1.10)

1 Introduction 7

where the sensitivity matrix H(θ) = E−∇θuc(θ; y), and the variability matrix J(θ) =

varθuc(θ; y). Throughout this thesis both H(θ) and J(θ) are assumed to be positive

definite matrices. Note that H(θ) and J(θ) are by definition nonnegative definite matrices.

Because the component score functions uk(θ; y) = ∇θlogLk(θ; y) can be correlated,

the second Bartlett identity does not hold for composite likelihood functions in general, i.e.

H(θ) 6= J(θ). After Lindsay (1982), we call a composite likelihood CL(θ; y) information-

unbiased if H(θ) = J(θ) for all θ ∈ Θ, and information-biased, otherwise. For the full

likelihood function, we have H(θ) = J(θ) = I(θ), the expected Fisher information.

By the differentiation of the composite score equation, Euc(θ; y) = 0 with respect to

θ, we have

E∇θuc(θ; y)+ Eu(θ; y)uT

c (θ; y) = 0, (1.11)

where u(θ; y) is the score function of the full likelihood. So,H(θ) = covu(θ; y)uTc (θ; y),

and the Godambe information G(θ) can also be written as

G(θ) = covu(θ; y)uT

c (θ; y)varθuc(θ; y)−1covuc(θ; y)uT(θ; y). (1.12)

The multivariate version of the Cauchy-Schwarz inequality implies that

I(θ) = varθu(θ; y) ≥ G(θ),

1 Introduction 8

i.e. the full likelihood function is more efficient than any other composite likelihood func-

tion (Lindsay, 1988, Lemma 4A).

1.3 Main results and Outline

In modelling only lower dimensional marginal or conditional densities, composite likeli-

hood inference is widely viewed as robust, in the sense that the inference is valid for a

range of statistical models consistent with the component densities. Chapter 2 presents a

rigorous proof on the limit of the maximum composite likelihood estimator taking model

misspecification into account. The notion of robustness in the context of composite like-

lihood inference is studied based on the consistency of the MCLE, and clarified through

some illustrative examples. We also carry out a simulation study of the performance of the

MCLE in a constructed model suggested by Arnold (2010) that is not multivariate normal,

but has multivariate normal marginal distributions.

Intuitively, a more efficient composite likelihood function can be obtained by pooling

more component likelihood functions, or using higher dimensional component likelihood

function. However, such strategies may cause loss of efficiency even when the component

likelihood functions are independent, which will be illustrated through some simple exam-

ples in Chapter 3. Sufficient conditions to guarantee the improvement of the efficiency are

1 Introduction 9

presented in different scenarios. Chapter 3 also investigates the efficiency of the maximum

composite likelihood estimation in the presence of nuisance parameters. In the equicorre-

lated multivariate normal model, we show that the maximum composite likelihood estima-

tor of the correlation coefficient ρ can be less efficient when the common variance σ2 is

known than when σ2 is unknown. Necessary and sufficient conditions for this paradox to

occur are discussed.

Composite likelihood methods have been proposed to estimate the parameters in a va-

riety of complex models (Varin et al., 2011). However, little work has been done to use

composite likelihood for predictive inference (Grunenfelder, 2010). Chapter 4 aims to use

composite likelihood to make predictions in computer experiments, where Gaussian pro-

cesses are usually used to build the statistical emulators. When the number of trials n is

large, both estimation and prediction based on a Gaussian process can be computationally

intractable due to the dimension of the covariance matrix. To address this problem, we pro-

pose prediction methods by maximizing different composite likelihood functions, which

reduce the computational complexity from O(n3) to O(n2). Simulation studies show that

the proposed predictors outperform the pairwise likelihood-based predictor suggested by

Grunenfelder (2010), and are competitive with the optimal predictor based on the full like-

lihood function.

Chapter 2

On the robustness of maximum

composite likelihood estimators

This chapter is a complement to the discussion on the robustness of composite likelihood in-

ference in Varin (2008) and Varin et al. (2011). To formulate ideas about robustness we dis-

tinguish between the true data-generating model, and the working model used for inference,

following Kent (1982). We assume the random vector Y has distribution function G(y),

with associated density function g(y); the marginal distribution function of the subvector

Ysk ⊂ Y is Gk(ysk) and the corresponding density function is gk(ysk), k = 1, · · · , K, with

respect to some dominating measure µ. Now consider the family of modelled distributions

for Ysk, with common support and density functions fk(ysk; θ); θ ∈ Ω with respect to the

10


same dominating measure µ. We restrict attention to the unweighted composite marginal

likelihood function:

CL(θ; y) =K∏k=1

fk(ysk; θ). (2.1)

The family of densities f(y; θ); θ ∈ Ω is correctly specified if there exists θ0 ∈ Ω

such that f(y; θ0) = g(y); if no such θ0 exists, the model is misspecified. The composite

marginal likelihood function (2.1) is correctly specified if fk(ysk; θ); θ ∈ Ω are correctly

specified for all k ∈ 1, · · · , K.

If the full model is misspecified, then as in Kent (1982) and White (1982), we define θ∗ML

as the parameter which minimizes the Kullback-Leibler divergence between the specified

full model and the true model g(·). Similarly, for a misspecified composite likelihood

function CL(θ; y), θ∗ is defined as the parameter point which minimizes the composite

Kullback-Leibler divergence (Varin and Vidoni, 2005):

θ∗ = arg minθEg

log

∏Kk=1 gk(ysk)

CL(θ; y)

= arg minθ

K∑k=1

Eg

log

gk(ysk)

fk(ysk; θ)

. (2.2)

Consistency of the maximum composite likelihood estimator is claimed in several pa-

pers, although without detailed proof; see for example Lindsay (1988), Molenberghs and

Verbeke (2005, Chapter 9) and Jin (2009). Asymptotic results on misspecified full likeli-

hood functions, as in White (1982), cannot be applied to the case of composite likelihood


directly, since the composite likelihood function will not usually be a genuine likelihood

function, as mentioned in Section 1.1. In Section 2.1 we adapt Wald’s classical approach

(Wald, 1949) to establish the result that the MCLE converges almost surely to θ∗ defined in

(2.2), taking model misspecification into account. The regularity conditions are analogous

to those given in Wald’s proof, but applied to each component density function fk(ysk; θ)

without explicit assumptions on the full model.

2.1 The limit of the MCLE in a general setting

For analytical simplicity we only treat the composite marginal likelihood functions (2.1)

with equal weights; however the results obtained here should be easily generalized to more

general situations.

2.1.1 Introduction and assumptions

Following Wald (1949) we introduce some notation for the needed assumptions. For any

θ and for ρ, r > 0, let f(y; θ, ρ) = supf(y; θ′) : ‖θ′ − θ‖ ≤ ρ, where ‖ · ‖ means

Euclidean norm; ϕ(y, r) = supf(y; θ) : ‖θ‖ > r; f ∗(y; θ, ρ) = maxf(y; θ, ρ), 1;

ϕ∗(y, r) = maxϕ(y, r), 1.

For each k ∈ (1, 2, ..., K), we make the following assumptions, analogous to Assump-


tions 1 to 8, in Wald (1949):

(A0): The parameter space Ω is a closed subset of q-dimensional Cartesian space.

(A1): fk(ysk; θ, ρ) is a measurable function of ysk for any θ and ρ.

(A2): The density function fk(ysk; θ) is distinct for different values of θ, i.e. if θ1 6= θ2,

then µ[ysk : fk(ysk; θ1) 6= fk(ysk; θ2)] > 0

(A3): For sufficiently small ρ and sufficiently large r, the expected values∫log f ∗k (ysk; θ, ρ)gk(ysk)dµ(ysk) and

∫logϕ∗k(ysk, r)gk(ysk)dµ(ysk) are finite.

(A4): For any θ ∈ Ω, there exist a set Bkθ , such that

∫Bkθgk(ysk)dµ(ysk) = 0 and

fk(ysk; θ′)→ fk(ysk; θ) as θ′ → θ for ysk ∈ Bk

θ (the complement set of Bkθ ).

(A5): The expectation of log gk(ysk) exists.

(A6): There exists a set Ak, such that∫Akgk(ysk)dµ(ysk) = 0 and lim‖θ‖→∞ fk(ysk; θ) = 0

for ysk ∈ Ak.

(A7): There exists a unique point θ∗ ∈ Ω which minimizes the composite Kullback-Leibler

divergence defined in (2.2).


2.1.2 The main theorem

Theorem 1. Assume that y(1), . . . , y(n) are independently and identically distributed with

distribution function G(y). Under the regularity conditions (A0)-(A7), the maximum com-

posite likelihood estimator θCL converges almost surely to θ∗ defined in (2.2).

Before we prove Theorem 1, we state the following lemmas. By the expected value

Eg(·), we shall mean the expected value determined under the true distribution G(y).

Lemma 1. For any θ 6= θ∗, we have

EgK∑k=1

log fk(ysk; θ) < EgK∑k=1

log fk(ysk; θ∗) ≤ Eg

K∑k=1

log gk(ysk)

Lemma 2.

limρ→0

EgK∑k=1

log fk(ysk; θ, ρ) = EgK∑k=1

log fk(ysk; θ)

Lemma 3.

limr→∞

EgK∑k=1

log ϕk(ysk, r) = −∞

The three Lemmas follow immediately from Assumption (A7) and Lemmas 1, 2, 3 in


Wald (1949).

Proof: First we shall prove that

Pr

limn→∞

supθ∈ω∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ)∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ∗)

= 0

= 1, (2.3)

for any closed subset ω which belongs to Ω and does not contain θ∗ defined in (A7).

From Lemma 3, for each i, we can choose r0 > 0 such that

EgK∑k=1

log ϕk(y(i)sk , r0) < Eg

K∑k=1

log fk(y(i)sk ; θ∗). (2.4)

Let ω0=θ : θ ∈ ω and ‖θ‖ ≤ r0⊆ ω. From Lemma 1 and 2, for each θ ∈ ω0, we can find

a ρθ such that

EgK∑k=1

log fk(y(i)sk ; θ, ρθ) < Eg

K∑k=1

log fk(y(i)sk ; θ∗). (2.5)

Since ω0 is compact, by the finite-covering theorem there exists a finite number of points

θ1, ..., θh in ω0 such that S(θ1, ρθ1) ∪ ... ∪ S(θh, ρθh) ⊇ ω0, where S(θ, ρ) denotes the

sphere with center θ and radius ρ. Clearly, we have

supθ∈ω

n∏i=1

K∏k=1

fk(y(i)sk ; θ) ≤

h∑l=1

n∏i=1

K∏k=1

fk(y(i)sk ; θl, ρθl)

+

n∏i=1

K∏k=1

ϕk(y(i)sk , r0).


Hence (2.3) is proved if we can show that

Pr

limn→∞

∏ni=1

∏Kk=1 fk(y

(i)sk ; θl, ρθl)∏n

i=1

∏Kk=1 fk(y

(j)sk ; θ∗)

= 0

= 1, (l = 1, ..., h),

and

Pr

limn→∞

∏ni=1

∏Kk=1 ϕk(y

(i)sk , r0)∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ∗)

= 0

= 1.

Proving the above two equations is equivalent to showing that for l = 1, ..., h

Pr

[limn→∞

n∑i=1

log

K∏k=1

fk(y(i)sk ; θl, ρθl)− log

K∏k=1

fk(y(i)sk ; θ∗)

= −∞

]= 1,

and

Pr

[limn→∞

n∑i=1

log

K∏k=1

ϕk(y(i)sk , r0)− log

K∏k=1

fk(y(i)sk ; θ∗)

= −∞

]= 1.

These equations follow immediately from (2.4), (2.5) and the strong law of large numbers.

Let θn(y(1), · · · , y(n)) be any function of the observations y(1), · · · , y(n) such that

∏ni=1

∏Kk=1 fk(y

(i)sk ; θn)∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ∗)

≥ c > 0 for all n and all y(1), ..., y(n). (2.6)

If we can show that

Pr

limn→∞

θn = θ∗

= 1, (2.7)

the proof of Theorem 1 is completed since the maximum composite estimator θCL satisfies


(2.6). To prove (2.7) it is sufficient to show that for any ε > 0 the probability is one that all

limit points θ of the sequence θn satisfy that ‖θ − θ∗‖ ≤ ε.

If there exists a limit point θ0 such that ‖θ0 − θ∗‖ > ε, we have

sup‖θ−θ∗‖≥ε

n∏i=1

K∏k=1

fk(y(i)sk ; θ) ≥

n∏i=1

K∏k=1

fk(y(i)sk ; θn) for infinitely many n.

Then

sup‖θ−θ∗‖≥ε∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ)∏n

i=1

∏Kk=1 fk(y

(i)sk ; θ∗)

≥ c > 0 for infinitely many n.

According our previous result (2.3) this is an event with probability zero. Thus we have

shown that the probability is one that all limit points θ of the sequence θn satisfy ‖θ −

θ∗‖ ≤ ε, and the equation (2.7) is obtained.

Since the ordinary likelihood function is a special case of composite likelihood, the consis-

tency of maximum likelihood estimator under a misspecified model (White, 1982, Theorem

2.2) follows immediately from Theorem 1.

Corollary 1. If the composite likelihood function (2.1) is correctly specified, under the

assumptions (A0)-(A6), the maximum composite likelihood estimator θCL converges to the

true parameter point θ0 almost surely.


2.2 Aspects of robustness for the MCLE

Model specifications under different mechanisms and their impact on the convergence of

the resulting maximum likelihood estimators are illustrated schematically in Table 2.1. The

first row illustrates the result that has been most studied: when the model and sub-models

are correctly specified, the resulting MCLE and MLE are both consistent for the true pa-

rameter value, under some regularity conditions, and the MCLE will be less efficient than

the MLE, although a number of examples indicate that the loss of efficiency can be quite

small.

Table 2.1: Model SpecificationModel Full Likelihood Composite Likelihood

Correctly specified f(y; θ0) = g(y) fk(ysk; θ0) = gk(ysk) for all kθML → θ0 θCL → θ0

Misspecified f(y; θ) 6= g(y), fk(ysk; θ) 6= gk(ysk) for some kθML → θ∗ML θCL → θ∗

The interesting case for robustness of consistency is when the components of composite

likelihood, such as lower dimensional marginal densities, are correctly specified, but the

full likelihood is misspecified. In this case the MLE will not usually be consistent for the

true parameter value. On the other hand, the MCLE, which is calculated from the com-

posite likelihood function making use of the correctly specified lower dimensional margins

only, still converges to the true parameter value without depending on the full model. How-

ever the asymptotic variance of the MCLE may vary dramatically, depending on the true


underlying full model.

Finally, if both the composite and the full likelihood functions are not correctly specified,

the MCLE or MLE will converge not to the true parameter, but to θ∗ or to θ∗ML.

2.3 Examples

In this section, we illustrate some of the points above with some simple examples con-

structed to highlight aspects of robustness.

2.3.1 Estimation of association parameters

This example is due to Andrei and Kendziorski (2009). Suppose Y1 ∼ N(µ1, σ21), Y2 ∼

N(µ2, σ22) and ε ∼ N(0, 1) are independent random variables. Let Y3 = Y1+Y2+bY1Y2+ε,

b 6= 0. We can show that all full conditional distributions i.e. f(y1|y2, y3), f(y2|y1, y3) and

f(y3|y1, y2) are normal, but the joint distribution is not multivariate normal due to the non-

zero interaction term bY1Y2.

Given a random sample (y(i)1 , y

(i)2 , y

(i)2 ), i = 1, · · · , n, if we misspecify the joint model as

multivariate normal, b will be forced to be 0 directly; if we use the full conditional distribu-

tion f(y3|y1, y2), the MCLE of b is bCL =∑n

i=1 y(i)1 y

(i)2 (y

(i)3 − y

(i)1 − y

(i)2 )/

∑ni=1(y

(i)1 y

(i)2 )2,


which is consistent for b. We can also use f(y1|y2, y3) or f(y2|y1, y3), but the resulting

MCLE can not be expressed in a closed form and some numerical methods are needed.

2.3.2 Estimation of the correlation

The random vector (Y1, Y2, Y3, Y4)T follows a multivariate normal distribution with mean

vector (0, 0, 0, 0)T and covariance matrix

Σ =

1 ρ0 2ρ0 2ρ0

ρ0 1 2ρ0 2ρ0

2ρ0 2ρ0 1 ρ0

2ρ0 2ρ0 ρ0 1

.

If we model the joint distribution of (Y1, Y2, Y3, Y4)T as multivariate normal with zero mean

vector and all correlations equal, the covariance matrix is then misspecified and the result-

ing MLE will not be consistent for ρ0. On the other hand, if we only use the correct

information about the pairs (Y1, Y2) and (Y3, Y4), and construct the pairwise likelihood

CL(ρ; y) = f12(y1, y2; ρ)f34(y3, y4; ρ), (2.8)


where both f12 and f34 are the density functions for a bivariate normal with mean vector

(0, 0)T and covariance matrix

Σ =

1 ρ

ρ 1

,

then by Corollary 1, the resulting MCLE is consistent for ρ0.

It is of interest to note that the parameter constraint needed to ensure that the covariance

matrix is non-negative definite in the correct full likelihood is −1/5 ≤ ρ ≤ 1/3, whereas

in the composite likelihood (2.8) the parameter constraint is −1 ≤ ρ ≤ 1. The composite

likelihood (2.8) can also be thought as the likelihood function for a multivariate normal

distribution with a block diagonal covariance matrix, which is obviously different from the

true full model.

From this example we can see that even if different parameter constraints are imposed

or the composite likelihood is compatible with a different full model, the MCLE will be

consistent as long as all of the component likelihoods are correctly specified.

2.3.3 No compatible joint density exists

Suppose the true model for the random vector (Y1, Y2, Y3) is multivariate normal with mean

vector µ0(1, 1, 1)T, µ0 > 0, and covariance matrix equal to the identity matrix. Now con-


sider the following pairwise likelihood function

CL(µ; y) = f12(y1, y2;µ)f13(y1, y3;µ)f23(y2, y3;µ), (2.9)

where both f12 and f23 are the density functions for a bivariate normal distribution with

unknown mean vector µ(1, 1)T and covariance matrix equal to the 2 × 2 identity matrix.

However, f13(y1, y3;µ) is misspecified as

f13(y1, y3;µ) =1

µexp(−y1

µ)

1√2π

exp(−(y3 − µ)2

2).

It is easy to see that no compatible joint density exists for the composite likelihood function

(2.9) since from f12 and f13 we will get different marginal densities of Y1.

Given a random sample (y(i)1 , y

(i)2 , y

(i)3 ), i = 1, · · · , n, the MCLE of µ based on the

composite likelihood function (2.9), µCL can be obtained by solving the score equation

5nµ3 − Snµ2 + nµ− S1n = 0, (2.10)

where Sn =∑n

i=1(y(i)1 + 2y

(i)2 + 2y

(i)3 ) and S1n =

∑ni=1 y

(i)1 .

As n → ∞, a direct argument using the consistency of sample means for the population

mean shows that the unique real root of (2.10) converges to µ0. The asymptotic variance of

µCL can be calculated using the Godambe information function G(θ), and the ratio of the


asymptotic variance of the MLE of µ based on the true multivariate normal model, µML

to that of µCL is r = 5 + (1/µ2)2/3[8 + 1 + (1/µ2)2]. It is easy to check r ≤ 1 and

equality holds only for µ = 1.

From this artificial example, we can see that although no compatible joint density exists,

the limit of the MCLE may still be meaningful, even consistent for the true parameter

value. In general the MCLE converges to θ∗ which minimizes the composite Kullback-

Leibler divergence, whether the specified sub-models are compatible or not. If the specified

sub-models are very close to the corresponding true sub-models, we can imagine that θ∗

should be a good estimate of the true parameter value even if those specified sub-models

are incompatible.

2.3.4 A class of distributions with normal margins

This example is suggested by Arnold (2010). Suppose the density function of the random

vector Y = (Y1, Y2, . . . , Yp) is

f(y) = φ(p)(y;µ,Σ) + g(µ,Σ)(

p∏i=1

yi)IA(y), (2.11)

where φ(p)(y;µ,Σ) is the density function of a p-dimensional multivariate normal with

mean vector µ and covariance matrix Σ, g(·) is a function of parameters chosen to guarantee


that f(y) ≥ 0, A = y : −t ≤ yi ≤ t, i = 1, 2, ..., p, t is a threshold parameter, and

IA(y) = 1 if y ∈ A and 0 otherwise. It is easy to show that all k(< p) dimensional sub-

vectors of y follow k-dimensional multivariate normal distributions with corresponding

mean vectors and covariance matrices. When t = 0, f(y) becomes φ(p)(y;µ,Σ). This

example also provides a general approach to constructing a model with the same margins

as a pre-specified density. In model (2.11), depending on the complexity of the function

g(·), the calculation of the MLE may be very difficult. In the simulation study, we let

t = 1, µ = 0 and Σ = (1 − ρ)Ip + ρJp, where Ip is identity matrix, Jp is a p × p matrix

with all entries equal to 1, and ρ is the common correlation coefficient for p ≥ 3. Since

A ⊆ y : yTy ≤ p, we can choose the function g as

g(µ,Σ) = infyTy≤p

φ(p)(y;µ,Σ) ≤ infy∈A

φ(p)(y;µ,Σ)

To calculate g(µ,Σ), we use the fact that

supyTy≤p

yTΣ−1y = pλp,

where λp is the largest eigenvalue of Σ−1. For Σ = (1 − ρ)Ip + ρJp, we can show that

λp = 1/(1− ρ) if 0 ≤ ρ < 1, and λp = 1/1 + (p− 1)ρ if 1/(1− p) < ρ ≤ 0.

We begin with p = 3 and consider three different estimators of ρ: the MLE ρ; the MCLE,


ρpl obtained by maximizing the pairwise likelihood function (1.5) with equal weights, and

the unbiased estimator based on the method of moments,

ρ =2S2

np(p− 1),where S2 =

n∑i=1

p∑s>r

yr(i)ys

(i).

The last two estimators are free of the function g(·) and are more computationally conve-

nient than the MLE.

The rejection sampling method is used to generate n sample points from the joint distri-

bution (2.11), using the fact that

f(y) ≤ φ(p)(y;µ,Σ)1 + IA(y)

p∏i=1

yi ≤ 2φ(p)(y;µ,Σ).

We used numerical methods to calculate ρ and ρpl by solving the relevant score equa-

tions, and calculated simulation means and variances of the three estimators, ρ, ρpl and ρ.

In Table 2.2 Sρ, Sρpl and Sρ denote the simulation variances of the three estimators, re-

spectively. The ratios Sρpl/Sρ and Sρ/Sρpl are used to compare the efficiencies of the three

estimators.

The results for sample size n = 100, simulation size M = 10000, threshold t = 1 and

dimension p = 3 are presented in Table 2.2. All three methods produce accurate estimates.

With the exception of ρ = −0.49, var(ρpl) is very close to var(ρ), and var(ρpl) seems


smaller than var(ρ) for any value of ρ except ρ = 0. We also performed the simulation for

t = 2, 4, 8, and observed the same phenomenon. Figure 2.1 illustrates the relative efficiency

of ρpl compared with the MLE ρ, with increasing p.

Table 2.2: Performances of ρ, ρpl and ρ when n = 100, M = 10000, p = 3 and t = 1.true value of ρ -0.49 -0.25 0 0.25 0.5 0.75 0.99sim.mean of ρpl -0.4924 -0.2512 -0.0012 0.2515 0.4986 0.7487 0.9899sim.mean of ρ -0.4900 -0.2481 0.0019 0.2489 0.4983 0.7502 0.9900sim.mean of ρ -0.4908 -0.2479 -0.0015 0.2521 0.4998 0.7511 0.9874sim.variance of ρpl 0.0008 0.0013 0.0036 0.0037 0.0024 0.0006 10−6

sim.variance of ρ 10−6 0.0012 0.0036 0.0036 0.0023 0.0059 10−6

sim.variance of ρ 0.0025 0.0023 0.0035 0.0057 0.0092 0.0155 0.0215Sρ/Sρpl 0.0025 0.9231 1.0000 0.9730 0.9583 0.9833 1.0000Sρpl/Sρ 0.3334 0.5614 1.0252 0.6521 0.2599 0.0402 10−5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.85

0.9

0.95

1

the true value of ρ

the

rela

tive

effic

ienc

y

Figure 2.1: The ratio of the simulated variances, Sρ/Sρpl , as a function of ρ. The linesshown are for p = 3, 6, 8 (descending)


2.4 Concluding Remarks

This chapter sets out some issues in the study of robustness of composite likelihood infer-

ence; specifically emphasizing robustness of consistency. Robustness in inference usually

means obtaining the same inferential result under a range of models. In point estimation the

range of models is often considered to be small-probability perturbations of the assumed

model, to reflect the sampling notion of occasional outliers.

In the framework of composite likelihood inference, the range of models is, loosely

speaking, all models consistent with the specified set of sub-models fk(ysk; θ). For exam-

ple, if pairwise likelihood is used, the range of models is those consistent with the assumed

bivariate distributions. In many, or even most, applications of composite likelihood, it is

not immediately clear what that range of models looks like, and indeed whether there is

even a single model compatible with the assumed sub-models.

The Wald assumptions set out in Section 2.1.1 are sufficient to ensure consistency of

the MCLE, although they may be stronger than necessary. The most restrictive of these

assumptions is (A7): that there exists a unique point θ∗ ∈ Ω that minimizes the Kullback-

Leibler divergence (2.2). For each component likelihood the assumption that there is a

unique θ∗k ∈ Ωk would be more closely analogous to the usual Wald assumption for the

MLE.


However, even in cases where both the MLE and the MCLE are not consistent, the

MCLE might still be more reliable than the MLE, since mis-specifying a high dimensional

complex joint density may be much more likely than mis-specifying some simpler lower

dimensional densities.

The MCLE also has a type of robustness of efficiency: in computing the asymptotic

variance the composite likelihood is always treated as a “misspecified” model even if all

component likelihoods are correctly specified. On the other hand, the inverse of the Fisher

information matrix I(θ), which is used as the asymptotic variance of the MLE, is sensitive

to model misspecification.

Composite likelihood also has a type of computational robustness, discussed in Varin

et al. (2011); there is some evidence from applied work that the composite likelihood sur-

face is smoother, and hence easier to maximize, than the full likelihood surface.

There is also some evidence that composite likelihood inference is robust to missing

data, although there is still much work to be done in this area. Recent papers discussing

this include Yi et al. (2011), Molenberghs et al. (2011), and He and Yi (2011).

Chapter 3

On the efficiency of maximum composite

likelihood estimators

Given two composite likelihood functions CL1(θ; y) and CL2(θ; y), CL2(θ; y) is said to

be more efficient than CL1(θ; y) if the Godambe information in CL2(θ; y) is greater than

the Godambe information in CL1(θ; y) in the sense of matrix inequality. It is known that

the full likelihood function is more efficient than any other composite likelihood function

(Godambe, 1960; Lindsay, 1988).

In practice, we usually want to select a composite likelihood which should achieve a

good balance between computational savings and statistical efficiency. We might expect

that either increasing the number of component likelihoods, or increasing the dimension

29


of the component likelihood, would improve the efficiency, although extra computing time

may be needed. When θ is a scalar, basic results on efficiency of the product of two or more

composite likelihood functions are stated in Section 3.1; some examples are also presented

to show such strategies may impair the efficiency, even when the component composite

likelihoods are independent. In Section 3.2, the equicorrelated multivariate normal model

is used to illustrate that in the presence of nuisance parameters, the maximum composite

likelihood estimator of the parameter of interest can be less efficient when the nuisance

parameters are known. Some theoretical results on the multiparameter case are presented

in Section 3.3.

3.1 Efficiency of composite likelihood with more compo-

nent likelihoods

3.1.1 General setting

Consider a composite likelihood function of the p-dimensional vector Y , CLc(θ; y), which

can be expressed as a product of two “smaller” composite likelihood functions CL1(θ; y)

and CL2(θ; y), i.e. CLc(θ; y) = CL1(θ; y) × CL2(θ; y). If both CL1(θ; y) and CL2(θ; y)

are true likelihood functions, with uncorrelated score functions, then the information in


CLc(θ; y) is equal to the sum of the information in CL1(θ; y) and in CL2(θ; y). In fact,

this information additivity also holds for the product of information-unbiased composite

likelihood functions with uncorrelated composite scores.

Denote the Godambe information in CL1(θ; y) and CL2(θ; y) by G1(θ) = H21 (θ)J−11 (θ)

andG2(θ) = H22 (θ)J−12 (θ), respectively. A lower bound on the efficiency of the compound

composite likelihood function CLc(θ; y) is given by the following theorem.

Theorem 2. If G1(θ) ≤ G2(θ), then CLc(θ; y) is at least as efficient as CL1(θ; y).

Proof: The Godambe information in CLc(θ; y) can be calculated directly as

Gc(θ) =(1 + γ)2

1 + λ+ 2ρ√λG1(θ), (3.1)

where γ = H2(θ)/H1(θ), λ = J2(θ)/J1(θ) and ρ = corru1(θ; y), u2(θ; y). When γ ≥√λ, i.e. G2(θ) ≥ G1(θ), it is easy to check that (1 + γ)2/(1 + λ + 2ρ

√λ) ≥ 1 for any

ρ ∈ [−1, 1], and equality holds if and only if γ =√λ and ρ = 1.

Corollary 2. If G1(θ) ≤ G2(θ), the weighted compound composite likelihood function

CLwc(θ; y) = CLw11 (θ; y)×CLw2

2 (θ; y) is at least as efficient as CL1(θ; y) for any w1 > 0

and w2 > 0.

Corollary 3. If the composite likelihood functions CL1(θ), CL2(θ), . . . , CLm(θ) have the

same Godambe information G(θ), the information in the compound composite likelihood

function, CLc(θ) =∏m

i=1CLi(θ), is greater than or equal to G(θ).


Corollary 2 follows from Theorem 2 immediately by noting thatCLw11 (θ; y) has the same

information as CL1(θ; y), and CLw22 (θ; y) has the same information as CL2(θ; y) for any

positive w1 and w2. For information-unbiased composite likelihood functions, a version of

Corollary 3 is given in Lindsay (1988, Lemma 4C).

In equation (3.1), we also note that (1 + γ)2/(1 + λ + 2ρ√λ) can be smaller than 1

when γ <√λ, i.e. the Godambe information decreases as more component likelihoods are

added, which will be illustrated through the following example.

Example 3.1.1 (Correlated regression). The model for the random vector y = (y1, y2) is

y(i)1 = αti + εi, εi ∼ N(0, σ2

1) (3.2)

y(i)2 = αti + ε′i, ε′i ∼ N(0, σ2

2) (3.3)

with cov(εi, ε′j) = ρσ1σ2 when i = j, and cov(εi, ε

′j) = 0 when i 6= j (i, j = 1, ..., n).

Assume σ1 and σ2 are known, and α is the unknown parameter of interest.

Suppose we ignore the bivariate distribution of (y1, y2), and use either model (3.2) or

model (3.3) separately to estimate α, without knowing the value of ρ. The maximum

likelihood estimators based on (3.2) and (3.3) are given by α1 =∑n

i=1 tiy(i)1 /∑n

i=1 t2i and

α2 =∑n

i=1 tiy(i)2 /∑n

i=1 t2i , respectively. We also consider the weighted independence


likelihood ignoring the correlation between y1 and y2:

CLc(α; y1, y2) =n∏i=1

f(y(i)1 ;α)w1f(y

(i)2 ;α)w2 ,

where the weights w1 and w2 are some positive constants. The corresponding maximum

composite likelihood estimator is αCL = w∗1α1 + w∗2α2 with w∗1 = w1σ22/(w1σ

22 + w2σ

21)

and w∗2 = w2σ21/(w1σ

22 + w2σ

21). The variance of αCL is

var(αCL) =1

(1 + γ)2σ21

S+

γ2

(1 + γ)2σ22

S+

2γ

(1 + γ)2ρσ1σ2S

,

where γ = w2σ21/(w1σ

22) and S =

∑ni=1 t

2i .

Whether var(αCL) ≤ var(α1) = σ21/S depends on the values of ρ, σ2

2 , σ21 and the ratio

w2/w1. When σ22 ≤ σ2

1 , we have var(αCL) ≤var(α1) for all w1, w2 and ρ ∈ [−1, 1], which

agrees with Corollary 2. On the other hand, when σ22/σ

21 = 4 and w1 = w2, var(αCL) <

var(α1) for ρ < 5/16, and var(αCL) >var(α1) for ρ > 5/16.

3.1.2 Product of information-unbiased composite likelihoods

Under the usual regularity conditions, the likelihood function is information-unbiased.

More generally, any composite likelihood function constructed by multiplying likelihood

functions with mutually uncorrelated scores is also information-unbiased. An example is


the partial likelihood of Cox (1975). Given a p-dimensional vector Y = (Y1, . . . , Yp)T with

density function f(y1, . . . , yp; θ), and defining f(y1 | y0; θ) = f(y1; θ), it is easy to show

that the covariance between the score function of f(yi | y1, . . . , yi−1; θ) and the score func-

tion of f(yj | y1, . . . , yj−1; θ) is zero for any i 6= j ∈ 1, . . . , p. Without loss of generality,

we assume j > i, from the equation (1.11), the covariance between the two score func-

tions is cov∇θ log f(yi | y1, . . . , yi−1; θ),∇θ log f(y1, . . . , yj−1, yj; θ) − cov∇θ log f(yi |

y1, . . . , yi−1; θ),∇θ log f(y1, . . . , yj−1; θ) = Hi(θ) − Hi(θ) = 0, where Hi(θ) is the sen-

sitivity matrix of f(yi | y1, . . . , yi−1; θ). Thus, any composite likelihood function of type

CL(y; θ) =∏

i∈A f(yi | y1, . . . , yi−1; θ), where A ⊆ 1, . . . , p, is information-unbiased.

From Section 3.1.1, we have that information additivity holds for uncorrelated

information-unbiased composite likelihood functions. On the other hand, the product of

correlated information-unbiased composite likelihood functions with the same information,

is at least as efficient as any of its components (Lindsay, 1988, Lemma 4C). In the latter

case, a stronger improvement of efficiency can be achieved with an additional assumption

on the covariance between any two composite likelihood functions.

Lemma 4. Assume CL1(θ), CL2(θ), . . . , CLm(θ) are information-unbiased, each with the

same Godambe information G(θ). If∑m

i 6=j=1 cov∇θ log CLi(θ),∇θ log CLj(θ) is equal

to a constant, R(θ), for any i ∈ 1, . . . ,m, then the compound composite likelihood

function CLmc (θ) =∏m

i=1CLi(θ) is at least as efficient as CLm−1c (θ) =∏m−1

i=1 CLi(θ) for

any m ≥ 2.


The proof of Lemma 4 is obtained by showing that the Godambe information forCLmc (θ)

ism2G(θ)mG(θ)+mR(θ)−1G(θ) = mG(θ)G(θ)+R(θ)−1G(θ), which is an increas-

ing function of m. This proof also holds for the multiparameter case. The assumptions in

Lemma 4 apply for a variety of applications of composite likelihood inference, including

pairwise and conditional pairwise likelihood in the equicorrelated multivariate distribution

(Cox and Reid, 2004; Mardia et al., 2009), pairwise likelihood in a Mantel-Haenszel pro-

cedure (Lindsay, 1988), and a version of composite conditional likelihood in Markov chain

models (Hjort and Varin, 2007) and in lattice data ignoring boundary effects (Besag, 1975).

3.1.3 Product of uncorrelated composite likelihoods

Since uncorrelated information-unbiased composite likelihood functions satisfy informa-

tion additivity, in this subsection we focus only on the product of uncorrelated information-

biased composite likelihood functions. Denote by Ys1 and Ys2 two sub-vectors of the

p-dimensional random vector Y . We consider a composite likelihood function for Ys1,

CL(ys1; θ) with Godambe information G1(θ) = H21 (θ)J−11 (θ), multiplied by the likeli-

hood function of Ys2, f(ys2; θ), with Fisher information I2(θ).

Lemma 5. If the score functions for CL(ys1; θ) and f(ys2; θ) are uncorrelated, the com-

pound composite likelihood functionCLc(θ) = CL(ys1; θ)×f(ys2; θ) is more efficient than

CL(ys1; θ) if and only if I2(θ) > G1(θ)− 2H1(θ).


Proof: Denote by Gc(θ) the Godambe information of the compound composite likelihood

function CLc(θ). By definition,

Gc(θ)−G1(θ) =I2(θ) +H1(θ)2

I2(θ) + J1(θ)− H2

1 (θ)

J1(θ)

=I2(θ) +H1(θ)2 J1(θ)−H2

1 (θ) I2(θ) + J1(θ)J1(θ) + I2(θ) J1(θ)

=I2(θ) I2(θ)J1(θ) + 2H1(θ)J1(θ)−H2

1 (θ)J1(θ) + I2(θ) J1(θ)

Since I2(θ) are J1(θ) are positive, Gc(θ) > G1(θ) if and only if I2(θ)J1(θ) > H21 (θ) −

2H1(θ)J1(θ). Noting that H21 (θ)/J1(θ) = G1(θ), Lemma 5 is therefore proved.

By Lemma 5, in order to improve the efficiency by incorporating an independent like-

lihood function, G1(θ) − 2H1(θ) should not be too large. When G1(θ) = H1(θ), i.e.

CL(ys1; θ) is information-unbiased, we always have I2(θ) > G1(θ)− 2H1(θ). As an illus-

tration of Lemma 5, we consider the following example.

Example 3.1.2 (Product of independent normal models). The random vector (Y1, Y2, Y3)

follows a normal distribution with mean vector µ× (1, 1, 1)T and covariance matrix

Σ =

1 ρ 0

ρ 1 0

0 0 σ2

.


Assume σ2 is known, µ and ρ are unknown, and µ is the only parameter of interest.

Suppose we do not know the bivariate distribution of Y1 and Y2, and use the independence

likelihood function CL12(µ) = f(y1;µ)×f(y2;µ), which is free of the nuisance parameter

ρ, to estimate µ. To incorporate the information contained in the independent variable Y3,

we also consider the composite likelihood function CL123(µ) = CL12(µ)× f(y3;µ).

Given a random sample of size n from the model, the Fisher information in f(y3;µ)

is n/σ2, and the Godambe information in CL12(µ) is G12(µ) = H212(µ)/J12(µ) with

H12(µ) = 2n and J12(µ) = 2n(1 + ρ). By Lemma 5, CL123(µ) is more efficient than

CL12(µ) if and only if n/σ2 > 2n/(1 + ρ) − 4n. When σ2 = 2, this inequality becomes

ρ > −5/9. So if ρ ∈ [−1,−5/9], CL123(µ) is less efficient than CL12(µ).

We can also compare the variances of the two maximum composite likelihood estimators

directly. The maximum composite likelihood estimator is µ12 = (y1 + y2)/2 for CL12(µ),

and µ123 = σ2(y1 + y2)/(1+2σ2)+ y3/(1+2σ2) for CL123(µ). When σ2 = 2, the variance

of µ12 is (1 + ρ)/(2n) and the variance of µ123 is (10 + 8ρ)/(25n). It is easy to show that

(10 + 8ρ)/(25n) is smaller than (1 + ρ)/(2n) if and only if ρ > −5/9.

Note that if ρ = −1 this result is expected as (Y1, Y2) determines µ exactly with µ ≡

(Y1 + Y2)/2; but the dependence on σ2 of the range of ρ over which Y3 degrades the

inference is surprising; as σ2 increases this range approaches [−1,−1/2).


3.1.4 Pairwise likelihood and independence likelihood

Intuitively, a composite likelihood with higher dimensional component likelihoods should

achieve a higher efficiency, although it usually demands more computational cost. In

this subsection we focus on comparing the independence likelihood CLind(θ; y) =∏pr=1 f(yr; θ ) and the pairwise likelihood CLpair(θ; y) =

∏p−1r=1

∏ps=r+1 f(yr, ys; θ ). Un-

der independence, CLind(θ; y) is identical to the full likelihood, and CLpair(θ; y) =

CLind(θ; y)p−1, which is also fully efficient. For a multivariate normal model with con-

tinuous responses, Zhao and Joe (2005) proved that the maximum pairwise likelihood es-

timator of the regression parameter has a smaller asymptotic variance than the maximum

independence likelihood estimator. On the other hand, pairwise likelihood can be expressed

as a product of the independence likelihood and the pairwise conditional likelihood, i.e.

CLpair(θ; y)2 = CLind(θ; y)p−1p∏r=1

∏s 6=r

f(yr | ys; θ ).

If we can show that the pairwise conditional likelihood dominates the independence likeli-

hood in terms of efficiency, then CLpair(θ; y) is more efficient than CLind(θ; y) by Corol-

lary 2. Arnold and Strauss (1991) showed that this may not be true in a bivariate binary

model. However, in their example, the pairwise likelihood is identical to the full likeli-

hood and hence still more efficient than the independence likelihood. In this subsection

we generalize Arnold and Strauss’s example to a four-dimensional binary model, which al-


lows us to compare the (asymptotic) variances of different composite likelihood estimators

analytically, and observe the reverse relationship.

Example 3.1.3 (A partial multivariate binary model). Suppose (Y1, Y2, Y3, Y4) follows a

Multinomial(1; θ, θ, θ/k, 1−2θ−θ/k), where k is a positive constant and 0 ≤ θ ≤ k/(2k+

1). The parameter θ controls both the mean and covariance structures, and we can change

the value of k to adjust the strength of dependence. Given a random sample of size n

from this model, we estimate θ based only on the partial observations (y(i)1 , y

(i)2 , y

(i)3 ), i =

1, . . . , n.

The full likelihood for the model of (Y1, Y2, Y3) is

L(θ) =n∏i=1

θy(i)1 +y

(i)2 (

θ

k)y

(i)3 (1− 2θ − θ

k)1−y

(i)1 −y

(i)2 −y

(i)3 . (3.4)

Solving the score equation we get the maximum likelihood estimator of θ, θ = (y1 + y2 +

y3)/(2 + 1/k). The exact variance of θ is

var(y1 + y2 + y3

2 + 1/k) =

1

n(

θ

2 + 1/k− θ2).


The independence likelihood function for the model of (Y1, Y2, Y3) is

CLind(θ) =n∏i=1

f(y(i)1 ; θ)f(y

(i)2 ; θ)f(y

(i)3 ; θ)

=n∏i=1

θy(i)1 (1− θ)1−y

(i)1 θy

(i)2 (1− θ)1−y

(i)2 (

θ

k)y

(i)3 (1− θ

k)1−y

(i)3 (3.5)

and we can calculate its sensitivity matrix and variability matrix as

Hind(θ) =2

θ+

2

1− θ+

1

kθ+

1

k(k − θ),

Jind(θ) =2

θ(1− θ)+

1

θ(k − θ)− 2

(1− θ)2− 4

(1− θ)(k − θ).

The pairwise likelihood function is

CLpair(θ) =n∏i=1

f(y(i)1 , y

(i)2 ; θ)f(y

(i)1 , y

(i)3 ; θ)f(y

(i)2 , y

(i)3 ; θ)

=n∏i=1

θ2y(i)1 +2y

(i)2 (1− 2θ)1−y

(i)1 −y

(i)2 (

θ

k)2y

(i)3 (1− θ − θ

k)2−y

(i)1 −y

(i)2 −2y

(i)3 , (3.6)

and we can calculate its sensitivity matrix and variability matrix as

Hpair(θ) =4

θ+

2

kθ+

4

1− 2θ+

2(1 + 1/k)2

1− (1 + 1/k)θ,

Jpair(θ) = 2A2θ(1− θ) +B2(θ

k)(1− θ

k)− 2A2θ2 − 4AB

θ2

k,


where A = 2/θ+ 2/(1− 2θ) + (1 + 1/k)/(1− θ− θ/k) and B = 2/θ+ 2(1 + 1/k)/(1−

θ − θ/k).

For k = 5, the asymptotic variances of the maximum composite likelihood estimators

for (3.4), (3.5) and (3.6) multiplied by n are plotted as a function of θ in Figure 3.1. We

can see that when θ < 0·3, the three estimators perform almost equally well; when θ >

0·3, the full likelihood becomes more efficient than the independence likelihood, and the

independence likelihood estimator is more efficient than the pairwise likelihood estimator.

We also carried out the comparisons for different values of k and found that: at k = 1, both

the independence likelihood and the pairwise likelihood are fully efficient; when k > 1, the

independence likelihood is more efficient than the pairwise likelihood and the difference

goes to zero when k → ∞; when k < 1, the pairwise likelihood is more efficient than the

independence likelihood and the difference goes to zero when k → 0.

This example suggests that in practical applications of composite likelihood inference,

where the models will usually have more complex dependence structure and incomplete

data (Yi et al., 2011), some care is required for the use of higher dimensional composite

likelihood to obtain more efficient estimators.

From the discussion above we know that a composite likelihood function with more com-

ponent likelihoods, or with higher dimensional component likelihoods, usually requires

more computing time but does not guarantee a more efficient estimator. The most direct


0.0 0.1 0.2 0.3 0.4

0.0

00

.01

0.0

20

.03

0.0

40

.05

The value of θ

Asym

pto

tic v

ari

an

ce

Figure 3.1: The asymptotic variances (multiplied by n) of the maximum composite like-lihood estimators for the full likelihood (solid line), the independence likelihood (dashedline) and the pairwise likelihood (dotted line).

way to avoid the decrease of efficiency is to adopt some weighting scheme, e.g. a care-

ful choice of w1 and w2 will make CLwc(θ; y) more efficient than both CL1(θ; y) and

CL2(θ; y), and a lot of research has been done along this direction within the framework

of unbiased estimating equations (Lindsay, 1988; Zhao and Joe, 2005; Joe and Lee, 2009).

We can also consider multiplying component likelihoods of different dimensions, for ex-

ample the Hoeffding scores as suggested in Lindsay et al. (2011) and the second-order

log-likelihoods in Cox and Reid (2004); or using some hybrid composite likelihood meth-

ods which employ two or more different composite likelihood functions simultaneously

to make inference, for example the hybrid pairwise likelihood method proposed in Kuk

(2007).


3.2 Efficiency of composite likelihood with known nui-

sance parameters

In the presence of nuisance parameters, it is well known that the asymptotic variance of

the maximum likelihood estimator will become smaller when the nuisance parameters are

replaced by their true values. Meanwhile, the reverse relationship has been noted for semi-

parametric inference using estimating functions (Henmi and Eguchi, 2004). We are inter-

ested to know whether such a phenomenon could occur for the estimators based on a com-

posite score function, which is a generalized score function as well as a special unbiased

estimating function. It is easy to check that this paradox will not occur for information-

unbiased composite likelihood functions. Suppose the q-dimensional parameter vector θ is

partitioned as θ = (ψT, λT), where ψ is a q1-dimensional parameter vector of interest and

λ is a q2-dimensional nuisance parameter vector, q = q1 + q2. The Godambe information

matrix of a information-unbiased composite likelihood is G(θ) = H(θ) = J(θ), and

G(θ) =

Gψψ Gψλ

Gλψ Gλλ

,

where Gψψ is the q1× q1 submatrix of G(θ) pertaining to ψ, and Gλλ the q2× q2 submatrix

of G(θ) pertaining to λ. When λ is unknown, the asymptotic variance of the MCLE of

ψ is given by (Gψψ − GψλG−1λλGλψ)−1; when λ is known, the asymptotic variance of the


MCLE of ψ can be shown to be G−1ψψ. Since GψλG−1λλGλψ is a nonnegative matrix, we

have (Gψψ − GψλG−1λλGλψ)−1 ≥ G−1ψψ. In this section we focus on the inference based on

information-biased composite likelihood functions.

The example given by Henmi and Eguchi (2004) can be thought as a hybrid composite

likelihood approach to a missing data problem, where the nuisance parameters are esti-

mated based on the marginal distribution of the indicator of missing status, and the pa-

rameters of interest are obtained by maximizing the weighted likelihood function of the

response variable with weights depending on the nuisance parameters. In this section

we will investigate the situation where only one composite likelihood CL(θ; y) is used

and the estimators of the parameters are obtained by solving its composite score equation

uc(θ; y) = ∇θc`(θ; y) = 0.

3.2.1 Equicorrelated multivariate normal model

The equicorrelated multivariate normal model has been well studied to compare the effi-

ciency of pairwise likelihood and full likelihood in different settings (Arnold and Strauss,

1991; Cox and Reid, 2004; Mardia et al., 2009). As shown in Pace et al. (2011) the sensi-

tivity matrixH(θ) of pairwise likelihood is not identical to its variability matrix J(θ) in this

model. In this section we compare the pairwise likelihood with itself when the nuisance

parameters are unknown and known.


Suppose y(1), . . . , y(n) are n independent observations from the p-dimensional multivari-

ate normal distribution with zero mean and covariance matrix Σ = σ2(1 − ρ)Ip + ρJp,

where Ip is identity matrix and Jp is a p×p matrix with all entries equal to 1. The common

correlation coefficient ρ is the parameter of interest. When σ2 is unknown, the maximum

pairwise likelihood estimator of ρ, denoted as ρpl, is as same as the maximum likelihood

estimator of ρ; when σ2 is known, the maximum pairwise likelihood estimator, denoted

as ρpl, is less efficient than the maximum likelihood estimator of ρ (Cox and Reid, 2004;

Mardia et al., 2009).

The asymptotic variance of ρpl is (Cox and Reid, 2004)

avar(ρpl) =2(1− ρ)2

np(p− 1)

c(p, ρ)

(1 + ρ2)2, (3.7)

where c(p, ρ) = (1− ρ)2(3ρ2 + p2ρ2 + 1) + pρ(−3ρ3 + 8ρ2 − 3ρ+ 2).

The asymptotic variance of ρpl can be shown to be

avar(ρpl) =2(1− ρ)2

np(p− 1)1 + (p− 1)ρ2. (3.8)

Comparing the equations (3.7) and (3.8), we find that as ρ approaches its lower bound

−1/(p − 1), avar(ρpl) decreases to zero while avar(ρpl) does not. The ratio of the asymp-

totic variances, avar(ρpl)/avar(ρpl), as a function of ρ is plotted in Figure 3.2 for p = 3.


−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

0.5

1.0

1.5

2.0

The value of ρ

The

ratio

of a

sym

ptot

ic v

aria

nce

Figure 3.2: The plot of the ratio r(ρ)= avar(ρpl)/avar(ρpl) at p = 3. The vertical andhorizontal dashed line denotes ρ = 0 and r(ρ) = 1 respectively.

We can see that when ρ is positive, ρpl is more efficient than ρpl; when ρ < 0, the op-

posite phenomenon is observed, and when ρ approaches the lower bound −0·5, this ratio

diverges to infinity. We performed the comparisons for different p and observed the same

phenomenon.

3.2.2 Discussion

To see that information-biasedness is not a sufficient condition for the paradox to occur,

we consider another information-biased composite likelihood function, the full conditional


likelihood for the same model:

CLFC(θ; y) =

p∏r=1

f(yr | y(−r); θ ),

where y(−r) denotes the random vector excluding yr. When σ2 is unknown, the maximum

full conditional likelihood estimator of ρ, ρcl is identical to ρpl and fully efficient (Mardia

et al., 2009); when σ2 is known, the maximum full conditional likelihood estimator, ρcl

is less efficient than the maximum likelihood estimator for p ≥ 3. Using the formula in

Mardia et al. (2007), the ratio of the asymptotic variances, avar(ρcl)/avar(ρcl), as a function

of ρ is plotted in Figure 3.3 for p = 3. We can see that the ratio does not exceed 1 for all

ρ ∈ [−1/(p− 1), 1].

Denote by σ2pl the maximum pairwise likelihood estimator of σ2. As suggested in Henmi

and Eguchi (2004, Proposition 1), a sufficient condition to observe the paradox in this

example is that ρpl and σ2pl are asymptotically independent, while ρpl and σ2

pl are not. It

can be shown that the asymptotic covariance between ρpl and σ2pl is 2ρ(1 − ρ)1 + (p −

1)ρσ2/(np) which goes to 0 as ρ approaches−1/(p−1); ρpl and σ2pl are not asymptotically

independent when ρ = −1/(p − 1). This may explain why the paradox occurs when ρ is

close to its lower bound −1/(p− 1).

One way to avoid the paradoxical phenomenon is to convert the composite score func-

tion uc(θ; y) to an unbiased estimating function by projecting (Henmi and Eguchi, 2004;


−0.5 0.0 0.5 1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

The value of ρ

The

ratio

of a

sym

ptot

ic v

aria

nce

Figure 3.3: The plot of the ratio r(ρ) = avar(ρcl)/avar(ρcl) at p = 3. The horizontal dashedline denotes r(ρ) = 1.

Lindsay et al., 2011):

u∗c(θ; y) = H(θ)J−1(θ)uc(θ; y) = arg minν=Auc(θ;y)

E‖u(θ; y)− ν(θ; y)‖2

, (3.9)

where u(θ; y) is the score function of full likelihood, A ranges over all q×q matrices, H(θ)

and J(θ) are the sensitivity matrix and variability matrix. It is easy to check that u∗c(θ; y)

is information-unbiased. Since H(θ) and J(θ) are constant matrices, this projection does

not change the point estimator of θ, and u∗c(θ; y) has the same Godambe information as

uc(θ; y). In the equicorrelated multivariate normal model, θ = (ρ, σ2), Kenne Pagui (2009)

showed that the score funtion of the pairwise likelihood is upl(θ; y) = J(θ)H−1(θ)u(θ, y).


From (3.9), the projected estimating funtion of upl(θ; y) is equal to the score function of

full likelihood, u(θ; y).

In complex models, the required computation for the projected estimating function

u∗c(θ; y) can be intractable; it may be a better idea to design a nuisance-parameter-free

composite likelihood carefully for practical use. As an example, a pairwise difference

likelihood that eliminates nuisance parameters in a Neyman–Scott problem is described in

Hjort and Varin (2007).

3.3 Theoretical results on multiparameter case

In Section 3.1 we focus on the composite likelihood inference for a scalar parameter. In

this section we consider the multiparameter version of Theorem 2, and the corresponding

Corollary 2 and Corollary 3 follow immediately.

We start by assuming H1(θ) = H2(θ). To show that the compound composite likelihood

CLc(θ; y) is more efficient than CL1(θ; y), in the sense thatGc(θ)−G1(θ) is a nonnegative

definite matrix, it is equivalent to show that G−11 (θ) ≥ G−1c (θ), i.e.

H−11 (θ)J1(θ)H−11 (θ) ≥ H1(θ) +H2(θ)−1varuc1(θ; y) + uc2(θ; y)H1(θ) +H2(θ)−1

(3.10)


where uci(θ; y) is the composite score function for CLi(θ; y), i = 1, 2.

Since H1(θ) +H2(θ) is positive definite, to show (3.10) it is equivalent to show that

varuc1(θ; y) + uc2(θ; y) ≤ H1(θ) +H2(θ)H−11 (θ)J1(θ)H−11 (θ)H1(θ) +H2(θ).

(3.11)

Define Cov12(θ; y) = covuc1(θ; y), uc2(θ; y), andB12(θ) = J1(θ)H−11 (θ)H2(θ). The left

hand side of (3.11) is equal to J1(θ)+J2(θ)+Cov12(θ; y)+CovT

12(θ; y), and the right hand

side is J1(θ) +B12(θ) +BT12(θ) +H2(θ)H

−11 (θ)J1(θ)H

−11 (θ)H2(θ).

Because G−12 (θ) = H−12 (θ)J2(θ)H−12 (θ) ≤ H−11 (θ)J1(θ)H

−11 (θ) = G−11 (θ), we have

J2(θ) ≤ H2(θ)H−11 (θ)J1(θ)H

−11 (θ)H2(θ).

Hence, to show (3.11) we only need to show

Cov12(θ; y) + CovT

12(θ; y) ≤ B12(θ) +BT

12(θ). (3.12)

The assumption that H1(θ) = H2(θ) implies that B12(θ) + BT12(θ) = J1(θ) + J1(θ) ≥

J1(θ) + J2(θ) ≥ Cov12(θ; y) + CovT

12(θ; y).

When H1(θ) 6= H2(θ), additional assumptions may be needed for Gc(θ) − G1(θ)

to be a nonnegative definite matrix. To see this, we consider a simple case where


H2(θ)J−12 (θ)H2(θ) = H1(θ)J

−11 (θ)H1(θ) and covuc1(θ; y), uc2(θ; y) is a zero matrix.

The inequality (3.11) is then simplified to be

J1(θ) + J2(θ) ≤ J1(θ) + J2(θ) +B12(θ) +BT

12(θ).

So, B12(θ) +BT12(θ) is required to be a nonnegative definite matrix.

In general, we can define u∗c2(θ; y) = H1(θ)H−12 (θ)uc2(θ; y), which has the sensitivity

matrix H1(θ) and the Godambe information matrix G2(θ) = H2(θ)J−12 (θ)H2(θ). Since

varu∗c2(θ; y) = H1(θ)H−12 (θ)J2(θ)H

−12 (θ)H1(θ) ≤ J1(θ), we can show that

covuc1(θ; y), u∗c2(θ; y)+ covu∗c2(θ; y), uc1(θ; y) ≤ varuc1(θ; y)+ varu∗c2(θ; y)

≤ J1(θ) + J1(θ),

Define J∗1 (θ) = J1(θ) − covu∗c2(θ; y), uc1(θ; y). From the inequality (3.12), a suffi-

cient condition for Gc(θ) ≥ G1(θ) is that J∗T1 (θ)H−11 (θ)H2(θ) + H2(θ)H−11 (θ)J∗1 (θ) is

a nonnegative definite matrix, which is true if H1(θ) = H2(θ), or the composite likeli-

hood functions CL1(θ; y) and CL2(θ; y) are information-unbiased, i.e. H1(θ) = J1(θ) and

H2(θ) = J2(θ).


3.4 Concluding Remarks

An information-unbiased composite likelihood function can be thought as a true likelihood

function based on partial information of the full model, and its Godambe information plays

the similar role as the Fisher information in the full likelihood inference. However, it may

be very inefficient. On the other hand, an information-biased composite likelihood not only

uses partial correct information but also introduces extra incorrect information implicitly,

leading to the occurrence of some undesirable paradoxes as discussed in Section 3.1 and

3.2.

In many applications (Varin et al., 2011), an optimally weighted estimating equation

is constructed based on the composite score function to achieve higher efficiency, within

the framework of unbiased estimating equations. However, such an optimal estimating

equation is usually very expensive to compute, and unlikely to be a score function of any

composite likelihood function. One direction for future research is to develop some spe-

cific theories on the construction of an optimally weighted composite likelihood function

which achieves higher efficiency while retaining the features as a likelihood-type objective

function, such as the Kullback-Leibler inequality (1.8).

Chapter 4

Prediction in computer experiments

with composite likelihood

4.1 Computer experiments

Computer experiments have been successfully used for predicting weather and climate,

modeling wildfire evolution, assessing the performance of integrated circuits and many

other scientific and technological fields where physical experiments are impossible or too

expensive and time-consuming to conduct (Santner et al., 2003; Fang et al., 2006). In a

computer experiment, we usually run a computer code to solve a mathematical system

which is used to approximate some real physical process. We can vary the inputs to the

53


code and observe how the output is affected. Due to the complexity of the mathematical

system and the high dimensionality of the inputs, it may take a long time to obtain even

a single output. To address this problem, statistical models have been used as surrogates

for computer simulators. Different from physical experiments, computer experiments are

typically deterministic, i.e. running the code twice with identical input will produce the

same output. Hence the principles of randomization, blocking and replication do not work

for computer experiments, and our statistical modeling scheme for a computer experiment

should be able to capture this deterministic feature. It is also desirable for the models

to allow for some smoothness assumption about the response surface. In addition, the

predictions are expected to have zero uncertainty at the observed inputs, small uncertainty

close to the observed inputs and larger uncertainty further away.

4.1.1 Gaussian random function model

Modeling the computer outputs as a sample path of a Gaussian process is a popular sta-

tistical approach due to its flexibility for fitting a large class of response surfaces and its

convenience for analytic work (Sacks et al., 1989a,b; Welch et al., 1992). The uncertainty

about the output at an untried input setting comes from fact that there can be more than one

random path passing through all of the observed points. Now consider the d-dimensional

input vector x = (x1, x2, . . . , xd) and the scalar output Y . In a Gaussian random function


(GRF) model, the relationship between Y and x is modelled as

Y (x) =k∑j=1

βjφj(x) + Z(x) = φ(x)Tβ + Z(x), (4.1)

where φ(x) = (φ1(x), . . . , φk(x))T is a k × 1 vector of basis functions, β = (β1, . . . , βk)T

is a vector of coefficients, and Z(·) is a mean zero Gaussian process for x ∈ X ∈ Rd. The

covariance function of Z(·) at two inputs x and x∗ is

cov(Z(x), Z(x∗)) = σ2R(x, x∗; θ), (4.2)

where σ2 is the marginal variance of Z(·), R(·, ·) is the correlation function and θ is the

parameter vector governing the correlation structure.

Following convention, we require that Z(·) is stationary, which implies that R(x, x∗; θ)

depends only on the difference, x − x∗. In spatial statistics Z(·) is often assumed to be

isotropic, i.e. the correlation function depends only on ||x − x∗||, the Euclidean distance

between x and x∗. However, in the context of computer experiments, anisotropic corre-

lation functions are commonly used because the input variables are usually measured on

different scales and impact the output in very different ways.

The correlation function R(x, x∗; θ) is usually modelled as a product of the correlations


at each dimension of the input vector x, i.e.

R(x, x∗; θ) =d∏i=1

Ri(|xi − x∗i |; θi) (4.3)

Two widely-used families of correlation functions are the power exponential correlation

function and the Matern correlation function (Matern, 1986). In this thesis we only consider

the power exponential correlation function

Ri(|xi − x∗i |; θi) = exp−γi|xi − x∗i |αi, αi ∈ (0, 2], γi > 0 (4.4)

where θi = (γi, αi). The exponent αi can be interpreted as a smoothness parameter, which

determines the differentiability of the sample paths. The sample paths at the ith dimension

are infinitely differentiable when αi = 2, and nondifferentiable when αi < 2 (Santner

et al., 2003, Chapter 2). γi can be interpreted as a dependence parameter. As γi increases,

the range of dependence decreases. The Gaussian correlation function ( αi = 2 ) and the

Ornstein-Uhlenbeck correlation function ( αi = 1 ) are two special cases in this family.

4.1.2 Estimation for GRF model

Now we assume a collection of outputs y(1), . . . , y(n) are observed at the inputs x(1), . . . , x(n)

respectively, where each input x(·) is a d-dimensional vector. Under the GRF model (4.1),


the log-likelihood function for y = (y(1), . . . , y(n))T = (y(x(1)), . . . , y(x(n)))T up to an

additive constant is

− 1

2n log σ2 + log |R(θ)|+ 1

σ2(y − Φβ)TR(θ)−1(y − Φβ), (4.5)

where Φ is the n× k matrix of basis functions and R(θ) is the n× n matrix of correlations

with the (l,m)th entry R(θ)lm = R(x(l), x(m); θ), l,m = 1, . . . , n.

In the analysis of computer experiments, the most popular way to estimate the parame-

ters of the Gaussian process model is a likelihood-based method such as MLE or REML

(Santner et al., 2003), rather than the variogram method which is usually used in geostatis-

tics to estimate the correlation parameters (e.g. Cressie, 1993). The difference is partly

due to the incommensurability and higher dimensionality of the input vector in computer

experiments. As n gets larger, maximum likelihood estimation becomes computationally

infeasible because we need to evaluate the log-likelihood function (4.5) many times and at

each time we need to calculate |R(θ)| and R−1(θ), both of which require O(n3) operations.

From a fully Bayesian point of view, the Gaussian random function (4.1) is a prior on

the response function space, and the parameters (β, θ, σ2) are treated as hyperparameters.

Point estimation of the parameters is then not required, but we still need to evaluate the

n-dimensional normal likelihood function to obtain the predictive distribution.


To address this “big n” problem, two classes of approaches have been proposed:

1. Simplifying the correlation function. R(θ) is modelled with much more easily ma-

nipulated structures such as a sparse matrix. This class includes low rank matrix

(Stein, 2008; Cressie and Johannesson, 2008), covariance tapering techniques (Kauf-

man et al., 2008; Furrer et al., 2006; Sang and Huang, 2011) and compactly supported

correlation matrices (Kaufman et al., 2011).

2. Approximating the likelihood function. This class includes Besag’s conditional com-

posite likelihood and its variants (Besag, 1975; Vecchia, 1988; Stein et al., 2004),

pairwise difference likelihood (Curriero and Lele, 1999), pairwise likelihood for im-

age models (Nott and Ryden, 1999) and binary spatial data (Heagerty and Lele,

1998), blockwise composite likelihood (Caragea and Smith, 2007) and the ensem-

ble method, such as the Bayesian committee machine developed within the Bayesian

framework (Tresp, 2000).

4.1.3 Prediction for GRF model

We would like a fast statistical surrogate for the computer code to predict the outputs at

untried input settings, with associated measures of uncertainties. For a new input x(0), we

want to predict its output y(0) = y(x(0)), given the observations y = (y(1), . . . , y(n)) =


(y(x(1)), . . . , y(x(n))) . The most popular method for prediction is the best linear unbiased

predictor ( BLUP ) which minimizes the mean squared prediction error, E(CT0 y − y(0))2

among all the unbiased linear predictors y(0) = CT0 y, whereC0 is a n×1 vector of constants

satisfyingE(y(0)) = E(y0). This approach is also known as “Kriging” in the field of spatial

statistics (Matheron, 1963; Stein, 1999). For the Gaussian random function model (4.1),

the BLUP of y(0) is given by the conditional mean

E(y(0) | y) = φT

0β + rT0R(θ)−1(y − Φβ), (4.6)

where φ0 = φ(x(0)), and r0 = (R(x(0), x(1)), . . . , R(x(0), x(n)))T is the vector of correla-

tions between y(0) and y. r0 reflects the information about y(x(0)) contained in each y(x(l))

( l = 1, . . . , n ), and R(θ)−1 is able to account for the the clustering effects among the

observations. To see this, assume φT0β = 0 and n = 3, and y(0) = y(x(0)) is predicted

using the three observed points y(x(1)), y(x(2)) and y(x(3)). Suppose ||x(1)−x(2)|| ≈ 0 and

y(x(1)) ≈ y(x(2)), the BLUP of y(x(0)) will be

E(y0 | y(x(1)), y(x(2)), y(x(3))) ≈ 1

1− r213

(r01 − r13r03)y(x(1)) + (r03 − r13r01)y(x(3))

= E(y0 | y(x(1)), y(x(3))) ≈ E(y0 | y(x(2)), y(x(3))),

where rij = R(x(i), x(j)), i, j ∈ 0, 1, 2, 3.


When the parameters (β, θ, σ2) are all known, the unconditional prediction variance of

the BLUP (4.6) is

vary(0) − E(y(0) | y) = var[Ey(0) − E(y(0) | y) | y] + E[vary0 − E(y(0) | y) | y]

= 0 + E[vary(0) − E(y(0) | y) | y]

= Evar(y(0) | y).

Since var(y(0) | y) = σ2(1− rT0R(θ)−1r0) does not depend on y, we have

vary(0) − E(y(0) | y) = var(y(0) | y) = σ2(1− rT0R(θ)−1r0). (4.7)

It is easy to show that at any observed input, x(0) = x(l) with l ∈ 1, . . . , n, the BLUP for

y(x(0)) is E(y(x(0)) | y) = y(x(l)), which is a desirable feature for a determinstic computer

code. It is also worthwhile to note that the variance (4.7) are equal to zero when x(0) = x(l)

for l ∈ (1, . . . , n), due to the fact that rTl R(θ)−1 = eTl , where el is the lth unit vector

(0, . . . , 0, 1, 0, . . .)T (Santner et al., 2003, p.90-93). To calculate the BLUP (4.6) and its

variance (4.7), we still need to compute the inverse of the n× n covariance matrix.

In this thesis we do not consider the uncertainties on the estimated parameters, which

might be done in a fully Bayesian framework mentioned in Section 4.1.2 (Santner et al.,

2003). In particular, when σ2, θ are known, and we put a non-informative prior on β,


[β] ∝ 1, the predictive distribution of y(x(0)) can be shown to be normal with mean

φT

0 β + rT0R(θ)−1(y − Φβ) (4.8)

and variance

σ2

1− (φT

0 , rT

0 )

0 ΦT

Φ R(θ)

−1 φ0

r0

, (4.9)

where β = (ΦTR(θ)−1Φ)−1ΦTR(θ)−1y, is equal to the MLE of β.

In the next section we consider the pairwise likelihood function (1.5) for estimating the

unknown parameters. In Section 4.3, we propose prediction methods based on different

composite likelihood functions, to approximate the BLUP (4.6). Composite likelihood-

based estimation and prediction do not involve the large n × n matrix, and reduce the

computational complexity from O(n3) to O(n2).

4.2 Estimation using composite likelihood

Pairwise likelihood is one of the most widely used versions of composite likelihood, and

has been used to fit Gaussian process models with isotropic or geometrical isotropic corre-

lation functions in spatial data analysis (Heagerty and Lele, 1998; Nott and Ryden, 1999).


The pairwise log-likelihood function for the full model (4.5) up to an additive constant is

− 1

2

n−1∑l=1

n∑m=l+1

log σ4 + log |Rl,m|+

1

σ2(yl,m − Φl,mβ)TR−1l,m(yl,m − Φl,mβ)

, (4.10)

where yl,m is the bivariate vector (y(l), y(m))T, Φl,m is the 2 × k matrix of basis functions

for yl,m and Rl,m is the 2× 2 correlation matrix of yl,m. The maximum pairwise likelihood

estimator ( MPLE ) of β is

βpl = (n−1∑l=1

n∑m=l+1

ΦT

l,mR−1l,myl,m)−1(

n−1∑l=1

n∑m=l+1

ΦT

l,mR−1l,myl,m) (4.11)

and the MPLE of σ2 is

σ2pl =

1

2(n2

) n−1∑l=1

n∑m=l+1

(yl,m − Φl,mβpl)TR−1l,m(yl,m − Φl,mβpl) (4.12)

Substituting βpl and σ2pl into the pairwise log-likelihood function (4.10), we can get an

objective function which depends only on θ and can be maximized to obtain the MPLE

of θ. The computaional complexity to find the maximum pairwise likelihood estimators is

O(n2). The computational burden can be further reduced if we exclude the pairs formed by

observations far apart, which may also improve the efficiency (Davis and Yau, 2011; Varin

et al., 2005).

Consistency and asymptotic normality of the maximum composite likelihood estimators


can be obtained under some regularity conditions within the framework of “increasing do-

main” asymptotics (Nott and Ryden, 1999; Caragea and Smith, 2007; Bevilacqua et al.,

2012).

4.3 Prediction using composite likelihood

In this section we assume the parameters are known and develop composite likelihood-

based techniques to approximate the BLUP (4.6). In practice we can replace the parameters

by their maximum composite likelihood estimators obtained in Section 4.2.

Under the model (4.1), the joint distribution of y(0) and y = (y(1), . . . , y(n))T is

y

y(0)

∼MVNn+1

Φ

φT0

β, σ2

R(θ) r0

rT0 1

. (4.13)

Treating y(0) as an unknown parameter, Jones et al. (1998, Appendix 1) showed that the

MLE of y(0) is

y(0)mle = φT

0β + rT0R(θ)−1(y − Φβ), (4.14)

which is identical to the BLUP (4.6). Following this line, we consider maximizing a com-

posite likelihood function, instead of the full likelihood function of (4.13), to get the maxi-

mum composite estimator of y(0), as an approximation to y(0)mle.


4.3.1 Maximum pairwise likelihood predictors

The pairwise likelihood function of (4.13) is

CLpair(y, y(0)) =

n−1∏l=1

n∏m=l+1

f(y(l), y(m); θ, β, σ2 )

n∏l=1

f(y(l), y(0); θ, β, σ2), (4.15)

where each f(y(l), y(m); θ, β, σ2 ) is a bivariate normal density funtion. The maximum pair-

wise likelihood predictor of y(0) is obtained by maximizing (4.3.3) with respect to y(0):

y(0)pl = φT

0β +n∑l=1

rl,0(y(l) − φlβ)

1− r2l,0/

n∑l=1

1

1− r2l,0(4.16)

where φl is the 1×k basis matrix functions for y(l), and rl,0 denotes the correlation between

y(l) and y(0). The number of operations to calculate y(0)pl for each prediction is O(n). This

maximum pairwise likelihood predictor was also suggested by Grunenfelder (2010), and

applied to the max-stable process for modeling spatial extremes, rather than the Gaussian

process. The performances there and in our simulation study, shown in Section 4.4, were

unsatisfactory, and we consider a better approach in the next section.


4.3.2 Maximum triplet-wise likelihood predictors

The maximum pairwise likelihood predictor (4.16) does not account for the clustering ef-

fect between any two observations, as illustrated in Section 4.1.3, so we consider the max-

imum triplet-wise likelihood predictor of y(0) to achieve an improvement. Similarly to the

pairwise likelihood function (4.3.3), the triplet-wise likelihood function of the (n + 1)-

dimensional normal model (4.13) is constructed as a product of all the possible trivariate

normal densities, and the maximum triplet-wise likelihood predictor is given by

y(0)tr = φT

0β +n−1∑l=1

n∑m=l+1

rTlm0R−1l,m(yl,m − Φl,mβ)

1− rTlm0R−1l,mrlm0

/n−1∑l=1

n∑m=l+1

1

1− rTlm0R−1l,mrlm0

, (4.17)

where yl,m is the paired vector (y(l), y(m))T, Φl,m is the 2 × k matrix of basis functions for

yl,m, Rl,m is the 2 × 2 correlation matrix of yl,m, and rlm0 is the 2 × 1 vector of correla-

tions between y(0) and yl,m. The number of operations to calculate y(0)tr for each prediction

is O(n2), and hence this approach may become infeasible when the numer of requested

predictions is very large. Moreover, the maximum triplet-wise likelihood predictor can

not account for higher order clustering effects, such as three or more observations close to

each other. To address these problems, we generalize the definition of maximum pairwise

likelihood predictors in a different direction in the next section.


4.3.3 Maximum blockwise likelihood predictors

We split the observed data (y(1), . . . , y(n)) into B blocks, D1, . . . , DB, according to their

input settings. The sizes of the blocks are assumed to be n1, . . . , nB respectively. In spatial

statistics, each block may be chosen as a geographic neighborhood. However, the selection

of the blocks in the context of computer experiments seems much more difficult due to the

dimensionality of the input and the anisotropic correlation structures. Ideally, the inputs

within a block should be more homogenous than those in different blocks, and the corre-

lation between any two blocks should be weak. The blockwise likelihood function of the

joint model (4.13) is

CLblock(y, y(0)) =

B−1∏b=1

B∏c=b+1

f(Db, Dc | β, θ, σ2)

B∏b=1

f(Db, y0 | β, θ, σ2)

, (4.18)

which can be seen as a pairwise likelihood function treating each block as one single obser-

vation. When the size of each block is equal to 1, the blockwise likelihood function reduces

to the pairwise likelihood function ; when the number of blocks B = 1, the blockwise like-

lihood function becomes the full likelihood function.

Since∏B−1

b=1

∏Bc=b+1 f(Db, Dc | β, θ, σ2) is not a function of y(0), maximizing the block-

wise likelihood function (4.18) is equivalent to maximizing∏B

b=1 f(Db, y(0) | β, θ, σ2).


The maximum blockwise likelihood predictor is given by

y(0)bl =

1∑Bb=1 var−1(y(0) | Db)

B∑b=1

var−1(y(0) | Db)E(y(0) | Db)

= φT

0β +B∑b=1

rTDb0R−1Db

(Db − ΦDbβ)

1− rTDb0R−1DbrDb0

/

B∑b=1

1


, (4.19)

where ΦDb is the nb × k matrix of basis functions for Db, RDb is the nb × nb correlation

matrix for Db, and rDb0 is the nb × 1 vector of correlations between y(0) and Db.

We also consider the weighted blockwise likelihood function

CLwblock(y, y(0)) =

B∏b=1

f(Db, y(0) | β, θ, σ2)wb , (4.20)

where the weight wb = var−1(y(0) | Db). Maximizing the weighted blockwise likelihood

function (4.20) we get the maximum weighted blockwise likelihood predictor

y(0)wbl =

1∑Bb=1 var−2(y(0) | Db)

B∑b=1

var−2(y(0) | Db)E(y(0) | Db)

= φT

0β +B∑b=1

rTDb0R−1Db

(Db − ΦDbβ)

(1− rTDb0R−1DbrDb0)

2/

B∑b=1

1

(1− rTDb0R−1DbrDb0)

2, (4.21)

which places more weights on the block with smaller var(y(0) | Db), compared with the

unweighted blockwise likelihood predictor (4.19).

In the joint multivariate model (4.13), if we maximize the full conditional likelihood


function f(y(0) | y(1), . . . , y(n); β, θ, σ2) with respect y(0), the resulting predictor of y(0)

is also identical to the BLUP (4.6). Following this line, we may approximate the full

conditional likelihood function using a composite likelihood, which is then maximized to

obtain a predictor of y(0).

Denote by Di the collection of blocks D1, . . . , Di, i = 1, . . . , B. If we approxi-

mate the conditional likelihood function f(Di | Di, y(0); β, θ, σ2) by f(Di | y(0); β, θ, σ2),

it can be shown by Bayes rule that the full conditional likelihood function f(y(0) |

y(1), . . . , y(n); β, θ, σ2) is proportional to

∏Bb=1 f(y(0) | Db; β, θ, σ

2)

f(y(0); β, θ, σ2)B−1. (4.22)

This approximation is good if the observations in different blocks are nearly independent.

Maximizing the composite conditional likelihood function (4.22), we obtain an alternative

to the blockwise likelihood-based predictor y(0)bl :

y(0)abl =

1∑Bb=1 var−1(y(0)|Db)− (B − 1)/σ2

B∑b=1

var−1(y(0)|Db)E(y(0)|Db)−B − 1

σ2E(y(0))

= φT

0β +B∑b=1

rTDb0R−1Db

(Db − ΦDbβ)


/B∑b=1

1


− (B − 1), (4.23)

which is identical to the predictor given by the Bayesian Committee Machine (Tresp, 2000)

when the number of query points is equal to 1, and all the (hyper)parameters are assumed


known. In practice, the parameters are usually unknown, and the Bayesian Committee

Machine will put prior distributions on the parameters, while our approach will replace the

parameters by their maximum composite likelihood estimators.

Now assume all of the blocks have the same size nB, the number of operations to cal-

culate the blockwise likelihood-based predictors defined above for each prediction will be

O(n2B × n). Because the expectation E(Db − ΦDbβ) = 0, it is easy to see that the predic-

tors y(0)bl , y(0)wbl and y(0)abl are all unbiased, i.e. their expectations under the joint distribution of

(y(1), . . . , y(n)) are equal to E(y(0)) = φT0β. Moreover, if the block boundaries are fixed, the

blockwise composite likelihood-based predictors will converge to the BLUP (4.6) when the

density of the observations increases to infinity (Eidsvik et al., 2011). In the next section,

we compare the performances of the proposed composite likelihood-based predictors in a

simple model setting.

4.4 Simulation study

4.4.1 Prediction for GRF model with 1-dimensional input

With the univariate input x, the Gaussian random function is modelled with zero mean and

Gaussian correlation function, i.e. φ(x)Tβ ≡ 0 and R(|x − x∗|; θ) = exp−γ|x − x∗|α.

For α = 1.99, we uniformly generate n+ 1 = 101 input values from the unit interval [0, 1];


a sample path is then drawn from the Gaussian process to give n outputs at the selected

input values. We repeat the simulation M = 1000 times, and at each time we randomly

pick one input value, denoted as x(0), associated with the output y(x(0)), which is to be

predicted by the rest of the observations y(x(1)), . . . , y(x(n)). The composite likelihood-

based predictors developed in previous section, as well as the BLUP based on the full

likelihood, are compared in terms of the empirical mean square prediction error:

EMSPE =1

M

M∑m=1

ym − y(x(0))2, (4.24)

where ym denotes the predicted values of y(x(0)). To compute the blockwise likelihood-

based predictors, we sort y(x(1)), . . . , y(x(n)) according to their input values, and then split

the sorted data into B blocks in order, each with size nB = 5. The results of the compar-

isons at different values of γ are presented in Table 4.1. As mentioned in Section 4.1.1,

when γ increases, the strength of dependence decreases. From Table 4.1, the BLUP y(0)mle

has the smallest EMSPE in all cases as expected. The maximum triplet-wise likelihood

predictor performs better than the maximum pairwise likelihood predictor, especially when

γ is not large. The blockwise likelihood-based predictors outperform the pairwise and

triplet-wise likelihood-based predictors. When γ = 1000, y(0)bl and y(0)abl become inaccurate,

while the maximum weighted blockwise likelihood predictor, y(0)wbl is still competitive with

the BLUP, y(0)mle. When γ increases to ∞, the Gaussian process will behave like a white


noise, and the EMSPE’s will converge to the individual variance σ2, which is equal to 1 in

this simulation study.

Table 4.1: EMSPE of the six predictors for different γ when α = 1.99

y(0)mle y

(0)pl y

(0)tr y

(0)bl y

(0)wbl y

(0)abl

γ = 1 5.15× 10−6 0.0025 8.96× 10−5 9.79× 10−6 6.18× 10−6 9.92× 10−6

γ = 10 4.74× 10−5 0.0218 0.0022 9.79× 10−5 5.98× 10−5 8.37× 10−5

γ = 100 0.0012 0.2415 0.1079 0.0034 0.0012 0.0020γ = 1000 0.0441 0.8788 0.7932 0.3964 0.0664 0.2826γ = 10000 0.6294 0.8400 0.8285 0.8240 0.7914 0.8096

We also performed the comparisons at different values of α, with γ fixed to be 100. The

results are shown in Table 4.2. Recall that, α controls the smoothness of the sample path.

When α decreases, the sample path becomes less smooth, and the accuracy of prediction

also decreases. The maximum weighted blockwise likelihood predictor outperforms y(0)bl

and y(0)abl , especially when α = 1.8.

Table 4.2: EMSPE of the six predictors for different α when γ = 100

y(0)mle y

(0)pl y

(0)tr y

(0)bl y

(0)wbl y

(0)abl

α = 1.99 0.0012 0.2415 0.1079 0.0034 0.0012 0.0020α = 1.9 0.0123 0.3080 0.1807 0.0806 0.0124 0.0456α = 1.8 0.0233 0.5217 0.3445 0.2627 0.0252 0.1668α = 1.5 0.1381 0.9876 0.8556 0.8892 0.4447 0.7677α = 1 0.6126 0.8285 0.8208 0.8214 0.8106 0.8118

To investigate the influence of the sample size and the density of the observations on the

predictive accuracy of the proposed prediction methods, the EMSPE’s are compared for


different values of n, when α = 1.8 and γ = 100. The block size is still set equal to 5. The

results for M = 1000 simulations are presented in Table 4.3. As n increases, the density of

the observations increases, and all of the predictors become more accurate. The maximum

weighted blockwise likelihood predictor has the smallest EMSPE among all the composite

likelihood-based predictors, and is comparable with the EMSPE of the BLUP y(0)mle.

Table 4.3: EMSPE of the six predictors at different sample size n

y(0)mle y

(0)pl y

(0)tr y

(0)bl y

(0)wbl y

(0)abl

n = 100 0.0233 0.5217 0.3445 0.2627 0.0252 0.1668n = 200 0.0064 0.4742 0.2691 0.1879 0.0069 0.1059n = 400 0.0025 0.2028 0.0946 0.0650 0.0026 0.0315n = 1000 2.93× 10−4 0.0706 0.0230 0.0142 3.12× 10−4 0.0056

4.4.2 Prediction for GRF model with 2-dimensional input

In the univariate input case, it is easy to identify the blocks formed by the adjacent obser-

vations, according to their input values. In this section we consider a Gaussian random

function with 2-dimensional input vector x = (x1, x2). The mean function is still set equal

to be zero, and the correlation funtion is modelled as

R(||x− x∗||; θ) =2∏i=1

exp(−γi|xi − x∗i |α).

The parameter of smoothness, α is equal at both dimensions, while the parameters of de-

pendence, γ1 and γ2 can be different. At α = 1.99, we uniformly generate n + 1 = 101


input values from the unit square [0, 1] × [0, 1]. γ1 is fixed at 100, while the value of γ2 is

allowed to vary. The blocks, each with size 5, are identified in the same way as the univari-

ate input case, according to the value of x1 only, the first dimension of the input vector x.

At each setting of parameters, we repeat the simulation M = 1000 times. The composite

likelihood-based predictors and the BLUP are compared in terms of their empirical mean

square prediction errors (4.24).

The results of simulations are summarized in Table 4.4. As γ2 decreases, the strength

of dependence increases, and all of the predictors become more accurate. The blockwise

likelihood-based predictors still outperform the pairwise and triplet-wise likelihood-based

predictors. The maximum weighted blockwise likelihood predictor y(0)wbl performs better

than all the other composite likelihood-based predictors, although its relative accuracy com-

pared with the BLUP is not as good as the univariate case.

Table 4.4: EMSPE of the six predictors with 2-dimensional input

y(0)mle y

(0)pl y

(0)tr y

(0)bl y

(0)wbl y

(0)abl

γ2 = 0 0.0012 0.3614 0.1594 0.0052 0.0012 0.0028γ2 = 5 0.0758 0.8688 0.7569 0.5839 0.2688 0.4742γ2 = 10 0.1401 0.9173 0.8413 0.7459 0.4117 0.6514γ2 = 50 0.3969 0.9537 0.9151 0.8857 0.7488 0.8348


4.5 Discussion and Future Work

The blockwise composite likelihood-based predictiors developed in this chapter may be

thought as a weighted average of the best linear predictors obtained from each block. The

weights are automatically determined by maximizing (weighted) blockwise composite like-

lihood functions. Some preliminary simulation studies in Section 4.4 show that the block-

wise composite likelihood-based predictiors outperform the maximum pairwise and maxi-

mum triplet-wise likelihood predictors, while the maximum weighted blockwise composite

likelihood predictior y(0)wbl seems better than y(0)bl and y(0)abl , especially when the strength of

dependence is not too weak. The idea of maximizing a composite likelihood function to

develop prediction methods is not restricted to the multivariate distribuion: it may also

be applied to other statistical models, such as the max-stable process for modeling spatial

extrems (Grunenfelder, 2010).

In the simulation studies we assume the parameters in the Gaussian random functions

are known. In practice, the parameters are usually unknown, and can be estimated using

the maximum pairwise likelihood estimator mentioned in Section 4.2. The influence of the

uncertainty in parameter estimation on the prediction will be investigated in future work.

The proposed blockwise likelihood-based predictors will be applied to some real problems

with high-dimensional input vector, for example, the data set from a computer experiment

on photometric redshift (Kaufman et al., 2011), where the full likelihood approach does


not work due to the size of the data. Moreover, the variances, as well as the prediction

intervals, for the blockwise likelihood-based predictors will also be derived in future work.

Bibliography

A. Andrei and C. Kendziorski. An efficient method for identifying statistical interactors in

gene association networks. Biostatistics, 10:706–718, 2009.

B. C. Arnold. Example of a non-normal distribution with normal marginals. Personal

communication, 2010.

B. C. Arnold and D. Strauss. Pseudolikelihood estimation: some examples. Sankhya Ser.

B, 53:233–243, 1991.

J. E. Besag. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist.

Soc. Ser. B, 36:192–236, 1974.

J. E. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–195, 1975.

M. Bevilacqua, C. Gaetan, J. Mateu, and E. Porcu. Estimating space and space-time covari-

ance functions for large data sets: a weighted composite likelihood approach. J. Amer.

Statist. Assoc., To appear, 2012.

76

4 BIBLIOGRAPHY 77

P. Caragea and R. L. Smith. Asymptotic properties of computationally efficient alternative

estimators for a class of multivariate normal models. J. Multivariate Anal., 98:1417–

1440, 2007.

R. E. Chandler and S. Bate. Inference for clustered data using the independence loglikeli-

hood. Biometrika, 94:167–183, 2007.

D. R. Cox. Partial likelihood. Biometrika, 62:268–276, 1975.

D. R. Cox and N. Reid. A note on pseudolikelihood constructed from marginal densities.

Biometrika, 91:729–737, 2004.

N. Cressie and G. Johannesson. Fixed rank kriging for very large spatial data sets. J. Roy.

Statist. Soc. Ser. B, 70:209–226, 2008.

F. Curriero and S. Lele. A composite likelihood approach to semivariogram estimation. J.

Agric. Biol. Environ. Stat., 4:9–28, 1999.

R. A. Davis and C. Y. Yau. Comments on pairwise likelihood in time series models. Statist.

Sinica, 21:255–277, 2011.

J. Eidsvik, B.A. Shaby, B.J. Reich, M. Wheeler, and J Niemi. Estimation and prediction in

spatial models with block composite likelihoods using parallel computing. unpublished

manuscript, 2011.

4 BIBLIOGRAPHY 78

K. Fang, R. Li, and A. Sudjianto. Design and Modeling for Computer Experiments. Chap-

man&Hall/CRC, 2006.

R. Furrer, M. G. Genton, and D. Nychka. Covariance tapering for interpolation of large

spatial datasets. J. Comput. Graph. Statist., 15:502–523, 2006.

X. Gao and P. X.-K. Song. Composite likelihood bayesian information criteria for model

selection in high dimensional data. J. Amer. Statist. Assoc., 105:1531–1540, 2010.

H. Geys, G. Molenberghs, and L. Ryan. Pseudolikelihood modeling of multivariate out-

comes in developmental toxicology. J. Amer. Statist. Assoc., 94:734–745, 1999.

V. P. Godambe. An optimum property of regular maximum likelihood estimation. Ann.

Statist., 31:1208–1212, 1960.

C. Grunenfelder. Aspects of composite likelihood inference. Master’s thesis, Imperial

College London, 2010.

W. He and G. Y. Yi. A pairwise likelihood method for correlated binary data with/without

missing observations under generalized partially linear single-index models. Statist.

Sinica, 21:207–229, 2011.

P. Heagerty and S. Lele. A composite likelihood approach to binary spatial data. J. Amer.

Statist. Assoc., 93:1099–1111, 1998.

4 BIBLIOGRAPHY 79

M. Henmi and S. Eguchi. A paradox concerning nuisance parameters and projected esti-

mating functions. Biometrika, 91:929–943, 2004.

NL. Hjort and C. Varin. Ml, pl, ql in markov chain models. Scand. J. Statist., 35:64–82,

2007.

Z.. Jin. Aspects of composite likelihood inference. PhD thesis, University of Toronto, 2009.

H. Joe and Y. Lee. On weighting of bivariate margins in pairwise likelihood. J. Mult. Anal.,

100:670–685, 2009.

D.R. Jones, M. Schonlau, and Welch W.J. Efcient global optimization of expensive black-

box functions. J. Global Optim, 13:455–492, 1998.

C. G. Kaufman, D. Bingham, S. Habib, K. Heitmann, and J. A. Frieman. Efficient emula-

tors of computer experiments using compactly supported correlation functions, with an

application to cosmology. Ann. Appl. Stat., 5:2470–2492, 2011.

C.G. Kaufman, Schervish, and D.W. M.J., Nychka. Covariance tapering for likelihood-

based estimation in large spatial data sets. J. Amer. Statist. Assoc., 103:1545–1555,

2008.

E. C. Kenne Pagui. Pairwise likelihood in multivariate normal models. Master’s thesis,

University of Padova, 2009.

J. T. Kent. Robust properties of likelihood ratio tests. Biometrika, 69:19–27, 1982.

4 BIBLIOGRAPHY 80

A. Y. C. Kuk. A hybrid pairwise likelihood method. Biometrika, 94:939–952, 2007.

B. G. Lindsay. Conditional score functions: some optimality results. Biometrika, 69:503–

512, 1982.

B. G. Lindsay. Composite likelihood methods. Contemp. Math., 80:221–239, 1988.

B. G. Lindsay, G. Y. Yi, and J. Sun. Issues and strategies in the selection of composite

likelihoods. Statist. Sinica, 21:71–105, 2011.

K. V. Mardia, G. Hughes, and C. C. Taylor. Efficiency of the pseudolikelihood for multi-

variate normal and von mises distributions. Technical report, Department of Statistics,

University of Leeds, 2007.

K. V. Mardia, J. T. Kent, G. Hughes, and C. C. Taylor. Maximum likelihood estimation

using composite likelihoods for closed exponential families. Biometrika, 96:975–982,

2009.

B. Matern. Spatial Variation. New York: Springer, 1986.

G. Matheron. Principles of geostatistics. Economic Geology, 58:1246–1266, 1963.

G. Molenberghs and G. Verbeke. Models for Discrete Longitudinal Data. New York:

Springer, 2005.

G. Molenberghs, M. Kenward, G. Verbeke, and T. Berhanu. Pseudo-likelihood estimation

for incomplete data. Statist. Sinica, 21:187–206, 2011.

4 BIBLIOGRAPHY 81

D. Nott and T. Ryden. Pairwise likelihood methods for inference in image models.

Biometrika, 86:661–676, 1999.

L. Pace, A. Salvan, and N. Sartori. Adjusting composite likelihood ratio statistics. Statist.

Sinica, 21:129–148, 2011.

D. Renard, G. Molenberghs, and H. Geys. A pairwise likelihood approach to estimation in

multilevel probit models. Comput. Statist. Data Anal., 44:649–667, 2004.

J. Sacks, S. B. Schiller, and W. J. Welch. Designs for computer experiments. Technometrics,

31:41–47, 1989a.

J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer

experiments (with discussion). Statist. Sinica, 4:409–435, 1989b.

H. Sang and J. Z. Huang. A full scale approximation of covariance functions for large

spatial data sets. J. Roy. Statist. Soc. Ser. B, 74:111–132, 2011.

T. J. Santner, B. J. Williams, and W. I. Notz. The Design and Analysis of Computer Exper-

iments. New York: Springer, 2003.

M. Stein. Statistical Interpolation of Spatial Data: Some Theory for Kriging. New York:

Springer, 1999.

M. Stein. A modeling approach for large spatial datasets. J. Korean Statist. Soc., 37:3–10,

2008.

4 BIBLIOGRAPHY 82

M. Stein, Z. Chi, and L. Welty. Approximating likelihoods for large spatial data sets. J.

Roy. Statist. Soc. Ser. B, 66:275–296, 2004.

V. Tresp. A bayesian committee machine. Neural Computation, 12:2719–2741, 2000.

C. Varin. On composite marginal likelihoods. Adv. Statist. Anal., 92:1–28, 2008.

C. Varin and P. Vidoni. A note on composite likelihood inference and model selection.

Biometrika, 92:519–528, 2005.

C. Varin and P. Vidoni. Pairwise likelihood inference for ordinal categorical time series.

Comput. Statist. Data Anal., 51:2365–2373, 2006.

C. Varin, G. Hø st, and Ø Skare. Pairwise likelihood inference in spatial generalized linear

mixed models. Comput. Statist. Data Anal., 49:1173–1191, 2005.

C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods. Statist.

Sinica, 21:5–42, 2011.

A. V.. Vecchia. Estimation and model identification for continuous spatial processes. J.

Roy. Statist. Soc. Ser. B, 50:297–312, 1988.

A. Wald. Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist,

20:595–601, 1949.

W. J. Welch, R. J. Buck, J. Sacks, H. P. Wynn, T. J. Mitchell, and M. D. Morris. Screening,

predicting, and computer experiments. Technometrics, 34:15–22, 1992.

Bibliography 83

H. White. Maximum likeihood estimation of misspecified models. Econometrica, 50:1–25,

1982.

X. Xu and N. Reid. On the robustness of maximum composite likelihood estimate. J.

Statist. Plan. Infer., 141:3047–3054, 2011.

G. Y. Yi, L. L. Zeng, and J. R. Cook. A robust pairwise likelihood method for incomplete

longitudinal binary data arising in clusters. Can. J. Statist., 39:34–51, 2011.

Y. Zhao and H. Joe. Composite likelihood estimation in multivariate data analysis. Can. J.

Statist., 33:335–356, 2005.

aspects of composite likelihood estimation and …...aspects of composite likelihood estimation and...

Documents