computer intensive methods in statistics - kpi.ua

41
Computer Intensive Methods in Statistics Topics in EM Algorithm, SIMEX, Variable Selection 1 Silvelyn Zwanzig, Behrang Mahjani 1 This is the fourth chapter of the in-progress textbook Computer Intensive Meth- ods in Statistics.

Upload: others

Post on 20-Jan-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Intensive Methods in Statistics - kpi.ua

Computer Intensive Methods

in Statistics

Topics in EM Algorithm, SIMEX, Variable Selection1

Silvelyn Zwanzig, Behrang Mahjani

1This is the fourth chapter of the in-progress textbook Computer Intensive Meth-ods in Statistics.

Page 2: Computer Intensive Methods in Statistics - kpi.ua
Page 3: Computer Intensive Methods in Statistics - kpi.ua
Page 4: Computer Intensive Methods in Statistics - kpi.ua

Chapter 4

Simulation based Methods

In this chapter, we introduce methods that are based on embedding principle.Assume we have a data set that belongs to a complicated model, such asmissing data, hidden unobserved variable, or additional constrains. If we wouldhave more observations, unconstraint without missing data or without hiddenvariables, we could find a big and ”easier” model and we could apply ”easier”methods. The trick of embedding based methods is to find such a big modeland to simulate additional observations. Then we combine the original dataset with the simulated observations and apply methods of the big model. It isjust a change of the perspective, like in the Figure 4.1 and 4.2.

4.1 EM - Algorithm

The EM Algorithm is the oldest and maybe one of the most used method bas-ing on the embedding principle. It was introduced in 1977 by Dempster et al.

Figure 4.1: The observed environment.

99

Page 5: Computer Intensive Methods in Statistics - kpi.ua

100 SIMULATION BASED METHODS

Figure 4.2: Change the perspective!

in the Journal of Royal Statistical Society, B. The EM stands for expectationand maximization. As additional literature Chapter 10 and 12 of the Springerbook (1999) on ”Numerical Analysis for Statisticians” written by KennethLange is recommended.In the original formulation of the EM algorithm, there is no simulation step,instead a projection method is proposed. The set up is formulated as follows.Assume a data set includes the following observations:

Y = (Y1, . . . , Ym)

from the observed model with density g(y | θ). We wanted to calculatethe maximum likelihood estimator of θ. The observations are interpreted as atransformation

Y = T (X), (4.1)

where

X = (X1, . . . , Xn)

belongs to a complete model with density f(x | θ). The transformationin (4.1) means that the observed data y are augmented to the data set x.The trick is now to use the projection (conditional expectation) of the loglikelihood function ln f(x | θ) as estimation criterion. The conditional mean isa projection of the likelihood function from complete model on to the observedmodel. In the original paper it was called surrogate function and is defined by

Q(θ | θ′) = E(ln f(X | θ) | T (X), θ′).

Page 6: Computer Intensive Methods in Statistics - kpi.ua

EM - ALGORITHM 101

where we have two parameters, θ is a free parameter in the likelihood function,and θ′ is the parameter of the underlying distribution. Here it is assumedthat the transformation is insufficient. The case of a sufficient statistic T isuninteresting, because in that case minimizing of the surrogate function is thesame as minimizing the likelihood function l(θ) = ln g(y | θ). For insufficienttransformations the conditional expectation of X given T (X) depends on thetrue underlying parameter θtrue. The EM algorithm is an iterative procedurefor solving

arg maxθQ(θ | θtrue).

Algorithm 4.1 EM Algorithm

Given the current state θ(k).

1. E-step (Expectation): Calculate Q(θ | θ(k)).

2. M-step (Maximization):

θ(k+1) = arg maxθQ(θ | θ(k)).

Why does EM work?

First we prove the following theorem, the information inequality. This inequal-ity is the main argument that the maximum likelihood principle works.

Theorem 4.1 Let f > 0 a.s, and g > 0 a.s. be two densities, then

Ef ln f ≥ Ef ln g.

Proof:

According to the Jensen inequality, let W be a random variable and let h(w)be a convex function then

Eh(W ) ≥ h(E(W ))

provided both expectations exists. Note, the function − ln(w) is strictly con-

Page 7: Computer Intensive Methods in Statistics - kpi.ua

102 SIMULATION BASED METHODS

vex. Thus

Ef ln f − Ef ln g = −Ef ln

(g

f

)≥ − lnEf

(g

f

).

Furthermore

Ef

(g

f

)=

∫gdx = 1.

We getEf ln f − Ef ln g ≥ − ln(1) = 0.

2

Now we compare the surrogate function with the log likelihood of the observedmodel

l(θ) = ln g(y | θ).

Theorem 4.2 It holds

Q (θ′ | θ′)− l(θ′) ≥ Q (θ | θ′)− l(θ). (4.2)

Proof: We have

Q (θ | θ′)− l(θ) = E ln(f(X | θ) | Y = y, θ′)− ln g(y | θ)

= E ln(f(X | θ)g(y | θ) | Y = y, θ′).

Notef(X | θ)g(y | θ)

is the density of the conditional distribution of X | (Y = y, θ). We apply theinformation inequality and obtain

Q (θ | θ′)− l(θ) ≤ E ln(f(X | θ′)g(y | θ′) | Y = y, θ′) = Q (θ′ | θ′)− l(θ′). (4.3)

2

Page 8: Computer Intensive Methods in Statistics - kpi.ua

EM - ALGORITHM 103

The inequality (4.2) says that a maximizing of the surrogate function impliesan increasing of the likelihood function l(θ) because for

θ(k+1) = arg maxθQ(θ | θ(k)

)it holds

l(θ(k+1)) ≥ Q(θ(k+1) | θ(k)

)+ l(θ(k))−Q

(θ(k) | θ(k)

)≥ l(θ(k)).

In the original paper of Dempster et al. the following example was discussed.

Example 4.1 (EM- Algorithm: Genetic Linkage Model) Given data

y = (y1, y2, y3, y4) = (125, 18, 20, 34)

from a four-category multinomial distribution

Y ∼M(n, p1, p2, p3, p4),

with

p1 p2 p3 p412 + 1

4p14 (1− p) 1

4 (1− p) 14p

.

We represent the data y as incomplete data from a five-category multinomialpopulation:

y1 = x1 + x2, y2 = x3, y3 = x4, y4 = x5 (4.4)

with

X ∼M(n, p1, p2, p3, p4, p5) (4.5)

where

p1 p2 p3 p4 p512

14p

14 (1− p) 1

4 (1− p) 14p

. (4.6)

The conditional distribution of (X1, X2) | Y1 = y1 is

(X1, X2) | Y1 = y1 ∼M(y1,p1

p1 + p2,

p2

p1 + p2).

Especially

Page 9: Computer Intensive Methods in Statistics - kpi.ua

104 SIMULATION BASED METHODS

X1 | Y1 = y1 ∼ Bin(y1,p1

p1 + p2), X2 | Y1 = y1 ∼ Bin(y1,

p2

p1 + p2).

We are interested in a maximum likelihood estimation of p. Then under thisset up the EM algorithm is as follows.

EM Algorithm: Genetic Linkage Example

Given the current state p(k).

1. (E) Estimate the ”missing” observation x2 by the conditional expectations.

x(k)2 = E(X2 | Y1 = y1, p

(k)) = y1

14p

(k)

12 + 1

4p(k).

2. (M) Calculate the maximum likelihood estimation p(k+1) in model (4.5)

with the data estimated observations (x(k)1 , x

(k)2 , y2, y3, y4):

p(k+1) =x

(k)2 + y4

x(k)2 + y2 + y3 + y4

.

Let us compare the algorithm above with the general EM algorithm. Firstwe consider the more general set up of a multinomial model with additionalparameterized cell probabilities

X ∼M(n, p1 (θ) , . . . , pK (θ)).

In this model the log likelihood function up to an additive constant is

K∑j=1

xj ln(pj (θ)) =

K−1∑j=1

xj ln(pj (θ)) + (n−K−1∑j=1

xj) ln(1−K−1∑j=1

pj(θ))

and the surrogate function up to an additive constant is

Q(θ | θ(k)

)= E

K∑j=1

xj ln(pj (θ)) | Y = y, θ(k)

=

K∑j=1

ln(pj (θ))E(xj | Y = y, θ(k)),

which is just the maximum likelihood function with the estimated data

x(k)j = E(xj | Y = y, θ(k)).

Page 10: Computer Intensive Methods in Statistics - kpi.ua

EM - ALGORITHM 105

Consider now (4.6). l′(θ) = 0 respects to

x21

p− x3

1

1− p − x41

1− p + x51

p= 0.

Note, x1 is not included, because the the probability p1 = 12 does not depend

on the parameter of interest p. We get

pML =x2 + x5

x2 + x3 + x4 + x5.

2

In the observed model, the likekilhood equation is more complicated to solve.We have

y11

12 + 1

4p− (y2 + y3)

1

1− p + y41

p= 0.

Thus, using the MLE in the complete model with estimated observations x2

give us an easier expression.

In this case of multinomial distributions with incomplete data the calculationsin the E step and in the M step are carried out explicitly. Another importantexample is a mixture distributions, which belongs to models with a latent notobserved variable. Also in this case some of the calculations in the E step andin the M step can be done explicitly.

Example 4.2 (EM Algorithm: Mixture Distribution)Assume an i.i.d. sample Y = (Y1, . . . , Yn) from Y ∼ F , a mixture distributionwith density g(y | θ), θ = (π, ϑ1, ϑ2)

g(y | θ) = (1− π) f1,ϑ1(y) + πf1,ϑ1

(y) , π ∈ [0, 1] . (4.7)

Mixture distributed random variables have a stochastic decomposition:

Y = ∆ Z1 + (1−∆) Z2,

Z1 ∼ F1 with f1 = f1,ϑ1, Z2 ∼ F2, with f2 = f2,ϑ2

, independent

∆ ∼ Ber(π) independent of Z1 and Z2.

The Bernoulli distributed random variable ∆ ∼ Ber(π) is latent and unob-servable. The complete model is the following. Let be

X = (X1, . . . , Xn)

an i.i.d. sample from with

X = (Y,∆) ∆ ∼ Ber(π), X | ∆ = 0 ∼ F1, X | ∆ = 1 ∼ F2.

Page 11: Computer Intensive Methods in Statistics - kpi.ua

106 SIMULATION BASED METHODS

Then

f(x | θ) = f(y | ∆, θ) f(∆ | θ) = (f1,ϑ1(y)(1− π))

(1−∆)(f2,ϑ2

(y)π)∆.

Thus the log likelihood of the complete model is

ln f(x | θ) =

n∑i=1

(1−∆i) ln (f1,ϑ1(yi)(1− π)) + ∆i ln (f2,ϑ2(yi)π) .

The observed model is T (X) = Y with density (4.7). The surrogate functionis

Q(θ | θ(k)

)=

n∑i=1

(1− γi,k) ln (f1,ϑ1(yi)(1− π)) + γi,k ln (f2,ϑ2(yi)π)

with γi,k = γ(yi, θ

(k)), where

γ (y, θ) = E(∆ | θ)= P (∆ = 1 | T (X) = y, θ)

=P (y | ∆ = 1, θ)P (∆ = 1)

P (y | θ)

=f2,ϑ2

(y)π

f1,ϑ1(y)(1− π) + f2,ϑ2(y)π.

The γi,k are called responsibilities and are interpreted as k’th iteration of theestimates of the latent variable ∆i.For the calculation of

θ(k+1) = arg maxθQ(θ | θ(k)

)we solve the normal equations

d

dϑ1Q(θ | θ(k)

)=

n∑i=1

(1− γi,k)d

dϑ1ln (f1,ϑ1

(yi)) = 0

d

dϑ2Q(θ | θ(k)

)=

n∑i=1

γi,kd

dϑ2ln (f2,ϑ2(yi)) = 0

andd

dπQ(θ | θ(k)

)=

n∑i=1

(1− γi,k)−1

1− π + γi,k1

π= 0.

The last equation gives π(k+1) = 1n

∑ni=1 γi,k.

Let us summarize the calculations and formulate the algorithm.

Page 12: Computer Intensive Methods in Statistics - kpi.ua

EM - ALGORITHM 107

Algorithm 4.2 EM Algorithm: Mixture Distributions

Given the current state θ(k) =(π(k), ϑ

(k)1 , ϑ

(k)2

).

1. Calculate the responsibilities for i = 1, . . . , n

γi,k =f

2,ϑ(k)2

(yi)π(k)

f1,ϑ

(k)1

(yi)(1− π(k)) + f2,ϑ

(k)2

(yi)π(k).

2. Find ϑ1 such that

n∑i=1

(1− γi,k)d

dϑ1ln (f1,ϑ1

(yi)) = 0

and find ϑ2 such that

n∑i=1

γi,kd

dϑ2ln (f2,ϑ2(yi)) = 0.

Update θ(k+1) =(π(k+1), ϑ

(k+1)1 , ϑ

(k+1)2

)with

π(k+1) =1

n

n∑i=1

γi,k ϑ(k+1)1 = ϑ1, ϑ

(k+1)2 = ϑ2.

Consider the example of the mixture of two normals, F1 = N(µ1, σ21), F2 =

N(µ2, σ22). Then

µ(k+1)1 =

∑ni=1(1− γi,k)yi∑ni=1(1− γi,k)

, µ(k+1)2 =

∑ni=1 γi,kyi∑ni=1 γi,k

(4.8)

and

σ21,k+1 =

∑ni=1(1− γi,k)(yi − µ(k+1)

1 )2∑ni=1(1− γi,k)

, σ22,k+1 =

∑ni=1 γi,k(yi − µ(k+1)

2 )2∑ni=1 γi,k

(4.9)2

Page 13: Computer Intensive Methods in Statistics - kpi.ua

108 SIMULATION BASED METHODS

Data and True Density

Z

Den

sity

−4 −2 0 2 4 6

0.00

0.05

0.10

0.15

0.20

0.25

0.30

●● ● ● ● ● ● ● ●

2 4 6 8 10

−3

−2

−1

01

23

4

EM algorithm

iterations

●● ● ● ● ● ● ● ● ●

Figure 4.3: Example: EM algorithm for normal mixtures.

R Code 4.1.1. R-code:EM Algorithm for normal mixture

In this example, for simplicity we that π is known.

data<-Z # data simulated of two normal mixtures, sample size 100

pi<-0.3

r<-rep(NA,100) # responsibilities

Mu1<-rep(0,11)

# series of estimate of the first expectation

Mu2<-rep(0,11)

# series of estimate of the second expectation

Mu1[1]<-0; Mu2[1]<-1 # start values

for( j in 1:10)

{

for (i in 1:100)

{

r[i]<-pi*dnorm(Z[i],Mu2[j],1)/((1-pi)*dnorm(Z[i],Mu1[j],1)

+pi*dnorm(Z[i],Mu2[j],1));

}

Mu1[j+1]<-sum((1-r)*Z)/sum(1-r)

Mu2[j+1]<-sum(r*Z)/sum(r)

}

Note, mixture models deliver good approximation, but the interpretation ofthe latent will be lost. The EM algorithm becomes a simulation method, whenthe E step is carried out by a Monte Carlo method. This was proposed in1990 by Wei and Tanner in the JASA. They assume that observed model of

Page 14: Computer Intensive Methods in Statistics - kpi.ua

EM - ALGORITHM 109

Y has latent variables Z. The complete model is related to X = (Y,Z). Theiralgorithm is called MCEM.

Algorithm 4.3 MCEM Algorithm

Given the current state θ(k).

1. E-step (Expectation): Calculate Q(θ | θ(k)) by a Monte Carlo simulation

(a) Generate independently z(1), . . . , z(M) from f(y, z | θ), under the cur-rent step.

(a) Approximate Q(θ | θ(k)) by

Qk+1(θ | θ(k)) =1

M

M∑j=1

ln(f(y, z(j) | θ(k)))

2. M-step (Maximization):

θ(k+1) = arg maxθQk+1(θ | θ(k)).

The simulated latent variables z(1), . . . , z(M) are also called multiple impu-tations, for more details see Rubin (2004).

Page 15: Computer Intensive Methods in Statistics - kpi.ua

110 SIMULATION BASED METHODS

4.2 SIMEX

The SIMEX method was introduced in 1994 by Cook and Stefanski in theJASA. SIMEX stands for Simulation-Extrapolation estimation. SIMEX is ameasurement error model, and a special case of latent variable models. In dif-ference to the EM algorithm, we observe the latent variable with an additionalerror (measurement error). SIMEX is a simulation method to get rid of themeasurement error in the estimator. The main idea is to increase the mea-surement error by adding pseudo errors with stepwise increasing variances.The change of the estimator with respect to the increasing perturbation ismodelised. The SIMEX estimator is the backwards extrapolated value to thetheoretical point of no measurement error.

”Just make the bad thing worse,study how it behaves and guessthe good situation.”

Let us introduce the measurement error model and the general principle ofSIMEX. After that we will study the simple linear model with measurementerrors in more detail. First suppose a model without measurement errors:

(Y,V,U) = (Yi, Vi, Ui)i=1,...,n ∼ Fθ

withE (Yi | Ui, Vi) = G (θ, Ui, Vi) .

Example 4.3 (Blood Pressure) Consider a study on the influence of bloodpressure on the risk for a stroke. The data set is given by

(Yi, Vi, Xi)i=1,...,n ,

where Yi is Bernoulli distributed. The ”success” probability is the probabilitythat the i’th patient get a stroke. The variable Vi consist of control variableslike the sex, the weight, age of patient i. The variable Xi is the measurementof the blood pressure Ui of patient i. We are interested in estimating theparameter θ = (β0, β

T1 , β2) of

E (Yi | Ui, Vi) =exp(β0 + βT1 Vi + β2Ui)

1 + exp(β0 + βT1 Vi + β2Ui).

2

Another example for a measurement error model is the problem of calibrationof two measurement methods.

Page 16: Computer Intensive Methods in Statistics - kpi.ua

SIMEX 111

Figure 4.4: Calibration of a balance with a digital instrument.

Example 4.4 (Calibration) Suppose we are interested in comparing twoinstruments. Each instrument produces another type of measurement error.Our aim is to find a relation for transforming data sets from an old instru-ment in order to compare them with new data obtained from a new advancedinstrument. For that we have to apply both instruments on the same objectsas the children are doing in the picture 4.4. We observe for each apple i twoweights xi and yi and the relation of interest is between the expected values.

EYi = α+ βEXi

2

We are interested in estimating θ and have already a favorite estimator

θ = T (Y,V,U).

In the measurement error model U is a latent variable, which cannot directlyobserved. We observe X = (X1, . . . , Xn) with

Xi = Ui + σZi i = 1, . . . , n,

Page 17: Computer Intensive Methods in Statistics - kpi.ua

112 SIMULATION BASED METHODS

where

Z1, . . . , Zn i.i.d. are the measurement errors, V ar(Zi) = σ2.

In general, the naive use of the ”old” estimation rule

θnaive = T (Y,V,X)

delivers an inconsistent procedure. SIMEX is a general method for correctingthe naive estimator. We assume that σ2 is known.

Algorithm 4.4 SIMEX

1. Simulation: For every λ ∈ {λ1, . . . , λK} generate new errors Z∗i indepen-dent on the data with expectation zero and calculate new samples

Xi (λ) = Xi +√λσZ∗i ; i = 1, . . . , n, X (λ) = (X1 (λ) , . . . , Xn (λ)).

2. Calculate for every λ ∈ {λ1, . . . , λK}

θ (λ) = T (Y,V,X (λ)).

3. Extrapolation:

Fit a curve f (λ) such that∑Kk=1

(f (λk)− θ (λk)

)2

is minimal.

4. Defineθsimex = f (−1) .

NoteV ar(Xi) = σ2 and V ar(Xi (λ)) = (1 + λ)σ2

such that λ = −1 respects the case of no measurement error. Obviously wecan only generate new data for positive λ, that is why we need the back-wards extrapolation step. The knowledge of the measurement error varianceis necessary in the algorithm above.

Let us now study the simple linear model with and without measurementerrors. Consider the linear simple relationship:

yi = α+ βξi + εi, xi = ξi + δi. (4.10)

The ξ1, . . . , ξn are unknown design points (variables). The first equation is the

Page 18: Computer Intensive Methods in Statistics - kpi.ua

SIMEX 113

●●

0 2 4 6

510

1520

25

Linear Regression

xx

daty

●●

0 2 4 6

510

1520

25

Errors−in−variables: EIV

datx

daty

Figure 4.5: The simple linear regression model without errors in variables and witherrors in variables.

usual simple linear regression. The second equation is the errors-in-variablesequation. Regression models with observed independent variables are alsocalled errors-in-variables models (EIV).In errors-in-variables models the situation is much more complicated. The ob-servation (xi, yi) can deviate in all directions from the point on the regressionline (ξi, α+ βξi), compare Figure 4.5.The least squares estimators in the model with known design points is definedby

(α, β) = arg minα,β

n∑i=1

(yi − α− βξi)2.

Introduce the usual denotations mxx, mxy, mξξ, mξy by

mxx =1

n

n∑i=1

(xi − x)2, mxy =1

n

n∑i=1

(xi − x)(yi − y)

and so on. Thenα = y − βξ, β =

myξ

mξξ.

The naive use of this estimator gives

αnaive = y − βnaivex, βnaive =mxy

mxx.

Remind the central limit theorem, we know

mxy = βmξξ + oP (1) and mxx = mξξ + V ar(δ) + oP (1).

We obtain for the naive estimator

βnaive =mxy

mxx=

βmξξ

mξξ + V ar(δ)+ oP (1).

Page 19: Computer Intensive Methods in Statistics - kpi.ua

114 SIMULATION BASED METHODS

In case that the measurement error variance V ar(δ) = σ2 is known we cancorrect the estimator

βcorr =mxy

mxx − σ2.

Assume that the errors (εi, δi) are i.i.d. from a two dimensional normal dis-tribution with expected values zero, uncorrelated and V ar(ε) = σ2 andV ar(δ) = κσ2, where κ is known. Then the maximum likelihood estimatoris the total least squares estimator defined by

(αtls, βtls) = arg minα,β,ξ1,...,ξn

(n∑i=1

(yi − α− βξi)2 + κ

n∑i=1

(xi − ξi)2

).

In this simple linear EIV model the minimization problem can be solved ex-plicitly. The nuisance parameter can be eliminated by orthogonal projectiononto the regression line

ξi = arg minξi

((yi − α− βξi)2 + κ(xi − ξi)2

),

that is

ξi =βyi + κxiκ+ β2

.

Thus it remains to solve the minimization problem

minα,β

1

κ+ β2

n∑i=1

(yi − α− βxi)2.

We get

αtls = y − βtls x

and

βtls =myy − κmxx +

√(myy − κmxx)

2+ 4κm2

xy

2mxy. (4.11)

for mxy 6= 0, otherwise βtls = 0.

Page 20: Computer Intensive Methods in Statistics - kpi.ua

SIMEX 115

0 2 4 6

01

23

45

67

Naive Method

x

y

0 2 4 6

01

23

45

67

Orthogonal Method

x

y

Figure 4.6: The unknown design points are estimated by the observation or estimatedby the projection on to the line.

Algorithm 4.5 SIMEX, σ2 known

1. Simulation: For every λ ∈ {λ1, . . . , λK} generate new errors Z∗i indepen-dently on the data with expectation zero and calculate new samples

Xi (λ) = Xi +√λσZ∗i ; i = 1, . . . , n, X (λ) = (X1 (λ) , . . . , Xn (λ)).

2. Calculate for every λ ∈ {λ1, . . . , λK}

βnaive(λ) =mx(λ)y

mx(λ)x(λ).

3. Fit a curve

(a, b) = arg mina,b

K∑k=1

(a

b+ λk− βnaive(λk)

)2

.

4. Define

βsimex =a

b− 1.

It can be shown that

βsimex = βcorr + oP∗(1).

The remainder term oP∗(1) converges to zero in probability for K tending toinfinity with respect to the pseudo error generating probability P ∗.

Page 21: Computer Intensive Methods in Statistics - kpi.ua

116 SIMULATION BASED METHODS

The model (4.10) is symmetric in both variables, set EYi = ηi then

ηi = α+ βξi ξi = −αβ

+1

βηi.

The first equation is related to a regression of Y on X, the second to an inverseregression of X on Y . In case of errors in variables both regressions are thesame. Thus, we have the choice between two different naive estimators

β1,naive =mxy

mxx, β2,naive =

myy

mxy.

It holds β1,naive ≤ βtls ≤ β2,naive.

The following SIMEX algorithm explores the symmetry and is called SYMEX.It has the advantages that the knowledge of the measure error variance is notneeded, but the quotient κ is supposed to be known. The idea is to apply thefirst SIMEX steps to each of the naive estimators. The symex estimator isdefined by the crossing point of the two extrapolation curves.

Page 22: Computer Intensive Methods in Statistics - kpi.ua

SIMEX 117

Algorithm 4.6 SYMEX

1. Regression Y on X

(a) Simulation: Generate new samples for every λ ∈ {λ1, . . . , λK}

Xi (λ) = Xi +√λκZ∗1,i; i = 1, . . . , n X (λ) = (X1 (λ) , . . . , Xn (λ)).

(b) Calculate for every λ ∈ {λ1, . . . , λK}

β1,naive(λ) =mx(λ)y

mx(λ)x(λ).

(c) Fit a curve b1(λ) = a

b+λwith

(a, b) = arg mina,b

K∑k=1

(a

b+ λk− βnaive(λk)

)2

.

2. Regression X on Y

(a) Simulation: Generate new samples for every λ ∈ {λ1, . . . , λK}

Yi (λ) = Yi +√λZ∗2,i; i = 1, . . . , n Y (λ) = (Y1 (λ) , . . . , Yn (λ)).

(b) Calculate for every λ ∈ {λ1, . . . , λK}

β2,naive(λ) =my(λ)y(λ)

my(λ)x.

(c) Fit a line b2(λ) = a+ bλ with

(a, b) = arg mina,b

K∑k=1

(b+ aλk − βnaive(λk)

)2

.

3. Find λ∗ such thatb1(λ∗) = b2(λ∗).

4. Defineβsymex = b2(λ∗).

This estimator can also generalized to a multivariate EIV model, for moredetails see Polzehl and Zwanzig (2013). It holds

βsymex = βtls + oP∗(1),

Page 23: Computer Intensive Methods in Statistics - kpi.ua

118 SIMULATION BASED METHODS

where the remainder term oP∗(1) converges to zero in probability for K tend-ing to infinity with respect to the pseudo error generating probability P ∗.

The following R code is for a SIMEX algorithm with linear exploration model.

R Code 4.2.2. R-code: SIMEX

# data=(daty,datx)

# add pseudo errors with one fixed lamda

# and calculate the naive estimator

simest<-function(n,lamda,B)

{

b1<-rep(NA,B)

for (j in 1:B)

{

x<-datx+sqrt(lamda)*rnorm(n,0,1)

M<-lm(daty~x)

b1[j]<-coef(M)[2]

}

return(b1)

}

simest(11,0.5,10)

mean(simest(11,0.5,10)) # stabilizing

#### SIM - step

l<-c(0.1,0.2,0.3,0.4,0.5,0.6)

K<-length(l)

b<-rep(NA,K)

B<-1

n<-length(daty)

for(j in 1:K)

{

b[j]<-mean(simest(n,l[j],B))

}

##### EX - step ######

M1<-lm(b~l)

### backwards extrapolation, Var=0.5 #####

simb<-coef(M1)[1]-0.5*coef(M1)[2]

Page 24: Computer Intensive Methods in Statistics - kpi.ua

SIMEX 119

●●

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

23

45

SIMEX − Extrapolation

lamda

beta

true

naive

SIMEX

Figure 4.7: Illustration of the SIMEX method with a linear extrapolation model.

Page 25: Computer Intensive Methods in Statistics - kpi.ua

120 SIMULATION BASED METHODS

4.3 Variable Selection

In this section we consider the problem of variable selection based on simu-lation. The data set consists of p + 1 columns of dimension n. Each columnincludes the observations related to one variable. The first variable we call Yis the response variable (dependent variable) the other columns contain theobservations of the independent variables (features, covariates) X(1), . . . , X(p).

(Y,X(1), . . . ,X(p)

)=

Y1 X(1),1 · · · X(p),1

......

......

Yn X(1),n · · · X(p),n

(4.12)

The problem is to select the minimal set of covariates X(j1), . . . , X(jm), whichare needed for predicting the values of Y

E(Y | X(1), . . . , X(p)

)= E

(Y | X(j1), . . . , X(jm)

). (4.13)

We call the variables X(j1), . . . , X(jm) appearing on the right hand in (4.13)important and the other variables are named unimportant. The set of indicesof the important variables A = {j1, . . . , jm} get the name active set.

Example 4.5 (Diabetes) This example is quoted from Efron et. al. (2003).442 diabetes patients were measured on 10 different variables (age, sex, bmi,map, tc, ldl, hdl, tch, ltg, glu) and a response Y a measurement of diseaseprogression. The problem is to find out which of the covariates are importantfactors for the decease progression. 2

R Code 4.3.3. R-code: Example Diabetes

library(lars)

data(diabetes)

4.3.1 F-Backward and F-Forward procedures

As introduction we briefly repeat classical selection methods in linear regres-sion. Let us suppose that the data follow a linear regression model

Y = Xβ+ ε, X =(X(1), . . . ,X(p)

), rank(X) = p, ε ∼ Nn(0, σ2In). (4.14)

Then the model (4.13) corresponds to the hypothesis

H0 : βj = 0, for all j /∈ A. (4.15)

In linear models (4.14) the main tool for testing is the F -test. The F -statistic

Page 26: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 121

measures the difference in the model fit of a ”small” model with the variablesX(j), j ∈ Jm and a ”big” model with the variables X(j), j ∈ Jq, where Jm ⊂Jq ⊆ {1, . . . , p} and #(Jm) = m, #(Jq) = q, m < q

F (Jm, Jq) =q −mn− q

RSSJm −RSSJqRSSJq

.

Here the model fit is measured by the residual sum of squares (RSS). Let beRSSJ the residual sum of squares in the linear model including all variablesX(j) j ∈ J ⊆ {1, . . . , p}

RSSJ = minβ,βj=0, j /∈J

‖Y −Xβ‖2 .

Note, the RSSJ criterion is also reasonable, when the linear model assump-tions are not fulfilled. Then for J = {j1, . . . , jq} the criterion RSSJ is ameasure for the best linear approximation of E(Y | X(1), . . . , X(p)) by a linearfunction of X(j1), . . . , X(jq) and

RSSJ = minβ,βj=0,j /∈J

∥∥E(Y | X(1), . . . ,X(p))−Xβ∥∥2

+ tr(Cov(Y | X(1), . . . , X(p)

)) + n op(1)

In (4.14) under the null hypothesis, the small model is true and the F -statisticis F -distributed with df1 = n − q and df2 = q − m degrees of freedom. Letbe q(α, df1, df2) the respective α quantile. Selection algorithms perform com-binations of F -tests. As examples the backwards and forward algorithm arepresented. In practice methods which combine backward and forward stepsare in use.

The backward algorithm starts with the model including all variables of thedata set. The first step is a simultaneous test on the p null hypotheses that acomponent of β in (4.13) is zero. All possible partial F -statistics are compared.One performs the F-test based on the F -statistic with minimal value. Norejection of the null hypothesis means that the respective variable is deletedfrom the model, otherwise the whole model is accepted. In the next step themodel with the deleted variable is the big model and all small models withone additional deleted variable are simultaneously tested.

Page 27: Computer Intensive Methods in Statistics - kpi.ua

122 SIMULATION BASED METHODS

Algorithm 4.7 F-BackwardChoose α.

1. Start with the model (4.14) which includes all variables X(1), . . . , X(p).

(a) For all j ∈ J = {1, . . . , p} consider H0,j : βj = 0 with the partialF -statistic Fj = F (J \ {j} , J).

(b) Select l1 with Fl1 = minj∈J Fj .

(c) Perform the F -test for the hypothesis H0,l1 :

If Fl1 < q(α, 1, n−p) then delete the variable X(l1) and go to the nextstep.

Otherwise stop, select the model with the variables X(1), . . . , X(p).

2. Assume the model with the variables X(j), j ∈ J1 = {1, . . . , p} \ {l1} .(a) For all j ∈ J1 calculate the F -statistics Fj = F (J1 \ {j} , J1).

(b) Select l2 with Fl2 = minj∈J1 Fj .

(c) If Fl2 < q(α, 1, n− p+ 1) then delete the variable X(l2) and go to thenext step.

Otherwise stop, select the model with the variables X(j), j ∈ J1 ={1, . . . , p} \ {l1} .

3. ...

4. Last step: Assume the model with the variable X(lp), {lp} = {1, . . . , p}\{l1, . . . , lp−1}

(a) Consider H0,lp : βlp = 0 and Flp = F ({}, {lp})(b) If Flp < q(α, 1, n) then delete the variable X(lp) and no variable is

selected.

Otherwise stop, select the model with the variable X(lp).

The F -forward selection algorithm is a stepwise regression starting with allpossible simple linear regressions. In every step new models with one addedvariable are compared by the partial F -statistic. For the model with the max-imal F -statistic the partial F -test at level α is carried out. In case of rejectingthe null hypothesis this variable with the maximal F -statistic is added to themodel. Otherwise, all null hypotheses are not rejected and no more variableis added and the procedure stops.

Note, forward algorithms have the advantage that they can be applied for bigdata sets, where p > n.

Page 28: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 123

Algorithm 4.8 F-ForwardChoose α.

1. Suppose p different models with variable X(j), j ∈ J = {1, . . . , p} .(a) In each model calculate the F -statistics Fj = F ({}, {j}) for testing

H0,j : βj = 0.

(b) Select j1 with Fj1 = maxj Fj .

(c) For Fj1 > q(α, 1, n− 1) select the variable X(j1) and go to Step 2.

Otherwise stop, no variable is selected.

2. Suppose p− 1 different models with the variables X(j1), X(j)

j ∈ J1 = {1, . . . , p} \ {j1}(a) Calculate the F -statistics F(j1,j) = F ({j1} , {j, j1}) for testing H0,j :

βj = 0

(b) Select j2 with F(j1,j2) = maxj F(j1,j).

(c) For Fj1,j2 > q(α, 1, n− 2) select variable X(j2), and go to Step 3.

Otherwise stop, the final model contains X(j1).

3. Suppose p− 2 different models with the variables X(j1), X(j2), Xj ,j ∈ J2 = J1 \ {j2} ...

4. ...

The tuning parameter α is the size of every partial F -test. But the final modelbases on several tests, and α is not any more the size of the whole procedure.The size of the whole procedure is called family size and defined by

αfamily = P (the true model is not selected).

In case of the backward algorithm, a bound of the real family size can becalculated. Suppose the chosen model contains q variables, then the decisionsD1, . . . , Dp−q+1 are made, where only the last decision is a rejection of thepartial null hypothesis. Using

P0 (D1 . . . Dp−q+1)

= P0 (Dp−q+1 | D1 . . . Dp−q)P0 (Dp−q | D1 . . . Dp−q−1) · · · P0 (D1)

we have

P (model is selected | model is true)

= P0

(βj = 0, j ∈ J A is not rejected, βlq+1

= 0 is rejected)

= (1− α)p−q−1α.

Page 29: Computer Intensive Methods in Statistics - kpi.ua

124 SIMULATION BASED METHODS

Step 1

F−

stat

istic

age sex bmi map tc ldl hdl tch ltg glu0

5015

025

0

●Step 2

age sex () map tc ldl hdl tch ltg glu

020

6010

0

Step 3

F−

stat

istic

age sex () map tc ldl hdl tch () glu

05

1015

20

●Step 4

age sex () () tc ldl hdl tch () glu

02

46

812

Step 5

F−

stat

istic

age sex () () () ldl hdl tch () glu

02

46

8

Step 6

age () () () () ldl hdl tch () glu

05

1015

Step 7

F−

stat

istic

age () () () () ldl hdl tch () glu

0.0

0.5

1.0

1.5

2.0

Step 8

age () () () () () hdl () () glu

0.0

0.5

1.0

1.5

2.0

Step 9

Covariates

F−

stat

istic

age () () () () () hdl () () ()

0.0

0.5

1.0

1.5

2.0

Step 10

Covariates

age () () () () () () () () ()

0.0

0.4

0.8

Figure 4.8: All steps of the F -forward algorithm in Example 4.5. The broken line isthe α = 0.5 quantile. The algorithm stops after Step 8. The selected variables arebmi, ltg, map, tc, sex, ldl, tch, glu.

The main problem is now to determine the tuning parameter α which con-trol the threshold of the model fit criterion. The F -statistic as measure of theinfluence of covariates is interesting also in case of nonlinear models. Then itcompares the best linear approximations. If the condition on the error distri-bution is violated, the F -statistics are not F -distributed. The comparison ofthe F -statistic with a bound q(α, df1, df2) controlled by the tuning parameterα is still reasonable. The bound is now simply a threshold and not any morethe quantile of the F distribution Fdf1,df2 and the tuning parameter α is notthe size of the partial test.

In R, variable selection procedures are implemented, where the models arecompared by Akaike information criterion AIC or the Bayesian information

Page 30: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 125

●●

●●

●●

●●

5 10 15 20

0.96

0.97

0.98

0.99

1.00

F−Backward Selection

Number of deleted variables

Fam

ily e

rror ●

●● ● ● ● ● ● ● ● ● ● ● ●

α=0.05

α=0.5

criterion BIC. For a linear model with normal errors, unknown error varianceand included variables X(j), j ∈ J = {j1, . . . , jq} the AIC criterion is definedas

AICJ = n ln(RSSJ) + 2q

and the BICBICJ = n ln(RSSJ) + lnn q.

Note, some times the criteria differ by an additive const. Another alternativeis Mallows Cp

Cp,J =RSSJ

RSS{1,...,p}

n− p1

+ 2q.

R Code 4.3.4. R-code: Variable selection

library(lars)

data(diabetes)

library(leaps) # subset selection

all=regsubsets(y~x,data=diabetes)

## stepwise adding variables can be done as follows ##

# step1

lm0<-lm(y~1)

A1<-add1(lm0,~1+x[,1]+x[,2]+x[,3]+x[,4]+x[,5]+

x[,6]+x[,7]+x[,8]+x[,9]+x[,10],test="F")

F1<-A1$F[2:11] #partial F statistic for every new added variable

AIC.1<-A1$AIC[2:11] # AIC for every new added variable

A1$AIC[1] # minimal AIC of the step before

max(F1)

which.max(F1) # here at bmi= x[,3]

# step2

Page 31: Computer Intensive Methods in Statistics - kpi.ua

126 SIMULATION BASED METHODS

lm1<-lm(y~1+x[,3],data=diabetes)

A2<-add1(lm1,~x[,1]+x[,2]+x[,3]+x[,4]+x[,5]+

x[,6]+x[,7]+x[,8]+x[,9]+x[,10],test="F")

The AIC-forward algorithm is a stepwise minimization algorithm of the AICcriterion. The model with the smallest AIC is taken for the next step. Theprocedure stops, when no smaller AIC can be reached.

Algorithm 4.9 AIC-Forward

1. Let be Jq = {j1, . . . , jq} the current set of selected variables andAIC(q) = AICJq

2. For all j ∈ {1, . . . , p} \ Jq calculate the AIC-statistics

AICJq∪j = n ln(RSSJq∪j) + 2n(q + 1).

3. Select jq+1 withAICJq∪jq+1

= minjAICJq∪j .

and set AIC(q+1) = AICJq∪jq+1

4. For AIC(q+1) < AIC(q) select the variable X(jq+1) and update Jq+1 =Jq ∪ {jq+1}, otherwise stop.

Both forward algorithms are greedy algorithms, because in every step the nextstep is optimized. In every step, we compare models with the same number ofvariables and

RSS1 < RSS2 ⇔ AIC1 < AIC2

⇔ BIC1 < BIC2

⇔ F2 < F1.

Thus, there is no difference in selecting the next variable between F -forward,AIC-forward or BIC-forward algorithm. The procedures differ in their stop-ping rules. In case that the number qtrue of variables in the true model isknown, and it is only unknown which of the p variables in the data set shouldbe included, all methods stop after selecting qtruevariables and deliver thesame final model. These assumption that the active set A contains not morethan qtrue elements is called sparsity assumption.Note the equivalences work for all criteria, which is a monotone function ofRSS, especially for criteria which penalize the RSS by an additive constantdepending on the number of model parameters. The equivalence does not holdfor a criterion of following type

minβ,βj=0,j /∈J

(‖Y −Xβ‖2 + λpen(β, q)),

Page 32: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 127

2 4 6 8 10

3500

3550

3600

3650

3700

AIC−Forward Selection

Forward step

Min

imal

AIC

bmi

ltg

maptc sex

ldl ldl

STOP

2 4 6 8 10

050

100

150

200

250

F−Forward Selection

Forward step

Max

imal

par

tial F

sta

tistic

bmi

ltg

maptc sex ldl

tch glu hdl age

α=0.5α*=0.88

STOP for

Figure 4.9: Example 4.5. The AIC-forward algorithm stops after the 6th step. Theselected variables are bmi, ltg, map, tc, sex, ldl. Using the FSR method with α∗,the selected variables are bmi, ltg, map, tc, sex, ldl. In this example both methodsdeliver the same final model.

where the penalty depends on the unknown parameter β.

4.3.2 FSR-Forward procedure

The next procedure applies data augmenting for determining the stoppingcondition. In Miller’s textbook, he proposed the generation of pseudo vari-ables for calibrating forward selection procedures Miller (2002). Instead of theoriginal data (4.12), extended data sets are introduced. Let(

Y,X(1), ...,X(p),Z(1), ...,Z(K)

)=

Y1 X(1),1 · · · X(p),1 Z(1),1 · · · Z(K),1

......

......

......

...Yn X(1),n · · · X(p),1 Z(1),n · · · Z(K),n

,

where the observations Z(j),i, j = 1, . . . ,K , i = 1, . . . , n are generated suchthat the generated variables Z(1), . . . , Z(K) are unimportant by construction.For example, generate Z(j),i i.i. standard normal distributed. In the literature,they are called phony variables, control variables, or pseudo variables. Thearguments are as follows.

Page 33: Computer Intensive Methods in Statistics - kpi.ua

128 SIMULATION BASED METHODS

A procedure which selects many of phony variables will select many unimpor-tant variables. Otherwise, a procedure which never accepts a phony variable,has the big risk to miss an important variable in (4.12).

Wu developed an algorithm, in his dissertation, and call it FSR-algorithm[REF]. FSR stands for false selecting rate. Running a selection procedure onthe data (Y,X), let U (Y,X) be the number of unimportant selected variablesand S (Y,X) the number of all selected variables, then

FSR =U (Y,X)

S (Y,X) + 1.

It is not possible to count U (Y,X), but the number of selected phony variablesis known. Wu’s main idea is to determine the size α of the partial F -tests inthe F -forward procedure by checking how many of the phony variables areselected.

Algorithm 4.10 FSR-ForwardGiven a bound γ0 on the rate of selected phone variables.

1. For every α ∈ {α1, . . . , αM}(a) For every b, b = 1, . . . , B

i. Generate Zb =(Z(1), . . . ,Z(K)

)independent of (Y,X).

ii. Run a F -forward selection with α on the data (Y,X,Zb)and count U∗b (α) = #{k : Z(k) selected} and count Sb(α) thenumber of all selected variables.

(b) Calculate the averages

U∗(α) =

1

B

B∑j=1

U∗b (α), S(α) =1

B

B∑j=1

Sb(α)

and as an indicator for the FSR

γ(α) =U∗(α)

S(α) + 1.

2. Determine the adjusted α level

α∗ = supα∈{α1,...,αM}

{α : γ(α) ≤ γ0} .

3. Run an F -forward selection on the original data (Y,X) with α∗.

Page 34: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 129

0 5 10 15 20

050

100

150

200

250

F−Forward Selection with Pseudo Variables

Forward step

Max

imal

par

tial F

sta

tistic

bmi

ltg

maptc sex

ldlZ9 Z3 Z5 Z2 Z8

glutch Z4 Z6

hdlZ7

ageZ10 Z1

STOP γ=3/9

Figure 4.10: All possible steps of an F-Forward algorithm with 10 pseudo variables.

0.0 0.1 0.2 0.3 0.4

0.2

0.4

0.6

0.8

1.0

Two Tunings Parameters

gamma

alph

a

α*

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

F Distribution df1=1, df2=400

alpha

quan

tiles

α*

Figure 4.11: Determining the α∗ of the F-forward algorithm by the FSR estimatefrom the extended data set.

4.3.3 SimSel

SimSel applies both the data augmentation of Wu’s FSR algorithm and thesuccessive data pertubation of SIMEX. The main idea behind SimSel is todetermine the relevance of a variable X(j) by successively disturbing it andstudy the effect on the residual sum of squares (RSS). In case that the RSSremains unchanged, we conclude that the variable X(j) is not important.

The SimSel algorithm borrows the simulation step, where pseudo errors are

Page 35: Computer Intensive Methods in Statistics - kpi.ua

130 SIMULATION BASED METHODS

added to the independent variables from the SIMEX method. However, theextrapolation step in SIMEX is not performed. Thus, it is called SimSel forsimulation and selection. The variables X(1), . . . ,X(p) are checked each afterthe other. Let us assume that we are interested in X(1). The original data set(

Y,X(1), . . . ,X(p)

)is embedded into (

Y,X(1) +√λε∗, . . . ,X(p),Z

)where Z = (Z1, ...., Zn)

Tis an independent pseudo variable, independently

generated from Y,X1, ...,Xp, the pseudo errors are generated such that ε∗ =

(ε∗1, ...., ε∗n)T, ε∗i are i.i.d P ∗, with E(ε∗i ) = 0, V ar (ε∗i ) = 1, E (ε∗i )

4= µ

The phony variable Z serves as an untreated control group in a biologicalexperiment. The influence of the pseudo errors is controlled by stepwise in-creasing λ. The main idea is, if λ ”does not matter”, then X(1) is unimportant.For explaining this see Figure 4.12. There, we consider the simple linear caseY = β1X1 + ε and we fit

RSS1 (λk) = minβ1,β2

∥∥∥Y − β1

(X1 +

√λkε∗)− β2Z

∥∥∥2

.

and

RSS2 (λk) = minβ1,β2

∥∥∥Y − β1X1 − β2

(Z +

√λkε∗)∥∥∥2

by a linear regression. Intuitively ”does not matter” respects to a constanttrend of RSS (.).

The theoretical background to the SimSel algorithm is given by the followingresult:

Theorem 4.3 Under the assumption that(XTX

)−1exists, it holds

1

nRSS (λ) =

1

nRSS +

λ

1 + h11λ

(β1

)2

+ oP∗(1)

where h11 is the (1, 1)−element of(

1nX

TX)−1

and β1 is the first component

of the OLSE estimator β =(XTX

)−1XTy.

Line of the proof: It holds

1

nRSS (λ) =

1

nYTY− 1

nyTP (λ)Y (4.16)

Page 36: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 131

●●

●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.00 0.02 0.04 0.06 0.08 0.10

1.0

1.5

2.0

2.5

3.0

3.5

4.0

lambda

RS

S

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●

●●●

●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●●●

●●●

●●●

●●●●●●

●●

●●●●

Figure 4.12: The constant regression respects to the unimportance of the pseudovariable. The heteroscedastic regression is related to the worse model fit under thedisturbed important variable.

with

P (λ) = X(λ)(X(λ)TX(λ)

)−1X(λ)T . (4.17)

Further

1

nX(λ)TY =

(1

nX +

1

n

√λ∆

)Ty,

where ∆ is the (n× p)− matrix

∆ =

ε∗1 0 · · · 0ε∗2 0 · · · 0...

... 0...

ε∗n−1 0 · · ·...

ε∗n 0 · · · 0

.

The law of large number applied to the pseudo errors delivers

1

nX(λ)TY =

1

nXTY + oP∗(1). (4.18)

Page 37: Computer Intensive Methods in Statistics - kpi.ua

132 SIMULATION BASED METHODS

Consider now X(λ)TX(λ) :

1

n

(X+√λ∆)T (

X+√λ∆)

(4.19)

=1

nXTX+

1

n

√λXT∆ +

1

n

√λ∆TX+

1

nλ∆T∆ (4.20)

Hence (1

nX(λ)TX(λ)

)−1

=

(1

nXTX + λe1e

T1

)−1

+ oP∗(1).

2

Note, in the procedure we approximate

λ

1 + h11λ≈ λ.

In linear errors-in-variable models the naive LSE is inconsistent. But if β1 iszero, then naive LSE also converges to zero. This gives the motivation forsuccessful application of SimSel to errors-in-variables models.

The significance of the pertubation on the model fit is controlled by a MonteCarlo F -test for the regression of RSS(λ) on λ. That is the testing step inthe SimSel procedure. The comparison of the model fit is repeated M timesand a paired sample of F - statistics are obtained, where sample Fi,1, ....., Fi,Mrelated to the variable under control X(i), the other sample Fp+1,1, ....., Fp+1,M

related to the pseudo variable Z = X(p+1). Kernel estimates fi, fp+1 for eachsample are calculated and the overlapping of the densities are compared withgiven tuning parameters α1, α2, see Figure 4.13.

Example 4.6 (Prostate) The prostate cancer dataset contains 97 observa-tions of one dependent variable (the log of the level of prostate-specific antigen,lpsa) and eights independent variables; the logarithm of the cancer volume(lcavol), the logarithm of the prostate’s weight (lweight), age, the logarithmof the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion(svi), the logarithm of the level of capsular penetration (lcp), Gleason score(gleason), and percentage Gleason scores 4 and 5 (pgg45). For more details isthe R package ElemStatLearn, data(prostate). 2

Page 38: Computer Intensive Methods in Statistics - kpi.ua

VARIABLE SELECTION 133

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

Den

sity

fp+1(F )fi(F )

α1

F−1i (α2)F−1

p+1(1 − α1)

α2

Figure 4.13: Significance for chosen levels α1, α2 .

020

4060

8010

012

014

0

* lc

avol

(1)

* lw

eigh

t (2)

age

(5)

lbph

(4)

* sv

i (3)

lcp

(7)

glea

son

(6)

pgg4

5 (8

)

lpsa

Distributions of F−statistic

Variable

Figure 4.14: Graphical output of SimSel Example 4.6.

Page 39: Computer Intensive Methods in Statistics - kpi.ua

134 SIMULATION BASED METHODS

Algorithm 4.11 The SimSel - AlgorithmGiven 0 ≤ λ1 ≤ λ2 ≤ ... ≤ λK ,M, α1, α2 For m from 1 to M

Generate unimportant pseudo variables Z = Xp+1.For i from 1 to p+ 1

For k from 1 to KGenerate pseudo errors for each Z = Xp+1 to Xi and add them.Compute RSSi(λk).

Regression step. Calculate the F -statistics Fi,m.Plotting step, violin plot of the F -statistics.Ranking step, according to the sample median of (Fi,m)m=1...M .Testing step.

For more details regarding this chapter, see Dempster (1977), Cook and L.A.(1994), Polzehl and Zwanzig (2003), Eklund and Zwanzig (2011), Wei andTanner (1990), Fuller (1987), and Wu et al. (2007) .

Page 40: Computer Intensive Methods in Statistics - kpi.ua

Bibliography

J.R. Cook and Stefanski L.A. Simulation-Extrapolation Estimation in para-metric measurement error models. JASA, 89:1314–1327, 1994.

Laird N.M. Rubin D.B. Dempster, A.P. Maximum likelihood from incompletedata via EM algorithm. (with discussion). J. Roy. Stat. Soc. B, 39:1–38,1977.

Martin Eklund and Silvelyn Zwanzig. Simsel: a new simulation method forvariablen selection. J Comp and Sim, 85(515-527), 2011.

Wayne A. Fuller. Measurement Errors Models. Wiley, 1987.

Alan Miller. Subset Selection in Regression. Chapman and Hall CRC, 2002.

Jorg Polzehl and Silvelyn Zwanzig. On a symmetrized simulation extrapola-tion estimator in linear errors-in-variables models. ComputationaL Statisticsand Data Analysis, 47(675-688), 2003.

Donald B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley-Interscience, 2004.

Gerg c G Wei and Martin A. Tanner. A monte carlo implementation of the emalgorithm and the poor manOCOs data augmentation algorithms. Journalof the American Statistical Association, 85(411), 1990.

Yujun Wu, Dennis D Boos, and Leonard A Stefanski. Controlling variableselection by the addition of pseudovariables. Journal of the American Sta-tistical Association., 102, 2007.

Page 41: Computer Intensive Methods in Statistics - kpi.ua