generalized linear models - universitetet i oslo · the linear regression model is a glm •...

Generalized Linear Models

STK3100 - 4. september 2011

Plan for 3. lecture:1. Definition of GLM

2. Link functions and canonical link

3. Estimation of GLM - Maximum Likelihood

4. Large sample results

5. Tests in GLM - Likelihood Ratio

Generalized Linear Models – p. 1

Definition of GLM

A GLM is defined by

• IndependentY1, Y2, . . . , Yn from the same distribution in

the exponential family

• with pdf/pmff(yi; θi) = c(yi;φ) exp((θiyi − a(θi))/φ)

• and expectationsµi = a′(θi)

• Linear predictorsηi = β0 + β1xi1 + · · ·+ βpxip = β′xi

• Link functiong(): The expectationµi = E[Yi] is coupled to

the linear predictor throughg(µi) = ηi

Note thatµi depends on the (the vector)β throughg(µi) = ηi,

i.e. µi = g−1(ηi).

Thereforeθi also depends onβ throughµi = a′(θi).Generalized Linear Models – p. 2

The linear regression model is a GLM

• Responses (Yi-s) from normal distributions

• Linear predictorsηi = β0 + β1xi1 + · · ·+ βpxip

• E[Yi] = µi = ηi, i.e. the link functiong(µi) = µi is the

identity function

The R-commandslm for linear regression andglm does

essentially the same, but with slightly different output.

Linear regression is the default specification inglm

Ex. 1: Birth weights

> lm(vekt˜sex+svlengde)

lm(formula = vekt ˜ sex + svlengde)

Coefficients:

(Intercept) sex svlengde

-1447.2 -163.0 120.9

> glm(vekt˜sex+svlengde)

<Call: glm(formula = vekt ˜ sex + svlengde)

Coefficients:

(Intercept) sex svlengde

-1447.2 -163.0 120.9

Degrees of Freedom: 23 Total (i.e. Null); 21 Residual

Null Deviance: 1830000

Residual Deviance: 658800 AIC: 321.4Generalized Linear Models – p. 4

The logistic regression model is a GLM

• Responses (Yi-s) from binomial distributions Bin(ni, µi)

• Linear predictorsηi = β0 + β1xi1 + · · ·+ βpxip

• E[Yi] = µi = niexp(ηi)

1+exp(ηi).

• Gives link functiong(µi) = log( µi

ni−µi)

• Usually expressed as a function ofπ = µ/n, i.e.

g(πi) = log( πi

1−πi)

(which also is the link function for the binomial proportions

Yi/ni)

• log( π1−π

) = logit(π) is called the logit function

Link functions for binomial data

The logit link function gives

π = g−1(η) =exp(η)

1 + exp(η),

i.e. g−1(η) is a cumulative distribution function (CDF) for a

continuous distribution

• Continuous and strictly increasing

• g−1(−∞) = 0 andg−1(∞) = 1

Can in general define link functions byg(π) = F−1(π) where

F () is a continuous CDF

Other link functions for binomial data

Most usual alternative to logit is theprobit-link,

g2(π) = Φ−1(π)

whereΦ(η) =∫ η

−∞exp(−x2/2)√

2πdx = is the CDF for N(0,1) (theπ

here is 3.14)

Another alternative isComplementary log-log-link

g3(π) = log(− log(1− π))

which is the inverse ofF (η) = 1− exp(− exp(η))

(which is CDF for the so called Gumbel distribution)

Ex: Link functions logit and probit

> glm(cbind(Dode,Ant-Dode)˜Dose,family=binomial)

Call: glm(formula = cbind(Dode, Ant - Dode) ˜ Dose, family = binomial)

Coefficients:

(Intercept) Dose

-60.72 34.27

Null Deviance: 284.2

Residual Deviance: 11.23 AIC: 41.43

> glm(cbind(Dode,Ant-Dode)˜dose,family=binomial(link=probit))

Coefficients:

(Intercept) dose

-34.94 19.73

Ex: Link function complementary log-log

> glm(cbind(Dode,Ant-Dode)˜dose,family=binomial(link=cloglog))

Coefficients:

(Intercept) dose

-39.57 22.04

Link functions for Poisson regression

• ResponsesYi ∼ Po(µi)

• Linear predictorηi = β0 + β1xi1 + · · ·+ βpxip

Usual link functions

• ηi = g0(µi) = log(µi) which givesµi = exp(ηi)

• ηi = g1/2(µi) =√µi

• ηi = gp(µi) = µpi

Ex: Number of children among pregnant women

> glm(children˜age,family=poisson)

Coefficients:

(Intercept) age

-4.0895 0.1129

Residual Deviance: 165 AIC: 290

> glm(children˜age,data=births,family=poisson(link=sqrt))

Coefficients:

(Intercept) age

-0.61109 0.04477

Canonical link function

• µi depends onβ throughg(µi) = ηi

• Thereforeθi depends onβ throughµi = a′(θi)

• A GLM becomes mathematically more simple if we

assume that

the canonical parameterθi = the linear predictorηi

• The link functiong(µi) is then calledcanonical.

• Thenµi = g−1(ηi) = g−1(θi)

• Sinceµi = a′(θi) the canonical link function is found from

g−1(θi) = a′(θi)

Examples of canonical link functions

• Normal distribution: Ordinary linear-normal model

a′(θ) = θ = g−1(θ) which givesg(µ) = µ

• Poisson distribution: Log-linear model

a′(θ) = exp(θ) = g−1(θi) which givesg(µ) = log(µ)

• Binomial distribution withn = 1, i.e.µ = π:

a′(θ) = exp(θ)1+exp(θ)

= g−1(θi) which gives

g(π) = log(π/(1− π)) = logit(π)

• Canonical link make computations simpler, but not so

important with modern computers

Re-numberingβ

• So far we have numbered the elements of the vectorβ from

0 top, i.e. p+ 1 parametersβ0, β1, . . . βp

• In the following, we number them from 1 top,

i.e. p parameters (β1 is now the intercept if included)

• This is done to follow the book

Likelihood for GLM

Since theYi-s are independent with pdf/pmff(yi; θi) the

likelihood is

L(β, φ) =n∏

f(Yi; θi)

Note that this is a function of the regression coefficientsβ since

θi is a function ofµi which further is a function ofβ.

Contribution to the log-likelihood from thei-th observation is

li(β) = log(f(Yi; θi) and the log-likelihood is

l(β, φ) =n

li(β, φ) =n

[θiYi − a(θi)

φ+ log(c(Yi;φ))]

Estimation of β

Score function =∂l(β)/∂βj

Componentj in the score functions(β) = (s1(β), . . . , sp(β))′ is

(see next slide)

sj(β) =∂l(β)

∂βj

sij(β) =1

xijYi − µi

g′(µi)V (µi)

MLE β is found by solving

sj(β) =1

xijYi − µi

g′(µi)V (µi)= 0

for j = 1, . . . , p, whereµi is estimated expectation withβ = β

Can dropφ here, so MLE ofβ does not depend on the value ofφ

Score-contribution

Score-contribution from observationi is found by the chain rule

and the rule of the derivative of an inverse function:

sij(β) =∂li(β)

∂βj

=∂ηi∂βj

∂µi

∂ηi

∂θi∂µi

∂li∂θi

where∂ηi∂βj

∂µi

∂ηi= 1

∂ηi∂µi

= 1∂g(µi)

∂µi

= 1g′(µi)

∂θi∂µi

= 1∂µi∂θi

= 1∂a′(θi)

∂θi

= 1a′′(θi)

= 1V (µi)

∂li∂θi

= ∂[(θiYi−a(θi))/φ+log(c(Yi))]∂θi

= Yi−a′(θi)φ

= Yi−µi

This givessij(β) =

Yi − µi

g′(µi)V (µi)

Score function cont.

sj(β) =∂l(β)

∂βj

sij(β) =1

xijYi − µi

g′(µi)V (µi)

Note that E[sj(β)] = 0 since E[Yi − µi] = 0

These estimating equations can not be solved analytically except

in ordinary linear regression with normal distribution and

identity link

Therefore usually solved by numerical optimisation

Numerical optimisation

Newton-Raphson:

β(s+1) =β(s) + [J(β(s))]−1s(β(s))

s(β) =∂l(β)

∂βscore function

J(β) =− ∂2l(β)

∂β∂βTobserved information matrix

The Fisher scoring algorithm:

β(s+1) =β(s) + [I(β(s))]−1s(β(s))

I(β) =E[J(β)] expected information matrix

I(β) is also called Fisher information

Observed information matrix

J(β) = {Jj,k(β)} whereJj,k(β) = − ∂2l

∂βj∂βk

= − ∂sj∂βk

− ∂sj∂βk

= − 1φ

∑ni=1 xij

∂[(Yi−µi)/(g′(µi)V (µi))]

∂βk

= − 1φ

∑ni=1 xij

∂ηi∂βk

∂µi

∂ηi

∂µi

= − 1φ

∑ni=1 xijxik

1g′(µi)

∂µi

∂µi= 1

g′(µi)V (µi)∂(Yi−µi)

∂µi

+(Yi − µi)∂[1/(g′(µi)V (µi))]

∂µi

= −1g′(µi)V (µi)

+ (Yi − µi)∂[1/(g′(µi)V (µi))]

∂µiGeneralized Linear Models – p. 20

Expected information matrix

(Yi − µi)∂[1/(g′(µi)V (µi))]

∂µi

The expected information matrix is

I(β) = E[J(β)] =1

xijxik1

g′(µi)2V (µi)

Estimation of φ

• Maximum likelihood

• The ML estimates ofβ does not depend on the value of

φ and can be found as described above.

• Pluggingβ into the likelihood gives a profile likelihood

for φ, and the ML estimate ofφ is found by maximising

l(φ) = l(β, φ) =n

φθiyi − a(θi) + log(c(yi;φ))]

• Can be maximised numerically

• φ can also be estimated by the moment method

“Large sample” theory

We want

• properties of estimates, including standard errors and

confidence intervals

• to test hypotheses

Large sample results for MLE of β in a GLM

• Whenβ is p-dimensional and number of observations are

large, we have

β ≈ Np(β, I−1(β))

• This can be used to construct confidence intervals for each

element inβ, for the linear predictorηi for givenx-values

and forµi = g−1(ηi).

• Then plug in the MLEβ

• This is also the basis for the Wald test, see later

Large sample results for the score function in a GLM

s(β) ≈ Np(0, I(β))

The normal distribution comes from the central limit theorem

Thej-th component of s:sj(β) = 1φ

∑ni=1 xij

Yi−µi

g′(µi)V (µi)

E[s(β)] = 0 sinceE(Yi − µi) = 0

Proof for covariance matrix:Cov(sj, sk) = E(sj · sk)

= 1φ2

∑ni=1

xijxik

g′(µi)2V (µi)2Var(Yi − µi)

∑ni=1

xijxik

g′(µi)2V (µi)= Ijk

Multivariate normal distribution

A p-dimensional vectorY = (Y1, . . . , Yp)′

is multivariate normal distributed if we can write

Y = AZ+ µ ,

whereZ′ = (Z1, . . . , Zp) is a vector ofp independent N(0,1)

variablesZi, µ′ = (µ1, . . . , µp) an arbitraryp-dimensional vector

of numbers andA a non-singular matrix

E[Y] = AE[Z] + µ = µ

Var[Y] = V = Var[AZ] = E[(AZ)(AZ)T ]

= E[AZZTA

T ] = AE[ZZ)T ]AT = AAT

ThereforeY ∼ Np(µ,V)

Distribution of (Y − µ)TV−1(Y − µ)

(Y − µ)TV−1(Y − µ) = ZTZ =

Z2i ∼ χ2

i.e. chi square distributed withp degrees of freedom since

Zi ∼ N(0,1) and independent

Hypothesis testing

Various hypothesis:

• H0 : βj = β∗j : one parameter equal a specified value, e.g. 0

• H0 : β = β∗ : all parameters equal specified values

• H0 : βp−q+1 = βp−q+2 = · · · = βp = 0 whereq < p

- some parameters equal 0

• H0 : βj = βi : two parameters equal

Tree possible tests

• Wald test

• Score test

• Likelihood ratio test

All tests are based on statistics that areχ2q distributed

Hypothesis testing cont.

Can be generalised as a restrictionCβ = r whereC is a known

qxr matrix andr is a q-dimensional vector of known values

• H0 : βj = β∗j : C a row vector with 1 in columnj and 0

elsewhere,r = β∗j

• H0 : β = β∗ : C thepxp identity matrix,r = β∗

• H0 : βp−q+1 = βp−q+2 = · · · = βp = 0 whereq < p : C a

qxp matrix where all elements in the firstp− q columns are

0 and the lastq columns is theqxq identity matrix, r a

q-dimensional vector of 0-s

• H0 : βj = βi : C a row vector with 1 at elementj and -1 at

elementi and 0 elsewhere,r = 0

Wald test

In general, whenY ∼ Np(µ,V) then

(Y − µ)′V−1(Y − µ) ∼ χ2p

Sinceβ ≈ Np(β, I−1(β)) the Wald test statistic withI = I(β)

(and also plugged in a consistent estimate ofφ if needed) is

(β − β)T I(β − β) ≈ χ2p

(Cβ − r)T [CI−1CT ]−1(Cβ − r) ≈ χ2q

The hypothesis is rejected if the test statistic is large

The MLE β of the full model is needed, and also a consistentestimate ofφ if φ is unknown

Likelihood ratio test

• Let β denote MLE in the full (unrestricted) model andβ

MLE in the restricted model whereCβ = r

• Then the likelihood ratio statistic is

2 log[L(β)/L(β)] = 2[l(β)− l(β)] ≈ χ2q

• Both β andβ is needed

• If the φ is unknown, one has to use the same consistent

estimate ofφ in the two log likelihoods

Score test

Sinces(β) ≈ Np(0, I(β))

the score test statistic withI = I(β) (and also plugged in a

consistent estimate ofφ if needed) is

s(β)T I−1s(β) ∼ χ2

The MLE β of the restricted model is needed,and a consistent estimate ofφ if φ is unknown

Comparing test properties

• Wald-, score- and likelihood ratio (LR) tests are

asymptotically equivalent, but can give different results

when there are few observations (small sample)

• The score and LR tests have in general better small sample

properties than the Wald test

• The Wald test corresponds to the usual confidence intervals

for βj

• Since LRT can can be computed directly from the

likelihood (without an estimate of the Fisher information),

it is simple to use

• The score test may be better for one-sided tests

Ex: Deadly dose of poison for beetles

Yi = number of died beetles out ofni treated with dosexi

Model:

Yi ∼ Bin(ni, πi) with

πi =exp(β0 + β1xi)

1 + exp(β0 + β1xi)

Want to test H0 : β1 = 0

Beetles-ex cont: - Output from R

> glmfit0biller<-glm(cbind(Dode,Ant-Dode)˜Dose,family=binomial)

> summary(glmfit0biller)

glm(formula = cbind(Dode, Ant - Dode) ˜ Dose, family = binomial)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5941 -0.3944 0.8329 1.2592 1.5940

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -60.717 5.181 -11.72 <2e-16 ***Dose 34.270 2.912 11.77 <2e-16 ***---

Signif. codes: 0 ’ *** ’ 0.001 ’ ** ’ 0.01 ’ * ’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 284.202 on 7 degrees of freedom

Residual deviance: 11.232 on 6 degrees of freedom

AIC: 41.43

Beetles-ex cont: Wald test

• β1 = 34.27

• se1 = 2.912

• z = β1/se1 = 11.77

• z2 = 138.5

• P (Z2 > 138.5) = 1-pchisq(138.5,1) = 0

Beetles-ex cont. Likelihood ratio test

Estimate the two models and extract the likelihoods with thelogLik function:> fit0<-glm(cbind(Dode,Ant-Dode)˜1,family=binomial)

> logLik(fit0)

’log Lik.’ -155.2002 (df=1)

> fit1<-glm(cbind(Dode,Ant-Dode)˜Dose,family=binomial)

> logLik(fit)

’log Lik.’ -18.71513 (df=2)

> 2* (logLik(fit1)-logLik(fit0))

[1] 272.9702

attr(,"df")

attr(,"class")

[1] "logLik"

Note: The test statistic valueG = 272.97 is also given as thedifference between the "Null Deviance" and "ResidualDeviance" on the R output for the full model

generalized linear models - universitetet i oslo · the linear regression model is a glm •...

Documents

general linear models (glm) - statistical software...general...

chapter 9 generalised linear...

an overview of data processing tasks in the analysis of...

part iii the general linear model chapter 10 glm. anova

general linear models oneway anova, glm univariate (n-way...

chapter 10. generalized linear models · 2007. 1. 10. ·...

statistical analysis with the general linear · pdf...

introduction to general and generalized linear...

6. the general linear model (glm) -...

generalized linear models and logistic...

introduction to general and generalized linear models...

generalized linear models (glm)

introduction on to generalized linear models (glm) ·...

the general linear model and statistical parametric...

introduction to general and generalized linear models -...

functional image analysis with a general linear model...

data analytics for semiconductor manufacturing · model...

introduction to general and generalized linear models...

chapter 2: simple linear...

an extension of generalized linear models for...