generalized linear model

8/12/2019 Generalized Linear Model

1/14

Appendix B

Generalized Linear Model

Theory

We describe the generalized linear model as formulated by Nelder and Wed-derburn (1972), and discuss estimation of the parameters and tests of hy-potheses.

B.1 The Model

Let y1, . . . , yn denote n independent observations on a response. We treatyi as a realization of a random variable Yi. In the general linear model weassume that Yi has a normal distribution with mean i and variance

2

Yi N(i, 2),

and we further assume that the expected value i is a linear function ofppredictors that take values xi= (xi1, . . . , xip) for the i-th case, so that

i= x

i,

where is a vector of unknown parameters.

We will generalize this in two steps, dealing with the stochastic andsystematic components of the model.

G. Ro drguez. Revised November 2001


2/14

2 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

B.1.1 The Exponential Family

We will assume that the observations come from a distribution in the expo-nential family with probability density function

f(yi) = exp{yii b(i)ai()

+c(yi, )}. (B.1)

Here i and are parameters and ai(), b(i) and c(yi, ) are known func-tions. In all models considered in these notes the function ai() has theform

ai() =/pi,

wherepi is a known prior weight, usually 1.The parametersi and are essentially location and scale parameters.

It can be shown that ifYi has a distribution in the exponential family thenit has mean and variance

E(Yi) = i = b(i) (B.2)

var(Yi) = 2

i =b(i)ai(), (B.3)

where b(i) and b(i) are the first and second derivatives ofb(i). When

ai() =/pi the variance has the simpler form

var(Yi) =2

i =b(i)/pi.

The exponential family just defined includes as special cases the normal,

binomial, Poisson, exponential, gamma and inverse Gaussian distributions.Example: The normal distribution has density

f(yi) = 1

22exp{1

2

(yi i)22

}.

Expanding the square in the exponent we get (yi i)2 =y2i +2i 2yii,so the coefficient ofyi is i/

2. This result identifiesi as i and as 2,

with ai() =. Now write

f(yi) = exp{yii 122i

2 y

2

i

22 1

2log(22)}.

This shows that b(i) = 1

22i (recall that i = i). Let us check the meanand variance:

E(Yi) =b(i) =i = i,

var(Yi) =b(i)ai() =

2.


3/14

B.1. THE MODEL 3

Try to generalize this result to the case whereYihas a normal distribution

with mean i and variance 2

/ni for known constants ni, as would be thecase if the Yi represented sample means.

Example: In Problem Set 1 you will show that the exponential distributionwith density

f(yi) =iexp{iyi}belongs to the exponential family.

In Sections B.4 and B.5 we verify that the binomial and Poisson distri-butions also belong to this family.

B.1.2 The Link Function

The second element of the generalization is that instead of modeling themean, as before, we will introduce a one-to-one continuous differentiabletransformation g(i) and focus on

i= g(i). (B.4)

The function g(i) will be called the link function. Examples of link func-tions include the identity, log, reciprocal, logit and probit.

We further assume that the transformed mean follows a linear model, sothat

i= x

i. (B.5)

The quantityi is called the linear predictor. Note that the model foriis pleasantly simple. Since the link function is one-to-one we can invert it toobtain

i= g1(xi).

The model for i is usually more complicated than the model for i.Note that we do not transform the response yi, but rather its expected

value i. A model where log yi is linear on xi, for example, is not the sameas a generalized linear model where log i is linear on xi.

Example: The standard linear model we have studied so far can be describedas a generalized linear model with normal errors and identity link, so that

i= i.

It also happens that i, and therefore i, is the same as i, the parameter inthe exponential family density.


4/14


When the link function makes the linear predictor i the same as the

canonical parameter i, we say that we have a canonical link. The identityis the canonical link for the normal distribution. In later sections we will seethat the logit is the canonical link for the binomial distribution and the logis the canonical link for the Poisson distribution. This leads to some naturalpairings:

Error Link

Normal IdentityBinomial LogitPoisson Log

However, other combinations are also possible. An advantage of canoni-cal links is that a minimal sufficient statistic for exists, i.e. all the informa-tion about is contained in a function of the data of the same dimensionalityas .

B.2 Maximum Likelihood Estimation

An important practical feature of generalized linear models is that they canall be fit to data using the same algorithm, a form ofiteratively re-weightedleast squares. In this section we describe the algorithm.

Given a trial estimate of the parameters , we calculate the estimated

linear predictor i = x

i and use that to obtain the fitted values i =g1(i). Using these quantities, we calculate the working dependent variable

zi = i+ (yi i) didi

, (B.6)

where the rightmost term is the derivative of the link function evaluated atthe trial estimate.

Next we calculate the iterative weights

wi= pi/[b(i)(

didi

)2], (B.7)

whereb(i) is the second derivative ofb(i) evaluated at the trial estimateand we have assumed that ai() has the usual form /pi. This weight isinversely proportional to the variance of the working dependent variable zigiven the current estimates of the parameters, with proportionality factor.


5/14

B.3. TESTS OF HYPOTHESES 5

Finally, we obtain an improved estimate of regressing the working

dependent variable zi on the predictors xi using the weights wi, i.e. wecalculate the weighted least-squares estimate

= (XWX)1XWz, (B.8)

whereX is the model matrix, W is a diagonal matrix of weights with entrieswi given by (B.7) and zis a response vector with entries zi given by (B.6).

The procedure is repeated until successive estimates change by less thana specified small amount. McCullagh and Nelder (1989) prove that thisalgorithm is equivalent to Fisher scoring and leads to maximum likelihoodestimates. These authors consider the case of generalai() and include intheir expression for the iterative weight. In other words, they usewi =wi,

where wi is the weight used here. The proportionality factor cancels outwhen you calculate the weighted least-squares estimates using (B.8), so theestimator is exactly the same. I prefer to show explicitly rather thaninclude it in W.

Example: For normal data with identity link i = i, so the derivative isdi/di= 1 and the working dependent variable is yiitself. Since in additionb(i) = 1 andpi= 1, the weights are constant and no iteration is required.

In Sections B.4 and B.5 we derive the working dependent variable and theiterative weights required for binomial data with link logit and for Poissondata with link log. In both cases iteration will usually be necessary.

Starting values may be obtained by applying the link to the data, i.e.we take i = yi and i=g(i). Sometimes this requires a few adjustments,for example to avoid taking the log of zero, and we will discuss these at theappropriate time.

B.3 Tests of Hypotheses

We consider Wald tests and likelihood ratio tests, introducing the deviancestatistic.

B.3.1 Wald Tests

The Wald test follows immediately from the fact that the information matrixfor generalized linear models is given by

I() =XWX/, (B.9)


6/14


so the large sample distribution of the maximum likelihood estimator is

multivariate normal Np(, (XWX)1). (B.10)with mean and variance-covariance matrix (XWX)1.

Tests for subsets of are based on the corresponding marginal normaldistributions.

Example: In the case of normal errors with identity link we have W = I(where I denotes the identity matrix), = 2, and the exactdistributionof is multivariate normal with mean and variance-covariance matrix(XX)12.

B.3.2 Likelihood Ratio Tests and The Deviance

We will show how the likelihood ratio criterion for comparing any two nestedmodels, say 1 2, can be constructed in terms of a statistic called thedevianceand an unknown scale parameter .

Consider first comparing a model of interest with a saturatedmodel that provides a separate parameter for each observation.

Let i denote the fitted values under and let i denote the correspond-ing estimates of the canonical parameters. Similarly, let O = yi and idenote the corresponding estimates under .

The likelihood ratio criterion to compare these two models in the expo-nential family has the form

2log = 2ni=1

yi(i i) b(i) +b(i)ai()

.

Assume as usual that ai() =/pi for known prior weights pi. Then wecan write the likelihood-ratio criterion as follows:

2log = D(y,)

. (B.11)

The numerator of this expression does not depend on unknown parametersand is called the deviance:

D(y,) = 2ni=1

pi[yi(i i) b(i) +b(i)]. (B.12)

The likelihood ratio criterion 2log Lis the deviance divided by the scaleparameter, and is called the scaled deviance.


7/14

B.3. TESTS OF HYPOTHESES 7

Example: Recall that for the normal distribution we had i= i,b(i) = 1

22i ,

and ai() =2

, so the prior weights are pi= 1. Thus, the deviance is

D(y,) = 2

{yi(yi i) 12

y2i +1

2i

2}

= 2

{12

y2i yii2 +1

2i

2}=

(yi i)2

our good old friend, the residual sum of squares.Let us now return to the comparison of two nested models 1, with p1

parameters, and 2, with p2 parameters, such that 1 2 and p2> p1.The log of the ratio of maximized likelihoods under the two models can

be written as a difference of deviances, since the maximized log-likelihoodunder the saturated model cancels out. Thus, we have

2log = D(1) D(2)

(B.13)

The scale parameter is either known or estimated using the larger model2.

Large sample theory tells us that the asymptotic distribution of thiscriterion under the usual regularity conditions is 2with= p2p1 degreesof freedom.

Example: In the linear model with normal errors we estimate the unknown

scale parameter using the residual sum of squares of the larger model, sothe criterion becomes

2log = RSS(1) RSS(2)RSS(2)/(n p2) .

In large samples the approximate distribution of this criterion is 2 with=p2 p1 degrees of freedom. Under normality, however, we have an exactresult: dividing the criterion byp2 p1 we obtain an F with p2 p1 andn p2 degrees of freedom. Note that as n the degrees of freedom inthe denominator approach and the F converges to (p2 p1)2, so theasymptotic and exact criteria become equivalent.

In Sections B.4 and B.5 we will construct likelihood ratio tests for bi-nomial and Poisson data. In those cases = 1 (unless one allows over-dispersion and estimates , but thats another story) and the deviance isthe same as the scaled deviance. All our tests will be based on asymptotic2 statistics.


8/14


B.4 Binomial Errors and Link Logit

We apply the theory of generalized linear models to the case of binary data,and in particular to logistic regression models.

B.4.1 The Binomial Distribution

First we verify that the binomial distribution B(ni, i) belongs to the expo-nential family of Nelder and Wedderburn (1972). The binomial probabilitydistribution function (p.d.f.) is

fi(yi) =

niyi

yii (1 i)niyi. (B.14)

Taking logs we find that

log fi(yi) =yilog(i) + (ni yi) log(1 i) + log

niyi

.

Collecting terms on yi we can write

log fi(yi) =yilog( i

1 i ) +nilog(1 i) + log

niyi

.

This expression has the general exponential form

log fi(yi) =yii b(i)

ai() +c(yi, )

with the following equivalences: Looking first at the coefficient ofyi we notethat the canonical parameter is the logit ofi

i= log( i

1 i ). (B.15)

Solving fori we see that

i= ei

1 +ei, so 1

i=

1

1 +ei.

If we rewrite the second term in the p.d.f. as a function ofi, so log(1i) = log(1 +ei), we can identify the cumulant function b(i) as

b(i) =nilog(1 +ei).


9/14

B.4. BINOMIAL ERRORS AND LINK LOGIT 9

The remaining term in the p.d.f. is a function ofyi but not i, leading to

c(yi, ) = log

niyi

.

Note finally that we may set ai() = and = 1.Let us verify the mean and variance. Differentiating b(i) with respect

to i we find that

i= b(i) =ni

ei

1 +ei=nii,

in agreement with what we knew from elementary statistics. Differentiatingagain using the quotient rule, we find that

vi= ai()b(i) =ni e

i

(1 +ei)2 =nii(1 i),

again in agreement with what we knew before.In this development I have worked with the binomial count yi, which

takes values 0(1)ni. McCullagh and Nelder (1989) work with the propor-tion pi = yi/ni, which takes values 0(1/ni)1. This explains the differencesbetween my results and their Table 2.1.

B.4.2 Fisher Scoring in Logistic Regression

Let us now find the working dependent variable and the iterative weight

used in the Fisher scoring algorithm for estimating the parameters in logisticregression, where we model

i = logit(i). (B.16)

It will be convenient to write the link function in terms of the mean i, as:

i= log( i

1 i ) = log( i

ni i ),

which can also be written as i = log(i) log(ni i).Differentiating with respect toi we find that

di

di=

1

i+

1

ni i=

ni

i(ni i)=

1

nii(1 i).

The working dependent variable, which in general is

zi= i+ (yi i) didi

,


10/14


turns out to be

zi= i+

yi

niinii(1 i) . (B.17)

The iterative weight turns out to be

wi = 1/

b(i)(

didi

)2

,

= 1

nii(1 i) [nii(1 i)]2,

and simplifies to

wi= nii(1 i). (B.18)

Note that the weight is inversely proportional to the variance of theworking dependent variable. The results here agree exactly with the resultsin Chapter 4 of McCullagh and Nelder (1989).

Exercise: Obtain analogous results for Probit analysis, where one models

i= 1(i),

where () is the standard normal cdf. Hint: To calculate the derivative ofthe link function find di/di and take reciprocals.

B.4.3 The Binomial Deviance

Finally, let us figure out the binomial deviance. Let i denote the m.l.e. ofi under the model of interest, and let i = yi denote the m.l.e. under thesaturated model. From first principles,

D = 2

[yilog(yini

) + (ni yi)log( ni yini

)

yilog(ini

) (ni yi)log( ni ini

)].

Note that all terms involving log(ni) cancel out. Collecting terms on yi andon ni yi we find that

D= 2

[yilog(yii

) + (ni yi)log( ni yini i )]. (B.19)

Alternatively, you may obtain this result from the general form of thedeviance given in Section B.3.


11/14

B.5. POISSON ERRORS AND LINK LOG 11

Note that the binomial deviance has the form

D= 2

oilog( oiei

),

whereoidenotes observed,eidenotes expected (under the model of interest)and the sum is over both successes and failures for each i (i.e. we havea contribution fromyi and one fromni yi).

For grouped data the deviance has an asymptotic chi-squared distribu-tion as ni for all i, and can be used as a goodness of fit test.

More generally, the difference in deviances between nested models (i.e.the log of the likelihood ratio test criterion) has an asymptotic chi-squareddistribution as the number of groups k or the size of each groupni

, provided the number of parameters stays fixed.

As a general rule of thumb due to Cochrane (1950), the asymptotic chi-squared distribution provides a reasonable approximation when all expectedfrequencies (both i andni i) under the largermodel exceed one, and atleast 80% exceed five.

B.5 Poisson Errors and Link Log

Let us now apply the general theory to the Poisson case, with emphasis onthe log link function.

B.5.1 The Poisson Distribution

A Poisson random variable has probability distribution function

fi(yi) =eiyii

yi! (B.20)

for yi = 0, 1, 2, . . .. The moments are

E(Yi) = var(Yi) =i.

Let us verify that this distribution belongs to the exponential family asdefined by Nelder and Wedderburn (1972). Taking logs we find

log fi(yi) =yilog(i) i log(yi!).Looking at the coefficient of yi we see immediately that the canonical

parameter isi= log(i), (B.21)


12/14


and therefore that the canonical link is the log. Solving for i we obtain the

inverse link i= ei ,

and we see that we can write the second term in the p.d.f. as

b(i) =ei .

The last remaining term is a function ofyi only, so we identify

c(yi, ) = log(yi!).Finally, note that we can take ai() = and = 1, just as we did in thebinomial case.

Let us verify the mean and variance. Differentiating the cumulant func-

tion b(i) we havei= b

(i) =ei =i,

and differentiating again we have

vi= ai()b(i) =e

i =i.

Note that the mean equals the variance.

B.5.2 Fisher Scoring in Log-linear Models

We now consider the Fisher scoring algorithm for Poisson regression modelswith canonical link, where we model

i = log(i). (B.22)

The derivative of the link is easily seen to be

didi

= 1

i.

Thus, the working dependent variable has the form

zi= i+yi i

i. (B.23)

The iterative weight is

wi = 1/

b(i)(

didi

)2

= 1/

i

1

2i

,


13/14

B.5. POISSON ERRORS AND LINK LOG 13

and simplifies to

wi = i. (B.24)

Note again that the weight is inversely proportional to the variance ofthe working dependent variable.

B.5.3 The Poisson Deviance

Let i denote the m.l.e. of i under the model of interest and let i =yi denote the m.l.e. under the saturated model. From first principles, thedeviance is

D = 2[yilog(yi) yi log(yi!)yilog(i) + i+ log(yi!)].Note that the terms on yi! cancel out. Collecting terms on yi we have

D= 2

[yilog(yii

) (yi i)]. (B.25)

The similarity of the Poisson and Binomial deviances should not go un-noticed. Note that the first term in the Poisson deviance has the form

D= 2

oilog(oiei

),

which is identical to the Binomial deviance. The second term is usually zero.To see this point, note that for a canonical link the score is

log L

=X(y ),

and setting this to zero leads to the estimating equations

Xy= X.

In words, maximum likelihood estimation for Poisson log-linear modelsandmore generally for any generalized linear model with canonical linkrequires

equating certain functions of the m.l.e.s (namely X

) to the same functionsof the data (namely Xy). If the model has a constant, one column ofXwill consist of ones and therefore one of the estimating equations will be

yi=

i or

(yi i) = 0,


14/14


so the last term in the Poisson deviance is zero. This result is the basis

of an alternative algorithm for computing the m.l.e.s known as iterativeproportional fitting, see Bishop et al. (1975) for a description.The Poisson deviance has an asymptotic chi-squared distribution as n

with the number of parameters p remaining fixed, and can be used as agoodness of fit test. Differences between Poisson deviances for nested models(i.e. the log of the likelihood ratio test criterion) have asymptotic chi-squareddistributions under the usual regularity conditions.

generalized linear model

Documents