generalized linear model

Upload: ashishthe7353

Post on 03-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Generalized Linear Model

    1/14

    Appendix B

    Generalized Linear Model

    Theory

    We describe the generalized linear model as formulated by Nelder and Wed-derburn (1972), and discuss estimation of the parameters and tests of hy-potheses.

    B.1 The Model

    Let y1, . . . , yn denote n independent observations on a response. We treatyi as a realization of a random variable Yi. In the general linear model weassume that Yi has a normal distribution with mean i and variance

    2

    Yi N(i, 2),

    and we further assume that the expected value i is a linear function ofppredictors that take values xi= (xi1, . . . , xip) for the i-th case, so that

    i= x

    i,

    where is a vector of unknown parameters.

    We will generalize this in two steps, dealing with the stochastic andsystematic components of the model.

    G. Ro drguez. Revised November 2001

  • 8/12/2019 Generalized Linear Model

    2/14

    2 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    B.1.1 The Exponential Family

    We will assume that the observations come from a distribution in the expo-nential family with probability density function

    f(yi) = exp{yii b(i)ai()

    +c(yi, )}. (B.1)

    Here i and are parameters and ai(), b(i) and c(yi, ) are known func-tions. In all models considered in these notes the function ai() has theform

    ai() =/pi,

    wherepi is a known prior weight, usually 1.The parametersi and are essentially location and scale parameters.

    It can be shown that ifYi has a distribution in the exponential family thenit has mean and variance

    E(Yi) = i = b(i) (B.2)

    var(Yi) = 2

    i =b(i)ai(), (B.3)

    where b(i) and b(i) are the first and second derivatives ofb(i). When

    ai() =/pi the variance has the simpler form

    var(Yi) =2

    i =b(i)/pi.

    The exponential family just defined includes as special cases the normal,

    binomial, Poisson, exponential, gamma and inverse Gaussian distributions.Example: The normal distribution has density

    f(yi) = 1

    22exp{1

    2

    (yi i)22

    }.

    Expanding the square in the exponent we get (yi i)2 =y2i +2i 2yii,so the coefficient ofyi is i/

    2. This result identifiesi as i and as 2,

    with ai() =. Now write

    f(yi) = exp{yii 122i

    2 y

    2

    i

    22 1

    2log(22)}.

    This shows that b(i) = 1

    22i (recall that i = i). Let us check the meanand variance:

    E(Yi) =b(i) =i = i,

    var(Yi) =b(i)ai() =

    2.

  • 8/12/2019 Generalized Linear Model

    3/14

    B.1. THE MODEL 3

    Try to generalize this result to the case whereYihas a normal distribution

    with mean i and variance 2

    /ni for known constants ni, as would be thecase if the Yi represented sample means.

    Example: In Problem Set 1 you will show that the exponential distributionwith density

    f(yi) =iexp{iyi}belongs to the exponential family.

    In Sections B.4 and B.5 we verify that the binomial and Poisson distri-butions also belong to this family.

    B.1.2 The Link Function

    The second element of the generalization is that instead of modeling themean, as before, we will introduce a one-to-one continuous differentiabletransformation g(i) and focus on

    i= g(i). (B.4)

    The function g(i) will be called the link function. Examples of link func-tions include the identity, log, reciprocal, logit and probit.

    We further assume that the transformed mean follows a linear model, sothat

    i= x

    i. (B.5)

    The quantityi is called the linear predictor. Note that the model foriis pleasantly simple. Since the link function is one-to-one we can invert it toobtain

    i= g1(xi).

    The model for i is usually more complicated than the model for i.Note that we do not transform the response yi, but rather its expected

    value i. A model where log yi is linear on xi, for example, is not the sameas a generalized linear model where log i is linear on xi.

    Example: The standard linear model we have studied so far can be describedas a generalized linear model with normal errors and identity link, so that

    i= i.

    It also happens that i, and therefore i, is the same as i, the parameter inthe exponential family density.

  • 8/12/2019 Generalized Linear Model

    4/14

    4 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    When the link function makes the linear predictor i the same as the

    canonical parameter i, we say that we have a canonical link. The identityis the canonical link for the normal distribution. In later sections we will seethat the logit is the canonical link for the binomial distribution and the logis the canonical link for the Poisson distribution. This leads to some naturalpairings:

    Error Link

    Normal IdentityBinomial LogitPoisson Log

    However, other combinations are also possible. An advantage of canoni-cal links is that a minimal sufficient statistic for exists, i.e. all the informa-tion about is contained in a function of the data of the same dimensionalityas .

    B.2 Maximum Likelihood Estimation

    An important practical feature of generalized linear models is that they canall be fit to data using the same algorithm, a form ofiteratively re-weightedleast squares. In this section we describe the algorithm.

    Given a trial estimate of the parameters , we calculate the estimated

    linear predictor i = x

    i and use that to obtain the fitted values i =g1(i). Using these quantities, we calculate the working dependent variable

    zi = i+ (yi i) didi

    , (B.6)

    where the rightmost term is the derivative of the link function evaluated atthe trial estimate.

    Next we calculate the iterative weights

    wi= pi/[b(i)(

    didi

    )2], (B.7)

    whereb(i) is the second derivative ofb(i) evaluated at the trial estimateand we have assumed that ai() has the usual form /pi. This weight isinversely proportional to the variance of the working dependent variable zigiven the current estimates of the parameters, with proportionality factor.

  • 8/12/2019 Generalized Linear Model

    5/14

    B.3. TESTS OF HYPOTHESES 5

    Finally, we obtain an improved estimate of regressing the working

    dependent variable zi on the predictors xi using the weights wi, i.e. wecalculate the weighted least-squares estimate

    = (XWX)1XWz, (B.8)

    whereX is the model matrix, W is a diagonal matrix of weights with entrieswi given by (B.7) and zis a response vector with entries zi given by (B.6).

    The procedure is repeated until successive estimates change by less thana specified small amount. McCullagh and Nelder (1989) prove that thisalgorithm is equivalent to Fisher scoring and leads to maximum likelihoodestimates. These authors consider the case of generalai() and include intheir expression for the iterative weight. In other words, they usewi =wi,

    where wi is the weight used here. The proportionality factor cancels outwhen you calculate the weighted least-squares estimates using (B.8), so theestimator is exactly the same. I prefer to show explicitly rather thaninclude it in W.

    Example: For normal data with identity link i = i, so the derivative isdi/di= 1 and the working dependent variable is yiitself. Since in additionb(i) = 1 andpi= 1, the weights are constant and no iteration is required.

    In Sections B.4 and B.5 we derive the working dependent variable and theiterative weights required for binomial data with link logit and for Poissondata with link log. In both cases iteration will usually be necessary.

    Starting values may be obtained by applying the link to the data, i.e.we take i = yi and i=g(i). Sometimes this requires a few adjustments,for example to avoid taking the log of zero, and we will discuss these at theappropriate time.

    B.3 Tests of Hypotheses

    We consider Wald tests and likelihood ratio tests, introducing the deviancestatistic.

    B.3.1 Wald Tests

    The Wald test follows immediately from the fact that the information matrixfor generalized linear models is given by

    I() =XWX/, (B.9)

  • 8/12/2019 Generalized Linear Model

    6/14

    6 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    so the large sample distribution of the maximum likelihood estimator is

    multivariate normal Np(, (XWX)1). (B.10)with mean and variance-covariance matrix (XWX)1.

    Tests for subsets of are based on the corresponding marginal normaldistributions.

    Example: In the case of normal errors with identity link we have W = I(where I denotes the identity matrix), = 2, and the exactdistributionof is multivariate normal with mean and variance-covariance matrix(XX)12.

    B.3.2 Likelihood Ratio Tests and The Deviance

    We will show how the likelihood ratio criterion for comparing any two nestedmodels, say 1 2, can be constructed in terms of a statistic called thedevianceand an unknown scale parameter .

    Consider first comparing a model of interest with a saturatedmodel that provides a separate parameter for each observation.

    Let i denote the fitted values under and let i denote the correspond-ing estimates of the canonical parameters. Similarly, let O = yi and idenote the corresponding estimates under .

    The likelihood ratio criterion to compare these two models in the expo-nential family has the form

    2log = 2ni=1

    yi(i i) b(i) +b(i)ai()

    .

    Assume as usual that ai() =/pi for known prior weights pi. Then wecan write the likelihood-ratio criterion as follows:

    2log = D(y,)

    . (B.11)

    The numerator of this expression does not depend on unknown parametersand is called the deviance:

    D(y,) = 2ni=1

    pi[yi(i i) b(i) +b(i)]. (B.12)

    The likelihood ratio criterion 2log Lis the deviance divided by the scaleparameter, and is called the scaled deviance.

  • 8/12/2019 Generalized Linear Model

    7/14

    B.3. TESTS OF HYPOTHESES 7

    Example: Recall that for the normal distribution we had i= i,b(i) = 1

    22i ,

    and ai() =2

    , so the prior weights are pi= 1. Thus, the deviance is

    D(y,) = 2

    {yi(yi i) 12

    y2i +1

    2i

    2}

    = 2

    {12

    y2i yii2 +1

    2i

    2}=

    (yi i)2

    our good old friend, the residual sum of squares.Let us now return to the comparison of two nested models 1, with p1

    parameters, and 2, with p2 parameters, such that 1 2 and p2> p1.The log of the ratio of maximized likelihoods under the two models can

    be written as a difference of deviances, since the maximized log-likelihoodunder the saturated model cancels out. Thus, we have

    2log = D(1) D(2)

    (B.13)

    The scale parameter is either known or estimated using the larger model2.

    Large sample theory tells us that the asymptotic distribution of thiscriterion under the usual regularity conditions is 2with= p2p1 degreesof freedom.

    Example: In the linear model with normal errors we estimate the unknown

    scale parameter using the residual sum of squares of the larger model, sothe criterion becomes

    2log = RSS(1) RSS(2)RSS(2)/(n p2) .

    In large samples the approximate distribution of this criterion is 2 with=p2 p1 degrees of freedom. Under normality, however, we have an exactresult: dividing the criterion byp2 p1 we obtain an F with p2 p1 andn p2 degrees of freedom. Note that as n the degrees of freedom inthe denominator approach and the F converges to (p2 p1)2, so theasymptotic and exact criteria become equivalent.

    In Sections B.4 and B.5 we will construct likelihood ratio tests for bi-nomial and Poisson data. In those cases = 1 (unless one allows over-dispersion and estimates , but thats another story) and the deviance isthe same as the scaled deviance. All our tests will be based on asymptotic2 statistics.

  • 8/12/2019 Generalized Linear Model

    8/14

    8 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    B.4 Binomial Errors and Link Logit

    We apply the theory of generalized linear models to the case of binary data,and in particular to logistic regression models.

    B.4.1 The Binomial Distribution

    First we verify that the binomial distribution B(ni, i) belongs to the expo-nential family of Nelder and Wedderburn (1972). The binomial probabilitydistribution function (p.d.f.) is

    fi(yi) =

    niyi

    yii (1 i)niyi. (B.14)

    Taking logs we find that

    log fi(yi) =yilog(i) + (ni yi) log(1 i) + log

    niyi

    .

    Collecting terms on yi we can write

    log fi(yi) =yilog( i

    1 i ) +nilog(1 i) + log

    niyi

    .

    This expression has the general exponential form

    log fi(yi) =yii b(i)

    ai() +c(yi, )

    with the following equivalences: Looking first at the coefficient ofyi we notethat the canonical parameter is the logit ofi

    i= log( i

    1 i ). (B.15)

    Solving fori we see that

    i= ei

    1 +ei, so 1

    i=

    1

    1 +ei.

    If we rewrite the second term in the p.d.f. as a function ofi, so log(1i) = log(1 +ei), we can identify the cumulant function b(i) as

    b(i) =nilog(1 +ei).

  • 8/12/2019 Generalized Linear Model

    9/14

    B.4. BINOMIAL ERRORS AND LINK LOGIT 9

    The remaining term in the p.d.f. is a function ofyi but not i, leading to

    c(yi, ) = log

    niyi

    .

    Note finally that we may set ai() = and = 1.Let us verify the mean and variance. Differentiating b(i) with respect

    to i we find that

    i= b(i) =ni

    ei

    1 +ei=nii,

    in agreement with what we knew from elementary statistics. Differentiatingagain using the quotient rule, we find that

    vi= ai()b(i) =ni e

    i

    (1 +ei)2 =nii(1 i),

    again in agreement with what we knew before.In this development I have worked with the binomial count yi, which

    takes values 0(1)ni. McCullagh and Nelder (1989) work with the propor-tion pi = yi/ni, which takes values 0(1/ni)1. This explains the differencesbetween my results and their Table 2.1.

    B.4.2 Fisher Scoring in Logistic Regression

    Let us now find the working dependent variable and the iterative weight

    used in the Fisher scoring algorithm for estimating the parameters in logisticregression, where we model

    i = logit(i). (B.16)

    It will be convenient to write the link function in terms of the mean i, as:

    i= log( i

    1 i ) = log( i

    ni i ),

    which can also be written as i = log(i) log(ni i).Differentiating with respect toi we find that

    di

    di=

    1

    i+

    1

    ni i=

    ni

    i(ni i)=

    1

    nii(1 i).

    The working dependent variable, which in general is

    zi= i+ (yi i) didi

    ,

  • 8/12/2019 Generalized Linear Model

    10/14

    10 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    turns out to be

    zi= i+

    yi

    niinii(1 i) . (B.17)

    The iterative weight turns out to be

    wi = 1/

    b(i)(

    didi

    )2

    ,

    = 1

    nii(1 i) [nii(1 i)]2,

    and simplifies to

    wi= nii(1 i). (B.18)

    Note that the weight is inversely proportional to the variance of theworking dependent variable. The results here agree exactly with the resultsin Chapter 4 of McCullagh and Nelder (1989).

    Exercise: Obtain analogous results for Probit analysis, where one models

    i= 1(i),

    where () is the standard normal cdf. Hint: To calculate the derivative ofthe link function find di/di and take reciprocals.

    B.4.3 The Binomial Deviance

    Finally, let us figure out the binomial deviance. Let i denote the m.l.e. ofi under the model of interest, and let i = yi denote the m.l.e. under thesaturated model. From first principles,

    D = 2

    [yilog(yini

    ) + (ni yi)log( ni yini

    )

    yilog(ini

    ) (ni yi)log( ni ini

    )].

    Note that all terms involving log(ni) cancel out. Collecting terms on yi andon ni yi we find that

    D= 2

    [yilog(yii

    ) + (ni yi)log( ni yini i )]. (B.19)

    Alternatively, you may obtain this result from the general form of thedeviance given in Section B.3.

  • 8/12/2019 Generalized Linear Model

    11/14

    B.5. POISSON ERRORS AND LINK LOG 11

    Note that the binomial deviance has the form

    D= 2

    oilog( oiei

    ),

    whereoidenotes observed,eidenotes expected (under the model of interest)and the sum is over both successes and failures for each i (i.e. we havea contribution fromyi and one fromni yi).

    For grouped data the deviance has an asymptotic chi-squared distribu-tion as ni for all i, and can be used as a goodness of fit test.

    More generally, the difference in deviances between nested models (i.e.the log of the likelihood ratio test criterion) has an asymptotic chi-squareddistribution as the number of groups k or the size of each groupni

    , provided the number of parameters stays fixed.

    As a general rule of thumb due to Cochrane (1950), the asymptotic chi-squared distribution provides a reasonable approximation when all expectedfrequencies (both i andni i) under the largermodel exceed one, and atleast 80% exceed five.

    B.5 Poisson Errors and Link Log

    Let us now apply the general theory to the Poisson case, with emphasis onthe log link function.

    B.5.1 The Poisson Distribution

    A Poisson random variable has probability distribution function

    fi(yi) =eiyii

    yi! (B.20)

    for yi = 0, 1, 2, . . .. The moments are

    E(Yi) = var(Yi) =i.

    Let us verify that this distribution belongs to the exponential family asdefined by Nelder and Wedderburn (1972). Taking logs we find

    log fi(yi) =yilog(i) i log(yi!).Looking at the coefficient of yi we see immediately that the canonical

    parameter isi= log(i), (B.21)

  • 8/12/2019 Generalized Linear Model

    12/14

    12 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    and therefore that the canonical link is the log. Solving for i we obtain the

    inverse link i= ei ,

    and we see that we can write the second term in the p.d.f. as

    b(i) =ei .

    The last remaining term is a function ofyi only, so we identify

    c(yi, ) = log(yi!).Finally, note that we can take ai() = and = 1, just as we did in thebinomial case.

    Let us verify the mean and variance. Differentiating the cumulant func-

    tion b(i) we havei= b

    (i) =ei =i,

    and differentiating again we have

    vi= ai()b(i) =e

    i =i.

    Note that the mean equals the variance.

    B.5.2 Fisher Scoring in Log-linear Models

    We now consider the Fisher scoring algorithm for Poisson regression modelswith canonical link, where we model

    i = log(i). (B.22)

    The derivative of the link is easily seen to be

    didi

    = 1

    i.

    Thus, the working dependent variable has the form

    zi= i+yi i

    i. (B.23)

    The iterative weight is

    wi = 1/

    b(i)(

    didi

    )2

    = 1/

    i

    1

    2i

    ,

  • 8/12/2019 Generalized Linear Model

    13/14

    B.5. POISSON ERRORS AND LINK LOG 13

    and simplifies to

    wi = i. (B.24)

    Note again that the weight is inversely proportional to the variance ofthe working dependent variable.

    B.5.3 The Poisson Deviance

    Let i denote the m.l.e. of i under the model of interest and let i =yi denote the m.l.e. under the saturated model. From first principles, thedeviance is

    D = 2[yilog(yi) yi log(yi!)yilog(i) + i+ log(yi!)].Note that the terms on yi! cancel out. Collecting terms on yi we have

    D= 2

    [yilog(yii

    ) (yi i)]. (B.25)

    The similarity of the Poisson and Binomial deviances should not go un-noticed. Note that the first term in the Poisson deviance has the form

    D= 2

    oilog(oiei

    ),

    which is identical to the Binomial deviance. The second term is usually zero.To see this point, note that for a canonical link the score is

    log L

    =X(y ),

    and setting this to zero leads to the estimating equations

    Xy= X.

    In words, maximum likelihood estimation for Poisson log-linear modelsandmore generally for any generalized linear model with canonical linkrequires

    equating certain functions of the m.l.e.s (namely X

    ) to the same functionsof the data (namely Xy). If the model has a constant, one column ofXwill consist of ones and therefore one of the estimating equations will be

    yi=

    i or

    (yi i) = 0,

  • 8/12/2019 Generalized Linear Model

    14/14

    14 APPENDIX B. GENERALIZED LINEAR MODEL THEORY

    so the last term in the Poisson deviance is zero. This result is the basis

    of an alternative algorithm for computing the m.l.e.s known as iterativeproportional fitting, see Bishop et al. (1975) for a description.The Poisson deviance has an asymptotic chi-squared distribution as n

    with the number of parameters p remaining fixed, and can be used as agoodness of fit test. Differences between Poisson deviances for nested models(i.e. the log of the likelihood ratio test criterion) have asymptotic chi-squareddistributions under the usual regularity conditions.