winonacourse1.winona.edu/bdeppa/biostatistics/handouts/ho… · web viewgeneralized linear models...

STAT 405 – Biostatistics Handout 16 – Poisson Regression

Recall the Poisson distribution is given by

P (Y= y )=e−μμ y

y !y=0,1,2 ,…∧μ>0.

The response Y is a discrete random variable that represents the number of occurrences per time or space unit. In Poisson regression we seek a model for the mean of the response (μ) as a function of terms based upon a set of predictors x1 , x2,…, x p. For a Poisson random variable the mean and variance are both , so traditional OLS will not be adequate because the constant error variance assumption would be violated.

The logistic regression model that we have been studying is one type of a broader class of models called Generalized Linear Models. Generalized linear models are an extension of regular linear models that allow: (i) the mean of a population to depend on a linear function of terms through a nonlinear link function and (ii) the response probability distribution to be any member of a special class of distributions referred to as the exponential family. The exponential family contains the normal distribution (OLS), the binomial distribution (logistic) and the Poisson distribution.

The link function is function that relates the mean of the response μi=E (Y i) linearly to a set of terms based on the explanatory variables or predictors.

OLS RegressionFor a normally distributed response the link function is the identity function, g (μ )=μ thus,

g (μ )=η0+η1u1+…+ηk−1uk−1

or we typically write the model for the mean,

E (Y|X )=η0+η1u1+…+ηk−1uk−1

Logistic RegresionFor a binomial response we know that

g (μ )=ln( μ1−μ )=η0+η1u1+…+ηk−1uk−1

which we expressed as,

1

ln ( θ(~x )1−θ(~x ))=ηo+η1u1+…+ηk−1uk−1

Poisson RegressionFor a Poisson distributed response the link function is g (μ )= ln (μ) so,

ln (μ )=ηo+η1u1+…+ηk−1uk−1

thus,

μ=exp (η0+η1u1+…+ηk−1uk−1)

Interpretation of Coefficients in the Poisson Regression Model

The interpretation of the coefficients in the Poisson regression model is as follows. Assume that we change one of the explanatory terms, for example, the first one, by one unit from u to u+1 while holding all other terms fixed. This change affects the mean of the Poisson response by

100exp (ηo+η1 (u+1 )+…+ηk−1uk−1 )−exp (ηo+η1u+…+ηk−1uk−1)

exp (ηo+η1u+…+ηk−1uk−1)

¿100 [exp (η1 )−1 ] % , i.e. a percent increase or decrease in the mean response.

Alternatively we can simply take the ratio

exp (ηo+η1 (u+1 )+…+ηk−1uk−1)exp (ηo+η1u+…+ηk−1uk−1)

=eη1

says the mean of the response gets a multiplicative increase by eη1 units per unit increase in the term u1.

2

Wald Intervals and Tests for Parameters

95% CI for ηi : η̂i±1.96 ∙ SE (η̂i)

Therefore a CI for the multiplicative increase in the response is:

95% CI for eη1: exp (η̂i±1.96 ∙ SE ( η̂i ))

For testing:Ho : ηi=0Ha : ηi≠0

Large sample test for significance of “slope” parameter (ηi )

z=η̂i

SE ( η̂i )≈N (0,1)

z2 χ2

General Chi-Square Test

Consider the comparing two rival models where the alternative hypothesis modelHo :log ( μ)=η

1Tx1

H1 :log( μ )=η1Tx1+η2T

x2

General Chi-Square Statistic

χ2= (residual deviance of reduced model) – (residual deviance of full model)

= D( for model without the terms in x2 )−D(for model with the terms in x2)~ χ

Δ df 2

If the full model is needed χ2

is BIG and the associated p-value = P( χ Δ df2 > χ2 )is small.

D=2∑i=1

n

y i ln( y i

μ̂ i)

and can also be approximated by

D ≅∑i=1

n ( y i− μ̂i )2

μ̂ i

3

(reduced model OK)(full model needed)

ResidualsAs with logistic regression there are two types of residuals, which are related to the two forms of model deviance.

Deviance residuals

d i=sign( y i− μ̂i)√2[ y i ln ( y iμ̂i )−( y i− μ̂i)] Pearson residuals

ri=y i− μ̂ i

√ μ̂i

Example 1: Mating Success of African Elephants

OLS Regression> ele.lm = lm(Matings~Age, data=Elephants)> summary(ele.lm)

4

In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful matings during the 8 years was recorded. The objective was to learn whether older animals are more successful at mating or whether they have diminished success after reaching a certain age.

Y = Number of matings in the 8 year follow-up period

X = Age (yrs.) of elephant at the start of the study

Call:lm(formula = Matings ~ Age, data = Elephants)

Residuals: Min 1Q Median 3Q Max -4.1158 -1.3087 -0.1082 0.8892 4.8842

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.50589 1.61899 -2.783 0.00826 ** Age 0.20050 0.04443 4.513 5.75e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.849 on 39 degrees of freedomMultiple R-squared: 0.343, Adjusted R-squared: 0.3262 F-statistic: 20.36 on 1 and 39 DF, p-value: 5.749e-05

> Resplot(ele.lm)

Do these plots suggest any violations with OLS regression assumptions?

> ncv.plot(ele.lm)

5

One approach to attempting to correct the problem is to transform the response, using a variance stabilizing transformation which is found using the delta method. The delta method says, if Y N (μ ,σ2) then g (Y )is approximately normally distributed with mean g(μ) and variance [g' (μ ) ]2σ 2.

> elesq.lm = lm(sqrt(Matings)~Age,data=Elephants)> summary(elesq.lm)

Call:lm(formula = sqrt(Matings) ~ Age, data = Elephants)

Residuals: Min 1Q Median 3Q Max -1.90532 -0.33654 0.07767 0.45871 1.09468

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.81220 0.56867 -1.428 0.161187 Age 0.06320 0.01561 4.049 0.000236 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6493 on 39 degrees of freedomMultiple R-squared: 0.296, Adjusted R-squared: 0.2779 F-statistic: 16.4 on 1 and 39 DF, p-value: 0.0002362

6

While this may seem like a satisfactory model, interpretation of the model coefficients is difficult and the response is now in the square root scale. As the number of matings per 8 year period is likely to be well modeled using a Poisson distribution, we will now consider Poisson regression.

> ele.glm = glm(Matings~Age,family="poisson")> summary(ele.glm)Call:glm(formula = Matings ~ Age, family = "poisson")

Deviance Residuals: Min 1Q Median 3Q Max -2.80798 -0.86137 -0.08629 0.60087 2.17777

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 75.372 on 40 degrees of freedomResidual deviance: 51.012 on 39 degrees of freedomAIC: 156.46

Number of Fisher Scoring iterations: 5

> par(mfrow=c(2,2))

7

> plot(ele.glm)

> par(mfrow=c(1,1))> plot(Age,Matings,xlab="Age of Elephant",ylab="Num. of Matings")> lines(Age,fitted(ele.glm))> title(main="Plot of Matings vs. Age of Elephant w/ Poisson Fit")

8

Given the curvilinear appearance of the scatterplot perhaps adding a squared term for Age would improve the model.

> elesq.glm = glm(Matings~Age+Age2,family="poisson")> summary(elesq.glm)

Call:glm(formula = Matings ~ Age + Age2, family = "poisson")

Deviance Residuals: Min 1Q Median 3Q Max -2.8470 -0.8848 -0.1122 0.6580 2.1134

Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -2.8574060 3.0356383 -0.941 0.347Age 0.1359544 0.1580095 0.860 0.390Age2 -0.0008595 0.0020124 -0.427 0.669



Number of Fisher Scoring iterations: 5We can use the Wald test to determine that adding Age2 to model was not necessary or we could use the General Chi-square Test.

> 1 - pchisq(51.012-50.826,1)[1] 0.6662668

Interpretation of the estimated coefficient for Age in the first model

The estimated coefficient for Age is η̂1=.0632 thus a year increase in age we have100 [e .0632−1 ]=6.52 % increase in the number matings in the 8 year period per one year of age at the start of the study. Expressed as a multiplicative increase this would be 1.0632.

For a 5 year difference in initial age we would expect a 100 [e5∙ ×0632−1 ]=¿ 37.2% increase in the number of matings in the following 8 year period. Expressed as a multiplicative increase this would be 1.372.

Find a 95% CI for the 5-year Age Effect

9

Example 2: Reproduction of Ceriodaphnia OrganismsIn this study the number of Ceriodaphnia organisms are counted in a controlled environment in which reproduction occurs among the organisms. Two different strains of organisms are involved, and the environment is changed by adding varying amounts of a chemical component intended to impair reproduction. Initial population sizes are the same.

> head(Ceriodaph) Cerio Conc Strain1 82 0.0 02 106 0.0 03 63 0.0 04 99 0.0 05 101 0.0 06 45 0.5 0… … … …

> cerio.glm = glm(Cerio~Conc+Strain,family="poisson",data=Ceriodaph)> summary(cerio.glm)

Call:glm(formula = Cerio ~ Conc + Strain, family = "poisson", data = Ceriodaph)

Deviance Residuals: Min 1Q Median 3Q Max -2.6800 -0.6766 0.1528 0.6787 2.0774

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.45464 0.03914 113.819 < 2e-16 ***Conc -1.54308 0.04660 -33.111 < 2e-16 ***Strain1 -0.27497 0.04837 -5.684 1.31e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

10




Interpret the coefficients:

> par(mfrow=c(2,2))> plot(cerio.glm)> par(mfrow=c(1,1))

> plot(Conc[Strain=="0"],Cerio[Strain=="0"],col="blue",xlab=”Concentration”,ylab=”Ceriodaphnia Count”)> lines(Conc[Strain=="0"],fitted(cerio.glm)[Strain=="0"],col="blue")> points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X")> points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X",col="red")> lines(Conc[Strain=="1"],fitted(cerio.glm)[Strain=="1"],col="red",lty=2)> legend(locator(),legend=c("Strain 0","Strain 1"),col=c("blue","red"),pch="oX",lty=1:2)

11

> mmps(cerio.glm)

Consider adding interaction term, although there is no visual evidence to suggest it is necessarily needed.

> cerio2.glm = glm(Cerio~Conc*Strain,family="poisson",data=Ceriodaph)> summary(cerio2.glm)

Call:glm(formula = Cerio ~ Conc * Strain, family = "poisson", data = Ceriodaph)

Deviance Residuals:

12

Min 1Q Median 3Q Max -2.84251 -0.64872 0.01169 0.70636 1.82195

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.48110 0.04350 103.008 < 2e-16 ***Conc -1.59787 0.06244 -25.592 < 2e-16 ***Strain1 -0.33667 0.06704 -5.022 5.11e-07 ***Conc:Strain1 0.12534 0.09385 1.336 0.182 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




13

Poisson Regression in JMP

Example 1: Age of Elephants and Number of Matings (Data File: Elephants.JMP)

Select Analyze > Fit Model then change options in the dialog box as shown below:

14

Comments:

Interpretation of Coefficient for Age

15

Example 2: Ceriodaphnia (Data File: Cerio.JMP)

A portion of the data table is shown below:

To fit a Poisson regression model for the number of ceriodaphnia as a function of the concentration and stain we again use Analyze > Fit Model to set up the model as shown below:

The results of the model fit are shown below:

16

We can clearly see from the regression plot that the ceriodaphnia count decreases with concentration and that strain 1 has higher count in general, with the largest difference at lower concentrations. Next we find the risk ratios associated with the concentration and strain.

Interpretation of Model Effects

Risk Ratio CI’s

17

Example 3: Caesarean Sections in Private vs. Public Hospitals

Births by caesarean sections are said to be more frequent in private (fee paying) hospitals as compared to non-fee paying public hospitals. Data about total annual births and the number caesarean sections carried out were obtained from the records or 4 private hospitals and 16 public hospitals. There are tabulated in the JMP Data Table shown below:

As the number of caesareans performed at a hospital is clearly a count of the # of occurrences a Poisson regression for these data is appropriate. Clearly the number of caesareans is going to be dependent on the number of births performed at the hospital, but what is of interest is what role the type of hospital plays. We will therefore fit a Poisson regression model for the number of caesarean births using the number of total births and hospital type as covariates.

18

To do this in JMP use Analyze > Fit Model to perform the Poisson regression as shown by first select Generalized Linear Model and choosing Poisson as the Distrubtion.

The resulting output for the fitted model is shown below:

The Regression Plot above clearly shows that public, not private, perform more caesareans when taking the number of births into account. However, as all private hospitals had fewer than 500 births total, extrapolation into total births in the thousands is problematic. The Effect Tests section of the output shows that both Hospital Type and Births are statistically significant (p < .0001) for each. Finally we consider the parameter estimates section and examine how we interpret the estimated coefficients.

As Birth is a continuous predictor we need to pick an incremental value (c) to use when interpreting the risk ratio. For example, if we use c = 100 births the estimated risk ratio is given by:

e .003261∗100=e .3261=1.3855

19

Thus when comparing two hospitals, regardless of hospital type, a hospital with 100 more live births will have 1.3855 times as many caesarean sections. For a difference of 1000 live births the risk ratio would be e3.261=26.08 times as many caesarean sections, etc.

Finally looking at the hospital type effect, adjusting for number of births, we find the risk ratio for public vs. private hospitals is given by:

e2∗.5226=e1.0452=2.844 Note: due to contrast coding we need to multiply by 2!

Thus when comparing public hospitals vs. private hospitals with the same number of births, we estimate that the number of caesareans performed as the public hospital will be 2.84 times larger than the number of caesareans performed at the private hospital. Thus the researchers initial belief that private hospitals would perform more caesareans is not supported by these data, in fact the exact opposite appears to be true. A 95% CI for risk ratio (public vs. private) requires first finding a Wald interval for the coefficient for hospital in the population of all hospitals and then exponentiating the endpoints to obtain a 95% CI for risk ratio.

95% CI for the parameter2∗( .5226±1.96∗.1364 )=( .5105,1.580 ) Note: we again have multiplied by 2 due to coding.

95% CI for the risk ratio (by exponentiating endpoints of CI for parameter)(e .5105 ,e1.580 )=(1.67,4 .85 )

We estimate that when comparing hospitals with the same number of births that public hospitals will perform between 1.67 and 4.85 times as many caesarean sections as private hospitals.

20

Appendix: Code for some useful R functions for OLS Regression

NCV.test = function (model, var.formula, data = NULL, subset, na.action) { if (!is.null(weights(model))) stop("requires unweighted linear model") if ((!is.null(class(model$na.action))) && class(model$na.action) == "exclude") model <- update(model, na.action = na.omit) sumry <- summary(model) residuals <- residuals(model) S.sq <- df.residual(model) * (sumry$sigma)^2/sum(!is.na(residuals)) U <- (residuals^2)/S.sq if (missing(var.formula)) { mod <- lm(U ~ fitted.values(model)) varnames <- "fitted.values" var.formula <- ~fitted.values df <- 1 } else { if (missing(na.action)) { na.action <- if (is.null(model$na.action)) options()$na.action else parse(text = paste("na.", class(mod$na.action), sep = "")) } m <- match.call(expand.dots = FALSE) if (is.matrix(eval(m$data, sys.frame(sys.parent())))) m$data <- as.data.frame(data) m$formula <- var.formula m$var.formula <- m$model <- m$... <- NULL m[[1]] <- as.name("model.frame") mf <- eval(m, sys.frame(sys.parent())) response <- attr(attr(mf, "terms"), "response") if (response) stop(paste("Variance formula contains a response.")) mf$U <- U .X <- model.matrix(as.formula(paste("U~", as.character(var.formula)[2], "-1")), data = mf) mod <- lm(U ~ .X) df <- sum(!is.na(coefficients(mod))) - 1 } SS <- anova(mod)$"Sum Sq" RegSS <- sum(SS) - SS[length(SS)] Chisq <- RegSS/2 result <- list(formula = var.formula, formula.name = "Variance", ChiSquare = Chisq, Df = df, p = 1 - pchisq(Chisq, df), test = "Non-constant Variance Score Test") class(result) <- "chisq.test" result}

21

ncv.plot = function (fit) { temp <- NCV.test(fit) p <- temp$p e <- sqrt(abs(resid(fit))) yhat <- fitted(fit) plot(yhat, e, xlab = "Fitted Values", ylab = "Sqrt. Abs. Residuals", main = paste("Non-Constant Variance Plot ~ NCV Test (p=", signif(p, 4), ")")) lines(lowess(yhat, e), lty = 1, col = "Blue")}

Resplot = function (lm1, lms = summary(lm1)) { par(mfrow = c(2, 2), pty = "m") y <- resid(lm1) qqnorm(Studresid(lm1), main = "Normal Probability Plot", ylab = "Residuals") abline(0, sqrt(var(Studresid(lm1)))) plot(fitted(lm1), Studresid(lm1), xlab = "Fitted Values", ylab = "Studentized Residuals", main = "Plot of Studentized Residuals vs. Fitted", cex = 0.65) x <- fitted(lm1) y <- Studresid(lm1) f <- 0.5 xs <- sort(x, index = T) x <- xs$x ix <- xs$ix y <- y[ix] trend <- lowess(x, y, f) e2 <- (y - trend$y)^2 scatter <- lowess(x, e2, f) uplim <- trend$y + sqrt(abs(scatter$y)) lowlim <- trend$y - sqrt(abs(scatter$y)) lines(trend$x, trend$y, col = "Blue") lines(scatter$x, uplim, col = "Red") lines(scatter$x, lowlim, col = "Red") abline(h = 0, lty = 2, col = 2) plot(fitted(lm1), sqrt(abs(Studresid(lm1))), main = "Loess Fit of Residuals", ylab = "Absolute Stud. Residuals", xlab = "Fitted Values", cex = 0.7) lines(lowess(fitted(lm1), sqrt(abs(Studresid(lm1)))), lty = 1, col = 3) abline(h = mean(sqrt(abs(Studresid(lm1)))), col = "blue", lty = 3) par(mfrow = c(1, 2)) par(ask = T) yl <- c(min(resid(lm1), fitted(lm1) - mean(fitted(lm1))), max(resid(lm1), fitted(lm1) - mean(fitted(lm1)))) fit <- fitted(lm1) p <- sort(fit - mean(fit)) pp <- ppoints(p) res <- resid(lm1) pr <- sort(res) ppr <- ppoints(pr) plot(pp, p, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Fitted values", cex = 0.7) plot(ppr, pr, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Residuals", cex = 0.7) par(mfrow = c(1, 1)) par(ask = F) invisible()}

22

winonacourse1.winona.edu/bdeppa/biostatistics/handouts/ho… · web viewgeneralized linear models...

Documents