stac51: categorical data analysis

67
Introduction to Generalized linear Models STAC51: Categorical data Analysis Mahinda Samarakoon March 23, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 67

Upload: others

Post on 04-Jun-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

STAC51: Categorical data Analysis

Mahinda Samarakoon

March 23, 2016

Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 67

Page 2: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Table of contents

1 Introduction to Generalized linear Models

Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 67

Page 3: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models

In ordinary regression models, we model means of Normalrandom variables as functions of some predictors (independentvariables).

Recall that the ordinary regression model is give by

Yi = β0 + β1xi1 + · · ·+ βpxip + εi

where εi are independent N(0, σ2).

This implies

E (Yi ) = β0 + β1xi1 + · · ·+ βpxip.

This model assumes that Yi has a Normal distribution.

What if this is not true? For example

Y can be a nominal categorical variable. Y can be a Poissonrandom variable. There are many other possibilities.

Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 67

Page 4: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models

Generalized linear models(GLM) extend ordinary regressionmodels to encompass non-normal response variables andmodeling functions of the mean.

We can use these models to investigate the relationships(associations) among categorical and continuous variables.

They have three components:

Random component identifies the response variable Y and itsprobability distributionSystematic component identifies the explanatory variables usedin a linear predictor function.Link function specifies the function of E (Y ) that the modelequates to the liner predictor.

Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 67

Page 5: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models:Randomcomponent

The random component of a GLM consists of a responsevariable Y with independent observations (y1, . . . , yN) from adistribution in the natural exponential family.

This family has probability density function or mass functionof form

f (yi ; θi ) = a(θi )b(yi ) exp(yiQ(θi )) (1)

Some important distributions, including the Poisson andbinomial are in this family.

The value of the parameter θi varies with i .

The parameter Q(θ) is called the natural parameter.

Note that there is a more general formula defining theexponential family, but this is sufficient for the discrete datawe discuss here.

Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 67

Page 6: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models:Systematiccomponent component

The systematic component of a GLM relates a vector(η1, . . . , ηN) to the explanatory variables through a linearmodel. Let xij denote the value of predictor j(j = 1, 2, . . . , p.)for subject i .

Thenηi =

∑j

βixij , i = 1, . . . ,N.

This linear combination of explanatory variables is called thelinear predictor.

Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 67

Page 7: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models:Link Function

Link function connects the random and systematiccomponents.

The model links µi = E (Yi ) to ηi by ηi = g(µi ), where thelink function g is a monotonic, differentiable function. Thus,g links E (Yi ) to explanatory variables through the formula

g(µi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.

Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 67

Page 8: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Introduction to Generalized linear Models:Link Function

The link function g(µ) = µ, called the identity link, hasηi = µi . ordinary regression with normally distributed Y .

The link function that transforms the mean to the naturalparameter is called the canonical link.

That is g(µi ) = Q(θi ) and

Q(θi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.

Use of the canonical link has advantages (but not mandatory).

Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 67

Page 9: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Example: Binomial Logit Models for Binary Data

For binary data P(Y = 1) = π and P(Y = 0) = 1− π.

Y has a Bernoulli distribution.

µ = E (Y ) = π.

We can express thee probability mass function as

f (y , π) = πy (1− π)1−y = (1− π)[π/(1− π)]y (2)

= (1− π) exp

[y

(log

π

1− π

)](3)

for y = 0 and 1.

This is a natural exponential family, identifying θ with π,a(π) = 1− π, b(y) = 1, and Q(π) = log π

1−π .

The natual parameter Q(π) = log π1−π is the log odds of the

response 1 (i.e. log odds of Y = 1), the logit of π.

with this canonical link function are called logistic regressionmodes, or sometimes simply as logit models

Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 67

Page 10: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Example: Binomial Logit Models for Binary Data

Question: Can we use the ordinary regression model for binarydata?

regression model is

E (Yi )− πi = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.

The problem with this model is that πi is a probability (i.e.taking values between 0 and 1),

but linear functions take values over the entire real line.

This also doesn’t satisfy the usual assumptions of ordinaryregression model: Y does not have Normal distribution,

Var(Yi ) = πi (1− πi ) depends on i . That means Var(Y ) isnot constant.

Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 67

Page 11: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Example:Loglinear Models for Count Data

The simplest distribution for count data is the Poissondistribution.

The probability mass function of the Poisson distribution is

f (y , π) =e−µµy

y != e−µ

(1

y !

)exp[y log(µ)], y = 0, 1, . . . .

This is a natural exponential family with θ = µ, a(µ) = e−µ,b(y) = 1/y !, and Q(µ) = log(µ).

The natural parameter is log(µ) and so the canonical link isthe log link.

the model using link function is

log(µi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N. (4)

This is called a Poisson loglinear model

Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 67

Page 12: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model

To simplify the discussion, let’s use only one explanatoryvariable, x for predicting the probability of success, π(x)

The logistic regression model for this case is

logπ(x)

1− π(x)= α + βx (5)

or

π(x) =exp(α + βx)

1 + exp(α + βx)(6)

Note: F (x) = ex

1+ex is c.d.f of the standard logistic distribution

and so logistic regression model can be written asπ(x) = F (α + βx) where F is the c.d.f. of the standardlogistic distribution.

Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 67

Page 13: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Graph of π(x) vs x for α = 1 and β = 0.5

#R code for plotting the graph of pi vs x

alpha<-1

beta1<- 0.5

curve(expr = exp(alpha+beta1*x)/(1+exp(alpha+beta1*x)),

from = -15, to = 15, col = "red", main =

expression(pi(x) == frac(e^{alpha+beta*x},

1+e^{alpha+beta*x})), xlab = "x",

ylab = expression(pi(x)), panel.first = grid(nx =

NULL, ny = NULL, col = "gray", lty = "dotted"))

Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 67

Page 14: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Graph of π(x) vs x for α = 1 and β = −0.5

Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 67

Page 15: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model with more than one independentvariable

This model can generalized to more than one independentvariable.

logπ(x)

1− π(x)= α + β1x1 + · · ·+ βpxp (7)

or

π(x) =exp(α + β1x1 + · · ·+ βpxp)

1 + exp(α + β1x1 + · · ·+ βpxp(8)

Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 67

Page 16: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Interpretation of β’s

In the model with one independent variable, α represents thelog-odds Y = 1 when x = 0

and β represents the increase in the log-odds Y = 1 when xincreases by one unit.

In the model with more than one independent variable, αrepresents the log-odds Y = 1 when x1 = · · · = xp = 0

and βk represents the increase in the log-odds Y = 1 when xkincreases by one unit, holding other x variables fixed.

Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 67

Page 17: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Interpretation of β’s

For example, in example in aspirin study , we found that theodds of a heart attack in the placebo group is 1.83 times thatin the aspirin group.In this example we can consider x = 1 as the placebo group,x = 0 as aspirin group. y = 1 mean got a heart disease andy = 0 means did not get a heart disease.substituting x = 0 in 5, we get the log odds for the aspiringroup

log(odds(0)) = logπ(0)

1− π(0)= α

Substituting x = 1 we get the log odds for the placebo group

log(odds(1)) = logπ(1)

1− π(1)= α + β

and so

β = log

(Odds(1)

Odds(0)

)or eβ represents the odds ratio.

Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 67

Page 18: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Parameter estimation

We use the maximum likelihood methods to estimate theparameters (i.e α and β’s ). This requires numerical methods.

We can use the R function, glm(), to estimate the parameter.

Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 67

Page 19: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

The R code below fits a logistic regression model for the data froman example from Kutner et al 2004. This example was based on astudy of the effect of computer programming experience on abilityto complete within a specified time a complex programming task,including debugging. Twenty-five persons were selected for thestudy. They had varying amounts of programming experience(measured in months of experience).

Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 67

Page 20: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

> # Example p565 Kutner et al

> data=read.table("C:/Users/Mahinda/Desktop/CH14TA01.txt", header=F)

> experience <- data[,1]

> task <- data[,2]

> cbind(experience, task)

experience task

[1,] 14 0

[2,] 29 0

[3,] 6 0

[4,] 25 1

[5,] 18 1

[6,] 4 0

[7,] 18 0

[8,] 12 0

[9,] 22 1

[10,] 6 0

[11,] 30 1

[12,] 11 0

[13,] 30 1

[14,] 5 0

[15,] 20 1

[16,] 13 0

[17,] 9 0

[18,] 32 1

[19,] 24 0

[20,] 13 1

[21,] 19 0

[22,] 4 0

[23,] 28 1

[24,] 22 1

[25,] 8 1

Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 67

Page 21: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

> model1 = glm(task ~ experience, family=binomial)

> summary(model1)

Call:

glm(formula = task ~ experience, family = binomial)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.8992 -0.7509 -0.4140 0.7992 1.9624

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.05970 1.25935 -2.430 0.0151 *

experience 0.16149 0.06498 2.485 0.0129 *

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.296 on 24 degrees of freedom

Residual deviance: 25.425 on 23 degrees of freedom

AIC: 29.425

Number of Fisher Scoring iterations: 4

> # for every one-month increase in experience, estimated

odds of being able to perform the task are multiplied by exp(betahat1[1])

Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 67

Page 22: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

The estimated probability that a person will be able toperform the task is

π(xi ) =exp(α + βxi )

1 + exp(α + βxi )

=exp(−3.05970 + 0.16149xi )

1 + exp(−3.05970 + 0.16149xi )

For example the probability that a person with 24 monthsexperience will be able to perform the task is

π(24) =exp(−3.05970 + 0.16149× 24)

1 + exp(−3.05970 + 0.16149× 24)= 0.6934

.

Mahinda Samarakoon STAC51: Categorical data Analysis 22 / 67

Page 23: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

> pihat = model1$fitted.values # estmated probabilities

> cbind(experience, task , pihat)

experience task pihat

1 14 0 0.31026237

2 29 0 0.83526292

3 6 0 0.10999616

4 25 1 0.72660237

5 18 1 0.46183704

6 4 0 0.08213002

7 18 0 0.46183704

8 12 0 0.24566554

9 22 1 0.62081158

10 6 0 0.10999616

11 30 1 0.85629862

12 11 0 0.21698039

13 30 1 0.85629862

14 5 0 0.09515416

15 20 1 0.54240353

16 13 0 0.27680234

17 9 0 0.16709980

18 32 1 0.89166416

19 24 0 0.69337941

20 13 1 0.27680234

21 19 0 0.50213414

22 4 0 0.08213002

23 28 1 0.81182461

24 22 1 0.62081158

25 8 1 0.14581508

Mahinda Samarakoon STAC51: Categorical data Analysis 23 / 67

Page 24: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

> # Plot of estimated probability vs experience

> Estimated_prob <- function(experience) { exp(model1$coefficients[1] +

model1$coefficients[2]*experience) /

(1+exp(model1$coefficients[1]+model1$coefficients[2]*experience)) }

> curve(Estimated_prob, from=0, to=40, , xlab="experience",

+ ylab="Estimated Probability")

> abline(h=(seq(0,1,by=0.02)), col="blue", lty="dotted")

> abline(v=(seq(0,40,1)), col="blue", lty="dotted")

Mahinda Samarakoon STAC51: Categorical data Analysis 24 / 67

Page 25: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example 2

The R code below fits a logistic regression model for the data (incontingency table) in aspirin above:

> x = c(rep(1, 189+10845), rep(0, 104+10933))

> y = c(rep(1, 189), rep(0, 10845), rep(1, 104), rep(0, 10933))

> length(x)

[1] 22071

> length(y)

[1] 22071

> model1 = glm(y ~ x, family=binomial)

> summary(model1)

Mahinda Samarakoon STAC51: Categorical data Analysis 25 / 67

Page 26: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example 2

The R code below fits a logistic regression model for the data (incontingency table) in aspirin above:

Call:

glm(formula = y ~ x, family = binomial)

Deviance Residuals:

Min 1Q Median 3Q Max

-0.1859 -0.1859 -0.1376 -0.1376 3.0544

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.65515 0.09852 -47.250 < 2e-16 ***

x 0.60544 0.12284 4.929 8.28e-07 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3114.7 on 22070 degrees of freedom

Residual deviance: 3089.3 on 22069 degrees of freedom

AIC: 3093.3 Mahinda Samarakoon STAC51: Categorical data Analysis 26 / 67

Page 27: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model

Another model for Bernoulli random component Y is theprobit regression model. This model uses the inverse standardnormal c.d.f Φ−1 as the link function. That is, the model

π(x) = Φ(α + βx) (9)

orΦ−1(π(x)) = α + βx . (10)

The curve has a similar appearance to logistic regression curve.

Mahinda Samarakoon STAC51: Categorical data Analysis 27 / 67

Page 28: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model

The curves for β = 0.5 and β = −0.5, both with α = 1 are shownbelow:

Mahinda Samarakoon STAC51: Categorical data Analysis 28 / 67

Page 29: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model

The curves for β = 0.5 and β = −0.5, both with α = 1 are shownbelow:

Mahinda Samarakoon STAC51: Categorical data Analysis 29 / 67

Page 30: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model

Which model to use?

This is not an easy question.

One way to decide is to try many models and see which onefits the data best.

logit is easier to interpret, through the use of odds and oddsratios and so is used often.

Mahinda Samarakoon STAC51: Categorical data Analysis 30 / 67

Page 31: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model: Example

> # Example p565 Kutner et al

> data=read.table("C:/Users/Mahinda/Desktop/CH14TA01.txt", header=F)

> experience <- data[,1]

> task <- data[,2]

> cbind(experience, task)

experience task

[1,] 14 0

[2,] 29 0

[3,] 6 0

[4,] 25 1

[5,] 18 1

[6,] 4 0

[7,] 18 0

[8,] 12 0

[9,] 22 1

[10,] 6 0

[11,] 30 1

[12,] 11 0

[13,] 30 1

[14,] 5 0

[15,] 20 1

[16,] 13 0

[17,] 9 0

[18,] 32 1

[19,] 24 0

[20,] 13 1

[21,] 19 0

[22,] 4 0

[23,] 28 1

[24,] 22 1

[25,] 8 1

Mahinda Samarakoon STAC51: Categorical data Analysis 31 / 67

Page 32: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit Regression model: Example

> model2 = glm(task ~ experience, family=binomial (link = probit))

> summary(model2)

Call:

glm(formula = task ~ experience, family = binomial(link = probit))

Deviance Residuals:

Min 1Q Median 3Q Max

-1.8959 -0.7579 -0.3907 0.8101 1.9691

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.83787 0.69012 -2.663 0.00774 **

experience 0.09686 0.03565 2.717 0.00659 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.296 on 24 degrees of freedom

Residual deviance: 25.380 on 23 degrees of freedom

Mahinda Samarakoon STAC51: Categorical data Analysis 32 / 67

Page 33: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Logistic regression model: Example

> pihatlogit = model1$fitted.values # estmated probabilities

> pihatprobit = model2$fitted.values # estmated probabilities

> cbind(experience, task , pihatlogit, pihatprobit)

experience task pihatlogit pihatprobit

1 14 0 0.31026237 0.31495382

2 29 0 0.83526292 0.83422869

3 6 0 0.10999616 0.10442754

4 25 1 0.72660237 0.72024848

5 18 1 0.46183704 0.46238565

6 4 0 0.08213002 0.07346854

7 18 0 0.46183704 0.46238565

8 12 0 0.24566554 0.24965602

9 22 1 0.62081158 0.61524129

10 6 0 0.10999616 0.10442754

11 30 1 0.85629862 0.85721025

12 11 0 0.21698039 0.21992975

13 30 1 0.85629862 0.85721025

14 5 0 0.09515416 0.08793556

15 20 1 0.54240353 0.53954616

16 13 0 0.27680234 0.28139084

17 9 0 0.16709980 0.16698550

18 32 1 0.89166416 0.89645092

19 24 0 0.69337941 0.68677231

20 13 1 0.27680234 0.28139084

21 19 0 0.50213414 0.50097045

22 4 0 0.08213002 0.07346854

23 28 1 0.81182461 0.80898266

24 22 1 0.62081158 0.61524129

25 8 1 0.14581508 0.14389004

Mahinda Samarakoon STAC51: Categorical data Analysis 33 / 67

Page 34: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit Regression model: Example

The two estimated regression curves (logistic and probit) areshown below.

> #Plotting estimated regression curves

> # Plot of estimated probability vs experience

>

> Estimated_prob <- function(experience) { exp(model1$coefficients[1] +

+ model1$coefficients[2]*experience) / (1+exp(model1$coefficients[1]+

model1$coefficients[2]*experience)) }

> # Or we can use

> #Estimated_prob <- function(experience) { plogis(model1$coefficients[1] +

> # model1$coefficients[2]*experience)}

> curve(Estimated_prob, from=0, to=40, col = "green", xlab="experience",

+ ylab="Estimated Probability")

> abline(h=(seq(0,1,by=0.02)), col="blue", lty="dotted")

> abline(v=(seq(0,40,1)), col="blue", lty="dotted")

>

> #---------------------------------------------------------------------------

> par(new = TRUE)

> #Ploting the estimated probablity from the probit model

> Estimated_prob2 <- function(experience) { pnorm(model2$coefficients[1] +

+ model2$coefficients[2]*experience)}

> curve(Estimated_prob2, from=0, to=40, col = "red", axes = FALSE,

xlab = "", ylab = "")

legend(locator(1), legend = c("Logit", "Probit"), lty = c(1,1),

col = c( "green", "red"))

#locator(1) places the legend at the place you click on the graph

Mahinda Samarakoon STAC51: Categorical data Analysis 34 / 67

Page 35: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Probit regression model: Example

Mahinda Samarakoon STAC51: Categorical data Analysis 35 / 67

Page 36: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Generalized linear models for count data

Counts of possible outcomes are non-negative integers.

These are often modeled as Poisson random variables.

A Poisson loglinear GLIM assumes a Poisson distribution forthe response and the log function for the link function. So,the linear predictor is related to the mean as

log(µ(x) = α + βx (11)

orµ(x) = exp(α + βx) = eα(eβ)x (12)

Interpretation of β: A unit increase in x has a multiplicativeimpact of eβ. i.e. the mean of Y at x + 1 is equal to themean at x times eβ.

Mahinda Samarakoon STAC51: Categorical data Analysis 36 / 67

Page 37: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Horseshoe Crab Mating Example p 123

For each i th female, assume the number of satellites, Yi , has aPoisson distribution with mean µi dependent on female shell width(xi ). We will model the expected number of satellites with thefollowing model:

log(µi ) = α + βxi .

The R code below fits the model for crab data:> # Log- linear model example

> # Example p 123

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file

> model3 <- glm(formula = satellite ~ width, data = crab, family = poisson(link = log))

> summary(model3)

Call:

glm(formula = satellite ~ width, family = poisson(link = log),

data = crab)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.8526 -1.9884 -0.4933 1.0970 4.9221

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***

width 0.16405 0.01997 8.216 < 2e-16 ***

Mahinda Samarakoon STAC51: Categorical data Analysis 37 / 67

Page 38: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Horseshoe Crab Mating Example p 123

> # Predicting the mean response at a given value(s0 of x

> #Predict for 25 and 30 widths

> predict.data<-data.frame(width = c(25, 30))

> #Predcted vlues for mu

> pred1 <- predict(model3, newdata = predict.data, type = "response", se = TRUE)

> pred1

$fit

1 2

2.217477 5.035916

$se.fit

1 2

0.1345945 0.3703386

> pred2 <- predict(model3, newdata = predict.data, se = TRUE)

> #This gives predicted values for log(mu)

> pred2

$fit

1 2

0.7963699 1.6165954

$se.fit

1 2

0.06069713 0.07353947

Mahinda Samarakoon STAC51: Categorical data Analysis 38 / 67

Page 39: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Horseshoe Crab Mating Example p 123

> alpha<-0.05

> lower<-pred1$fit-qnorm(1-alpha/2)*pred1$se

> upper<-pred1$fit+qnorm(1-alpha/2)*pred1$se

> data.frame(predict.data, mu.hat = round(pred1$fit,3), lower = round(lower,3), upper = round(upper,3))

width mu.hat lower upper

1 25 2.217 1.954 2.481

2 30 5.036 4.310 5.762

Mahinda Samarakoon STAC51: Categorical data Analysis 39 / 67

Page 40: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Horseshoe Crab Mating Example p 123

> # Plot of estimated mean count vs width

> Estimated_count <- function(width) { exp(model1$coefficients[1] +

+ model1$coefficients[2]*width) }

> curve(Estimated_count, from=20, to=35, , xlab="Width",

+ ylab="Estimated Mean Count")

> abline(h=(seq(0,15,by=1)), col="blue", lty="dotted")

> abline(v=(seq(20,35,1)), col="blue", lty="dotted")

Mahinda Samarakoon STAC51: Categorical data Analysis 40 / 67

Page 41: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Overdispersion for Poisson GLMs

Count data often vary more than we would expect if theresponse distribution truly were Poisson.

The phenomenon of the data having greater variability thanexpected for a GLM is called overdispersion.

Mahinda Samarakoon STAC51: Categorical data Analysis 41 / 67

Page 42: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Overdispersion for Poisson GLMs

Mahinda Samarakoon STAC51: Categorical data Analysis 42 / 67

Page 43: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Overdispersion for Poisson GLMs

This might happen because the true distribution is a mixtureof different Poisson distributions.

One remedy for this is to find more explanatory variables andadd to the model.

negative binomial is a related distribution for count data thatpermits the variance to exceed the mean.

probability mass function of the negative binomial distributionis given by

f (y , k , π) =

(y + k − 1

y

)(1− π)yπk , y = 0, 1, 2, · · · (13)

where k > 0 and µ > 0 are parameters.

Negative binomial random variables can be interpreted as thenumber of failures before the kth success.

Mahinda Samarakoon STAC51: Categorical data Analysis 43 / 67

Page 44: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Overdispersion for Poisson GLMs

The mean and the variance of this distribution are given by

E (Y ) = µ =k(1− π)

πand (14)

Var(Y ) = k(1−π)π2 .

Note thatk

µ+ k= π

and

µ+ µ2/k =k(1− π)

π2= Var(Y ).

Note that E (Y ) < Var(Y ).

k is the (positive) dispersion parameter.

The smaller the dispersion parameter, the larger the varianceas compared to the mean. In R this is denoted by θ.

Note: Agresti uses γ = 1/k as dispersion parameter.

Mahinda Samarakoon STAC51: Categorical data Analysis 44 / 67

Page 45: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Overdispersion for Poisson GLMs

This probability mass function can also be written as

f (y , k , π) =

(y + k − 1

y

)(1− k

µ+ k

)y ( k

µ+ k

)k

, y = 0, 1, 2, · · ·

(15)

Mahinda Samarakoon STAC51: Categorical data Analysis 45 / 67

Page 46: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Negative Binomial GLMs: Example Horseshoe CrabMating Example

glm function in R cannot fit negative binomial regression models.We can use glm.nb function in the MASS package to estimate thismodel. The R code below uses the glm.nb to estimate a negativebinomial GLM for crab data.> #R code Negative binomial regression

> library(MASS)

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file

> model3.nb<-glm.nb(formula = satellite ~ width, data = crab,

link = log)

> summary(model3.nb)

Call:

glm.nb(formula = satellite ~ width, data = crab, link = log,

init.theta = 0.90456808)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.05251 1.17143 -3.459 0.000541 ***

width 0.19207 0.04406 4.360 1.3e-05 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)

Mahinda Samarakoon STAC51: Categorical data Analysis 46 / 67

Page 47: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Negative Binomial GLMs: Example Horseshoe CrabMating Exampl

(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)

Null deviance: 213.05 on 172 degrees of freedom

Residual deviance: 195.81 on 171 degrees of freedom

AIC: 757.29

Number of Fisher Scoring iterations: 1

Theta: 0.905

Std. Err.: 0.161

Mahinda Samarakoon STAC51: Categorical data Analysis 47 / 67

Page 48: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs: Waldtest

One test we are usually interested in H0 : β = 0 againstHa : β 6= 0. For large n, MLE’s are approximately Normal. Inparticular, β ∼ N(β,AsVar(β)) and so

Z =β − βSE

approx→ N(0, 1)

and this result can be used to calculate approximate p-value.(Wald test) Consider the crab data. Test whether the number ofsatellites is independent of the width.Solution: z = 8.216, the p-value is 2× 10−16 < 0.05 and reject thenull hypothesis. An approximate 95 percent confidence interval is0.16405± 1.96× 0.01997

Mahinda Samarakoon STAC51: Categorical data Analysis 48 / 67

Page 49: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test

We have discussed the LRT before. This can also be usedhere.Recall

Λ =Maximum likelihood under the null hypohesis

Unrestricted maximum likelihoodFor testing H0 : β = 0, the numerator is calculated assumingβ = 0. Thus, the model fit to the data is only g(µ) = αwhere g(µ) is the link function.The denominator is calculated without assuming β = 0 and sothe model fit to the data is g(µ) = α + βx We know for largen,G 2 = −2 log(Λ) has an approximate chisquared distribution.The degrees of freedom is the number of parameters in inunrestricted model - the number of parameters in the modelunder the null hypothesis.

Mahinda Samarakoon STAC51: Categorical data Analysis 49 / 67

Page 50: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test

For example for testing the null hypothesis H0 : β = 0, thedegrees of freedom is 1.

The value of G 2 is not always given in software outputs.

They often give ”null deviance” and ”residual deviance”.

These values are G 2 values for testing some differenthypothesis, but we often can use them to calculate the valuevalue of G 2 for our other tests.

For examples The value of G 2 for testing H0 : β = 0 againstHa : β 6= 0 is simply null deviance - residual deviance.

Mahinda Samarakoon STAC51: Categorical data Analysis 50 / 67

Page 51: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test

Null deviance (G 21 ) tests H0: Model with only α against H1:

Saturated model.

Example Poisson GLM : In Poisson GLM, the saturated modelassumes Yi ∼ Poisson (µi for i = 1, . . . , n)

and the MLE of µi is yi .

G 21 = 2

n∑i=1

yi log

(yiµ0,i

)(16)

where µ0,i = eα0 and α0 is the MLE of α in the modellogµi = α, i = 1, . . . , n.

Residual deviance (G 22 )tests H0: Model with only α and β

against H1: Saturated model. .

Mahinda Samarakoon STAC51: Categorical data Analysis 51 / 67

Page 52: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test

Null deviance (G 21 ) tests H0: Model with only α against H1:

Saturated model.

Example Poisson GLM : In Poisson GLM, the saturated modelassumes Yi ∼ Poisson (µi for i = 1, . . . , n)

and the MLE of µi is yi .

G 21 = 2

n∑i=1

yi log

(yiµ0,i

)(17)

where µ0,i = eα0 and α0 is the MLE of α in the modellogµi = α, i = 1, . . . , n.

Residual deviance (G 22 )tests H0: Model with only α and β

against H1: Saturated model. .

Mahinda Samarakoon STAC51: Categorical data Analysis 52 / 67

Page 53: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test Example

> # Log- linear model example

> # Example p 123

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file

> model3 <- glm(formula = satellite ~ width, data = crab, family = poisson(link = log))

> summary(model3)

Call:

glm(formula = satellite ~ width, family = poisson(link = log),

data = crab)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.8526 -1.9884 -0.4933 1.0970 4.9221

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***

width 0.16405 0.01997 8.216 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 632.79 on 172 degrees of freedom

Residual deviance: 567.88 on 171 degrees of freedom

AIC: 927.18

—————————————————————————————— Mahinda Samarakoon STAC51: Categorical data Analysis 53 / 67

Page 54: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Statistical Inference and Model Checking For GLMs:Likelihood Ratio test Example

Crab data

G 2 = 632.79− 567.88 = 64.91

Degreed of freedom = 172 - 171 = 1

Chi-square table value at α = 0.05 = 3.84

We reject the null hypothesis H0 : β = 0

Data shows evidence to indicate that width has a significanteffect on the number of satellites.

Mahinda Samarakoon STAC51: Categorical data Analysis 54 / 67

Page 55: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs p140

Pearson residual for observation i is defined by

ei =yi − µi√

µi(18)

and standardized residuals are defined by

ri =yi − µi√µi (1− hi )

(19)

where hi is the ith diagonal element of the hat matrix.

H = W1/2X(XTWX)−1XTW1/2. (20)

hi ’s are known as leverages.

The standardized residual has a distribution that is closer to astandard normal distribution than the Pearson residual.

Mahinda Samarakoon STAC51: Categorical data Analysis 55 / 67

Page 56: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

The R code below calculates the standardized residuals and createsresidual plots

> #RCode Residuals for Poissin Reg

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file

> poissonReg <- glm(formula = satellite ~ width, data = crab,

family = poisson(link = log))

> e <-residuals(poissonReg, type="pearson")

> X<-model.matrix(poissonReg)

> muhat<-predict(poissonReg, type = "response")

> W <- diag(muhat)

> H<-(W^(1/2))%*%X%*%solve(t(X)%*%W%*%X)%*% t(X)%*%(W^(1/2))

> h <- diag(H)

> head(h)

[1] 0.009852370 0.006360719 0.006945761 0.019161622 0.014825698 0.008169498

> r <- e/sqrt(1-h)

> head(e)

1 2 3 4 5 6

2.1463312 0.8582102 -1.5642375 -1.0726099 -1.5836582 0.5254940

> head(r)

1 2 3 4 5 6

2.1569832 0.8609527 -1.5696984 -1.0830364 -1.5955298 0.5276537

Mahinda Samarakoon STAC51: Categorical data Analysis 56 / 67

Page 57: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

The R code below calculates the standardized residuals and createsresidual plots

>

>

> h<-lm.influence(poissonReg)$h

> head(h)

1 2 3 4 5 6

0.009852370 0.006360719 0.006945761 0.019161622 0.014825698 0.008169498

> r <- e/sqrt(1-h)

> head(r)

1 2 3 4 5 6

2.1569832 0.8609527 -1.5696984 -1.0830364 -1.5955298 0.5276537

Mahinda Samarakoon STAC51: Categorical data Analysis 57 / 67

Page 58: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

The R code below calculates the standardized residuals and createsresidual plots

> #----------------------------------------------------------------

> #Standardized residual vs observation number

> plot(x = 1:length(r), y = r, xlab="Observation number",

+ ylab="Standardized residuals", main = "Standardized

+ residuals vs. observation number")

> abline(h = c(-3, 3), lty=3, col="red")

> #-----------------------------------------------------------------

> par(mfrow = c(1,1))

> # Plot of Residual vs width

> plot(x = crab$width, y = r, xlab="Width",

+ ylab="Standardized Pearson residuals", main =

+ "Standardized Pearson residuals vs. width")

> abline(h = c(-3, 3), lty=3, col="red")

> #-------------------------------------------------------------

> plot(x = crab$width, y = r, xlab="Width",

+ ylab="Standardized Pearson residuals", main =

+ "Standardized Pearson residuals vs. width", type = "n")

> text(x = crab$width, y = r,

+ labels = crab$satellite, cex=0.75)

> abline(h = c(-3,3), lty=3, col="red")

Mahinda Samarakoon STAC51: Categorical data Analysis 58 / 67

Page 59: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

Mahinda Samarakoon STAC51: Categorical data Analysis 59 / 67

Page 60: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

Mahinda Samarakoon STAC51: Categorical data Analysis 60 / 67

Page 61: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

Mahinda Samarakoon STAC51: Categorical data Analysis 61 / 67

Page 62: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

The R code below calculates the standardized residuals and createsresidual plots

> #R code Negative binomial regression

> library(MASS)

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file

> model4.nb<-glm.nb(formula = satellite ~ width, data = crab, link = log)

> e.nb<-residuals(model4.nb, type="pearson")

> h.nb<-lm.influence(model4.nb)$h

> r.nb<-e.nb/sqrt(1-h.nb)

> par(mfrow = c(1,2))

> plot(x = 1:length(r.nb), y = r.nb, xlab="Obs. number",

+ ylab="Standardized residuals",

+ main = "Stand. residuals (Neg Bin model) vs. obs. number")

> abline(h = c(-3, 3), lty=3, col="red")

>

> plot(x = crab$width, y = r.nb,

+ xlab="Width", ylab="Standardized residuals",

+ main = "Stand. residuals (Neg Bin model) vs. width", type = "n")

> text(x = crab$width, y = r.nb, labels =

+ crab$satellite, cex=0.75)

> abline(h = c(-3, 3), lty=3, col="red")

>

Mahinda Samarakoon STAC51: Categorical data Analysis 62 / 67

Page 63: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Residuals for GLMs: Example

Mahinda Samarakoon STAC51: Categorical data Analysis 63 / 67

Page 64: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Goodness of Fit:Pearson Chisquare

For Poisson regression, the statistic

χ2 =n∑

i=1

(yi − µi )2

µi.

The statistic as an approximate χ2 distribution with n -number of model parameters = n − 2 degrees of freedom forlarge n.

In order for the χ2 approximation to work well, µi should notbe small.

Rule of thumb µi ≥ 5

Mahinda Samarakoon STAC51: Categorical data Analysis 64 / 67

Page 65: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Goodness of Fit: LRT

For Poisson regression, LRT comparing the model with thesaturated model is

G 2 = −2 log(Λ) = 2n∑

i=1

yi log

(yiµi

)

where µi = eα+βxi

The statistic has an approximate χ2 distribution with n − 2degrees of freedom for large n. In R this is called the residualdeviance.

Mahinda Samarakoon STAC51: Categorical data Analysis 65 / 67

Page 66: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Goodness of fit: Example

The R code and output for Crab data are given below. Use thePearson chisq test and the LRT to test goodness of fit of themodel.> #RCode Residuals for Poissin Reg

> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt",

header=T) #the data file

> poissonReg <- glm(formula = satellite ~ width, data = crab,

family = poisson(link = log))

> summary(poissonReg)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***

width 0.16405 0.01997 8.216 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 632.79 on 172 degrees of freedom

Residual deviance: 567.88 on 171 degrees of freedom

> pear.res<-resid(poissonReg, type="pearson")

> pearsonChisq <- sum(pear.res^2)

> pearsonChisq

[1] 544.157

> p_pearson <- 1-pchisq(pearsonChisq, df = poissonReg$df.residual)

> p_pearson

[1] 0.

Mahinda Samarakoon STAC51: Categorical data Analysis 66 / 67

Page 67: STAC51: Categorical data Analysis

Introduction to Generalized linear Models

Goodness of Fit:Example

Both tests indicate lack of fit of the model

Some yi ’s (not printed) are less than 5 and so the test is notvery reliable

Mahinda Samarakoon STAC51: Categorical data Analysis 67 / 67