north south university tutorial 4individual.utoronto.ca/ahmed_3/index_files/nsu/bio2_4.pdf ·...

NORTH SOUTH UNIVERSITY

TUTORIAL 4

AHMED HOSSAIN,PhD

Data Management and Analysis

AHMED HOSSAIN,PhD - Data Management and Analysis 1

Regression so far...REVIEW

SIMPLE LINEAR REGRESSION Relationship between numerical response and anumerical or categorical predictor- for normally distributed errors.

MULTIPLE REGRESSION Relationship between numerical response and multiplenumerical and/or categorical predictors- for normally distributederrors

LOGISTIC REGRESSION Relationship between bivariate outcome and numerical and/orcategorical predictors-for binomial distributed errors.

What we havent seen is what to do when the response is count data.


Modelling countsEXAMPLES

Number of traffic accidents per day

Mortality counts in a given neighborhood, per week

Number of coughs in a concert hall, per minute

Number of customers arriving in a doctors office, daily

Number of people acquiring a particular disease in a given year.

Poisson regression will provide us with a framework to handle counts properly!


DistributionPOISSON

EXAMPLE The number of occurrences in a given time interval can be modeledby the Poisson distribution.

POISSON DISTRIBUTION Define a random variable Y byY = total number of occurances in a given time

There are infinite number of trials.The mean rate of occurance is λ.

NOTATION Y ∼ Poi(λ).

PDF P(Y = y) = e−λ(λ)y

y! ; y = 0, 1, 2, · · · ,

MEAN E(Y ) = λ

VARIANCE Var(Y ) = λ


Modeling countsPOISSON REGRESSION


Modeling countsPOISSON REGRESSION

If we take the count data y with mean rate λ and add a regression equation, weget a Poisson regression:

log(λt ) = β0 + β1x1 + β2x2

It is a special case of generalized linear models, so it is closely related to linearand logistic regression modelling.


Modeling countsPOISSON REGRESSION IN R


ExampleSMOKING AND LUNG CANCER

This dataset has information on lung cancer deaths by age and smoking status.

AGE: in five-year age groups coded 1 to 9 for 40-44, 45-49, 50-54, 55-59,60-64, 65-69, 70-74, 75-79, 80+.

SMOKING STATUS: coded 1 = doesn’t smoke, 2 = smokes cigars or pipe only, 3 =smokes cigarrettes and cigar or pipe, and 4 = smokes cigarrettesonly,

POPULATION: in hundreds of thousands, and

DEATHS: number of lung cancer deaths in a year.



INTERCEPT We interpret β̂0 = −3.96 as the baseline log expected count, or lograte, in the group with all covariates (Age group, smoking status) setto zero. exp(−3.96) = 0.0189 is the expected count of lung cancerdeath in the baseline group. This baseline value does not quite makesense. It may be helpful to center our covariates, but this is not a bigdeal if we dont care about baseline because our primary inference isabout the increase with respect to covariates

SLOPE We interpret β̂1 = 0.33 as the increase in log expected count, or thelog rate ratio comparing lung cancer deaths whose age group differby one, adjusting for other covariates.


Final NotesLINEAR, LOGISTIC AND POISSON REGRESSION


Stepwise regressionSTEPWISE FOR LINEAR, LOGISTIC AND POISSON REGRESSION

Stepwise regression includes regression models in which the choice of predictivevariables is carried out by an automatic procedure.

The main approaches are:Forward selection, which involves starting with no variables in the model,testing the addition of each variable using a chosen model comparisoncriterion, adding the variable (if any) that improves the model the most,and repeating this process until none improves the model.Backward elimination, which involves starting with all candidatevariables, testing the deletion of each variable using a chosen modelcomparison criterion, deleting the variable (if any) that improves the modelthe most by being deleted, and repeating this process until no furtherimprovement is possible.


north south university tutorial 4individual.utoronto.ca/ahmed_3/index_files/nsu/bio2_4.pdf ·...

Documents