north south university tutorial 4individual.utoronto.ca/ahmed_3/index_files/nsu/bio2_4.pdf ·...
TRANSCRIPT
NORTH SOUTH UNIVERSITY
TUTORIAL 4
AHMED HOSSAIN,PhD
Data Management and Analysis
AHMED HOSSAIN,PhD - Data Management and Analysis 1
Regression so far...REVIEW
SIMPLE LINEAR REGRESSION Relationship between numerical response and anumerical or categorical predictor- for normally distributed errors.
MULTIPLE REGRESSION Relationship between numerical response and multiplenumerical and/or categorical predictors- for normally distributederrors
LOGISTIC REGRESSION Relationship between bivariate outcome and numerical and/orcategorical predictors-for binomial distributed errors.
What we havent seen is what to do when the response is count data.
AHMED HOSSAIN,PhD - Data Management and Analysis 2
Modelling countsEXAMPLES
Number of traffic accidents per day
Mortality counts in a given neighborhood, per week
Number of coughs in a concert hall, per minute
Number of customers arriving in a doctors office, daily
Number of people acquiring a particular disease in a given year.
Poisson regression will provide us with a framework to handle counts properly!
AHMED HOSSAIN,PhD - Data Management and Analysis 3
DistributionPOISSON
EXAMPLE The number of occurrences in a given time interval can be modeledby the Poisson distribution.
POISSON DISTRIBUTION Define a random variable Y byY = total number of occurances in a given time
There are infinite number of trials.The mean rate of occurance is λ.
NOTATION Y ∼ Poi(λ).
PDF P(Y = y) = e−λ(λ)y
y! ; y = 0, 1, 2, · · · ,
MEAN E(Y ) = λ
VARIANCE Var(Y ) = λ
AHMED HOSSAIN,PhD - Data Management and Analysis 4
Modeling countsPOISSON REGRESSION
AHMED HOSSAIN,PhD - Data Management and Analysis 5
Modeling countsPOISSON REGRESSION
If we take the count data y with mean rate λ and add a regression equation, weget a Poisson regression:
log(λt ) = β0 + β1x1 + β2x2
It is a special case of generalized linear models, so it is closely related to linearand logistic regression modelling.
AHMED HOSSAIN,PhD - Data Management and Analysis 6
Modeling countsPOISSON REGRESSION IN R
AHMED HOSSAIN,PhD - Data Management and Analysis 7
ExampleSMOKING AND LUNG CANCER
This dataset has information on lung cancer deaths by age and smoking status.
AGE: in five-year age groups coded 1 to 9 for 40-44, 45-49, 50-54, 55-59,60-64, 65-69, 70-74, 75-79, 80+.
SMOKING STATUS: coded 1 = doesn’t smoke, 2 = smokes cigars or pipe only, 3 =smokes cigarrettes and cigar or pipe, and 4 = smokes cigarrettesonly,
POPULATION: in hundreds of thousands, and
DEATHS: number of lung cancer deaths in a year.
AHMED HOSSAIN,PhD - Data Management and Analysis 8
ExampleSMOKING AND LUNG CANCER
AHMED HOSSAIN,PhD - Data Management and Analysis 9
ExampleSMOKING AND LUNG CANCER
INTERCEPT We interpret β̂0 = −3.96 as the baseline log expected count, or lograte, in the group with all covariates (Age group, smoking status) setto zero. exp(−3.96) = 0.0189 is the expected count of lung cancerdeath in the baseline group. This baseline value does not quite makesense. It may be helpful to center our covariates, but this is not a bigdeal if we dont care about baseline because our primary inference isabout the increase with respect to covariates
SLOPE We interpret β̂1 = 0.33 as the increase in log expected count, or thelog rate ratio comparing lung cancer deaths whose age group differby one, adjusting for other covariates.
AHMED HOSSAIN,PhD - Data Management and Analysis 10
Final NotesLINEAR, LOGISTIC AND POISSON REGRESSION
AHMED HOSSAIN,PhD - Data Management and Analysis 11
Stepwise regressionSTEPWISE FOR LINEAR, LOGISTIC AND POISSON REGRESSION
Stepwise regression includes regression models in which the choice of predictivevariables is carried out by an automatic procedure.
The main approaches are:Forward selection, which involves starting with no variables in the model,testing the addition of each variable using a chosen model comparisoncriterion, adding the variable (if any) that improves the model the most,and repeating this process until none improves the model.Backward elimination, which involves starting with all candidatevariables, testing the deletion of each variable using a chosen modelcomparison criterion, deleting the variable (if any) that improves the modelthe most by being deleted, and repeating this process until no furtherimprovement is possible.
AHMED HOSSAIN,PhD - Data Management and Analysis 12