north south university tutorial 4individual.utoronto.ca/ahmed_3/index_files/nsu/bio2_4.pdf ·...

12
NORTH S OUTH UNIVERSITY T UTORIAL 4 AHMED HOSSAIN,PhD Data Management and Analysis AHMED HOSSAIN,PhD - Data Management and Analysis 1

Upload: others

Post on 27-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

NORTH SOUTH UNIVERSITY

TUTORIAL 4

AHMED HOSSAIN,PhD

Data Management and Analysis

AHMED HOSSAIN,PhD - Data Management and Analysis 1

Page 2: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Regression so far...REVIEW

SIMPLE LINEAR REGRESSION Relationship between numerical response and anumerical or categorical predictor- for normally distributed errors.

MULTIPLE REGRESSION Relationship between numerical response and multiplenumerical and/or categorical predictors- for normally distributederrors

LOGISTIC REGRESSION Relationship between bivariate outcome and numerical and/orcategorical predictors-for binomial distributed errors.

What we havent seen is what to do when the response is count data.

AHMED HOSSAIN,PhD - Data Management and Analysis 2

Page 3: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Modelling countsEXAMPLES

Number of traffic accidents per day

Mortality counts in a given neighborhood, per week

Number of coughs in a concert hall, per minute

Number of customers arriving in a doctors office, daily

Number of people acquiring a particular disease in a given year.

Poisson regression will provide us with a framework to handle counts properly!

AHMED HOSSAIN,PhD - Data Management and Analysis 3

Page 4: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

DistributionPOISSON

EXAMPLE The number of occurrences in a given time interval can be modeledby the Poisson distribution.

POISSON DISTRIBUTION Define a random variable Y byY = total number of occurances in a given time

There are infinite number of trials.The mean rate of occurance is λ.

NOTATION Y ∼ Poi(λ).

PDF P(Y = y) = e−λ(λ)y

y! ; y = 0, 1, 2, · · · ,

MEAN E(Y ) = λ

VARIANCE Var(Y ) = λ

AHMED HOSSAIN,PhD - Data Management and Analysis 4

Page 5: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Modeling countsPOISSON REGRESSION

AHMED HOSSAIN,PhD - Data Management and Analysis 5

Page 6: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Modeling countsPOISSON REGRESSION

If we take the count data y with mean rate λ and add a regression equation, weget a Poisson regression:

log(λt ) = β0 + β1x1 + β2x2

It is a special case of generalized linear models, so it is closely related to linearand logistic regression modelling.

AHMED HOSSAIN,PhD - Data Management and Analysis 6

Page 7: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Modeling countsPOISSON REGRESSION IN R

AHMED HOSSAIN,PhD - Data Management and Analysis 7

Page 8: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

ExampleSMOKING AND LUNG CANCER

This dataset has information on lung cancer deaths by age and smoking status.

AGE: in five-year age groups coded 1 to 9 for 40-44, 45-49, 50-54, 55-59,60-64, 65-69, 70-74, 75-79, 80+.

SMOKING STATUS: coded 1 = doesn’t smoke, 2 = smokes cigars or pipe only, 3 =smokes cigarrettes and cigar or pipe, and 4 = smokes cigarrettesonly,

POPULATION: in hundreds of thousands, and

DEATHS: number of lung cancer deaths in a year.

AHMED HOSSAIN,PhD - Data Management and Analysis 8

Page 9: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

ExampleSMOKING AND LUNG CANCER

AHMED HOSSAIN,PhD - Data Management and Analysis 9

Page 10: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

ExampleSMOKING AND LUNG CANCER

INTERCEPT We interpret β̂0 = −3.96 as the baseline log expected count, or lograte, in the group with all covariates (Age group, smoking status) setto zero. exp(−3.96) = 0.0189 is the expected count of lung cancerdeath in the baseline group. This baseline value does not quite makesense. It may be helpful to center our covariates, but this is not a bigdeal if we dont care about baseline because our primary inference isabout the increase with respect to covariates

SLOPE We interpret β̂1 = 0.33 as the increase in log expected count, or thelog rate ratio comparing lung cancer deaths whose age group differby one, adjusting for other covariates.

AHMED HOSSAIN,PhD - Data Management and Analysis 10

Page 11: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Final NotesLINEAR, LOGISTIC AND POISSON REGRESSION

AHMED HOSSAIN,PhD - Data Management and Analysis 11

Page 12: North South University Tutorial 4individual.utoronto.ca/ahmed_3/index_files/NSU/Bio2_4.pdf · Example SMOKING AND LUNG CANCER This dataset has information on lung cancer deaths by

Stepwise regressionSTEPWISE FOR LINEAR, LOGISTIC AND POISSON REGRESSION

Stepwise regression includes regression models in which the choice of predictivevariables is carried out by an automatic procedure.

The main approaches are:Forward selection, which involves starting with no variables in the model,testing the addition of each variable using a chosen model comparisoncriterion, adding the variable (if any) that improves the model the most,and repeating this process until none improves the model.Backward elimination, which involves starting with all candidatevariables, testing the deletion of each variable using a chosen modelcomparison criterion, deleting the variable (if any) that improves the modelthe most by being deleted, and repeating this process until no furtherimprovement is possible.

AHMED HOSSAIN,PhD - Data Management and Analysis 12