25.11.2014 - categorical data
DESCRIPTION
Stats 101TRANSCRIPT
Research Methodology
Statistics Lecture 5
Catagorical Data: The Chi Squared Test, Odds, Ratios, Relative Risk and Logistic Regression
Rifat Hamoudi Senior Lecturer [email protected]
Review • Comparing one numerical outcome over 2
or more groups:
2 Groups >2 Groups
Independent t-test
One-way ANOVA
Kruskal Wallis test
Mann Whitney U test
Independent Groups
Review • Comparing one numerical outcome over 2 or more
groups:
Independent Groups Paired Groups
2 Groups >2 Groups
One-way ANOVA
Kruskal Wallis test
Mann Whitney U test
Paired t-test
Wilcoxon's Signed Rank
test
2 Groups
Independent t-test
Review
• Assessing the relationship between two numerical variables:
Correlation Analysis Quantifies the strength of the linear association between two numerical variables
Simple linear regression fits a straight line to describe the relationship between the two numerical variables where
one variable depends on the other
The Regression coefficient quantifies the amount the dependent variable changes as the explanatory
variable increases by one unit
Regression Analysis
Review
Which statistical test to use?
Next.....
Methods for Analysing Categorical Data
Outline
• Comparing Two Proportions: - Chi-squared test - Fishers Exact test
• Risk, Risk Difference and Risk Ratio
• Odds and Odds ratio
• Binary Logistic Regression Analysis
Categorical Data
• Categorical Data is data that can be placed into categories: Binary/Ordinal/Nominal
• The mean is useless for categorical data! → We cannot use methods for continuous data to analyse categorical data
• We analyze frequencies for categorical variables, that is the number of things that fall into each combination of categories
Obesity in Young Children
• Obesity in young life can pave the way for future musculoskeletal conditions
• A dietician conducted a survey of 510 children at a local primary school
• Objective: Are there more obese children under 5 or over 5?
Categorical Data: Comparing Groups
Categorical Data: Comparing Groups
• Objective: Are there more obese children under 5 or over 5?
• Initially tabulate observed frequencies as below in a 2 x 2 contingency table, for example:
Age Category
Under 5 Over 5
BMI under 30 92 323
Obese (BMI over 30) 19 76
Total 111 399
Categorical Data: Comparing Groups
• The proportions of obese children in each age category are calculated as follows;
• We wish to formally compare the proportions of children
with the obese characteristic
Age Category
Under 5 Over 5
BMI under 30 92 323
Obese (BMI over 30) 19 76
Total 111 399
Proportion of Obese Children 19/111 =0.17 76/399 = 0.19
• We often have two independent groups of individuals (under 5 / over 5)
• We want to know whether the proportions of individuals
with a particular characteristic are the same in the two groups (obese)
Categorical Data: Comparing Groups
Categorical Data: Independent Groups →χ2 Test
• The Chi-Squared (χ2) test allows us to formally compare proportions between two independent groups
• It allows us to determine whether the observed frequencies (counts) are markedly differ from the frequencies that we would expect by chance
• Define the null and alternative hypothesis under study:
Ho: The proportions of individuals with the characteristic are equal in the two groups in the population
HA: These population proportions are not equal
Categorical Data: Two Independent Groups →χ2 Test
• SPSS: Analyse → Descriptive Statistics→ Crosstabs
Categorical Data: Two Independent Groups → χ2 Test
• χ2 Test Technical details: -The expected numbers in each of the four cells in our 2x2
contingency table if H0 is true are calculated (equal proportions) - The formula for each expected cell is: (row total*column total)/grand total where the grand total equals the total number of individuals that make up the sample (N)
Age Category
Under 5 Over 5
BMI under 30 92 323
Obese (BMI over 30) 19 76
Total 111 399
Proportion of Obese Children 19/111 =0.17 76/399 = 0.19
Categorical Data: Two Independent Groups → χ2 Test
- What was observed is compared to the calculated expected numbers which would indicate there were no differences between the groups (equal proportions)
- A large discrepancy between the observed and the corresponding expected frequencies is an indication that the proportions in the two groups differ (P <0.05)
Expected Numbers:
415x111/510 = 90.3
415x399/510=324.7
95x111/510=20.7
95x399/510=74.3
Under 5 Over 5
BMI under 30 92 323
Obese (BMI over 30) 19 76
Total 111 399 Proportion of Obese
Children 19/111 =0.17 76/399 = 0.19
χ2 Test Example
• Example:
H0: The proportion of children with the obese characteristic is equal in the two age groups
HA: The proportion of children with the obese characteristic is not equal in the two age groups
• To conduct the χ2 test:
SPSS: Analyse → Descriptive Statistics →Crosstabs
χ2 Test Example
χ2 Test Example
• 2 x 2 Contingency table:
Overweight * Age_Cat Crosstabulation
Count
92 323 41519 76 95
111 399 510
BMI Under 30Obese (BMI Over 30)
Overweight
Total
Under 5 Over 5Age_Cat
Total
χ2 Test Example
• Expected Cell counts: 415x111/510
= 90.3 415x399/510=324.7
95x111/510=20.7
95x399/510=74.3
Overweight * Age_Cat Crosstabulation
Count
92 323 41519 76 95
111 399 510
BMI Under 30Obese (BMI Over 30)
Overweight
Total
Under 5 Over 5Age_Cat
Total
χ2 Test Example
• Results of the Chi-squared test:
There is evidence that the proportions of children with the obese
characteristics are equal in the two age groups (Under 5 = 0.17 or 17%, Over 5 = 0.19 or 19%)
Chi-Square Tests
.214b 1 .644 .682 .379
.105 1 .746
.217 1 .641 .682 .379.682 .379
.213c
1 .644 .682 .379 .101
510
Pearson Chi-SquareContinuity Correction a
Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
PointProbability
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is 20.68.b.
The standardized statistic is .462.c.
• The χ2 test compares observed and expected cell counts - useful to compare proportions across two independent groups
• In the context of a randomised controlled trial our proportions will be risks
• Probably the most common scenario in medical research is to compare the outcome risk in two independent groups
• We can use the χ2 test to answer a Common RCT Question: Is the risk of failing in the group A the same as the risk of failing in group B?
Categorical Data: Comparing Risks → χ2 Test
χ2 Test – Risk Example
• Risk of not healing in the drug group = 0.42 or (0.42*100) 42% Risk of not healing in the placebo group = 0.72 or (0.72*100) 72% Risk difference = 72% - 42% = 30%
• The Chi-squared test allows us to formally compare risks between groups answering the Question: Is the risk of not healing in the placebo group the same as the risk of not healing in the drug group?
Treatment
Outcome Drug Placebo
Not Healed 152 142
Healed 212 56
Total 364 198
Risk of Not healing 152/364 = 0.42 142/198 =0.72
χ2 Test – Risk Example
• Define the null and alternative hypothesis under study:
H0: The risk of not healing is equal in the two treatment groups
HA: The risk of not healing is not equal in the two treatment groups
• SPSS: Analyse → Descriptive Statistics →Crosstabs
Chi-Square Tests
46.140b 1 .000 .000 .00044.946 1 .00047.359 1 .000 .000 .000
.000 .000562
Pearson Chi-SquareContinuity Correctiona
Likelihood RatioFisher's Exact TestN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is 94.42.
b.
χ2 Test – Risk Example
• There is evidence to reject the null hypothesis (P<0.001). The risk of not healing is not equal in the drug
and placebo groups (risk difference = 30%)
Chi-Square Tests
46.140b 1 .000 .000 .00044.946 1 .00047.359 1 .000 .000 .000
.000 .000562
Pearson Chi-SquareContinuity Correctiona
Likelihood RatioFisher's Exact TestN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is 94.42.
b.
Relative Risk (Risk Ratio)
• Typically the risk difference will be a sufficient way of presenting differences between groups with binary outcomes
• If the outcome is rare then ratios are more suitable
• Relative Risk (Risk Ratio)
Relative Risk (Risk Ratio)
d)b/(bc)a/(a
Risk Risk
unexp
exp
++
=
Exposed to factor
Outcome of Interest Yes No Total
Yes a b a + b
No c d c + d
Total a + c b + d n = a + b + c + d
Risk of Outcome in the Exposed group = a / (a+c)
Risk of Outcome in the unexposed group = b / (b+d)
Relative Risk (Risk Ratio or RR) =
χ2 Test – Risk Example
• Risk difference = 72% - 42% = 30% • Relative Risk (Risk Ratio): (152/364) / (142/198) = 0.58
Treatment
Outcome Drug Placebo
Not Healed 152 142
Healed 212 56
Total 364 198
Risk of Not healing 152/364 = 42% 142/198 =72%
A subject in the drug group is 0.58 times as likely to not heal than a subject in the placebo group
Interpretation of the Relative Risk (Risk Ratio)
• A RR of 1 indicates that the risk is the same in the two groups
• A RR <1 indicates that there a reduction in the risk of the outcome in the exposed group (drug group) compares with the unexposed group (placebo)
• A RR >1 indicates that there is an increased risk in the exposed group (drug group) compared with the unexposed group (placebo)
Categorical Data: Comparing Odds → χ2 Test
• A RCT is often not feasible if an outcome is rare so instead known cases and suitable controls are selected for a case-control study
• In a case control study we do not interpret the proportions of cases/controls with specific characteristics
• Case-control studies only examine association – NOT causation
• We compare odds because patients are selected because of their disease status
• We don’t interpret proportions as risks - You could get any risk value you wish by simply varying the number of cases and controls selected and often numbers of cases do not reflect the true mix of case numbers in the general population
Odds and Odds Ratios
• Relative Risk is not valid in such a scenario
• Rather we will be comparing odds therefore we must use the Odds Ratio (OR) to present the differences between groups
• Odds are different to risks!
• What are odds, what is the odds ratio and how does the odds ratio differ from the risk ratio?
Odds and Odds Ratios
Exposed to Factor
Yes No Total
Case a b a + b
Control c d c + d
Total a + c b + d n = a + b + c + d
c x bdxa
d / bc / a
groupunexposedtheincaseabeingofOddgroupexposedtheincaseabeingofOddOddsRatio ===
Odds of being a Case in the exposed group = a / c
Odds of being a Case in the unexposed group = b / d
Odds and Odds Ratios Example
Lung Cancer Doll & Hill Example: 649 male cancer patients and 649 controls. Compare distribution of lung cancer among smokers and Non smokers. 647 of 1269 smokers had lung cancer compared to 2 of 29 non smokers.
Odds lung cancer in smokers = 647/622 = 1.04 Odds lung cancer in non-smokers = 2/27 = 0.07
Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04
Smoker Non-smoker Total
Lung Cancer 647 2 649
No Lung Ca 622 27 649
Total 1269 29
Interpretation of the Odds Ratios
• If the odds ratio = 1 then this implies equality
• The odds are equivalent in the exposed and unexposed groups
• An odds ratio >1 indicates that the odds of disease (outcome) is greater in the exposed group than in the unexposed group
• An odds ratio <1 indicates that the odds of disease (outcome) is lower in the exposed group than in the unexposed group
• The Chi squared test can be used within the context of the case control study to formally test:
H0: The odds of having lung cancer in smokers cases = the odds of lung cancer in non-smokers (i.e. odds ratio = 1) HA: The odds of having lung cancer are not equal
Odds lung cancer in smokers = 647/622 = 1.04 Odds lung cancer in non-smokers = 2/27 = 0.07 Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04
Categorical Data: Comparing Odds → χ2 Test
Smoker Non-smoker Total
Lung Cancer 647 2 649
No Lung Ca 622 27 649
χ2 Test – Odds Example • The null and alternative hypothesis under study: H0: The odds of having lung cancer in smokers cases = the odds of lung
cancer in non-smokers (i.e. odds ratio = 1) HA: The odds of having lung cancer are not equal
• P < 0.001, Small P-value indicates there is evidence against the null
hypothesis, reject the null hypothesis.
Chi-Square Tests
22.044b 1 .000 .000 .00020.316 1 .00026.140 1 .000 .000 .000
.000 .000
22.027c
1 .000 .000 .000 .000
1298
Pearson Chi-SquareContinuity Correctiona
Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
PointProbability
Computed only for a 2x2 tablea.
0 cells (.0%) have expected count less than 5. The minimum expected count is 14.50.b.
The standardized statistic is 4.693.c.
χ2 Test – Odds Example
• The study provides considerable evidence to suggest an association between lung cancer and smoking
• The odds of having lung cancer are significantly greater for smokers than non-smokers, odds ratio = 14.04 (P<0.001)
Assumptions of the → χ2 Test
1. The expected frequency in each of the four cells is at least 5
The Chi squared test is hence only valid if all the expected frequencies are sufficient
2. The Chi squared test also assumes the groups are independent (e.g. treatment group and placebo group, under 5’s and over 5’s)
Chi-Square Tests
1.326b 1 .250 .534 .355
.217 1 .641
2.126 1 .145 .534 .355
.534 .355
1.273c
1 .259 .534 .355 .355
25
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher' s Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
Point
Probability
Computed only for a 2x2 tablea.
2 cells (50. 0%) have expected count less than 5. The minimum expected count is .84.b.
The standardized statistic is 1.128.c.
What if Expected Frequency < 5?
• Use Fishers exact test - Given in SPSS output of Chi-squared test
• If any one of the expected cell counts is less than 5 interpret fishers exact test:
What if the Groups are Not Independent?
• What if the two groups are related? - Each individual may have had their outcome measured in 2 different circumstances - Cross-over trial – each patient receives drug and placebo - Matched Case-Control Study • Use McNemar’s Test
• SPSS: Analyze→ Descriptive statistics →Crosstabs
Select McNemar’s Test in Statistics option
• Same null and alternative hypothesis: Ho: The proportion of individuals with the characteristic is equal in the two groups in the population HA: The proportion of individuals with the characteristic is not equal in the two groups in the population
What if the Groups are Not Independent?
Categorical Data: More than 2 Categories
• Suppose we wish to test for an association between two factors which may have more than two categories
• Example: Is there an association between blood group (4 group levels: A, B, O, AB) and disease severity (3 groups: mild, moderate, severe). Are individuals of a particular blood group likely to be more severely ill?
• We can still use the Chi-squared test on larger frequencies - data presented in a r x c contingency table (r rows and columns)
• The null and alternative hypothesis under study: H0 : There is no association between the categories of one factor and
the categories of the other factor in the population HA: The two factors are associated in the population
Binary Logistic Regression
• Up until now we have discussed regression with a numerical outcome/dependent variable
• Lecture 4 - Linear regression is a modelling technique used to explore the associations between one numerical dependent variable and one or more explanatory variables (be these numerical of categorical)
• We are often interesting in examining binary outcomes, for example mortality (dead/alive), case/control, success/failure
• We can model a binary outcome using binary logistic regression
Binary Logistic Regression • Useful when we wish to compare the proportion of people
with a particular binary outcome by group, but adjusted for potential confounders
• Examples:
1. Is there an association between smoking and lung cancer after adjusting for Sex?
2. Is a new treatment associated with mortality after
adjustment for age?
Binary Logistic Regression
• The dependent variable and explanatory variable(s) are distinguished in the same way as linear regression
• In binary logistic regression the binary outcome of interest is the dependent variable. The other factors of interest which we believe may be related to the binary outcome are the explanatory/independent variables
• Logistic regression evaluates the odds that an individual with a particular combination of values for the explanatory variables will have the binary outcome of interest
Binary Logistic Regression
• When you fit a binary logistic regression model, for each explanatory variable you will get an odds ratio (OR) – EXP(B)
• For binary/categorical explanatory variables, the OR is the increase in odds of the binary outcome for one group compared to the other/reference group
• For numerical explanatory variables, the OR is the increase in odds of the binary outcome for a one unit increase in the numerical explanatory variable
Binary Logistic Regression
• If the odds ratio is greater than 1 then as the predictor increases, the odds of the outcome occurring increase
• Conversely an odds ratio value less than 1 indicates that as the predictor increases the odds of the outcome occurring decrease
• If the odds ratio = 1 then this implies equality
- For a binary/categorical predictor the odds are equivalent for one group compared to another
- For a numerical predictor the odds are equivalent for the different levels of the continuous variables
Binary Logistic Regression Example
• Example: Clinical trial for breast cancer, comparing mortality at 5 years between new vs standard drug
- Outcome is mortality at 5 years – either yes or no - Difference in age between two treatment groups – need to adjust
• Binary logistic regression is an ideal method of analysis to employ
to determine if treatment is associated with mortality at 5 years after adjustment for age
• We will fit a binary logistic regression model and get an odds ratio (OR) for Treatment (adjusted for age) and an OR for Age (adjusted for treatment)
Binary Logistic Regression Example
• Data:
Binary outcome
of interest
Binary Logistic Regression Example • SPSS: Analyze → Regression → Binary Logistic
Binary Logistic Regression Example
• For each variable odds ratio (OR) = Exp(B)
- For the binary explanatory variable Treatment, the OR is the odds of mortality for treatment = 1 compared to the reference category, treatment = 0
• OR for Treatment = Exp (B) = 0.368 In comparison to Treatment=0 the odds of mortality at 5 years for
Treatment=1 are 0.368 times, or equivalently [0.368-1*100] = - 63.2%
Variables in the Equation
.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906
-10.491 5.484 3.660 1 .056 .000
AgeTreatment(1)Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)
Variable(s) entered on step 1: Age, Treatment.a.
Odds Ratios
Binary Logistic Regression Example
• For the continuous variable Age , the OR is the increase in odds of mortality for a one unit increase in Age(1 year increase)
• OR for Age = Exp(B) = 1.204 As Age increases by one unit (1 year) the odds of mortality
increase by a factor of 1.204, or equivalently increase by [1.204-1*100]= 20.4%
Variables in the Equation
.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906
-10.491 5.484 3.660 1 .056 .000
AgeTreatment(1)Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)
Variable(s) entered on step 1: Age, Treatment.a.
Odds Ratios
Confidence Intervals • 95% confidence intervals for the Odds give the range we expect the true population Odds Ratio values to lie within • We would expect the confidence interval of Exp(B) [OR] to not include 1 if the associated explanatory variable is significant • If the 95% CI spans OR = 1 then this implies equality and that the odds are equal. We cannot be sure that true odds ratio is not 1 if the 95% confidence interval spans 1 - For a binary/categorical predictor the odds are equivalent for one group compared to another - For a continuous predictor the odds are equivalent for the different levels of the continuous variables
Significance of Predictors • We can also test the null hypothesis that the relevant binary
logistic regression coefficient is zero, which is equivalent to testing the hypothesis that the odds ratio associated with this variable is 1
• Wald test: Formally test the null hypothesis that a regression coefficient B is zero:
H0: B = 0 HA: B≠ 0 Or equivalently that H0: Exp(B) = odds ratio =1 HA: Exp(B) = odds ratio ≠ 1
Binary Logistic Regression Example
• SPSS Conducts the Wald test for you!
H0: Exp(B) = odds ratio =1 HA: Exp(B) = odds ratio ≠ 1 • Age – P=0.064, P-value indicates we should not reject
Ho for Age. Odds ratio =1 (Taking a strict 0.05 critical level)
Treatment – P=0.030. Small P-value indicates evidence against Ho for Treatment. Odds ratio ≠ 1.
Variables in the Equation
.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906
-10.491 5.484 3.660 1 .056 .000
AgeTreatment(1)Constant
Step1
a
B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)
Variable(s) entered on step 1: Age, Treatment.a.
Binary Logistic Regression Example
• Summary of Results: - OR for Age = 1.204, 95% CI ( 0.989, 1.466), P = 0.064 - OR for new treatment =0.368, 95% CI (0.149, 0.906), P=0.030
• As age increases, so do the odds for mortality, however Age is not a significant predictor of mortality
• The odds of mortality were significantly 63.2% less for patients on the new treatment
SPSS Practical 4 & 5
• Linear Regression • Fits a straight line to describe the relationship between one or more explanatory/independent variables and
one numerical dependent/outcome variable • Regression coefficients quantify the amount the dependent
variable changes as the explanatory variable increases by one unit (multiple linear regression - after adjustment for any other explanatory variables)
• Chi squared test • Binary logistic regression • Solutions will be available on moodle
Key Points • When comparing proportions/risks/odds of a characteristic of a
categorical variable over 2 groups consider the structure of the data:
- Independent groups: Expected counts >5 χ2 Test Expected counts <5 Fishers Exact Test
- Non-Independent Samples: McNemar’s Test
• We can also use the χ2 test to test for association between 2 categorical factors which may have any number of groups
• Binary Logistic regression is used for modelling binary outcomes; Output is given in terms of odds ratios