25.11.2014 - categorical data

Research Methodology

Statistics Lecture 5

Catagorical Data: The Chi Squared Test, Odds, Ratios, Relative Risk and Logistic Regression

Rifat Hamoudi Senior Lecturer [email protected]

Review • Comparing one numerical outcome over 2

or more groups:

2 Groups >2 Groups

Independent t-test

One-way ANOVA

Kruskal Wallis test

Mann Whitney U test

Independent Groups

Review • Comparing one numerical outcome over 2 or more

groups:

Independent Groups Paired Groups

2 Groups >2 Groups

One-way ANOVA

Kruskal Wallis test

Mann Whitney U test

Paired t-test

Wilcoxon's Signed Rank

test

2 Groups

Independent t-test

Review

• Assessing the relationship between two numerical variables:

Correlation Analysis Quantifies the strength of the linear association between two numerical variables

Simple linear regression fits a straight line to describe the relationship between the two numerical variables where

one variable depends on the other

The Regression coefficient quantifies the amount the dependent variable changes as the explanatory

variable increases by one unit

Regression Analysis

Review

Which statistical test to use?

Next.....

Methods for Analysing Categorical Data

Outline

• Comparing Two Proportions: - Chi-squared test - Fishers Exact test

• Risk, Risk Difference and Risk Ratio

• Odds and Odds ratio

• Binary Logistic Regression Analysis

Categorical Data

• Categorical Data is data that can be placed into categories: Binary/Ordinal/Nominal

• The mean is useless for categorical data! → We cannot use methods for continuous data to analyse categorical data

• We analyze frequencies for categorical variables, that is the number of things that fall into each combination of categories

Obesity in Young Children

• Obesity in young life can pave the way for future musculoskeletal conditions

• A dietician conducted a survey of 510 children at a local primary school

• Objective: Are there more obese children under 5 or over 5?

Categorical Data: Comparing Groups


• Objective: Are there more obese children under 5 or over 5?

• Initially tabulate observed frequencies as below in a 2 x 2 contingency table, for example:

Age Category

Under 5 Over 5

BMI under 30 92 323

Obese (BMI over 30) 19 76

Total 111 399


• The proportions of obese children in each age category are calculated as follows;

• We wish to formally compare the proportions of children

with the obese characteristic

Age Category

Under 5 Over 5

BMI under 30 92 323


Total 111 399

Proportion of Obese Children 19/111 =0.17 76/399 = 0.19

• We often have two independent groups of individuals (under 5 / over 5)

• We want to know whether the proportions of individuals

with a particular characteristic are the same in the two groups (obese)


Categorical Data: Independent Groups →χ2 Test

• The Chi-Squared (χ2) test allows us to formally compare proportions between two independent groups

• It allows us to determine whether the observed frequencies (counts) are markedly differ from the frequencies that we would expect by chance

• Define the null and alternative hypothesis under study:

Ho: The proportions of individuals with the characteristic are equal in the two groups in the population

HA: These population proportions are not equal

Categorical Data: Two Independent Groups →χ2 Test

• SPSS: Analyse → Descriptive Statistics→ Crosstabs

Categorical Data: Two Independent Groups → χ2 Test

• χ2 Test Technical details: -The expected numbers in each of the four cells in our 2x2

contingency table if H0 is true are calculated (equal proportions) - The formula for each expected cell is: (row total*column total)/grand total where the grand total equals the total number of individuals that make up the sample (N)

Age Category

Under 5 Over 5

BMI under 30 92 323


Total 111 399

Proportion of Obese Children 19/111 =0.17 76/399 = 0.19

Categorical Data: Two Independent Groups → χ2 Test

- What was observed is compared to the calculated expected numbers which would indicate there were no differences between the groups (equal proportions)

- A large discrepancy between the observed and the corresponding expected frequencies is an indication that the proportions in the two groups differ (P <0.05)

Expected Numbers:

415x111/510 = 90.3

415x399/510=324.7

95x111/510=20.7

95x399/510=74.3

Under 5 Over 5

BMI under 30 92 323


Total 111 399 Proportion of Obese

Children 19/111 =0.17 76/399 = 0.19

χ2 Test Example

• Example:

H0: The proportion of children with the obese characteristic is equal in the two age groups

HA: The proportion of children with the obese characteristic is not equal in the two age groups

• To conduct the χ2 test:

SPSS: Analyse → Descriptive Statistics →Crosstabs

χ2 Test Example

χ2 Test Example

• 2 x 2 Contingency table:

Overweight * Age_Cat Crosstabulation

Count

92 323 41519 76 95

111 399 510

BMI Under 30Obese (BMI Over 30)

Overweight

Total

Under 5 Over 5Age_Cat

Total

χ2 Test Example

• Expected Cell counts: 415x111/510

= 90.3 415x399/510=324.7

95x111/510=20.7

95x399/510=74.3

Overweight * Age_Cat Crosstabulation

Count

92 323 41519 76 95

111 399 510

BMI Under 30Obese (BMI Over 30)

Overweight

Total

Under 5 Over 5Age_Cat

Total

χ2 Test Example

• Results of the Chi-squared test:

There is evidence that the proportions of children with the obese

characteristics are equal in the two age groups (Under 5 = 0.17 or 17%, Over 5 = 0.19 or 19%)

Chi-Square Tests

.214b 1 .644 .682 .379

.105 1 .746

.217 1 .641 .682 .379.682 .379

.213c

1 .644 .682 .379 .101

510

Pearson Chi-SquareContinuity Correction a

Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

PointProbability

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is 20.68.b.

The standardized statistic is .462.c.

• The χ2 test compares observed and expected cell counts - useful to compare proportions across two independent groups

• In the context of a randomised controlled trial our proportions will be risks

• Probably the most common scenario in medical research is to compare the outcome risk in two independent groups

• We can use the χ2 test to answer a Common RCT Question: Is the risk of failing in the group A the same as the risk of failing in group B?

Categorical Data: Comparing Risks → χ2 Test

χ2 Test – Risk Example

• Risk of not healing in the drug group = 0.42 or (0.42*100) 42% Risk of not healing in the placebo group = 0.72 or (0.72*100) 72% Risk difference = 72% - 42% = 30%

• The Chi-squared test allows us to formally compare risks between groups answering the Question: Is the risk of not healing in the placebo group the same as the risk of not healing in the drug group?

Treatment

Outcome Drug Placebo

Not Healed 152 142

Healed 212 56

Total 364 198

Risk of Not healing 152/364 = 0.42 142/198 =0.72


• Define the null and alternative hypothesis under study:

H0: The risk of not healing is equal in the two treatment groups

HA: The risk of not healing is not equal in the two treatment groups

• SPSS: Analyse → Descriptive Statistics →Crosstabs

Chi-Square Tests

46.140b 1 .000 .000 .00044.946 1 .00047.359 1 .000 .000 .000

.000 .000562

Pearson Chi-SquareContinuity Correctiona

Likelihood RatioFisher's Exact TestN of Valid Cases

Value dfAsymp. Sig.


Exact Sig.(1-sided)


0 cells (.0%) have expected count less than 5. The minimum expected count is 94.42.

b.


• There is evidence to reject the null hypothesis (P<0.001). The risk of not healing is not equal in the drug

and placebo groups (risk difference = 30%)

Chi-Square Tests

46.140b 1 .000 .000 .00044.946 1 .00047.359 1 .000 .000 .000

.000 .000562


Likelihood RatioFisher's Exact TestN of Valid Cases

Value dfAsymp. Sig.


Exact Sig.(1-sided)


0 cells (.0%) have expected count less than 5. The minimum expected count is 94.42.

b.

Relative Risk (Risk Ratio)

• Typically the risk difference will be a sufficient way of presenting differences between groups with binary outcomes

• If the outcome is rare then ratios are more suitable

• Relative Risk (Risk Ratio)

Relative Risk (Risk Ratio)

d)b/(bc)a/(a

Risk Risk

unexp

exp

++

=

Exposed to factor

Outcome of Interest Yes No Total

Yes a b a + b

No c d c + d

Total a + c b + d n = a + b + c + d

Risk of Outcome in the Exposed group = a / (a+c)

Risk of Outcome in the unexposed group = b / (b+d)

Relative Risk (Risk Ratio or RR) =


• Risk difference = 72% - 42% = 30% • Relative Risk (Risk Ratio): (152/364) / (142/198) = 0.58

Treatment

Outcome Drug Placebo

Not Healed 152 142

Healed 212 56

Total 364 198

Risk of Not healing 152/364 = 42% 142/198 =72%

A subject in the drug group is 0.58 times as likely to not heal than a subject in the placebo group

Interpretation of the Relative Risk (Risk Ratio)

• A RR of 1 indicates that the risk is the same in the two groups

• A RR <1 indicates that there a reduction in the risk of the outcome in the exposed group (drug group) compares with the unexposed group (placebo)

• A RR >1 indicates that there is an increased risk in the exposed group (drug group) compared with the unexposed group (placebo)

Categorical Data: Comparing Odds → χ2 Test

• A RCT is often not feasible if an outcome is rare so instead known cases and suitable controls are selected for a case-control study

• In a case control study we do not interpret the proportions of cases/controls with specific characteristics

• Case-control studies only examine association – NOT causation

• We compare odds because patients are selected because of their disease status

• We don’t interpret proportions as risks - You could get any risk value you wish by simply varying the number of cases and controls selected and often numbers of cases do not reflect the true mix of case numbers in the general population

Odds and Odds Ratios

• Relative Risk is not valid in such a scenario

• Rather we will be comparing odds therefore we must use the Odds Ratio (OR) to present the differences between groups

• Odds are different to risks!

• What are odds, what is the odds ratio and how does the odds ratio differ from the risk ratio?

Odds and Odds Ratios

Exposed to Factor

Yes No Total

Case a b a + b

Control c d c + d

Total a + c b + d n = a + b + c + d

c x bdxa

d / bc / a

groupunexposedtheincaseabeingofOddgroupexposedtheincaseabeingofOddOddsRatio ===

Odds of being a Case in the exposed group = a / c

Odds of being a Case in the unexposed group = b / d

Odds and Odds Ratios Example

Lung Cancer Doll & Hill Example: 649 male cancer patients and 649 controls. Compare distribution of lung cancer among smokers and Non smokers. 647 of 1269 smokers had lung cancer compared to 2 of 29 non smokers.

Odds lung cancer in smokers = 647/622 = 1.04 Odds lung cancer in non-smokers = 2/27 = 0.07

Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04

Smoker Non-smoker Total

Lung Cancer 647 2 649

No Lung Ca 622 27 649

Total 1269 29

Interpretation of the Odds Ratios

• If the odds ratio = 1 then this implies equality

• The odds are equivalent in the exposed and unexposed groups

• An odds ratio >1 indicates that the odds of disease (outcome) is greater in the exposed group than in the unexposed group

• An odds ratio <1 indicates that the odds of disease (outcome) is lower in the exposed group than in the unexposed group

• The Chi squared test can be used within the context of the case control study to formally test:

H0: The odds of having lung cancer in smokers cases = the odds of lung cancer in non-smokers (i.e. odds ratio = 1) HA: The odds of having lung cancer are not equal

Odds lung cancer in smokers = 647/622 = 1.04 Odds lung cancer in non-smokers = 2/27 = 0.07 Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04

Categorical Data: Comparing Odds → χ2 Test

Smoker Non-smoker Total

Lung Cancer 647 2 649

No Lung Ca 622 27 649

χ2 Test – Odds Example • The null and alternative hypothesis under study: H0: The odds of having lung cancer in smokers cases = the odds of lung

cancer in non-smokers (i.e. odds ratio = 1) HA: The odds of having lung cancer are not equal

• P < 0.001, Small P-value indicates there is evidence against the null

hypothesis, reject the null hypothesis.

Chi-Square Tests

22.044b 1 .000 .000 .00020.316 1 .00026.140 1 .000 .000 .000

.000 .000

22.027c

1 .000 .000 .000 .000

1298


Likelihood RatioFisher's Exact TestLinear-by-LinearAssociationN of Valid Cases

Value dfAsymp. Sig.


Exact Sig.(1-sided)

PointProbability


0 cells (.0%) have expected count less than 5. The minimum expected count is 14.50.b.

The standardized statistic is 4.693.c.

χ2 Test – Odds Example

• The study provides considerable evidence to suggest an association between lung cancer and smoking

• The odds of having lung cancer are significantly greater for smokers than non-smokers, odds ratio = 14.04 (P<0.001)

Assumptions of the → χ2 Test

1. The expected frequency in each of the four cells is at least 5

The Chi squared test is hence only valid if all the expected frequencies are sufficient

2. The Chi squared test also assumes the groups are independent (e.g. treatment group and placebo group, under 5’s and over 5’s)

Chi-Square Tests

1.326b 1 .250 .534 .355

.217 1 .641

2.126 1 .145 .534 .355

.534 .355

1.273c

1 .259 .534 .355 .355

25

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher' s Exact Test

Linear-by-Linear

Association

N of Valid Cases

Value df

Asymp. Sig.

(2-sided)

Exact Sig.

(2-sided)

Exact Sig.

(1-sided)

Point

Probability


2 cells (50. 0%) have expected count less than 5. The minimum expected count is .84.b.

The standardized statistic is 1.128.c.

What if Expected Frequency < 5?

• Use Fishers exact test - Given in SPSS output of Chi-squared test

• If any one of the expected cell counts is less than 5 interpret fishers exact test:

What if the Groups are Not Independent?

• What if the two groups are related? - Each individual may have had their outcome measured in 2 different circumstances - Cross-over trial – each patient receives drug and placebo - Matched Case-Control Study • Use McNemar’s Test

• SPSS: Analyze→ Descriptive statistics →Crosstabs

Select McNemar’s Test in Statistics option

• Same null and alternative hypothesis: Ho: The proportion of individuals with the characteristic is equal in the two groups in the population HA: The proportion of individuals with the characteristic is not equal in the two groups in the population

What if the Groups are Not Independent?

Categorical Data: More than 2 Categories

• Suppose we wish to test for an association between two factors which may have more than two categories

• Example: Is there an association between blood group (4 group levels: A, B, O, AB) and disease severity (3 groups: mild, moderate, severe). Are individuals of a particular blood group likely to be more severely ill?

• We can still use the Chi-squared test on larger frequencies - data presented in a r x c contingency table (r rows and columns)

• The null and alternative hypothesis under study: H0 : There is no association between the categories of one factor and

the categories of the other factor in the population HA: The two factors are associated in the population

Binary Logistic Regression

• Up until now we have discussed regression with a numerical outcome/dependent variable

• Lecture 4 - Linear regression is a modelling technique used to explore the associations between one numerical dependent variable and one or more explanatory variables (be these numerical of categorical)

• We are often interesting in examining binary outcomes, for example mortality (dead/alive), case/control, success/failure

• We can model a binary outcome using binary logistic regression

Binary Logistic Regression • Useful when we wish to compare the proportion of people

with a particular binary outcome by group, but adjusted for potential confounders

• Examples:

1. Is there an association between smoking and lung cancer after adjusting for Sex?

2. Is a new treatment associated with mortality after

adjustment for age?


• The dependent variable and explanatory variable(s) are distinguished in the same way as linear regression

• In binary logistic regression the binary outcome of interest is the dependent variable. The other factors of interest which we believe may be related to the binary outcome are the explanatory/independent variables

• Logistic regression evaluates the odds that an individual with a particular combination of values for the explanatory variables will have the binary outcome of interest


• When you fit a binary logistic regression model, for each explanatory variable you will get an odds ratio (OR) – EXP(B)

• For binary/categorical explanatory variables, the OR is the increase in odds of the binary outcome for one group compared to the other/reference group

• For numerical explanatory variables, the OR is the increase in odds of the binary outcome for a one unit increase in the numerical explanatory variable


• If the odds ratio is greater than 1 then as the predictor increases, the odds of the outcome occurring increase

• Conversely an odds ratio value less than 1 indicates that as the predictor increases the odds of the outcome occurring decrease

• If the odds ratio = 1 then this implies equality

- For a binary/categorical predictor the odds are equivalent for one group compared to another

- For a numerical predictor the odds are equivalent for the different levels of the continuous variables

Binary Logistic Regression Example

• Example: Clinical trial for breast cancer, comparing mortality at 5 years between new vs standard drug

- Outcome is mortality at 5 years – either yes or no - Difference in age between two treatment groups – need to adjust

• Binary logistic regression is an ideal method of analysis to employ

to determine if treatment is associated with mortality at 5 years after adjustment for age

• We will fit a binary logistic regression model and get an odds ratio (OR) for Treatment (adjusted for age) and an OR for Age (adjusted for treatment)


• Data:

Binary outcome

of interest

Binary Logistic Regression Example • SPSS: Analyze → Regression → Binary Logistic


• For each variable odds ratio (OR) = Exp(B)

- For the binary explanatory variable Treatment, the OR is the odds of mortality for treatment = 1 compared to the reference category, treatment = 0

• OR for Treatment = Exp (B) = 0.368 In comparison to Treatment=0 the odds of mortality at 5 years for

Treatment=1 are 0.368 times, or equivalently [0.368-1*100] = - 63.2%

Variables in the Equation

.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906

-10.491 5.484 3.660 1 .056 .000

AgeTreatment(1)Constant

Step1

a

B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)

Variable(s) entered on step 1: Age, Treatment.a.

Odds Ratios


• For the continuous variable Age , the OR is the increase in odds of mortality for a one unit increase in Age(1 year increase)

• OR for Age = Exp(B) = 1.204 As Age increases by one unit (1 year) the odds of mortality

increase by a factor of 1.204, or equivalently increase by [1.204-1*100]= 20.4%


.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906

-10.491 5.484 3.660 1 .056 .000


Step1

a



Odds Ratios

Confidence Intervals • 95% confidence intervals for the Odds give the range we expect the true population Odds Ratio values to lie within • We would expect the confidence interval of Exp(B) [OR] to not include 1 if the associated explanatory variable is significant • If the 95% CI spans OR = 1 then this implies equality and that the odds are equal. We cannot be sure that true odds ratio is not 1 if the 95% confidence interval spans 1 - For a binary/categorical predictor the odds are equivalent for one group compared to another - For a continuous predictor the odds are equivalent for the different levels of the continuous variables

Significance of Predictors • We can also test the null hypothesis that the relevant binary

logistic regression coefficient is zero, which is equivalent to testing the hypothesis that the odds ratio associated with this variable is 1

• Wald test: Formally test the null hypothesis that a regression coefficient B is zero:

H0: B = 0 HA: B≠ 0 Or equivalently that H0: Exp(B) = odds ratio =1 HA: Exp(B) = odds ratio ≠ 1


• SPSS Conducts the Wald test for you!

H0: Exp(B) = odds ratio =1 HA: Exp(B) = odds ratio ≠ 1 • Age – P=0.064, P-value indicates we should not reject

Ho for Age. Odds ratio =1 (Taking a strict 0.05 critical level)

Treatment – P=0.030. Small P-value indicates evidence against Ho for Treatment. Odds ratio ≠ 1.


.186 .100 3.439 1 .064 1.204 .989 1.466-1.001 .460 4.724 1 .030 .368 .149 .906

-10.491 5.484 3.660 1 .056 .000


Step1

a




• Summary of Results: - OR for Age = 1.204, 95% CI ( 0.989, 1.466), P = 0.064 - OR for new treatment =0.368, 95% CI (0.149, 0.906), P=0.030

• As age increases, so do the odds for mortality, however Age is not a significant predictor of mortality

• The odds of mortality were significantly 63.2% less for patients on the new treatment

SPSS Practical 4 & 5

• Linear Regression • Fits a straight line to describe the relationship between one or more explanatory/independent variables and

one numerical dependent/outcome variable • Regression coefficients quantify the amount the dependent

variable changes as the explanatory variable increases by one unit (multiple linear regression - after adjustment for any other explanatory variables)

• Chi squared test • Binary logistic regression • Solutions will be available on moodle

Key Points • When comparing proportions/risks/odds of a characteristic of a

categorical variable over 2 groups consider the structure of the data:

- Independent groups: Expected counts >5 χ2 Test Expected counts <5 Fishers Exact Test

- Non-Independent Samples: McNemar’s Test

• We can also use the χ2 test to test for association between 2 categorical factors which may have any number of groups

• Binary Logistic regression is used for modelling binary outcomes; Output is given in terms of odds ratios

25.11.2014 - categorical data

Documents

groups categorical data

groups obese categorical

groups objective

categorical data outline

rank test

statistical test

categorical variables

catagorical data