![Page 1: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/1.jpg)
Not in FPP
Exploratory data analysis with two qualitative variables
1
![Page 2: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/2.jpg)
Exploratory data analysis with two qualitative/categorical variablesMain tools
Contigency tablesConditional, marginal, and joint frequencies
2
![Page 3: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/3.jpg)
Motivating exampleSurviving the Titanic
Was there a class discrimination in survival of the wreck of the Titanic?
“It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912
3
![Page 4: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/4.jpg)
Titanic: Class by survival11stst ClassClass
22ndnd ClassClass
33rdrd ClassClass
CrewCrew
DeadDead 122122 167167 528528 696696 15131513
AliveAlive 203203 118118 178178 212212 711711
325325 285285 706706 908908 22242224
4
![Page 5: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/5.jpg)
Titanic: Marginal frequencies% Dead = 1513/2224 = 0.68% Alive = 711/2224 = 0.32
% in first class = 325/2224 = 0.14% in second class = 285/2224 = 0.13% in third class = 706/2224 = 0.32% crew = 908/2224 = 0.41
5
![Page 6: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/6.jpg)
Titanic: Conditional frequenceis% (Alive | 1st) = 203/325 = 0.625% (Alive | 2nd) = 118/285 = 0.414% (Alive | 3rd) = 178/706 = 0.252% (Alive | Crew) = 212/908 = 0.233
Based on these frequencies does there appear to be class discrimination?
6
![Page 7: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/7.jpg)
Titanic: Class by person type1st Class
2nd Class
3rd Class
Crew
Child. 6 24 79 0 109
Wom. 144 93 165 23 425
Men 175 168 462 885 1690
325 285 706 908 2224
7
![Page 8: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/8.jpg)
Titanic: percentage of men in each class% (Man | 1st) = 175/325 = 0.54% (Man | 2nd) = 168/285 = 0.59% (Man | 3rd) = 462/706 = 0.65% (Man | Crew) = 885/908 = 0.97
There are larger percentages of men in third class and crew
8
![Page 9: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/9.jpg)
Surviving the TitanicA reason for class differences in survival:
Larger percentages of men died3rd class consisted of mostly men.Hence, a larger percentage of 3rd class
passengers died.
Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables
9
![Page 10: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/10.jpg)
Relative risk and odds ratiosMotivating example
Physicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years old
Half the subjects assigned to take aspirin every other day
Other half assigned to take a placebo, a dummy pill that looked and tasted like aspirin
10
![Page 11: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/11.jpg)
Physicians’ health studyHere are the number of people in each cell:
11
![Page 12: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/12.jpg)
Relative risky1 y2
x1 a b a+b
x2 c d c+d
a+c
b+d
Risk of y1 for level x1=a/(a+b)
Risk of y1 for level x2=c/(c+d)
€
Relative risk = a/(a +b)c /(c + d)12
![Page 13: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/13.jpg)
Relative risk for physicians’ health studyRelative risk of a heart attack when taking aspirin
versus when taking a placebo equals
People that took aspirin are 0.55 times as likely to have a heart attack than people that took the placebo
Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin
€
RR = 104 /(104 +10933)189 /(189 +10845)
= 0.55
13
![Page 14: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/14.jpg)
Odds ratios
y1 y2x1 a bx2 c d
Odds of y1 for level x1=a/b
Odds of y1 for level x2=c/d
€
Odds ratio = a/bc /d
14
![Page 15: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/15.jpg)
Odds ratios for physicians’ health studyRelative risk of a heart attack when taking
aspirin versus taking a placebo is
Odds of having a heart attack when taking aspirin over odds of a heart attack when taking a placebo (odds ratio)€
RR = 104 /(104 +10933)189 /(189 +10845)
= 0.55
€
OR = 104 /10933189 /10845
= 0.546
15
![Page 16: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/16.jpg)
Interpreting odds ratios and relative risksWhen the variables X and Y are independent
odds ratio = 1 relative risk = 1
When subjects with level x1 are more likely to have y1 than subjects with level x2, theodds ratio > 1 relative risk > 1
When subjects with level x1 are less likely to have y1 than subjects with level x2, thenodds ratio < 1 relative risk < 1
16
![Page 17: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/17.jpg)
Which one should be used?If Relative Risk is available then it should be usedIn a cohort study, the relative risk can be
calculated directlyIn a case-control study the relative risk cannot be
calculated directly, so an odds ratio is used insteadCase-control studies is an example. They compare subjects
who have a “condition” to subjects that don’t but have similar controls
In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure).
Recall that RR = %(disease|exposure)/%(disease|placebo)Not available in more complex modeling (logistic
regression)
17
![Page 18: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/18.jpg)
Odds ratio vs relative riskWhen is odds ratio a good approximation of
relative riskWhen cases are representative of diseased
populationWhen controls are representative of
population without diseaseWhen the disease being studied occurs at
low frequencyOf itself, an odds ratio is a useful measure of
association
18
![Page 19: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/19.jpg)
Relative risk vs absolute risk% smokers who get lung cancer: 8%
(conservative guess here)
Relative risk of lung cancer for smokers: 800%
Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.
19
![Page 20: Exploratory data analysis with two qualitative variables](https://reader036.vdocuments.us/reader036/viewer/2022062310/56815d9f550346895dcbc8d0/html5/thumbnails/20.jpg)
Simpsons paradoxWhen a third variable seemingly reverses
the association between two other variables
Hot hand example
20