chapter 3 - categorical data analysis (cont’d) outline 1 ... 3 - categorical data analysis...
TRANSCRIPT
Chapter 3 - Categorical Data Analysis (cont’d)
Outline
1. Paired binary responses - McNemar’s Test
2. Odds Ratios
3. Relative Risk
4. Chi-Square test for Trend
5. Handling Dates
1
Paired Binary Responses - McNemar’s Test
• The paired t-test is used to test for a difference in the
mean response, but is appropriate only for continuous
data.
• McNemar’s test can be used when the paired responses
take on 2 possible values (Yes-No, 0-1, T-F, Success-
Fail, etc.)
• For each subject, 2 binary variables are measured.
2
Paired Binary Responses - McNemar’s Test
• Example: Suppose a new teaching method has been
developed. The effectiveness of the teaching method
can be assessed by having a pre-test and post-test for
a random sample of subjects.
The pre-test is taken before being taught, and the post-
test is taken afterward. One is interested in knowing
whether there is a significant difference between the
proportion of successes before and after.
3
Paired Binary Responses - McNemar’s Test
• DATA KNOT; /* A sample of 20 individuals is
tested on their ability to tie a type of knot.
Then they are taught how to tie the knot and
tested again. If successful, they
score 1, else 0. */
INFILE ’knot.txt’;
INPUT ID BEFORE AFTER;
PROC FREQ;
TABLES BEFORE*AFTER / AGREE;
RUN; QUIT;
4
Summary – McNemar’s Test
• The McNemar Test is used for comparing two binary
populations, where there is a 1-1 correspondence be-
tween observations in both populations. (i.e. the data
are paired).
• It can be estimated using PROC FREQ and the / AGREE op-
tion.
5
Odds Ratios
• Example: Suppose that the proportion of moderate-
speed (MS) car accident fatalities in which the victim
was not wearing a seatbelt is p1, while the proportion of
MS car accident survivors in which the victim was not
wearing a seatbelt is p2.
• The respective odds that a person killed in a car acci-
dent wasn’t wearing a seatbelt is then p1/(1− p1) while
the odds that someone who survived such an accident
wasn’t wearing a seatbelt are p2/(1− p2).• The ratio of these odds gives a way of summarizing the
risk of dying associated with being in a MS car accident
without wearing a seatbelt. For example, if p1 = .8 and
p2 = .1, then the odds ratio is 36.
6
Odds Ratios
• Suppose that in a particular jurisdiction, there were 110
moderate speed car accidents (single occupant vehi-
cles) in a single year which resulted in fatalities. 25 of
the drivers involved were wearing seatbelts.
There were also 190 similar accidents which did not
result in fatalities. 175 of these drivers were wearing
seatbelts.
1. Is there a relation between wearing a seatbelt and
surviving a moderate speed accident?
2. Estimate the odds ratio and compute a 95% confi-
dence interval for it.
7
Odds Ratios – Example
DATA SEATBELT;
INPUT SEATBELT $ FATAL $ COUNT;
/* SEATBELT = Y if driver was wearing one */
/* FATAL = Y if driver was killed */
DATALINES;
N N 15
N Y 85
Y N 175
Y Y 25
;
PROC FREQ;
TABLES SEATBELT*FATAL / CHISQ CMH;
WEIGHT COUNT;
RUN; QUIT;
• This program computes the chi-square test statistic toanswer the first question, and it computes Cochran-Mantel-Haenszel statistics in order to estimate the oddsratio and an associated confidence interval.
8
Odds Ratios
• For a retrospective study such as this, it is not appropri-
ate to look at the Cohort output. This is an example of
a case-control study. (The accident fatalities are the
cases, and the accident nonfatalities are the control
group.)
• It also writes out a confidence interval based on the
logit which is simpler for us to study.
• The odds ratio is estimated using
OR =n11n22n12n21
where nij = the count in the ith row and j column of
the table.
9
Odds Ratios – Confidence Intervals
• The 1− 2α confidence interval is then given by
(ORe−zα√v,ORezα
√v)
where
v =1
n11+
1
n12+
1
n21+
1
n22
and
P (Z < zα) = α.
(Z is standard normal.)
10
Odds Ratios – A check on the Accuracy of the Confidence
Intervals
• Let us write a simulation program to see how accurate
these logit confidence bounds are.
/* Program to simulate numbers of car accident
fatalities, with and without seatbelts. This
is located in the file simacc.sas */
DATA _NULL_;
FILE ’SIMACC.TXT’;
FATALITY = 100; /* NUMBER OF FATAL CAR ACCIDENTS */
NONFATAL = 200;
COUNT11 = RANBIN(0,FATALITY,.9);
/* We assume that the proportion of fatalities
without a seatbelt is .9 */
COUNT12 = FATALITY - COUNT11;
/* These are the fatalaties without
a seatbelt */
11
Odds Ratios – A check on the Accuracy of the Confidence
Intervals
COUNT21 = RANBIN(0,NONFATAL,.2);
/* We assume that the proportion of nonfatalities without
a seatbelt is .2 */
COUNT22 = NONFATAL - COUNT21;
/* These are the nonfatalities with a seatbelt. */
PUT COUNT11 COUNT12 COUNT21 COUNT22;
RUN; QUIT;
12
Odds Ratios – A check on the Accuracy of the ConfidenceIntervals
• Now compute the confidence interval and check whetherit contains the true odds ratio (36):/* This is in file simaccOR.sas */
DATA SEATBELT;
INFILE ’SIMACC.TXT’;
INPUT N11 N12 N21 N22;
OR = (N11*N22)/(N12*N21);
V = 1/N11 + 1/N12 + 1/N21 + 1/N22;
LCL = OR*EXP(-1.96*SQRT(V));
UCL = OR*EXP(1.96*SQRT(V));
IF LCL < 36 AND UCL > 36 THEN
CORRECT = 1;
ELSE CORRECT = 0;
/* The variable CORRECT indicates whether
the confidence is correct or not. */
PROC PRINT NOOBS;
VAR OR LCL UCL CORRECT;
RUN; QUIT;
13
Odds Ratios – A check on the Accuracy of the ConfidenceIntervals
• Now, we would like to simulate a large number of datasets in order to test whether close to 95% of suchconfidence intervals contain the correct value of theodds ratio./* This is located in the file simaccDO.sas */
DATA _NULL_;
FILE ’SIMACC.TXT’;
FATALITY = 100; /* NUMBER OF FATAL CAR ACCIDENTS */
NONFATAL = 200;
DO I = 1 TO 1000;
COUNT11 = RANBIN(0,FATALITY,.9);
/* We assume that the proportion of fatalities
without a seatbelt is .9 */
COUNT12 = FATALITY - COUNT11;
/* These are the fatalities without
a seatbelt */
14
Odds Ratios – A check on the Accuracy of the Confidence
Intervals
COUNT21 = RANBIN(0,NONFATAL,.2);
/* We assume that the proportion of nonfatalities
without a seatbelt is .2 */
COUNT22 = NONFATAL - COUNT21;
/* These are the nonfatalities with a seatbelt. */
PUT COUNT11 COUNT12 COUNT21 COUNT22;
END;
RUN; QUIT;
15
Odds Ratios – A check on the Accuracy of the Confidence
Intervals
• The same program as before can be used to compute
all of the statistics.
DATA SEATBELT;
INFILE ’SIMACC.TXT’;
INPUT N11 N12 N21 N22;
OR = (N11*N22)/(N12*N21);
V = 1/N11 + 1/N12 + 1/N21 + 1/N22;
LCL = OR*EXP(-1.96*SQRT(V));
UCL = OR*EXP(1.96*SQRT(V));
IF LCL < 36 AND UCL > 36 THEN
CORRECT = 1;
ELSE CORRECT = 0;
PROC PRINT NOOBS;
VAR OR LCL UCL CORRECT;
16
Odds Ratios – A check on the Accuracy of the Confidence
Intervals
/* Count up the number of correct
confidence intervals */
PROC MEANS SUM;
VAR TRUE;
RUN; QUIT;
17
Summary – Odds Ratios
• The odds ratio is defined as
p1(1− p2)p2(1− p1)
.
• It can be estimated using PROC FREQ and the / CMH option.
• It is an appropriate measure to consider for retrospec-
tive studies, such as case-control studies.
18
Relative Risk
• Example: In a study the effectiveness of a flu vac-
cine, 1250 individuals were randomly selected from a
screened population. The vaccine was given to 750 of
the individuals while 500 were given a placebo. During
the subsequent flu season, the number of individuals in
the vaccine group who had caught the flu was 120 while
the number catching the flu in the placebo group was
240.
This is an example of a prospective cohort study. We can
easily estimate the incidence of flu for each treatment
group: 240/500 for the placebo group, and 120/750
for the vaccine group.
The estimated relative risk of acquiring the flu for those
on placebo is
240/500
120/750= 3.0
times higher than for those in the vaccine group.
19
Relative Risk
• In order to estimate a confidence interval for the true
relative risk, we use the same procedure as for the odds
ratio:
DATA VACCTEST;
INPUT TREATMENT $ FLU $ COUNT;
/* FLU = Y, for subjects who caught the flu
= N, for subjects who did not catch flu */
DATALINES;
Vaccine Y 120
Vaccine N 630
Placebo Y 240
Placebo N 260
;
PROC FREQ;
TABLES TREATMENT*FLU / CMH;
/* We do not calculate the chisquare test
statistics this time */
20
WEIGHT COUNT;
RUN; QUIT;
but this time, we read the row of output correspond-
ing to the placebo cohort risk. This gives us a 95%
confidence interval for the relative risk.
Relative Risk – Confidence Intervals
• The relative risk is estimated using
RR =n11n2n1n21
where nij = the count in the ith row and j column of
the table, and ni is the total of the ith row.
• The 1− 2α confidence interval is then given by
(RRe−zα√v,RRezα
√v)
where
v = (1− n11/n1)/n11 + (1− n21/n2)/n21and
P (Z < zα) = α.
(Z is standard normal.)
• Let us write a simulation program to see how accurate
these logit confidence bounds are.
21
/* Simulation of 1000 flu vaccine
prospective studies */
/* This program is in simfluDO.sas */
DATA _NULL_;
FILE ’SIMFLU.TXT’;
N1 = 500; /* No. Placebo */
N2 = 750; /* No. Vaccinated */
P1 = 0.3; /* Prop. Plac. Flu */
P2 = 0.1; /* Prop. Vacc. Flu */
RR = P1/P2; /* TRUE REL. RISK */
DO I = 1 TO 1000;
FLU_P = RANBIN(0,N1,P1);
/* No. Placebo Flu */
FLU_V = RANBIN(0,N2,P2);
/* No. Vaccine Flu */
PUT FLU_P N1 FLU_V N2 RR;
END;
RUN; QUIT;
Relative Risk – Confidence Intervals
/* Counting number of correct
confidence intervals for RR */
/* This program is in simfluRR.sas */
DATA FLU;
INFILE ’simflu.txt’;
INPUT N11 N1 N21 N2 RR_TRUE;
P1 = N11/N1;
P2 = N21/N2;
RR = P1/P2;
V = (1-P1)/N11 + (1-P2)/N21;
LCL = RR*EXP(-1.96*SQRT(V));
UCL = RR*EXP(1.96*SQRT(V));
IF LCL < RR_TRUE < UCL THEN
CORRECT = 1;
ELSE
CORRECT = 0;
PROC MEANS SUM;
VAR CORRECT;
RUN; QUIT;
22
Summary – Relative Risk
1. The relative risk is defined as
p1/p2.
2. It can be estimated using PROC FREQ and the / CMH option.
3. It is an appropriate measure for assessing risk when data
come from randomized controlled experiments.
23
Testing for Trend
• This is only appropriate if the variables represent ordi-
nal data (i.e. the values have some sort of inherent
ordering).
• The test is only appropriate for 2×N tables.
24
Testing for Trend – Mantel-Haenszel Test
• Example (fake student survey, again): We would like to
know if there is a trend according to YEAR of study and
the response to the funding question.
DATA SURVEY;
INFILE ’Fakesur2.txt’;
INPUT ID 1-3
AGE 4-5
GENDER $ 6
YEAR 7
FULLTIME $ 8
FUNDINC $ 9
OWEEK 10;
25
Testing for Trend – Mantel-Haenszel Test
PROC FREQ;
TITLE ’2-Way Frequency Tables’;
TABLES FUNDINC*YEAR / CHISQ NOCUM NOPERCENT
NOROW NOCOL;
RUN;
QUIT;
This time, we look at the Mantel-Haenszel Chi-Square
row of output.
26
Summary – Mantel-Haenszel Test
• Sometimes there is a natural ordering in the values of a
categorical variable, and we are interested in knowing if
there is relation between the ordered values and some
other binary variable.
• The Mantel-Haenszel Chi-Square test can be used to
perform such a test.
• It can be performed using PROC FREQ and the / CHISQ op-
tion.
27
Chapter 4 - Working with DatesProcessing Date Variables
Example:
DATA WINNIPEG;
INFILE ’wwpt6080.dat’;
INPUT DATE YYMMDD6. MINWIND 7-10 MEANWIND 11-14 MAXWIND 15-18 MINTEMP 19-23 MEANTEMP 24-28
MAXTEMP 29-33 MINPRESS 34-38 MEANPRES 39-43 MAXPRESS 44-48
CHGPRESS $ 49;
YEAR = YEAR(DATE); /* Extracts the year from the DATE */
MTH = MONTH(DATE); /* Extracts the month from the DATE */
PROC SORT;
BY YEAR;
PROC MEANS MEAN;
VAR MEANTEMP;
BY YEAR;
RUN; QUIT;
28