chapter 3 - categorical data analysis (cont’d) outline 1 ... 3 - categorical data analysis...

Chapter 3 - Categorical Data Analysis (cont’d)

Outline

1. Paired binary responses - McNemar’s Test

2. Odds Ratios

3. Relative Risk

4. Chi-Square test for Trend

5. Handling Dates

1

Paired Binary Responses - McNemar’s Test

• The paired t-test is used to test for a difference in the

mean response, but is appropriate only for continuous

data.

• McNemar’s test can be used when the paired responses

take on 2 possible values (Yes-No, 0-1, T-F, Success-

Fail, etc.)

• For each subject, 2 binary variables are measured.

2


• Example: Suppose a new teaching method has been

developed. The effectiveness of the teaching method

can be assessed by having a pre-test and post-test for

a random sample of subjects.

The pre-test is taken before being taught, and the post-

test is taken afterward. One is interested in knowing

whether there is a significant difference between the

proportion of successes before and after.

3


• DATA KNOT; /* A sample of 20 individuals is

tested on their ability to tie a type of knot.

Then they are taught how to tie the knot and

tested again. If successful, they

score 1, else 0. */

INFILE ’knot.txt’;

INPUT ID BEFORE AFTER;

PROC FREQ;

TABLES BEFORE*AFTER / AGREE;

RUN; QUIT;

4

Summary – McNemar’s Test

• The McNemar Test is used for comparing two binary

populations, where there is a 1-1 correspondence be-

tween observations in both populations. (i.e. the data

are paired).

• It can be estimated using PROC FREQ and the / AGREE op-

tion.

5

Odds Ratios

• Example: Suppose that the proportion of moderate-

speed (MS) car accident fatalities in which the victim

was not wearing a seatbelt is p1, while the proportion of

MS car accident survivors in which the victim was not

wearing a seatbelt is p2.

• The respective odds that a person killed in a car acci-

dent wasn’t wearing a seatbelt is then p1/(1− p1) while

the odds that someone who survived such an accident

wasn’t wearing a seatbelt are p2/(1− p2).• The ratio of these odds gives a way of summarizing the

risk of dying associated with being in a MS car accident

without wearing a seatbelt. For example, if p1 = .8 and

p2 = .1, then the odds ratio is 36.

6

Odds Ratios

• Suppose that in a particular jurisdiction, there were 110

moderate speed car accidents (single occupant vehi-

cles) in a single year which resulted in fatalities. 25 of

the drivers involved were wearing seatbelts.

There were also 190 similar accidents which did not

result in fatalities. 175 of these drivers were wearing

seatbelts.

1. Is there a relation between wearing a seatbelt and

surviving a moderate speed accident?

2. Estimate the odds ratio and compute a 95% confi-

dence interval for it.

7

Odds Ratios – Example

DATA SEATBELT;

INPUT SEATBELT $ FATAL $ COUNT;

/* SEATBELT = Y if driver was wearing one */

/* FATAL = Y if driver was killed */

DATALINES;

N N 15

N Y 85

Y N 175

Y Y 25

;

PROC FREQ;

TABLES SEATBELT*FATAL / CHISQ CMH;

WEIGHT COUNT;

RUN; QUIT;

• This program computes the chi-square test statistic toanswer the first question, and it computes Cochran-Mantel-Haenszel statistics in order to estimate the oddsratio and an associated confidence interval.

8

Odds Ratios

• For a retrospective study such as this, it is not appropri-

ate to look at the Cohort output. This is an example of

a case-control study. (The accident fatalities are the

cases, and the accident nonfatalities are the control

group.)

• It also writes out a confidence interval based on the

logit which is simpler for us to study.

• The odds ratio is estimated using

OR =n11n22n12n21

where nij = the count in the ith row and j column of

the table.

9

Odds Ratios – Confidence Intervals

• The 1− 2α confidence interval is then given by

(ORe−zα√v,ORezα

√v)

where

v =1

n11+

1

n12+

1

n21+

1

n22

and

P (Z < zα) = α.

(Z is standard normal.)

10

Odds Ratios – A check on the Accuracy of the Confidence

Intervals

• Let us write a simulation program to see how accurate

these logit confidence bounds are.

/* Program to simulate numbers of car accident

fatalities, with and without seatbelts. This

is located in the file simacc.sas */

DATA _NULL_;

FILE ’SIMACC.TXT’;

FATALITY = 100; /* NUMBER OF FATAL CAR ACCIDENTS */

NONFATAL = 200;

COUNT11 = RANBIN(0,FATALITY,.9);

/* We assume that the proportion of fatalities

without a seatbelt is .9 */

COUNT12 = FATALITY - COUNT11;

/* These are the fatalaties without

a seatbelt */

11


Intervals

COUNT21 = RANBIN(0,NONFATAL,.2);

/* We assume that the proportion of nonfatalities without

a seatbelt is .2 */

COUNT22 = NONFATAL - COUNT21;

/* These are the nonfatalities with a seatbelt. */

PUT COUNT11 COUNT12 COUNT21 COUNT22;

RUN; QUIT;

12

Odds Ratios – A check on the Accuracy of the ConfidenceIntervals

• Now compute the confidence interval and check whetherit contains the true odds ratio (36):/* This is in file simaccOR.sas */

DATA SEATBELT;

INFILE ’SIMACC.TXT’;

INPUT N11 N12 N21 N22;

OR = (N11*N22)/(N12*N21);

V = 1/N11 + 1/N12 + 1/N21 + 1/N22;

LCL = OR*EXP(-1.96*SQRT(V));

UCL = OR*EXP(1.96*SQRT(V));

IF LCL < 36 AND UCL > 36 THEN

CORRECT = 1;

ELSE CORRECT = 0;

/* The variable CORRECT indicates whether

the confidence is correct or not. */

PROC PRINT NOOBS;

VAR OR LCL UCL CORRECT;

RUN; QUIT;

13

Odds Ratios – A check on the Accuracy of the ConfidenceIntervals

• Now, we would like to simulate a large number of datasets in order to test whether close to 95% of suchconfidence intervals contain the correct value of theodds ratio./* This is located in the file simaccDO.sas */

DATA _NULL_;

FILE ’SIMACC.TXT’;

FATALITY = 100; /* NUMBER OF FATAL CAR ACCIDENTS */

NONFATAL = 200;

DO I = 1 TO 1000;

COUNT11 = RANBIN(0,FATALITY,.9);

/* We assume that the proportion of fatalities


COUNT12 = FATALITY - COUNT11;

/* These are the fatalities without

a seatbelt */

14


Intervals

COUNT21 = RANBIN(0,NONFATAL,.2);

/* We assume that the proportion of nonfatalities


COUNT22 = NONFATAL - COUNT21;

/* These are the nonfatalities with a seatbelt. */

PUT COUNT11 COUNT12 COUNT21 COUNT22;

END;

RUN; QUIT;

15


Intervals

• The same program as before can be used to compute

all of the statistics.

DATA SEATBELT;

INFILE ’SIMACC.TXT’;

INPUT N11 N12 N21 N22;

OR = (N11*N22)/(N12*N21);

V = 1/N11 + 1/N12 + 1/N21 + 1/N22;

LCL = OR*EXP(-1.96*SQRT(V));

UCL = OR*EXP(1.96*SQRT(V));

IF LCL < 36 AND UCL > 36 THEN

CORRECT = 1;

ELSE CORRECT = 0;

PROC PRINT NOOBS;

VAR OR LCL UCL CORRECT;

16


Intervals

/* Count up the number of correct

confidence intervals */

PROC MEANS SUM;

VAR TRUE;

RUN; QUIT;

17

Summary – Odds Ratios

• The odds ratio is defined as

p1(1− p2)p2(1− p1)

.

• It can be estimated using PROC FREQ and the / CMH option.

• It is an appropriate measure to consider for retrospec-

tive studies, such as case-control studies.

18

Relative Risk

• Example: In a study the effectiveness of a flu vac-

cine, 1250 individuals were randomly selected from a

screened population. The vaccine was given to 750 of

the individuals while 500 were given a placebo. During

the subsequent flu season, the number of individuals in

the vaccine group who had caught the flu was 120 while

the number catching the flu in the placebo group was

240.

This is an example of a prospective cohort study. We can

easily estimate the incidence of flu for each treatment

group: 240/500 for the placebo group, and 120/750

for the vaccine group.

The estimated relative risk of acquiring the flu for those

on placebo is

240/500

120/750= 3.0

times higher than for those in the vaccine group.

19

Relative Risk

• In order to estimate a confidence interval for the true

relative risk, we use the same procedure as for the odds

ratio:

DATA VACCTEST;

INPUT TREATMENT $ FLU $ COUNT;

/* FLU = Y, for subjects who caught the flu

= N, for subjects who did not catch flu */

DATALINES;

Vaccine Y 120

Vaccine N 630

Placebo Y 240

Placebo N 260

;

PROC FREQ;

TABLES TREATMENT*FLU / CMH;

/* We do not calculate the chisquare test

statistics this time */

20

WEIGHT COUNT;

RUN; QUIT;

but this time, we read the row of output correspond-

ing to the placebo cohort risk. This gives us a 95%

confidence interval for the relative risk.

Relative Risk – Confidence Intervals

• The relative risk is estimated using

RR =n11n2n1n21

where nij = the count in the ith row and j column of

the table, and ni is the total of the ith row.

• The 1− 2α confidence interval is then given by

(RRe−zα√v,RRezα

√v)

where

v = (1− n11/n1)/n11 + (1− n21/n2)/n21and

P (Z < zα) = α.

(Z is standard normal.)

• Let us write a simulation program to see how accurate

these logit confidence bounds are.

21

/* Simulation of 1000 flu vaccine

prospective studies */

/* This program is in simfluDO.sas */

DATA _NULL_;

FILE ’SIMFLU.TXT’;

N1 = 500; /* No. Placebo */

N2 = 750; /* No. Vaccinated */

P1 = 0.3; /* Prop. Plac. Flu */

P2 = 0.1; /* Prop. Vacc. Flu */

RR = P1/P2; /* TRUE REL. RISK */

DO I = 1 TO 1000;

FLU_P = RANBIN(0,N1,P1);

/* No. Placebo Flu */

FLU_V = RANBIN(0,N2,P2);

/* No. Vaccine Flu */

PUT FLU_P N1 FLU_V N2 RR;

END;

RUN; QUIT;

Relative Risk – Confidence Intervals

/* Counting number of correct

confidence intervals for RR */

/* This program is in simfluRR.sas */

DATA FLU;

INFILE ’simflu.txt’;

INPUT N11 N1 N21 N2 RR_TRUE;

P1 = N11/N1;

P2 = N21/N2;

RR = P1/P2;

V = (1-P1)/N11 + (1-P2)/N21;

LCL = RR*EXP(-1.96*SQRT(V));

UCL = RR*EXP(1.96*SQRT(V));

IF LCL < RR_TRUE < UCL THEN

CORRECT = 1;

ELSE

CORRECT = 0;

PROC MEANS SUM;

VAR CORRECT;

RUN; QUIT;

22

Summary – Relative Risk

1. The relative risk is defined as

p1/p2.

2. It can be estimated using PROC FREQ and the / CMH option.

3. It is an appropriate measure for assessing risk when data

come from randomized controlled experiments.

23

Testing for Trend

• This is only appropriate if the variables represent ordi-

nal data (i.e. the values have some sort of inherent

ordering).

• The test is only appropriate for 2×N tables.

24

Testing for Trend – Mantel-Haenszel Test

• Example (fake student survey, again): We would like to

know if there is a trend according to YEAR of study and

the response to the funding question.

DATA SURVEY;

INFILE ’Fakesur2.txt’;

INPUT ID 1-3

AGE 4-5

GENDER $ 6

YEAR 7

FULLTIME $ 8

FUNDINC $ 9

OWEEK 10;

25

Testing for Trend – Mantel-Haenszel Test

PROC FREQ;

TITLE ’2-Way Frequency Tables’;

TABLES FUNDINC*YEAR / CHISQ NOCUM NOPERCENT

NOROW NOCOL;

RUN;

QUIT;

This time, we look at the Mantel-Haenszel Chi-Square

row of output.

26

Summary – Mantel-Haenszel Test

• Sometimes there is a natural ordering in the values of a

categorical variable, and we are interested in knowing if

there is relation between the ordered values and some

other binary variable.

• The Mantel-Haenszel Chi-Square test can be used to

perform such a test.

• It can be performed using PROC FREQ and the / CHISQ op-

tion.

27

Chapter 4 - Working with DatesProcessing Date Variables

Example:

DATA WINNIPEG;

INFILE ’wwpt6080.dat’;

INPUT DATE YYMMDD6. MINWIND 7-10 MEANWIND 11-14 MAXWIND 15-18 MINTEMP 19-23 MEANTEMP 24-28

MAXTEMP 29-33 MINPRESS 34-38 MEANPRES 39-43 MAXPRESS 44-48

CHGPRESS $ 49;

YEAR = YEAR(DATE); /* Extracts the year from the DATE */

MTH = MONTH(DATE); /* Extracts the month from the DATE */

PROC SORT;

BY YEAR;

PROC MEANS MEAN;

VAR MEANTEMP;

BY YEAR;

RUN; QUIT;

28

chapter 3 - categorical data analysis (cont’d) outline 1 ... 3 - categorical data analysis...

Documents