categorical questions

Upload: oteng-richard-selasie

Post on 16-Oct-2015

182 views

Category:

Documents


1 download

DESCRIPTION

Sample questions to try.

TRANSCRIPT

  • 1.1

    Analysis of categorical response data

    Topic covered in lecture 1:

    What is categorical data

    Response and explanatory variables

    Measurement scales for categorical data

    Course coverage

    Tabulated count data and related questions

    Non tabulated categorical data

    Sampling design for tables

    Links with other methods

  • 1.2

    What is categorical data?: The measurement scale for

    the response consists of a number of categories

    Variable Measurement Scale

    Farm system Dairy, Beef, Tillage etc.

    Mortality Dead, alive

    Food texture Very soft, Soft, Hard, Very hard

    Litter size 0, 1, 2, 3 and >3

    Types of data discussed in this course

    Response variable(s) is categorical

    Explanatory variable(s) may be categorical or

    continuous

    Example 1: Does Post-operative survival (categorical response) depend on the explanatory variables?

    Sex (categorical) Age (continuous) Example 2: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.

    Farm system (categorical) Attitude to EU (categorical/ordinal)?

    (Two response variables - no explanatory variables) Could one of these be regarded as explanatory?

  • 1.3

    Measurement scales for categorical data

    Nominal - no underlying order

    Variable Measurement Scale

    Farm system Dairy, Beef, Tillage etc.

    Weed Species Stellaria media, Poa annua, etc.

    Ordinal - underlying order in the scale

    Variable Measurement Scale

    Food texture Very soft, Soft, Hard, Very hard

    Disease diagnosis Very likely, Likely, Unlikely

    Education Primary, Secondary, Tertiary

    Interval - underlying numerical distance between scale

    points

    Variable Measurement Scale

    Litter size 0, 1, 2, 3 and >3

    Age class 5

    Education years in education

  • 1.4

    Tabulated count data and questions

    Single level table

    Example 1: A geneticist carries out a crossing

    experiment between F1 hybrids of a wild type and a

    mutant genotype and obtains an F2 progeny of 90

    offspring with the following characteristics.

    Wild Type Mutant Total

    80 10 90

    Evidence that a wild tpe is dominant, giving on average

    3:1 offspring phenotype in its favour?

    Two-way table

    Example 1- A sample 124 mice was divided into two

    groups, 84 receiving a standard dose of pathogenic

    bacteria followed by an antiserum and a control group

    of 40 not receiving the antiserum. After 3 weeks the

    numbers dead and alive i9n each group were counted.

    Outcome

    Dead Alive Total % dead

    + antiserum 19 65 84 23

    - antiserum 18 22 40 45

    Total 37 87 124

    Association between mortality and treatment ?

    Is the mortality rate the same for both treatments?

  • 1.5

    Example 2 - Categorical response and categorical

    explanatory variable: The opinion poll after the Good

    Friday Agreement with respondents classified by

    religion (R - Catholic or Protestant)

    Favour Oppose Undec. Total %

    Favour

    Catholic 258 32 62 352 73 Protestant 149 91 208 448 33 Total 407 123 270 800 51 % Cath 63 26 23

    1. Evidence that a majority of decided voters (all

    voters) support the agreement?

    2. Support pattern the same for Protestants and

    Catholics?

  • 1.6

    Example 3 (Snedecor and Cochran): Categorical

    response and interval categorical explanatory variable.

    The table below shows the number of aphids alive and

    dead after spraying with four concentrations of solutions

    of sodium oleate. Has the higher concentration given a

    significantly different percentage kill? Is there a

    relationship between concentration and mortality?

    Concentration of sodium

    oleate (%)

    0.65 1.10 1.6 2.1 Total

    Dead 55 62 100 72 289

    Alive 22 13 12 5 52

    Total 77 75 112 77 341

    % Dead 71.4 82.7 89.3 93.5 84.8

    Is mortality related to sodium oleate concentration?

  • 1.7

    Example 4 Categorical response and interval

    categorical explanatory variable (Cornfield 1962):

    Blood pressure (BP) was measured on a sample of

    males aged 40-59, who were also classified by

    whether they developed coronary heart disease (CHD)

    in a 6-year follow-up period. The data were classified

    by BP (interval categorical variable in 8 classes) and

    CHD (CHD or No-CHD).

    BP CHD No

    CHD

    Total % CHD

    186 8 35 43 18.6

    Total 92 1237 1329

    1.Is the incidence of CHD independent of BP?

    2.Simple relationship between the probability of CHD

    and the level of BP?

  • 1.8

    Multiway table - relationship between categorical

    responses or categorical response and several

    categorical explanatory variables:

    Example 1: The NI opinion poll with respondents further classified by where they lived in Northern Ireland (L) (ARL table)

    West - rural and strong nationalist/Catholic Belfast - mixed population North East - industrial and Unionist/Protestant.

    Favour Oppose Undecided West Catholic 73 20 20 Protestant 47 34 69

    Belfast Catholic 90 9 21 Protestant 54 23 66

    North East Catholic 95 3 21 Protestant 48 34 73

    Total 407 123 270

    1. Evidence that a majority of decided voters (all voters)

    support the agreement?

    2. Difference in support pattern between Protestants

    and Catholics?

    3. Difference in support pattern between Protestants

    and Catholics consistent over region?

    4. Within the Catholic (Protestant) population does the

    strength of support change with region? ETC ETC

  • 1.9

    Example 2: Grouped binomial data - patterns of

    psychotropic drug consumption in a sample from West

    London (Murray et al 1981, Psy Med 11,551-60). Sex Age

    Group Psych. case

    On drugs

    Total

    M 1 No 9 531 M 2 No 16 500 M 3 No 38 644 M 4 No 26 275 M 5 No 9 90 M 1 Yes 12 171 M 2 Yes 16 125 M 3 Yes 31 121 M 4 Yes 16 56 M 5 Yes 10 26 F 1 No 12 588 F 2 No 42 596 F 3 No 96 765 F 4 No 52 327 F 5 No 30 179 F 1 Yes 33 210 F 2 Yes 47 189 F 3 Yes 71 242 F 4 Yes 45 98 F 5 Yes 21 60

    Is Pychotropic drug use affected by gender, age or

    psychological state and are there interactions among

    these effects?

  • 1.10

    Non-tabulated data and questions

    Example 1: Individual plants were monitored the

    survival of plants of Legousia in an experiment to

    see whether they survived after 3 months. Survived -

    yes is scored 1 and Survived -no scored 0. Also

    recorded were

    CO2 treatment 2 levels low and high

    Density of Legousia

    Density of companion species

    Height of the plant (mm) two weeks after planting.

    Most individuals will have a unique profile in these

    three additional variables and so tabulation of the data

    by them is not feasible. The individual data is

    presented. Density

    Subject Surv CO2 Ht Leg. Comp 1 0 L 35 20 30 2 1 L 68 22 27 3 1 H 43 16 33 4 0 L 27 4 16

    1.Is survival related to the explanatory variables

    (CO2, Height, density-self, density-companions.)?

    2.Can the probability of survival be predicted from the

    subjects profile?

  • 1.11

    Example 2: A sample of 62 patients who had

    angioplasty for coronary artery disease were

    followed to see if they reblocked (restenosed) after 6

    months RS -yes is scored 1 and RS -no scored 0 (a

    binary response categorical variable). Also

    recorded were

    Age in years - continuous variate

    Blood pressure (BP) - continuous variate

    Sex - nominal categorical (?)

    Cholesterol - continuous

    Most individuals will have a unique profile in these four

    additional variables and so tabulation of the data by

    them is not feasible. The individual data is presented.

    Subject RS Age BP Sex Cholest.

    1 0 35 117 m 1 2 1 68 154 f 5 3 1 43 123 f 2 4 0 27 110 m 3

    3.Is RS related to the explanatory variables (Age, BP,

    Sex and Cholesterol)?

    4.Can the probability of RS be predicted from the

    subjects profile?

  • 1.12

    Sampling designs - two and multiway tables

    Single sample (no margin fixed) simultaneously

    classified by several categorical variables. Used in

    Cross-sectional studies.

    Example: A simple random sample of 200 students

    was classified by gender and attitude to EU

    integration.

    EU integration

    Favour Oppose Total

    Male 43 53 96

    Female 61 33 104

    Total 104 86

    This is a snapshot of opinion at a moment in time -

    hence Cross-sectional.

  • 1.13

    One margin fixed: Samples of fixed size are selected

    for one category and individuals are classified by the

    other category(s).

    Example 1 (Clinical trial - a prospective study): Of

    400 HIV positive pregnant women 200 are assigned at

    random to each of Breast feeding (BF) or Formula

    feeding (FF). Two years after birth the childs HIV

    status is determined.

    Childs status (???) Total

    HIV + HIV -

    BF 62 138 200

    FF 45 155 200

    Example 2 (Cohort study - a prospective study): 400

    HIV positive pregnant women are asked to select

    either Breast feeding (BF) or Formula feeding (FF).

    Two years after birth the childs HIV status is

    determined. Here the sample totals are determined by

    the mothers choices.

    Example 3 (Case-control or retrospective study): A

    sample of 200 HIV+ and another of 200 HIV- two year

    old children are selected and classified by whether

    they were BF or FF. Here the HIV outcome numbers

    are controlled - cannot compute % HIV from BF and

    FF.

  • 1.14

    Past Present Future

    Cohort

    Cases and controls

    Cross-sectional

  • 1.15

    Notes on sampling designs

    In more complex studies more than one margin may

    be fixed.

    Example 1: Any replicated factorial experiment

    where the response is binary

    Example 2: Physicians health study. NEJM 1988,

    262-264. Four treatments

    Treatment Aspirin beta

    carotene A No No B Yes No C No Yes D Yes Yes

    Example 3: 2x2 table with both margins fixed?

    The statistical properties differ considerably between

    sampling schemes, nevertheless the methods to be

    discussed below apply, with some modifications, to

    data collected using any of these sampling schemes.

  • 1.16

    Relationships with regression methods.

    Traditionally categorical data analysis has been viewed

    as completely distinct from and unconnected with

    regression and ANOVA methods. We show that there

    are many strong links and that many concepts transfer

    naturally between the methods.

  • 1.17

    SAS Analysis of example 1

    A sample 124 mice was divided into two groups, 84

    receiving a standard dose of pathogenic bacteria

    followed by an antiserum and a control group of 40 not

    receiving the antiserum. After 3 weeks the numbers

    dead and alive i9n each group were counted.

    Outcome

    Dead Alive Total % dead

    + antiserum 19 65 84 23

    - antiserum 18 22 40 45

    Total 37 87 124

    Association between mortality and treatment ?

    Is the mortality rate the same for both treatments?

  • 1.18

    SAS program for analysis of example 1 data

    PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA ANTISER; INPUT ANTISER $ MORTALI $ COUNT ; CARDS ; A__plus Dead 19 A_plus Alive 65 A_minus Dead 18 A_minus Alive 22 ; PROC FREQ ;

    TABLES ANTISER*MORTALI/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL

    NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

  • 1.19

    Table of ANTISER by MORTALI ANTISER MORTALI Frequency Expected Deviation Cell Chi-SquareAlive Dead Total A_minus 22 18 40 28.065 11.935 -6.065 6.0645 1.3105 3.0814 A_plus 65 19 84 58.935 25.065 6.0645 -6.065 0.624 1.4673 Total 87 37 124 Statistics for Table of ANTISER by MORTALI Statistic DF Value Prob ChiChiChiChi----Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 0.0109 0.0109 0.0109 0.0109 Likeli Ratio ChiLikeli Ratio ChiLikeli Ratio ChiLikeli Ratio Chi----Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122

    Sample Size = 124

    Observed counts

    Outcome

    Dead Alive Total % Dead

    + antiserum 19 65 84 23

  • 1.20

    - antiserum 18 22 40 45

    Total 37 87 124 30

    Expected (blue) counts if outcome is independent of treatment

    Outcome

    Dead Alive Total % Dead

    + antiserum .3*84

    25.2

    .7*84

    58.8

    84 23

    - antiserum .3*40

    12.0

    .7*40

    28.0

    40 45

    Total 37 87 124 30

    Is there a discrepancy between obsewrved and expected? Chisquared = (Observed-expected)

    2/expected

  • 1.21

    SAS Analysis of example 3

    The table below shows the number of aphids alive and

    dead after spraying with four concentrations of solutions

    of sodium oleate. Has the higher concentration given a

    significantly different percentage kill? Is there a

    relationship between concentration and mortality?

    Concentration of sodium

    oleate (%)

    0.65 1.10 1.6 2.1 Total

    Dead 55 62 100 72 289

    Alive 22 13 12 5 52

    Total 77 75 112 77 341

    Is mortality independent of sodium oleate concentration?

  • 1.22

    SAS program for analysis of Insecticide data

    PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA INSECT; INPUT SODOL D_AL COUNT ; CARDS ; 0.65 1 55 1.10 1 62 1.6 1 100 2.1 1 72 0.65 2 22 1.10 2 13 1.6 2 12 2.1 2 5 ; PROC FREQ ; TABLES D_AL*SODOL/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

  • 1.23

    Output from SAS PROC FREQ. TABLE OF D_AL BY SODOL D_AL SODOL FREQUENCY| EXPECTED | DEVIATION| CELL CHI2| 0.65| 1.1| 1.6| 2.1| TOTAL ---------+--------+--------+--------+--------+ 1 | 55 | 62 | 100 | 72 | 289 | 65.3 | 63.6 | 94.9 | 65.3 | | -10.3 | -1.6 | 5.1 | 6.7 | |1.61249 |.038436 |.271785 |.696522 | ---------+--------+--------+--------+--------+ 2 | 22 | 13 | 12 | 5 | 52 | 11.7 | 11.4 | 17.1 | 11.7 | | 10.3 | 1.6 | -5.1 | -6.7 | |8.96172 |.213617 | 1.5105 |3.87106 | ---------+--------+--------+--------+--------+ TOTAL 77 75 112 77 341

  • 1.24

    STATISTICS FOR TABLE OF D_AL BY SODOL STATISTIC DF VALUE PROB ------------------------------------------------------ CHI-SQUARE 3 17.176 0.001

    LIKELIHOOD RATIO CHI-SQUARE 3 16.633 0.001

    MANTEL-HAENSZEL CHI-SQUARE 1 16.157 0.000 PHI 0.224 CONTINGENCY COEFFICIENT 0.219 CRAMER'S V 0.224

    Conclusion: Insect mortality is not independent of dose. Mortality is not constant

    as dose changes.

    Sodium oleate (%)

    0.65 1.10 1.6 2.1 Total

    Dead 55 62 100 72 289

    Alive 22 13 12 5 52

    Total 77 75 112 77 341

    % Dead 71.4 82.7 89.3 93.5 84.8

  • 1.25

    Group two lowest and two highest levels

  • 1.26

    Analysis of CHD data

    Blood pressure (BP) was measured on a sample of

    males aged 40-59, who were also classified by

    whether they developed coronary heart disease (CHD)

    in a 6-year follow-up period. The data were classified

    by BP (interval categorical variable in 8 classes) and

    CHD (CHD or No-CHD).

    BP CHD No

    CHD

    Total % CHD

    186 8 35 43 18.6

    Total 92 1237 1329

    3.Is the incidence of CHD independent of BP?

    4.Simple relationship between the probability of CHD

    and the level of BP?