week 12 anova and contingency tables. two categorical variables joint probabilities p x,y = p(x=x...

Week 12

Anova andcontingency tables

Two categorical variables

Joint probabilities px,y = P(X=x and Y=y) proportion of popn with values (x, y)

School performance and wt of children

Conditional probabilities

Proportion within row

€

pblonde blue eyes = nblonde & blue eyes

nblue eyes

= pblonde & blue eyes

pblue eyes

Conditional probabilities

Weight & performance are independent

School performance and wt of children

Independence from sample?

214 child skiers classified by skiing ability and whether they got injured

Injured Uninjured Total

Beginner 20 60 80 Intermediate 9 84 93 Advanced 2 39 41

Total 31 183 214

Are ability and injury independent in underlying population?

Independence from sample?

Is there independence in underlying population?

Conditional sample proportions

Testing for independence

Can a relationship observed in the sample data be inferred to hold in the population represented by the data?

Could observed sample relationship have occurred by chance?

31 out of 214 injured overall

Expect 31/214 of the 80 beginners to be injured

i.e. expect

Expected counts — independence

injured beginners

€

31× 80

214 = 11.59

Expected counts — independence

General formula

€

exy = nxny

n

Injured Uninjured

Beginner 80

Intermediate 93

Advanced 41

31 183 214

€

80 × 31

214 = 11.59

€

93 × 31

214 = 13.47

€

41× 31

214 = 5.94€

80 ×183

214 = 68.41

€

93 ×183

214 = 79.53

€

41×183

214 = 35.06

Observed and estimated counts

Are the differences more than would be expected by chance?

Injured Uninjured

Beginner20

(11.59)60

(68.41) 80

Intermediate9

(13.47)84

(79.53) 93

Advanced2

(5.94)39

(35.06) 41

31 183 214

Chi-squared test of independence

H0: independence of injury & experience

HA: association between injury & experience

or equivalently H0: P(injury|beginner) = P(injury|intermediate) = ...

HA: P(injury | experience) depends on experience

Test statistic:

€

χ 2 =observed − expected( )

2

expected∑

Small values consistent with independence Big values arise when observed are very

different from what would be expected under independence.

p-value = Prob(χ2 as big as obtained) if indep Tail area of chi-squared distribution d.f. of chi-squared = (rows–1)(cols–1)

€

χ 2 =observed − expected( )

2

expected∑

Chi-squared test of independence

Chi-squared distributions

• Skewed to the right distributions.• Minimum value is 0.• Indexed by the degrees of freedom.

Skiing injury and experience

p-value = 0.003

Chi-Square Test: Injured, Uninjured

Expected counts are printed below observed countsChi-Square contributions are printed below expected counts

Injured Uninjured Total 1 20 60 80 11.59 68.41 6.105 1.034

2 9 84 93 13.47 79.53 1.484 0.251

3 2 39 41 5.94 35.06 2.613 0.443

Total 31 183 214

Chi-Sq = 11.930, DF = 2, P-Value = 0.003

Strong evidence that the chance of injury is related to experience.

Ear Infections and Xylitol

Experiment: n = 533 children randomized to 3 groups Group 1: Placebo Gum; Group 2: Xylitol Gum; Group 3: Xylitol Lozenge

Response = Did child have an ear infection?

Ear Infections and Xylitol

Moderately strong evidence of differences between probs of infection

Making friends

With whom do you find it easiest to make friend — opposite sex, same sex or no difference?

Making friends

Fairly strong evidence of difference between Females(1) & Males(2)

Females more likely to choose opposite sex

H0: No difference in distribution of responses of men and women (no relationship between gender & response)

HA: Difference in distribution of responses of men and women (association between gender & response)

Chi-Square Test: Opposite sex, Same sex, No difference

Expected counts are printed below observed countsChi-Square contributions are printed below expected counts

Opposite sex Same sex No difference Total 1 58 16 63 137 48.79 19.38 68.83 1.740 0.590 0.494

2 15 13 40 68 24.21 9.62 34.17 3.507 1.188 0.996

Total 73 29 103 205

Chi-Sq = 8.515, DF = 2, P-Value = 0.014

Comparing means of 3+ groups

Do best students sit in the front of a classroom? Seat location and GPA for n = 384 students

Students sitting in the front generally have slightly higher GPAs than others.

Chance?

Seat location and GPA

p-value = 0.0001.

Such big differences between sample means unlikely if popn means were same

Extremely strong evidence that means are not all same.

H0: 1 = 2 = 3

HA: The means are not all equal.

95% CIs for separate means:

Seat location and GPA

Main difference seems to be between front and others

Assumptions for F-test

Independent random samples. Normal distribution within each population. Perhaps different population means. Same standard deviation, in each group.

Can still proceed if n is big or assumptions approx hold

F ratio

More evidence of a real difference when: Group means are far apart Variability within groups is small

How do you measure: Variation between means? Variation within groups?

€

F =Variation between sample means

Natural variation within groups

Variation between means

Between-groups sum of squares

€

SSGroups = ni x i − x ( )2

groups∑

Mean sum of squares for groups (k groups):

€

MSGroups = SSGroups

k −1

Variation within groups

Within-groups sum of squares Residual sum of squares

Mean residual sum of squares:

€

SSError = ni −1( ) si( )2

groups∑

€

MSError = SSError

N − k

Best estimate of error st devn, :

€

sp = MSError

Also called residual sum of squares

Total variation

Total sum of squares = SSTotal

€

SSTotal = x ij − x ( )2

values∑

SSTotal = SSGroups + SSError

Analysis of variance table

Anova table

F test is based on F ratio p-value = Prob of such a high F ratio if all means same (p-value found from an ‘F distribution’)

Seat location and GPA (again)

p-value = P(F ≥ 6.69) under H0 = 0.0001.

Such a big F ratio unlikely if popn means were same

Extremely strong evidence that means are not all same.

H0: 1 = 2 = 3

HA: The means are not all equal.

week 12 anova and contingency tables. two categorical variables joint probabilities p x,y = p(x=x...

Documents