week 12 anova and contingency tables. two categorical variables joint probabilities p x,y = p(x=x...
TRANSCRIPT
Week 12
Anova andcontingency tables
Two categorical variables
Joint probabilities px,y = P(X=x and Y=y) proportion of popn with values (x, y)
School performance and wt of children
Conditional probabilities
Proportion within row
€
pblonde blue eyes = nblonde & blue eyes
nblue eyes
= pblonde & blue eyes
pblue eyes
Conditional probabilities
Weight & performance are independent
School performance and wt of children
Independence from sample?
214 child skiers classified by skiing ability and whether they got injured
Injured Uninjured Total
Beginner 20 60 80 Intermediate 9 84 93 Advanced 2 39 41
Total 31 183 214
Are ability and injury independent in underlying population?
Independence from sample?
Is there independence in underlying population?
Conditional sample proportions
Testing for independence
Can a relationship observed in the sample data be inferred to hold in the population represented by the data?
Could observed sample relationship have occurred by chance?
31 out of 214 injured overall
Expect 31/214 of the 80 beginners to be injured
i.e. expect
Expected counts — independence
injured beginners
€
31× 80
214 = 11.59
Expected counts — independence
General formula
€
exy = nxny
n
Injured Uninjured
Beginner 80
Intermediate 93
Advanced 41
31 183 214
€
80 × 31
214 = 11.59
€
93 × 31
214 = 13.47
€
41× 31
214 = 5.94€
80 ×183
214 = 68.41
€
93 ×183
214 = 79.53
€
41×183
214 = 35.06
Observed and estimated counts
Are the differences more than would be expected by chance?
Injured Uninjured
Beginner20
(11.59)60
(68.41) 80
Intermediate9
(13.47)84
(79.53) 93
Advanced2
(5.94)39
(35.06) 41
31 183 214
Chi-squared test of independence
H0: independence of injury & experience
HA: association between injury & experience
or equivalently H0: P(injury|beginner) = P(injury|intermediate) = ...
HA: P(injury | experience) depends on experience
Test statistic:
€
χ 2 =observed − expected( )
2
expected∑
Small values consistent with independence Big values arise when observed are very
different from what would be expected under independence.
p-value = Prob(χ2 as big as obtained) if indep Tail area of chi-squared distribution d.f. of chi-squared = (rows–1)(cols–1)
€
χ 2 =observed − expected( )
2
expected∑
Chi-squared test of independence
Chi-squared distributions
• Skewed to the right distributions.• Minimum value is 0.• Indexed by the degrees of freedom.
Skiing injury and experience
p-value = 0.003
Chi-Square Test: Injured, Uninjured
Expected counts are printed below observed countsChi-Square contributions are printed below expected counts
Injured Uninjured Total 1 20 60 80 11.59 68.41 6.105 1.034
2 9 84 93 13.47 79.53 1.484 0.251
3 2 39 41 5.94 35.06 2.613 0.443
Total 31 183 214
Chi-Sq = 11.930, DF = 2, P-Value = 0.003
Strong evidence that the chance of injury is related to experience.
Ear Infections and Xylitol
Experiment: n = 533 children randomized to 3 groups Group 1: Placebo Gum; Group 2: Xylitol Gum; Group 3: Xylitol Lozenge
Response = Did child have an ear infection?
Ear Infections and Xylitol
Moderately strong evidence of differences between probs of infection
Making friends
With whom do you find it easiest to make friend — opposite sex, same sex or no difference?
Making friends
Fairly strong evidence of difference between Females(1) & Males(2)
Females more likely to choose opposite sex
H0: No difference in distribution of responses of men and women (no relationship between gender & response)
HA: Difference in distribution of responses of men and women (association between gender & response)
Chi-Square Test: Opposite sex, Same sex, No difference
Expected counts are printed below observed countsChi-Square contributions are printed below expected counts
Opposite sex Same sex No difference Total 1 58 16 63 137 48.79 19.38 68.83 1.740 0.590 0.494
2 15 13 40 68 24.21 9.62 34.17 3.507 1.188 0.996
Total 73 29 103 205
Chi-Sq = 8.515, DF = 2, P-Value = 0.014
Comparing means of 3+ groups
Do best students sit in the front of a classroom? Seat location and GPA for n = 384 students
Students sitting in the front generally have slightly higher GPAs than others.
Chance?
Seat location and GPA
p-value = 0.0001.
Such big differences between sample means unlikely if popn means were same
Extremely strong evidence that means are not all same.
H0: 1 = 2 = 3
HA: The means are not all equal.
95% CIs for separate means:
Seat location and GPA
Main difference seems to be between front and others
Assumptions for F-test
Independent random samples. Normal distribution within each population. Perhaps different population means. Same standard deviation, in each group.
Can still proceed if n is big or assumptions approx hold
F ratio
More evidence of a real difference when: Group means are far apart Variability within groups is small
How do you measure: Variation between means? Variation within groups?
€
F =Variation between sample means
Natural variation within groups
Variation between means
Between-groups sum of squares
€
SSGroups = ni x i − x ( )2
groups∑
Mean sum of squares for groups (k groups):
€
MSGroups = SSGroups
k −1
Variation within groups
Within-groups sum of squares Residual sum of squares
Mean residual sum of squares:
€
SSError = ni −1( ) si( )2
groups∑
€
MSError = SSError
N − k
Best estimate of error st devn, :
€
sp = MSError
Also called residual sum of squares
Total variation
Total sum of squares = SSTotal
€
SSTotal = x ij − x ( )2
values∑
SSTotal = SSGroups + SSError
Analysis of variance table
Anova table
F test is based on F ratio p-value = Prob of such a high F ratio if all means same (p-value found from an ‘F distribution’)
Seat location and GPA (again)
p-value = P(F ≥ 6.69) under H0 = 0.0001.
Such a big F ratio unlikely if popn means were same
Extremely strong evidence that means are not all same.
H0: 1 = 2 = 3
HA: The means are not all equal.