ka-fu wong © 2004 econ1003: analysis of economic data lesson10-1 lesson 10: analysis of variance
Post on 22-Dec-2015
222 views
TRANSCRIPT
Lesson10-1 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Lesson 10:
Analysis of VarianceAnalysis of Variance
Lesson10-2 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Outline
Characteristics of F-Distribution
Test for Equal Variances
Analysis of Variance (ANOVA)
Two-Factor ANOVA
Sampling distribution of the sample means
Probability histograms and empirical histograms
Central Limit Theorem
Normal approximation to Binomial
Lesson10-3 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Two Sample Tests
TEST FOR EQUAL VARIANCESTEST FOR EQUAL VARIANCES TEST FOR EQUAL MEANSTEST FOR EQUAL MEANS
HHo
HH1
Population 1
Population 2
Population 1
Population 2
HHo
HH1
Population 1
Population 2
Population 1Population 2
Lesson10-4 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Characteristics of F-Distribution
There is a “family” of F Distributions.
Each member of the family is determined by two parameters: the numerator degrees of freedom and the denominator degrees of freedom.
F cannot be negative, and it is a continuous distribution.
The F distribution is positively skewed. Its values range from 0 to . As F the curve
approaches the X-axis.
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
The F-Distribution, F(m,n)
0 1.0
Not symmetric (skewed to the right)
F
Nonnegative values only
Each member of the family is determined by two parameters: the numerator degrees of freedom (m) and the denominator degrees of freedom (n).
Lesson10-6 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Test for Equal Variances
For the two tail test, the test statistic is given by:
where s12 and s2
2 are the sample variances for the two samples.
The null hypothesis is rejected at level of significance if the computed value of the test statistic is greater than the critical value with a confidence level /2 and numerator and denominator dfs.
),(
),( arg=
22
21
22
21SSofSmaller
SSoferLF ),(
),( arg=
22
21
22
21SSofSmaller
SSoferLF
Lesson10-7 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Test for Equal Variances
For the one tail test, the test statistic is given by:
where s12 and s2
2 are the sample variances for the two samples.
The null hypothesis is rejected at level of significance if the computed value of the test statistic is greater than the critical value with a confidence level and numerator and denominator dfs.
22
2112
2
21 > :H if = σσS
SF 2
22112
2
21 > :H if = σσS
SF
Lesson10-8 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 1
Colin, a stockbroker at Critical Securities, reported that the mean rate of return on a sample of 10 internet stocks was 12.6 percent with a standard deviation of 3.9 percent. The mean rate of return on a sample of 8 utility stocks was 10.9 percent with a standard deviation of 3.5 percent. At the .05 significance level, can Colin conclude that there is more variation in the software stocks?
Lesson10-9 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
Step 1: The hypotheses are:
Step 2: The significance level is .05.
Step 3: The test statistic is the F distribution.
Step 4: H0 is rejected if F>3.68. The degrees of freedom are 9 in the numerator and 7 in the denominator.
Step 5: The value of F is
221
220
:
:
UI
UI
H
H
2416.1)5.3(
)9.3(2
2
F
H0 is not rejected. There is insufficient evidence to show more variation in the internet stocks.
Lesson10-10 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Analysis of Variance(ANOVA)
Lesson10-11 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Underlying Assumptions for ANOVA
The F distribution is also used for testing whether two or more sample means came from the same or equal populations. if any group mean differs from the mean of all
groups combinedAnswers: “Are all groups equal or not?”
This technique is called analysis of variance or ANOVA.
ANOVA requires the following conditions: The sampled populations follow the normal
distribution. The populations have equal standard
deviations. The samples are randomly selected and are
independent.
Lesson10-12 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
The hypothesis
Suppose that we have independent samples of n1, n2, . . ., nK observations from K populations. If the population means are denoted by 1, 2, . . ., K, the one-way analysis of variance framework is designed to test the null hypothesis
ji1
210
, pair one least at For:
===:
μμμμH
μμμH
ji
K
≠
Lesson10-13 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Sample Observations from Independent Random Samples of K Populations
Population1 2 . . . K
Mean1 2 . . . K
Variance2 2 . . . 2
Sample
observations from
the population
x11
x12
.
.
.x1n1
x21
x22
.
.
.x2n2
. . .. . .
. . .
xK1
xK2
.
.
.xKnK
Sample sizen1 n2 . . . nK
Same !!
unequal !!
Unequal number of observations in the K samples in general.nT=n1+…+nK
Lesson10-14 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Sum of Squares Decomposition for one-way analysis of variance
Suppose that we have independent samples of n1, n2, . . ., nK observations from K populations.
Denote by the K group sample means and by the overall sample mean. We define the following sum of squaressum of squares:
Kxxx ,,, 21 x
∑∑ -in
jiij
K
i
xxSSE1
2
1
)( :Groups)-(Within Error Squares of Sum
∑ -K
iii xxnSST
1
2)( :Groups)-(Between Treatment Squares of Sum
∑∑ -in
jij
K
i
xxSSTotal1
2
1
)( :Total Squares of Sum
where xij denotes the jth sample observation in the ith group.
Lesson10-15 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
An Numerical Example of Sum of Squares Decomposition
Population 1 2 3
Mean 1 2 K
Variance 2 2 2
Sample obs from the
population(xij)
123
2345
135
Sample size (nj)
3 4 3
Sample mean
2 3.5 3
Grand mean
2.9
9.18
])9.25(...)9.21[(...])9.23(...)9.21[(
)(
2222
1
2
1
∑∑ -in
jij
K
i
xxSSTotal
9.3
)9.23(3)9.25.3(4)9.22(3)( 222
1
2
∑ -K
iii xxnSST
15
])35(...)31[(...])23(...)21[(
)(
2222
1
2
1
∑∑ -in
jiij
K
i
xxSSE
SSTotal = SST + SSESSTotal = SST + SSE
Lesson10-16 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
A proof of SSTotal = SST + SSE
Population
1 2 . . . K
Sample obs
x11
x12
.
.
.x1n1
x21
x22
.
.
.x2n2
. . .. . .
. . .
xK1
xK2
.
.
.xKnK
Sample size
n1 n2 . . . nK
SSTSSE
xxnxx
xxxxxxnxx
xxxxxxxx
xxxxxxxx
xxxx
xxxx
xx
SSTotal
K
iii
in
jiij
K
i
in
jiij
K
ii
K
iii
in
jiij
K
i
in
jiiij
K
i
in
ji
K
i
in
jiij
K
i
in
jiiijiiij
K
i
in
jiiij
K
i
in
jiiij
K
i
in
jij
K
i
∑∑∑
∑∑∑∑∑
∑∑∑∑∑∑
∑∑
∑∑
∑∑
∑ -∑
1
2
1
2
1
111
2
1
2
1
111
2
11
2
1
1
22
1
1
2
1
1
2
1
1
2
1
)()(
)()(2)()(
))((2)()(
)])((2)()[(
)]()[(
)(
)(
ii
i
i
ii
i xnxnxxxx ii
n
j
n
jij
n
jij
∑∑∑111
)(
Lesson10-17 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Two Ways to estimate the population variance
Note that the variance is assumed to be identical across populations
If the population means are identical, we have two ways to estimate the population variance Based on the K sample variances. Based on the deviation of the K sample means
from the grand mean.
Lesson10-18 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
An estimate the population variance based on sample variances
Anyone of the K sample variances can be used to estimate the population.
)1/()(ˆ1
222
i
n
jiiji nxxs
i
∑ -
KnSSE
Knsn
nxx
K
ii
K
ii
K
iii
K
ii
n
jiij
K
i
i
)/(
)/()1(
)1(/)(ˆ
1
11
2
11
2
1
2
∑
∑∑
∑∑∑ -
We can get a more precise estimate if we use all the information from the K samples.
Lesson10-19 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
An estimate the population variance based on deviation of the K sample means from the grand sample mean.
If the sample sizes are the same for all samples, the Central Limit Theorem suggests that sample mean will be distributed normally with the population mean and the population variance divided by sample size.
)1/()(ˆ1
22
KxxnK
i
i∑
)1/(
)1/()(ˆ1
22
KSST
KxxnK
ii i∑
When sample sizes are different across samples, we will have to weight
???
Lesson10-20 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Comparing the Variance Estimates: The F Test
If the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of ratio of the two variance estimates follows F distribution with K - 1 and nT - K.
If the means of the K populations are not equal, the value of F-stat will be inflated because SST/(K-1) will overestimate2.
Hence, we will reject H0 if the resulting value of F-stat appears to be too large to have been selected at random from the appropriate F distribution.
)/(
)1/(
)/(
)1/(
1
KnSSE
KSST
KnSSE
KSSTstatF
TK
ii
∑
Lesson10-21 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Test for the Equality of k Population Means
Hypotheses H0: 1=2=3=….=k
H1: Not all population means are equal
Test StatisticF = [SST/(K-1)] / [SSE/(nT-K)]
Rejection Rule Reject H0 if F > F
where the value of F is based on an F distribution with K - 1 numerator degrees of freedom and nT - K denominator degrees of freedom.
Lesson10-22 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Sampling Distribution of MST/MSE
Do Not Reject Do Not Reject HH00Do Not Reject Do Not Reject HH00
Reject Reject HH00Reject Reject HH00
MST/MSE
Critical ValueFF
The figure below shows the rejection region associated with a level of significance equal to where F denotes the critical value.
Lesson10-23 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
The ANOVA Table
Source of Variation
Sum of Squares
Degree of Freedom
Mean Squares F
Treatment SST K-1 MST MST/MSE
Error SSE nT-K MSE
Total SSTotal nT-1
Lesson10-24 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Does learning method affect student’s exam scores?
Consider 3 methods: standard osmosis shock therapy
Convince 15 students to take part. Assign 5 students randomly to each method.
Wait eight weeks. Then, test students to get exam scores.
Are the three learning methods equally effective? i.e., are their population means of exam scores
same?
Lesson10-25 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
“ Analysis of Variance” (Study #1)
The variation between the group means and the grand mean is larger than the variation within each of the groups.
Lesson10-26 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
ANOVA Table for Study #1
One-way Analysis of Variance
Source DF SS MS F PFactor 2 2510.5 1255.3 93.44 0.000Error 12 161.2 13.4Total 14 2671.7
“ Source” means “find the components of variation in this column”
“ DF” means “degrees of freedom”
“ SS” means “sums of squares”
“ F” means “F test statistic”
“ MS” means “mean squared”
P-Value
Lesson10-27 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
ANOVA Table for Study #1
One-way Analysis of Variance
Source DF SS MS F PFactor 2 2510.5 1255.3 93.44 0.000Error 12 161.2 13.4Total 14 2671.7
“ Factor” means “Variability between groups” or “Variability due to the factor of interest” “ Error” means “Variability within groups” or “unexplained random variation”
“ Total” means “Total variation from the grand mean”
Lesson10-28 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
ANOVA Table for Study #1
One-way Analysis of Variance
Source DF SS MS F PFactor 2 2510.5 1255.3 93.44 0.000Error 12 161.2 13.4Total 14 2671.7
14 = 2 + 12
2671.7 = 2510.5 + 161.2
1255.2 = 2510.5/2 13.4 = 161.2/12
93.44 = 1255.3/13.4
Lesson10-29 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
“ Analysis of Variance” (Study #2)
The variation between the group means and the grand mean is smaller than the variation within each of the groups.
Lesson10-30 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
ANOVA Table for Study #2
One-way Analysis of Variance
Source DF SS MS F PFactor 2 80.1 40.1 0.46 0.643Error 12 1050.8 87.6Total 14 1130.9
The P-value is pretty large so cannot reject the null hypothesis. There is insufficient evidence to conclude that the average exam scores differ for the three learning methods.
Lesson10-31 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Do Holocaust survivors have more sleep problems than others?
Lesson10-32 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
ANOVA Table for Sleep Study
One-way Analysis of Variance
Source DF SS MS F PFactor 2 1723.8 861.9 61.69 0.000Error 117 1634.8 14.0Total 119 3358.6
The P-value is so small that we reject the null hypothesis of equal population means and favor the alternative hypothesis that at least one pair of population means are different.
Lesson10-33 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Potential problem with the analysis
What is driving the rejection of null of equal population means? From the plot, the Healthy and Depress seem to have
different mean sleep quality. It looks like that the rejection is due to the difference between these two groups.
If we pooled Healthy and Depress, the distribution will look more like Survivor. That is, an acceptance of the null is more likely.
This example illustrate that we have to be careful about our analysis and interpretation of the result when we conduct a test of equal population means.
Lesson10-34 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 2
Rosenbaum Restaurants specialize in meals for senior citizens. Katy Polsby, President, recently developed a new meat loaf dinner. Before making it a part of the regular menu she decides to test it in several of her restaurants. She would like to know if there is a difference in the mean number of dinners sold per day at the Anyor, Loris, and Lander restaurants. Use the .05 significance level.
Lesson10-35 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Example 2 continued
# of dinners sold per day
Obs Aynor Loris Lander
1 13 10 18
2 12 12 16
3 14 13 17
4 12 11 17
5 17
Lesson10-36 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 2 continued
Step 1: H0: 1 = 2 = 3
H1: Treatment means are not the same
Step 2: H0 is rejected if F>4.10. There are 2 df in the numerator and 10 df in the denominator.
Lesson10-37 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Example 2 continued
To find the value of F:
Source SS df MS F p-value
Treatment 76.25 2 38.125 39.10 1.87E-05
Error 9.75 10 0.975
Total 86.00 12
The decision is to reject the null hypothesis. The treatment means are not the same. The mean number of meals sold at the three
locations is not the same.
Lesson10-38 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Inferences About Treatment Means
When we reject the null hypothesis that the means are equal, we may want to know which treatment means differ.
One of the simplest procedures is through the use of confidence intervals.
Lesson10-39 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Confidence Interval for the Difference Between Two Means
where t is obtained from the t table with degrees of freedom (nT - k).
(nT - k) degree of freedom because MSE = [SSE/(nT - k)]
21
2111
)(nn
MSEtXX
Lesson10-40 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 3
From EXAMPLE 2 develop a 95% confidence interval for the difference in the mean number of meat loaf dinners sold in Lander and Aynor. Can Katy conclude that there is a difference between the two restaurants?
)73.5,77.2(48.125.4
5
1
4
1975.228.2)75.1217(
Because zero is not in the interval, we conclude that this pair of means differs.
The mean number of meals sold in Aynor is different from Lander.
Lesson10-41 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Two-Factor ANOVA
For the two-factor ANOVA we test whether there is a significant difference between the treatment effect and whether there is a difference in the blocking effect.
Lesson10-42 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Sample Observations from Independent Random Samples of K Populations
TREATMENT
1 2 . . . K
BLOCK
1 x11 x21 . . . xK1
2 x12 x22 . . . xK2
.
.
.
.
.
.
.
.
.
. . . . . . . . .
.
.
.
B x1B x2B . . . xKB
Lesson10-43 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Sum of Squares Decomposition for Two-Way Analysis of Variance
Suppose that we have a sample of observations with xij denoting the observation in the ith group and jth block. Suppose that there are K groups and B blocks, for a total of n = KH observations. Denote the group sample means by ,
the block sample means by and the overall sample mean by x.
B
1j
2j )xx(KSSB
),,2,1( Kixi
K
1i
2i )xx(BSST
B
1j
2ij
K
1i
)x(xSSTotal
),,2,1( Bjx j
B
1j
2jiij
K
1i
)xxx(xSSE
SSTotal = SSE+SST+SSBSSTotal = SSE+SST+SSB
Lesson10-44 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
General Format of Two-Way Analysis of Variance Table
Source of Variation
Sums of Squares
Degrees of
Freedom
Mean Squares F Ratios
Treatments SST K-1 MST=SST/K-1) MST/MSE
Blocks SSB B-1 MSB=SSB/(B-1) MSB/MSE
Error SSE (K-1)(B-1) MSE=SSE/[(K-1)(B-1)]
Total SSTotal nT-1
Lesson10-45 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 4
The Bieber Manufacturing Co. operates 24 hours a day, five days a week. The workers rotate shifts each week. Todd Bieber, the owner, is interested in whether there is a difference in the number of units produced when the employees work on various shifts. A sample of five workers is selected and their output recorded on each shift.
At the .05 significance level, can we conclude there is a difference in the mean production by shift and in the mean production by employee?
Lesson10-46 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 4 continued
Employee Day Output
Evening Output
Night Output
McCartney 31 25 35
Neary 33 26 33
Schoen 28 24 30
Thompson 30 29 28
Wagner 28 26 27
Lesson10-47 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 4 continued
TREATMENT EFFECT Step 1: H0: µ1= µ2= µ3 versus H1: Not all
means are equal.
Step 2: H0 is rejected if F>4.46, the degrees of freedom are 2 and 8.
Lesson10-48 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
Example 4 continued
Step 3: Compute the various sum of squares:
Source SS df MS F p-value
Treatments 62.53 2 31.267 5.75 .0283
Blocks 33.73 4 8.433 1.55 .2762
Error 43.47 8 5.433
Total 139.73 14
Step 4: H0 is rejected. There is a difference in the mean number of units produced for the different time periods.
Lesson10-49 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
EXAMPLE 4 continued
Block Effect: Step 1: H0: µ1= µ2= µ3 = µ4 = µ5 versus H1: Not all
means are equal.
Step 2: H0 is rejected if F>3.84, the degrees of freedom are 4 and 8.
Step 3: F=[33.73/4]/[43.47/8]=1.55
Step 4: H0 is not rejected since there is no significant difference in the average number of units produced for the different employees.
Lesson10-50 Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data
- END -
Lesson 10:Lesson 10: Analysis of VarianceAnalysis of Variance