ams 394zhu/ams394/lab5.docx · web viewams 394lab # 5 prof. wei zhu aim: today we will learn how to...
TRANSCRIPT
AMS 394 Lab # 5 Prof. Wei Zhu
Aim: Today we will learn how to do inference on two population means in SAS.
1. Review of Theory and Introduction to SAS for Comparing Two Population Means
Overview: Inference on two population means:
1. The samples are paired
P aired samples t-test (if the paired differences are normal)
OR
Wilcoxon signed rank test (if the paired differences are not normal)
2. The samples are independent
I ndependent samples t-test (if both populations are normal)
(Then we use the F-test to check if )
a) pooled-variance t-test
b) unpooled-variance t-test
OR
Wilcoxon Rank Sum Test (if at least one population is NOT normal)
1
1. Inference on two population means: Paired
samples:
e.g. Compare the IQ between adult male and female
Pair IQ Male IQ Female Diff 1 200 199 12 150 120 30… … … …n 109 112 -3
*** After taking the paired differences – make sure you use
one variable to subtract the other, consistently – the two
population/sample problem is then reduced to a one
population/one sample problem: on the population/sample of
the paired differences
SAS code:data twins;input IQmale IQfemale;diff= IQmale-IQfemale;datalines;200 199150 120...109 112;run;
2
proc univariate data=twins normal;var diff;run;
Example 1. To study the effectiveness of wall insulation in saving energy for home heating, the energy consumption (in MWh) for 5 houses in Bristol, England, was recorded for two winters; the first winter was before insulation and the second winter was after insulation:
House 1 2 3 4 5Before 12.1 10.6 13.4 13.8 15.5After 12.0 11.0 14.1 11.2 15.3
(a) Please provide a 95% confidence interval for the difference between the mean energy consumption before and after the wall insulation is installed. What assumptions are necessary for your inference?(b) Can you conclude that there is a difference in mean energy consumption before and after the wall insulation is installed at the significance level 0.05? Please test it and evaluate the p-value of your test. What assumptions are necessary for your inference?(c) Please write the SAS program to perform the test and examine the necessary assumptions given in (b).
SOLUTION: This is inference on two population means, paired samples.
(a). d=0 .36 , Sd = 1.30
CI: 0 .36±2 .776⋅1. 30
√5=(−1 .25 ,1 .97 )
(b).H0 : μd=0 ,
(1)
Since is smaller than , we cannot reject H0. (2)
Assumptions for (a) and (b): the paired differences follow a normal distribution.
(c) The SAS program is:Data energy;
3
Input before after @@;Diff=before - after;Datalines; 12.1 12.0 10.6 11.0 13.4 14.1 13.8 11.2 15.5 15.3;Run;Proc univariate data = energy normal;Var Diff;Run;
Note 1: As you can see from the output below, the variable “Diff” is barely normal. In this situation, one can either use the t-test or the signed-rank test(*although not the be st practice, but α = 0.05 can be used as the significance level to judge the normality; although α = 0.1 will be more secure.)
Note 2: Although several SAS procedures can generate the t-test, I would recommend you to master the proc univariate first because it ha s everything you will nee d : normality test, t-test and non- parametric test s .
4
SAS Output (selected) : The UNIVARIATE Procedure Variable: Diff
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t 0.616851 Pr > |t| 0.5707 Sign M 0.5 Pr >= |M| 1.0000 Signed Rank S 0.5 Pr >= |S| 1.0000
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.80253 Pr < W 0.0850 Kolmogorov-Smirnov D 0.348791 Pr > D 0.0442 Cramer-von Mises W-Sq 0.10017 Pr > W-Sq 0.0859 nderson-Darling A-Sq 0.551915 Pr > A-Sq 0.0779
5
2._The samples are independent:
a) If both populations are normal and
Pooled-variance t-test
b) If both populations are normal and
Unspooled-variance t-test
c) If at least one population is not normal
Wilcoxon rank sum test
2a.Inference on 2 population means, when both
populations are normal. We have 2
independent samples, the population variances
are unknown but equal ( ) pooled-
variance t-test.
Data:
Goal: Compare and 1) Point estimator:
μ̂1−μ2=X−Y N (μ1−μ2 ,σ1
2
n1+
σ22
n2)=N (μ1−μ2 ,( 1
n1+ 1
n2 )σ 2)2) Pivotal quantity:
6
not the PQ, since we don’t know
, , and they are
independent ( & are independent because these two samples are independent to each other)
Definition: , where , then
.
Definition: t-distribution:
, , and Z & W are independent, then
.
where is the pooled variance.This is the PQ of the inference on the parameter of interest
3) Confidence Interval for 7
This is the C.I for 4) Test:
Test statistic:
8
a) (The most common situation
is )
At the significance level α, we reject in favor
of iff
If , reject
b)
At significance level α, reject in favor of
iff
If , reject
9
c)
At α=0.05, reject in favor of iff
If , reject
10
2b. Inference on 2 population means, when
both populations are normal and the
population variances are not equal ( ).
We have two independent samples.
unpooled-variance t-test. 1) P.Q:
Formula for the df:
Walch Satterthwaite method:
where
or another way to find df (less accurate and more conservative)
2) 100(1-α)% C.I. for ( )
3) Test:
Test statistic:
a)
11
At α=0.05, reject in favor of iff
If , reject
b)
At α=0.05, reject in favor of iff
If , reject
c)
At α=0.05, reject in favor of iff
If , reject
4) F-test:
Data:
① Point estimator:
② P.Q:
12
Definition: F-Distribution
Let , , and , are independent. Then
Pivotal Quantity
③ Conference Interval for
13
④ F-test: H 0 :σ12=σ2
2 (or equivalently: H 0 :σ1
2
σ22 =1)
Test Statistic:
a)
At the significance level , we will reject in favor of
iff or
b) Reject iff
14
c) Reject iff
5) A trick for the F distribution
If , then Some F-table only gives the upper bound. If we
know , how to find ?
15
Example 2. An experiment was conducted to compare the mean number of tapeworms in the stomachs of sheep that had been treated for worms against the mean number in those that were untreated. A sample of 14 worm-infected lambs was randomly divided into 2 groups. Seven were injected with the drug and the remainders were left untreated. After a 6-month period, the lambs were slaughtered and the following worm counts were recorded:
Drug-treated sheep 18 43 28 50 16 32 13Untreated sheep 40 54 26 63 21 37 39
(a). Test at α = 0.05 whether the treatment is effective or not. (b) What assumptions do you need for the inference in part (a)?(c). Please write up the entire SAS program necessary to answer questions raised in (a) and (b).
SOLUTION: Inference on two population means. Two small and independent samples.
Drug-treated sheep: X1=28 . 57 , s12=198 .62 , n1=7
Untreated sheep: X 2=40.0 , s22=215 .33 , n2=7
(a) Under the normality assumption, we first test if the two
population variances are equal. That is, versus
. The test statistic is
F0=
s12
s22 =
198. 62215. 33
≈0 . 92, F6,6,0.05 ,U=4 .28 and
F6,6,0. 05 , L=1/4 . 28≈0 .23 .Since F0 is between 0.23 and 4.28, we cannot reject H0 . Therefore
it is reasonable to assume that .Next we perform the pooled-variance t-test with hypotheses H0 : μ1−μ2=0 versus Ha : μ1−μ2<0
t 0=X1−X2−0
s p√ 1n+ 1
n2
=(28. 57−40 .0 )−0
14 . 39√ 17+1
7
≈−1 . 49
Since t0≈−1 .49 is greater than −t12 ,0 . 05=−1 .782 , we cannot reject H0. We have insufficient evidence to reject the hypothesis that there is no difference in the mean number of worms in treated and untreated lambs.
(b) (1) Both populations are normally distributed
16
(2) (c) /*Problem #2B*/data sheep;input group worms;datalines;1 181 431 281 501 161 321 132 402 542 262 632 212 372 39;run;
proc univariate data=sheep normal;class group;var worms;title 'Check for normality';run;
proc ttest data=sheep;class group;var worms;title 'Independent samples t-test';run;
proc npar1way data=sheep wilcoxon;class group;var worms;title 'Nonparametric test for two-mean comparisons';run;
17
Check for normality
The UNIVARIATE Procedure
Variable: worms
group = 1
Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.925604 Pr < W 0.5142
Kolmogorov-Smirnov D 0.201976 Pr > D >0.1500
Cramer-von Mises W-Sq 0.039988 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.271024 Pr > A-Sq >0.2500
Variable: worms
group = 2
Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.952397 Pr < W 0.7515
Kolmogorov-Smirnov D 0.214286 Pr > D >0.1500
Cramer-von Mises W-Sq 0.041652 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.240103 Pr > A-Sq >0.2500
Independent samples t-test
The TTEST Procedure
Variable: worms
Method Variances DF t Value Pr > |t|
Pooled Equal 12 -1.49 0.1630
Satterthwaite Unequal 11.98 -1.49 0.1631
18
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 6 6 1.08 0.9244
Nonparametric test for two-mean comparisons
The NPAR1WAY Procedure
Wilcoxon Two-Sample Test (Wilcoxon Rank Sum Test)
Statistic 42.0000
Normal Approximation
Z -1.2778
One-Sided Pr < Z 0.1007
Two-Sided Pr > |Z| 0.2013
t Approximation
One-Sided Pr < Z 0.1118
Two-Sided Pr > |Z| 0.2237
Z includes a continuity correctionof 0.5.
Kruskal-Wallis Test
Chi-Square 1.8000
DF 1
Pr > Chi-Square 0.1797
19
Example 3:
A new drug for reducing blood pressure (BP) is
compared to an old drug. 20 patients with comparable
high BP were recruited and randomized evenly to the 2
drugs. Reductions in BP after 1 month of taking the
drugs are as follows:
New drug: 0, 10, -3, 15, 2, 27, 19, 21, 18, 10
Old drug: 8, -4, 7, 5, 10, 11, 9, 12, 7, 8
① Assume both populations are normal, please test at
α = 0.05 whether the new drug is better than the old
one.
② Run the SAS program to do the above tests
(including the normality test)
Solution:
① =10, =11.9, =97.43, =10, =7.3,
=20.01
F test:
Test statistic:
At α=0.05, reject and use unpooled-
variance t-test
T-test:
20
Test statistic: =
At α =0.05, fail to reject and we cannot say
that the new drug is better than the old one.
②Data BP;input drug BPR;datalines;1 01 101 -31 151 21 271 191 211 181 102 82 -42 72 52 102 112 92 122 72 8;run;
PROC UNIVARIATE DATA=BP NORMAL;CLASS DRUG;VAR BPR;RUN;
PROC TTEST DATA=BP;CLASS DRUG;VAR BPR;RUN;
Selected result: Tests for Normality
Test (class=1) --Statistic--- -----p Value------
Shapiro-Wilk W 0.955526 Pr < W 0.7339
21
Kolmogorov-Smirnov D 0.142059 Pr > D >0.1500 Cramer-von Mises W-Sq 0.037679 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.238555 Pr > A-Sq >0.2500
Test (class =2) --Statistic--- -----p Value------
Shapiro-Wilk W 0.805147 Pr < W 0.0167 Kolmogorov-Smirnov D 0.273266 Pr > D 0.0334 Cramer-von Mises W-Sq 0.124461 Pr > W-Sq 0.0447 Anderson-Darling A-Sq 0.776159 Pr > A-Sq 0.0294
*The data of old drug is not normal, use nonparametric test.
Method Variances DF t Value Pr > |t|
Pooled Equal 18 1.34 0.1962 Satterthwaite Unequal 12.547 1.34 0.2033
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 9 9 4.87 0.0274
If the populations are not normal, use nonparametric
test for 2 means from 2 independent samples
(Wilcoxon Rank Sum test)PROC NPAR1WAY DATA=BP;CLASS DRUG;VAR BPR;RUN;
Selected result:Wilcoxon Two-Sample Test
Statistic 123.0000
Normal Approximation Z 1.3259 One-Sided Pr > Z 0.0924 Two-Sided Pr > |Z| 0.1849
t Approximation One-Sided Pr > Z 0.1003 Two-Sided Pr > |Z| 0.2006
*In summary, at α =0.05, we can NOT conclude that
the new drug is better than the old one using the
nonparametric Wilcoxon Rank Sum test.
Note: As you can see from this example, the normality test is very important – indeed we should not have
22
used the unspooled variance t-test because the normality assumption is not satisfied in this problem.
2. Additional Exercise Open SAS: start →programs →Math Science & Statistics
→The SAS System →The SAS System for Windows V8 (click now!)
Now type in the following program in the SAS program editor. Note that comments can be included in the SAS program in the following fashion: /* comments */ (I always include more than one * to signify the comments).
Exercise 1. An agricultural experiment station was interested in comparing the yields for two new varieties of corn. Because the investigators thought that there might be a great deal of variability in yield from one field to another, each variety was randomly assigned to a different 1-acre plot on each of seven farms. The 1-acre plots were planted; the corn was harvested at maturity. The results of the experiment (in bushels of corn) are listed here. Use these data to test the null hypothesis that there is no difference in mean yields for the two varieties of corn. Use α = .05.
Farm 1 2 3 4 5 6 7Variety A 48.2 44.6 49.7 40.5 54.6 47.1 51.4Variety B 41.5 40.1 44.0 41.2 49.8 41.7 46.8
SAS program for solving Exercise 1.
data corn;input farm varietyA varietyB;diff=varietyA-varietyB;datalines;1 48.2 41.52 44.6 40.1 3 49.7 44.04 40.5 41.25 54.6 49.86 47.1 41.77 51.4 46.8;run;
23
proc univariate data=corn normal plot;var diff;title 'Paired samples t-test in SAS';run;
Now run your SAS program and examine the output; check the SAS log if the program is not running properly.
Exercise 2. A pollution-control inspector suspected that a riverside community was releasing semi-treated sewage into a river and this, as a consequence, was changing the level of dissolved oxygen of the river. To check this, he drew 5 randomly selected specimens of river water at a location above the town and another 5 specimens below. The dissolved oxygen readings, in parts per million, are given in the accompanying table. Do the data provide sufficient evidence to indicate a difference in mean oxygen content between locations above and below the town? Use α = .05.
Above town 4.8 5.2 5.0 4.9 5.1Below town 5.0 4.7 4.9 4.8 4.9
SAS program for solving Exercise 2. (We create a new variable called “location” to indicate whether the specimen is from “above town” → location = 1, or “below town” → location = 2. In general, we have a grouping variable such as “location” here, and the original variable of interest such as “oxygen” here, in two-mean comparison via independent samples tests.)
data oxygen;input location oxygen;datalines;1 4.81 5.21 5.01 4.91 5.12 5.02 4.72 4.92 4.82 4.9 ;run;
proc univariate data=oxygen normal plot;24
class location;var oxygen;title 'Check the normality';run;
/**if oxygen is normal at both locations**/proc ttest data=oxygen;class location;var oxygen;title 'Independent samples t-test in SAS';run;
/**if oxygen is not normal for at least one location**/proc npar1way data=oxygen wilcoxon;class location;var oxygen;title 'Nonparametric test for two-mean comparisons in SAS';run;
Now run your SAS program and examine the output; check the SAS log if the program is not running properly.
25
3. Home Work #3 (Due Thursday, September 17). Please do the following problems using SAS.
Problem (1). In an effort to link cold environments with hypertension in humans, a preliminary experiment was conducted to investigate the effect of cold on hypertension in rats. Two random samples of 6 rats each were exposed to different environments. One sample of rats were held in a normal environment at 26°C. The other sample was held in a cold 5°C environment. Blood pressures and heart rates were measured for rats for both groups. The blood pressure for the 12 rats are shown in the accompanying table. Do the data provide sufficient evidence to indicate that rats exposed to a 5°C environment have a higher mean blood pressure than rats exposed to a 26°C environment? Test by using α = .05.
26°C 5°CRat Blood Pressure Rat Blood Pressure1 152 7 3842 157 8 3693 179 9 3544 182 10 3755 176 11 3666 149 12 423
Problem (2). A company wanted to know if attending a course on “how to be a successful salesperson” can increase the average sales of its employees. The company sent six of its salesperson to attend this course. The following table gives the one-week sales of these salespersons before and after they attended the course.
Before 12 18 25 9 14 16After 18 24 24 14 19 20
Using the 1% significance level, can you conclude that the mean weekly sales for all salespersons increase as a result of attending this course?
26