ams 394zhu/ams394/lab5.docx · web viewams 394lab # 5 prof. wei zhu aim: today we will learn how to...

AMS 394 Lab # 5 Prof. Wei Zhu

Aim: Today we will learn how to do inference on two population means in SAS.

1. Review of Theory and Introduction to SAS for Comparing Two Population Means

Overview: Inference on two population means:

1. The samples are paired

P aired samples t-test (if the paired differences are normal)

OR

Wilcoxon signed rank test (if the paired differences are not normal)

2. The samples are independent

I ndependent samples t-test (if both populations are normal)

(Then we use the F-test to check if )

a) pooled-variance t-test

b) unpooled-variance t-test

OR

Wilcoxon Rank Sum Test (if at least one population is NOT normal)

1

1. Inference on two population means: Paired

samples:

e.g. Compare the IQ between adult male and female

Pair IQ Male IQ Female Diff 1 200 199 12 150 120 30… … … …n 109 112 -3

*** After taking the paired differences – make sure you use

one variable to subtract the other, consistently – the two

population/sample problem is then reduced to a one

population/one sample problem: on the population/sample of

the paired differences

SAS code:data twins;input IQmale IQfemale;diff= IQmale-IQfemale;datalines;200 199150 120...109 112;run;

2

proc univariate data=twins normal;var diff;run;

Example 1. To study the effectiveness of wall insulation in saving energy for home heating, the energy consumption (in MWh) for 5 houses in Bristol, England, was recorded for two winters; the first winter was before insulation and the second winter was after insulation:

House 1 2 3 4 5Before 12.1 10.6 13.4 13.8 15.5After 12.0 11.0 14.1 11.2 15.3

(a) Please provide a 95% confidence interval for the difference between the mean energy consumption before and after the wall insulation is installed. What assumptions are necessary for your inference?(b) Can you conclude that there is a difference in mean energy consumption before and after the wall insulation is installed at the significance level 0.05? Please test it and evaluate the p-value of your test. What assumptions are necessary for your inference?(c) Please write the SAS program to perform the test and examine the necessary assumptions given in (b).

SOLUTION: This is inference on two population means, paired samples.

(a). d=0 .36 , Sd = 1.30

CI: 0 .36±2 .776⋅1. 30

√5=(−1 .25 ,1 .97 )

(b).H0 : μd=0 ,

(1)

Since is smaller than , we cannot reject H0. (2)

Assumptions for (a) and (b): the paired differences follow a normal distribution.

(c) The SAS program is:Data energy;

3

Input before after @@;Diff=before - after;Datalines; 12.1 12.0 10.6 11.0 13.4 14.1 13.8 11.2 15.5 15.3;Run;Proc univariate data = energy normal;Var Diff;Run;

Note 1: As you can see from the output below, the variable “Diff” is barely normal. In this situation, one can either use the t-test or the signed-rank test(*although not the be st practice, but α = 0.05 can be used as the significance level to judge the normality; although α = 0.1 will be more secure.)

Note 2: Although several SAS procedures can generate the t-test, I would recommend you to master the proc univariate first because it ha s everything you will nee d : normality test, t-test and nonparametric test s .

4

SAS Output (selected) : The UNIVARIATE Procedure Variable: Diff

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 0.616851 Pr > |t| 0.5707 Sign M 0.5 Pr >= |M| 1.0000 Signed Rank S 0.5 Pr >= |S| 1.0000

Tests for Normality

Test --Statistic--- -----p Value------

Shapiro-Wilk W 0.80253 Pr < W 0.0850 Kolmogorov-Smirnov D 0.348791 Pr > D 0.0442 Cramer-von Mises W-Sq 0.10017 Pr > W-Sq 0.0859 nderson-Darling A-Sq 0.551915 Pr > A-Sq 0.0779

5

2._The samples are independent:

a) If both populations are normal and

Pooled-variance t-test

b) If both populations are normal and

Unspooled-variance t-test

c) If at least one population is not normal

Wilcoxon rank sum test

2a.Inference on 2 population means, when both

populations are normal. We have 2

independent samples, the population variances

are unknown but equal ( ) pooled-

variance t-test.

Data:

Goal: Compare and 1) Point estimator:

μ̂1−μ2=X−Y N (μ1−μ2 ,σ1

2

n1+

σ22

n2)=N (μ1−μ2 ,( 1

n1+ 1

n2 )σ 2)2) Pivotal quantity:

6

not the PQ, since we don’t know

, , and they are

independent ( & are independent because these two samples are independent to each other)

Definition: , where , then

.

Definition: t-distribution:

, , and Z & W are independent, then

.

where is the pooled variance.This is the PQ of the inference on the parameter of interest

3) Confidence Interval for 7

This is the C.I for 4) Test:

Test statistic:

8

a) (The most common situation

is )

At the significance level α, we reject in favor

of iff

If , reject

b)

At significance level α, reject in favor of

iff

If , reject

9

c)

At α=0.05, reject in favor of iff

If , reject

10

2b. Inference on 2 population means, when

both populations are normal and the

population variances are not equal ( ).

We have two independent samples.

unpooled-variance t-test. 1) P.Q:

Formula for the df:

Walch Satterthwaite method:

where

or another way to find df (less accurate and more conservative)

2) 100(1-α)% C.I. for ( )

3) Test:

Test statistic:

a)

11


If , reject

b)


If , reject

c)


If , reject

4) F-test:

Data:

① Point estimator:

② P.Q:

12

Definition: F-Distribution

Let , , and , are independent. Then

Pivotal Quantity

③ Conference Interval for

13

④ F-test: H 0 :σ12=σ2

2 (or equivalently: H 0 :σ1

2

σ22 =1)

Test Statistic:

a)

At the significance level , we will reject in favor of

iff or

b) Reject iff

14

c) Reject iff

5) A trick for the F distribution

If , then Some F-table only gives the upper bound. If we

know , how to find ?

15

Example 2. An experiment was conducted to compare the mean number of tapeworms in the stomachs of sheep that had been treated for worms against the mean number in those that were untreated. A sample of 14 worm-infected lambs was randomly divided into 2 groups. Seven were injected with the drug and the remainders were left untreated. After a 6-month period, the lambs were slaughtered and the following worm counts were recorded:

Drug-treated sheep 18 43 28 50 16 32 13Untreated sheep 40 54 26 63 21 37 39

(a). Test at α = 0.05 whether the treatment is effective or not. (b) What assumptions do you need for the inference in part (a)?(c). Please write up the entire SAS program necessary to answer questions raised in (a) and (b).

SOLUTION: Inference on two population means. Two small and independent samples.

Drug-treated sheep: X1=28 . 57 , s12=198 .62 , n1=7

Untreated sheep: X 2=40.0 , s22=215 .33 , n2=7

(a) Under the normality assumption, we first test if the two

population variances are equal. That is, versus

. The test statistic is

F0=

s12

s22 =

198. 62215. 33

≈0 . 92, F6,6,0.05 ,U=4 .28 and

F6,6,0. 05 , L=1/4 . 28≈0 .23 .Since F0 is between 0.23 and 4.28, we cannot reject H0 . Therefore

it is reasonable to assume that .Next we perform the pooled-variance t-test with hypotheses H0 : μ1−μ2=0 versus Ha : μ1−μ2<0

t 0=X1−X2−0

s p√ 1n+ 1

n2

=(28. 57−40 .0 )−0

14 . 39√ 17+1

7

≈−1 . 49

Since t0≈−1 .49 is greater than −t12 ,0 . 05=−1 .782 , we cannot reject H0. We have insufficient evidence to reject the hypothesis that there is no difference in the mean number of worms in treated and untreated lambs.

(b) (1) Both populations are normally distributed

16

(2) (c) /*Problem #2B*/data sheep;input group worms;datalines;1 181 431 281 501 161 321 132 402 542 262 632 212 372 39;run;

proc univariate data=sheep normal;class group;var worms;title 'Check for normality';run;

proc ttest data=sheep;class group;var worms;title 'Independent samples t-test';run;

proc npar1way data=sheep wilcoxon;class group;var worms;title 'Nonparametric test for two-mean comparisons';run;

17

Check for normality

The UNIVARIATE Procedure

Variable: worms

group = 1

Tests for Normality

Test Statistic p Value

Shapiro-Wilk W 0.925604 Pr < W 0.5142

Kolmogorov-Smirnov D 0.201976 Pr > D >0.1500

Cramer-von Mises W-Sq 0.039988 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.271024 Pr > A-Sq >0.2500

Variable: worms

group = 2

Tests for Normality

Test Statistic p Value


Kolmogorov-Smirnov D 0.214286 Pr > D >0.1500

Cramer-von Mises W-Sq 0.041652 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.240103 Pr > A-Sq >0.2500

Independent samples t-test

The TTEST Procedure

Variable: worms

Method Variances DF t Value Pr > |t|

Pooled Equal 12 -1.49 0.1630

Satterthwaite Unequal 11.98 -1.49 0.1631

18

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 6 6 1.08 0.9244

Nonparametric test for two-mean comparisons

The NPAR1WAY Procedure

Wilcoxon Two-Sample Test (Wilcoxon Rank Sum Test)

Statistic 42.0000

Normal Approximation

Z -1.2778

One-Sided Pr < Z 0.1007

Two-Sided Pr > |Z| 0.2013

t Approximation

One-Sided Pr < Z 0.1118

Two-Sided Pr > |Z| 0.2237

Z includes a continuity correctionof 0.5.

Kruskal-Wallis Test

Chi-Square 1.8000

DF 1

Pr > Chi-Square 0.1797

19

Example 3:

A new drug for reducing blood pressure (BP) is

compared to an old drug. 20 patients with comparable

high BP were recruited and randomized evenly to the 2

drugs. Reductions in BP after 1 month of taking the

drugs are as follows:

New drug: 0, 10, -3, 15, 2, 27, 19, 21, 18, 10

Old drug: 8, -4, 7, 5, 10, 11, 9, 12, 7, 8

① Assume both populations are normal, please test at

α = 0.05 whether the new drug is better than the old

one.

② Run the SAS program to do the above tests

(including the normality test)

Solution:

① =10, =11.9, =97.43, =10, =7.3,

=20.01

F test:

Test statistic:

At α=0.05, reject and use unpooled-

variance t-test

T-test:

20

Test statistic: =

At α =0.05, fail to reject and we cannot say

that the new drug is better than the old one.

②Data BP;input drug BPR;datalines;1 01 101 -31 151 21 271 191 211 181 102 82 -42 72 52 102 112 92 122 72 8;run;

PROC UNIVARIATE DATA=BP NORMAL;CLASS DRUG;VAR BPR;RUN;

PROC TTEST DATA=BP;CLASS DRUG;VAR BPR;RUN;

Selected result: Tests for Normality

Test (class=1) --Statistic--- -----p Value------


21

Kolmogorov-Smirnov D 0.142059 Pr > D >0.1500 Cramer-von Mises W-Sq 0.037679 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.238555 Pr > A-Sq >0.2500

Test (class =2) --Statistic--- -----p Value------

Shapiro-Wilk W 0.805147 Pr < W 0.0167 Kolmogorov-Smirnov D 0.273266 Pr > D 0.0334 Cramer-von Mises W-Sq 0.124461 Pr > W-Sq 0.0447 Anderson-Darling A-Sq 0.776159 Pr > A-Sq 0.0294

*The data of old drug is not normal, use nonparametric test.

Method Variances DF t Value Pr > |t|

Pooled Equal 18 1.34 0.1962 Satterthwaite Unequal 12.547 1.34 0.2033

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 9 9 4.87 0.0274

If the populations are not normal, use nonparametric

test for 2 means from 2 independent samples

(Wilcoxon Rank Sum test)PROC NPAR1WAY DATA=BP;CLASS DRUG;VAR BPR;RUN;

Selected result:Wilcoxon Two-Sample Test

Statistic 123.0000

Normal Approximation Z 1.3259 One-Sided Pr > Z 0.0924 Two-Sided Pr > |Z| 0.1849

t Approximation One-Sided Pr > Z 0.1003 Two-Sided Pr > |Z| 0.2006

*In summary, at α =0.05, we can NOT conclude that

the new drug is better than the old one using the

nonparametric Wilcoxon Rank Sum test.

Note: As you can see from this example, the normality test is very important – indeed we should not have

22

used the unspooled variance t-test because the normality assumption is not satisfied in this problem.

2. Additional Exercise Open SAS: start →programs →Math Science & Statistics

→The SAS System →The SAS System for Windows V8 (click now!)

Now type in the following program in the SAS program editor. Note that comments can be included in the SAS program in the following fashion: /* comments */ (I always include more than one * to signify the comments).

Exercise 1. An agricultural experiment station was interested in comparing the yields for two new varieties of corn. Because the investigators thought that there might be a great deal of variability in yield from one field to another, each variety was randomly assigned to a different 1-acre plot on each of seven farms. The 1-acre plots were planted; the corn was harvested at maturity. The results of the experiment (in bushels of corn) are listed here. Use these data to test the null hypothesis that there is no difference in mean yields for the two varieties of corn. Use α = .05.

Farm 1 2 3 4 5 6 7Variety A 48.2 44.6 49.7 40.5 54.6 47.1 51.4Variety B 41.5 40.1 44.0 41.2 49.8 41.7 46.8

SAS program for solving Exercise 1.

data corn;input farm varietyA varietyB;diff=varietyA-varietyB;datalines;1 48.2 41.52 44.6 40.1 3 49.7 44.04 40.5 41.25 54.6 49.86 47.1 41.77 51.4 46.8;run;

23

proc univariate data=corn normal plot;var diff;title 'Paired samples t-test in SAS';run;

Now run your SAS program and examine the output; check the SAS log if the program is not running properly.

Exercise 2. A pollution-control inspector suspected that a riverside community was releasing semi-treated sewage into a river and this, as a consequence, was changing the level of dissolved oxygen of the river. To check this, he drew 5 randomly selected specimens of river water at a location above the town and another 5 specimens below. The dissolved oxygen readings, in parts per million, are given in the accompanying table. Do the data provide sufficient evidence to indicate a difference in mean oxygen content between locations above and below the town? Use α = .05.

Above town 4.8 5.2 5.0 4.9 5.1Below town 5.0 4.7 4.9 4.8 4.9

SAS program for solving Exercise 2. (We create a new variable called “location” to indicate whether the specimen is from “above town” → location = 1, or “below town” → location = 2. In general, we have a grouping variable such as “location” here, and the original variable of interest such as “oxygen” here, in two-mean comparison via independent samples tests.)

data oxygen;input location oxygen;datalines;1 4.81 5.21 5.01 4.91 5.12 5.02 4.72 4.92 4.82 4.9 ;run;

proc univariate data=oxygen normal plot;24

class location;var oxygen;title 'Check the normality';run;

/**if oxygen is normal at both locations**/proc ttest data=oxygen;class location;var oxygen;title 'Independent samples t-test in SAS';run;

/**if oxygen is not normal for at least one location**/proc npar1way data=oxygen wilcoxon;class location;var oxygen;title 'Nonparametric test for two-mean comparisons in SAS';run;

Now run your SAS program and examine the output; check the SAS log if the program is not running properly.

25

3. Home Work #3 (Due Thursday, September 17). Please do the following problems using SAS.

Problem (1). In an effort to link cold environments with hypertension in humans, a preliminary experiment was conducted to investigate the effect of cold on hypertension in rats. Two random samples of 6 rats each were exposed to different environments. One sample of rats were held in a normal environment at 26°C. The other sample was held in a cold 5°C environment. Blood pressures and heart rates were measured for rats for both groups. The blood pressure for the 12 rats are shown in the accompanying table. Do the data provide sufficient evidence to indicate that rats exposed to a 5°C environment have a higher mean blood pressure than rats exposed to a 26°C environment? Test by using α = .05.

26°C 5°CRat Blood Pressure Rat Blood Pressure1 152 7 3842 157 8 3693 179 9 3544 182 10 3755 176 11 3666 149 12 423

Problem (2). A company wanted to know if attending a course on “how to be a successful salesperson” can increase the average sales of its employees. The company sent six of its salesperson to attend this course. The following table gives the one-week sales of these salespersons before and after they attended the course.

Before 12 18 25 9 14 16After 18 24 24 14 19 20

Using the 1% significance level, can you conclude that the mean weekly sales for all salespersons increase as a result of attending this course?

26

ams 394zhu/ams394/lab5.docx · web viewams 394lab # 5 prof. wei zhu aim: today we will learn how to...

Documents