university of toronto scarborough stac32 applications …butler/c32/mid20139.pdf · stac32...

21

Upload: others

Post on 29-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

UNIVERSITY OF TORONTO SCARBOROUGH

Department of Computer and Mathematial Sciences

Midterm Test, October 25, 2012

STAC32 Applications of Statistical MethodsDuration: 2 hours

Last name

First name

Student number

Aids allowed:

- My lecture overheads

- My long-form lecture notes

- Any notes that you have taken in this course

- Your marked assignments

- My R book

- The course SAS text

- Non-programmable, non-communicating calculator

Before you begin, complete the signature sheet, but sign it only when the invigilator collects it. The signaturesheet shows that you were present at the exam.

This exam has 21 numbered pages, including this page. Please check to see that you have all the pages.Answer each question in the space provided (under the question). If you need more space, use the backs

of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found.

The maximum marks available for each question are shown below.

Question Max Mark

1 5

2 5

3 12

4 10

5 7

6 10

Question Max Mark

7 15

8 7

9 7

10 6

11 10

12 6

Total 100

1

1. Here is a data �le, called data1.txt:

x y name

10 12 Amy

8 13 Brian

11 15 Camille

12 16 David

Write a line of R code that will store these data into a data frame called people.

2. Here is a data �le, called data2.txt:

10 12 Amy

8 13 Brian

11 15 Camille

12 16 David

Write a SAS data step that will read these values into a SAS data set called people, with variablesnamed x, y and name. (There will probably need to be three lines of code altogether.)

3. Recent studies of the private practices of physicians who saw no Medicaid patients suggested that themedian length of each patient visit was 22 minutes. It is suspected that the median visit length inpractices with a large Medicaid load is shorter than 22 minutes. (Medicaid is the pre-Obamacare USgovernment health program for low-income people. �A large Medicaid load� means that the healthpractice sees a lot of Medicaid patients.) A random sample of 20 visits in practices with a largeMedicaid load yielded, in order, the following visit lengths:

9.4 13.4 15.6 16.2 16.4 16.8 18.1 18.7 18.9 19.1

19.3 20.1 20.4 21.6 21.9 23.4 23.5 24.8 24.9 26.8

The data are laid out this way in the data �le.

Some analysis is shown below. Note that SAS's tests are two-sided. Ask yourself whether that is whatyou want.

SAS> data visits;

SAS> infile 'visit.txt';

SAS> input visit @@;

SAS>

SAS> proc univariate mu0=22;

SAS> var visit;

SAS> histogram;

2

The UNIVARIATE Procedure

Variable: visit

Moments

N 20 Sum Weights 20

Mean 19.465 Sum Observations 389.3

Std Deviation 4.22053564 Variance 17.8129211

Skewness -0.4101742 Kurtosis 0.38445229

Uncorrected SS 7916.17 Corrected SS 338.4455

Coeff Variation 21.6826901 Std Error Mean 0.94374046

Basic Statistical Measures

Location Variability

Mean 19.46500 Std Deviation 4.22054

Median 19.20000 Variance 17.81292

Mode . Range 17.40000

Interquartile Range 6.05000

Tests for Location: Mu0=22

Test -Statistic- -----p Value------

Student's t t -2.68612 Pr > |t| 0.0146

Sign M -5 Pr >= |M| 0.0414

Signed Rank S -66.5 Pr >= |S| 0.0110

Quantiles (Definition 5)

Quantile Estimate

100% Max 26.80

99% 26.80

95% 25.85

90% 24.85

75% Q3 22.65

50% Median 19.20

25% Q1 16.60

10% 14.50

5% 11.40

1% 9.40

0% Min 9.40

Extreme Observations

----Lowest---- ----Highest---

Value Obs Value Obs

9.4 1 23.4 16

13.4 2 23.5 17

15.6 3 24.8 18

16.2 4 24.9 19

16.4 5 26.8 20

3

1 0 1 4 1 8 2 2 2 6

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0Percent

v i s i t

(a) What happens if you leave out the @@ on the input line?

(b) What is the purpose of the mu0=22 on the proc univariate line?

(c) The preamble talks about testing the median visit length. What would be an appropriate test,P-value and conclusion for testing whether the median visit length is less than 22 minutes?

(d) Suppose that we had been testing whether the mean visit length was less than 22 minutes. Canyou obtain a reasonable P-value and conclusion from the output?

(e) Take a look at the histogram. Do you think that there's a reason to use median visit length ratherthan mean visit length? Which of the tests in (c) or (d) would you therefore recommend? Explainbrie�y.

4

4. The Kentucky Derby is a horse race that has been run since 1875. I have a data �le with someinformation about the winning horse each year, including the speed (in miles per hour) that thewinning horse completed the course. I want to draw a boxplot of the speeds (as one group), labellingoutliers separately, using SAS.

(a) Below is my data step, but with one line missing. What might the missing line be?

SAS> data horses;

SAS> infile 'derby.csv' firstobs=2 dlm=',';

SAS> input year winner $ time speed yearid;

(b) What two lines of code will produce you the boxplot of speeds that you want?

(c) Below is my boxplot. How would you describe the distribution of speeds?

1

3 0

3 2

3 4

3 6

3 8

speed

g r o u p

5

(d) Last, a plot of speed against year. Would you say that the speed of the winning horse is gettingfaster over time? What trend do you see over the last 30 years? Explain brie�y.

SAS> proc gplot;

SAS> plot speed*year;

s p e e d

3 1

3 2

3 3

3 4

3 5

3 6

3 7

3 8

y e a r

1 8 6 0 1 8 8 0 1 9 0 0 1 9 2 0 1 9 4 0 1 9 6 0 1 9 8 0 2 0 0 0 2 0 2 0

6

5. A study on the health risks of smoking measured the cholesterol levels of people that have smoked forat least 25 years and people of similar ages who smoked for no more than 5 years and then stopped(labelled exsmoker in the data �le). The top line of the data �le contains names for the variables.SAS code to read the data and perform some calculations is shown below, along with the output thatit produces. There are some questions below that.

SAS> data cholest;

SAS> infile 'smok-chol.txt' firstobs=2;

SAS> input chol status $;

SAS>

SAS> proc means;

SAS> var chol;

SAS> class status;

The MEANS Procedure

Analysis Variable : chol

N

status Obs N Mean Std Dev Minimum Maximum

-------------------------------------------------------------------------------------

exsmoker 33 33 233.0606061 47.6825043 134.0000000 328.0000000

smoker 43 43 237.9767442 38.5378530 155.0000000 351.0000000

-------------------------------------------------------------------------------------

(a) Would you expect these two groups of people to have statistically signi�cantly di�erent meancholesterol levels, just judging by the numbers shown below? Explain brie�y. You don't need todo any calculations; just eyeball the numbers.

(b) Write SAS code to carry out the appropriate two-sample t-test with a two-sided alternative hy-pothesis, and to calculate a 95% con�dence interval for the di�erence in means. (You should beable to do all this in 3 lines of code.)

7

6. In the previous question, we used data about cholesterol levels of smokers and ex-smokers. Here weuse the same data in R. Some code and output is shown below, with some questions below that. (Hint:if you can't do (a) or (c), tackle (b) or (d) by pretending that you could do (a) or (c) correctly, anddescribe what you think would need to be changed.)

R> cholest=read.table("smok-chol.txt",header=T)

R> head(cholest)

chol status

1 225 smoker

2 211 smoker

3 209 smoker

4 284 smoker

5 258 smoker

6 216 smoker

(a) Write a line of R code, using aggregate, that will calculate the mean cholesterol level separatelyfor smokers and ex-smokers.

(b) How would you change the above code to calculate the standard deviation for each group?

(c) Write a line of R code to carry out a two-sample t-test to test the null that the mean cholesterollevel is the same for smokers and ex-smokers, against the alternative that the mean cholesterollevels are di�erent, without assuming that the two groups have the same (population) spread.

(d) How would you change the above code to use the assumption that the two groups have the same(population) SD, that is to do a pooled t-test?

8

7. Ten students did poorly on their math exams in June, so they repeated the course in summer schooland took another exam in August. (Think of these students as being a random sample of �all possiblestudents who do poorly on the June math exams�.) Here are the results:

june august

54 50

49 69

68 74

66 64

62 68

62 72

41 71

56 60

61 66

65 66

(a) Explain brie�y why this is matched-pairs data rather than two separate samples.

(b) I read the data into R as shown below, and made a boxplot of the di�erences. Why would youhave doubts about doing a matched-pairs t-test here?

R> exams=read.table("math.txt",header=T)

R> diff=exams$august-exams$june

R> boxplot(diff)

−5

05

1015

2025

30

9

(c) I decided to do a randomization test, testing the null hypothesis that the di�erences have median

0. Below is the code I used. Explain brie�y what the three statements in Block 1 do.

R> # block 1

R> pm=c(-1,1)

R> nsim=1000

R> rand.median=numeric(nsim)

R> # end of block 1

R> for (i in 1:nsim)

R> {

R> # block 2

R> random.pm=sample(pm,10,replace=T)

R> random.diff=diff*random.pm

R> rand.median[i]=median(random.diff)

R> # end of block 2

R> }

(d) Explain brie�y what the three lines in block 2 (inside the loop) do.

(e) Below is a histogram of the randomized medians. The observed median of the di�erences is alsoshown. Where does the observed median di�erence fall on the randomization distribution? Doyou guess that you would reject the null hypothesis that the median di�erence is zero, in favourof an alternative that the median di�erence is greater than zero? Explain brie�y.

R> hist(rand.median)

R> median(diff)

[1] 5.5

10

Histogram of rand.median

rand.median

Fre

quen

cy

−6 −4 −2 0 2 4 6

050

100

150

(f) Based on your guessed decision to reject the null (or not), what do you conclude about thee�ectiveness of this summer school math course?

11

8. In the previous question, we looked at scores on June and August math tests for ten students. Thedata were these:

june august

54 50

49 69

68 74

66 64

62 68

62 72

41 71

56 60

61 66

65 66

(a) Explain brie�y how you might construct a sign test here to judge whether the summer-schoolmath course is having a positive e�ect.

(b) Below is a table of the cumulative binomial distribution for n = 10, p = 12 . That is, the left

column is the number of successes, and the right column is the probability of that many successesor less. Use this to obtain a P-value for your sign test.

R> cbind(0:10,pbinom(0:10,10,0.5))

[,1] [,2]

[1,] 0 0.0009765625

[2,] 1 0.0107421875

[3,] 2 0.0546875000

[4,] 3 0.1718750000

[5,] 4 0.3769531250

[6,] 5 0.6230468750

[7,] 6 0.8281250000

[8,] 7 0.9453125000

[9,] 8 0.9892578125

[10,] 9 0.9990234375

[11,] 10 1.0000000000

12

9. A student has an MP3 player. It uses a lot of batteries. The student wants to know whether he canget away with using cheaper �generic� batteries rather than the more expensive �brandname� ones. Inparticular, is the mean battery lifetime the same for both kinds of battery, or are the means di�erent?The student buys six sets of brandname batteries and six sets of generic batteries. Every time he needsnew batteries in his MP3 player, he chooses a new set at random, and records the number of hours ofplaying time until the batteries die. He repeats this until all twelve sets of batteries have been usedup.

The student knows how to use R, but he is not a good enough statistician to know what kind of analysisto do. So he does four analyses, and asks his friends which one they prefer. Here is what he did. First,he reads in the data and checks the values, �nding that they are correct:

R> batteries=read.table("battery.txt",header=T)

R> batteries

R> attach(batteries)

brandname generic

1 194.0 190.7

2 205.5 203.5

3 199.2 203.5

4 172.4 206.5

5 184.0 222.5

6 169.5 209.4

This is his Analysis 1:

R> t.test(brandname,generic)

Welch Two Sample t-test

data: brandname and generic

t = -2.5462, df = 8.986, p-value = 0.03143

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-35.097420 -2.069246

sample estimates:

mean of x mean of y

187.4333 206.0167

Analysis 2:

R> t.test(brandname,generic,paired=T)

Paired t-test

data: brandname and generic

t = -2.1709, df = 5, p-value = 0.08205

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-40.58852 3.42185

sample estimates:

mean of the differences

-18.58333

Analysis 3:

R> t.test(brandname,generic,alternative="less")

13

Welch Two Sample t-test

data: brandname and generic

t = -2.5462, df = 8.986, p-value = 0.01571

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -5.202122

sample estimates:

mean of x mean of y

187.4333 206.0167

Analysis 4:

R> t.test(brandname,generic,paired=T,alternative="less")

Paired t-test

data: brandname and generic

t = -2.1709, df = 5, p-value = 0.04103

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -1.333733

sample estimates:

mean of the differences

-18.58333

(a) Which of these analyses is the best one? Explain brie�y.

(b) Thus, what does the student conclude about the batteries?

(c) What kind of batteries do you think he should buy for his MP3 player in the future?

14

10. This question is about normal quantile plots:

(a) Take a look at the normal quantile plot shown below, of some data x. Do you think the datacome from a normal distribution? Explain brie�y.

R> qqnorm(x)

R> qqline(x)

●●

−2 −1 0 1 2

46

810

1214

16

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

15

(b) Take a look at the normal quantile plot shown below, of some data y. Do you think the datacome from a normal distribution? Explain brie�y. If not, what kind of distribution do you thinkthey come from?

R> qqnorm(y)

R> qqline(y)

●●

●●

●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

16

(c) Last, look at the one below. Are these data from a normal distribution? If not, describe asprecisely as possible what shape of distribution it is.

R> qqnorm(z)

R> qqline(z)

●●

●●

●●●

●●

● ●

−2 −1 0 1 2

1011

1213

1415

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

17

11. A data set was collected containing various pieces of information about breakfast cereals. The datawere read in as below. There were only a few cereals for which shelf was bigger than 3, and we wantto remove those from the data set. (shelf is the supermarket shelf on which the brand of cereal isfound, with 1 being at the top.)

SAS> data cereal;

SAS> infile 'cereals.csv' dlm=',' firstobs=2;

SAS> input name $ calories protein fat sodium fibre carbo sugars shelf;

SAS> if shelf>3 then delete;

and these were the �rst 10 lines of the data set:

SAS> proc print data=cereal (obs=10);

Obs name calories protein fat sodium fibre carbo sugars shelf

1 100%_Bra 70 4 1 130 10.0 5.0 6 3

2 100%_Nat 120 3 5 15 2.0 8.0 8 3

3 All-Bran 70 4 1 260 9.0 7.0 5 3

4 All-Bran 50 4 0 140 14.0 8.0 0 3

5 Almond_D 110 2 2 200 1.0 14.0 8 3

6 Apple_Ci 110 2 2 180 1.5 10.5 10 1

7 Apple_Ja 110 2 0 125 1.0 11.0 14 2

8 Basic_4 130 3 2 210 2.0 18.0 8 3

9 Bran_Che 90 2 1 200 4.0 15.0 6 1

10 Bran_Fla 90 3 0 210 5.0 13.0 5 3

(a) One question of interest is whether the sugary cereals are mostly found on one shelf. To assessthis, an analysis of variance was carried out as below. The variable sugars is the amount of sugarper serving. What do you conclude from the table under The ANOVA Procedure, DependentVariable: sugars? Express your answer in terms of sugars and shelves.

SAS> proc anova;

SAS> class shelf;

SAS> model sugars=shelf;

SAS> means shelf / tukey lines;

The ANOVA Procedure

Class Level Information

Class Levels Values

shelf 3 1 2 3

Number of Observations Read 74

Number of Observations Used 74

The ANOVA Procedure

Dependent Variable: sugars

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 2 260.699789 130.349895 7.74 0.0009

Error 71 1196.394805 16.850631

Corrected Total 73 1457.094595

R-Square Coeff Var Root MSE sugars Mean

0.178918 60.63203 4.104952 6.770270

Source DF Anova SS Mean Square F Value Pr > F

18

shelf 2 260.6997894 130.3498947 7.74 0.0009

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for sugars

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 71

Error Mean Square 16.85063

Critical Value of Studentized Range 3.38539

Minimum Significant Difference 2.8697

Harmonic Mean of Cell Sizes 23.45178

NOTE: Cell sizes are not equal.

Means with the same letter are not significantly different.

T

u

k

e

y

G

r

o

u

p

i

n

g Mean N shelf

A 9.619 21 2

B 6.152 33 3

B

B 4.800 20 1

(b) Explain brie�y why it is worth performing Tukey's procedure here. (If you care, this is actuallya variant of Tukey called �Tukey-Kramer� because the groups are of di�erent sizes. This won'tmake a di�erence to your answer, though.)

(c) What do you conclude from Tukey's procedure? Is there one shelf where the highest-sugar cerealstend to be found? If so, which one?

19

(d) Boxplots were obtained for the sugar content of cereals on each of the three shelves, as shownbelow. Is there anything in the boxplot to make you doubt your conclusions from the analysis ofvariance? Explain brie�y why or why not. (The median for group 2 is at the top of the box.)

SAS> proc sort;

SAS> by shelf;

SAS> proc boxplot;

SAS> plot sugars*shelf / boxtype=schematic;

1 2 3

- 5

0

5

1 0

1 5

sugars

s h e l f

20

12. SAS's proc univariate has an annoying di�erence in syntax between getting a histogram and gettinga normal quantile plot. Suppose you have a variable x for which you want both a histogram and anormal quantile plot. After the proc univariate line, what code would you use:

(a) to get a histogram?

(b) to get a normal quantile plot?

21