last name first name student number - university of torontobutler/c32/oldexams/mid14.pdf ·...

25

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

UNIVERSITY OF TORONTO SCARBOROUGH

Department of Computer and Mathematial Sciences

Midterm Test, October 25, 2014

STAC32 Applications of Statistical MethodsDuration: 2 hours

Last name

First name

Student number

Aids allowed:

- My lecture overheads

- My long-form lecture notes

- Any notes that you have taken in this course

- Your marked assignments and labs

- The course R text

- The course SAS text

- Non-programmable, non-communicating calculator

Before you begin, complete the signature sheet, but sign it only when the invigilator collects it. The signaturesheet shows that you were present at the exam.

This exam has 25 numbered pages, including this page. Please check to see that you have all the pages.Answer each question in the space provided (under the question). If you need more space, use the backs

of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found.The maximum marks available for each part of each question are shown next to the question part. For

example, question 1(a) is worth 3 marks. The exam as a whole has a maximum of 120 marks.

For marker's use only:

Page Max Mark

2 6

3 12

4 2

5 6

6 4

7 10

Page Max Mark

8 5

9 11

10 8

11 8

12 12

14 5

Page Max Mark

16 3

17 4

20 6

22 6

24 8

25 4

Total 120

1

Page 2: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

1. Suppose we have the following data, stored in a �le mydata.txt under my username on SAS Studio:

red 10 3.4

red 11 2.7

red 10 2.9

green 12 3.5

green 11 2.8

green 11 3.4

My username is ken. The three variables are called colour, x and y.

(a) (3 marks) Write SAS code to read these data in from the �le into a SAS dataset named fred,and also to produce a listing of the data. (You will need a section beginning data and anotherone beginning proc.)

(b) (3 marks) How would you use SAS to calculate the mean values of x and y for each of the groupsde�ned by colour? Give the code you would use.

2

Page 3: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

2. The data shown below are stored in a �le xy.txt, in R's current �working directory� (folder):

x y group

20 1.6 treatment

22 1.4 treatment

23 1.3 treatment

25 0.8 control

26 0.9 control

24 1.1 control

(a) (1 mark) One line of R code will read these data into a data frame called d. What one line?

(b) (2 marks) How would you use R to �nd the mean value of y for each group de�ned by group?How would your code change if you wanted to �nd the median instead of the mean?

(c) (2 marks) The R function quantile produces a �ve-number summary of whatever you feed it.How would you get a �ve-number summary of y for each group, and how would the result bedisplayed?

(d) (2 marks) How would you display the second value in the fourth row of d? The whole fourth row?Give the code you would use in each case.

(e) (3 marks) How would you select and display the rows of d where y is bigger than 1? (Use severallines of code if that's what you need. If it works, I'm good with it.)

(f) (2 marks) Give code to calculate the mean value of x for the rows of d where y is greater than 1.

3

Page 4: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

3. A researcher investigated four di�erent word lists for use in hearing assessment. She wanted to knowwhether the lists were equally di�cult to understand in the presence of a noisy background. To �ndout, she tested 96 subjects with normal hearing, and randomly assigned 24 of them to each word list.She measured the number of words perceived correctly in the presence of background noise. I savedthe data in hearing.txt in my home folder in SAS Studio. (My user name is ken.) The variables arecalled ID, list and hearing. These are subject ID, list ID and the percent of words correct. SubjectID and percent correct are numbers, while list ID is text. The data are separated by tabs. The �rstline of the data �le contains the variable names.

(a) (2 marks) Give SAS code to read in the data and to make boxplots of the percent of wordsperceived correctly for each word list.

(b) (2 marks) The boxplots are as shown below.

L i s t 1 L i s t 2 L i s t 3 L i s t 4

1 0

2 0

3 0

4 0

5 0

hearing

l i s t

Question continues overleaf.

4

Page 5: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

Do any of the word lists seem to be easier or harder to hear correctly? Discuss brie�y.

(c) (2 marks) An analysis of variance is shown below. What code would have been run to produceit, and what do you conclude from it?

The ANOVA Procedure

Class Level Information

Class Levels Values

list 4 List1 List2 List3 List4

Number of Observations Read 96

Number of Observations Used 96

The ANOVA Procedure

Dependent Variable: hearing

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 920.458333 306.819444 4.92 0.0033

Error 92 5738.166667 62.371377

Corrected Total 95 6658.625000

R-Square Coeff Var Root MSE hearing Mean

0.138235 27.89423 7.897555 28.31250

Source DF Anova SS Mean Square F Value Pr > F

list 3 920.4583333 306.8194444 4.92 0.0033

(d) (2 marks) Explain brie�y why running Tukey's method would be a good idea here, and give thecode that you would add to your previous code to run it.

Question continues overleaf.

5

Page 6: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(e) (2 marks) Shown below is the output from running Tukey's method. (The ANOVA is repeated;just ignore that.) What do you conclude from this output? (For your answer, use the space besidethe Tukey output.)

The ANOVA Procedure

Class Level Information

Class Levels Values

list 4 List1 List2 List3 List4

Number of Observations Read 96

Number of Observations Used 96

The ANOVA Procedure

Dependent Variable: hearing

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 920.458333 306.819444 4.92 0.0033

Error 92 5738.166667 62.371377

Corrected Total 95 6658.625000

R-Square Coeff Var Root MSE hearing Mean

0.138235 27.89423 7.897555 28.31250

Source DF Anova SS Mean Square F Value Pr > F

list 3 920.4583333 306.8194444 4.92 0.0033

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for hearing

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 92

Error Mean Square 62.37138

Critical Value of Studentized Range 3.70045

Minimum Significant Difference 5.9654

Means with the same letter are not significantly different.

Tukey

Groupi

ng Mean N list

A 32.750 24 List1

A

B A 29.667 24 List2

B

B 25.583 24 List4

B

B 25.250 24 List3

(f) (2 marks) Do the conclusions from Tukey's method make sense when you look back at the box-plots? Explain brie�y.

6

Page 7: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

4. A school district superintendent selects 8 students from each of the 15 schools in his school district,and gives each student an arithmetic test.1 Some analysis is shown below.

R> arith=read.table("school-system.txt",header=T)

R> head(arith)

R> attach(arith)

R> arith.aov=aov(Scores~School)

R> summary(arith.aov)

R> detach(arith)

Scores School

1 18 A

2 23 A

3 21 A

4 19 A

5 22 A

6 23 A

Df Sum Sq Mean Sq F value Pr(>F)

School 14 108.8 7.771 1.074 0.39

Residuals 105 760.1 7.239

(a) (2 marks) What are the null and alternative hypotheses being tested in the analysis of variance?(State your hypotheses in terms of schools and scores.)

(b) (3 marks) What do you conclude from the analysis of variance?

(c) (2 marks) Would you run Tukey's method next, or not? Explain brie�y.

(d) (3 marks) An intern ran t-tests comparing each school against each other school, and found thatsome of them were signi�cant at the α = 0.05 level. Is this consistent with your conclusions from(b) and (c)? Explain brie�y why or why not.

1This is a preliminary to a study of a new method for teaching arithmetic.

7

Page 8: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

5. Golden hamsters, in the wild, hibernate. How does a hamster know when to hibernate? It is believedthat the amount of daylight a�ects the concentration of an enzyme called Na+K+ATP-ase (and thisthen triggers a hibernation impulse in the hamster). A student conducted an experiment using 8hamsters. The student randomly divided the hamsters into 2 groups of 4. The �rst group was raisedin 16 hours of light per day (�long days�), and the second group was raised in 8 hours of light per day(�short days�). The data are as shown:

R> hamsters=read.table("hamsters.txt",header=T)

R> hamsters

daylength enzyme

1 short 12.500

2 short 11.625

3 short 18.275

4 short 13.225

5 long 6.625

6 long 10.375

7 long 9.900

8 long 8.800

(a) (2 marks) The student carried out a two-sample t-test, as shown:

R> t.test(enzyme~daylength,data=hamsters)

Welch Two Sample t-test

data: enzyme by daylength

t = -2.913, df = 4.709, p-value = 0.03576

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-9.4599866 -0.5025134

sample estimates:

mean in group long mean in group short

8.92500 13.90625

Explain brie�y why a two-sample t-test is more appropriate than a paired t-test here.

(b) (2 marks) What do you conclude from the two-sample t-test, at α = 0.05?

(c) (1 mark) Do you see anything in the data that might make you doubt the conclusions from thet-test above? Explain brie�y.

8

Page 9: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(d) (3 marks) A randomization test was run, with the code and results shown below.

R> a=aggregate(enzyme~daylength,hamsters,median)

R> a

R> omd=a[2,2]-a[1,2]

daylength enzyme

1 long 9.3500

2 short 12.8625

R> attach(hamsters)

R> nsim=1000

R> diff=numeric(nsim)

R> for (i in 1:nsim)

R> {

R> sday=sample(daylength,8)

R> as=aggregate(enzyme~sday,hamsters,median)

R> diff[i]=as[2,2]-as[1,2]

R> }

R> table(diff>=omd)

R> detach(hamsters)

FALSE TRUE

970 30

What are the �rst three lines of code, and the table below them, telling you?

(e) (2 marks) There are three lines of code inside the for (i in 1:nsim) loop. What are each ofthose lines doing?

(f) (2 marks) What is the table line after the loop doing? What is the P-value of this randomizationtest?

(g) (2 marks) What do you conclude from the randomization test?

(h) (2 marks) Do you get consistent or inconsistent conclusions from the t-test and the randomizationtest? Why do you think that happened? Which test would you prefer to trust? Explain brie�y.

9

Page 10: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

6. This question will compare the power of a t-test and a corresponding sign test, when the actualpopulation has a normal distribution. To set up some things, suppose µ is the mean of a population,and we're testing H0 : µ = 50 against Ha : µ > 50, using α = 0.05, and the population SD σ is assumedto be 25. We'll use a sample of size 35. The true mean is 62.

(a) (2 marks) Give SAS code for computing the power of this t-test. The calculated power turns outto be 0.872.

(b) (2 marks) Suppose you also want to use the sign test to test that the median is 50, against thealternative that the median is greater than 50. What would be your test statistic? Would youexpect this to be larger or smaller if the alternative is true?

(c) (2 marks) If the null hypothesis is true, the test statistic will have a binomial distribution withn = 35 and p = 0.5. Explain why this is.

(d) (2 marks) Below is part of a table of this binomial distribution. The probabilities shown are theprobability of obtaining a number of successes less than or equal to the value of x shown.

R> x=c(12,22)

R> p=pbinom(x,35,0.5)

R> data.frame(x,p)

x p

1 12 0.04476554

2 22 0.95523446

Which values of your test statistic would lead to rejection of the null hypothesis (that the medianis 50) in favour of the alternative (that the median is greater than 50) at approximately α = 0.05?

Question continues overleaf.

10

Page 11: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(e) (2 marks) If the true mean is 62 and the population standard deviation is 25, give R code to �ndthe probability that a single sample value exceeds 50. (This is a one-liner.) The answer turns outto be 0.684. (A reminder that the population is normally distributed.)

(f) (2 marks) If the alternative hypothesis above is true and the population median is really 62, whatis the exact distribution of your test statistic from (b)?

(g) (2 marks) If the alternative hypothesis is true, what is the power of the sign test: that is, theprobability of correctly rejecting the null hypothesis, using your test statistic from (b) and rejectionrule from (d)? You may use the tables below:

R> x

R> p=pbinom(x,35,0.684)

R> data.frame(x,p)

[1] 12 22

x p

1 12 3.505218e-05

2 22 2.949374e-01

R> x

R> p=pbinom(x,35,0.316)

R> data.frame(x,p)

[1] 12 22

x p

1 12 0.7050626

2 22 0.9999649

(h) (2 marks) Compare the power computed for the t-test and the sign test. Which is bigger? Doesthis surprise you? Explain brie�y. (If you didn't �gure out (g), what kind of answer would youhave expected to see, and why?)

11

Page 12: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

7. This question is about R Markdown:

(a) How do you do the following in an R Markdown document?

i. (2 marks) Start a new section called Introduction?

ii. (2 marks) Make the text see this italic?

iii. (2 marks) Make the text important bold?

iv. (3 marks) Insert an image in the current folder called myimage.png?

(b) (3 marks) Describe precisely what Knitting HTML on the R Markdown code below does.

```{r}

x=1:5

y=c(10,11,13,14,16)

plot(y~x)

```

12

Page 13: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

8. The Human Development Index is a measure, for each country, of whether it has the attributes of adeveloped country like Canada. It is a value between 0 and 1, with a higher value being better. Domore highly-developed countries have a higher usage of cellphones? For a recent data set of HumanDevelopment Index values and cell phone use by country, the code below was run and the scatterplotshown below was obtained:

R> hdi_data=read.table("hdi.txt",header=T)

R> hdi_data[50:55,]

country hdi gdppc cell

50 Cuba 0.826 NA 7

51 SaintKittsNevis 0.825 12702 213

52 Bahamas 0.825 17843 584

53 Mexico 0.821 9803 370

54 Bulgaria 0.816 8078 609

55 Tonga 0.815 7870 NA

R> v=complete.cases(hdi_data)

R> v[50:55]

[1] FALSE TRUE TRUE TRUE TRUE FALSE

R> hdi2=hdi_data[v,]

R> attach(hdi2)

R> plot(cell~hdi)

R> lines(lowess(cell~hdi))

13

Page 14: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●●

●●●●

●●●●●●●

0.3 0.4 0.5 0.6 0.7 0.8 0.9

020

040

060

080

010

0012

00

hdi

cell

(a) (2 marks) Some countries have missing values for some of the variables. What do you think thecomplete.cases function does, judging from the output you see above? (I am not expecting youto have seen this function before. Just guess what you think it does.)

(b) (1 mark) What, therefore, does the data frame hdi2 contain?

(c) (2 marks) Describe the relationship that you see in the scatterplot. What is happening to thespread around the line as hdi increases?

Question continues overleaf.

14

Page 15: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(d) (3 marks) The plot below shows a scatterplot of the square root of cellphone use against hdi:

R> sqrtcell=sqrt(cell)

R> plot(sqrtcell~hdi)

R> lines(lowess(sqrtcell~hdi))

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

0.3 0.4 0.5 0.6 0.7 0.8 0.9

510

1520

2530

35

hdi

sqrt

cell

and the plot below shows a scatterplot of the (natural) logarithm of cellphone use against hdi:

R> logcell=log(cell)

R> plot(logcell~hdi)

R> lines(lowess(logcell~hdi))

15

Page 16: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

●●●●

●●

●●●●●●

●●●●●

●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.3 0.4 0.5 0.6 0.7 0.8 0.9

12

34

56

7

hdi

logc

ell

Which plot do you think does a better job of making the data suitable for linear regression?Explain brie�y. (If you don't like either plot, explain brie�y why not, and describe what youwould try next.)

(e) (2 marks) Suppose we wanted to predict the cellphone use for a country whose hdi was 0.80.Which one of the two results below would you use? Why is the last step of the code (in the resultyou chose) necessary? (If you don't think either result is relevant, explain brie�y why not.) Useany handy blank space on the right to write your answer.

Result 1:

R> cell.1=lm(sqrtcell~hdi)

R> summary(cell.1)

16

Page 17: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

Call:

lm(formula = sqrtcell ~ hdi)

Residuals:

Min 1Q Median 3Q Max

-12.221 -3.365 0.810 3.389 11.375

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -15.732 1.604 -9.807 <2e-16 ***

hdi 45.858 2.163 21.196 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 4.726 on 148 degrees of freedom

Multiple R-squared: 0.7522, Adjusted R-squared: 0.7505

F-statistic: 449.3 on 1 and 148 DF, p-value: < 2.2e-16

R> p=-15.732+45.858*0.80

R> p

[1] 20.9544

R> p^2

[1] 439.0869

Result 2:

R> cell.2=lm(logcell~hdi)

R> summary(cell.2)

Call:

lm(formula = logcell ~ hdi)

Residuals:

Min 1Q Median 3Q Max

-3.6395 -0.2964 0.1193 0.4371 1.5082

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.3386 0.2497 1.356 0.177

hdi 6.8743 0.3367 20.414 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.7356 on 148 degrees of freedom

Multiple R-squared: 0.7379, Adjusted R-squared: 0.7362

F-statistic: 416.7 on 1 and 148 DF, p-value: < 2.2e-16

R> p=0.3386+6.8743*0.80

R> p

[1] 5.83804

R> exp(p)

[1] 343.1062

(f) (2 marks) Suppose you had looked at a regression that you didn't trust (ie. the log if you preferredsquare root, the square root if you preferred log, either if you didn't like either.) What can yousay about the plot of residuals against �tted values for that regression?

17

Page 18: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

9. A manufacturer of laundry detergent was seeing how the height of the detergent suds in a standardwashing machine depended on the amount of detergent added in the wash cycle. Here are the data:

R> soap=read.table("soap.txt",header=T)

R> soap

sud_height detergent

1 27.8 6

2 27.6 6

3 31.9 7

4 32.5 7

5 36.0 8

6 37.0 8

7 39.8 9

8 42.0 9

9 43.7 10

10 47.0 10

A scatterplot of the data is as shown:

R> attach(soap)

R> plot(sud_height~detergent)

18

Page 19: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

●●

●●

6 7 8 9 10

3035

4045

detergent

sud_

heig

ht

A linear regression is �tted:

R> soap.1=lm(sud_height~detergent)

R> summary(soap.1)

Call:

lm(formula = sud_height ~ detergent)

Residuals:

Min 1Q Median 3Q Max

-1.630 -0.455 -0.030 0.445 1.670

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.3300 1.8814 0.707 0.5

detergent 4.4000 0.2316 19.000 6.1e-08 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

19

Page 20: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

Residual standard error: 1.036 on 8 degrees of freedom

Multiple R-squared: 0.9783, Adjusted R-squared: 0.9756

F-statistic: 361 on 1 and 8 DF, p-value: 6.095e-08

and �nally the residuals against �tted values are plotted:

R> res=residuals(soap.1)

R> fit=fitted(soap.1)

R> plot(res~fit)

R> lines(lowess(res~fit))

●●

30 35 40 45

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

fit

res

(a) (4 marks) Does this regression appear satisfactory? Explain brie�y why or why not.

(b) (2 marks) What, if anything, would you do to improve this regression? Explain brie�y.

20

Page 21: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

10. A scatter plot, and the SAS code to produce it, are shown below.

SAS> data mt;

SAS> infile '/home/ken/c4.txt';

SAS> input y x;

SAS>

SAS> proc gplot;

SAS> plot y*x;

y

0

1

2

3

4

5

6

7

8

9

x

0 1 2 3 4 5

(a) (4 marks) What does the output from the Box-Cox procedure below tell you? (This is thenumerical output from the lecture notes, not the graphical output that you had in the lab.) Youranswer needs to include concrete recommendations for the data analyst to follow next.

SAS> proc transreg;

SAS> model boxcox(y / lambda=0.25 to 0.4 by 0.01)=identity(x);

The TRANSREG Procedure

Box-Cox Transformation Information for y

Lambda R-Square Log Like

0.25 1.00 23.64499

21

Page 22: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

0.26 1.00 24.90797

0.27 1.00 26.34920

0.28 1.00 28.02837

0.29 1.00 30.04087

0.30 1.00 32.55294

0.31 1.00 35.89372

0.32 1.00 40.85854

0.33 1.00 49.87572 <

0.34 1.00 49.11745 *

0.35 1.00 40.47234

0.36 1.00 35.65545

0.37 1.00 32.38484

0.38 1.00 29.91387

0.39 1.00 27.92873

0.40 1.00 26.26940

< - Best Lambda

* - 95% Confidence Interval

+ - Convenient Lambda

(b) (2 marks) The data analyst is considering a log transformation. Would you recommend this, ornot? Explain brie�y.

22

Page 23: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

11. A random sample of 20 schools was taken from the US mid-Atlantic and New England states. For eachschool, the following variables were recorded:

� grade 6 student verbal mean test score

� sta� salary per pupil

� percent white-collar fathers (of grade 6 students). �White collar father� means that the father hasa professional occupation.

� Socioeconomic status, mean for families of grade 6 students. (This is a composite constructedfrom several other variables, and is scaled to have mean zero.)

� Mean teacher's verbal test score

� Mother's number of years of education (for students in grade 6).

A regression was run predicting student's verbal test score (the �rst variable given above) from theother variables (in the same order as given above). The results are shown below. Note that the �rstcolumn of the data frame identi�es the school.

R> schools=read.table("coleman.txt",header=T)

R> head(schools)

R> attach(schools)

R> schools.1=lm(verbal.test~staff.sal+wcf+ses.dev+teacher.verbal+mother.ed)

R> summary(schools.1)

school verbal.test staff.sal wcf ses.dev teacher.verbal mother.ed

1 1 3.83 28.87 7.20 26.6 6.19 37.01

2 2 2.89 20.10 -11.71 24.4 5.17 26.51

3 3 2.86 69.05 12.32 25.7 7.04 36.51

4 4 2.92 65.40 14.28 25.7 7.10 40.70

5 5 3.06 29.59 6.31 25.4 6.15 37.10

6 6 2.07 44.82 6.16 21.6 6.41 33.90

Call:

lm(formula = verbal.test ~ staff.sal + wcf + ses.dev + teacher.verbal +

mother.ed)

Residuals:

Min 1Q Median 3Q Max

-0.6109 -0.2528 -0.0764 0.2625 0.7773

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.287673 2.954551 0.097 0.9238

staff.sal 0.006136 0.010889 0.563 0.5820

wcf 0.042504 0.033518 1.268 0.2254

ses.dev 0.229713 0.086588 2.653 0.0189 *

teacher.verbal -0.181106 0.418194 -0.433 0.6716

mother.ed -0.073156 0.050314 -1.454 0.1680

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.419 on 14 degrees of freedom

Multiple R-squared: 0.3728, Adjusted R-squared: 0.1488

F-statistic: 1.664 on 5 and 14 DF, p-value: 0.2078

23

Page 24: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(a) (3 marks) The variables teacher.verbal and mother.ed have negative slopes. Given the nature ofthe data, explain brie�y why we ought to �nd this surprising. The explanation of these surprisingresults is contained in the output. Where?

(b) (1 mark) Why would you consider removing teacher.verbal from this regression? Explainbrie�y.

(c) (2 marks) Why would it be dangerous to remove all the non-signi�cant explanatory variables fromthe regression? Explain brie�y.

(d) (2 marks) I removed teacher.verbal and staff.sal from the regression, and this remained:

R> schools.2=update(schools.1,.~.-teacher.verbal-staff.sal)

R> summary(schools.2)

Call:

lm(formula = verbal.test ~ wcf + ses.dev + mother.ed)

Residuals:

Min 1Q Median 3Q Max

-0.64248 -0.24343 -0.08803 0.23900 0.72209

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.42251 1.88811 -0.224 0.8258

wcf 0.04372 0.02699 1.620 0.1248

ses.dev 0.21597 0.07858 2.748 0.0143 *

mother.ed -0.06834 0.04657 -1.467 0.1616

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.3963 on 16 degrees of freedom

Multiple R-squared: 0.3586, Adjusted R-squared: 0.2383

F-statistic: 2.981 on 3 and 16 DF, p-value: 0.06257

What would you do next? Explain brie�y.

Question continues overleaf.

24

Page 25: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject

(e) (2 marks) Here are the correlations between all of the variables:

R> cor(schools[,-1])

verbal.test staff.sal wcf ses.dev teacher.verbal

verbal.test 1.0000000 0.18113980 0.2296278 0.50266385 0.1967731

staff.sal 0.1811398 1.00000000 0.8271829 0.05105812 0.9271008

wcf 0.2296278 0.82718291 1.0000000 0.18332924 0.8190633

ses.dev 0.5026638 0.05105812 0.1833292 1.00000000 0.1238087

teacher.verbal 0.1967731 0.92710081 0.8190633 0.12380866 1.0000000

mother.ed 0.1922916 0.75340081 0.9271611 0.33364951 0.7329859

mother.ed

verbal.test 0.1922916

staff.sal 0.7534008

wcf 0.9271611

ses.dev 0.3336495

teacher.verbal 0.7329859

mother.ed 1.0000000

There are some large correlations in this table, for example between mother.ed and wcf. Whyare neither of these variables signi�cant in the regression?

(f) (2 marks) Explain brie�y what you learn from the output below.

R> schools.3=lm(verbal.test~ses.dev)

R> anova(schools.3,schools.1)

Analysis of Variance Table

Model 1: verbal.test ~ ses.dev

Model 2: verbal.test ~ staff.sal + wcf + ses.dev + teacher.verbal + mother.ed

Res.Df RSS Df Sum of Sq F Pr(>F)

1 18 2.9279

2 14 2.4573 4 0.47062 0.6703 0.6233

25