last name first name student number - university of torontobutler/c32/oldexams/mid14.pdf ·...
TRANSCRIPT
![Page 1: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/1.jpg)
UNIVERSITY OF TORONTO SCARBOROUGH
Department of Computer and Mathematial Sciences
Midterm Test, October 25, 2014
STAC32 Applications of Statistical MethodsDuration: 2 hours
Last name
First name
Student number
Aids allowed:
- My lecture overheads
- My long-form lecture notes
- Any notes that you have taken in this course
- Your marked assignments and labs
- The course R text
- The course SAS text
- Non-programmable, non-communicating calculator
Before you begin, complete the signature sheet, but sign it only when the invigilator collects it. The signaturesheet shows that you were present at the exam.
This exam has 25 numbered pages, including this page. Please check to see that you have all the pages.Answer each question in the space provided (under the question). If you need more space, use the backs
of the pages, but be sure to draw the marker's attention to where the rest of the answer may be found.The maximum marks available for each part of each question are shown next to the question part. For
example, question 1(a) is worth 3 marks. The exam as a whole has a maximum of 120 marks.
For marker's use only:
Page Max Mark
2 6
3 12
4 2
5 6
6 4
7 10
Page Max Mark
8 5
9 11
10 8
11 8
12 12
14 5
Page Max Mark
16 3
17 4
20 6
22 6
24 8
25 4
Total 120
1
![Page 2: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/2.jpg)
1. Suppose we have the following data, stored in a �le mydata.txt under my username on SAS Studio:
red 10 3.4
red 11 2.7
red 10 2.9
green 12 3.5
green 11 2.8
green 11 3.4
My username is ken. The three variables are called colour, x and y.
(a) (3 marks) Write SAS code to read these data in from the �le into a SAS dataset named fred,and also to produce a listing of the data. (You will need a section beginning data and anotherone beginning proc.)
(b) (3 marks) How would you use SAS to calculate the mean values of x and y for each of the groupsde�ned by colour? Give the code you would use.
2
![Page 3: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/3.jpg)
2. The data shown below are stored in a �le xy.txt, in R's current �working directory� (folder):
x y group
20 1.6 treatment
22 1.4 treatment
23 1.3 treatment
25 0.8 control
26 0.9 control
24 1.1 control
(a) (1 mark) One line of R code will read these data into a data frame called d. What one line?
(b) (2 marks) How would you use R to �nd the mean value of y for each group de�ned by group?How would your code change if you wanted to �nd the median instead of the mean?
(c) (2 marks) The R function quantile produces a �ve-number summary of whatever you feed it.How would you get a �ve-number summary of y for each group, and how would the result bedisplayed?
(d) (2 marks) How would you display the second value in the fourth row of d? The whole fourth row?Give the code you would use in each case.
(e) (3 marks) How would you select and display the rows of d where y is bigger than 1? (Use severallines of code if that's what you need. If it works, I'm good with it.)
(f) (2 marks) Give code to calculate the mean value of x for the rows of d where y is greater than 1.
3
![Page 4: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/4.jpg)
3. A researcher investigated four di�erent word lists for use in hearing assessment. She wanted to knowwhether the lists were equally di�cult to understand in the presence of a noisy background. To �ndout, she tested 96 subjects with normal hearing, and randomly assigned 24 of them to each word list.She measured the number of words perceived correctly in the presence of background noise. I savedthe data in hearing.txt in my home folder in SAS Studio. (My user name is ken.) The variables arecalled ID, list and hearing. These are subject ID, list ID and the percent of words correct. SubjectID and percent correct are numbers, while list ID is text. The data are separated by tabs. The �rstline of the data �le contains the variable names.
(a) (2 marks) Give SAS code to read in the data and to make boxplots of the percent of wordsperceived correctly for each word list.
(b) (2 marks) The boxplots are as shown below.
L i s t 1 L i s t 2 L i s t 3 L i s t 4
1 0
2 0
3 0
4 0
5 0
hearing
l i s t
Question continues overleaf.
4
![Page 5: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/5.jpg)
Do any of the word lists seem to be easier or harder to hear correctly? Discuss brie�y.
(c) (2 marks) An analysis of variance is shown below. What code would have been run to produceit, and what do you conclude from it?
The ANOVA Procedure
Class Level Information
Class Levels Values
list 4 List1 List2 List3 List4
Number of Observations Read 96
Number of Observations Used 96
The ANOVA Procedure
Dependent Variable: hearing
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 920.458333 306.819444 4.92 0.0033
Error 92 5738.166667 62.371377
Corrected Total 95 6658.625000
R-Square Coeff Var Root MSE hearing Mean
0.138235 27.89423 7.897555 28.31250
Source DF Anova SS Mean Square F Value Pr > F
list 3 920.4583333 306.8194444 4.92 0.0033
(d) (2 marks) Explain brie�y why running Tukey's method would be a good idea here, and give thecode that you would add to your previous code to run it.
Question continues overleaf.
5
![Page 6: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/6.jpg)
(e) (2 marks) Shown below is the output from running Tukey's method. (The ANOVA is repeated;just ignore that.) What do you conclude from this output? (For your answer, use the space besidethe Tukey output.)
The ANOVA Procedure
Class Level Information
Class Levels Values
list 4 List1 List2 List3 List4
Number of Observations Read 96
Number of Observations Used 96
The ANOVA Procedure
Dependent Variable: hearing
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 920.458333 306.819444 4.92 0.0033
Error 92 5738.166667 62.371377
Corrected Total 95 6658.625000
R-Square Coeff Var Root MSE hearing Mean
0.138235 27.89423 7.897555 28.31250
Source DF Anova SS Mean Square F Value Pr > F
list 3 920.4583333 306.8194444 4.92 0.0033
The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for hearing
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type
II error rate than REGWQ.
Alpha 0.05
Error Degrees of Freedom 92
Error Mean Square 62.37138
Critical Value of Studentized Range 3.70045
Minimum Significant Difference 5.9654
Means with the same letter are not significantly different.
Tukey
Groupi
ng Mean N list
A 32.750 24 List1
A
B A 29.667 24 List2
B
B 25.583 24 List4
B
B 25.250 24 List3
(f) (2 marks) Do the conclusions from Tukey's method make sense when you look back at the box-plots? Explain brie�y.
6
![Page 7: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/7.jpg)
4. A school district superintendent selects 8 students from each of the 15 schools in his school district,and gives each student an arithmetic test.1 Some analysis is shown below.
R> arith=read.table("school-system.txt",header=T)
R> head(arith)
R> attach(arith)
R> arith.aov=aov(Scores~School)
R> summary(arith.aov)
R> detach(arith)
Scores School
1 18 A
2 23 A
3 21 A
4 19 A
5 22 A
6 23 A
Df Sum Sq Mean Sq F value Pr(>F)
School 14 108.8 7.771 1.074 0.39
Residuals 105 760.1 7.239
(a) (2 marks) What are the null and alternative hypotheses being tested in the analysis of variance?(State your hypotheses in terms of schools and scores.)
(b) (3 marks) What do you conclude from the analysis of variance?
(c) (2 marks) Would you run Tukey's method next, or not? Explain brie�y.
(d) (3 marks) An intern ran t-tests comparing each school against each other school, and found thatsome of them were signi�cant at the α = 0.05 level. Is this consistent with your conclusions from(b) and (c)? Explain brie�y why or why not.
1This is a preliminary to a study of a new method for teaching arithmetic.
7
![Page 8: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/8.jpg)
5. Golden hamsters, in the wild, hibernate. How does a hamster know when to hibernate? It is believedthat the amount of daylight a�ects the concentration of an enzyme called Na+K+ATP-ase (and thisthen triggers a hibernation impulse in the hamster). A student conducted an experiment using 8hamsters. The student randomly divided the hamsters into 2 groups of 4. The �rst group was raisedin 16 hours of light per day (�long days�), and the second group was raised in 8 hours of light per day(�short days�). The data are as shown:
R> hamsters=read.table("hamsters.txt",header=T)
R> hamsters
daylength enzyme
1 short 12.500
2 short 11.625
3 short 18.275
4 short 13.225
5 long 6.625
6 long 10.375
7 long 9.900
8 long 8.800
(a) (2 marks) The student carried out a two-sample t-test, as shown:
R> t.test(enzyme~daylength,data=hamsters)
Welch Two Sample t-test
data: enzyme by daylength
t = -2.913, df = 4.709, p-value = 0.03576
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.4599866 -0.5025134
sample estimates:
mean in group long mean in group short
8.92500 13.90625
Explain brie�y why a two-sample t-test is more appropriate than a paired t-test here.
(b) (2 marks) What do you conclude from the two-sample t-test, at α = 0.05?
(c) (1 mark) Do you see anything in the data that might make you doubt the conclusions from thet-test above? Explain brie�y.
8
![Page 9: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/9.jpg)
(d) (3 marks) A randomization test was run, with the code and results shown below.
R> a=aggregate(enzyme~daylength,hamsters,median)
R> a
R> omd=a[2,2]-a[1,2]
daylength enzyme
1 long 9.3500
2 short 12.8625
R> attach(hamsters)
R> nsim=1000
R> diff=numeric(nsim)
R> for (i in 1:nsim)
R> {
R> sday=sample(daylength,8)
R> as=aggregate(enzyme~sday,hamsters,median)
R> diff[i]=as[2,2]-as[1,2]
R> }
R> table(diff>=omd)
R> detach(hamsters)
FALSE TRUE
970 30
What are the �rst three lines of code, and the table below them, telling you?
(e) (2 marks) There are three lines of code inside the for (i in 1:nsim) loop. What are each ofthose lines doing?
(f) (2 marks) What is the table line after the loop doing? What is the P-value of this randomizationtest?
(g) (2 marks) What do you conclude from the randomization test?
(h) (2 marks) Do you get consistent or inconsistent conclusions from the t-test and the randomizationtest? Why do you think that happened? Which test would you prefer to trust? Explain brie�y.
9
![Page 10: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/10.jpg)
6. This question will compare the power of a t-test and a corresponding sign test, when the actualpopulation has a normal distribution. To set up some things, suppose µ is the mean of a population,and we're testing H0 : µ = 50 against Ha : µ > 50, using α = 0.05, and the population SD σ is assumedto be 25. We'll use a sample of size 35. The true mean is 62.
(a) (2 marks) Give SAS code for computing the power of this t-test. The calculated power turns outto be 0.872.
(b) (2 marks) Suppose you also want to use the sign test to test that the median is 50, against thealternative that the median is greater than 50. What would be your test statistic? Would youexpect this to be larger or smaller if the alternative is true?
(c) (2 marks) If the null hypothesis is true, the test statistic will have a binomial distribution withn = 35 and p = 0.5. Explain why this is.
(d) (2 marks) Below is part of a table of this binomial distribution. The probabilities shown are theprobability of obtaining a number of successes less than or equal to the value of x shown.
R> x=c(12,22)
R> p=pbinom(x,35,0.5)
R> data.frame(x,p)
x p
1 12 0.04476554
2 22 0.95523446
Which values of your test statistic would lead to rejection of the null hypothesis (that the medianis 50) in favour of the alternative (that the median is greater than 50) at approximately α = 0.05?
Question continues overleaf.
10
![Page 11: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/11.jpg)
(e) (2 marks) If the true mean is 62 and the population standard deviation is 25, give R code to �ndthe probability that a single sample value exceeds 50. (This is a one-liner.) The answer turns outto be 0.684. (A reminder that the population is normally distributed.)
(f) (2 marks) If the alternative hypothesis above is true and the population median is really 62, whatis the exact distribution of your test statistic from (b)?
(g) (2 marks) If the alternative hypothesis is true, what is the power of the sign test: that is, theprobability of correctly rejecting the null hypothesis, using your test statistic from (b) and rejectionrule from (d)? You may use the tables below:
R> x
R> p=pbinom(x,35,0.684)
R> data.frame(x,p)
[1] 12 22
x p
1 12 3.505218e-05
2 22 2.949374e-01
R> x
R> p=pbinom(x,35,0.316)
R> data.frame(x,p)
[1] 12 22
x p
1 12 0.7050626
2 22 0.9999649
(h) (2 marks) Compare the power computed for the t-test and the sign test. Which is bigger? Doesthis surprise you? Explain brie�y. (If you didn't �gure out (g), what kind of answer would youhave expected to see, and why?)
11
![Page 12: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/12.jpg)
7. This question is about R Markdown:
(a) How do you do the following in an R Markdown document?
i. (2 marks) Start a new section called Introduction?
ii. (2 marks) Make the text see this italic?
iii. (2 marks) Make the text important bold?
iv. (3 marks) Insert an image in the current folder called myimage.png?
(b) (3 marks) Describe precisely what Knitting HTML on the R Markdown code below does.
```{r}
x=1:5
y=c(10,11,13,14,16)
plot(y~x)
```
12
![Page 13: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/13.jpg)
8. The Human Development Index is a measure, for each country, of whether it has the attributes of adeveloped country like Canada. It is a value between 0 and 1, with a higher value being better. Domore highly-developed countries have a higher usage of cellphones? For a recent data set of HumanDevelopment Index values and cell phone use by country, the code below was run and the scatterplotshown below was obtained:
R> hdi_data=read.table("hdi.txt",header=T)
R> hdi_data[50:55,]
country hdi gdppc cell
50 Cuba 0.826 NA 7
51 SaintKittsNevis 0.825 12702 213
52 Bahamas 0.825 17843 584
53 Mexico 0.821 9803 370
54 Bulgaria 0.816 8078 609
55 Tonga 0.815 7870 NA
R> v=complete.cases(hdi_data)
R> v[50:55]
[1] FALSE TRUE TRUE TRUE TRUE FALSE
R> hdi2=hdi_data[v,]
R> attach(hdi2)
R> plot(cell~hdi)
R> lines(lowess(cell~hdi))
13
![Page 14: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/14.jpg)
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●●
●
●
●●●●
●
●
●
●●
●●
●●●
●
●●●●
●●●●●●●
0.3 0.4 0.5 0.6 0.7 0.8 0.9
020
040
060
080
010
0012
00
hdi
cell
(a) (2 marks) Some countries have missing values for some of the variables. What do you think thecomplete.cases function does, judging from the output you see above? (I am not expecting youto have seen this function before. Just guess what you think it does.)
(b) (1 mark) What, therefore, does the data frame hdi2 contain?
(c) (2 marks) Describe the relationship that you see in the scatterplot. What is happening to thespread around the line as hdi increases?
Question continues overleaf.
14
![Page 15: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/15.jpg)
(d) (3 marks) The plot below shows a scatterplot of the square root of cellphone use against hdi:
R> sqrtcell=sqrt(cell)
R> plot(sqrtcell~hdi)
R> lines(lowess(sqrtcell~hdi))
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●●●
●
0.3 0.4 0.5 0.6 0.7 0.8 0.9
510
1520
2530
35
hdi
sqrt
cell
and the plot below shows a scatterplot of the (natural) logarithm of cellphone use against hdi:
R> logcell=log(cell)
R> plot(logcell~hdi)
R> lines(lowess(logcell~hdi))
15
![Page 16: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/16.jpg)
●●●●
●
●
●●
●●●●●●
●
●●●●●
●●●
●●
●●
●
●
●●●
●
●●
●●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
0.3 0.4 0.5 0.6 0.7 0.8 0.9
12
34
56
7
hdi
logc
ell
Which plot do you think does a better job of making the data suitable for linear regression?Explain brie�y. (If you don't like either plot, explain brie�y why not, and describe what youwould try next.)
(e) (2 marks) Suppose we wanted to predict the cellphone use for a country whose hdi was 0.80.Which one of the two results below would you use? Why is the last step of the code (in the resultyou chose) necessary? (If you don't think either result is relevant, explain brie�y why not.) Useany handy blank space on the right to write your answer.
Result 1:
R> cell.1=lm(sqrtcell~hdi)
R> summary(cell.1)
16
![Page 17: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/17.jpg)
Call:
lm(formula = sqrtcell ~ hdi)
Residuals:
Min 1Q Median 3Q Max
-12.221 -3.365 0.810 3.389 11.375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.732 1.604 -9.807 <2e-16 ***
hdi 45.858 2.163 21.196 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 4.726 on 148 degrees of freedom
Multiple R-squared: 0.7522, Adjusted R-squared: 0.7505
F-statistic: 449.3 on 1 and 148 DF, p-value: < 2.2e-16
R> p=-15.732+45.858*0.80
R> p
[1] 20.9544
R> p^2
[1] 439.0869
Result 2:
R> cell.2=lm(logcell~hdi)
R> summary(cell.2)
Call:
lm(formula = logcell ~ hdi)
Residuals:
Min 1Q Median 3Q Max
-3.6395 -0.2964 0.1193 0.4371 1.5082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3386 0.2497 1.356 0.177
hdi 6.8743 0.3367 20.414 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.7356 on 148 degrees of freedom
Multiple R-squared: 0.7379, Adjusted R-squared: 0.7362
F-statistic: 416.7 on 1 and 148 DF, p-value: < 2.2e-16
R> p=0.3386+6.8743*0.80
R> p
[1] 5.83804
R> exp(p)
[1] 343.1062
(f) (2 marks) Suppose you had looked at a regression that you didn't trust (ie. the log if you preferredsquare root, the square root if you preferred log, either if you didn't like either.) What can yousay about the plot of residuals against �tted values for that regression?
17
![Page 18: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/18.jpg)
9. A manufacturer of laundry detergent was seeing how the height of the detergent suds in a standardwashing machine depended on the amount of detergent added in the wash cycle. Here are the data:
R> soap=read.table("soap.txt",header=T)
R> soap
sud_height detergent
1 27.8 6
2 27.6 6
3 31.9 7
4 32.5 7
5 36.0 8
6 37.0 8
7 39.8 9
8 42.0 9
9 43.7 10
10 47.0 10
A scatterplot of the data is as shown:
R> attach(soap)
R> plot(sud_height~detergent)
18
![Page 19: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/19.jpg)
●●
●●
●
●
●
●
●
●
6 7 8 9 10
3035
4045
detergent
sud_
heig
ht
A linear regression is �tted:
R> soap.1=lm(sud_height~detergent)
R> summary(soap.1)
Call:
lm(formula = sud_height ~ detergent)
Residuals:
Min 1Q Median 3Q Max
-1.630 -0.455 -0.030 0.445 1.670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3300 1.8814 0.707 0.5
detergent 4.4000 0.2316 19.000 6.1e-08 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
19
![Page 20: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/20.jpg)
Residual standard error: 1.036 on 8 degrees of freedom
Multiple R-squared: 0.9783, Adjusted R-squared: 0.9756
F-statistic: 361 on 1 and 8 DF, p-value: 6.095e-08
and �nally the residuals against �tted values are plotted:
R> res=residuals(soap.1)
R> fit=fitted(soap.1)
R> plot(res~fit)
R> lines(lowess(res~fit))
●
●●
●
●
●
●
●
●
●
30 35 40 45
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
fit
res
(a) (4 marks) Does this regression appear satisfactory? Explain brie�y why or why not.
(b) (2 marks) What, if anything, would you do to improve this regression? Explain brie�y.
20
![Page 21: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/21.jpg)
10. A scatter plot, and the SAS code to produce it, are shown below.
SAS> data mt;
SAS> infile '/home/ken/c4.txt';
SAS> input y x;
SAS>
SAS> proc gplot;
SAS> plot y*x;
y
0
1
2
3
4
5
6
7
8
9
x
0 1 2 3 4 5
(a) (4 marks) What does the output from the Box-Cox procedure below tell you? (This is thenumerical output from the lecture notes, not the graphical output that you had in the lab.) Youranswer needs to include concrete recommendations for the data analyst to follow next.
SAS> proc transreg;
SAS> model boxcox(y / lambda=0.25 to 0.4 by 0.01)=identity(x);
The TRANSREG Procedure
Box-Cox Transformation Information for y
Lambda R-Square Log Like
0.25 1.00 23.64499
21
![Page 22: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/22.jpg)
0.26 1.00 24.90797
0.27 1.00 26.34920
0.28 1.00 28.02837
0.29 1.00 30.04087
0.30 1.00 32.55294
0.31 1.00 35.89372
0.32 1.00 40.85854
0.33 1.00 49.87572 <
0.34 1.00 49.11745 *
0.35 1.00 40.47234
0.36 1.00 35.65545
0.37 1.00 32.38484
0.38 1.00 29.91387
0.39 1.00 27.92873
0.40 1.00 26.26940
< - Best Lambda
* - 95% Confidence Interval
+ - Convenient Lambda
(b) (2 marks) The data analyst is considering a log transformation. Would you recommend this, ornot? Explain brie�y.
22
![Page 23: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/23.jpg)
11. A random sample of 20 schools was taken from the US mid-Atlantic and New England states. For eachschool, the following variables were recorded:
� grade 6 student verbal mean test score
� sta� salary per pupil
� percent white-collar fathers (of grade 6 students). �White collar father� means that the father hasa professional occupation.
� Socioeconomic status, mean for families of grade 6 students. (This is a composite constructedfrom several other variables, and is scaled to have mean zero.)
� Mean teacher's verbal test score
� Mother's number of years of education (for students in grade 6).
A regression was run predicting student's verbal test score (the �rst variable given above) from theother variables (in the same order as given above). The results are shown below. Note that the �rstcolumn of the data frame identi�es the school.
R> schools=read.table("coleman.txt",header=T)
R> head(schools)
R> attach(schools)
R> schools.1=lm(verbal.test~staff.sal+wcf+ses.dev+teacher.verbal+mother.ed)
R> summary(schools.1)
school verbal.test staff.sal wcf ses.dev teacher.verbal mother.ed
1 1 3.83 28.87 7.20 26.6 6.19 37.01
2 2 2.89 20.10 -11.71 24.4 5.17 26.51
3 3 2.86 69.05 12.32 25.7 7.04 36.51
4 4 2.92 65.40 14.28 25.7 7.10 40.70
5 5 3.06 29.59 6.31 25.4 6.15 37.10
6 6 2.07 44.82 6.16 21.6 6.41 33.90
Call:
lm(formula = verbal.test ~ staff.sal + wcf + ses.dev + teacher.verbal +
mother.ed)
Residuals:
Min 1Q Median 3Q Max
-0.6109 -0.2528 -0.0764 0.2625 0.7773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.287673 2.954551 0.097 0.9238
staff.sal 0.006136 0.010889 0.563 0.5820
wcf 0.042504 0.033518 1.268 0.2254
ses.dev 0.229713 0.086588 2.653 0.0189 *
teacher.verbal -0.181106 0.418194 -0.433 0.6716
mother.ed -0.073156 0.050314 -1.454 0.1680
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.419 on 14 degrees of freedom
Multiple R-squared: 0.3728, Adjusted R-squared: 0.1488
F-statistic: 1.664 on 5 and 14 DF, p-value: 0.2078
23
![Page 24: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/24.jpg)
(a) (3 marks) The variables teacher.verbal and mother.ed have negative slopes. Given the nature ofthe data, explain brie�y why we ought to �nd this surprising. The explanation of these surprisingresults is contained in the output. Where?
(b) (1 mark) Why would you consider removing teacher.verbal from this regression? Explainbrie�y.
(c) (2 marks) Why would it be dangerous to remove all the non-signi�cant explanatory variables fromthe regression? Explain brie�y.
(d) (2 marks) I removed teacher.verbal and staff.sal from the regression, and this remained:
R> schools.2=update(schools.1,.~.-teacher.verbal-staff.sal)
R> summary(schools.2)
Call:
lm(formula = verbal.test ~ wcf + ses.dev + mother.ed)
Residuals:
Min 1Q Median 3Q Max
-0.64248 -0.24343 -0.08803 0.23900 0.72209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.42251 1.88811 -0.224 0.8258
wcf 0.04372 0.02699 1.620 0.1248
ses.dev 0.21597 0.07858 2.748 0.0143 *
mother.ed -0.06834 0.04657 -1.467 0.1616
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.3963 on 16 degrees of freedom
Multiple R-squared: 0.3586, Adjusted R-squared: 0.2383
F-statistic: 2.981 on 3 and 16 DF, p-value: 0.06257
What would you do next? Explain brie�y.
Question continues overleaf.
24
![Page 25: Last name First name Student number - University of Torontobutler/c32/oldexams/mid14.pdf · 2016-10-08 · These are subject ID, list ID and the percent of words correct. Subject](https://reader033.vdocuments.us/reader033/viewer/2022050413/5f8a4d1f37fa3e3ae241c8f9/html5/thumbnails/25.jpg)
(e) (2 marks) Here are the correlations between all of the variables:
R> cor(schools[,-1])
verbal.test staff.sal wcf ses.dev teacher.verbal
verbal.test 1.0000000 0.18113980 0.2296278 0.50266385 0.1967731
staff.sal 0.1811398 1.00000000 0.8271829 0.05105812 0.9271008
wcf 0.2296278 0.82718291 1.0000000 0.18332924 0.8190633
ses.dev 0.5026638 0.05105812 0.1833292 1.00000000 0.1238087
teacher.verbal 0.1967731 0.92710081 0.8190633 0.12380866 1.0000000
mother.ed 0.1922916 0.75340081 0.9271611 0.33364951 0.7329859
mother.ed
verbal.test 0.1922916
staff.sal 0.7534008
wcf 0.9271611
ses.dev 0.3336495
teacher.verbal 0.7329859
mother.ed 1.0000000
There are some large correlations in this table, for example between mother.ed and wcf. Whyare neither of these variables signi�cant in the regression?
(f) (2 marks) Explain brie�y what you learn from the output below.
R> schools.3=lm(verbal.test~ses.dev)
R> anova(schools.3,schools.1)
Analysis of Variance Table
Model 1: verbal.test ~ ses.dev
Model 2: verbal.test ~ staff.sal + wcf + ses.dev + teacher.verbal + mother.ed
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 2.9279
2 14 2.4573 4 0.47062 0.6703 0.6233
25