assignment 6 - university of torontoutsc.utoronto.ca/~butler/c32/assignments/a6.pdf · assignment 6...

41
Assignment 6 Due Thursday November 2 at 11:59pm on Blackboard As before, the questions without solutions are an assignment: you need to do these questions yourself and hand them in. The assignment is due on the date shown above. An assignment handed in after the deadline is late, and may or may not be accepted (see course outline). My solutions to the assignment questions will be available when everyone has handed in their assignment. You are reminded that work handed in with your name on it must be entirely your own work. Assignments are to be handed in on Blackboard. Instructions are at http://www.utsc.utoronto.ca/ ~ butler/c32/blackboard-assgt-howto.html, in case you forgot since last week. Markers’ comments and grades will be available on Blackboard as well. Begin with the usual: library(tidyverse) 1. The data for this question are in http://www.utsc.utoronto.ca/ ~ butler/c32/cereal-sugar.txt. The story here is whether breakfast cereals marketed to children have a lot of sugar in them; in particular, whether they have more sugar on average than cereals marketed to adults. (a) Read in the data (to R) and display the data set. Do you have a variable that distinguishes the children’s cereals from the adults’ cereals, and another that contains the amount of sugar? Solution: 1

Upload: others

Post on 18-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Assignment 6

Due Thursday November 2 at 11:59pm on Blackboard

As before, the questions without solutions are an assignment: you need to do these questions yourself andhand them in.

The assignment is due on the date shown above. An assignment handed in after the deadline is late, andmay or may not be accepted (see course outline). My solutions to the assignment questions will be availablewhen everyone has handed in their assignment.

You are reminded that work handed in with your name on it must be entirely your own work.

Assignments are to be handed in on Blackboard. Instructions are at http://www.utsc.utoronto.ca/

~butler/c32/blackboard-assgt-howto.html, in case you forgot since last week. Markers’ comments andgrades will be available on Blackboard as well.

Begin with the usual:

library(tidyverse)

1. The data for this question are in http://www.utsc.utoronto.ca/~butler/c32/cereal-sugar.txt.The story here is whether breakfast cereals marketed to children have a lot of sugar in them; in particular,whether they have more sugar on average than cereals marketed to adults.

(a) Read in the data (to R) and display the data set. Do you have a variable that distinguishes thechildren’s cereals from the adults’ cereals, and another that contains the amount of sugar?

Solution:

1

my_url="http://www.utsc.utoronto.ca/~butler/c32/cereal-sugar.txt"

cereals=read_delim(my_url," ")

## Parsed with column specification:

## cols(

## who = col character(),

## sugar = col double()

## )

cereals

## # A tibble: 40 x 2

## who sugar

## <chr> <dbl>

## 1 children 40.3

## 2 children 55.0

## 3 children 45.7

## 4 children 43.3

## 5 children 50.3

## 6 children 45.9

## 7 children 53.5

## 8 children 43.0

## 9 children 44.2

## 10 children 44.0

## # ... with 30 more rows

The variable who is a categorical variable saying who the cereal is intended for, and the variable sugar

says how much sugar each cereal has.

(b) Calculate the mean sugar content for each group of cereals (the adults’ ones and the children’sones). Do they look similar or different?

Solution: group by and summarize:

cereals %>% group_by(who) %>%

summarize(sugar_mean=mean(sugar))

## # A tibble: 2 x 2

## who sugar_mean

## <chr> <dbl>

## 1 adults 10.90000

## 2 children 46.61053

These means look very different, though it would be better to look at a boxplot (coming up in a moment).

(c) Make side-by-side boxplots of the sugar contents of the two types of cereal. What do you see thatis out of the ordinary?

Solution: The usual:

ggplot(cereals,aes(x=who,y=sugar))+geom_boxplot()

Page 2

0

20

40

60

adults children

who

suga

r

I see outliers: two high ones on the adults’ cereals, and one high and one low on the children’s cereals.

(d) Explain briefly why you would not trust a two-sample t-test with these data. (That is, say whatthe problem is, and why it’s a problem.)

Solution: The problem is the outliers (which is rather a giveaway), but the reason it’s a problem isthat the two-sample t-test assumes (approximately) normal data, and a normal distribution doesn’t haveoutliers.

Not only do you need to note the outliers, but you also need to say why the outliers cause a problem inthis case. Anything less than that is not a complete answer.

(e) Run a suitable test to see whether the “typical” amount of sugar differs between adult’s and chil-dren’s cereals. Justify the test that you run. (You can use the version of your test that lives in a

Page 3

package, if that is easier for you.) What do you conclude, in the context of the data?

Solution: Having ruled out the two-sample t-test, we are left with Mood’s median test. I didn’t needyou to build it yourself, so you can use package smmr to run it with:

library(smmr)

median_test(cereals,sugar,who)

## $table

## above

## group above below

## adults 2 19

## children 18 1

##

## $test

## what value

## 1 statistic 2.897243e+01

## 2 df 1.000000e+00

## 3 P-value 7.341573e-08

We conclude that there is a difference between the median amounts of sugar between the two groups ofcereals, the P-value of 0.00000007 being extremely small.

Why did it come out so small? Because the amount of sugar was smaller than the overall median foralmost all the adult cereals, and larger than the overall median for almost all the children’s ones. Thatis, the children’s cereals really do have more sugar.

My daughter likes chocolate Cheerios, but she also likes Shredded Wheat and Bran Flakes. Go figure.(Her current favourite is Raisin Bran, even though she doesn’t like raisins by themselves.)

Mood’s median test is the test we should trust, but you might be curious about how the t-test stacksup here:

t.test(sugar~who,data=cereals)

##

## Welch Two Sample t-test

##

## data: sugar by who

## t = -11.002, df = 37.968, p-value = 2.278e-13

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## -42.28180 -29.13925

## sample estimates:

## mean in group adults mean in group children

## 10.90000 46.61053

The P-value is even smaller, and we have the advantage of getting a confidence interval for the differencein means: from about 30 to about 40 units less sugar in the adult cereals. Whatever the units were.

(f) Read the data into SAS, and do the same test that you did in the previous part. Do you get thesame result?

Solution: Let’s see if we can do this in one shot:

filename myurl url 'http://www.utsc.utoronto.ca/~butler/c32/cereal-sugar.txt';

Page 4

proc import

datafile=myurl

dbms=dlm

out=cereal

replace;

delimiter=' ';

getnames=yes;

proc npar1way median;

var sugar;

class who;

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable sugar

Classified by Variable who

Sum of Expected Std Dev Mean

who N Scores Under H0 Under H0 Score

children 19 18.0 9.50 1.599279 0.947368

adults 21 2.0 10.50 1.599279 0.095238

Average scores were used for ties.

Median Two-Sample Test

Statistic 18.0000

Z 5.3149

One-Sided Pr > Z <.0001

Two-Sided Pr > |Z| <.0001

Median One-Way Analysis

Chi-Square 28.2481

DF 1

Pr > Chi-Square <.0001

The P-value is entirely consistent, but, as we found before, the test statistic is slightly different: 28.24here, vs. 28.97 from smmr. I still haven’t figured out why that is.

2. Two new short courses have been proposed for helping students who suffer from severe math phobia.The courses are labelled A and B. Ten students were randomly allocated to one of these two courses,and each student’s score on a math phobia test was recorded after they completed their course. Themath phobia test produces whole-number scores between 0 and 10, with a higher score indicating agreater fear of mathematics. The data can be found in http://www.utsc.utoronto.ca/~butler/c32/

mathphobia.txt. We start with R for this question.

(a) Read in the data and check, however you like, that you have 10 observations, 5 from each course.

Solution: This doesn’t need much comment:

Page 5

my_url="http://www.utsc.utoronto.ca/~butler/c32/mathphobia.txt"

math=read_delim(my_url," ")

## Parsed with column specification:

## cols(

## course = col character(),

## phobia = col integer()

## )

math

## # A tibble: 10 x 2

## course phobia

## <chr> <int>

## 1 a 8

## 2 a 7

## 3 a 7

## 4 a 6

## 5 a 6

## 6 b 9

## 7 b 8

## 8 b 7

## 9 b 2

## 10 b 1

This will do, counting the a and b. Or, to save yourself that trouble:

math %>% count(course)

## # A tibble: 2 x 2

## course n

## <chr> <int>

## 1 a 5

## 2 b 5

Five each. The story is to get the computer to do the grunt work for you, if you can make it do so.Other ways:

math %>% group_by(course) %>% summarize(count=n())

## # A tibble: 2 x 2

## course count

## <chr> <int>

## 1 a 5

## 2 b 5

and this:

with(math,table(course))

## course

## a b

## 5 5

giving the same answer. Lots of ways.

(b) Do a two-sample t-test to assess whether there is a difference in mean phobia scores after the students

Page 6

have taken the two courses. What do you conclude? (You have no a priori1 reason to suppose thata particular one of the tests will produce a higher mean than the other, so do a two-sided test.)

Solution: A two-sided test is the default, so there is not much to do here:

t.test(phobia~course,data=math)

##

## Welch Two Sample t-test

##

## data: phobia by course

## t = 0.83666, df = 4.4199, p-value = 0.4456

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## -3.076889 5.876889

## sample estimates:

## mean in group a mean in group b

## 6.8 5.4

The P-value of 0.4456 is nowhere near less than 0.05, so there is no evidence at all that the mean mathphobia scores are different between the two courses.

(c) Draw boxplots of the math phobia scores for each group (one line of code). What is the moststriking thing that you notice?

Solution:

ggplot(math,aes(x=course,y=phobia))+geom_boxplot()

Page 7

2.5

5.0

7.5

a b

course

phob

ia

Boxplot a is just weird. The bar across the middle is actually at the top, and it has no bottom. (Notingsomething sensible like this is enough.) Boxplot b is hugely spread out.2

By way of explanation: the course a scores have a number of values equal so that the 3rd quartile andthe median are the name, and also that the first quartile and the minimum value are the same:

Page 8

tmp=math %>% filter(course=="a")

tmp %>% count(phobia)

## # A tibble: 3 x 2

## phobia n

## <int> <int>

## 1 6 2

## 2 7 2

## 3 8 1

quantile(tmp$phobia)

## 0% 25% 50% 75% 100%

## 6 6 7 7 8

The phobia scores from course A are two 6’s, two 7’s and an 8. The median and third quartile are both7, and the first quartile is the same as the lowest value, 6.

Technique note: I wanted to do two things with the phobia scores from course A: count up how manyof each score, and show you what the five-number summary looks like. One pipe won’t do this (the pipe“branches”), so I saved what I needed to use, before it branched, into a data frame tmp and then usedtmp twice. Pipes are powerful, but not all -powerful.

(d) Explain briefly why a t-test would not be good for these data. (There are two things that you needto say.)

Solution: The easiest way to structure this is to ask yourself first what the t-test needs, and secondwhether you have it.

The t-test assumes (approximately) normal data. The boxplot for group a doesn’t even look symmetric,and the one for group b has an oddly asymmetric box. So I think the normality is in question here, andtherefore another test would be better. (This is perhaps a bit glib of an answer, since there are only 5values in each group, and so they can certainly look non-normal even if they actually are normal, butthese values are all integers, so it is perhaps wise to be cautious.)

We have the machinery to assess the normality for these, in one shot:

ggplot(math,aes(sample=phobia))+stat_qq()+facet_wrap(~course,ncol=1)

Page 9

● ●

● ●

b

a

−1.0 −0.5 0.0 0.5 1.0

2.5

5.0

7.5

2.5

5.0

7.5

theoretical

sam

ple

I don’t know what you make of those, but they both look pretty straight to me (and there are only fiveobservations, so it’s hard to judge). Course b maybe has a “hole” in it (three large values and two smallones). Maybe. I dunno. What I would really be worried about is outliers, and at least we don’t havethose.

I mentioned in class that the t-tests are robust to non-normality. I ought to have expanded on that a bit:what really makes the t-test still behave itself with non-normality is when you have large samples, thatis, when the Central Limit Theorem has had a chance to take hold. (That’s what drives the normalitynot really being necessary in most cases.) But, even with small samples, exact normality doesn’t matterso much. Here, we have two tiny samples, and so we have to insist a bit more, but only a bit more, ona more-or-less normal shape in each group. (It’s kind of a double jeopardy in that the situation wherenormality matters most, namely with small samples, is where it’s the hardest to judge, because samplesof size 5 even from a normal distribution can look very non-normal.)

But, the biggest threats to the t-test are big-time skewness and outliers, and we are not suffering too

Page 10

badly from those.

(e) Run a suitable test to compare the “typical” scores for the two courses. (You can use the versionfrom a package rather than building your own.) What do you conclude?

Solution: This is an invite to use smmr:

library(smmr)

median_test(math,phobia,course)

## $table

## above

## group above below

## a 1 2

## b 2 2

##

## $test

## what value

## 1 statistic 0.1944444

## 2 df 1.0000000

## 3 P-value 0.6592430

We are nowhere near rejecting equal medians; in fact, both courses are very close to 50–50 above andbelow the overall median.

If you look at the frequency table, you might be confused by something: there were 10 observations, butthere are only 1 + 2 + 2 + 2 = 7 in the table. This is because three of the observations were equal to theoverall median, and had to be thrown away:

math %>% summarize(med=median(phobia))

## # A tibble: 1 x 1

## med

## <dbl>

## 1 7

math %>% count(phobia)

## # A tibble: 6 x 2

## phobia n

## <int> <int>

## 1 1 1

## 2 2 1

## 3 6 2

## 4 7 3

## 5 8 2

## 6 9 1

The overall median was 7. Because the actual data were really discrete (the phobia scores could only bewhole numbers), we risked losing a lot of our data when we did this test (and we didn’t have much tobegin with). The other thing to say is that with small sample sizes, the frequencies in the table have tobe very lopsided for you to have a chance of rejecting the null. Something like this is what you’d need:

Page 11

x=c(1,1,2,6,6,6,7,8,9,10)

g=c(1,1,1,1,1,2,2,2,2,2)

d=tibble(x,g)

median_test(d,x,g)

## $table

## above

## group above below

## 1 0 3

## 2 4 0

##

## $test

## what value

## 1 statistic 7.000000000

## 2 df 1.000000000

## 3 P-value 0.008150972

I faked it up so that we had 10 observations, three of which were equal to the overall median. Of therest, all the small ones were in group 1 and all the large ones were in group 2. This is lopsided enough toreject with, though, because of the small frequencies, there should actually have been a warning about“chi-squared approximation may be inaccurate”.3

(f) Repeat your chosen test in SAS. First, you will have to read the data in. Does your test give thesame conclusion? An identical result? Explain briefly.

Solution:

filename myurl url

"http://www.utsc.utoronto.ca/~butler/c32/mathphobia.txt";

proc import

datafile=myurl

dbms=dlm

out=math

replace;

delimiter=' ';

getnames=yes;

proc print;

proc npar1way median;

var phobia;

class course;

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable phobia

Classified by Variable course

Sum of Expected Std Dev Mean

course N Scores Under H0 Under H0 Score

a 5 2.333333 2.50 0.713624 0.466667

b 5 2.666667 2.50 0.713624 0.533333

Average scores were used for ties.

Page 12

Median Two-Sample Test

Statistic 2.3333

Z -0.2335

One-Sided Pr < Z 0.4077

Two-Sided Pr > |Z| 0.8153

Median One-Way Analysis

Chi-Square 0.0545

DF 1

Pr > Chi-Square 0.8153

Page 13

The P-value of 0.8153 leads to the same non-rejection as with R, though the P-value is different than wehad before. This is actually for two reasons: one, that way SAS has of calculating the test statistic (thatwe ran into before), and two, how SAS deals with values exactly equal to the overall median. Instead ofthrowing them away, it counts two-thirds of an observation above the median. Let’s see if I can figureout why. I’m going to use something new in R to demonstrate what’s going on. There are three thingsthat matter: whether an observation is strictly above, equal or strictly below the grand median, whichin this case is 7. Here goes:

math2=math %>% mutate(rel=case_when(

phobia==7 ~ "equal",

phobia<7 ~ "less",

phobia>7 ~ "greater",

TRUE ~ "wrong"))

math2

## # A tibble: 10 x 3

## course phobia rel

## <chr> <int> <chr>

## 1 a 8 greater

## 2 a 7 equal

## 3 a 7 equal

## 4 a 6 less

## 5 a 6 less

## 6 b 9 greater

## 7 b 8 greater

## 8 b 7 equal

## 9 b 2 less

## 10 b 1 less

I’ve made a new column called rel that is filled with text saying where the phobia value stands relativeto 7. Because there are several possible answers (three, here) rather than only two, we can use thiscase when construction. It has a bunch of logical conditions, here involving the phobia score, and itprovides a recipe for obtaining a value of rel. R goes down the list, and the first condition that is trueprovides its value to rel. The last condition is always true, and is a kind of catch-all in case I missedanything.

In this case, all my conditions involved just phobia, but a condition could involve more than one variable,or different conditions could involve different variables. For example, in this case, I might have wantedto split greater into two parts according to whether I was looking at course A or B, which I could dolike this:

Page 14

math %>% mutate(rel=case_when(

phobia==7 ~ "equal",

phobia<7 ~ "less",

course=="a" ~ "greater-a",

course=="b" ~ "greater-b",

TRUE ~ "wrong"))

## # A tibble: 10 x 3

## course phobia rel

## <chr> <int> <chr>

## 1 a 8 greater-a

## 2 a 7 equal

## 3 a 7 equal

## 4 a 6 less

## 5 a 6 less

## 6 b 9 greater-b

## 7 b 8 greater-b

## 8 b 7 equal

## 9 b 2 less

## 10 b 1 less

By the time I get to testing whether course is A or B, I have already caught phobia being less thanor equal to 7, so I know that phobia is strictly greater than 7. If that’s the case, I also want to testcourse. You could imagine drawing this diagrammatically as a tree; this is very like a “decision tree”where you take some action based on what conditions happen to be true:

rel

less

phobia <7?

equalphobia = 7?

greater-a

course = a?

greater-b

course = b?

phobia

>7?

I stole this idea from http://texample.net/tikz/examples/decision-tree/. This picture was drawnin LATEX with a thing called tikz.

Other languages also have a construction like case when. Sometimes it is called switch. Python seemsnot to have it.

All right, let’s count how many values are above and below the median overall:

Page 15

math2 %>% count(rel)

## # A tibble: 3 x 2

## rel n

## <chr> <int>

## 1 equal 3

## 2 greater 3

## 3 less 4

and for each course:

math2 %>% group_by(course) %>% count(rel)

## # A tibble: 6 x 3

## # Groups: course [2]

## course rel n

## <chr> <chr> <int>

## 1 a equal 2

## 2 a greater 1

## 3 a less 2

## 4 b equal 1

## 5 b greater 2

## 6 b less 2

If there were no values exactly equal to the median, there would be 5 above and 5 below. So those equalvalues would be, if they were not exactly equal to the overall median, 2 above and 1 below. So thoseequal values are each counted as 2/3 above and 1/3 below.

So now let’s think about how many values, counting this way, are “above” the median for each course.

Course A has one value genuinely above, and two values equal, which count 2/3 each, for a total of1 + 2(2/3) = 7/3 = 2.33.

Course B has two values genuinely above and one equal (counting 2/3), for a total of 2+1(2/3) = 8/3 =2.67.

This is how SAS got what it did.

With decently large amounts of data (ie. not here), the fact that SAS does something different shouldnot make any material difference to the conclusion, or indeed (much) to the P-value.

3. Does the age of a sweet white wine affect its acidity? A winemaker took random samples of Riesling (asweet white wine) from casks that are 15, 20 and 25 years old and measured the acidity of each one. Theresults are in http://www.utsc.utoronto.ca/~butler/c32/wine-acidity.txt. The two columns arethe age of the wine and the acidity.

Note that the data are in the correct format, so we don’t need to do the tidyr stuff (or SAS equivalentthereof).

(a) Read the data into SAS and confirm, without listing all the data values, that all of the acidity valuesare between 0.7 and 0.85.

Solution: The trick here is to remember that proc means also produces the maximum and minimum,which is exactly what we need here:

filename myurl url "http://www.utsc.utoronto.ca/~butler/c32/wine-acidity.txt";

Page 16

proc import

datafile=myurl

out=wine

dbms=dlm

replace;

getnames=yes;

delimiter=' ';

with this output:

proc means;

var acidity;

Page 17

The MEANS Procedure

Analysis Variable : acidity

N Mean Std Dev Minimum Maximum

------------------------------------------------------------------

39 0.8058615 0.0151667 0.7735000 0.8298000

------------------------------------------------------------------

Page 18

The smallest value is 0.77 and the largest 0.83, so all the acidity values must be between 0.7 and 0.85.

Or you can use a “custom” proc means, asking only for the extremes:

proc means min max;

var acidity;

Page 19

The MEANS Procedure

Analysis Variable : acidity

Minimum Maximum

----------------------------

0.7735000 0.8298000

----------------------------

Another alternative here is to use proc univariate and look at the “extreme values” down near thebottom of the output. It works, but you need to make sure you only grab the table of extreme values(and not the rest of it), because the output from proc univariate is rather long:

proc univariate;

var acidity;

The UNIVARIATE Procedure

Variable: acidity

Extreme Observations

-----Lowest----- -----Highest----

Value Obs Value Obs

0.7735 27 0.8251 13

0.7789 34 0.8262 21

0.7801 38 0.8265 20

0.7813 28 0.8291 3

0.7843 39 0.8298 5

Again, the smallest value is bigger than 0.7 and the largest is smaller than 0.85, so all the values mustbe between 0.7 and 0.85.

Find a way to make it work.

Listing all the data values (with proc print) is smart until you have convinced yourself that the valuesare correct, but it is not smart to make the marker read it every time (unless the data set is small).

I have 39 data values altogether (which is getting a bit long to list all of), so I imagine there are 13 pergroup. You don’t need to, but I want to check:

proc freq;

table age;

The FREQ Procedure

Cumulative Cumulative

age Frequency Percent Frequency Percent

--------------------------------------------------------

15 13 33.33 13 33.33

20 13 33.33 26 66.67

25 13 33.33 39 100.00

I do. Or, of course, proc means with class age would have told me the same thing.

(b) Obtain side by side boxplots of acidity levels for each age of wine.

Page 20

Solution: This is just like the one you did in the tutorial:

proc sgplot;

vbox acidity / category=age;

(c) What do you learn from the boxplots? (Two things, one about means/medians and one aboutspreads. For the second, why is what you learn important?).

Solution: The mean and median acidity is much lower for the oldest wines (age 25). The age-15 andage-20 wines have about the same median acidity.

The groups each have similar spreads, and there are no outliers and more or less symmetric shapes in allcases. (The age 20 wines are the most questionable in terms of symmetry.) This is important becausean assumption of ANOVA is that you have (approximately) normal data with (approximately) equalSDs, and if we have that, as I think we do, we can trust the results of the ANOVA. As ever, you can askfor perfection, but you will generally be disappointed. If it’s not bad enough to cause problems, that’sall we need.

An alternative (and thus acceptable) answer is to say that the age-15 wines have “clearly smaller” IQR,and therefore we should be cautious about the results of the ANOVA. I don’t like this as much, but ifthat’s your reasoning, that’s a valid conclusion.

(d) Obtain an analysis of variance. What, as precisely as possible, are you able to conclude?

Solution: It looks as if there ought to be a var line in there, but you don’t need one (what would beon the var line is on the left side of the equals in the model line):

Page 21

proc anova;

class age;

model acidity=age;

Page 22

The ANOVA Procedure

Class Level Information

Class Levels Values

age 3 15 20 25

Number of Observations Read 39

Number of Observations Used 39

The ANOVA Procedure

Dependent Variable: acidity

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 2 0.00492012 0.00246006 23.18 <.0001

Error 36 0.00382102 0.00010614

Corrected Total 38 0.00874113

R-Square Coeff Var Root MSE acidity Mean

0.562870 1.278432 0.010302 0.805862

Source DF Anova SS Mean Square F Value Pr > F

age 2 0.00492012 0.00246006 23.18 <.0001

Page 23

The P-value is less than 0.0001, so we reject our null that all three ages of wine have the same meanacidity. All we can conclude at this point is that the three means are not all the same, but beyond thatwe cannot go yet.

If you thought the three groups had unequal spreads above, you ought to be a bit cautious about this,but : the P-value is so small that even if our assumptions are slightly off, the conclusions are unlikelyto change. (That is, our somewhat unequal spreads would have to change the P-value from less than0.0001 to greater than 0.05 to affect our conclusions, and I don’t see how our assumptions are “off” byenough to make this happen.)

If you really insist on caution, you should run something like Mood’s Median Test, as per last assignment.We did this before in R. SAS can also do Mood’s median test. It lives in proc npar1way, but not underthe name you think: there is a mood there, which is not what we need. We want median, which, accordingto the documentation, is the “Brown-Mood median test”, the right thing:

proc npar1way median;

var acidity;

class age;

with this output:

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable acidity

Classified by Variable age

Sum of Expected Std Dev Mean

age N Scores Under H0 Under H0 Score

15 13 10.0 6.333333 1.490712 0.769231

20 13 8.0 6.333333 1.490712 0.615385

25 13 1.0 6.333333 1.490712 0.076923

Median One-Way Analysis

Chi-Square 13.4000

DF 2

Pr > Chi-Square 0.0012

The P-value of 0.0012 is not quite as small as from the ANOVA, but is easily small enough to reject anull hypothesis that the population medians are all equal.

In the grand scheme of things, our concerns were not enough to invalidate the ANOVA F -test, since theconclusions were the same both ways.

(e) Is it appropriate to run Tukey’s method? Explain briefly why or why not. If appropriate, runTukey’s method. What do you conclude?

Solution: The F -test was significant, so there are some differences to find. In SAS, to get Tukey, wehave to run the whole ANOVA again (so if I had been smart, I would have included the Tukey in thefirst run of proc anova and ignored it had I not needed it):

proc anova;

class age;

model acidity=age;

means age / tukey;

Page 24

Here’s the extra bit of output:

Page 25

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for acidity

NOTE: This test controls the Type I experimentwise error rate, but it generally

has a higher Type II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 36

Error Mean Square 0.000106

Critical Value of Studentized Range 3.45675

Minimum Significant Difference 0.0099

Means with the same letter are not significantly different.

Tukey Grouping Mean N age

A 0.814738 13 15

A

A 0.812831 13 20

B 0.790015 13 25

Page 26

The last line of code says “get the mean value of acidity (implied) for each group defined by age, andcompare using Tukey.”

The results say that the 15 and 20 means are not significantly different (they are in fact very close), butthe age-25 mean is significantly different (less, in fact). (The minimum significant difference, almostexactly 0.01, says that any means differing by this much or more are significantly different. The age 25group mean differs by more than 0.01 from the others, but the age 15 and 20 groups differ in mean byonly 0.002.)

This was exactly the same conclusion that we got (informally) from the boxplots, but now we haveP-values to attach to it.

To do something like Tukey with Mood’s median test, what I’ve seen suggested is that you do Mood’smedian test on all the pairs of groups (ie. 15 vs. 20, 15 vs. 25, 20 vs. 25). Then you adjust for the factthat you’ve done three tests at once. A simple but “conservative” adjustment is to multiply all thoseP-values by 3.4

I think I could even do that, with a bit of cleverness:

proc npar1way median;

where age=15 | age=20;

var acidity;

class age;

proc npar1way median;

where age=15 | age=25;

var acidity;

class age;

proc npar1way median;

where age=20 | age=25;

var acidity;

class age;

That has three Mood’s median tests as output:

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable acidity

Classified by Variable age

Sum of Expected Std Dev Mean

age N Scores Under H0 Under H0 Score

15 13 7.0 6.50 1.30 0.538462

20 13 6.0 6.50 1.30 0.461538

Median Two-Sample Test

Statistic 7.0000

Z 0.3846

One-Sided Pr > Z 0.3503

Two-Sided Pr > |Z| 0.7005

Median One-Way Analysis

Chi-Square 0.1479

DF 1

Pr > Chi-Square 0.7005

Page 27

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable acidity

Classified by Variable age

Sum of Expected Std Dev Mean

age N Scores Under H0 Under H0 Score

15 13 11.0 6.50 1.30 0.846154

25 13 2.0 6.50 1.30 0.153846

Median Two-Sample Test

Statistic 11.0000

Z 3.4615

One-Sided Pr > Z 0.0003

Two-Sided Pr > |Z| 0.0005

Median One-Way Analysis

Chi-Square 11.9822

DF 1

Pr > Chi-Square 0.0005

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable acidity

Classified by Variable age

Sum of Expected Std Dev Mean

age N Scores Under H0 Under H0 Score

20 13 11.0 6.50 1.30 0.846154

25 13 2.0 6.50 1.30 0.153846

Median Two-Sample Test

Statistic 11.0000

Z 3.4615

One-Sided Pr > Z 0.0003

Two-Sided Pr > |Z| 0.0005

Median One-Way Analysis

Chi-Square 11.9822

DF 1

Pr > Chi-Square 0.0005

Page 28

For the first one, I only take the acidity measurements that have ages 15 or 20 and do a Mood’s mediantest on them: that is to say, comparing just age 15 and 20. This has a very large P-value of 0.7005(two-sided, since we are looking for any differences). Ages 15 and 20 are not significantly different.

The second one compares ages 15 and 25, and produces a very small P-value of 0.0005. Hang on to thatone a moment.

The third one compares ages 20 and 25, and likewise produces a P-value of 0.0005.

Now we have just done three tests at once, which we have to account for. The Bonferroni proceduresays that, with three tests, we multiply each P-value by 3:

Test P-value Adjusted P-value15 vs. 20 0.7005 “large”15 vs. 25 0.0005 0.001520 vs. 25 0.0005 0.0015

In this case, the adjustment hasn’t changed anything (since the decisions were very clear-cut: age 25 issignificantly different in terms of acidity from each of the other ages, which are not significantly differentfrom each other.

Since we are comparing medians, we should look at the medians for each group. proc means can becajoled into providing these:5

proc means median;

var acidity;

class age;

The MEANS Procedure

Analysis Variable : acidity

N

age Obs Median

-----------------------------------

15 13 0.8133000

20 13 0.8109000

25 13 0.7882000

-----------------------------------

We see, exactly as we did from ANOVA plus Tukey, that the acidity at age 25 is significantly lower thanat the other two ages. In other words, the conclusions are identical from the two strategies. They won’talways be, but in this case they are.

4. Hand this one in. Do people understand medical instructions better at certain times of the day? In astudy, students in a grade 12 class are randomly divided into two groups, A and B. All students see avideo describing how to use an infant forehead thermometer. The students in Group A see the video at8:30 am, while the students in Group B see the same video at 3:00 pm (on the same day). The next day,all the students are given a test on the material in the video (graded out of 100). The observed scoresare in http://www.utsc.utoronto.ca/~butler/c32/forehead.txt (values separated by spaces).

(a) (2 marks) Read the data into R and display the (first ten) values.

Page 29

Solution: Separated by spaces, so read delim:

my_url="http://www.utsc.utoronto.ca/~butler/c32/forehead.txt"

instr=read_delim(my_url," ")

## Parsed with column specification:

## cols(

## group = col character(),

## score = col integer()

## )

instr

## # A tibble: 18 x 2

## group score

## <chr> <int>

## 1 A 88

## 2 A 89

## 3 A 79

## 4 A 100

## 5 A 98

## 6 A 89

## 7 A 65

## 8 A 94

## 9 A 95

## 10 A 91

## 11 B 87

## 12 B 69

## 13 B 78

## 14 B 79

## 15 B 83

## 16 B 90

## 17 B 85

## 18 B 58

(b) (2 marks) Obtain a suitable plot that will enable you to assess the assumptions for a two-samplet-test.

Solution: We need the values in each group to be approximately normally distributed. Side-by-sideboxplots will do it:

ggplot(instr,aes(x=group,y=score))+geom_boxplot()

Page 30

60

70

80

90

100

A B

group

scor

e

or, if you like, separate (facetted) normal quantile plots, which I would do this way:

ggplot(instr,aes(sample=score))+stat_qq()+

facet_wrap(~group,ncol=1)

Page 31

●● ●

●●

●●

B

A

−1 0 1

60

70

80

90

100

60

70

80

90

100

theoretical

sam

ple

(c) (2 marks) Why might you have doubts about using a two-sample t-test here?

Solution: We are looking for non-normality in at least one of the groups. Here, both groups have anoutlier at the low end that would be expected to pull the mean downward. You might also see left-skewness (they are hard to tell apart). Both of the normal quantile plots are slightly bent as well ashaving low outliers. This direction of curve points to spread out at the bottom, bunched up at the top:that is, skewed to the left.

I’m expecting that you drew the boxplots, in which case the outliers probably caught your attention,but I’ll take either outliers at the bottom or skewed-left as an answer.

(d) (4 marks) Run Mood’s median test as in class (without using smmr). What do you conclude, inthe context of the data? What recommendation would you make about the time of day to see thevideo? (You might get a warning about “chisquared approximation being incorrect”, which you can

Page 32

ignore here.)

Solution: The overall median first:

instr %>% summarize(med=median(score))

## # A tibble: 1 x 1

## med

## <dbl>

## 1 87.5

87.5, which is not equal to any of the data values (they are all integers). This will avoid any issues withvalues-equal-to-median later.

Then, create and save a table of the value by group and above/below median. You can count eitherabove or below (it comes out equivalently either way):

tab=with(instr,table(group,score>87.5))

tab

##

## group FALSE TRUE

## A 2 8

## B 7 1

Then, chi-squared test for independence (the null) or association of some kind (the alternative). Thecorrect=F is saying not to do Yates’s correction, so that it would come out the same if you were doingit by hand (“observed minus expected, squared, divided by expected” and all that stuff).

chisq.test(tab,correct=F)

## Warning in chisq.test(tab, correct = F): Chi-squared approximation may be incorrect

##

## Pearson's Chi-squared test

##

## data: tab

## X-squared = 8.1, df = 1, p-value = 0.004427

The P-value is 0.0044, which is (much) smaller than 0.05, and therefore you can reject independence andconclude association: that is, whether a student scores above or below the median depends on whichgroup they are in, or, that the median scores are different for the two groups.

As for which group is better, well, the easiest way is to go back to your boxplots and see that the medianfor group A (8:30 am) is substantially higher than for group B (3:00pm). But you can also see it fromyour frequency table, if you displayed it:

tab

##

## group FALSE TRUE

## A 2 8

## B 7 1

Most of the people in the 8:30 am group scored above the median, and most of the people in the 3:00pm group scored below the median. So the scores at 8:30 am were better overall.

As I write this, it is just after 3:00 pm and I am about to make myself a pot of tea!

About that correct=F thing. There was a point of view for a long time that when you are dealing with

Page 33

a 2 × 2 table, you can get better P-values by, before squaring “observed minus expected”, taking 0.5away from the absolute value of the difference. This is called Yates’s correction. In about the 1960s, theusefulness of Yates’s correction was shot down, for general contingency tables. There is, however, onecase where it is useful, and that is the case where the row totals and column totals are fixed.

What do I mean by that? Well, first let’s look at a case where the totals are not all fixed. Consider asurvey in which you want to see whether males and females agree or disagree on some burning issue ofthe day. You collect random samples of, say, 500 males and 500 females, and you count how many ofthem say Yes or No to your statement. You might get results like this:

Yes No Total

Males 197 303 500

Females 343 157 500

Total 540 460 1000

In this table, the row totals must be 500, because you asked this many males and this many females,and each one must have answered something. The column totals, however, are not fixed: you didn’tknow, ahead of time, that 540 people would answer “yes”. That was just the way the data turned out,and if you did another survey with the same design, you’d probably get a different number of peoplesaying “yes”.

For another example, let’s go back to Fisher (yes, that Fisher). A “lady” of his acquaintance claimedto be able, by drinking a cup of tea with milk and sugar in it, whether the milk or the sugar had beenadded first. Fisher, or, more likely, his housekeeper, prepared 8 cups of tea, 4 with milk first and 4 withsugar first. The lady knew that four of the cups had milk first, and her job was to say which four. Theresults might have been like this:

Actual

Milk first sugar first Total

Lady Milk first 3 1 4

says sugar first 1 3 4

Total 4 4 8

This time, all of the row totals and all of the column totals must be 4, regardless of what the ladythinks. Even if she thinks 5 of the cups of tea actually had milk first, she is going to pick 4 of them tosay that they have milk first, since she knows there are only 4. In this case, all of the row and columntotals are fixed at 4, and the right analysis is called Fisher’s Exact Test, based on the hypergeometricdistribution. But, leaving that aside, the usual chi-squared analysis is a perfectly good approximation,especially if the frequencies are large, and especially if you use Yates’s correction.

It is clear that Fisher must have been English, since he was able to get a publication out of drinking tea.

How does that apply to Mood’s median test? Well, let’s remind ourselves of the table we had:

tab

##

## group FALSE TRUE

## A 2 8

## B 7 1

We know how many students were in each group: 10 in group A and 8 in B. So the row totals are fixed.What about the columns? These are whether each observation was above or below the overall median.There were 18 observations altogether, so there must be 9 above and 9 below.6 So the column totalsare fixed as well. All totals fixed, so we should be using Yates’s correction. I didn’t, because I wantedto keep things simple, but I should have done.

Page 34

R’s chisq.test by default always uses Yates’s correction, and if you don’t want it, you have to saycorrect=F. Which is why I have been doing so all through.

(e) (2 marks) Run Mood’s median test on these data using my smmr package, and verify that you getthe same answer.

Solution: Not much to it, since the data is already read in:

library(smmr)

median_test(instr, score, group)

## $table

## above

## group above below

## A 8 2

## B 1 7

##

## $test

## what value

## 1 statistic 8.100000000

## 2 df 1.000000000

## 3 P-value 0.004426526

Identical, test statistic, degrees of freedom and P-value. The table of frequencies is also the same, justwith columns rearranged. (In smmr I counted the number of values below the overall median, whereasin my build-it-yourself I counted the number of values above.)

(f) (2 marks) Read the data into SAS and list the values (there are only 18 of them).

Solution: Nothing too challenging here:

filename myurl url 'http://www.utsc.utoronto.ca/~butler/c32/forehead.txt';

proc import

datafile=myurl

dbms=dlm

out=forehead

replace;

delimiter=' ';

getnames=yes;

proc print;

Obs group score

1 A 88

2 A 89

3 A 79

4 A 100

5 A 98

6 A 89

7 A 65

8 A 94

9 A 95

10 A 91

11 B 87

12 B 69

13 B 78

14 B 79

15 B 83

16 B 90

17 B 85

18 B 58

Page 35

Success.

(g) (3 marks) Obtain Mood’s median test in SAS for these data, and compare the results you get tothose from R.

Solution: This is why I wanted you to read the data into SAS. Same proc npar1way idea as theprevious question:

proc npar1way median;

var score;

class group;

The NPAR1WAY Procedure

Median Scores (Number of Points Above Median) for Variable score

Classified by Variable group

Sum of Expected Std Dev Mean

group N Scores Under H0 Under H0 Score

A 10 8.0 5.0 1.084652 0.8000

B 8 1.0 4.0 1.084652 0.1250

Average scores were used for ties.

Median Two-Sample Test

Statistic 1.0000

Z -2.7659

One-Sided Pr < Z 0.0028

Two-Sided Pr > |Z| 0.0057

Median One-Way Analysis

Chi-Square 7.6500

DF 1

Pr > Chi-Square 0.0057

Page 36

The P-value, 0.0057, is a little bit bigger than R’s, and the test statistic is a little smaller, but the overallconclusion, that there is a difference in median test scores between the two groups, is still the same.

I would like you to note that the P-values are slightly different but the conclusions are the same.

Why are the results not exactly the same? Since the SAS test statistic is a little smaller, I’m suspectingthat SAS is using Yates’s correction. So let’s re-run the R analysis without correct=F, that is to say,to use the correction:

chisq.test(tab)

## Warning in chisq.test(tab): Chi-squared approximation may be incorrect

##

## Pearson's Chi-squared test with Yates' continuity correction

##

## data: tab

## X-squared = 5.625, df = 1, p-value = 0.01771

That’s rather odd. Same frequencies above and below the median in each group, but Yates has over-corrected beyond what SAS said: the test statistic is smaller than SAS got, with a P-value that isbigger.

Did SAS perhaps do Fisher’s exact test?

fisher.test(tab)

##

## Fisher's Exact Test for Count Data

##

## data: tab

## p-value = 0.01522

## alternative hypothesis: true odds ratio is not equal to 1

## 95 percent confidence interval:

## 0.0007163236 0.6521664451

## sample estimates:

## odds ratio

## 0.04673025

Nope. Colour me mystified. There’s something odd about the test statistic that SAS is computing.

5. Hand this one in. Before a movie is shown in theatres, it receives a “rating” that says what kind ofmaterial it contains. https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_

film_rating_system explains the categories, from G (suitable for children) to R (anyone under 17 mustbe accompanied by parent/guardian). In 2011, two students collected data on the length (in minutes) andthe rating category, for 15 movies of each rating category, randomly chosen from all the movies releasedthat year. The data are at http://www.utsc.utoronto.ca/~butler/c32/movie-lengths.csv.

(a) (3 marks) Read the data into SAS and summarize the movie lengths by rating category. Verifythat you have the right ratings and the right number of each.

Solution: A .csv, so:

filename myurl url 'http://www.utsc.utoronto.ca/~butler/c32/movie-lengths.csv';

proc import

Page 37

datafile=myurl

dbms=csv

out=movies

replace;

getnames=yes;

and then, having used proc print to check out what the columns are called:

proc means;

var length;

class rating;

The MEANS Procedure

Analysis Variable : length

N

rating Obs N Mean Std Dev Minimum Maximum

-----------------------------------------------------------------------------

G 15 15 80.6000000 21.2293597 25.0000000 117.0000000

PG 15 15 106.9333333 18.4021220 87.0000000 152.0000000

PG-13 15 15 123.4000000 23.6093202 95.0000000 178.0000000

R 15 15 111.2000000 17.8093074 93.0000000 148.0000000

-----------------------------------------------------------------------------

There are four different movie ratings, as listed on the website referred to in the question, and fifteenmovies of each rating, so we appear to be good.

(b) (2 marks) Run an analysis of variance for testing the null hypothesis that the mean lengths ofmovies of all four rating types are equal. Include code to run Tukey’s method (in case we need itlater).

Solution: Like this:

proc anova;

class rating;

model length=rating;

means rating / tukey;

The ANOVA Procedure

Class Level Information

Class Levels Values

rating 4 G PG PG-13 R

Number of Observations Read 60

Number of Observations Used 60

Page 38

The ANOVA Procedure

Dependent Variable: length

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 14624.40000 4874.80000 11.72 <.0001

Error 56 23294.53333 415.97381

Corrected Total 59 37918.93333

R-Square Coeff Var Root MSE length Mean

0.385675 19.32606 20.39544 105.5333

Source DF Anova SS Mean Square F Value Pr > F

rating 3 14624.40000 4874.80000 11.72 <.0001

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for length

NOTE: This test controls the Type I experimentwise error rate, but it generally

has a higher Type II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 56

Error Mean Square 415.9738

Critical Value of Studentized Range 3.74468

Minimum Significant Difference 19.72

Means with the same letter are not significantly different.

Tukey Grouping Mean N rating

A 123.400 15 PG-13

A

A 111.200 15 R

A

A 106.933 15 PG

B 80.600 15 G

(c) (3 marks) Give a complete conclusion from your analysis, in the context of the data.

Solution: First, check the ANOVA table and report on what you see there. Then, if the P-value fromthe ANOVA table is small, go on to the Tukey results and say what you see there.

The P-value in the ANOVA table is very small (less than 0.0001), so the mean lengths of movies of thefour different ratings are not all equal (or, if you like, at least one of them is different).

Since there are differences to find, we need to go on to Tukey to find out where they lie. (You need tosay why you are looking at the Tukey.) This says that the G-rated movies are shorter (have smallermean length) than the others, which are not significantly different from each other.

(d) (4 marks) Read the data into R, carry out an ANOVA and a Tukey analysis (if warranted), andverify that the results in R and SAS are the same.

Page 39

Solution: This seems like a lot of work, but just follow the steps through.

First, read csv:

my_url="http://www.utsc.utoronto.ca/~butler/c32/movie-lengths.csv"

movies=read_csv(my_url)

## Parsed with column specification:

## cols(

## length = col integer(),

## rating = col character()

## )

movies

## # A tibble: 60 x 2

## length rating

## <int> <chr>

## 1 25 G

## 2 75 G

## 3 88 G

## 4 63 G

## 5 76 G

## 6 97 G

## 7 68 G

## 8 82 G

## 9 98 G

## 10 74 G

## # ... with 50 more rows

For yourself (optional), you might want to verify that you have the right number of movies of eachrating:

movies %>% count(rating)

## # A tibble: 4 x 2

## rating n

## <chr> <int>

## 1 G 15

## 2 PG 15

## 3 PG-13 15

## 4 R 15

I do. Now to the ANOVA and Tukey (that we do need):

length.1=aov(length~rating,data=movies)

summary(length.1)

## Df Sum Sq Mean Sq F value Pr(>F)

## rating 3 14624 4875 11.72 4.59e-06 ***

## Residuals 56 23295 416

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This P-value is 0.00000459, which is indeed less than 0.0001, so it’s consistent with SAS.

Having rejected the null (which said “all means equal”), we need to do Tukey, thus:

Page 40

TukeyHSD(length.1)

## Tukey multiple comparisons of means

## 95% family-wise confidence level

##

## Fit: aov(formula = length ~ rating, data = movies)

##

## $rating

## diff lwr upr p adj

## PG-G 26.333333 6.613562 46.053104 0.0044541

## PG-13-G 42.800000 23.080229 62.519771 0.0000023

## R-G 30.600000 10.880229 50.319771 0.0007379

## PG-13-PG 16.466667 -3.253104 36.186438 0.1327466

## R-PG 4.266667 -15.453104 23.986438 0.9397550

## R-PG-13 -12.200000 -31.919771 7.519771 0.3660019

This looks different from SAS, since it gives a test for each pair of groups. The three significant tests allinvolve rating G (mean length less than that of the other ratings), and none of the other three ratingshave significantly different mean lengths from each other. So, with a little work, we found exactly thesame conclusion as from SAS.

Notes

1That is, before looking at the data. This is Latin. It’s also the place that the Bayesian “prior distribution” comes from.The “posterior distribution” comes from the Latin a posteriori, which means “afterwards”, that is, after you have looked at thedata.

2The two groups have very different spreads, but that is not a problem as long as we remember to do the Welch-Satterthwaitetest that does not assume equal spreads. This is the default in R, so we are good, at least with that.

3There was, in the chisq.test inside median test, but in smmr I didn’t pass that warning back to the outside world.

4“Conservative” in statistics means that it may fail to declare some actually significant results significant: “playing it safe”,you might say, thinking of rejecting the null as something you only want to do if you’re sure it’s the right thing. That is tosay, the correct P-value is smaller than the one given, but we don’t know how much smaller. This particular adjustment iscalled a Bonferroni correction, and applies any time you are doing several tests at once. There are better but more complicatedadjustments; indeed, one of the reasons Tukey developed his procedure is that he wanted to avoid using something like Bonferroniin an ANOVA context where it is possible to do better.

5“proc means” knows about a number of statistics: see http://support.sas.com/documentation/cdl/en/proc/61895/HTML/

default/viewer.htm#a000146729.htm. Look for “statistic-keywords” near the bottom.

6Except in the case of the previous problem, where there were multiple observations equal to the overall median. Which weignore for the moment.

Page 41