minitab companion - stat2 · 2.2 inference for simple linear regression ... 3.3 additional anova...

Stat2

Building Models for a World of Data

Minitab Companion

Ann R. Cannon George W. CobbCornell College Mount Holyoke College

Bradley A. Hartlaub Julie M. LeglerKenyon College St. Olaf College

Robin H. Lock Thomas L. MooreSt. Lawrence University Grinnell College

Allan J. Rossman Jeffrey A. WitmerCalifornia Polytechnic State University Oberlin College

W. H. Freeman and CompanyNew York

c⃝2013 by W. H. Freeman and Company

ISBN-13: 978-1-4641-0269-1ISBN-10: 1-4641-0269-4

All rights reserved

Printed in the United States of America

First printing

W. H. Freeman and Company41 Madison AvenueNew York, NY 10010Houndmills, Basingstoke RG21 6XS, England

www.whfreeman.com

Contents

1 Introduction to Minitab 11.1 Minitab Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Regression and Correlation 292.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Inference for Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4 Additional Topics in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 ANOVA 513.1 One-way ANOVA - Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Two-way ANOVA – Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3 Additional ANOVA Topics – Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Logistic Regression 634.1 Logistic Regression and Odds – Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Multiple Logistic Regression - Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . 724.3 Logistic Regression: Additional Topics - Chapter 11 . . . . . . . . . . . . . . . . . . 72

5 Randomization, Bootstrapping, and Macros 795.1 Running Macros in Minitab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.2 Structure of the Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Index 89

iii

CHAPTER 1

Introduction to Minitab

1.1 Minitab Basics

Data Basics

When you open Minitab for the first time you will see a screen similar to that shown in Figure 1.1.You should see two main windows, the menus along the top, and an icon on the bottom left of thescreen that says “project.” The top window is called the session window, and this is where thenumerical results will be shown. The bottom window is called the worksheet. This looks like anempty spreadsheet and is where the data are stored. For the rest of this section we will concentrateon the worksheet window.

Figure 1.1: Screen shot of Minitab

1

2 CHAPTER 1. INTRODUCTION TO MINITAB

Data storage

First, notice that each column has a heading, starting with “C1.” In Minitab, the columns arethe variables and the rows are the individuals. Minitab gives all variables a default nameof C with a number. Basically “C1” stands for column number 1, but you can give the variablesany name that you want. Note that it is best to stick with letters and numbers (not other specialcharacters) when naming your variables. Just type the appropriate name in the gray box directlybeneath the column number (be sure to put it in the gray box, not any of the white boxes wheredata will eventually be entered). You can enter as many variables as you like (well, it can’t be morethan 4000, but that is probably enough for anything you might do in STAT 2).

Now you are ready to enter data. This works just like any other spreadsheet. Note that if youenter something other than a number, the column heading will change from, say, “C1” to “C1-T.”This is Minitab’s way of telling you that the data is not numerical but rather text. Even if you addother values later (or earlier) in the column that are numerical, Minitab will read those numbersas if they are characters. If you have missing data for a numeric variable, either enter an asterisk(*) or skip the cell and Minitab will automatically enter the symbol. If you have missing data fora text variable, just skip the cell (leave it blank) where the data are missing.

In the next subsection we discuss all of the different data types (most often you only deal witheither numerical or text data) and how to force Minitab to change data types.

Data types

There are three basic data types that Minitab recognizes: Numeric, Text, and Date/Time. In thiscourse you will likely be using only the first two, so this is what we will concentrate on. As wasstated earlier, Minitab does its best to decide which type of data a particular variable represents.If there are only numbers in the column, Minitab will treat the data as numeric. If, however, thereare characters anywhere in the column, Minitab will automatically change the data type to textand will add “-T” to the heading of the column. You may also notice that Minitab right justifiesnumeric data and left justifies text data. Sometimes this will happen by mistake. For instance, ifyou accidentally type the letter “O” instead of the number “0,” Minitab will now see a characterin the column and change the data type to text. Now all numbers that appear on the column willbe read by the computer as characters rather than numbers. It could also happen if you enter anysymbol other than “*” for missing data (like “NA”).

So what do you do if you accidentally enter a character for a numeric variable, or you would like tochange a numeric variable to a text variable? Let’s say that you accidentally typed the letter “O”rather than the number “0.” Once you do that, Minitab changes the variable type to text. Even ifyou correct your mistake, and there are no more characters in the column, Minitab will stick withthe text designation until you tell it otherwise. To illustrate this, we use the Diamonds data set.Open this data set. If you click on Data>Change Data Type and choose the appropriate typeof change (which from our discussion above is Text to Numeric), the dialogue box in Figure 1.2

1.1. MINITAB BASICS 3

comes up. Obviously neither of the two variables listed are really numeric. But if you had one thatshould be, list it in both the “Change text columns” and “Store numeric columns in” boxes. Ifyou still have characters in some of the cells in that column, they will automatically be changed tomissing values.

Figure 1.2: Dialogue box for changing the variable type

Storing information about the data

It is often a good idea to store information about the data. For instance, variable names are not al-ways descriptive enough to clearly explain what they measure. So we might like to store a sentenceor two that describes the specifics about a variable. To illustrate how to store this information, westart with a very basic data set. We have two variables, Age and Height, and three observations.The worksheet for this data set is shown in Figure 1.3.

Figure 1.3: Example worksheet

Suppose we would like to remember that the heights are measured in inches. To store this informa-tion, start by opening the project manager window. This is accomplished by clicking on the iconon the lower-left portion of the screen labeled “Project” and clicking restore (or clicking on the


Figure 1.4: Project Manager window

“restore up” button on the icon). Now you should have the window shown in Figure 1.4. On theleft-hand side of this window, click on the word “Columns” under “Worksheet 1.” The resultingwindow is shown in Figure 1.5(a). Now right-click on the variable name Height and choose “SetDescription.” This brings up the window in Figure 1.5(b) where we have entered the sentence“Height is measured in inches.” General information about the whole worksheet can be added byclicking on “Worksheets” in the right-hand pane, right-clicking on “Worksheet 1” in the left-handpane, and choosing “Set Description.”

(a) Project Manager window showing columninformation

(b) Window where you can enter a variable de-scription

Figure 1.5: Adding a description to a column of data

1.1. MINITAB BASICS 5

Now that you have added this information, you probably want to know how to see it again. Ifyou look at the worksheet, you will notice that a red triangle has been added to the box with thevariable name in it. If you hover your mouse over that red triangle, you will see the additionalinformation. If you have added information about the worksheet in general, a red triangle willappear in the blank gray box in the left-most column above the box with the number 1 in it. Thisis illustrated in Figure 1.6.

Figure 1.6: Worksheet with triangle showing that extra information for the variable is available

File types

There are two types of files that we work with most often. The first is called a worksheet and carriesthe extension .mtw. This type of file saves only the information given in the worksheet window.That is, it saves the variable names, the data values, and any notes that might have been addedto the worksheet (seen as red triangles in the worksheet). This type of file is best for sharing datawith others. You will note that all of the data sets that come with this book are stored as .mtw files.

The second type of file is called a project and carries the extension .mpj. This type of file saveseverything that is part of the Minitab desktop. Basically it saves a snapshot of your entire session.Any graphs that are open are saved (including those that have been minimized and are not currentlyvisible), everything in the session window is saved, and all worksheets that are open (or minimized)will be saved. This type of file is most useful for someone who has started an analysis and will becoming back to it, or for someone who wants to share their analysis with others. Note that the filedoes not save graphs that have been deleted (or closed).


Data retrieval

Data files that have already been saved can also be used in Minitab. For instance, all of the datanecessary for the exercises and examples in this book have been saved in Minitab format. To accessa saved Minitab file, use the menus at the top of the program and click File>Open Project ifthe file is a Minitab project with the .mpj extension. If the file is a Minitab worksheet with an.mtw extension then use File>Open Worksheet. In either case, choose the appropriate file andclick Open.

1.2 Graphs

Minitab will let you create almost any kind of statistical graph that you will encounter in a beginningor second statistics course. In this section we take you through creating the four most basic graphsfound in an introductory statistics course. Later in this manual we will describe how to use Minitabto make graphs that are new in this second course in statistics.

We will use two data sets to illustrate these commands. The first data set is called Cereal. Thisdataset gives the number of calories, the number of grams of sugar, and the number of grams offiber, all per serving, for 36 different breakfast cereals. The second data set we will use is calledFirstYearGPA. This dataset gives several measurements on 219 first-year college students at aliberal arts college. Some of the quantitative variables of interest include first-year GPA, highschool GPA, SAT verbal score, and SAT math score. Some of the categorical variables are whetherthey are male or not, whether they are in the first generation of their family to attend college ornot, and whether they are white or not.

Most graphs are found on the Graph menu.

Histogram

To create a histogram, click Graph>Histogram. This brings up a dialogue box with severalchoices for types of histogram. We will stick with the choice in the upper-left-hand corner, “Sim-ple,” which is the default (shown in Figure 1.7). This dialogue box is typical of the first dialoguebox for most graph choices. We show it here, but from now on will only refer to such dialogueboxes as the “initial dialogue box” for a particular graph.

Once you have verified that “Simple” is chosen, then click “OK.” This brings up a second dialoguebox which allows you to choose which variable to plot. Select the appropriate variable from the listof variables given and either double-click on it or click once on it and then click on the “Select”button. Either method will move that variable into the “Graph variables” box on the right side ofthe dialogue box. In this example we have chosen to graph the variable Sugar. Figure 1.8(a) showswhat this dialogue box looks like. The resulting histogram is shown in Figure 1.8(b).

1.2. GRAPHS 7

Figure 1.7: Dialogue box to choose which type of histogram to plot

(a) Dialogue box to choose which variable tograph

(b) Histogram of the number of grams of sugarper serving in 36 breakfast cereals

Figure 1.8: Histogram dialogue box and resulting graph

Once you have created a histogram, you may wish to modify it some. Minitab has its own way ofchoosing how many bins to use, but you may wish to use a different number of bins. The easiest wayto modify a graph is to right-click on the region of the graph that you wish to change. For example,if you wish to change the number of bins, put your mouse over one of the bars and right-click. Thiswill bring up a menu with one option being “Edit bars” (the third option from the top). Click onthe “Edit bars” option to bring up the dialogue box shown in Figure 1.9. Click on the last tab atthe top marked “Binning.” There are two choices here. Either you can enter the number of inter-vals that you want the data divided into (in this case the computer automatically divided it into9 bins), or you can specify all of the midpoints of the intervals (or the cutpoints of the intervals).Choose which way you want to change the binning structure, click on the appropriate button(s),and put in the appropriate numbers. For illustration we have chosen to reduce the number of binsto 6 (see Figure 1.10(a)). After clicking “OK”, the histogram shown in Figure 1.10(b) is created.


Figure 1.9: Edit bars for a histogram dialogue box

(a) Binning dialogue box for histogram using 6bins

(b) Histogram of the number of grams of sugarper serving in 36 breakfast cereals, using 6 bins

Figure 1.10: Histogram dialogue box and resulting graph using 6 bins

There are several other options for graph modification that come in handy. If you right-click withyour mouse pointed at either the x or y-scale, you can change the way the scale looks (where tickmarks occur, etc.). Choose “Edit x-scale” or “Edit y-scale” from the menu that comes up whenyou right-click. If you right-click on either the graph label or the axis labels, you can also edit(or delete) what Minitab puts there by default. For example, you will note that in the textbook,we have deleted all of the titles that Minitab puts on graphs because we are using figure captionsbelow the graphs anyway. These options are typical of most graphs that Minitab creates.

1.2. GRAPHS 9

Dotplot

We next turn to the dotplot. Using the same Cereal data, we create a dotplot by clickingGraph>Dotplot. This brings up a dialogue box asking which type of dotplot we would like.To create a dotplot for one variable, we want a “Simple” dotplot for “One Y.” This is the defaultselection, so just click “OK” on the initial dialogue box. This brings up a new dialogue box, shownin on the left in Figure 1.11. The resulting dotplot is shown on the right in Figure 1.11.

[ ]

Figure 1.11: Dotplot dialogue box and resulting graph

As with histograms, if you right-click on areas of the dotplot, you can change the appearance. Themost common parts of the graph to edit are the title, the x -axis label, and the scale of the x -axis.

If you have two variables that you want to represent on a dotplot, you need to think about theinformation those two variables give you. One possible situation is when one variable is the responsevariable of interest and the other variable is a categorical explanatory variable that breaks theresponses into groups. In the FirstYearGPA data, for instance, we might wish to compare thedistributions of the first-year college GPA for males and females. The variable GPA is the responsein this case and Male is the explanatory variable (Male = 1 for men and Male = 0 for women). Tocreate this dotplot, in the first dialogue box chooseOne Y>With Groups as shown in Figure 1.12.

Figure 1.12: Choose which type of dotplot to create: One Y and With Groups


This brings up the dialogue box shown in Figure 1.13(a). Notice that we have put the responsevariable in the box titled “Graph variables” and the explanatory variable (grouping variable) in thebox marked “Categorical variables for grouping.” The resulting dotplot is shown in Figure 1.13(b).

(a) Choose which variables to graph (b) Dotplots of first-year college GPA brokendown by sex

Figure 1.13: Specifying dotplots for a quantitative response by categorical explanatory groups

The other case is illustrated in the Cereal data. There we might wish to compare the distributionsof the amount of sugar per serving and the amount of fiber per serving. In this case we have twovariables that we want to compare in dotplots using the same scale (both are measured in grams perserving). To accomplish this, click on Graphs>Dotplot as before. This time, choose MultipleY’s>Simple (see Figure 1.14). This brings up the same dialogue box as seen in Figure 1.11(a).Put both variables of interest into the “Graph Variables” box as shown in Figure 1.15(a). Theresulting dotplot is shown in Figure 1.15(b).

Figure 1.14: Choose which type of dotplot to create: Multiple Y’s and Simple

1.2. GRAPHS 11

(a) Dialogue box to graph multiple dotplots (b) Dotplots comparing amounts of Sugar andFiber per serving

Figure 1.15: Dotplots for one several variables dialogue box and resulting graph

Boxplot

The steps required to create a boxplot are basically the same as those required to create a dot-plot. We will illustrate creating boxplots using the FirstYearGPA data. In this case clickGraph>Boxplot. This brings up the initial dialogue box. Again there are two basic choices: Doyou have one response variable (“One Y”) or more than one response variable (“Multiple Y’s”)?We start with the simple situation of one response variable and no grouping variable. In this caseclick on One Y>Simple. This brings up the second dialogue box where you choose which variableto graph. We have chosen to create a boxplot of GPA, so Figure 1.16(a) shows this second dialoguebox with GPA moved to the “Graph variables” box. The final boxplot is shown in Figure 1.16(b).

(a) Choose the variable (b) Boxplot of first-year college GPA

Figure 1.16: Dialogue box for simple boxplot and resulting graph


As with dotplots, we can also put several boxplots on the same set of axes. Again you need to thinkabout the structure of the data. If you have one response variable and one explanatory variablethat divides the observations into groups, then you will choose One Y>With Groups in theinitial dialogue box. Now the second dialogue box that appears is the one shown in Figure 1.17(a).We have illustrated the use of this dialogue box by choosing GPA as our response variable (putinto the “Graph variables” box) and Male as our grouping variable (put in “Categorical variablesfor grouping” box). The resulting graph is shown in Figure 1.17(b).

(a) Choose the variable and groups (b) Boxplots of GPA based on sex

Figure 1.17: Specifying boxplots for a quantitative response by categorical explanatory groups

If you have two separate response variables that you wish to plot together, then, in the initialdialogue box, choose Multiple Y’s>Simple. This brings up the dialogue box shown in Fig-ure 1.18(a). We have entered the variables SATV and SATM to compare the distributions of themath and verbal SAT scores. The resulting boxplot is shown in Figure 1.18(b).

(a) Dialogue box for choosing several responsevariables to graph in boxplots

(b) Boxplots of the SAT verbal and math scoreson the same set of axes

Figure 1.18: Dialogue box for boxplots of several response variables and resulting graph

1.2. GRAPHS 13

Once again, if you right-click on specific areas of the boxplot, you can change the appearance. Themost common parts of the graph to edit are the title, the x -axis and y-axis labels, and the scalesof the x -axis and y-axis.

Normal Probability Plot

The final graph that we address in this chapter is a normal probability plot. Perhaps we would liketo see if the number of grams of sugar per serving in breakfast cereals can be modeled by a normaldistribution. Open the data file Cereal. Click on Graph>Probability Plot. (Note that we didnot say click on Graph>Probability Distribution Plot). This brings up the initial dialoguebox. Choose “Single” and click “OK.” This brings up the second dialogue box, which is shown inFigure 1.19(a). Note that we have selected the variable Sugar as the variable for the graph. Theresulting normal probability plot is shown in Figure 1.19(b).

(a) Dialogue box to choose the variable to graphfor a normal probability plot

(b) Normal probability plot of amount of sugarper serving for a sample of breakfast cereals

Figure 1.19: Dialogue box for normal probability plot and resulting graph

The “95% CI” bands above and below the line in Minitab’s normal plot help judge when pointsstray from the expected linear pattern (see Exercise 2.43 on page 93 of the text for more on this).

You may notice a difference between this plot and the normal probability plots that we have shownin the text. For the text figures, we have deleted both the horizontal and vertical reference lines.To do this, right-click on one set of lines and click on “Delete.” Then do the same thing for thelines in the other direction.

Bar Chart

To compare the numbers of observations in various categories, we often use a bar chart. To illustratethis we use the data file TipJoke. In this example we are interested in people’s tipping behavior ina restaurant based on whether their waiter left them a card with a joke on it, a card with an adver-tisement on it, or left no card at all. The Card variable has three values—Joke, Ad, and None. To


create a bar chart showing how many times each kind of card (or none) was left, click Graph>Barchart and choose “Counts of unique values” and “Simple.” This brings up the dialogue box in Fig-ure 1.20(a). Move the variable name of the variable you are interested in (Card in this case) to thebox marked “Categorical variables” and click “OK.” The resulting graph is shown in Figure 1.20(b).

(a) Dialogue box for a bar chart (b) Bar chart of the tipping conditions

Figure 1.20: Dialogue box for bar chart and resulting graph

We can also create a stacked bar chart (often referred to as a segmented bar chart). In the exampleabove, we want to know how many people leave tips under the three different conditions. A stackedbar chart will help us see if there is a difference among the three conditions. In this case, clickGraph>Bar chart, again choose “Counts of unique values,” but this time choose “Stack” insteadof “Simple.” In the corresponding dialogue box (see Figure 1.21(a)), put both variables of interestinto the “Categorical variables” box with the variable you want represented on the x-axis first, andclick “OK.” The resulting graph is shown in Figure 1.21(b)

(a) Dialogue box for a stacked bar chart (b) Bar chart of conditions stacked by whetherpeople tipped or not

Figure 1.21: Dialogue box for bar chart and resulting graph

1.3. CALCULATIONS 15

1.3 Calculations

Minitab will also perform almost any kind of statistical computation that you will encounter in abeginning statistics course. In this section we take you through three different kinds of computa-tions that you are already familiar with. Later in this manual we will describe how to use Minitabto perform computations that are new in this second course in statistics. Note that all outputdiscussed in this section will appear in the session window.

Again we will use the Cereal and FirstYearGPA data sets to illustrate the use of Minitab.

Most statistical computations are found on the Stat menu.

Descriptive Statistics

Nearly all of the basic descriptive statistics that one would wish to calculate (mean, median,standard deviation, quartiles, etc.) are located in one place in Minitab. Click on Stat>BasicStatistics>Display Descriptive Statistics to bring up the main dialogue box shown in Fig-ure 1.22(a). We are illustrating the use of this with the Cereal data, having Minitab computedescriptive statistics for the Calories variable. In Figure 1.22(a) we have moved the variable Calo-ries to the “Variables” box.

](a) Main dialogue box for computing descriptivestatistics

(b) Dialogue box to choose which statistics tocompute

Figure 1.22: Dialogue boxes for computing descriptive statistics


Now that we have told Minitab which variable to work with, we need to tell it which statisticsto calculate. In the same dialogue box (Figure 1.22(a)) click on the “Statistics” button. Thisbrings up a second dialogue box, which is shown in Figure 1.22(b). Typically Minitab has a certainnumber of the options checked by default, depending on how a user has it set up on their machine1.Figure 1.22(b) shows the default selections for a computer used by one of the authors. Check thestatistics that you would like to calculate and uncheck those that you do not need. The resultsappear in the Session window of Minitab. For this example, the output that appears in the windowis shown below (based on the selections shown in Figure 1.22(b)).

Descriptive Statistics: Calories

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum

Calories 36 0 101.60 3.69 22.16 50.00 90.00 104.00 110.00 160.00

One-Sample t-Test

We now turn to the FirstYearGPA example to illustrate the use of the one-sample t-test inMinitab. We have a sample of first-year GPA’s, and an administrator might wish to see if we havesignificant evidence that the population mean is bigger than 3.0. The sample size is 219, so a t-testis appropriate as long as we don’t have extreme outliers. The histogram of the GPA’s in Figure 1.23shows that there are no extreme outliers, so we can proceed with the test.

Figure 1.23: Histogram of GPA’s for 219 first-year students

Click on Stat>Basic Statistics>1-Sample t to bring up the dialogue box shown in Figure 1.24(a).Since we are interested in GPA’s, move the variable GPA to the “Samples in columns” box. Wewant to perform a hypothesis test, so check the “perform hypothesis test” box and enter the nullvalue in the “Hypothesized mean” box. At this point we have told the computer what variablewe want to perform the test on and what the hypothesized value is. But we have not yet toldit whether this is a one-sided or two-sided test. So, now click on the “Options” button to bringup a second dialogue box, shown in Figure 1.24(b). This procedure computes both the hypothesistest and the equivalent confidence interval (note that if you ask for a one-sided test you will also

1Use Tools>Options>Individual Commands>Display Descriptive Statistics to modify your defaults.


get a one-sided interval—something not typically covered in an introductory statistics course). Inthis case we want a one-sided test with the alternative being that the mean is greater than 3.0, sochoose the “greater than” option from the drop-down menu for the “Alternative” box as shown inFigure 1.24(b).

(a) Main dialogue box for 1-sample t-test (b) Second dialogue box for 1-sample t-test

Figure 1.24: Dialogue boxes for 1-sample t-test

Now click “OK” in both dialogue boxes. The resulting output is given below.

One-Sample T: GPA

Test of mu = 3 vs > 3

95% Lower

Variable N Mean StDev SE Mean Bound T P

GPA 219 3.0962 0.4655 0.0315 3.0442 3.06 0.001

Paired t-Test and Two-Sample t-Test

In this section we deal with the two types of two-sample tests: independent samples and pairedsamples. We begin with paired samples.

Paired t-test

Do people generally score higher on one particular portion of the SAT (verbal or math) or are thescores, on average, the same? This is a question we can attempt to answer using the FirstYearGPAdata. We have both verbal and math SAT scores for 219 first-year students. We need to use apaired t-test here because each person has both a verbal and a math SAT score. We will checkthe normality condition as a part of running the test in Minitab. Click on Stat>Basic Statis-tics>Paired t to bring up the dialogue box shown in Figure 1.25. We have arbitrarily chosen


Figure 1.25: Main dialogue box for paired t-test

SATV as the first variable and SATM as the second. You can assign them in either order. Notethat when Minitab computes the differences, it subtracts the second variable from the first. Nowclick on the “Graphs” box to bring up a second dialogue box, shown in Figure 1.26(a). Click onthe “Histogram of differences” choice to get the required histogram and then click “OK” in thisbox. Now you are back to the main dialogue box. Now click on the “Options” box to bring up onemore dialogue box, shown in Figure 1.26(b). In this example we want the null hypothesis value tobe 0 (the default value) and we want a two-sided test (also the default) so you can just click “OK.”Finally, click “OK” in the main dialogue box.

(a) Dialogue box to check conditions for pairedt-test

(b) Dialogue box to set test parameters forpaired t-test

Figure 1.26: Secondary dialogue boxes for paired t-test

The t-test output is given here and the resulting histogram is shown in Figure 1.27.


Paired T-Test and CI: SATV, SATM

Paired T for SATV - SATM

N Mean StDev SE Mean

SATV 219 605.07 83.39 5.64

SATM 219 634.29 75.24 5.08

Difference 219 -29.22 77.48 5.24

95% CI for mean difference: (-39.54, -18.91)

T-Test of mean difference = 0 (vs not = 0): T-Value = -5.58 P-Value = 0.000

Figure 1.27: Histogram of the differences SATV − SATM

Two-sample t-test

Continuing with the FirstYearGPA data, we might wish to test to see if there is a differencebetween the mean GPA of men and women. Since these are two independent samples, we needto use the t-test designed for two independent samples, typically called the two-sample t-test. InMinitab, click on Stat>Basic Statistics>2-Sample t to bring up the main dialogue box shownin Figure 1.28. Our data are arranged so that all of the GPA’s are in one column marked GPA andthe variable Male divides those GPA’s into two groups. If Male = 0 the GPA belongs to a female,and if Male = 1 the GPA belongs to a male. This means that we need to use the top options inthe dialogue box shown in Figure 1.28. The “Samples” box is for the response variable (GPA here)and the “Subscripts” box is for the grouping variable (Male in this example). Generally we do notrequire the assumption of equal variances for the two groups, but if you wish to run the test underthat assumption, there is a box on this main dialogue box that you can check.


Figure 1.28: Main dialogue box for 2-sample t-test

Now click on the “Graphs” button to bring up a second dialogue box, shown in Figure 1.29(a).Check the box marked “Boxplots of data” to check the conditions of the test and then click “OK.”Next, click the “Options” button on the main dialogue box. Now click on the “Options” boxto bring up one more dialogue box, shown in Figure 1.29(b). In this example we want the nullhypothesis value to be 0 (the default value) and we want a two-sided test (also the default) so youcan just click “OK.” Finally, click “OK” in the main dialogue box.

(a) Dialogue box to check conditions for two-sample t-test

(b) Dialogue box to set test parameters for two-sample t-test

Figure 1.29: Secondary dialogue boxes for two-sample t-test

The two-sample t-test output is shown here and the resulting boxplots appear in Figure 1.30.

1.4. DATA MANIPULATION 21

Two-Sample T-Test and CI: GPA, Male

Two-sample T for GPA

Male N Mean StDev SE Mean

0 117 3.073 0.466 0.043

1 102 3.122 0.465 0.046

Difference = mu (0) - mu (1)

Estimate for difference: -0.0492

95% CI for difference: (-0.1736, 0.0752)

T-Test of difference = 0 (vs not =): T-Value = -0.78 P-Value = 0.436 DF = 213

Figure 1.30: Boxplots of male and female GPA’s

1.4 Data manipulation

Once you have either opened an existing file or entered your own data, you may find that you needto manipulate the data somehow. This can include working with only a subset of the data, splittingthe data into two or more groups, or manipulating the variables themselves to create new variables.To illustrate these techniques we will use the Diamonds data. This data set has information on thenumber of carats, the color, the clarity, the depth, the total price, and the price per carat for 351diamonds. Notice that the variables Carat, Depth, PricePerCt, and TotalPrice are quantitative,whereas Color and Clarity are categorical.

Several of the manipulations that we discuss below create new worksheets (while still keeping theoriginal worksheet available for use). When you do any analysis of a worksheet, make sure that it is


the worksheet highlighted (the window has a darker blue border rather than a lighter blue border).To switch to a different worksheet within the same project, click on it if a portion is visible or usethe top Window menu to select it.

Subset Worksheet

Let’s suppose that for some reason we would like to limit our analysis to those diamonds that havea Clarity rating of “IF.” Click on Data>Subset Worksheet. This brings up the window shownin Figure 1.31(a). We want our subset to only include those diamonds for which the Clarity islisted as “IF.” In the middle box, make sure that “Specify which rows to include” is checked. Inthe bottom box, choose “Rows that match” and click on the “Condition” button. This brings up asecond dialogue box. Notice that in this dialogue box, all of the available variables are listed in theleft-hand box. This will be typical of many Minitab dialogue boxes. The variable that we want inthis case is Clarity, so either double-click on it or single-click on it and click below the box on theword “Select.” This should move the variable name over to the box in the upper-right part of thedialogue box, under the word “condition.” We want the Clarity to be “IF”, so after the variablename either type the equal sign from your keyboard or click on the one available in the dialoguebox. Finally type “IF” using double quotes around it. When you are finished the dialogue boxshould look like the one in Figure 1.31(b). Now click “OK” in both dialogue boxes. The resultwill be a new worksheet titled “Subset of Diamonds.MTW.” This worksheet contains only thosediamonds for which Clarity is recorded as “IF.”

(a) Creating a subset of a worksheet (b) Setting the subset condition

Figure 1.31: Dialogue boxes for creating a subset of a data set

It should come as no surprise that instead of creating a subset by telling the computer whichobservations to include, you could instead tell it which ones to exclude. The process is the same asdescribed above except that you choose “Specify which rows to exclude” instead of “Specify which


rows to include.” You can also use quantitative variables to subset data. For instance you mightuse the condition Carat>1 or something more involved such as (Carat>1) and (Carat<1.5).

Split Worksheet

Sometimes you will find that you want to do a similar analysis on several disjoint subsets of thedata, with the subsets defined by categories of one particular variable. In our Diamond data, wemight want to do the same analysis for each of the different Color diamonds in our set. Click onData>Split Worksheet to bring up the dialogue box shown in Figure 1.32. Move the appropriatevariable from the list on the left to the box in the upper right (in this case Color). Now click “OK”in both dialogue boxes. Minitab will create a new worksheet for each of the colors and give theseworksheets relevant names like “Diamonds.MTW(Color = D).” You can use each of the worksheetsnow for individual analyses. Just make sure to have the one you wish to use as the current window(darker blue border) when you click on relevant menu items (use the Window menu to switch).

Figure 1.32: Dialogue box for splitting a worksheet

Stacking and Unstacking Columns

Typically we enter data into a worksheet with each variable in its own column and each observationin its own row. If we have, say, a quantitative response variable and a categorical explanatoryvariable, the data in this format are called “stacked.” That is, all of the quantitative values arestacked in one column and the variable that divides the quantitative values into groups is in anothercolumn. But sometimes we have reason to want all of the quantitative responses for each group intheir own column (or the data may come in that way). This format is called “unstacked.” MostMinitab routines expect stacked data, but some assume unstacked, so it pays to be able to go backand forth between the two formats, which we illustrate below.


Unstacking data

Since data usually comes in a stacked format, this is the more typical process that we do. Weillustrate with the data file HawkTail. This data has the tail lengths for three different species ofhawks. The file HawkTail is in stacked format with one column for the tail lengths and one columnfor the categorical variable Species. To unstack this data, click on Data>Unstack columns. Inthe dialogue box that pops up (see Figure 1.33(a)) enter the quantitative variable in the box marked“Unstack the data in” and put the categorical variable in the box marked “Using subscripts in.”A new worksheet will appear with the data unstacked. See Figure 1.33(b).

(a) Dialogue box for unstacking columns (b) Worksheet after unstacking columns

Figure 1.33: Unstacking columns

Stacking data

It is less likely that you will need to stack data, but just in case, we illustrate that procedure withHawkTail2. This is the result of having unstacked the data in HawkTail. To stack it again, clickData>Stack>Columns. This brings up the dialogue box in Figure 1.34(a). Put the columnsyou wish to stack in the box titled “Stack the following columns” and click “OK.” A new worksheetwill be created as seen in Figure 1.34(b). Notice that Minitab uses the original column names asthe grouping categories. It also does not give the stacked column a name. It is always best to giveit a name so that you can keep track of what the column represents. We would suggest somethinglike length or TailLength in this case.

Sorting Variables

If you would like to sort a variable from smallest to largest (or largest to smallest), click onData>Sort. The simplest case is when you just have one variable you would like to sort. In thiscase, move the variable name into both the box labeled “Sort column(s)” and the box labeled “By


(a) Dialogue box for stacking columns (b) Worksheet after stacking

Figure 1.34: Stacking columns

column.” Minitab automatically sorts the data from smallest to largest. If you would like it theother way, simply put a check in the box marked “descending” next to the box marked “By column.”

If you have one variable that you would like to use for the sorting, but would like to carry alongother variables, put all variables that you want re-ordered (both those being carried along and theone being used for sorting) into the “Sort column(s)” box and put the variable that you want touse to do the ordering into the “By column box.” Figure 1.35 shows the dialogue box we used withthe MedGPA data to sort the variables Acceptance and GPA using GPA as the sorting variable.

Figure 1.35: Dialogue box for sorting columns


Creating New Variables

Sometimes you will want to create new variables from old ones. This may mean creating a newquantitative variable that is a function of one or more existing quantitative variables. It could alsomean that we want to take a categorical variable (or quantitative variable with just a few differentvalues) and create indicator variables from it.

Calculator

We start with creating a new quantitative variable from an existing one. For instance, you mightwant height in cm but you have it in inches in the data set. We know that there are 2.54 cm tothe inch, so you just need to multiply all the values that are in inches by 2.54. You could do thisindividually with a calculator, but it is much quicker and easier to ask Minitab to do it for thewhole column at one time.

In the Diamonds data set, one of the variables listed is PricePerCt. In fact, this is also a combina-tion of two other variables, TotalPrice and Carat. The relationship is PricePerCt = Price/Carat.We can verify the values in the worksheet by having Minitab recompute the price per carat. Click onCalc>Calculator. This brings up the dialogue box shown in Figure 1.36. Give the new variable aname by typing the chosen name into the upper-right box marked “Store result in variable.” Thenput the relevant mathematical expression in the “Expression” box, selecting variables as appropriatefrom the left-hand list of variables. Note that we called the new variable NewPricePerCarat.

Figure 1.36: Dialogue box for calculating a new numeric variable with expression entered


Indicator variables

When using a categorical variable as an explanatory variable in regression or logistic regression, weneed to create several indicator variables out of the one categorical variable. For example, in thedata set TipJoke we are interested in people’s tipping behavior in a restaurant based on whethertheir waiter left them a card with a joke on it, a card with an advertisement on it, or left no cardat all. The Card variable has three values—Joke, Ad, and None. What we need to do is to createvariables that are indicator variables of the three conditions. That is, we need one variable thattakes the value 1 when a joke is left and is 0 for all other cases, one variable that takes the value 1when an ad is left and is 0 for all other cases, and one variable that is 1 when no card is left andis 0 in all other cases. To do this in Minitab, click on Calc>Make Indicator Variables. Thedialogue box shown in Figure 1.37 will appear. Put the categorical variable into the box labeled“Indicator variables for.” Notice that when you put the variable name into that top box, the otherempty boxes will automatically fill with the names that Minitab assigns to the various indicatorvariables. You can leave those as given, or change the names if you would like. When you click“OK” you will have a new column for each of the categories in the original variable.

Figure 1.37: Dialogue box for creating indicator variables

CHAPTER 2

Regression and Correlation

In this chapter we introduce you to the steps necessary to perform a regression analysis, both simpleand multiple, in Minitab.

2.1 Simple Linear Regression

We will use the data set relating price of used Porsche sports cars to the amount of mileage on thecar. The data can be found inPorschePrice and is the basis for many of the examples in Chapter 1.We begin our analysis by creating a scatterplot. Like most other graphs, this can be found byclicking Graph>Scatterplot. This brings up a first dialogue box, shown in Figure 2.1(a). Select“Simple” and click on “OK.” This will bring up a second dialogue box shown in Figure 2.1(b).Move the response variable (Price) to the top box in the column marked “Y variables” and theexplanatory variable (Mileage) to the top box in the column marked “X variables.” The resultingscatterplot is shown in Figure 2.2.

(a) Initial scatterplot dialogue box (b) Second scatterplot dialogue box

Figure 2.1: Creating a scatterplot

29

30 CHAPTER 2. REGRESSION AND CORRELATION

Figure 2.2: Scatterplot of Mileage versus Price for used Porsche sports cars

Notice that the scatterplot in Figure 2.2 does not display the regression line. You can create a scat-terplot with the line by choosing “With regression” from the dialogue box shown in Figure 2.1(a).Note that if you create this graph, while in Minitab, if you hover your mouse over the regressionline, a window will pop up showing the fitted regression equation (see Figure 2.3).

Figure 2.3: Scatterplot with regression line and regression equation

The rest of the output seen in Chapter 1 can be created by clicking Stat>Regression>Regression.The dialogue box in Figure 2.4 is the main dialogue box for linear regression.

To simply compute the fitted line and the statistics that go with that, move the response vari-able (Price) into the box marked “Response” and the explanatory variable into the box marked“Predictors” and click “OK.” The output will appear in the Session window and is shown below.

2.1. SIMPLE LINEAR REGRESSION 31

Figure 2.4: Dialogue box for linear regression

Regression Analysis: Price versus Mileage

The regression equation is

Price = 71.1 - 0.589 Mileage

Predictor Coef SE Coef T P

Constant 71.090 2.370 30.00 0.000

Mileage -0.58940 0.05665 -10.40 0.000

S = 7.17029 R-Sq = 79.5% R-Sq(adj) = 78.7%

Analysis of Variance

Source DF SS MS F P

Regression 1 5565.7 5565.7 108.25 0.000

Residual Error 28 1439.6 51.4

Total 29 7005.2

Unusual Observations

Obs Mileage Price Fit SE Fit Residual St Resid

24 20.5 39.70 59.01 1.54 -19.31 -2.76R

27 89.6 23.90 18.28 3.37 5.62 0.89 X

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large leverage.


Of course we also want to create diagnostic graphs: histogram of residuals, normal plot of residuals,and a scatterplot of residuals versus fits. For all three of these graphs, click on the “Graphs” buttonin the dialogue box shown in Figure 2.4. This brings up a new dialogue box shown in Figure 2.5.Check the boxes of the graphs that you would like to produce. Typically these would be the firstthree choices of “Histogram of residuals,” “Normal plot of residuals,” and “Residuals versus fits.”

Figure 2.5: Dialogue box for graphs associated with linear regression conditions

Notice that you can, instead, choose to click on the choice called “Four in one,” which will producethe set of four graphs shown in Figure 2.6. The only new graph here is a time plot of the residuals.This is most useful in detecting whether there is some sort of pattern to the residuals across thedata in the order in which it was collected (or entered into the computer).

Figure 2.6: Four residual graphs in one display

2.2. INFERENCE FOR SIMPLE LINEAR REGRESSION 33

2.2 Inference for Simple Linear Regression

Much of the material covered in Chapter 2 uses output that we have already seen how to create inthe last section. There are three new ideas that need to be covered here.

First, Section 2.1 discusses computing a confidence interval for the slope. To accomplish thiscomputation, we need the value of t∗ for the appropriate level of confidence. Let’s say that we wantto compute a 95% confidence interval for a slope using 20 degrees of freedom. To find t∗ click onCalc>Probability Distributions>t. This brings up the dialogue box shown in Figure 2.7.

Figure 2.7: Dialogue box used to find t∗

Make sure that “Inverse cumulative probability” is checked, enter the degrees of freedom, click on“Input constant” and type in the appropriate area (0.975 for a 95% interval). The resulting outputshown below indicates that t∗ = 2.08596 for a 95% confidence interval with 20 degrees of freedom.

Inverse Cumulative Distribution Function

Student’s t distribution with 20 DF

P(X<=x) x

0.975 2.08596

An alternate method to find a t∗ value (or a p-value for a test statistic) is Graph >ProbabilityDistribution Plot >View Probability, which gives a graphical display. After clicking on “OK”,choose the t-distribution from the list, enter the degrees of freedom, and click on the “Shaded Area”tab. In the final dialogue box, choose the type of region and enter a probability for that regionto find the endpoints(s). (For finding an area/probability, check the “X-value” item, choose theregion, and enter the endpoint(s)). This process is shown in Figure 2.8. the resulting graph isshown in Figure 2.9.


Figure 2.8: Getting a graphical t∗ value

The resulting graph, identifying t∗ = ±2.086, is shown in Figure 2.9.

Figure 2.9: Finding t∗ for a 95% CI with 29 degrees of freedom

Next, Section 2.1 introduces the correlation coefficient. To compute the correlation in Minitab,click on Stat>Basic Statistics>Correlation. Figure 2.10 shows the dialogue box that results.

Put the appropriate variable names to the box labeled “Variables,” check the box labeled “Displayp-values” and click “OK.” The resulting output is shown below. Note that while Minitab doesgive a p-value for the test, it does not give the actual value of the test statistic. Also, if you giveMinitab more than two variables and ask it to compute a correlation, it will compute a correlationbetween each pair of variables from the list that you give it. The value of R2 is found on the originalregression output.

2.2. INFERENCE FOR SIMPLE LINEAR REGRESSION 35

Figure 2.10: Dialogue box for correlation calculation

Correlations: Price, Mileage

Pearson correlation of Price and Mileage = -0.891

P-Value = 0.000

Here we see a very strong negative correlation between price and mileage for used Porsche carswith a p-value that is essentially zero.

Finally, Section 2.1 discusses confidence intervals for the mean response at a particular value ofthe explanatory variable and prediction intervals for new observations at a particular value of theexplanatory variable. These are easily computed in Minitab. In the dialogue box shown in Fig-ure 2.4, click on the “Options” button. This brings up the new dialogue box shown in Figure 2.11.

Put the value of the explanatory variable for which you want to compute these intervals into the boxlabeled “Prediction intervals for new observations.” In this example we have asked the computerfor confidence and prediction intervals for the price of used Porsche sports cars with 30,000 miles(so Price = 30.0). Click “OK” twice and you will see the same regression output as before withadditional information at the bottom for the intervals.


Figure 2.11: Dialogue box for confidence and prediction intervals

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI

1 53.41 1.34 (50.67, 56.15) (38.47, 68.35)

Values of Predictors for New Observations

NewObs Mileage

1 30.0

The predicted price (Fit) is ̂Price = 53.41 or $53,410. The 95% CI gives the confidence intervalfor the mean price (in %1,000’s) for Porsches with 30,000 miles and the 95% PI gives the predictioninterval for the price of one such car.

2.3 Multiple Regression

This section will illustrate how to fit all of the multiple regression models that are presented inChapters 3 and 4.

Comparing Two Regression Lines

We now turn to fitting two regression lines for the data in Kids198. The first step is to create anindicator variable IGirl. There are several ways to do this in Minitab. The easiest way is to usethe calculator by selecting Calc>Calculator and then entering IGirl as the variable where theresult should be stored and (“Sex”=1) as the expression, see Figure 2.12(a). If the expression istrue, then IGirl will be set to 1; if the expression is false, then IGirl will be set to zero.

2.3. MULTIPLE REGRESSION 37

(a) Using the calculator (b) Using a built in option

Figure 2.12: Two ways of making indicator variables

Another way to create indicator variables, that will be more useful when the categorical variable hasmore than two different levels, is to click on Calc>Make Indicator Variables (as described onpage 27 of this companion). Now, enter Sex as the variable for which to create indicator variablesand you will see that two new indicator variables, one for each distinct value of Sex, will be createdwhen you click “OK”. Figure 2.12(b) shows the completed dialog box for this method of creatingindicator variables. In order to match up with the output in the text, you may want to renameSex 1 to IGirl.

Before we fit the model for two separate lines, we need to create the interaction variable. Use thecalculator by clicking on Calc>Calculator to create a new variable AgexIGirl that is the productof Age and IGirl. Now that all of our data management is finished, we use the same commands thatwe did to fit a simple linear regression model. That is, select Stat>Regression>Regression andenter the response and predictor variables. The only difference is that we must enter three variablesinto the list of predictor variables. Thus, the multiple regression model for predicting Weight fromAge for each Sex separately is fit by completing the dialogue box, as shown in Figure 2.13.

The output now contains the estimated model and a row for each of the predictor variables. Eachrow contains the estimated coefficient, the standard error of the coefficient, the individual t statis-tic, and the corresponding p-value. To obtain confidence intervals for each parameter, we wouldneed to obtain the critical value t∗, as described in Section 2.2. Note that the appropriate degreesof freedom, 194 in this case, can always be obtained from the Residual Error row in the Analysisof Variance table. The standard error of the multiple regression model, the coefficient of multipledetermination, and the adjusted coefficient of determination are also included in the output. Fi-nally, sequential sums of squares, which will be discussed later, are part of the standard output.


Figure 2.13: Dialogue boxes for fitting two lines


Weight = - 33.7 + 0.909 Age + 31.9 IGirl - 0.281 AgexIGirl


Constant -33.69 10.01 -3.37 0.001

Age 0.90871 0.06106 14.88 0.000

IGirl 31.85 13.24 2.41 0.017

AgexIGirl -0.28122 0.08164 -3.44 0.001

S = 19.1862 R-Sq = 66.8% R-Sq(adj) = 66.3%


Source DF SS MS F P

Regression 3 143864 47955 130.27 0.000

Residual Error 194 71414 368

Total 197 215278

Source DF Seq SS

Age 1 131450

IGirl 1 8046

AgexIGirl 1 4368

To obtain residual plots, confidence intervals, and prediction intervals, you select the same optionsthat you used for simple linear regression models. For example, in order to obtain residual plotsagainst all of the predictors, just enter all of the variables in the dialogue box, as shown in Fig-ure 2.14(a).


(a) Plots: Request residual plots (b) Options: Request prediction intervals

Figure 2.14: Dialogue boxes for residual plots and prediction intervals when fitting two lines.

When doing confidence or prediction intervals for multiple regression, take care to list the valuesof the explanatory variables in the same order as they appear in the model. For example, topredict the weight of a 12-year-old (144 months) girl using the model above specify the values 1441 144 in the “Prediction Intervals for new observations” section under the regression options (asin Figure 2.14(b)), while a 12-year-old boy would need 144 0 0.

Regression Model with Interaction

In order to fit multiple regression models with two predictors and an interaction term, use thecalculator to create the interaction term and then click on the Minitab commands described in theprevious section for fitting two lines. For example, to fit a multiple regression model for predictingWeight from Length and Width for the data in Perch, simply create the variable LengthxWidthand fill in the regression dialogue boxes associated with Stat>Regression>Regression.

Polynomial Regression

There are several ways to fit quadratic and cubic regression models. The first method is touse the calculator to create higher-order terms and then fill in the dialog boxes after clickingStat>Regression>Regression as you have done for other multiple regression models. Theother option is to click Stat>Regression>Fitted line plot and then select the appropriatehigher-order option. For example, in order to predict TotalPrice with a quadratic model usingCarat as the predictor variable for Diamonds, complete the dialogue box for a fitted line plot, asshown in Figure 2.15.


Figure 2.15: Dialogue box for quadratic regression

The output, including a scatterplot with a curve and the equation of the fitted model, is shown inFigure 2.16.

Figure 2.16: Output from fitted line plot for quadratic regression model

The fitted line plot option provides a convenient way to fit quadratic and cubic regression models,but higher-order models must be fit by creating the quadratic, cubic, and higher-order terms andthen fitting the multiple regression model using Stat>Regression>Regression.

Correlated Predictors

To examine the possible relationships among a set of predictor variables, we compute the correla-tion matrix by clicking Stat>Basic Statistics>Correlation and entering the variables into thedialogue box. For example, Figure 2.17 illustrates how to get the correlation coefficients among allof the variables in Perch.


Figure 2.17: Dialogue box for creating a correlation matrix

In order to obtain variance inflation factors, click Stat>Regression>Regression, enter the re-sponse and predictor variables, and then select Options. Figure 2.18 shows the regression optionsdialogue box that will appear. Click the box in front of variance inflation factors and then “OK”.After clicking “OK” in the regression dialogue box, you will see that the output now has a columnfor VIF and the variance inflation factor is listed after the p-value for the individual t tests for eachpredictor.

Figure 2.18: Dialogue box for getting variance inflation factors

Partial output for predicting TotalPrice from Carat, CaratSq, and Depth for Diamonds is shownbelow. Notice that the VIFs are 10.942, 10.719, and 1.117.



TotalPrice = 6343 + 2950 Carat + 4430 CaratSq - 114 Depth

Predictor Coef SE Coef T P VIF

Constant 6343 1436 4.42 0.000

Carat 2950.0 736.1 4.01 0.000 10.942

CaratSq 4430.4 254.7 17.40 0.000 10.719

Depth -114.08 22.66 -5.03 0.000 1.117

S = 2056.07 R-Sq = 93.1% R-Sq(adj) = 93.0%

Testing Subsets of Predictors

Unfortunately, there is no option in the dialogue boxes for doing a nested F-test on subsets ofpredictors. One way to conduct the nested F-test is to fit both the full and reduced model andthen compute the F-statistic. For example, suppose we want to find out if the quadratic terms areneeded in the complete second-order model when predicting Weight from Length and Width forPerch, as in Example 3.16 on page 135 of the text. The first step is to fit the complete second-ordermodel. The regression output for the full model, the complete second-order model, is shown below.


Weight = 156 - 25.0 Length + 21.0 Width - 9.78 LengthxWidth + 1.57 LengthSq + 34.4 WidthSq


Constant 156.35 61.42 2.55 0.014

Length -25.00 14.27 -1.75 0.086 547.034

Width 20.98 82.59 0.25 0.801 640.317

LengthxWidth -9.776 7.145 -1.37 0.177 16191.346

LengthSq 1.5719 0.7244 2.17 0.035 5478.427

WidthSq 34.41 18.75 1.84 0.072 3511.961

S = 43.1277 R-Sq = 98.6% R-Sq(adj) = 98.5%


Source DF SS MS F P

Regression 5 6553094 1310619 704.63 0.000


Total 55 6646094


Source DF Seq SS

Length 1 6118739

Width 1 110593

LengthxWidth 1 314997

LengthSq 1 2499

WidthSq 1 6266

The second step is to fit the multiple regression model with only the linear terms and interaction.The regression output for the reduced model is shown below.

Regression Analysis: Weight versus Length, Width, LengthxWidth


Weight = 114 - 3.48 Length - 94.6 Width + 5.24 LengthxWidth


Constant 113.93 58.78 1.94 0.058

Length -3.483 3.152 -1.10 0.274 25.358

Width -94.63 22.30 -4.24 0.000 44.352

LengthxWidth 5.2412 0.4131 12.69 0.000 51.439

S = 44.2381 R-Sq = 98.5% R-Sq(adj) = 98.4%


Source DF SS MS F P

Regression 3 6544330 2181443 1114.68 0.000


Total 55 6646094

Source DF Seq SS

Length 1 6118739

Width 1 110593

LengthxWidth 1 314997

We noticed in Example 3.16 (on page 135 in the text) that the nested F-statistic is based on thedifference in the model sums of squares.

SSModelfull − SSModelreduced = 6553094− 6544330 = 8764


The sequential sums of squares can also be used to compute this difference. Recall that thesequential sums of squares depend on the order in which the predictors are entered into the model.Since the last two terms are the squared terms, we can simply add these values together to get thetotal variability that is explained by the quadratic terms. The sum is 2499 + 6266 = 8765, whichis equal (up to roundoff error) to the difference in the model sums of squares.

2.4 Additional Topics in Regression

Added Variable Plots

The only new feature that you need to know to create added variable plots is how to save residualsfrom different regression models. After the residuals are saved, you can create the appropriatescatterplots using Graph>Scatterplot. To save the residuals for any regression model, selectStorage in the regression dialogue box and then a regression storage dialogue box will appear.Figure 2.19 shows the completed regression storage dialogue box after checking the box next toResiduals. Notice that there are many other diagnostic measures that can be saved, and we willreturn to these options shortly.

Figure 2.19: Dialogue box for saving residuals

After clicking “OK” in the regression storage dialogue box and “OK” in the regression dialoguebox, you will find a new variable Resi1 after the last variable in your Minitab worksheet. Resi1contains the residuals from the regression model that you just fit. Every time that you run anotherregression model in your current Minitab session, the residuals will be saved and the names of thenew variables will be identified sequentially, that is, as Resi2, Resi3, etc. You may find it helpfulto rename these stored residuals (and other measures) after fitting each regression model.

Techniques for Choosing Predictors

There are a number of ways to pick the best predictors for a particular response variable, andwe will use measurements on 219 college students to illustrate the Minitab commands. The dataare in the file FirstYearGPA. We start by illustrating the method of best subsets. In orderto find the best subsets for predicting GPA from the quantitative predictor variables HSGPA,SATV , SATM , HU , and SS, we select Stat>Regression>Best Subsets. Figure 2.20 showsthe completed best subsets dialogue box.

2.4. ADDITIONAL TOPICS IN REGRESSION 45

Figure 2.20: Dialogue box for best subsets

The resulting output, shown below, contains the coefficient of determination, the adjusted coefficientof determination, Mallow’s Cp, and the multiple regression standard error for each model. Eachrow corresponds to a model with the variables included marked by an “X”. By default, Minitabshows the best two models for each number of predictors.

Best Subsets Regression: GPA versus HSGPA, SATV, SATM, HU, SS

Response is GPA

H

S S S

G A A

Mallows P T T H S

Vars R-Sq R-Sq(adj) Cp S A V M U S

1 20.0 19.6 35.1 0.41737 X

1 9.9 9.5 66.6 0.44285 X

2 27.0 26.3 15.2 0.39962 X X

2 24.6 23.9 22.7 0.40606 X X

3 30.8 29.8 5.3 0.38993 X X X

3 28.8 27.8 11.6 0.39552 X X X

4 31.6 30.3 4.8 0.38857 X X X X

4 31.0 29.7 6.8 0.39039 X X X X

5 31.9 30.3 6.0 0.38875 X X X X X

Now, we turn to three different automated methods that are referred to as stepwise regressionprocedures. All three methods are carried out by selecting Stat>Regression>Stepwise. Fig-ure 2.21(a) shows the stepwise dialog box that needs to be completed for all three methods. Simply


enter your response variable and possible predictor variables into the appropriate sections of thebox. (Notice that you have the option of forcing a variable to be in the model by entering it intoa different location. Use this option with caution and only in situations where you have solid evi-dence, perhaps a theoretical justification, for forcing the variable to be in the model.) Next, clickMethod. You will get a method selection dialogue box that is shown in Figure 2.21(b). Dependingon the method that you choose—backward, forward, or stepwise—you can specify various entryand exit values. At the top of the method selection dialogue box, you can decide whether you wantto use alpha levels or F values. The default setting is to use alpha values.

(a) Stepwise dialogue box (b) Method selection dialogue box

Figure 2.21: Stepwise regression options

Since all three of these methods are implemented in a similar fashion, we will use backward elim-ination to identify a model for predicting GPA and then let you explore with the other methods.In backward elimination (as specified in the dialog box above), the idea is to put all possible pre-dictors, both quantitative and indicator variables, into a model and then remove terms one at a time.

The resulting output is shown below. Each column shows the coefficients and individual t-tests forthe variables in the model at that step. Summary values for each model appear at the bottom ofeach column. We see in this case that the backward elimination proceeds through 6 steps, leavinga 4-predictor model based on HSGPA, SATV , HU , and White to help explain first-year collegeGPA. Each of the final explanatory variables is significant at a 5% level and the overall R2 valuefor the model is 35.75% (down from 34.96% with all nine predictors).


Stepwise Regression: GPA versus HSGPA, SATV, ...

Backward elimination. Alpha-to-Remove: 0.1

Response is GPA on 9 predictors, with N = 219

Step 1 2 3 4 5 6

Constant 0.5269 0.5552 0.5825 0.5467 0.5685 0.6410

HSGPA 0.493 0.495 0.492 0.483 0.474 0.476

T-Value 6.62 6.70 6.81 6.76 6.68 6.70

P-Value 0.000 0.000 0.000 0.000 0.000 0.000

SATV 0.00059 0.00062 0.00063 0.00069 0.00075 0.00074

T-Value 1.50 1.76 1.79 2.01 2.19 2.16

P-Value 0.135 0.080 0.075 0.045 0.029 0.032

SATM 0.00008

T-Value 0.19

P-Value 0.849

Male 0.048 0.052 0.053 0.054

T-Value 0.85 0.99 1.00 1.03

P-Value 0.398 0.325 0.316 0.306

HU 0.0162 0.0161 0.0161 0.0168 0.0167 0.0151

T-Value 4.08 4.10 4.10 4.40 4.39 4.14

P-Value 0.000 0.000 0.000 0.000 0.000 0.000

SS 0.0073 0.0072 0.0071 0.0076 0.0077

T-Value 1.32 1.31 1.30 1.39 1.42

P-Value 0.189 0.192 0.194 0.166 0.156

FirstGen -0.074 -0.076 -0.077

T-Value -0.84 -0.86 -0.88

P-Value 0.403 0.393 0.380

White 0.196 0.197 0.196 0.205 0.206 0.212

T-Value 2.80 2.84 2.84 2.98 3.00 3.09

P-Value 0.006 0.005 0.005 0.003 0.003 0.002

CollegeBound 0.02 0.02

T-Value 0.21 0.21

P-Value 0.831 0.833

S 0.383 0.383 0.382 0.381 0.381 0.382

R-Sq 34.96 34.95 34.94 34.70 34.37 33.75

R-Sq(adj) 32.16 32.47 32.78 32.85 32.83 32.51

Mallows Cp 10.0 8.0 6.1 4.8 3.9 3.9


Identifying Unusual Points

Many different measures are used to identify influential points and an unusual observations. Thedefault regression output after fitting a model with Stat>Regression>Regression will includeunusual observations. R is Minitab’s label to denote an observation with a large standardized resid-ual, and X denotes an observation whose X value gives it large leverage. However, in order to applythe rules of thumb on page 187 in the textbook, we need to obtain the diagnostic measures. Nonew Minitab commands are needed to compute and save these statistics. When discussing addedvariable plots earlier in this chapter, we learned how to save residuals. Figure 2.19 shows that wecan save leverage, standardized residuals, studentized residuals, and Cook’s distance in exactly thesame way. Simply check the box next to the values you want to save, and these measures will becalculated and saved in your Minitab worksheet after you fit your model.

Coding Categorical Predictors

In order to illustrate the Minitab commands for categorical predictors, we return to the sales datafor ThreeCars. We want to predict the prices (in thousands of dollars) of used cars (Porsches,Jaguars, and BMWs) offered for sale at an internet site, based on the mileages (in thousands ofmiles). To obtain a scatterplot, with a different plotting symbol for each type of car, we clickGraph>Scatterplot>With Groups. Figure 2.22 shows the completed dialogue box.

Figure 2.22: Dialogue box for scatterplot with groups

There are two ways in Minitab to fit a regression model with three lines, one for each car. Theeasiest option is to click Graph>Scatterplot>With Regression and Groups. Figure 2.23shows the completed dialogue box.


Figure 2.23: Dialogue box for scatterplot with regression lines for each group

The resulting graph is shown in Figure 2.24.

Figure 2.24: Scatterplot with regression lines for each type of car

The other method is to use the indicator variables for Porsche and Jaguar (or any two of the threeindicators) to fit a multiple regression model. Since we need to allow for the possibility of having dif-ferent slopes for each type of car, we begin by creating interaction variables, say IPorschexMileageand IJaguarxMileage, by multiplying the indicator variables with Mileage.

The output from fitting a multiple regression model using Stat>Regression>Rgression withthe predictor variables Mileage, Porsche, Jaguar, IPorschexMileage, and IJaguarxMileage isshown below.



Price = 56.3 - 0.490 Mileage + 14.8 Porsche - 2.06 Jaguar

- 0.0995 IPorschexMileage - 0.130 IJaguarxMileage


Constant 56.290 4.155 13.55 0.000

Mileage -0.48988 0.07227 -6.78 0.000

Porsche 14.800 5.041 2.94 0.004

Jaguar -2.063 5.236 -0.39 0.695

IPorschexMileage -0.09952 0.09940 -1.00 0.320

IJaguarxMileage -0.1304 0.1057 -1.23 0.221

In order to obtain a multiple regression model with three parallel regression lines, we can drop thetwo insignificant interaction terms.

Randomization Test for a Relationship and Bootstrap for Regression Slope andCorrelation

There are currently no menu options for randomization tests or bootstapping in Minitab. Both ofthese topics require the use of macros, so we address these more advanced topic at a later point.See Chapter 5 of this Minitab companion for a complete description of the macros, how to usethem, and examples.

CHAPTER 3

ANOVA

In this chapter we introduce you to the steps necessary to perform ANOVA to compare means,both one-way and two-way, in Minitab.

3.1 One-way ANOVA - Chapter 5

Basic Procedures from the Chapter

We start with a very basic analysis using the FruitFlies data (the main dataset used in examplesin Chapter 5).

Example 5.3 (page 234): Fruit flies (continued)To access the ANOVA commands in Minitab, click on Stat>ANOVA. For this simple example,we are interested in a one-way model. Note that there are two choices on this menu for one-way:One-Way and One-Way (Unstacked). The difference between these two commands has to dowith how the data are entered in the worksheet. In this case, we have one column with all of theLongevity measurements and a different column that breaks these measurements into groups (Treat-ment). Minitab calls this way of organizing data “stacked.” If the data had been “unstacked” thenone column would have had the Longevity measurements for the first treatment, the next columnwould have the Longevity measurements for the second treatment, and so on. Because our data isstacked, we use the One-Way command.

Once you have clicked Stat>ANOVA>One-Way, the main dialogue box will appear. To producethe basic ANOVA table shown below and on page 234 of the book, select Longevity for the responsevariable and Treatment for the factor as shown in Figure 3.1(a). In addition to the ANOVA table,Minitab shows the sample size, mean, and standard deviation for each group and makes a crudeplot of a confidence interval for each group mean. The pooled standard deviation is

√MSE.

51

52 CHAPTER 3. ANOVA

(a) One-way ANOVA main dialogue box (b) One-way ANOVA graphs dialogue box

Figure 3.1: Dialogue boxes for a One-way ANOVA analysis

Source DF SS MS F P

Treatment 4 11939 2985 13.61 0.000

Error 120 26314 219

Total 124 38253

S = 14.81 R-Sq = 31.21% R-Sq(adj) = 28.92%

Individual 95% CIs For Mean Based on

Pooled StDev

Level N Mean StDev -------+---------+---------+---------+--

1 pregnant 25 64.80 15.65 (-----*-----)

1 virgin 25 56.76 14.93 (-----*-----)

8 pregnant 25 63.36 14.54 (-----*-----)

8 virgin 25 38.72 12.10 (-----*-----)

none 25 63.56 16.45 (-----*----)

-------+---------+---------+---------+--

40 50 60 70

Pooled StDev = 14.81

To check the conditions necessary for performing ANOVA, we need to produce a normal probabilityplot of the residuals and a residuals versus fits plot. Both of these plots are created by clickingon the “Graphs” button on the main dialogue box (see Figure 3.1(a)). This brings up the new

3.1. ONE-WAY ANOVA - CHAPTER 5 53

dialogue box in Figure 3.1(b). Check the box marked “Normal plot of residuals” and/or “Residualsversus fits” as necessary.

If you prefer to do several residual plots in one window, instead of checking the boxes for the normalplot and the residual plot, click on the “Four in one” option. This produces the graph windowdisplayed in Figure 3.2.

Figure 3.2: “Four in one” residual plots for the fruit fly data

Finally, to produce the output in Example 5.11: Fruit flies (one last time) (page 255), whichincludes Fisher’s LSD intervals, in the main ANOVA dialogue box click on the “Comparisons”button (see Figure 3.1(a)) to bring up a new dialogue box. In this new dialogue box, seen inFigure 3.3, click on “Fisher’s, individual error rate” and choose an error rate. We have chosen anerror rate of 5% (or 95% intervals) for our example.

Figure 3.3: One-way ANOVA comparisons dialogue box

54 CHAPTER 3. ANOVA

Beyond the Basics

In Figure 5.2 (page 229 of the text) we compared histograms of the residuals for both the null model(no group differences) and the ANOVA model. To create these histograms we need two new columnsof data: one with the residuals for the null model and one with the residuals for the ANOVA model.

To create a column of the residuals for the null model:

a. Use Stat>Basic Statistics >Display Descriptive Statistics to compute the grand meanof the response variable.

b. Use Calc>Calculator to subtract the grand mean from the values of the response variableand store the result in a new column.

To create a column of the residuals for the ANOVA model, click on “Store residuals” on the maindialogue box (see Figure 3.1(a)) for ANOVA.

Now histograms can be created for both of these variables. Be careful to force the scales on boththe x and y-axes to be the same (double-click on either scale in the graph to bring up a dialoguebox to adjust it).

3.2 Two-way ANOVA – Chapter 6

Basic Procedures from the Chapter

The basics for two-way ANOVA in Minitab are very similar to those for one-way ANOVA. If thedata are balanced (the same number of observations for every treatment combination), then theeasiest way to do a two-way ANOVA is to use Stat>ANOVA>Two-way. This brings up thedialogue box shown in Figure 3.4 using the RiverIron data of Example 6.2 (page 283 in the text).Note that Minitab automatically fits an additive model if there is only one observation per cell,and it automatically fits a model with interaction if there is more than one observation per cell.

Figure 3.4: Two-way ANOVA main dialogue box

3.2. TWO-WAY ANOVA – CHAPTER 6 55

Residual plots are created in exactly the same way as for the one-way ANOVA: Click on the“Graphs” button on the two-way main dialogue box and you will bring up exactly the same dialoguebox as you saw with the one-way ANOVA (see Figure 3.1(b)). Make the appropriate choices andclick “OK.”

The new kind of plot introduced in this chapter is an interaction plot. Figures 6.5 and 6.6 (page289 of the text) give the two different interaction plots for the PigFeed data. These are createdby clicking on Stat>ANOVA>Interactions Plot. The dialogue box that appears is given inFigure 3.5. Click on “Display full interaction plot matrix” to get both interaction plots, otherwiseyou will get only one. If you do only one plot, whichever factor is listed second will appear on thex -axis and the factor listed first will be the factor that defines the different lines in the plot.

Figure 3.5: Interactions plot dialogue box

Finally we note that Minitab does not compute Fisher’s LSD intervals for the two-way ANOVAmodel.

Beyond the Basics

The Two-Way command discussed above works for the situation where there are only two factorsand the data are balanced. If you have a situation with more than two factors, or with unbalanceddata, you need to use a different command. In that case, use Stat>ANOVA>General LinearModel. This results in the dialogue box shown in Figure 3.6. We have illustrated the use of thisdialogue box using the PigFeed data, fitting a two-way ANOVA model with interaction. Notethat we select Antibiotic, then B12, then select Antibiotic again, type an asterisk, and finally selectB12 again. This last part is what specifies the interaction term. Alternatively you could use theCalc>Calculator command to multiply the two variables together, store the product in a newvariable, and put that new variable as the interaction term in the model.

Once again, residual plots are created by clicking on the “Graphs” button, resulting in the familiargraphs dialogue box. Notice that, while there is a “Comparisons” button, Fisher’s LSD is not oneof the choices here. As noted above, there is no way to compute Fisher’s LSD intervals in Minitabexcept in the one-way ANOVA situation.

56 CHAPTER 3. ANOVA

Figure 3.6: ANOVA General Linear Model main dialogue box

3.3 Additional ANOVA Topics – Chapter 7

Topic: Levene’s Test

Levene’s test is also found under the ANOVA menu. Click Stat>ANOVA>Test for EqualVariances. This brings up the dialogue box in Figure 3.7. We have again used the FruitFliesdata to illustrate using this dialogue box. This test does allow for both one-way and multiwayANOVA models.

Figure 3.7: Levene’s test main dialogue box

3.3. ADDITIONAL ANOVA TOPICS – CHAPTER 7 57

Topic: Multiple Tests

As discussed above in Section 3.1, Minitab computes Fisher’s LSD intervals if you click on “Com-parisons” in the main one-way ANOVA dialogue box and then check the “Fisher’s, individual errorrate” box on the subsequent dialogue box. The command to compute Tukey’s HSD intervals for aone-way ANOVA is in the same dialogue box. To have Minitab perform these computations, checkthe box marked “Tukey’s, family error rate” and choose an error rate. We have illustrated thisdialogue box using an error rate of 5% in Figure 3.8.

Figure 3.8: One-way ANOVA comparisons dialogue box with Tukey’s checked

If you want to compute Bonferroni intervals for either a one-way or a multi-way ANOVA model, oryou want Tukey’s HSD intervals for a multi-way model, then you need to use the General LinearModel command. Start by choosing Stat>ANOVA>General Linear Model. This brings upthe dialogue box seen in Figure 3.6. Now click on the “Comparisons” button to bring up the com-parisons dialogue box as seen in Figure 3.9. We have illustrated the use of this dialogue box withthe FruitFlies data. In general, leave the option “Pairwise comparisons” checked at the top of thebox to get intervals for each pair of levels. At the bottom of the box, the “Grouping information”option gives the table of which levels are significantly different from which other levels, and the“Intervals” option gives the actual confidence intervals.

Note that if you are doing a multi-way ANOVA model, you need to put all factors for which youwant intervals into the box labeled “Terms” in the comparisons dialogue box (see Figure 3.9).

58 CHAPTER 3. ANOVA

Figure 3.9: General linear model comparisons dialogue box with Tukey and Bonferroni checked

Topic: Comparisons and Contrasts

Minitab has no direct procedure for assessing contrasts after ANOVA. However, you can use thesample means and MSE from the ANOVA output to compute the standard error and test a contrastby hand (as outlined in Topic 7.3 starting on page 336 of the text).

Topic: Nonparametric Statistics

This topic covers two main testing procedures: Wilcoxon-Mann-Whitney (to compare two samples)and Kruskal-Wallis (a nonparametric version of ANOVA).

Wilcoxon-Mann-Whitney

We begin with the simpler case of comparing two samples and illustrate the Wilcoxon-Mann-Whitney. As discussed in the text, there are actually two equivalent procedures: Wilcoxon andMann-Whitney. We have chosen to refer to the general concept of the procedures with the title“Wilcoxon-Mann-Whitney” throughout the Topic. But, while both the Mann-Whitney and theWilcoxon procedures will come to the same conclusion, how they get there is somewhat differ-ent. Since Minitab only does the Mann-Whitney for the two-sample case (there is a one-sampleWilcoxon available in Minitab), that is what we illustrate here.

We use the HawkTail2 data discussed in Example 7.11 (on page 348 of the text) to illustrateperforming this test. As discussed in the text, some software packages require that the data bestacked (all response values in one column and the categorical variable dividing the responses intogroups in another column). Other software packages, Minitab included, require that the data beunstacked. That is, the response values for each individual group must be in separate columns.The file HawkTail2 already has the data unstacked. You can, alternatively, work with the fileHawkTail but you will first need to unstack the data (see page 24 in this manual).


Once you have data in an unstacked format, choose Stat>Nonparametrics>Mann-Whitney.This brings up the dialogue box seen in Figure 3.10. Enter one of the columns of data in the boxmarked “First Sample” and the other in the box marked “Second Sample.” Choose the level ofconfidence you wish (95% is the default) and the type of alternate hypothesis (not equal is thedefault) and click “OK.” The output is shown below and in Example 7.11 of the text.

Figure 3.10: Mann-Whitney dialogue box

Mann-Whitney Test and CI: Tail_RT, Tail_SS

N Median

Tail_RT 577 221.00

Tail_SS 261 150.00

Point estimate for ETA1-ETA2 is 76.00

95.0 Percent CI for ETA1-ETA2 is (74.00,78.00)

W = 316058.0

Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.0000

The test is significant at 0.0000 (adjusted for ties)

60 CHAPTER 3. ANOVA

Kruskal-Wallis

The nonparametric version of the ANOVA procedure is the Kruskal-Wallis test. This test is used tocompare two or more groups to each other for differences in location. Again we follow the examplein the text and use the file CancerSurvival.

To perform the Kruskal-Wallis procedure, click on Stat>Nonparametrics>Kruskal-Wallis.This brings up the dialogue box seen in Figure 3.11. Notice that while the Mann-Whitney testrequired unstacked data, the Kruskal-Wallis test requires stacked data. The Minitab output isshown below.

Figure 3.11: Kruskal-Wallis dialogue box

Kruskal-Wallis Test: Survival versus Organ

Kruskal-Wallis Test on Survival

Organ N Median Ave Rank Z

Breast 11 1166.0 47.0 2.84

Bronchus 17 155.0 23.3 -2.37

Colon 17 372.0 35.9 0.88

Ovary 6 406.0 40.2 1.06

Stomach 13 124.0 24.2 -1.79

Overall 64 32.5

H = 14.95 DF = 4 P = 0.005

H = 14.95 DF = 4 P = 0.005 (adjusted for ties)


Topic: ANOVA and Regression with Indicators

There are no new Minitab procedures in this topic.

Topic: Analysis of Covariance

Most of the procedures in Minitab necessary for Analysis of Covariance are covered in the Re-gression chapter of this manual and in earlier sections of this chapter. The only additionalcommand necessary is the one to actually produce the ANCOVA table. To achieve this, selectStat>ANOVA>General Linear Model. Enter the response variable and the categorical fac-tors in the main dialogue box (see Figure 3.6). Now click the the “Covariates” button to bring upa new dialogue box (shown in Figure 3.12 using the Grocery data). Enter the covariate in thisbox and click “OK.”

Figure 3.12: Adding covariates to an ANCOVA model

CHAPTER 4

Logistic Regression

In this chapter we introduce you to the steps necessary to perform a logistic regression analysis,both simple and multiple, and the graphs that accompany such an analysis.

4.1 Logistic Regression and Odds – Chapter 9

Basic Calculation Procedures from the Chapter

The only new calculation procedure required for logistic regression is the one that produces theoutput that includes the estimated logistic regression model. Before we begin, we need to thinkabout the format of our data. In many cases we will have one row of data in the spreadsheet foreach observation just as we did when we considered simple linear regression. In other cases we willhave data that is already summarized in a two-way table. How we ask Minitab to compute thelogistic regression equation will depend on which of these two ways the data are stored.

One row of data per observation

In Minitab, click Stat>Regression>Binary Logistic Regression to bring up the dialogue boxshown in Figure 4.1. In this case we are using the MedGPA data and trying to predict acceptanceto medical school from college GPA. As in the figure, select Acceptance as the “Response” variableand GPA as the “Model” variable.

The output produced from this command, seen below, is quite extensive, but includes the outputshown in the examples in Chapter 9.

63

64 CHAPTER 4. LOGISTIC REGRESSION

Figure 4.1: Logistic regression main dialogue box

Binary Logistic Regression: Acceptance versus GPA

Link Function: Logit

Response Information

Variable Value Count

Acceptance 1 30 (Event)

0 25

Total 55

**************************OUTPUT SHOWN IN CHAPTER 9*******************************

Logistic Regression Table

95% CI

Predictor Coef SE Coef Z P Odds Ratio Lower Upper

Constant -19.2065 5.62922 -3.41 0.001

GPA 5.45417 1.57931 3.45 0.001 233.73 10.58 5164.44

Log-Likelihood = -28.420

Test that all slopes are zero: G = 18.952, DF = 1, P-Value = 0.000

**************************OUTPUT SHOWN IN CHAPTER 9*******************************

Goodness-of-Fit Tests

Method Chi-Square DF P

Pearson 36.2463 38 0.551

Deviance 38.1105 38 0.464

Hosmer-Lemeshow 5.6721 8 0.684

4.1. LOGISTIC REGRESSION AND ODDS – CHAPTER 9 65

Table of Observed and Expected Frequencies:

(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

Group

Value 1 2 3 4 5 6 7 8 9 10 Total

1

Obs 1 1 1 3 3 3 3 4 7 4 30

Exp 0.3 1.3 1.5 2.4 2.7 3.8 4.3 4.0 6.1 3.6

0

Obs 4 5 4 3 2 3 3 1 0 0 25

Exp 4.7 4.7 3.5 3.6 2.3 2.2 1.7 1.0 0.9 0.4

Total 5 6 5 6 5 6 6 5 7 4 55

Measures of Association:

(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures

Concordant 614 81.9 Somers’ D 0.65

Discordant 128 17.1 Goodman-Kruskal Gamma 0.65

Ties 8 1.1 Kendall’s Tau-a 0.33

Total 750 100.0

Data summarized in a two-way table

Here we refer back to Example 9.17 (page 486 of the text) where we wondered whether a hand-held device that sends a magnetic pulse in the head might be an effective treatment for migraineheadaches. The data given in that example are repeated here in the following two-way table:

TMS Placebo Total

Pain-free two hours later 39 22 61Not pain-free two hours later 61 78 139

Total 100 100 200

In order to perform logistic regression with these data in Minitab we need to set up an appropriateworksheet. In general you should create three columns: one for the response variable, one for theexplanatory variable, and one for the frequency. This is demonstrated in Figure 4.2 using the dataabove, where a response of being pain-free two hours later is coded as a 1 and otherwise 0, andreceiving TMS is coded as a 1 and receiving the placebo is coded as 0.


Figure 4.2: Worksheet view of two-way table

Now click on Stat>Regression>Binary Logistic Regression as before. The response variableand the explanatory variable are filled in as before. The difference is that this time we fill in thebox marked “Frequency (optional)” with the count variable. The completed dialogue box is shownin Figure 4.3.

Figure 4.3: Logistic regression dialogue box for data in a two-way table

Basic Graphical Procedures

The new graph that we encountered in this chapter was the empirical logit plot. There is no singlecommand in Minitab to create this plot, so we show you one possible step-by-step method. We willbe loosely following the steps discussed in the box on page 474 of the text and will use the data inMedGPA to create the graph seen on the right-hand side of Figure 9.13 in the text.

The first thing that we need to do is to break the data into groups of roughly equal numbers ofpoints. In this particular data set there are 55 observations. Example 9.12 discusses breaking thisdata set into 11 groups of 5 each or 5 groups of 11 each. We will break the data into 5 groups of11 each. Here are the steps we take to create the plot.


(1) Sort the response and explanatory variables in order of smallest to largest for theexplanatory variable. For this example, the explanatory variable is GPA. (See page 24 of thismanual for how to sort and Figure 1.35 for what the dialogue box looks like for this particulardata set—and be sure the response values stay with the sorted explanatory values.)

(2) Add a new grouping variable. Now that the data are sorted from smallest to largest, weneed to create 5 groups of 11 each. Click on Calc>Make Patterned Data>Simple Setof Numbers. This brings up the dialogue box in Figure 4.4. Type a new name for yourgrouping variable into the box labeled “Store patterned data in.” We have chosen the nameGroup in our example. Then, because we want 5 groups, put the number 1 in the box labeled“From first value” and 5 in the box labeled “To last value” and leave the default value of 1 inthe box labeled “In steps of.” Now, because we want 11 values in each group, put 11 in thebox labeled “Number of times to list each value” and click “OK.” You will now have a newcolumn in your worksheet labeled Group which has eleven 1’s, eleven 2’s and so on down to5.

Figure 4.4: Dialogue box to create groups for empirical logit plot

(3) Compute the mean value of the explanatory variable (GPA in this case) for eachof the groups. Click on Stat>Basic Statistics>Store Descriptive Statistics. Thiscommand, like the Display Descriptive Statistics command, calculates any statistics thatyou ask it to. The difference is that the values of the statistics will be saved in the worksheetrather than the session window so that we can use them later. The initial dialogue box isshown in Figure 4.5. Put the explanatory variable in the box labeled “Variables” and thegrouping variable into the box labeled “By variables (optional).” Next, click the buttonmarked “Statistics” and make sure that only the mean is selected. Finally, click “OK” twice.You should now see two new columns in your worksheet labeled ByVar1 and Mean1.

(4) Compute p̂ for each group. Click on Stat>Basic Statistics>Store DescriptiveStatistics. This time, put the response variable (Acceptance) in the “Variables” box andwhen you click the button labeled “Statistics,” choose sum instead of mean. Now click “OK”


Figure 4.5: Dialogue box to store descriptive statistics

twice. Again, you get two columns, ByVar2 and Sum2. The variable marked Sum2 givesthe number of Yes responses. In this case, we note that in the last group, all responses are“Yes”. This means we will have to compute the adjusted p̂ so that when we compute thelogit, we don’t get an undefined value. Calculate a new variable (we called it p-hat-adj) usingthe formula p-hat-adj = (1/2+Sum2)/12 (remember that there are 11 values in each group).

(5) Compute the logits. Now use the calculator tool to compute the logit = log(p-hat-adj/(1−p-hat-adj)).

(6) Create the plot. Finally scatterplot the logit against the mean value of the explanatoryvariable for each group. The Minitab graph for this example is shown in Figure 4.6

Figure 4.6: Minitab empirical logit plot


Beyond the Basics

It is often useful to produce a scatterplot of the data from which the logistic model is built. Forthe MedGPA data this is shown in Figure 4.7, where we are trying to predict acceptance intomedical school based on college GPA. Since the only response values are zero and one, we add somerandom jitter in the y-direction to help see the individual points better (double-click on a dot inthe scatterplot and choose the jitter tab).

Figure 4.7: Scatterplot of acceptance status by GPA

More useful yet would be adding the logistic regression curve to this scatterplot. Unfortunatelythis is not completely straightforward in Minitab. What we need to do is to calculate the value ofthe logistic regression equation at a large set of appropriate x values and then ask Minitab to drawa line through those coordinates. This process can be accomplished using the following steps.

(1) Create the scatterplot without the logistic regression line and take note of the scale on thex -axis. In this case, the scale goes from 2.6 to 4.0. We need to create a new column ofnumbers that spans this scale and has quite a few values. In this case we will create a columnwith the numbers from 2.6 to 4.0, increasing by 0.1.

(2) Click Calc>Make Patterned Data>Simple Set of Numbers. This brings up a dialoguebox. Now fill in the boxes in the dialogue box. We suggest that you call the new variablesomething simple like x (this goes in the “Store patterned data in” box). Enter the beginningof the x -axis scale in the “From first value” box and the end of the x -axis scale in the “Tolast value” box, and choose an appropriate value for “In steps of.” Figure 4.8 shows how wefilled in this box for the medical school data.


Figure 4.8: Simple set of numbers dialogue box filled in for the MedGPA data

(3) Next we need to have the computer compute the value of the estimated logistic regressionequation for each value that we just put into the column of x -coordinates. Find the logisticregression equation from the output. For the medical school data we have

log

(π̂

1− π̂

)= −19.21 + 5.454GPA or π̂ =

e−19.21+5.454GPA

1 + e−19.21+5.454GPA

This last equation gives us the probability of success at a given value of GPA—exactly whatwe want to plot. So go to Calc>Calculator, enter a new variable name (we suggest y) andthe formula into the “Expression” box (see Figure 4.9).

Figure 4.9: Dialogue box to calculate the estimated probabilities of success for many values of x


Figure 4.10: Calculated line dialogue box

(4) Finally go back to the scatterplot from step 1. Right click on the plot and click onAdd>CalculatedLine. Enter the appropriate columns for the boxes marked “Y column” and “X column” andclick “OK” (see Figure 4.10). This adds the estimated logistic regression curve onto the scat-terplot and you finally have the graph shown in Figure 4.11.

Figure 4.11: Scatterplot with estimated logistic regression curve


4.2 Multiple Logistic Regression - Chapter 10

There are no new Minitab procedures introduced in this section. The only change from the pro-cedures listed above is that we now list two or more variables in the “Model” part of the dialoguebox shown in Figure 4.1.

4.3 Logistic Regression: Additional Topics - Chapter 11

Topic: Fitting the Logistic Regression Model

There are no new Minitab procedures in this topic.

Topic: Assessing Logistic Regression Models

There are three new graphs discussed in this Topic, two of which can be created using Minitab.Minitab does not compute Cook’s distances for logistic regression models.

Delta Chi-Square graph

To create a Delta Chi-Square graph, bring up the usual binary logistic regression dialogue box andclick on “Graphs.” In the new dialogue box, check the box marked “Delta chi-square vs. probabil-ity” (top box). Now click “OK” twice and you will get the Delta Chi-Square graph along with theusual output.

Plot of Pearson residuals

To create a plot of the Pearson residuals, bring up the usual binary logistic regression dialogue boxand click on “Storage.” In the new dialogue box, check the box marked “Pearson residuals” (topbox in the left-hand column). Now click “OK” twice and you will get a new column of data inyour worksheet along with the usual output for binary logistic regression. The new column is titledPRES1 and has the Pearson residuals in it. To graph this, go to Data>Time Series Plot andchoose Simple. Put PRES1 into the “Series” box and click on the button marked “Data View.”In the new dialogue box, make sure that only “Symbols” is checked. Now click “OK” twice.

Overdispersion

The residual deviance discussed in Topic 11.2 of the text appears with the label Deviance in theGoodness-of-Fit portion of the Minitab logistic regression output. For example, using the puttingdata from Example 11.4 on page 570 of the text we obtain the following Minitab output for fittinga logistic regression model to predict the success of making a putt based on its length.

4.3. LOGISTIC REGRESSION: ADDITIONAL TOPICS - CHAPTER 11 73

Binary Logistic Regression: Made, Trials versus Length

Link Function: Logit

Response Information

Variable Value Count

Made Event 338

Non-event 249

Trials Total 587

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant 3.25684 0.368931 8.83 0.000

Length -0.566142 0.0674707 -8.39 0.000 0.57 0.50 0.65

Log-Likelihood = -359.946

Test that all slopes are zero: G = 80.317, DF = 1, P-Value = 0.000

Goodness-of-Fit Tests

Method Chi-Square DF P

Pearson 6.79414 3 0.079

Deviance 6.82565 3 0.078

Hosmer-Lemeshow 6.79414 3 0.079

The next to last line of this output shows the deviance statistic to be 6.82565 with 3 degrees offreedom. Looking back at Example 11.4 on page 570 in the text you see that this matches theresidual deviance of the R output for this model. When this deviance is much larger than the de-grees of freedom we might suspect there is a problem with overdispersion. Unfortunately, Minitabhas no routine for fitting the overdispersion model as described in the text.

Topic: Randomization Tests

See Chapter 5 in this Minitab Companion.


Topic: Analyzing Two-Way Tables with Logistic Regression

This section contains two procedures which have not yet been covered in this manual: the 2-samplez-test for proportions and the chi-square test for 2× k tables. For a refresher on creating indicatorvariables and stacked bar charts, please see Chapter 1 of this manual.

Two-sample z-test for proportions

We use the data from Example 9.6 (page 464 of the text) to illustrate having Minitab performa 2-sample z-test for proportions. In this case we want to know if those who receive the drugTMS have significantly fewer migraine headaches than those receiving the placebo. The data arerepeated below:

TMS Placebo Total

Pain-free two hours later 39 22 61Not pain-free two hours later 61 78 139

Total 100 100 200

To test whether the groups experienced significantly different proportions of headaches, click onStat>Basic Statistics>2 proportions. In the dialogue box shown in Figure 4.12, click on thebutton marked “Summarized data” and fill in the boxes as shown in the figure. The resultingoutput is given below.

Figure 4.12: Dialogue box for 2-proportion z-test


Test and CI for Two Proportions

Sample X N Sample p

1 39 100 0.390000

2 22 100 0.220000

Difference = p (1) - p (2)

Estimate for difference: 0.17

95% CI for difference: (0.0445776, 0.295422)

Test for difference = 0 (vs not = 0): Z = 2.66 P-Value = 0.008

Fisher’s exact test: P-Value = 0.014

Chi-square test for 2× k tables

Unlike for the 2-sample z-test, when running a chi-square test we need to have the data in a work-sheet. This can be accomplished in two ways. Either the two-way table of counts can be typeddirectly into the worksheet, or the individual observations can be stored into the worksheet as rawdata. We will illustrate how to run the test both ways.

Two-way table in worksheetFor this example we use the migraine data again. We typed the table into an empty worksheet asshown in Figure 4.13(a).

(a) Worksheet with two-way table (b) Chi-square test using two-way table

Figure 4.13: Performing a chi-square test with a two-way table in the worksheet


To perform the test, click on Stat>Tables>Chi-Square Test (Two-Way Table in Work-sheet). In the resulting dialogue box (see Figure 4.13(b)), move both relevant columns over to thebox marked “Columns containing the table” and click “OK.”

Chi-Square Test: TMS, Placebo

Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

TMS Placebo Total

1 39 22 61

30.50 30.50

2.369 2.369

2 61 78 139

69.50 69.50

1.040 1.040

Total 100 100 200

Chi-Sq = 6.817, DF = 1, P-Value = 0.009

Raw data in worksheet

To illustrate performing a chi-square test for a two-way table when the raw data is in the worksheet,we use the data in TipJoke. Recall that we want to know if there is a significant difference intipping behavior based on what the waiter leaves at the table (joke, ad, nothing). To perform thechi-square test with this data set, click on Stat>Tables>Cross Tabulation and Chi-Square.This brings up the dialogue box shown in Figure 4.14.

Put one of the variables of interest in the “For rows” box and the other in the “For columns” box.Next, click on the “Chi-square” button. In the resulting dialogue box, click the first option: “Chi-square analysis.” You might also choose to show the expected counts, residuals, or contribution tothe chi-square statistic for each cell. Finally, click “OK” twice to produce the output.


Figure 4.14: Dialogue box for cross tabulation and chi-square

Tabulated statistics: Card, Tip

Rows: Card Columns: Tip

0 1 All

Ad 60 14 74

52.96 21.04 74.00

Joke 42 30 72

51.53 20.47 72.00

None 49 16 65

46.52 18.48 65.00

All 151 60 211

151.00 60.00 211.00

Cell Contents: Count

Expected count

Pearson Chi-Square = 9.953, DF = 2, P-Value = 0.007

Likelihood Ratio Chi-Square = 9.805, DF = 2, P-Value = 0.007

CHAPTER 5

Randomization, Bootstrapping, andMacros

Several sections of the text cover relatively modern, computer-intensive methods that involve gen-erating a large number of randomized samples. These include:

• Topic 4.5: Randomization Test for Regression (page 198)

• Topic 4.6: Bootstrapping for Regression (page 202)

• Section 8.2: Randomization F-Test (page 407)

• Topic 11.3: Randomization Tests for Logistic Regression (page 577)

As of this writing, none of these procedures are implemented directly in the current version (16.0) ofMinitab. However, Minitab allows the use of a macro that is essentially a small computer programthat executes a series of Minitab commands. We have provided downloadable versions of severalsuch macros to use with the bootstrap and randomization topics listed above. Each macro is justa plain text file with a series of typed Minitab commands. In what follows we first discuss how togo about running these macros from within Minitab. We then give a brief look at how the macrosthemselves are constructed in case you want to modify them or create your own.

5.1 Running Macros in Minitab

Minitab was originally a command-driven program where users typed in syntax for a command(such as regress ’Height’ on ’Weight’) and Minitab would read and then execute the request.Most of that functionality has now been replaced by the menu system and dialog boxes that letusers specify requests for analysis. However, you can still use the old command-based system inplace of (or along with) the menus.

79

80 CHAPTER 5. RANDOMIZATION, BOOTSTRAPPING, AND MACROS

To activate the Minitab commands: Click anywhere in the Session window to be sure it isactive and choose Editor>Enable commands from the main Minitab menus. This produces a

MTB>

prompt in the Session window. You will type commands at the Minitab prompt (and Minitab willshow you the commands it is using, even if you enter them via the menus – try it!).

The macro files we use are all plain text files with names ending with .mac, for example:

SlopeRand.mac

BootRand.mac

ANOVAFRand.mac

LogisticRand.mac

One of the tricky aspects of running a macro is being sure that Minitab knows where to look for thefile of commands to run. There is a directory within Minitab that can hold macros, but, dependingon how Minitab is implemented, you may not have access to that directory. The alternate solutionis to have the macro stored in whatever is Minitab’s current directory. Use the command

MTB> dir

to see where Minitab is currently looking and the files in that folder (often this is wherever youhave recently loaded data from). You can use commands (like cd to change the directory) to getto a different location or (what’s often easier) just load a worksheet from within Minitab from thesame folder that has the macro files (or keep the macros where you keep the data worksheets).

To run a Minitab macro: After the MTB> prompt, put a “%” symbol in front of the name ofthe macro file1 then list any additional information that is needed for the macro. For example,

MTB> %SlopeRand ’Weight’ ’Height’ c10

will execute SlopeRand.mac, specifying Weight as the response variable, Height as the predictorvariable, and C10 as the Minitab column to hold the randomization statistics.

Many macros need to get information from the user (such as the number of randomizations youwould like to run) and will issue prompts to let you enter the specifications. For example, if weload the worksheet with SATGPA data (see Example 4.9 on page 198) we can do a randomizationtest for the correlation between GPA using VerbalSAT score as a predictor. To save some typingwe note that the response GPA is in column C3 and the predictor VerbalSAT is in C2. We choosec5 as the column to hold the newly generated randomization correlations.

WARNING! Take care that you do not overwrite important data when you choose the last column!

1You can also include a full path name, such as MTB> %‘C:\MyStuff\Data\SlopeRand’

5.1. RUNNING MACROS IN MINITAB 81

Here’s a typical macro session to create a randomization distribution of 1000 sample correlations,under a null hypotheses of no relationship between the variables in C2 and c3. The informationafter the MTB> and Data> prompts is what is typed by the user.

MTB > %SlopeRand c3 c2 c5

Executing from file: SlopeRand.MAC

Which statistic?

1=corr 2=slope

DATA> 1

How many randomization samples?

DATA> 1000

MTB >

Note that the first prompt gives us a choice of statistics to include for each randomization sample:either the correlation (r) or the slope (β̂1).

The next prompt lets us specify the number of randomization samples (in this example, 1000). Thespeed of your machine will affect the number of randomizations that are feasible.

For each randomization, the values of the response variable (in this example, GPA’s) are randomlyscrambled so that any GPA is equally likely to be matched with any of the verbal SAT scores.This is consistent with a null hypothesis that the two variables are unrelated (either the populationcorrelation ρ = 0 or the slope of the regression model β1 = 0).

If the input is properly specified, values (either randomization correlations or slopes) will begin ap-pearing in the specified column (c5 in the example above). Be patient, the randomization processcan take a while for a large number of samples. The MTB> prompt reappears when the randomiza-tions are complete.

What do we do with the new column of randomization statistics?

(1) Look at them. Try a histogram or dotplot of the randomization correlations (or slopes) to seewhat the distribution looks like when there is no association between the two variables. Figure 5.1shows such a figure for the example given above. Compare this to Figure 4.15 on page 200 of thetext.


Figure 5.1: Randomization distribution for 1000 correlations of GPA versus Verbal SAT

(2) See how many are more extreme than the original sample. The original 24 studentsin the SATGPA show a correlation of r = 0.245 between GPA and V erbalSAT . Is that in anunusual place in the distribution of randomization correlations? We need to count how many ofthe randomization correlations are more extreme than r = 0.245. One easy way to do this inMinitab is to sort the new column of randomization statistics, using Data>Sort from the menusor commands like those below.2

MTB > sort c5 c5;

SUBC> Descending c5.

We can then look down the sorted column to identify how many are more extreme than the cor-relation for the observed sample (doubling if needed for a two-tail test). For the randomizationsshown in Figure 5.1 we find that 120 of the 1000 samples give correlations of 0.245 or higher so theestimated p-value is 2 ∗ 123/1000 = 0.246.

As an alternative to sorting you could create a new column (say C6) with values for the expressionc5 >= 0.245 and sum the values for the new column. Use the Calc menu or type directly intocommands.

MTB> let c6= (c5>=0.245)

MTB> sum c6

2The semicolon at the end of a command alerts Minitab to look for a following subcommand.

5.1. RUNNING MACROS IN MINITAB 83

Syntax for the randomization and bootstrap macros

Randomization test for a slope or correlation (in a simple linear model):

MTB> %SlopeRand ‘Y’ ’X’ ’Results’

where ‘Y’ is the response column, ‘X’ is the predictor column, and ‘Results’ is a column to holdthe results.

Statistics to save for each sample: Correlation or slope

Randomization test for F-statistic in ANOVA for means:

MTB> %ANOVAFRand ‘Y’ ‘Groups’ ‘Results’

where ‘Y’ is the quantitative response column, ‘Groups’ is the variable identifying the groups,and ‘Results’ is a column to hold the results.

Statistics to save for each sample: F-statistic for ANOVA

Randomization test for slope or odds ratio in logistic regression:

MTB> %LogisticRand ‘Y’ ‘X’ ’Results’

where ‘Y’ is the binary response column, ‘X’ is the predictor variable, and ‘Results’ is a columnto hold the results.

Statistics to save for each sample: Slope or Odds ratio

Bootstrap regression statistics (simple linear regression):

MTB> %Slope ‘Y’ ‘X’ ’Results’

where ‘Y’ is the response column, ‘X’ is the predictor column, and ‘Results’ is a column to holdthe results.

Statistics to save for each sample: Correlation, slope, intercept, or regression standard error

The bootstrap differs from the randomization procedures in that the samples are generated withreplacement from the original sample, rather than randomizing to make some null hypothesis true.We can find the standard deviation of the bootstrap samples to estimate the standard error (SE)of the statistic (slope or correlation). We can also use Calc>Calculator to enter a commandlike Percentile(’Results’,0.025) to store a percentile from the bootstrap distribution neededto compute a confidence interval.


5.2 Structure of the Macros

While you don’t need to know the details of the macros to use them effectively to create random-ization or bootstrap distributions for regression statistics, we give a brief overview of them here incase you would like to modify them for other situations. The complete macro is shown at the endof this chapter. We’ll first look at it one section at a time.

Getting started

The first macro statement indicates the start of the macro, and any text after the # symbol isa comment that isn’t processed in any way. So the main work starts with naming the macro assloperand and specifying the input columns to be generically known as y and x for purposes of themacro, and the output column to be generically known as randstat. The remaining statements inthis section define the variables to be used within the macro and set up a couple of other Minitaboptions.

macro

#####################################################################

# MACRO: SlopeRand.MAC #

# Purpose: Construct a distribution of randomization statistics #

# for regression or correlation by permuting the response variable #

#####################################################################

sloperand y x randstat

mcolumn y x tempy tempx randstat cint coeff1 ra rb

mconstant nrand stat n i tempmse

mmatrix corr1

mreset

brief 1

noecho

notitle

#clear out the column to contain randomization samples

erase randstat

Getting input

The next section of code is Minitab’s (very cumbersome) method for allowing the macro user tospecify the statistic to be generated and the number of randomization samples to produce.

5.2. STRUCTURE OF THE MACROS 85

# Code below is to let user enter stat

note

note Which statistic?

note 1=corr 2=slope

set cint;

file "Terminal";

nobs 1.

copy cint stat

# Code below is to let user enter the number of randomizations

note

note How many randomization samples?

set cint;

file "Terminal";

nobs 1.

copy cint nrand

The big loop

After determining the required sample size, the next section of the macro does the main work whenthe statistic of interest is the sample slope. The do i=1:nrand statement signifies the start of aloop to be run nrand times to generate the randomization samples. The first statement in the loopsample n y tempy does the random scrambling of the response (y) variable and stores it in tempy.The next set of statements runs the regression model, with no output but storing the coefficientsand MSE. The final statement before ending the loop stores the slope (second coefficient) into thenext slot of the randstat column that contains the randomization slopes.

let n=count(x) #sample size for original sample

if stat=2

#run regressions

do i=1:nrand

sample n y tempy

regress tempy 1 x;

coeff coeff1;

brief 0.

let randstat(i)=coeff1(2)

enddo


The other big loop

The second loop handles the case when stat=1 and the user wants the correlation for each ran-domization sample. Again we scramble the response variables and then compute the correlationbetween those scrambled values (tempy) and the original predictor (x) values. Finally, the correla-tion is saved into the randstat column.

elseif stat=1

#run correlations

do i=1:nrand

sample n y tempy

corr x tempy corr1

copy corr1 ra rb

let randstat(i)=ra(2)

enddo

On the next page we give the macro for randomization correlations and slopes in its entirety. Theother macros have a similar structure and all are saved as plain text files so you can open, read,and edit them as needed. The main change for the bootstrap macro (SlopeBoot.mac) is addinga replace. subcommand to the sample request so that samples are taken with replacement andhaving more options for statistics to save.

5.2. STRUCTURE OF THE MACROS 87

macro

#####################################################################

# MACRO: SlopeRand.MAC #

# #

# Purpose: Construct a distribution of randomization statistics #

# for regression or correlation by permuting the response variable #

#####################################################################

sloperand y x randstat

mcolumn y x tempy tempx randstat cint coeff1 ra rb

mconstant nrand stat n i tempmse

mmatrix corr1

mreset

brief 1

noecho

notitle

#clear out the column to contain randomization samples

erase randstat

# Code below is to let user enter stat

note

note Which statistic?

note 1=corr 2=slope

set cint;

file "Terminal";

nobs 1.

copy cint stat

# Code below is to let user enter the number of randomizations

note

note How many randomization samples?

set cint;

file "Terminal";

nobs 1.

copy cint nrand

let n=count(x) #sample size for original sample


if stat=2

#run regressions

do i=1:nrand

sample n y tempy

regress tempy 1 x;

coeff coeff1;

MSE tempmse;

brief 0.

let randstat(i)=coeff1(2)

enddo

elseif stat=1

#run correlations

do i=1:nrand

sample n y tempy

corr x tempy corr1

copy corr1 ra rb

let randstat(i)=ra(2)

enddo

else note

note ***Oops*** You need to specify 1 or 2 for the statistic

exit

endif

endmacro

Index

added variable plot, 44analysis of covariance, 61ANOVA

fit one-way model, 51fit two-way model (balanced data), 54fit two-way model (unbalanced), 55interaction plot, 55Levene’s test, 56multiple tests, 57residual plots, 52, 55stacked columns, 51unstacked columns, 51using regression and indicator variables, 61

bar chart, 13segmented bar chart, 14

best subsets, 44bootstrapping, 79boxplot, 11

Calcmathematical functions, 26probability distributions, 33

calculator, 26chi-square test for two-way tables, 75columns

as variables, 2stacking, 23unstacking, 24

comparing two regression lines, 36correlation, 34, 40

data retrieval, 6Open Project, 6

Open Worksheet, 6data types, 2

change type, 2storing information, 3

descriptive statistics, 15dotplot, 9

enabling command line mode, 80

file types, 5.mpj, 5.mtw, 5

graphicsadd a curve to scatterplot, 39added variable plot, 44bar chart, 13boxplot, 11coded scatterplot, 48dotplot, 9empirical logit plot, 66histogram, 6interaction plot, 55logistic regression delta chi-square graph, 72logistic scatterplot with jitter, 69modify a histogram, 7normal probability plot, 13scatterplot, 29

histogram, 6modify a histogram, 7

indicator variables, 27, 36interaction, 37

89

90

regression model with interaction, 39

Kruskal-Wallis, 60

Levene’s test, 56logistic regression

delta chi-square graph, 72empirical logit plot, 66fit model (binary form of data), 66fit model (one row per case), 63residual plot, 72scatterplot with jitter, 69

Minitab macros, 79how to run, 80structure of macros, 84

multiple tests, 57Bonferroni, 57Fisher’s LSD, 57Tukey’s HSD, 57

name a variable, 2nested F-test, 42nonparametric statistics

Kruskal-Wallis, 60Wilcoxon-Mann-Whitney, 58

normal probability plot, 13

one-sample t-test, 16

paired t-test, 17polynomial regression, 39

randomization test, 79randomization test for slope or correlation, 83regression, 30

analysis of covariance, 61ANOVA as regression, 61best subsets, 44comparing two lines, 36correlation, 34fit model, 30inference, 33interaction, 39nested F-test, 42

polynomial regression, 39residual plots, 32scatterplot, 30stepwise regression, 45storing output, 44unusual points, 48variance inflation factors, 41

regression inferenceconfidence interval for slope, 33

residual plots, 32ANOVA, 52, 55logistic regression, 72

rows as cases, 2

scatterplot, 29add a curve, 39add regression line, 30coded scatterplot, 48

segmented bar chart, 14sort a variable, 24

carry along other variables, 25split worksheet, 23stack columns, 23

ANOVA, 51stepwise regression, 45storage of regression output, 44storing information about data, 3subset worksheet, 22

two-sample z-test for proportions, 74

unstack columns, 23ANOVA, 51

unusual points, 48

variablescolumns and rows, 2creating new variables, 26Date/Time, 2indicator variables, 27, 36interaction variable, 37name, 2numeric, 2

INDEX 91

text, 2variance inflation factors, 41

Wilcoxon-Mann-Whitney, 58

worksheetsplit, 23subset, 22

minitab companion - stat2 · 2.2 inference for simple linear regression ... 3.3 additional anova...

Documents