stat 145 (notes) - university of new mexicomath.unm.edu/~alvaro/stat145-p2.pdfin any graph of data,...

135
. . . . . . STAT 145 (Notes) Al Nosedal [email protected] Department of Mathematics and Statistics University of New Mexico Fall 2013

Upload: nguyenminh

Post on 21-Apr-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

. . . . . .

STAT 145 (Notes)

Al [email protected]

Department of Mathematics and StatisticsUniversity of New Mexico

Fall 2013

. . . . . .

CHAPTER 4

SCATTERPLOTS AND CORRELATION.

. . . . . .

Definitions

I A response variable measures an outcome of a study.

I An explanatory variable may explain or influence changes in aresponse variable.

. . . . . .

Definitions

I A response variable measures an outcome of a study.

I An explanatory variable may explain or influence changes in aresponse variable.

. . . . . .

Coral reefs

How sensitive to changes in water temperature are coral reefs? Tofind out, scientists examined data on sea surface temperatures andcoral growth per year at locations in the Red Sea. What are theexplanatory and response variables? Are they categorical orquantitative?

. . . . . .

Solution

Sea-surface temperature is the explanatory variable; coral growth isthe response variable. Both are quantitative.

. . . . . .

Scatterplot

A scatterplot shows the relationship between two quantitativevariables measured on the same individuals. The values of onevariable appear on the horizontal axis, and the values of the othervariable appear on the vertical axis. Each individual in the dataappears as the point in the plot fixed by the values of bothvariables for that individual.Always plot the explanatory variable, if there is one, on thehorizontal axis (the x axis) of a scatterplot. As a reminder, weusually call the explanatory variable x and the response variable y .If there is no explanatory-response distinction, either variable cango on the horizontal axis.

. . . . . .

Do heavier people burn more energy?

Metabolic rate, the rate at which the body consumes energy, isimportant in studies of weight gain, dieting, and exercise. We havedata on the lean body mass and resting metabolic rate for 12women who are subjects in a study of dieting. Lean body mass,given in kilograms, is a person’s weight leaving out all fat.Metabolic rate is measured in calories burned per 24 hours.

. . . . . .

Data

Mass Rate Mass Rate

36.1 995 40.3 118954.6 1425 33.1 91348.5 1396 42.4 112442.0 1418 34.5 105250.6 1502 51.1 134742.0 1256 41.2 1204

The researchers believe that lean body mass is an importantinfluence on metabolic rate. Make a scatterplot to examine thisbelief.

. . . . . .

Scatterplot

35 40 45 50 55

900

1000

1100

1200

1300

1400

1500

Lean Body Mass (kg)

Metab

olic R

ate (c

alorie

s/day

)

. . . . . .

Examining a Scatterplot

In any graph of data, look for the overall pattern and for strikingdeviations from the pattern.You can describe the overall pattern of a scatterplot by thedirection, form, and strength of the relationship.An important kind of deviation is an outlier, an individual valuethat falls outside the overall pattern of the relationship.

. . . . . .

Positive Association, Negative Association

Two variables are positively associated when above-average valuesof one tend to accompany above-average values of the other, andbelow-average values also tend to occur together.Two variables are negatively associated when above-average valuesof one tend to accompany below-average values of the other, andvice versa.The strength of a relationship in a scatterplot is determined byhow closely the points follow a clear form.

. . . . . .

Positive Association (scatterplot)

2 4 6 8 10 12 14

1020

3040

5060

7080

x

y

. . . . . .

Negative Association (scatterplot)

2 4 6 8 10 12 14

2040

6080

100

x

y

. . . . . .

NO Association (scatterplot)

2 4 6 8 10 12 14

-10

-50

510

15

x

y

. . . . . .

Do heavier people burn more energy? (Again)

Describe the direction, form, and strength of the relationshipbetween lean body mass and metabolic rate, as displayed in yourplot.

. . . . . .

Solution

The scatterplot shows a positive direction, linear form, andmoderately strong association.

. . . . . .

Do heavier people burn more energy? (Again)

The study of dieting described earlier collected data on the leanbody mass (in kilograms) and metabolic rate (in calories) for bothfemale and male subjects.

Mass Rate Sex Mass Rate Sex

36.1 995 F 40.3 1189 F54.6 1425 F 33.1 913 F48.5 1396 F 42.4 1124 F42.0 1418 F 34.5 1052 F50.6 1502 F 51.1 1347 F42.0 1256 F 41.2 1204 F

. . . . . .

More data

Mass Rate Sex Mass Rate Sex

51.9 1867 M 47.4 1322 M46.9 1439 M 48.7 1614 M62.0 1792 M 51.9 1460 M62.9 1666 M

. . . . . .

a) Make a scatterplot of metabolic rate versus lean body mass forall 19 subjects. Use separate symbols to distinguish women andmen. (This is a common method to compare two groups ofindividuals in a scatterplot)b) Does the same overall pattern hold for both women and men?What is the most important difference between women and men?

. . . . . .

Scatterplot

35 40 45 50 55 60

1000

1200

1400

1600

1800

Lean Body Mass (kg)

Metab

olic R

ate (c

alorie

s/day

)

femalemale

. . . . . .

b) Solution

For both men and women, the association is linear and positive.The women’s points show a stronger association. As a group,males typically have larger values for both variables (they tend tohave more mass, and tend to burn more calories per day).

. . . . . .

Correlation

The correlation measures the direction and strength of the linearrelationship between two quantitative variables. Correlation isusually written as r .Suppose that we have data on variables x and y for n individuals.The values for the first individual are x1 and y1, the values for thesecond individual are x2 and y2, and so on.The means and standard deviations of the two variables are x andSx for the x-values, and y and Sy for the y -values. The correlationr between x and y is

r =1

n − 1

n∑i=1

(xi − x

Sx

)(yi − y

Sy

)

. . . . . .

Covariance

You can compute the covariance, Sxy using the following formula:

Sxy =

∑ni=1 xiyin − 1

− nx y

n − 1

. . . . . .

Correlation (alternative formula)

r =SxySxSy

wherer = correlation.Sxy = covariance.Sx = standard deviation of x .Sy = standard deviation of y .

. . . . . .

Example

Five observations taken for two variables follow

xi yi4 506 5011 403 6016 30

a) Develop a scatter diagram with x on the horizontal axis.b) Compute the sample covariance.c) Compute and interpret the sample correlation coefficient.

. . . . . .

a) Scatterplot

4 6 8 10 12 14 16

3035

4045

5055

60

x

y

. . . . . .

x and y

First, let’s find x and y

x =4 + 6 + 11 + 3 + 16

5= 8

y =50 + 50 + 40 + 60 + 30

5= 46

. . . . . .

Sx

Now, let’s find Sx and Sy

S2x =

(4− 8)2 + (6− 8)2 + (11− 8)2 + (3− 8)2 + (16− 8)2

4

S2x =

(−4)2 + (−2)2 + (3)2 + (−5)2 + (8)2

4=

118

4= 29.5

Sx = 5.4313

. . . . . .

Sy

S2y =

(50− 46)2 + (50− 46)2 + (40− 46)2 + (60− 46)2 + (30− 46)2

4

S2y =

(4)2 + (4)2 + (−6)2 + (14)2 + (−16)2

4=

520

4= 130

Sy = 11.4017

. . . . . .

Covariance and Correlation

Finally, we find Sxy and r

5∑i=1

xiyi = (4)(50)+(6)(50)+(11)(40)+(3)(60)+(16)(30) = 1600

Sxy =

∑ni=1 xiyin − 1

− nx y

n − 1=

1600

4− (5)(8)(46)

4= −60

r =SxySxSy

=−60

(5.4313)(11.4017)= −0.9688

. . . . . .

Coral reefs

Exercise 4.2 discusses a study in which scientists examined data onmean sea surface temperatures (in degrees Celsius) and mean coralgrowth (in millimeters per year) over a several-year period atlocations in the Red Sea. Here are the data:

Sea Surface Temperature Growth

29.68 2.6329.87 2.5830.16 2.6030.22 2.4830.48 2.2630.65 2.3830.90 2.26

. . . . . .

a) Make a scatterplot. Which is the explanatory variable?b) Find the correlation r step-by-step. Explain how your value for rmatches your graph in a).c) Enter these data into your calculator and use the correlationfunction to find r .

. . . . . .

ScatterplotTemperature is the explanatory variable.

2.3 2.4 2.5 2.6

29.8

30.0

30.2

30.4

30.6

30.8

Sea Surface Temperature

Cora

l Gro

wth (

mm)

. . . . . .

x and y

First, let’s find x and y

x =29.68 + 29.87 + 30.16 + 30.22 + 30.48 + 30.65 + 30.90

7= 30.28

y =2.63 + 2.58 + 2.60 + 2.48 + 2.26 + 2.38 + 2.26

7= 2.4557

. . . . . .

Sx

Now, let’s find Sx and Sy

S2x =

(29.68− 30.28)2 + ...+ (30.65− 30.28)2 + (30.90− 30.28)2

6

S2x = 0.1845

Sx = 0.4296

. . . . . .

Sy

S2y =

(2.63− 2.4557)2 + ...+ (2.38− 2.4557)2 + (2.26− 2.4557)2

6

S2y = 0.0249

Sy = 0.1578

. . . . . .

Covariance

Finally, we find Sxy and r

7∑i=1

xiyi = (29.68)(2.63) + ...+ (30.65)(2.38) + (30.90)(2.26)

= 520.1504

Sxy =

∑ni=1 xiyin − 1

− nx y

n − 1

=520.1504

6− (7)(30.28)(2.4557)

6= 86.6917− 86.7516 = −0.0599

. . . . . .

Correlation

r =SxySxSy

=−0.0599

(0.4296)(0.1578)= −0.8835

This is consistent with the strong, negative association depicted inthe scatterplot.c) Ti-84 will give a value of r = −0.8914.

. . . . . .

Example

Five observations taken for two variables follow

xi yi1 12 23 34 45 5

a) Develop a scatter diagram with x on the horizontal axis.b) Compute the sample covariance.c) Compute and interpret the sample correlation coefficient.

. . . . . .

a) Scatterplot

1 2 3 4 5

12

34

5

x

y

. . . . . .

Covariance and Correlation

x = 3 Sx = 1.58113883y = 3 Sy = 1.58113883∑5

i=1 xiyi = 55

Sxy =

∑xiyi

n − 1− nx y

n − 1

Sxy =55

4− (5)(3)(3)

4

Sxy = 13.75− 11.25 = 2.5

r =SxySxSy

=2.5

(1.58113883)(1.58113883)= 1

. . . . . .

Facts about correlation

1. Correlation makes no distinction between explanatory andresponse variables. It makes no difference which variable you call xand which you call y in calculating the correlation.2. Because r uses the standardized values of the observations, rdoes not change when we change the units of measurement of x, y,or both. The correlation r itself has no unit of measurement; it isjust a number.3. Positive r indicates positive association between the variables,and negative r indicates negative association.4. The correlation r is always a number between −1 and 1. Valuesof r near 0 indicate a very weak linear relationship. Perfectcorrelation, r = 1 or r = −1, occurs only when the points on ascatterplot lie exactly on a straight line.

. . . . . .

Strong association but no correlation

The gas mileage of an automobile first increases and thendecreases as the speed increases. Suppose that this relationship isvery regular, as shown by the following data on speed (miles perhour) and mileage (miles per gallon):

Speed Mileage

30 2440 2850 3060 2870 24

. . . . . .

Strong association but no correlation (cont.)

Make a scatterplot of mileage versus speed. Show that thecorrelation between speed and mileage is r = 0. Explain why thecorrelation is 0 even though there is a strong relationship betweenspeed and mileage.

. . . . . .

Scatterplot

30 40 50 60 70

2425

2627

2829

30

Speed

Mpg

Remember that correlation only measures the strength anddirection of a linear relationship between two variables.

. . . . . .

Correlation

x = 50 Sx = 15.8113y = 26.8 Sy = 2.6832∑5

i=1 xiyi = 6700

Sxy =

∑xiyi

n − 1− nx y

n − 1

Sxy =6700

4− (5)(50)(26.8)

4

Sxy = 1675− 1675 = 0

r =SxySxSy

=0

(15.8113)(2.6832)= 0

. . . . . .

More facts about correlation

1. Correlation requires that both variables be quantitative, so thatit makes sense to do the arithmetic indicated by the formula for r.2. Correlation measures the strength of only the linear relationshipbetween two variables. Correlation does not describe curvedrelationships between variables, no matter how strong they are.3. Like the mean and the standard deviation, the correlation is notresistant: r is strongly affected by a few outlying observations.4. Correlation is not a complete summary of two-variable data,even when the relationship between the variables is linear. Youshould give the means and standard deviations of both x and yalong with the correlation.

. . . . . .

CHAPTER 5

REGRESSION

. . . . . .

Regression Line

A regression line is a straight line that describes how a responsevariable y changes as an explanatory variable x changes. We oftenuse a regression line to predict the value of y for a given value of x .

. . . . . .

Review of Straight Lines

Suppose that y is a response variable (plotted on the vertical axis)and x is an explanatory variable (plotted on the horizontal axis). Astraight line relating y to x has an equation of the form

y = a+ bx

In this equation, b is the slope, the amount by which y changeswhen x increases by one unit. The number a is the intercept, thevalue of y when x = 0.

. . . . . .

City mileage, highway mileage

We expect a car’s highway gas mileage (mpg) to be related to itscity gas mileage. Data for all 1040 vehicles in the government’s2010 Fuel Economy Guide give the regression linehighway mpg = 6.554 + (1.016 x city mpg)for predicting highway mileage from city mileage.a) What is the slope of this line? Say in words what the numericalvalue of the slope tells you.b) What is the intercept? Explain why the value of the intercept isnot statistically meaningful.c) Find the predicted highway mileage for a car that gets 16 milesper gallon in the city. Do the same for a car with a city mileage of28 mpg.

. . . . . .

Solutions

a) The slope is 1.016. On average, highway mileage increases by1.016 mpg for each additional 1 mpg change in city mileage.b) The intercept is 6.554 mpg. This is the highway mileage for anonexistent car that gets 0 mpg in the city. Although thisinterpretation is valid, such prediction would be invalid because itinvolves considerable extrapolation.c) For a car that gets 16 mpg in the city, we predict highwaymileage to be:

6.554 + (1.016)(16) = 22.81 mpg.

For a car that gets 28 mpg in the city, we predict highway mileageto be:

6.554 + (1.016)(28) = 35.002 mpg.

. . . . . .

What’s the line?

You use the same bar of soap to shower each morning. The barweighs 80 grams when it is new. Its weight goes down by 5 gramsper day on the average. What is the equation of the regression linefor predicting weight from days of use?

. . . . . .

Solution

The equation is:

weight = 80− 5× days

The intercept is 80 grams (the initial weight), and the slope is −5grams/day.

. . . . . .

Least-Squares Regression Line

The least-squares regression line of y on x is the line that makesthe sum of the squares of the vertical distances of the data pointsfrom the line as small as possible.

. . . . . .

Equation of the Least-Squares Regression Line

We have data on an explanatory variable x and a response variabley for n individuals. From the data, calculate the means x and yand the standard deviations Sx and Sy of the two variables, andtheir correlation r . The least-squares regression line is the line

y = a+ bx

with slope

b = rSySx

and intercept

a = y − bx

. . . . . .

Coral reefs

Exercises 4.2 and 4.8 discuss a study in which scientists examineddata on mean sea surface temperatures (in degrees Celsius) andmean coral growth (in millimeters per year) over a several-yearperiod at locations in the Red Sea. Here are the data:

Sea Surface Temperature Growth

29.68 2.6329.87 2.5830.16 2.6030.22 2.4830.48 2.2630.65 2.3830.90 2.26

. . . . . .

a) Use your calculator to find the mean and standard deviation ofboth sea surface temperature x and growth y and the correlation rbetween x and y . Use these basic measures to find the equation ofthe least-squares line for predicting y from x .b) Enter the data into your software or calculator and use theregression function to find the least-squares line. The result shouldagree with your work in a) up to roundoff error.

. . . . . .

Solutions

a) x = 30.28 Sx = 0.4296y = 2.4557 Sy = 0.1578r = −0.8914.Hence,

b = rSySx

= (−0.8914)

(0.1578

0.4296

)= −0.3274

a = y − bx = 2.4557− (−0.3274)(30.28) = 12.3693

b) Ti-84 gives: slope = - 0.3276 and intercept = 12.3758.

. . . . . .

Do heavier people burn more energy?

We have data on the lean body mass and resting metabolic ratefor 12 women who are subjects in a study of dieting. Lean bodymass, given in kilograms, is a person’s weight leaving out all fat.Metabolic rate, in calories burned per 24 hours, is the rate atwhich the body consumes energy.

Mass Rate Mass Rate

36.1 995 40.3 118954.6 1425 33.1 91348.5 1396 42.4 112442.0 1418 34.5 105250.6 1502 51.1 134742.0 1256 41.2 1204

. . . . . .

a) Make a scatterplot that shows how metabolic rate depends onbody mass. There is a quite strong linear relationship, withcorrelation r = 0.876.b) Find the least-squares regression line for predicting metabolicrate from body mass. Add this line to your scatterplot.c) Explain in words what the slope of the regression line tells us.d) Another woman has a lean body mass of 45 kilograms. What isher predicted metabolic rate?

. . . . . .

Scatterplot

35 40 45 50 55

900

1000

1100

1200

1300

1400

1500

Lean Body Mass (kg)

Metab

olic R

ate (c

alorie

s/day

)

. . . . . .

Solutions

b) the regression equation is

y = 201.2 + 24.026x

where y= metabolic rate and x= body mass.c) The slope tells us that on the average, metabolic rate increasesby about 24 calories per day for each additional kilogram of bodymass.d) For x = 45 kg, the predicted metabolic rate isy = 1282.4 calories per day.

. . . . . .

Facts about Least-Squares Regression

1. The distinction between explanatory and response variables isessential in regression.2. The least-squares regression line always passes through thepoint (x , y) on the graph of y against x.3. The square of the correlation, r2, is the fraction of the variationin the values of y that is explained by the least-squares regressionof y on x.

. . . . . .

What’s my grade?

In Professor Krugman’s economics course the correlation betweenthe student’s total scores prior to the final examination and theirfinal-examination scores is r = 0.5. The pre-exam totals for allstudents in the course have mean 280 and standard deviation 40.The final-exam scores have mean 75 and standard deviation 8.Professor Krugman has lost Julie’s final exam but knows that hertotal before the exam was 300. He decides to predict herfinal-exam score from her pre-exam total.a) What is the slope of the least-squares regression line offinal-exam scores on pre-exam total scores in this course? What isthe intercept?b) Use the regression line to predict Julie’s final-exam score.c) Julie doesn’t think this method accurately predicts how well shedid on the final exam. Use r2 to argue that her actual score couldhave been much higher (or much lower) than the predicted value.

. . . . . .

Solutions

a) b = rSySy

= (0.5) 840 = 0.1

a = y − bx = 75− (0.1)(280) = 47.Hence, the regression equation is y = 47 + 0.1x .b) Julie’s pre-final exam total was 300, so we would predict a finalexam score of

y = 47 + (0.1)(300) = 77.

c) Julie is right ... with a correlation of r = 0.5, r2 = 0.25, so theregression line accounts for only 25% of the variability in studentfinal exam scores. That is, the regression line doesn’t predict finalexam scores very well. Julie’s score could, indeed, be much higheror lower than the predicted 77.

. . . . . .

Residuals

A residual is the difference between an observed value of theresponse variable and the value predicted by the regression line.That is, a residual is the prediction error that remains after wehave chosen the regression line:residual = observed y - predicted yresidual = y − y .

. . . . . .

Residual Plots

A residual plot is a scatterplot of the regression residuals againstthe explanatory variable. Residual plots help us assess how well aregression line fits the data.

. . . . . .

Residuals by hand

In Exercise 5.3 (page 115) you found the equation of theleast-squares line for predicting coral growth y from mean seasurface temperature x .a) Use the equation to obtain the 7 residuals step-by-step. That is,find the prediction y for each observation and then find theresidual y − y .b) Check that (up to roundoff error) the residuals add to 0.c) The residuals are the part of the response y left over after thestraight-line tie between y and x is removed. Show that thecorrelation between the residuals and x is 0 (up to roundoff error).That this correlation is always 0 is another special property ofleast-squares regression.

. . . . . .

Solutions

a) The residuals are computed in the table below usingy = −0.3276x + 12.3758.

xi yi yi yi − yi29.68 2.63 2.65632 - 0.02263229.87 2.58 2.59038 - 0.01038830.16 2.60 2.495384 0.10461630.22 2.48 2.475728 0.00427230.48 2.26 2.390552 - 0.13055230.65 2.38 2.33486 0.0451430.90 2.26 2.25296 0.00704

b)∑

(yi − yi ) = −0.002504 (they sum to zero, except for roundingerror.c) From software, the correlation between xi and yi − yi is−0.0000854, which is zero except for rounding.

. . . . . .

Do heavier people burn more energy?

Return to the data of Exercise 5.4 (page 117) on lean body massand metabolic rate. We will use these data to illustrate influence.a) Make a scatterplot of the data that is suitable for predictingmetabolic rate from body mass, with two new points added. PointA: mass 42 kilograms, metabolic rate 1500 calories. Point B: mass70 kilograms, metabolic rate 1400 calories. In which direction iseach of these points an outlier?b) Add three least-squares regression lines to your plot: for theoriginal 12 women, for the original women plus Point A, and forthe original women plus Point B. Which new point is moreinfluential for the regression line? Explain in simple language whyeach new point moves the line in the way your graph shows.

. . . . . .

Scatterplot

30 40 50 60 70

900

1000

1100

1200

1300

1400

1500

Lean Body Mass (kg)

Metab

olic R

ate (c

alorie

s/day

)

A

B

. . . . . .

Least-squares regression lines

30 40 50 60 70

900

1000

1100

1200

1300

1400

1500

Lean Body Mass (kg)

Metab

olic R

ate (c

alorie

s/day

)

A

B

originaloriginal + Aoriginal + B

. . . . . .

Solutions

a) Point A lies above the other points; that is, the metabolic rateis higher than we expect for the given body mass. Point B lies tothe right of the other points; that is, it is an outlier in the x (mass)direction, and the metabolic rate is lower than we would expect.b) In the plot, the dashed blue line is the regression line for theoriginal data. The dashed red line slightly that includes Point A; ithas a very similar slope to the original line, but a slightly higherintercept, because Point A pulls the line up. The third line includesPoint B, the more influential point; because Point B is an outlier inthe x direction, it ”pulls” the line down so that it is less steep.

. . . . . .

Influential observations

An observation is influential for a statistical calculation if removingit would markedly change the result of the calculation.The result of a statistical calculation may be of little practical useif it depends strongly on a few influential observations.Points that are outliers in either the x or the y direction of ascatterplot are often influential for the correlation. Points that areoutliers in the x direction are often influential for the least-squaresregression line.

. . . . . .

The endangered manatee

Table 4.1 gives 33 years of data on boats registered in Florida andmanatees killed by boats. If we made a scatterplot for this dataset, it would show a strong positive linear relationship. Thecorrelation is r = 0.951.a) Find the equation of the least-squares line for predictingmanatees killed from thousands of boats registered. Because thelinear pattern is so strong, we expect predictions from this line tobe quite accurate - but only if conditions in Florida remain similarto those of the past 33 years.b) In 2009, experts predicted that the number of boats registeredin Florida would be 975,000 in 2010. How many manatees do youpredict would be killed by boats if there are 975,000 boatsregistered? Explain why we can trust this prediction.c) Predict manatee deaths if there were no boats registered inFlorida. Explain why the predicted count of deaths is impossible.

. . . . . .

Table 4.1

Year Boats Manatees Year Boats Manatees

1977 447 13 1988 675 431978 460 21 1989 711 501979 481 24 1990 719 471980 498 16 1991 681 531981 513 24 1992 679 381982 512 20 1993 678 351983 526 15 1994 696 491984 559 34 1995 713 421985 585 33 1996 732 601986 614 33 1997 755 541987 645 39 1998 809 66

. . . . . .

Table 4.1 (cont.)

Year Boats Manatees

1999 830 822000 880 782001 944 812002 962 952003 978 732004 983 692005 1010 792006 1024 922007 1027 732008 1010 902009 982 97

. . . . . .

Solutions

a) The regression line is y = −43.172 + 0.129x .b) If 975, 000 boats are registered, then by our scale, x = 975, andy = −43.172 + (0.129)(975) = 82.6 manatees killed. Theprediction seems reasonable, as long as conditions remain the same,because ”975” is within the space of observed values of x on whichthe regression line was based. That is, this is not extrapolation.c) If x = 0 (corresponding to no registered boats), then we would”predict” −43.172 manatees to be killed by boats. This is absurd,because it is clearly impossible for fewer than 0 manatees to bekilled. Note that x = 0 is well outside the range of observed valuesof x on which the regression line was based.

. . . . . .

Extrapolation

Extrapolation is the use of a regression line for prediction faroutside the range of values of the explanatory variable x that youused to obtain the line. Such predictions are often not accurate.

. . . . . .

Is math the key to success in college?

A College Board study of 15,941 high school graduates found astrong correlation between how much math minority students tookin high school and their later success in college. New articlesquoted the head of the College Board as saying that ”Math is thegatekeeper for success in college.” Maybe so, but we should alsothink about lurking variables. What might lead minority studentsto take more or fewer high school math courses? Would thesesame factors influence success in college?

. . . . . .

Solution

A student’s intelligence may be a lurking variable: strongerstudents (who are more likely to succeed when they get to college)are more likely to choose to take these math courses, while weakerstudents may avoid them. Other possible answers may bevariations on this idea; for example, if we believe that success incollege depends on a student’s self-confidence, and perhapsconfident students are more likely to choose math courses.

. . . . . .

Lurking Variable

A lurking variable is a variable that is not among the explanatoryor response variables in a study and yet may influence theinterpretation of relationships among those variables.

. . . . . .

Chapter 8

PRODUCING DATA: SAMPLING.

. . . . . .

Population, Sample, Sampling Design

The population in a statistical study is the entire group ofindividuals about which we want information.A sample is a part of the population from which we actually collectinformation. We use a sample to draw conclusions about the entirepopulation.A sampling design describes exactly how to choose a sample fromthe population.The first step in planning a sample survey is to say exactly whatpopulation we want to describe. The second step is to say exactlywhat we want to measure, that is, to give exact definitions of ourvariables.

. . . . . .

Customer satisfaction

A department store mails a customer satisfaction survey to peoplewho make credit card purchases at the store. This month, 45,000people made credit card purchases. Surveys are mailed to 1000 ofthese people, chosen at random, and 137 people return the surveyform.a) What is the population of interest for this survey?b) What is the sample? From what group is information actuallyobtained?

. . . . . .

Solutions

a) The population is all 45,000 people who made credit cardpurchases.b) The sample is the 137 people who returned the survey form.

. . . . . .

How to sample badly

The final step in planning a sample survey is the sampling design.A sampling design is a specific method for choosing a sample fromthe population. The easiest- but not the best - design just choosesindividuals close at hand. A sample selected by taking the membersof the population that are easiest to reach is called a conveniencesample. Convenience samples often produce unrepresentative data.

. . . . . .

Bias

The design of a statistical study is biased if it systematically favorscertain outcomes.

. . . . . .

Voluntary response sample

A voluntary response sample consists of people who choosethemselves by responding to a broad appeal. Voluntary responsesamples are biased because people with strong opinions are mostlikely to respond.

. . . . . .

Sampling on campus

You see a woman student standing in front of the student center,now and then stopping other students to ask them questions. Shesays that she is collecting student opinions for a class assignment.Explain why this sampling method is almost certainly biased.

. . . . . .

Solution

It is a convenience sample; she is only getting opinions fromstudents who are at the student center at a certain time of day.This might underrepresent some group: commuters, graduatestudents, or nontraditional students, for example.

. . . . . .

Simple Random Sampling

A simple random sample (SRS) of size n consists of n individualsfrom the population chosen in such a way that every set of nindividuals has an equal chance to be the sample actually selected.

. . . . . .

Random digits

A table of random digits is a long string of the digits 0, 1, 2, 3, 4,5, 6, 7, 8, 9 with these two properties:1. Each entry in the table is equally likely to be any of the 10digits 0 through 9.2. The entries are independent of each other. That is, knowledgeof one part of the table gives no information about any other part.

. . . . . .

Using Table B to choose an SRS

Label: Give each member of the population a numerical label ofthe same length.Table: To choose a SRS, read from Table B successive groups ofdigits of the length you used as labels. Your sample contains theindividuals whose labels you find in the table.

. . . . . .

Apartment living

You are planning a report on apartment living in a college town.You decide to select three apartment complexes at random forin-depth interviews with residents. Use software or Table B toselect a simple random sample of 4 of the following apartmentcomplexes. If you use Table B, start at line 122.

. . . . . .

Ashley Oaks Country View Mayfair VillageBay Pointe Country Villa Nobb HillBeau Jardin Crestview Pemberly Courts

Bluffs Del-Lynn PeppermillBrandon Place Fairington Pheasant RunBriarwood Fairway Knolls River WalkBrownstone Fowler Sagamore Ridge

Burberry Place Franklin Park Salem CourthouseCambridge Georgetown Village Square

Chauncey Village Greenacres Waterford Court

. . . . . .

Solution

01 Ashley Oaks 11 Country View 21 Mayfair Village02 Bay Pointe 12 Country Villa 22 Nobb Hill03 Beau Jardin 13 Crestview 23 Pemberly Courts04 Bluffs 14 Del-Lynn 24 Peppermill05 Brandon Place 15 Fairington 25 Pheasant Run06 Briarwood 16 Fairway Knolls 26 River Walk07 Brownstone 17 Fowler 27 Sagamore Ridge08 Burberry Place 18 Franklin Park 28 Salem Courthouse09 Cambridge 19 Georgetown 29 Village Square10 Chauncey Village 20 Greenacres 30 Waterford Court

. . . . . .

Solution (cont.)

With Table B, enter at line 122 and choose 13 = Crestview, 15 =Fairington, 05 = Brandon Place, and 29 = Village Square.

. . . . . .

Inference about the Population

The purpose of a sample is to give us information about a largerpopulation. The process of drawing conclusions about a populationon the basis of sample data is called inference because we inferinformation about the population from what we know about thesample.Inference from convenience samples or voluntary response sampleswould be misleading because these methods of choosing a sampleare biased. We are almost certain that the sample does not fairlyrepresent the population. The first reason to rely on randomsampling is to eliminate bias in selecting samples from the list ofavailable individuals.

. . . . . .

Sampling frame

The list of individuals from which a sample is actually selected iscalled the sampling frame. Ideally, the frame should list everyindividual in the population, but in practice this is often difficult. Aframe that leaves out part of the population is a common source ofundercoverage.Suppose that a sample of households in a community is selected atrandom from the telephone directory. What households areomitted from this frame? What types of people do you think arelikely to live in these households? These people will probably beunderrepresented in the sample.

. . . . . .

Solution

This design would omit households without telephones or withunlisted numbers. Such households would likely be made up ofpoor individuals (who cannot afford a phone), those who choosenot to have phones, and those who do not wish to have theirphone number published.

. . . . . .

Cautions about sample surveys

Random selection eliminates bias in the choice of a sample from alist of the population. When the population consists of humanbeings, however, accurate information from a sample requires morethan a good sampling design.To begin, we need an accurate and complete list of the population.Because such a list is rarely available, most samples suffer fromsome degree of undercoverage. A sample survey of households, forexample, will miss not only homeless people but prison inmatesand students in dormitories. An opinion poll conducted by callinglandline telephone numbers will miss households that have only cellphones as well as households without a phone. The results ofnational sample surveys therefore have some bias if the people notcovered differ from the rest of the population.A more serious source of bias in most sample surveys isnonresponse, which occurs when a selected individual cannot becontacted or refuses to cooperate.

. . . . . .

Nonresponse

Academic sample surveys, unlike commercial polls, often discussnonresponse. A survey of drivers began by randomly sampling alllisted residential telephone numbers in the United States.Of 45,956 calls to these numbers, 5029 were completed. What wasthe rate of nonresponse for this sample? (Only one call was madeto each number. Nonresponse would be lower if more calls weremade.)

. . . . . .

Solution

The response rate was 502945956 = 0.1094, so the nonresponse rate

was 1− 0.1094 = 0.8906

. . . . . .

Undercoverage and nonresponse

Undercoverage occurs when some groups in the population are leftout of the process of choosing the sample.Nonresponse occurs when an individual chosen for the samplecan’t be contacted or refuses to participate.

. . . . . .

CHAPTER 9

Producing Data: Experiments

. . . . . .

Observation vs Experiment

An observational study observes individuals and measures variablesof interest but does not attempt to influence the responses. Thepurpose of an observational study is to describe some group orsituation.An experiment, on the other hand, deliberately imposes sometreatment on individuals in order to observe their responses. Thepurpose of an experiment is to study whether the treatment causesa change in the response.

. . . . . .

Subjects, Factors, Treatments

The individuals studied in an experiment are often called subjects,particularly when they are people.The explanatory variables in an experiment are often called factors.A treatment is any specific experimental condition applied to thesubjects. If an experiment has more than one factor, a treatment isa combination of specific values of each factor.

. . . . . .

Sealing food packages

A manufacturer of food products uses package liners that aresealed at the top by applying heated jaws after the package isfilled. The customer peels the sealed pieces apart to open thepackage. What effect does the temperature of the jaws have onthe force needed to peel the liner? To answer this question,engineers obtain 20 pairs of pieces of package liner. They seal fivepairs at each of 250◦ F, 275◦ F, 300◦ F, and 335◦ F. Then theymeasure the force needed to peel each seal.a) What are the individuals?b) There is one factor (explanatory variable). What is it, and whatare its values?c) What is the response variable?

. . . . . .

Solutions

a) The individuals are pairs of pieces of package liner.b) The explanatory variable is temperature of jaws, with fourvalues: 250◦ F, 275◦ F, 300◦ F, and 335◦ F.c) The response variable is the force needed to peel each seal.

. . . . . .

An industrial experiment

A chemical engineer is designing the production process for a newproduct. The chemical reaction that produces the product mayhave higher or lower yield, depending on the temperature andstirring rate in the vessel in which the reaction takes place. Theengineer decides to investigate the effects of combinations of twotemperatures (50◦ C and 60◦ C) and three stirring rates (60 rpm,90 rpm, and 120 rpm) on the yield of the process. She will processtwo batches of the product at each combination of temperatureand stirring rate.a) What are the individuals and the response variable in thisexperiment?b) How many factors are there? How many treatments?

. . . . . .

Solutions

a) The individuals are batches of the product; the response variableis the yield of a batch.b) There are two factors (temperature and stirring rate) and sixtreatments (temperature-stirring rate combinations).

50◦ 60◦

60 rpm Treatment 1 Treatment 290 rpm Treatment 3 Treatment 4120 rpm Treatment 5 Treatment 6

. . . . . .

Randomized Comparative Experiment

An experiment that uses both comparison of two or moretreatments and random assignment of subjects to treatments is arandomized comparative experiment.

. . . . . .

Completely Randomized Design

In a completely randomized experimental design, all the subjectsare allocated at random among all the treatments.

. . . . . .

Adolescent obesity

Adolescent obesity is a serious health risk affecting more than 5million young people in the United States alone. Laparoscopicadjustable gastric banding has the potential to provide a safe andeffective treatment. Fifty adolescents between 14 and 18 years oldwith a body mass index higher than 35 were recruited from theMelbourne, Australia, community for the study. Twenty-five wererandomly selected to undergo gastric banding, and the remaining25 were assigned to a supervised lifestyle intervention programinvolving diet, exercise, and behavior modification. All subjectswere followed for two years and their weight loss was recorded.a) Outline the design of this experiment. What is the responsevariable?b) Carry out the random assignment of 25 adolescents to thegastric-banding group using software or Table B, starting at line130.

. . . . . .

Solutions

a) The diagram is provided. The response variable is weight loss.b) Label the students from 01 to 50, and pick 25 to receiveTreatment 1 (the rest will receive Treatment 2). If using Table B,line 130, the sample is 05, 16, 48, 17, 40, 20, 19, 45, 32, 41, 04,25, 29, 43, 37, 39, 31, 50, 18, 07, 13, 33, 30, 21, 36.

. . . . . .

Diagram

RandomAssignment

Group 125 Subjects

Group 225 Subjects

Treatment 1Gastric banding

Treatment 2Lifestyle intervention

Compareweight loss

. . . . . .

The Logic of Randomized Comparative Experiments

Randomized comparative experiments are designed to give goodevidence that differences in the treatments actually cause thedifferences we see in the response. The logic is as follows:

I Random assignment of subjects forms groups that should besimilar in all respects before the treatments are applied.

I Comparative design ensures that influences other than theexperimental treatments operate equally on all groups.

I Therefore, differences in average response must be due eitherto the treatments or to the play of chance in the randomassignment of subjects to the treatments.

. . . . . .

The Logic of Randomized Comparative Experiments

Randomized comparative experiments are designed to give goodevidence that differences in the treatments actually cause thedifferences we see in the response. The logic is as follows:

I Random assignment of subjects forms groups that should besimilar in all respects before the treatments are applied.

I Comparative design ensures that influences other than theexperimental treatments operate equally on all groups.

I Therefore, differences in average response must be due eitherto the treatments or to the play of chance in the randomassignment of subjects to the treatments.

. . . . . .

The Logic of Randomized Comparative Experiments

Randomized comparative experiments are designed to give goodevidence that differences in the treatments actually cause thedifferences we see in the response. The logic is as follows:

I Random assignment of subjects forms groups that should besimilar in all respects before the treatments are applied.

I Comparative design ensures that influences other than theexperimental treatments operate equally on all groups.

I Therefore, differences in average response must be due eitherto the treatments or to the play of chance in the randomassignment of subjects to the treatments.

. . . . . .

Principles of Experimental Design

The basic principles of statistical design of experiments are:

I Control the effects of lurking variables on the response, mostsimply by comparing two or more treatments.

I Randomize - use chance to assign subjects to treatments.

I Use enough subjects in each group to reduce chance variationin the results.

. . . . . .

Principles of Experimental Design

The basic principles of statistical design of experiments are:

I Control the effects of lurking variables on the response, mostsimply by comparing two or more treatments.

I Randomize - use chance to assign subjects to treatments.

I Use enough subjects in each group to reduce chance variationin the results.

. . . . . .

Principles of Experimental Design

The basic principles of statistical design of experiments are:

I Control the effects of lurking variables on the response, mostsimply by comparing two or more treatments.

I Randomize - use chance to assign subjects to treatments.

I Use enough subjects in each group to reduce chance variationin the results.

. . . . . .

Statistical Significance

An observed effect so large that it would rarely occur by chance iscalled statistically significant.

. . . . . .

The Placebo and Double-Blind Methods

A placebo is a dummy treatment. Experiments in medicine andpsychology often give a placebo to a control group because justbeing in an experiment can affect responses.In a double-blind experiment, neither the subjects nor the peoplewho interact with them know which treatment each subject isreceiving.

. . . . . .

Testosterone for older men

As men age, their testosterone levels gradually decrease. This maycause a reduction in lean body mass, an increase in fat, and otherundesirable changes. Do testosterone supplements reverse some ofthese effects? A study in the Netherlands assigned 237 men aged60 to 80 with low or low-normal testosterone levels to either atestosterone supplement or a placebo. The report in the Journal ofthe American Medical Association described the study as a”double-blind, randomized, placebo-controlled trial”. Explain eachof these terms to someone who knows no statistics.

. . . . . .

Solution

”Double-blind” means that the treatment (testosterone orplacebo) assigned to a subject was unknown to both the subjectand those responsible for assessing the effectiveness of thattreatment. ”Randomized” means that the patients were randomlyassigned to receive either the testosterone supplement or aplacebo. ”Placebo-controlled” means that some of the subjectswere given placebos. Even though these possess no medicalproperties, some subjects may show improvement or benefits justas a result of participating in the experiment; the placebos allowthose doing the study to observe this effect.

. . . . . .

Matched Pairs Design

A matched pairs design compares two treatments. Choose pairs ofsubjects that are as closely matched as possible. Use chance todecide which subject in a pair gets the first treatment. The othersubject in that pair gets the other treatment.Sometimes each ”pair” in a matched pairs design consists of justone subject, who gets both treatments one after the other. Usechance to decide the order in which subjects receive thetreatments.

. . . . . .

Comparing breathing frequencies in swimming

Researchers from the United Kingdom studied the effect of twobreathing frequencies on performance times and on severalphysiological parameters in front crawl swimming. The breathingfrequencies were one breath every second stroke (B2) and onebreath every fourth stroke (B4). Subjects were 10 male collegiateswimmers. Each subject swam 200 meters, once with breathingfrequency B2 and once on a different day with breathing frequencyB4.Describe the design of this matched pairs experiment, including therandomization required by this design.

. . . . . .

Solution

Each swimmer swims one time using each breathing technique (B2and B4). A coin is tossed to determine the order in which thesetechniques are used.

. . . . . .

Alcohol and heart attacks

Many studies have found that people who drink alcohol inmoderation have lower risk of heart attacks than either nondrinkersor heavy drinkers. Does alcohol consumption also improve survivalafter a heart attack? One study followed 1913 people who werehospitalized after severe heart attacks. In the year before theirheart attacks, 47% of these people did not drink, 36% drankmoderately, and 17% drank heavily. After four years, fewer of themoderate drinkers had died.a) Is this an observational study or an experiment? Why? Whatare the explanatory and response variables?b) Suggest some lurking variables that may be confounded with thedrinking habits of the subjects. The possible confounding makes itdifficult to conclude that drinking habits explain death rates.

. . . . . .

Solution

a) This is an observational study; the subjects chose their own”treatments” (how much to drink). The explanatory variable isalcohol consumption, and the response variable is whether or not asubject dies.b) Many answers are possible. For example, some nondrinkersmight avoid drinking because of other health concerns. We do notknow what kind of alcohol (beer? wine? whiskey?) the subjectswere drinking.

. . . . . .

Confounding

Two variables (explanatory variables or lurking variables) areconfounded when their effects on a response variable cannot bedistinguished from each other.