mat 141 statistics page 1 sections 4.1-4.2 (sullivan 5e) · mat 141 – statistics page 3 sections...

MAT 141 – Statistics Sections 4.1-4.2 (Sullivan 5e)

[email protected] kradermath.jimdo.com 05/2018

Learning Outcomes After we cover Sections 4.1 and 4.2, you should be able to:

1. Describe what is meant by univariate and bivariate data. 2. Describe the objective of correlation and regression. 3. Construct/interpret a scatter diagram for a pair of variables.

a. Identify the independent (explanatory) variable (on the x-axis) and the dependent (response) variable (on the y-axis).

4. Use your calculator to compute the sample linear correlation coefficient (r) for a pair of variables.

5. Use the scatter diagram and the linear correlation coefficient to determine whether there is a linear relation, nonlinear relation or no relation between the two variables.

a. Describe the difference between “no relation” and “no linear relation.” 6. If there is a linear relation, determine:

a. The strength of the linear relation (e.g., strong, moderate, weak). b. The direction of the linear relation (i.e., positive or negative).

7. Use Table II to determine whether the linear correlation coefficient is statistically significant (or whether the linear correlation between the two variables in the sample is due to random chance).

8. Describe the difference between correlation and causation. a. Remember: Correlation does not necessarily mean that there is a cause-and-effect

relationship between the two variables. 9. Describe what is meant by the Least-Squares regression line. 10. Use your calculator to find the equation of the Least-Squares regression line.

a. Only perform a regression if r is sufficiently close to 1 and statistically significant

(based on Table II). 11. Interpret the slope and the y-intercept (where appropriate) of the Least-Squares line. 12. Use the equation of the Least-Squares line to predict the value of the dependent variable for

a given value of the independent variable. 13. Describe some problems/caveats using the least squares regression line to estimate or

predict the dependent variable. Univariate vs. Bivariate Data Throughout this course, we have been studying univariate data:

Univariate data involves measuring a single variable for each individual in a sample or population.

Now we will study bivariate data:

Bivariate data involves measuring two variables for each individual in a sample or population.

We often want to see if there is a relationship between the two variables (“correlation”). If there is a relationship between the two variables, we want to describe the relationship

mathematically, so we can use one variable to estimate or predict the other (“regression”).



Scatter Diagrams (or Scatterplots) EXAMPLE: Golf (P. 173, Example 1) A golf pro wants to investigate the relationship between club-head speed (measured in mph) and the distance the golf ball travels (measured in yards). He realizes that other variables may impact distance (e.g., club type, ball type, golfer, weather, wind). To eliminate these possible lurking variables, the pro used a single golfer, the same club and same type of ball, and the experiment was conducted on a 70-degree day with no wind. A sample of 8 swings were measured.

What are the individuals?

What are the variables?



EXAMPLE: Golf (cont’d) We can use a scatter diagram (or scatterplot) to graph the data.

Typically, the variable on the x-axis is used to estimate or predict the variable on the y-axis.

In the diagram above, it appears we want to use club-head speed to estimate or predict the distance.

If we wanted to use distance traveled to predict the club-head speed, we would switch the x and y-axes (i.e., distance would be the independent variable and speed would be the dependent variable).

NOTE:

The axes on a scatter diagram will often be truncated, to zoom in on the data. Do not connect the points in a scatter diagram.

Scatter diagrams may be drawn manually (with graph paper) or with technology.

The “Technology Step-by-Step” box on page 181-182 of the textbook explains how to draw scatter diagrams using the TI 83/84 graphing calculator, with Microsoft Excel, and with other technologies.

Slide 8

Is there a relation between the speed of a golf swing and the distance the ball travels?

Golf Drive Distance vs. Club-Head Speed

250

255

260

265

270

275

280

98 99 100 101 102 103 104 105 106

Club-Head Speed (mph)

Dis

tan

ce (

yard

s)


Distance (yards)

100 257

102 264

103 274

101 266

105 277

100 263

99 258

105 275

Independent (or explanatory) variable is plotted on the x-axis

Dependent (or response) variable is plotted on the y-axis

Each dot represents an individual from the sample (or population)

MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012



Scatter Diagrams (cont’d) Scatter diagrams are useful in determining whether there is a relation between two variables and, if so, what type of relation.

Linear relation: Dots form a pattern similar to a straight line. Positive linear relation:

o The line has a positive slope, i.e., o As one variable increases, the other variable tends to increase.

Negative linear relation: o The line has a negative slope, i.e., o As one variable increases, the other variable tends to decrease.

Why do we want to determine whether a linear relation exists?

If a straight line can be used to approximate the scatter diagram, then The equation of the line can be used to estimate or predict the y-variable for given values of

the x-variable. EXAMPLE: Golf (cont’d)

Use the scatter diagram to describe the relation (if one exists) between club-head speed and distance traveled?

Slide 9

Scatter diagrams are useful in seeing whether there is a relation between the two variables and, if so, what type of relation.

Linear Relation Nonlinear Relation No relation

Positive linear

Negative linear

© Pearson Education




EXAMPLE: Cyclones For a random sample of tropical cyclones, the table shows the lowest barometric pressure (in millibars) as the cyclone approaches, and the maximum wind speed (in mph) of the cyclone. Is there a relation between the barometric pressure and the cyclone speed?

What is the explanatory (or predictor) variable?

What is the response variable?

Use the scatter diagram to describe the relation (if one exists) between the lowest barometric pressure and the maximum wind speed?

Slide 13

Is there a relation between barometric pressure and cyclone speed?

Maximum wind speed vs. lowest barometric

pressure

0

20

40

60

80

100

120

140

160

920 930 940 950 960 970 980 990 1000 1010

Lowest barometric pressure (millibars)

Maxim

um

win

d s

peed

(m

ph

)


Maximum wind speed (mph)

1004 40

975 100

992 65

935 125

985 80

932 150




EXAMPLE: Home Prices The sale price of homes and the size (measured in square feet) of several homes in a single subdivision were recorded over a 6 month period.

What is the explanatory (or predictor) variable?

What is the response variable?

Use the scatter diagram to describe the relation (if one exists) between the number of square feet and the sale price of the home?

Slide 16

Sale Price vs. Area (Sq. Ft.)

200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)

Is there a relation between the square feet and sale price of a single family home?

Sq ftSale Price

2276 358.5

1739 360.0

2433 386.5

2196 394.0

1837 395.0

2619 414.0

2628 420.0

2696 474.9

2770 467.0

2468 470.0

2770 481.0

2497 482.5




Determining How Well a Straight Line Approximates the Scatter Diagram How well does a straight line approximate each scatter diagram?

A straight line is a better estimate or predictor of the response variable when the scatter diagram more closely resembles a straight line.

Golf

Home Prices

Note that the scale used for the y-axis may influence your answer.

A numerical measure called the (Pearson) linear correlation coefficient will help us determine how well a straight line approximates the scatter diagram.

Slide 19

How well does a straight line approximate each scatter diagram?



200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)


250

255

260

265

270

275

280

98 99 100 101 102 103 104 105 106


Dis

tan

ce (

yard

s)

Slide 20

How well does a straight line approximate each scatter diagram?

The scale used for the y-axis may influence your answer.

GHK 05/2018MAT 141 (Sullivan 5e) - 4.1-4.3


200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

1000 1500 2000 2500 3000

Sale

s P

rice (

x$1000)

Square Feet

Sales Price vs. Area (Sq. Ft.)



Linear Correlation Coefficient The (Pearson) linear correlation coefficient is a number between 1 and 1, inclusive, that describes how closely the scatter diagram resembles a straight line.

r = sample linear correlation coefficient ρ = population linear correlation coefficient (Greek letter “rho”)

The sign of the linear correlation coefficient describes the direction of the linear relation.

Positive linear relation: r > 0. Negative linear relation: r < 0.

The absolute value of the linear correlation coefficient measures the strength of the linear relation. |r| = 1 means the dots form a perfectly straight line. The closer r is to 1, the more the scatter diagram resembles a straight line.

CAUTION:

r describes the strength of the linear relationship, not the slope of the line. |r| close to zero means there is no linear relation. However, there may be a nonlinear

relation!

Slide 20

Linear correlation coefficient (r)

© 2002 Addison-Wesley

r

1.0

.9

.8

.7

.6

.5

.4

.3

.2

.1

0

Perfect

Strong

Moderate

Weak

r

-1.0

-.9

-.8

-.7

-.6

-.5

-.4

-.3

-.2

-.1

0




Linear Correlation Coefficient (cont’d) EXAMPLE: Nonlinear Relation In both scatter diagrams below, r=0.1, although in the first example, there is clearly a nonlinear relation between the two variables.

EXAMPLE: Nonlinear Relation The linear correlation coefficient of this scatter diagram is close to zero

(172.74 10r ), yet the diagram can be

described precisely by a parabola with the

nonlinear equation 2

8y x .

EXAMPLE: Calculate the Linear Correlation Coefficient Use your calculator to find r for each of the following examples.

Golf: r=

Cyclones: r=

Home Prices: r= In which example do the two variables have the strongest linear relation?

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14

x

y



Significance Test of Sample Linear Correlation Coefficient If there is a strong enough linear relation between the two variables, we can use a straight line to approximate the scatter diagram and we can use the equation of the straight line to estimate or predict the dependent (response) variable given a value of the independent (explanatory) variable. If |r| is close to 1, then the individuals in the sample form a linear pattern. This may mean that the individuals in the population also form a linear pattern (see figure below, left). It is also possible that the individuals in the population do not form a linear pattern and that we were just “lucky” to have selected a sample that forms a linear pattern (see figure below, right).

Positive linear correlation of the population No linear correlation of the population

We need to determine if the linear relationship seen in the sample also applies to the population. In other words, we need to determine whether the difference between r and 0 is statistically significant. This will answer the following questions:

Is r close to 1 because there really is a linear relationship between the two variables (i.e., ρ is close to 1), or

Is r close to 1 due to random chance (i.e., if we plotted the entire population there would be no linear correlation)?

We use Table II to determine whether r is statistically significant:

If |r| is greater than the value in the table, then the sample correlation coefficient is far enough from 0 to be statistically significant (using α=0.05).

Even if |r| is statistically significant, it may not be close enough to 1 to indicate strong or moderate linear correlation.

Slide 29


200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)

Positive linear correlation of the population

Sample data points

MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012Slide 30


200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)

No linear correlation of the population

Sample data points




Significance Test of Sample Linear Correlation Coefficient (cont’d) EXAMPLE: Home Prices (cont’d) Use Table II to determine whether the value of r calculated on page 9 is statistically significant. Interpret the results. NOTE: When the sample size n is small, the critical values in Table II are relatively large. Thus, for small samples, |r| must be close to 1 in order for its value to be statistically significant. This is because for small samples, it would not be unusual to choose a few data points that seem to describe a linear relation even though the population does not.



Practical Tips for Working with Linear Correlation Coefficients 1. |r|=0 means there is no linear correlation. There still may be a strong nonlinear correlation.

2. Correlation does not imply causation. In other words, a value of |r| close to 1 indicates a

strong linear relation – but not necessarily a cause-and-effect relationship – between the two variables.

Changes in the explanatory variable (x) may cause changes in the response variable (y). o Increasing the speed of a golf swing may cause the distance traveled by the ball to

increase. Changes in the response variable (y) may cause changes in the explanatory variable (x).

o If there is a positive relation between caffeine use and nervousness, can we conclude that increased caffeine use causes increased nervousness, or do nervous people tend to drink more caffeine, or neither?

Changes in both variables are caused by a third “lurking” variable. o There is a positive relation between a child’s shoe size and the size of his/her

vocabulary, because both are influenced by the child’s age. Relation may be coincidental.

3. Beware of outliers. Even one outlier can have a major impact on r.

Slide 34

Caveat #2:Beware of outliers

Cars sold vs. TV spots

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30

No. of TV spots

No

. o

f c

ars

so

ld

Withoutlier:r = 0.67

Without outlier:r = 0.96




Practical Tips for Working with Linear Correlation Coefficients (cont’d) 4. The linear correlation coefficient doesn’t tell the whole story. Using both the scatter

diagram and r provide a more complete picture of the relation between the two variables.

Slide 39

Linear correlation coefficient doesn’t tell the whole story!

r=0.7 in all 4 graphs, but the nature of the relationship varies considerably

Use r and the scatter plot together to understand how the variables are related.




THIS PAGE IS INTENTIONALLY LEFT BLANK



Least-Squares Regression If |r| is close to 1 and is statistically significant, then the scatter diagram can be approximated by a straight line, and we can use the equation of the line to estimate or predict the value of the response variable given the value of the explanatory variable. We want to use the line which “best fits” the scatter diagram, i.e., the line which is “closest” to all of the points in the diagram.

NOTE: The Least-Squares line will always pass through the point ,x y .

Slide 43

Linear Regression Determining a straight line that is the “best fit” for the scatter diagram

y

x

y

x


Slide 44

Linear RegressionThe Least-Squares line is the “best fit” for the scatter diagram

The “best fit” line is the one that minimizesthe sum of the squares of the residuals, i.e., the vertical distances:

2 2 2 2

1 2 3 9...d d d d




Least-Squares Regression (cont’d) In algebra, we learned that equations of non-vertical straight lines may be written in “slope-intercept form,” i.e., y mx b , where m is the slope of the line and b is the y-coordinate of the y-intercept. In

statistics, we use the same concept but different notation. Slope-intercept equation used in algebra

y mx b Calculating the slope (a) and the y-intercept (b):

y

x

sa r

s

, b y ax

Slope-intercept equation used by statisticians

y ax b

Slope-intercept equation used in our textbook

1 0y b x b

The calculations are cumbersome, so we will use a calculator command:

STAT > CALC > 4:LinReg(ax+b) L1, L2

L1 contains the values of the x-variable. L2 contains the values of the y-variable.

EXAMPLE: Golf (cont’d)

Use the calculator to find the equation of the Least-Squares Line.

Use the Least-Squares equation to predict the distance traveled for the two speeds shown below. Compute the residual, i.e., the difference between the actual value and the predicted value.

x club head

speed (mph)

y

predicted distance traveled (yds.)

y

actual distance traveled (yds.)

ˆy y

residual (yds.)

100

103



Interpreting the Least-Squares Line Interpret the (slope and y-intercept of the) least squares line in each of the examples: Golf

Cyclone Home Prices

Slide 49

A regression line describes the linear relation between the two variables


250

255

260

265

270

275

280

98 99 100 101 102 103 104 105 106


Dis

tan

ce (

yard

s)


Distance (yards)

100 257

102 264

103 274

101 266

105 277

100 263

99 258

105 275


Residual

Slide 50

The Least-Squares Line

Maximum wind speed vs. lowest barometric

pressure

0

20

40

60

80

100

120

140

160

920 930 940 950 960 970 980 990 1000 1010


Maxim

um

win

d s

peed

(m

ph

)

MAT 141 (Sullivan 5e) - 4.1-4.3


Maximum wind speed (mph)

1004 40

975 100

992 65

935 125

985 80

932 150

GHK 05/2018

When x-axis is truncated, this point is NOT the y-intercept

Slide 51


200.0

250.0

300.0

350.0

400.0

450.0

500.0

1000 1500 2000 2500 3000

Square Feet

Sale

Pri

ce (

x$1000)

The Least-Squares Line


When x-axis is truncated, this point is NOT the y-intercept



Practical Tips for Doing Linear Regression 5. Beware of outliers. Even one outlier can impact the Least-Squares line.

6. Beware of using “out-of-range” values of x.

Slide 54

Beware of outliersEven one outlier can impact the Least-Squares Line

Cars sold vs. TV spots

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30

No. of TV spots

No

. o

f c

ars

so

ld

Withoutlier:r = 0.67

Without outlier:r = 0.96

y = 0.9311x + 7.3985

excluding outlier

y = 0.6855x + 9.1956

including outlier

^

^


Slide 51

Caveat #5:Beware of predicting the future

© 2001 Addison-WesleySource: US. Department of Health and Human Services,National Center for Health Statistics

Rate of Accidental Deathsin the United States




Summary: Linear Correlation and Regression

The linear correlation coefficient r describes the strength and direction of the linear relationship between two variables.

1 r 1

Correlation coefficient does not establish causation. If r passes the significance test and is close enough to 1, find the equation of the Least

Squares regression line to estimate or predict values of y given values of x.

y ax b

The stronger the linear correlation (i.e., the closer |r| is to 1), the more accurate the estimate

or prediction.

Using a scatter diagram together with numerical measures (r, a and b) provides a more complete picture of the relation between the variables.

mat 141 statistics page 1 sections 4.1-4.2 (sullivan 5e) · mat 141 – statistics page 3 sections...

Documents