mat 141 statistics page 1 sections 4.1-4.2 (sullivan 5e) · mat 141 – statistics page 3 sections...
TRANSCRIPT
MAT 141 – Statistics Page 1 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Learning Outcomes After we cover Sections 4.1 and 4.2, you should be able to:
1. Describe what is meant by univariate and bivariate data. 2. Describe the objective of correlation and regression. 3. Construct/interpret a scatter diagram for a pair of variables.
a. Identify the independent (explanatory) variable (on the x-axis) and the dependent (response) variable (on the y-axis).
4. Use your calculator to compute the sample linear correlation coefficient (r) for a pair of variables.
5. Use the scatter diagram and the linear correlation coefficient to determine whether there is a linear relation, nonlinear relation or no relation between the two variables.
a. Describe the difference between “no relation” and “no linear relation.” 6. If there is a linear relation, determine:
a. The strength of the linear relation (e.g., strong, moderate, weak). b. The direction of the linear relation (i.e., positive or negative).
7. Use Table II to determine whether the linear correlation coefficient is statistically significant (or whether the linear correlation between the two variables in the sample is due to random chance).
8. Describe the difference between correlation and causation. a. Remember: Correlation does not necessarily mean that there is a cause-and-effect
relationship between the two variables. 9. Describe what is meant by the Least-Squares regression line. 10. Use your calculator to find the equation of the Least-Squares regression line.
a. Only perform a regression if r is sufficiently close to 1 and statistically significant
(based on Table II). 11. Interpret the slope and the y-intercept (where appropriate) of the Least-Squares line. 12. Use the equation of the Least-Squares line to predict the value of the dependent variable for
a given value of the independent variable. 13. Describe some problems/caveats using the least squares regression line to estimate or
predict the dependent variable. Univariate vs. Bivariate Data Throughout this course, we have been studying univariate data:
Univariate data involves measuring a single variable for each individual in a sample or population.
Now we will study bivariate data:
Bivariate data involves measuring two variables for each individual in a sample or population.
We often want to see if there is a relationship between the two variables (“correlation”). If there is a relationship between the two variables, we want to describe the relationship
mathematically, so we can use one variable to estimate or predict the other (“regression”).
MAT 141 – Statistics Page 2 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Scatter Diagrams (or Scatterplots) EXAMPLE: Golf (P. 173, Example 1) A golf pro wants to investigate the relationship between club-head speed (measured in mph) and the distance the golf ball travels (measured in yards). He realizes that other variables may impact distance (e.g., club type, ball type, golfer, weather, wind). To eliminate these possible lurking variables, the pro used a single golfer, the same club and same type of ball, and the experiment was conducted on a 70-degree day with no wind. A sample of 8 swings were measured.
What are the individuals?
What are the variables?
MAT 141 – Statistics Page 3 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
EXAMPLE: Golf (cont’d) We can use a scatter diagram (or scatterplot) to graph the data.
Typically, the variable on the x-axis is used to estimate or predict the variable on the y-axis.
In the diagram above, it appears we want to use club-head speed to estimate or predict the distance.
If we wanted to use distance traveled to predict the club-head speed, we would switch the x and y-axes (i.e., distance would be the independent variable and speed would be the dependent variable).
NOTE:
The axes on a scatter diagram will often be truncated, to zoom in on the data. Do not connect the points in a scatter diagram.
Scatter diagrams may be drawn manually (with graph paper) or with technology.
The “Technology Step-by-Step” box on page 181-182 of the textbook explains how to draw scatter diagrams using the TI 83/84 graphing calculator, with Microsoft Excel, and with other technologies.
Slide 8
Is there a relation between the speed of a golf swing and the distance the ball travels?
Golf Drive Distance vs. Club-Head Speed
250
255
260
265
270
275
280
98 99 100 101 102 103 104 105 106
Club-Head Speed (mph)
Dis
tan
ce (
yard
s)
Club-Head Speed (mph)
Distance (yards)
100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
Independent (or explanatory) variable is plotted on the x-axis
Dependent (or response) variable is plotted on the y-axis
Each dot represents an individual from the sample (or population)
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 4 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Scatter Diagrams (cont’d) Scatter diagrams are useful in determining whether there is a relation between two variables and, if so, what type of relation.
Linear relation: Dots form a pattern similar to a straight line. Positive linear relation:
o The line has a positive slope, i.e., o As one variable increases, the other variable tends to increase.
Negative linear relation: o The line has a negative slope, i.e., o As one variable increases, the other variable tends to decrease.
Why do we want to determine whether a linear relation exists?
If a straight line can be used to approximate the scatter diagram, then The equation of the line can be used to estimate or predict the y-variable for given values of
the x-variable. EXAMPLE: Golf (cont’d)
Use the scatter diagram to describe the relation (if one exists) between club-head speed and distance traveled?
Slide 9
Scatter diagrams are useful in seeing whether there is a relation between the two variables and, if so, what type of relation.
Linear Relation Nonlinear Relation No relation
Positive linear
Negative linear
© Pearson Education
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 5 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
EXAMPLE: Cyclones For a random sample of tropical cyclones, the table shows the lowest barometric pressure (in millibars) as the cyclone approaches, and the maximum wind speed (in mph) of the cyclone. Is there a relation between the barometric pressure and the cyclone speed?
What is the explanatory (or predictor) variable?
What is the response variable?
Use the scatter diagram to describe the relation (if one exists) between the lowest barometric pressure and the maximum wind speed?
Slide 13
Is there a relation between barometric pressure and cyclone speed?
Maximum wind speed vs. lowest barometric
pressure
0
20
40
60
80
100
120
140
160
920 930 940 950 960 970 980 990 1000 1010
Lowest barometric pressure (millibars)
Maxim
um
win
d s
peed
(m
ph
)
Lowest barometric pressure (millibars)
Maximum wind speed (mph)
1004 40
975 100
992 65
935 125
985 80
932 150
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 6 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
EXAMPLE: Home Prices The sale price of homes and the size (measured in square feet) of several homes in a single subdivision were recorded over a 6 month period.
What is the explanatory (or predictor) variable?
What is the response variable?
Use the scatter diagram to describe the relation (if one exists) between the number of square feet and the sale price of the home?
Slide 16
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
Is there a relation between the square feet and sale price of a single family home?
Sq ftSale Price
2276 358.5
1739 360.0
2433 386.5
2196 394.0
1837 395.0
2619 414.0
2628 420.0
2696 474.9
2770 467.0
2468 470.0
2770 481.0
2497 482.5
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 7 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Determining How Well a Straight Line Approximates the Scatter Diagram How well does a straight line approximate each scatter diagram?
A straight line is a better estimate or predictor of the response variable when the scatter diagram more closely resembles a straight line.
Golf
Home Prices
Note that the scale used for the y-axis may influence your answer.
A numerical measure called the (Pearson) linear correlation coefficient will help us determine how well a straight line approximates the scatter diagram.
Slide 19
How well does a straight line approximate each scatter diagram?
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
Golf Drive Distance vs. Club-Head Speed
250
255
260
265
270
275
280
98 99 100 101 102 103 104 105 106
Club-Head Speed (mph)
Dis
tan
ce (
yard
s)
Slide 20
How well does a straight line approximate each scatter diagram?
The scale used for the y-axis may influence your answer.
GHK 05/2018MAT 141 (Sullivan 5e) - 4.1-4.3
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
1000 1500 2000 2500 3000
Sale
s P
rice (
x$1000)
Square Feet
Sales Price vs. Area (Sq. Ft.)
MAT 141 – Statistics Page 8 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Linear Correlation Coefficient The (Pearson) linear correlation coefficient is a number between 1 and 1, inclusive, that describes how closely the scatter diagram resembles a straight line.
r = sample linear correlation coefficient ρ = population linear correlation coefficient (Greek letter “rho”)
The sign of the linear correlation coefficient describes the direction of the linear relation.
Positive linear relation: r > 0. Negative linear relation: r < 0.
The absolute value of the linear correlation coefficient measures the strength of the linear relation. |r| = 1 means the dots form a perfectly straight line. The closer r is to 1, the more the scatter diagram resembles a straight line.
CAUTION:
r describes the strength of the linear relationship, not the slope of the line. |r| close to zero means there is no linear relation. However, there may be a nonlinear
relation!
Slide 20
Linear correlation coefficient (r)
© 2002 Addison-Wesley
r
1.0
.9
.8
.7
.6
.5
.4
.3
.2
.1
0
Perfect
Strong
Moderate
Weak
r
-1.0
-.9
-.8
-.7
-.6
-.5
-.4
-.3
-.2
-.1
0
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 9 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Linear Correlation Coefficient (cont’d) EXAMPLE: Nonlinear Relation In both scatter diagrams below, r=0.1, although in the first example, there is clearly a nonlinear relation between the two variables.
EXAMPLE: Nonlinear Relation The linear correlation coefficient of this scatter diagram is close to zero
(172.74 10r ), yet the diagram can be
described precisely by a parabola with the
nonlinear equation 2
8y x .
EXAMPLE: Calculate the Linear Correlation Coefficient Use your calculator to find r for each of the following examples.
Golf: r=
Cyclones: r=
Home Prices: r= In which example do the two variables have the strongest linear relation?
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14
x
y
MAT 141 – Statistics Page 10 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Significance Test of Sample Linear Correlation Coefficient If there is a strong enough linear relation between the two variables, we can use a straight line to approximate the scatter diagram and we can use the equation of the straight line to estimate or predict the dependent (response) variable given a value of the independent (explanatory) variable. If |r| is close to 1, then the individuals in the sample form a linear pattern. This may mean that the individuals in the population also form a linear pattern (see figure below, left). It is also possible that the individuals in the population do not form a linear pattern and that we were just “lucky” to have selected a sample that forms a linear pattern (see figure below, right).
Positive linear correlation of the population No linear correlation of the population
We need to determine if the linear relationship seen in the sample also applies to the population. In other words, we need to determine whether the difference between r and 0 is statistically significant. This will answer the following questions:
Is r close to 1 because there really is a linear relationship between the two variables (i.e., ρ is close to 1), or
Is r close to 1 due to random chance (i.e., if we plotted the entire population there would be no linear correlation)?
We use Table II to determine whether r is statistically significant:
If |r| is greater than the value in the table, then the sample correlation coefficient is far enough from 0 to be statistically significant (using α=0.05).
Even if |r| is statistically significant, it may not be close enough to 1 to indicate strong or moderate linear correlation.
Slide 29
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
Positive linear correlation of the population
Sample data points
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012Slide 30
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
No linear correlation of the population
Sample data points
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 11 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Significance Test of Sample Linear Correlation Coefficient (cont’d) EXAMPLE: Home Prices (cont’d) Use Table II to determine whether the value of r calculated on page 9 is statistically significant. Interpret the results. NOTE: When the sample size n is small, the critical values in Table II are relatively large. Thus, for small samples, |r| must be close to 1 in order for its value to be statistically significant. This is because for small samples, it would not be unusual to choose a few data points that seem to describe a linear relation even though the population does not.
MAT 141 – Statistics Page 12 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Practical Tips for Working with Linear Correlation Coefficients 1. |r|=0 means there is no linear correlation. There still may be a strong nonlinear correlation.
2. Correlation does not imply causation. In other words, a value of |r| close to 1 indicates a
strong linear relation – but not necessarily a cause-and-effect relationship – between the two variables.
Changes in the explanatory variable (x) may cause changes in the response variable (y). o Increasing the speed of a golf swing may cause the distance traveled by the ball to
increase. Changes in the response variable (y) may cause changes in the explanatory variable (x).
o If there is a positive relation between caffeine use and nervousness, can we conclude that increased caffeine use causes increased nervousness, or do nervous people tend to drink more caffeine, or neither?
Changes in both variables are caused by a third “lurking” variable. o There is a positive relation between a child’s shoe size and the size of his/her
vocabulary, because both are influenced by the child’s age. Relation may be coincidental.
3. Beware of outliers. Even one outlier can have a major impact on r.
Slide 34
Caveat #2:Beware of outliers
Cars sold vs. TV spots
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30
No. of TV spots
No
. o
f c
ars
so
ld
Withoutlier:r = 0.67
Without outlier:r = 0.96
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 13 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Practical Tips for Working with Linear Correlation Coefficients (cont’d) 4. The linear correlation coefficient doesn’t tell the whole story. Using both the scatter
diagram and r provide a more complete picture of the relation between the two variables.
Slide 39
Linear correlation coefficient doesn’t tell the whole story!
r=0.7 in all 4 graphs, but the nature of the relationship varies considerably
Use r and the scatter plot together to understand how the variables are related.
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
MAT 141 – Statistics Page 14 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
THIS PAGE IS INTENTIONALLY LEFT BLANK
MAT 141 – Statistics Page 15 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Least-Squares Regression If |r| is close to 1 and is statistically significant, then the scatter diagram can be approximated by a straight line, and we can use the equation of the line to estimate or predict the value of the response variable given the value of the explanatory variable. We want to use the line which “best fits” the scatter diagram, i.e., the line which is “closest” to all of the points in the diagram.
NOTE: The Least-Squares line will always pass through the point ,x y .
Slide 43
Linear Regression Determining a straight line that is the “best fit” for the scatter diagram
y
x
y
x
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
Slide 44
Linear RegressionThe Least-Squares line is the “best fit” for the scatter diagram
The “best fit” line is the one that minimizesthe sum of the squares of the residuals, i.e., the vertical distances:
2 2 2 2
1 2 3 9...d d d d
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
MAT 141 – Statistics Page 16 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Least-Squares Regression (cont’d) In algebra, we learned that equations of non-vertical straight lines may be written in “slope-intercept form,” i.e., y mx b , where m is the slope of the line and b is the y-coordinate of the y-intercept. In
statistics, we use the same concept but different notation. Slope-intercept equation used in algebra
y mx b Calculating the slope (a) and the y-intercept (b):
y
x
sa r
s
, b y ax
Slope-intercept equation used by statisticians
y ax b
Slope-intercept equation used in our textbook
1 0y b x b
The calculations are cumbersome, so we will use a calculator command:
STAT > CALC > 4:LinReg(ax+b) L1, L2
L1 contains the values of the x-variable. L2 contains the values of the y-variable.
EXAMPLE: Golf (cont’d)
Use the calculator to find the equation of the Least-Squares Line.
Use the Least-Squares equation to predict the distance traveled for the two speeds shown below. Compute the residual, i.e., the difference between the actual value and the predicted value.
x club head
speed (mph)
y
predicted distance traveled (yds.)
y
actual distance traveled (yds.)
ˆy y
residual (yds.)
100
103
MAT 141 – Statistics Page 17 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Interpreting the Least-Squares Line Interpret the (slope and y-intercept of the) least squares line in each of the examples: Golf
Cyclone Home Prices
Slide 49
A regression line describes the linear relation between the two variables
Golf Drive Distance vs. Club-Head Speed
250
255
260
265
270
275
280
98 99 100 101 102 103 104 105 106
Club-Head Speed (mph)
Dis
tan
ce (
yard
s)
Club-Head Speed (mph)
Distance (yards)
100 257
102 264
103 274
101 266
105 277
100 263
99 258
105 275
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
Residual
Slide 50
The Least-Squares Line
Maximum wind speed vs. lowest barometric
pressure
0
20
40
60
80
100
120
140
160
920 930 940 950 960 970 980 990 1000 1010
Lowest barometric pressure (millibars)
Maxim
um
win
d s
peed
(m
ph
)
MAT 141 (Sullivan 5e) - 4.1-4.3
Lowest barometric pressure (millibars)
Maximum wind speed (mph)
1004 40
975 100
992 65
935 125
985 80
932 150
GHK 05/2018
When x-axis is truncated, this point is NOT the y-intercept
Slide 51
Sale Price vs. Area (Sq. Ft.)
200.0
250.0
300.0
350.0
400.0
450.0
500.0
1000 1500 2000 2500 3000
Square Feet
Sale
Pri
ce (
x$1000)
The Least-Squares Line
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
When x-axis is truncated, this point is NOT the y-intercept
MAT 141 – Statistics Page 18 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Practical Tips for Doing Linear Regression 5. Beware of outliers. Even one outlier can impact the Least-Squares line.
6. Beware of using “out-of-range” values of x.
Slide 54
Beware of outliersEven one outlier can impact the Least-Squares Line
Cars sold vs. TV spots
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30
No. of TV spots
No
. o
f c
ars
so
ld
Withoutlier:r = 0.67
Without outlier:r = 0.96
y = 0.9311x + 7.3985
excluding outlier
y = 0.6855x + 9.1956
including outlier
^
^
MAT 141 (Sullivan 5e) - 4.1-4.3 GHK 05/2018
Slide 51
Caveat #5:Beware of predicting the future
© 2001 Addison-WesleySource: US. Department of Health and Human Services,National Center for Health Statistics
Rate of Accidental Deathsin the United States
MAT 141 (Sullivan 3e) - 4.1-4.3 GHK 04/2012
MAT 141 – Statistics Page 19 Sections 4.1-4.2 (Sullivan 5e)
[email protected] kradermath.jimdo.com 05/2018
Summary: Linear Correlation and Regression
The linear correlation coefficient r describes the strength and direction of the linear relationship between two variables.
1 r 1
Correlation coefficient does not establish causation. If r passes the significance test and is close enough to 1, find the equation of the Least
Squares regression line to estimate or predict values of y given values of x.
y ax b
The stronger the linear correlation (i.e., the closer |r| is to 1), the more accurate the estimate
or prediction.
Using a scatter diagram together with numerical measures (r, a and b) provides a more complete picture of the relation between the variables.