two-variable analysis: simple linear regression/...
TRANSCRIPT
Pat Hammett, University of Michigan 1
Two-Variable Analysis:Simple Linear Regression/
Correlation
TopicsI. Scatter Plot (X-Y Graph)
II. Simple Linear Regression
III. Correlation, R
IV. Assessing Model Accuracy, R 2
V. Regression Abuses / Misinterpreting Correlation
Pat Hammett, University of Michigan 2
I. Scatter Plot
• Used to visualize relationship between two variables.
• Common results:Ø Linear relationshipsØ non-linear relationshipsØ No Relationships (robustness)
Scatter Plot• Shows the relationship between X (predictor)
and Y (response) given a range of X.
XIndependent Variable
Predictor Variable
YDependent VariableResponse Variable
Pat Hammett, University of Michigan 3
Example 1: Coating Thickness(From: “SPC of a Phosphate Coating Line”, Wire, J. J. Intl, May 1997, pp. 78-81.)
• Suppose you measure the efficiency of a phosphate coating operation for steel versus coating tank temperature.Ø What is the response, what is the predictor?
Sample Temp Efficiency Sample Temp Efficiency1 170 0.84 13 180 2.152 172 1.31 14 181 0.843 173 1.42 15 181 1.434 174 1.03 16 182 0.95 174 1.07 17 182 1.816 175 1.08 18 182 1.947 176 1.04 19 182 2.688 177 1.8 20 184 1.499 180 1.45 21 184 2.52
10 180 1.6 22 185 311 180 1.61 23 186 1.8712 180 2.13 24 188 3.08
Pat Hammett, University of Michigan 4
Pat Hammett, University of Michigan 5
Pat Hammett, University of Michigan 6
Pat Hammett, University of Michigan 7
Pat Hammett, University of Michigan 8
Pat Hammett, University of Michigan 9
Pat Hammett, University of Michigan 10
Pat Hammett, University of Michigan 11
Pat Hammett, University of Michigan 12
Pat Hammett, University of Michigan 13
Scatter Plot: Coating Efficiency
• Is there a relationship here?
Temperature Vs. Coating Efficiency
0
0.5
11.5
2
2.5
3
3.5
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Scatter Plot: Coating Efficiency
• Is there a relationship here?
Temperature Vs. Coating Efficiency
0
0.5
1
1.5
2
2.5
3
3.5
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
YES, as temp inc,efficiency Inc
Pat Hammett, University of Michigan 14
Lecture Exercise 1:Changing Range of X
• Open the excel file, tanktemp.xls, which has this data file.Ø Compute the range of Y (efficiency) if you reduce
the tank temperature from 170-188 to 180-182.
Ø Is the range of Y (efficiency) smaller, larger, or the same as over the full range of X?
Ø Construct a scatter plot of this new data set? Do you still think a relationship exists?
Lecture Exercise 1:Effect on Y by reducing Variation in X
• Coating Example:
• Note: if a strong relationship exists (positive or negative) between X and Y, then reducing variation in X should result in a variation reduction in Y.
Temp (Range X)
Efficiency Ratio
(Range Y)
170-188 2.24
180-182 1.84
Pat Hammett, University of Michigan 15
Efficiency Vs. Reduced Range in Temperature
• Over the smaller range of the input (temperature), this relationship weakens.
Temperature Vs. Coating Efficiency
00.5
11.5
22.5
3
179 180 181 182 183
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Lessons from Coating Example
• Relationships between Y and X variables may change depending on the range of X.
• Scatter plots provide good visualization of relationships between variables, but we need a metric to assess Strength of Relationship.Ø For Two variables – we use simple linear
regression to develop a model in order to assess the strength of relationship using correlation.
Pat Hammett, University of Michigan 16
II. Simple Linear Regression
• Simple Linear Regression examines the relationship between two variables: Ø one response (y), andØ one predictor (x).
• If two variables are related, a regression equation may be used to predict a response value given a predictor value with better than random chance.
Simple Regression Equation
• Y = βo + β1X1
Ø Y = dependent variable (response)Ø X1 = independent variable (predictor)Ø β0 = intercept; the value of Y when X = 0.Ø β1 = slope; the predicted change in output Y per
unit change of input X.
• Alternatively,• Y = mX + b (m is slope, and b is y-intercept)
Pat Hammett, University of Michigan 17
Computing Slope and Intercept• We typically use software to compute the slope
and y-intercept. In Excel, we may use:Ø =slope(y-array,x-array); =intercept(y-array,x-
array)Sample Temp Efficiency
1 170 0.842 172 1.313 173 1.424 174 1.035 174 1.07
… …24 188 3.08
slope 0.094 =slope(C2:C25,B2:B25)intercept -15.245 =intercept(C2:C25,B2:B25)
Coating Example: Trend Line• You may add this line to your scatter plot by
selecting your chart and then using add trend line command under chart menu.
Temperature Vs. Coating Efficiency
00.5
11.5
22.5
33.5
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Pat Hammett, University of Michigan 18
Pat Hammett, University of Michigan 19
Pat Hammett, University of Michigan 20
Pat Hammett, University of Michigan 21
Slope Values and Trend Lines• Positive slope valuesØ Increasing trend lines on scatter plot.
• Negative slope valuesØ Decreasing trend lines on scatter plot.
• No slope (~0)Ø Horizontal trend lines.Ø Comment: be careful with using absolute
magnitudes. Depending on units, a very small slope deviation from 0 could be significant.
Coating Example: Slope Values and Trend Lines
• The slope is greater over the entire temp range (170-188) of the study.
• Slope is a positive value à increasing trend.
Temp (Range X)
Efficiency Ratio
(Range Y)Slope Y-Intercept
170-188 2.24 0.094 -15.245180-182 1.84 0.008 0.153
Pat Hammett, University of Michigan 22
Model Predictions
• Slope and intercepts are mathematical calculations. We can always compute them.
• More Important Question: how effective are these terms at predicting any individual observation.
• One way to assess effectiveness of the prediction is to examine the residuals.
Residual Terms• Residual (obs i) = Yactual(obs i) – Ypredicted (obs i)Ø Vertical bars are the residuals for each
observation of Y.Temperature Vs. Coating Efficiency
00.5
11.5
22.5
33.5
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Pat Hammett, University of Michigan 23
Lecture Exercise 2: Computing Predicted Value
and Residual• Consider sample # 22, where X = 185
and Y = 3.0.
• Using the regression equation (Y = 0.094X – 15.245), compute the following:Ø Ypredicted (obs 22)= ?Ø Yresidual (obs 22) = ?
Lecture Exercise 2: Computing Predicted Value
and Residual• Obs: 22, X = 185 and Y = 3.0.• Using the regression equation (Y = 0.094X –
15.245), compute the following:Ø Ypredicted (obs 22)= § (0.094 x 185) – 15.245 = 2.145
Ø Yresidual (obs 22) = Yactual - Ypredicted
§ 3.0 – 2.145 = 0.855
Pat Hammett, University of Michigan 24
Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has
smaller residuals A or B?
Temperature Vs. Coating Efficiency
0
0.5
1
1.5
2
2.5
3
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Temperature Vs. Coating Efficiency
0
0.5
1
1.5
2
2.5
3
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Group A Group B
Residuals• Smaller residuals indicate a better prediction.• Consider the following graphs, which has
smaller residual A or B? A
Temperature Vs. Coating Efficiency
0
0.5
1
1.5
2
2.5
3
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Temperature Vs. Coating Efficiency
0
0.5
1
1.5
2
2.5
3
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Group A Group B
Pat Hammett, University of Michigan 25
III. Correlation• Correlation (R ) provides a measure of
model prediction.• Perfect correlation suggests that we
may pass a line through every observation (all residuals = 0).
X
Y
X
Y
R = 1.0
Correlation• In assessing relationships between variables,
we often want to know strength of relationship.
• The Pearson correlation coefficient, R,measures the extent to which two variables are related.
Ø where i = 1..n pairsØ -1 < R < 1Ø Microsoft excel function: = correl(array1,array2)
( )( )( ) yx
ii
ssn
yyxxR
1−
−−= ∑
Pat Hammett, University of Michigan 26
Correlation – Coating Example
• From Excel
• Correl (R )=correl(B2:B25,C2:C25)
R = 0.67
Temperature Vs. Coating Efficiency
00.5
11.5
22.5
33.5
165 170 175 180 185 190
Temperature
Pho
spha
te C
oatin
g E
ffici
ency
Rat
io
Correlation PatternsPerfect Positive Strong Positive
Perfect Negative Strong Negative
R = 1.0
R = -0.7R = -1.0
R = 0.7
Rule of Thumb: |Correlation| > 0.7 à strong relationship
Pat Hammett, University of Michigan 27
No Correlation
• If no correlation exists, R = 0.
Predictor, X
Res
pons
e, Y
IV. Assessing Model Accuracy, R2
• Another tool to assess model accuracy (or predictability) is R 2 .
• R 2 - multiple correlation coefficient Ø R2 is computed by squaring the correlation, R
Ø 0 (no correlation) < R2 < 1 (perfect correlation)
Pat Hammett, University of Michigan 28
What does R2 Measure?• R 2 - measures the % of the variation in Y
explained by the variation in X over the range of X.
• Suppose R = 1 à R2 = 1, thus all of the variation in Y may be explained by X.
• R =0.7 à R2 = 0.49, thus, 49% of the variation in Y may be explained by X.
• R =0.1 à R2 = 0.01, thus, only 1% of the variation in Y may be explained by X.
Coating Example Revisited• Recall our different equations based on the
range of X for coating example.• Over the full range, we have high correlation
where temp explains ~45% of efficiency ratio.• Over the tighter range, temp explains little of
the variation in efficiency ratio (~0%)
Temp (Range X)
Efficiency Ratio
(Range Y)Slope Y-Intercept R R
2
170-188 2.24 0.094 -15.245 0.672 0.45180-182 1.84 0.008 0.153 0.015 0.00
Pat Hammett, University of Michigan 29
Lecture Exercise 3:Model Prediction and Correlation
• Suppose you are in charge of a Design for Six Sigma project to determine the appropriate pressure settings for bicycle tires?Ø Currently you produce 37 mm tires.
• One of your response variables is the coefficient of rolling friction (Cr).
• Note: lower the Cr, the better the ride.
Lecture Exercise 3:Bicycle Tire Analysis Data
• Experiment: Ø Response:
§ coefficient of rolling friction (Cr).
Ø Predictor:§ tire pressure,
Ø Target: Cr < 0.006
• Perform the following:Scatter plot (pressure Vs. Cr), fitted regression line,Correlation (R), and Assess model accuracy with R2
Pressure (PSI) Width = 37 mm20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050
Pat Hammett, University of Michigan 30
Pat Hammett, University of Michigan 31
Pat Hammett, University of Michigan 32
Pat Hammett, University of Michigan 33
Pat Hammett, University of Michigan 34
Pat Hammett, University of Michigan 35
Pat Hammett, University of Michigan 36
Tire Example: Scatter Plot / R2
• Tire Example: R = -0.9698; R 2 = 0.940
Cr Vs. Tire Pressure
y = -9E-05x + 0.0115R2 = 0.9405
0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0 10 20 30 40 50 60 70 80
Tire Pressure (PSI)
Coe
ffic
ient
Rol
ling
Fric
tion
Pat Hammett, University of Michigan 37
Lecture Exercise 4:Interpreting Results
• Obviously, tire pressure has a tremendous impact on coefficient of rolling friction.
1. Suppose specification of Cr < 0.006, how might we determine the appropriate tire pressure from our model?
2. What tire pressure would eliminate Cr (Cr = 0)?
Solve the Equation for X
• Equation: Y = -0.00009X + 0.01146Ø If Y = 0.006, X = 60 psiØ If Y = 0, X = 127 psi
• Do these values make sense?
Pressure (PSI) Width=1.25"20 0.010025 0.009530 0.008835 0.008140 0.007445 0.006750 0.006055 0.005860 0.005665 0.005470 0.005275 0.0050
Pat Hammett, University of Michigan 38
Re-Examine Scatter Plot
• Is this graph linear?
Cr Vs. Tire Pressure
0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0 10 20 30 40 50 60 70 80
Tire Pressure (PSI)
Coe
ffic
ient
Rol
ling
Fric
tion
Pat Hammett, University of Michigan 39
Pat Hammett, University of Michigan 40
Re-Examine Scatter Plot
• Is this graph linear? No, non-linear
Cr Vs. Tire Pressure
0.0000
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0 10 20 30 40 50 60 70 80
Tire Pressure (PSI)
Coe
ffic
ient
Rol
ling
Fri
ctio
n
Pat Hammett, University of Michigan 41
V. Regression Abuses / Misinterpreting Correlation
• Between the coating efficiency and tire examples, we have noted several potential abuses:Ø Be careful that you have a linear model
when applying linear regression.Ø Do not make inferences outside the region
of study (example: tire pressure = 0, or tire pressure = 130 psi).
Ø Relationships between X and Y may change depending on the range of observed X values.
Extreme Values• Consider an
experiment between tonnage and draw depth.
• Based on these data, are they strongly related?
Tonnage Drawdepth946 60.22940 60.24935 60.25939 60.29944 60.30936 60.36946 60.37912 60.92939 60.02940 60.08
Correlation -0.79
Pat Hammett, University of Michigan 42
Draw Depth Example• With tonnage = 912 reading à R = -0.77;
without this reading à 0.015• Lesson –graph before interpreting
correlation!Tonnage Vs. Drawdepth
59.80
60.00
60.20
60.40
60.60
60.80
61.00
910 920 930 940 950
Tonnage
Dra
wde
pth
Interpreting Correlation
• When drawing conclusions based on correlation, several issues must be considered: Ø Pearson correlation coefficient (R)
measures the linear relationship (non-linear may exist).
Ø Correlation does not always indicate cause and effect!
Ø Correlation coefficient is very sensitive to extreme values – ALWAYS GRAPH.
Pat Hammett, University of Michigan 43
Correlation Vs. Causation• Correlation does not necessarily imply causation.Ø Does your income increase because you are older or
because you have more experience/ seniority company?
Age, X
Inco
me,
Y
Verifying Causation
• To verify that correlation relates to causation, you need to conduct controlled experiments.
• Hold other process variables fixed and then test if Y changes in relation to X.
• Note: Design of Experiments (Black Belt Skill) provides more advanced verification approaches.
Pat Hammett, University of Michigan 44
Regression / Correlation and Six Sigma Projects
• During the Analysis phase of a Six Sigma project, we try to understand relationships between our outputs (KPOVs) and our inputs (KPIVs).
• Regression and correlation provide tools to assess relationships.
• Remember, no correlation may be just as important to determine than strong correlation.