june 30, 2008stat 111 - lecture 16 - regression1 inference for relationships between variables...
TRANSCRIPT
![Page 1: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/1.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
1
Inference for relationships between variables
Statistics 111 - Lecture 16
![Page 2: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/2.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
2
Administrative Notes
• Homework 5 due on Wednesday, July 1st
![Page 3: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/3.jpg)
• Final is this Thursday – Will start at exactly 10:40 – Will end at exactly 12:10– Get here a bit early or you will be wasting time on
your exam
• Final will focus on material from after midterm• There will be at least two questions that require
knowing material from before midterm
• Calculators are definitely needed!• Single 8.5 x 11 cheat sheet (two-sided) allowed
June 30, 2008 Stat 111 - Lecture 16 - Regression
3
Final Exam
![Page 4: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/4.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
4
Inference Thus Far
• Tests and intervals for a single variable
• Tests and intervals to compare a single variable between two samples
• For the last couple of classes, we have looked at count data and inference for population proportions
• Before that, we looked at continuous data and inference for population means
• Next couple of classes: inference for a relationship between two continuous variables
![Page 5: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/5.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
5
Two Continuous Variables
• Remember linear relationships between two continuous variables?
• Scatterplots
• Correlation
• Best Fit Lines
![Page 6: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/6.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
6
Scatterplots and Correlation
• Visually summarize the relationship between two continuous variables with a scatterplot
• If our X and Y variables show a linear relationship, we can calculate a best fit line between Y and X
Education and Mortality: r = -0.51 Draft Order and Birthdayr = -0.22
![Page 7: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/7.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
7
Linear Regression
• Best fit line is called Simple Linear Regression Model:
• Coefficients:is the intercept and is the slope
• Other common notation: 0 for intercept, 1 for slope
• Our Y variable is a linear function of the X variable but we allow for error (εi) in each prediction
• We approximate the error by using the residualObserved Yi
Predicted Yi = + Xi
![Page 8: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/8.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
8
Residuals and Best Fit Line
0 and 1 that give the best fit line are the values that give smallest sum of squared residuals:
• Best fit lineis also called the least-squares line
ei = residuali
![Page 9: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/9.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
9
Best values for Regression Parameters
• The best fit line has these values for the regression coefficients:
• Also can estimate the average squared residual:
Best estimate of slope
Best estimate of intercept
![Page 10: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/10.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
10
Example: Education and Mortality
Mortality = 1353.16 - 37.62 · Education
• Negative association means negative slope b
![Page 11: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/11.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
11
Interpreting the regression line
• The estimated slope b is the average change you get in the Y variable if you increased the X variable by one• Eg. increasing education by one year means an
average decrease of ≈ 38 deaths per 100,000
• The estimated intercept a is the average value of the Y variable when the X variable is equal to zero• Eg. an uneducated city (Education = 0) would have
an average mortality of 1353 deaths per 100,000
![Page 12: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/12.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
12
Example: Vietnam Draft Order
Draft Order = 224.9 - 0.226 · Birthday
• Slightly negative slope means later birthdays have a lower draft order
![Page 13: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/13.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
13
Significance of Regression Line
• Does the regression line show a significant linear relationship between the two variables?• If there is not a linear relationship, then we would
expect zero correlation (r = 0)• So the estimated slope b should also be close to
zero
• Therefore, our test for a significant relationship will focus on testing whether our true slope is significantly different from zero:
H0 : = 0 versus Ha : 0
• Our test statistic is based on the estimated slope b
![Page 14: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/14.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
14
Test Statistic for Slope
• Our test statistic for the slope is similar in form to all the test statistics we have seen so far:
• The standard error of the slope SE(b) has a complicated formula that requires some matrix algebra to calculate• We will not be doing this calculation manually
because the JMP software does this calculation for us!
![Page 15: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/15.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
15
Example: Education and Mortality
![Page 16: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/16.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
16
p-value for Slope Test
• Is T = -4.53 significantly different from zero?• To calculate a p-value for our test statistic T, we use
the t distribution with n-2 degrees of freedom• For testing means, we used a t distribution as well,
but we had n-1 degrees of freedom before• For testing slopes, we use n-2 degrees of freedom
because we are estimating two parameters (intercept and slope) instead of one (a mean)
• For cities dataset, n = 60, so we have d.f. = 58• Looking at a t-table with 58 df, we discover that the
P(T < -4.53) < 0.0005
![Page 17: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/17.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
17
Conclusion for Cities Example
• Two-sided alternative: p-value < 2 x 0.0005 = 0.001
• We could get the p-value directly from the JMP output, which is actually more accurate than t-table
• Since our p-value is far less than the usual -level of 0.05, we reject our null hypothesis
• We conclude that there is a statistically significant linear relationship between education and mortality
![Page 18: June 30, 2008Stat 111 - Lecture 16 - Regression1 Inference for relationships between variables Statistics 111 - Lecture 16](https://reader035.vdocuments.us/reader035/viewer/2022062409/5697bff11a28abf838cbb6c9/html5/thumbnails/18.jpg)
June 30, 2008 Stat 111 - Lecture 16 - Regression
18
Next Class - Lecture 17
• More Inference for Regression
• Moore, McCabe and Craig: 10.1, 2.4-2.5