unit 6: simple linear regression lecture : introduction to slrtjl13/s101/slides/unit6lec1.pdf ·...
TRANSCRIPT
Unit 6: Simple Linear RegressionLecture : Introduction to SLR
Statistics 101
Thomas Leininger
June 17, 2013
Recap: Chi-square test of independence
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Recap: Chi-square test of independence Ball throwing
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Recap: Chi-square test of independence Ball throwing
Does ball-throwing ability vary by major?
Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?
The hypotheses are:H0: Ball-throwing ability and major are independent. Ball-throwing
skills do not vary by major.HA: Ball-throwing ability and major are dependent. Ball-throwing
skills vary by major.
https:// commons.wikimedia.org/ wiki/ File:Archery Target 80cm.svg
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35
Recap: Chi-square test of independence Ball throwing
Does ball-throwing ability vary by major?
Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?
The hypotheses are:
H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.
HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35
Recap: Chi-square test of independence Ball throwing
Does ball-throwing ability vary by major?
Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?
The hypotheses are:
H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.
HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35
Recap: Chi-square test of independence Ball throwing
Does ball-throwing ability vary by major?
Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?
The hypotheses are:
H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.
HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35
Recap: Chi-square test of independence Ball throwing
Chi-square test of independence
The test statistic is calculated as
χ2df =
k∑i=1
(O − E)2
Ewhere df = (R − 1) × (C − 1),
where k is the number of cells, R is the number of rows, and C isthe number of columns.
Note: We calculate df differently for one-way and two-way tables.
Expected counts in two-way tables
Expected Count =(row total) × (column total)
table total
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35
Recap: Chi-square test of independence Ball throwing
Chi-square test of independence
The test statistic is calculated as
χ2df =
k∑i=1
(O − E)2
Ewhere df = (R − 1) × (C − 1),
where k is the number of cells, R is the number of rows, and C isthe number of columns.
Note: We calculate df differently for one-way and two-way tables.
Expected counts in two-way tables
Expected Count =(row total) × (column total)
table total
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35
Recap: Chi-square test of independence Expected counts in two-way tables
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Recap: Chi-square test of independence Expected counts in two-way tables
Expected counts in two-way tables
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
df = (R − 1) × (C − 1) =
(2 − 1) × (3 − 1) = 2
χ2df =
k∑i=1
(O − E)2
E=
(40−25.7)2
25.7 + · · · +(30−22.857)2
22.857 = 24.306
p-value :
smaller than 0.001
Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83
2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35
Recap: Chi-square test of independence Expected counts in two-way tables
Expected counts in two-way tables
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
df = (R − 1) × (C − 1) =
(2 − 1) × (3 − 1) = 2
χ2df =
k∑i=1
(O − E)2
E=
(40−25.7)2
25.7 + · · · +(30−22.857)2
22.857 = 24.306
p-value :
smaller than 0.001
Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83
2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35
Recap: Chi-square test of independence Expected counts in two-way tables
Expected counts in two-way tables
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2
χ2df =
k∑i=1
(O − E)2
E=
(40−25.7)2
25.7 + · · · +(30−22.857)2
22.857 = 24.306
p-value :
smaller than 0.001
Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83
2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35
Recap: Chi-square test of independence Expected counts in two-way tables
Expected counts in two-way tables
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2
χ2df =
k∑i=1
(O − E)2
E=
(40−25.7)2
25.7 + · · · +(30−22.857)2
22.857 = 24.306
p-value :
smaller than 0.001
Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83
2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35
Recap: Chi-square test of independence Expected counts in two-way tables
Expected counts in two-way tables
Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140
df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2
χ2df =
k∑i=1
(O − E)2
E=
(40−25.7)2
25.7 + · · · +(30−22.857)2
22.857 = 24.306
p-value : smaller than 0.001
Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83
2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35
Modeling numerical variables
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)
1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)
1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)
2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Modeling numerical variables
So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)
Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.
Wed–Friday: to model numerical variables using manyexplanatory variables at once.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Modeling numerical variables
Poverty vs. HS graduate rate
The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Response?
% in poverty
Explanatory?
% HS grad
Relationship?
linear, negative,moderately strong
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35
Correlation
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Correlation
Quantifying the relationship
Correlation describes the strength of the linear associationbetween two variables.
It takes values between -1 (perfect negative) and +1 (perfectpositive).
A value of 0 indicates no linear association.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35
Correlation
Quantifying the relationship
Correlation describes the strength of the linear associationbetween two variables.
It takes values between -1 (perfect negative) and +1 (perfectpositive).
A value of 0 indicates no linear association.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35
Correlation
Quantifying the relationship
Correlation describes the strength of the linear associationbetween two variables.
It takes values between -1 (perfect negative) and +1 (perfectpositive).
A value of 0 indicates no linear association.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35
Correlation
Guessing the correlation
Question
Which of the following is the best guess for the correlation between %in poverty and % HS grad?
(a) 0.6
(b) -0.75
(c) -0.1
(d) 0.02
(e) -1.5
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35
Correlation
Guessing the correlation
Question
Which of the following is the best guess for the correlation between %in poverty and % HS grad?
(a) 0.6
(b) -0.75
(c) -0.1
(d) 0.02
(e) -1.5
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35
Correlation
Guessing the correlation
Question
Which of the following is the best guess for the correlation between %in poverty and % HS female householder?
(a) 0.1
(b) -0.6
(c) -0.4
(d) 0.9
(e) 0.5
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
8 10 12 14 16 18
6
8
10
12
14
16
18
% female householder, no husband present
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35
Correlation
Guessing the correlation
Question
Which of the following is the best guess for the correlation between %in poverty and % HS female householder?
(a) 0.1
(b) -0.6
(c) -0.4
(d) 0.9
(e) 0.5
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
8 10 12 14 16 18
6
8
10
12
14
16
18
% female householder, no husband present
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35
Correlation
Assessing the correlation
Question
Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●
●●●●●●
●●●●●●
●●●●
●●●●●●
●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●
●●●●
●●●●
(a)
●●●
●
●●●
●●●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●●●●
●
●
●●●●●●●●●
●
●●●●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●
●●●●●●●
●●
●
●●●
●
●●
●
●●●●●
●
●●
●●
●●●
●●●
●●
(b)
●
●
●●
●●●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
(c)
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●●●
●
●●●
●●
●●
●
●
●
●●●
●●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●●●
●
●
●
●
●
●●
●
●
(d)
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35
Correlation
Assessing the correlation
Question
Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●
●●●●●●
●●●●●●
●●●●
●●●●●●
●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●
●●●●
●●●●
(a)
●●●
●
●●●
●●●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●●●●
●
●
●●●●●●●●●
●
●●●●●
●
●●
●
●●
●●
●
●
●●
●●
●●
●
●●●●●●●
●●
●
●●●
●
●●
●
●●●●●
●
●●
●●
●●●
●●●
●●
(b)
●
●
●●
●●●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
(c)
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●●●
●
●●●
●●
●●
●
●
●
●●●
●●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●●●
●
●
●
●
●
●●
●
●
(d)
(b)→correlationmeans linearassociation
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35
Fitting a line by least squares regression
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Residuals
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Residuals
Residuals
Residuals are the leftovers from the model fit: Data = Fit + Residual
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 11 / 35
Fitting a line by least squares regression Residuals
Residuals (cont.)
ResidualResidual is the difference between the observed and predicted y.
ei = yi − yi
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
y
5.44
yy
−4.16
y
DC
RI
% living in poverty inDC is 5.44% morethan predicted.
% living in poverty inRI is 4.16% less thanpredicted.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35
Fitting a line by least squares regression Residuals
Residuals (cont.)
ResidualResidual is the difference between the observed and predicted y.
ei = yi − yi
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
y
5.44
yy
−4.16
y
DC
RI
% living in poverty inDC is 5.44% morethan predicted.
% living in poverty inRI is 4.16% less thanpredicted.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35
Fitting a line by least squares regression Residuals
Residuals (cont.)
ResidualResidual is the difference between the observed and predicted y.
ei = yi − yi
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
y
5.44
yy
−4.16
y
DC
RI
% living in poverty inDC is 5.44% morethan predicted.
% living in poverty inRI is 4.16% less thanpredicted.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35
Fitting a line by least squares regression Best line
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:
1 Option 1: Minimize the sum of magnitudes (absolute values) ofresiduals
|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?
1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used
2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used2 Easier to compute by hand and using software
3 In many applications, a residual twice as large as another is morethan twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
A measure for the best line
We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of
residuals|e1| + |e2| + · · · + |en|
2 Option 2: Minimize the sum of squared residuals – least squares
e21 + e2
2 + · · · + e2n
Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more
than twice as bad
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35
Fitting a line by least squares regression Best line
The least squares line
y = β0 + β1x��
����predicted y��
��intercept
@@@R
slope
HHHHHj
explanatory variable
Notation:Intercept:
Parameter: β0Point estimate: b0
Slope:Parameter: β1Point estimate: b1
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 14 / 35
Fitting a line by least squares regression The least squares line
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression The least squares line
Given...
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
% HS grad % in poverty(x) (y)
mean x = 86.01 y = 11.35sd sx = 3.73 sy = 3.1
correlation R = −0.75
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 15 / 35
Fitting a line by least squares regression The least squares line
Slope
Slope
The slope of the regression can be calculated as
b1 =sy
sxR
In context...b1 =
3.13.73
× −0.75 = −0.62
InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35
Fitting a line by least squares regression The least squares line
Slope
Slope
The slope of the regression can be calculated as
b1 =sy
sxR
In context...b1 =
3.13.73
× −0.75 = −0.62
InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35
Fitting a line by least squares regression The least squares line
Slope
Slope
The slope of the regression can be calculated as
b1 =sy
sxR
In context...b1 =
3.13.73
× −0.75 = −0.62
InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35
Fitting a line by least squares regression The least squares line
Intercept
InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).
b0 = y − b1x
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
b0 = 11.35 − (−0.62) × 86.01
= 64.68
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35
Fitting a line by least squares regression The least squares line
Intercept
InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).
b0 = y − b1x
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
b0 = 11.35 − (−0.62) × 86.01
= 64.68
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35
Fitting a line by least squares regression The least squares line
Intercept
InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).
b0 = y − b1x
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
b0 = 11.35 − (−0.62) × 86.01
= 64.68
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35
Fitting a line by least squares regression The least squares line
Interpret b0
Question
How do we interpret the intercept? (b0 = 64.68)
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
States with no HS graduates are expected on average to have64.68% of residents living below the poverty line.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35
Fitting a line by least squares regression The least squares line
Interpret b0
Question
How do we interpret the intercept? (b0 = 64.68)
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
States with no HS graduates are expected on average to have64.68% of residents living below the poverty line.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35
Fitting a line by least squares regression The least squares line
Recap: Interpretation of slope and intercept
Intercept: When x = 0, y is expected to equal the value of theintercept.
Slope: For each unit increase in x, y is expected toincrease/decrease on average by value of the slope.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 19 / 35
Fitting a line by least squares regression The least squares line
Regression line
% in poverty = 64.68 − 0.62 % HS grad
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 20 / 35
Fitting a line by least squares regression Prediction & extrapolation
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Prediction & extrapolation
Prediction
Using the linear model to predict the value of the responsevariable for a given value of the explanatory variable is calledprediction, simply by plugging in the value of x in the linear modelequation.There will be some uncertainty associated with the predictedvalue - we’ll talk about this next time.
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 21 / 35
Fitting a line by least squares regression Prediction & extrapolation
Extrapolation
Applying a model estimate to values outside of the realm of theoriginal data is called extrapolation.
Sometimes the intercept might be an extrapolation.
●
●
●
●
●
●● ●
●
●●●●●
● ●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●●
●
● ●
0 20 40 60 80 1000
10
20
30
40
50
60
70
% HS grad
% in
pov
erty
intercept
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 22 / 35
Fitting a line by least squares regression Prediction & extrapolation
Examples of extrapolation
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 23 / 35
Fitting a line by least squares regression Prediction & extrapolation
Examples of extrapolation
1 http:// www.colbertnation.com/ the-colbert-report-videos/ 269929
2 Sprinting:
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35
Fitting a line by least squares regression Prediction & extrapolation
Examples of extrapolation
1 http:// www.colbertnation.com/ the-colbert-report-videos/ 2699292 Sprinting:
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35
Fitting a line by least squares regression Prediction & extrapolation
Examples of extrapolation
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 25 / 35
Fitting a line by least squares regression Conditions for the least squares line
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Conditions for the least squares line
Conditions for the least squares line
1 Linearity
2 Nearly normal residuals
3 Constant variability
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions for the least squares line
1 Linearity
2 Nearly normal residuals
3 Constant variability
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions for the least squares line
1 Linearity
2 Nearly normal residuals
3 Constant variability
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (1) Linearity
The relationship between the explanatory and the responsevariable should be linear.
Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.
x x
ysu
mm
ary(
g)$r
esid
uals
x
ysu
mm
ary(
g)$r
esid
uals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (1) Linearity
The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.
Check using a scatterplot of the data, or a residuals plot.
x x
ysu
mm
ary(
g)$r
esid
uals
x
ysu
mm
ary(
g)$r
esid
uals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (1) Linearity
The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.
x x
ysu
mm
ary(
g)$r
esid
uals
x
ysu
mm
ary(
g)$r
esid
uals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35
Fitting a line by least squares regression Conditions for the least squares line
Anatomy of a residuals plot
% HS grad
% in
pov
erty
80 85 90
5
10
15
−5
0
5
∗ RI:
% HS grad = 81 % in poverty = 10.3% in poverty = 64.68 − 0.62 ∗ 81 = 14.46
e = % in poverty − % in poverty
= 10.3 − 14.46 = −4.16
� DC:
% HS grad = 86 % in poverty = 16.8% in poverty = 64.68 − 0.62 ∗ 86 = 11.36
e = % in poverty − % in poverty
= 16.8 − 11.36 = 5.44
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35
Fitting a line by least squares regression Conditions for the least squares line
Anatomy of a residuals plot
% HS grad
% in
pov
erty
80 85 90
5
10
15
−5
0
5
∗ RI:
% HS grad = 81 % in poverty = 10.3% in poverty = 64.68 − 0.62 ∗ 81 = 14.46
e = % in poverty − % in poverty
= 10.3 − 14.46 = −4.16
� DC:
% HS grad = 86 % in poverty = 16.8% in poverty = 64.68 − 0.62 ∗ 86 = 11.36
e = % in poverty − % in poverty
= 16.8 − 11.36 = 5.44
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (2) Nearly normal residuals
The residuals should be nearly normal.
This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.
residuals
freq
uenc
y
−4 −2 0 2 4 6
02
46
810
12
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
−2 −1 0 1 2
−4
−2
02
4
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (2) Nearly normal residuals
The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.
Check using a histogram or normal probability plot of residuals.
residuals
freq
uenc
y
−4 −2 0 2 4 6
02
46
810
12
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
−2 −1 0 1 2
−4
−2
02
4
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (2) Nearly normal residuals
The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.
residuals
freq
uenc
y
−4 −2 0 2 4 6
02
46
810
12
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
−2 −1 0 1 2
−4
−2
02
4
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (3) Constant variability
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
68
1012
1416
18
% HS grad
% in
pov
erty
● ●●
●
●●
●●
●
● ●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
80 90
−4
04
The variability of pointsaround the least squares lineshould be roughly constant.
This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.
Also called homoscedasticity.
Check using a residuals plot.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (3) Constant variability
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
68
1012
1416
18
% HS grad
% in
pov
erty
● ●●
●
●●
●●
●
● ●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
80 90
−4
04
The variability of pointsaround the least squares lineshould be roughly constant.
This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.
Also called homoscedasticity.
Check using a residuals plot.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (3) Constant variability
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
68
1012
1416
18
% HS grad
% in
pov
erty
● ●●
●
●●
●●
●
● ●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
80 90
−4
04
The variability of pointsaround the least squares lineshould be roughly constant.
This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.
Also called homoscedasticity.
Check using a residuals plot.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35
Fitting a line by least squares regression Conditions for the least squares line
Conditions: (3) Constant variability
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
68
1012
1416
18
% HS grad
% in
pov
erty
● ●●
●
●●
●●
●
● ●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
80 90
−4
04
The variability of pointsaround the least squares lineshould be roughly constant.
This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.
Also called homoscedasticity.
Check using a residuals plot.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35
Fitting a line by least squares regression Conditions for the least squares line
Checking conditions
Question
What condition is this linear modelobviously violating?
(a) Constant variability
(b) Linear relationship
(c) Non-normal residuals
(d) No extreme outliers x x
yg$residuals
x
yg$residuals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35
Fitting a line by least squares regression Conditions for the least squares line
Checking conditions
Question
What condition is this linear modelobviously violating?
(a) Constant variability
(b) Linear relationship
(c) Non-normal residuals
(d) No extreme outliers x x
yg$residuals
x
yg$residuals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35
Fitting a line by least squares regression Conditions for the least squares line
Checking conditions
Question
What condition is this linear modelobviously violating?
(a) Constant variability
(b) Linear relationship
(c) Non-normal residuals
(d) No extreme outliersx x
yg$residuals
x
yg$residuals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35
Fitting a line by least squares regression Conditions for the least squares line
Checking conditions
Question
What condition is this linear modelobviously violating?
(a) Constant variability
(b) Linear relationship
(c) Non-normal residuals
(d) No extreme outliersx x
yg$residuals
x
yg$residuals
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35
Fitting a line by least squares regression R2
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression R2
R2
The strength of the fit of a linear model is most commonlyevaluated using R2.
R2 is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable isexplained by the model.
The remainder of the variability is explained by variables notincluded in the model.
For the model we’ve been working with, R2 = (−0.62)2 = 0.38.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35
Fitting a line by least squares regression R2
R2
The strength of the fit of a linear model is most commonlyevaluated using R2.
R2 is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable isexplained by the model.
The remainder of the variability is explained by variables notincluded in the model.
For the model we’ve been working with, R2 = (−0.62)2 = 0.38.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35
Fitting a line by least squares regression R2
R2
The strength of the fit of a linear model is most commonlyevaluated using R2.
R2 is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable isexplained by the model.
The remainder of the variability is explained by variables notincluded in the model.
For the model we’ve been working with, R2 = (−0.62)2 = 0.38.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35
Fitting a line by least squares regression R2
R2
The strength of the fit of a linear model is most commonlyevaluated using R2.
R2 is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable isexplained by the model.
The remainder of the variability is explained by variables notincluded in the model.
For the model we’ve been working with, R2 = (−0.62)2 = 0.38.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35
Fitting a line by least squares regression R2
R2
The strength of the fit of a linear model is most commonlyevaluated using R2.
R2 is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable isexplained by the model.
The remainder of the variability is explained by variables notincluded in the model.
For the model we’ve been working with, R2 = (−0.62)2 = 0.38.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35
Fitting a line by least squares regression R2
Interpretation of R2
Question
Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?
(a) 38% of the variability in the % of HGgraduates among the 51 states isexplained by the model.
(b) 38% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.
(c) 38% of the time % HS graduates predict% living in poverty correctly.
(d) 62% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35
Fitting a line by least squares regression R2
Interpretation of R2
Question
Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?
(a) 38% of the variability in the % of HGgraduates among the 51 states isexplained by the model.
(b) 38% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.
(c) 38% of the time % HS graduates predict% living in poverty correctly.
(d) 62% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
80 85 90
6
8
10
12
14
16
18
% HS grad
% in
pov
erty
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35
Fitting a line by least squares regression Categorical explanatory variables
1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables
2 Modeling numerical variables
3 Correlation
4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2
Categorical explanatory variables
Statistics 101
U6 - L1: Introduction to SLR Thomas Leininger
Fitting a line by least squares regression Categorical explanatory variables
Poverty vs. region (east, west)
poverty = 11.17 + 0.38 × west
Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/
This is the value we get if we plug in 0 for the explanatory variable
Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.
Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable
This is called using a dummy variable.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35
Fitting a line by least squares regression Categorical explanatory variables
Poverty vs. region (east, west)
poverty = 11.17 + 0.38 × west
Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/
This is the value we get if we plug in 0 for the explanatory variable
Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.
Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable
This is called using a dummy variable.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35
Fitting a line by least squares regression Categorical explanatory variables
Poverty vs. region (east, west)
poverty = 11.17 + 0.38 × west
Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/
This is the value we get if we plug in 0 for the explanatory variable
Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.
Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable
This is called using a dummy variable.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35
Fitting a line by least squares regression Categorical explanatory variables
Poverty vs. region (east, west)
poverty = 11.17 + 0.38 × west
Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/
This is the value we get if we plug in 0 for the explanatory variable
Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.
Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.
This is the value we get if we plug in 1 for the explanatory variable
This is called using a dummy variable.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35
Fitting a line by least squares regression Categorical explanatory variables
Poverty vs. region (east, west)
poverty = 11.17 + 0.38 × west
Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/
This is the value we get if we plug in 0 for the explanatory variable
Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.
Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable
This is called using a dummy variable.
Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35