unit 6: simple linear regression lecture : introduction to slrtjl13/s101/slides/unit6lec1.pdf ·...

Unit 6: Simple Linear RegressionLecture : Introduction to SLR

Statistics 101

Thomas Leininger

June 17, 2013

Recap: Chi-square test of independence

1 Recap: Chi-square test of independenceBall throwingExpected counts in two-way tables

2 Modeling numerical variables

3 Correlation

4 Fitting a line by least squares regressionResidualsBest lineThe least squares linePrediction & extrapolationConditions for the least squares lineR2

Categorical explanatory variables

Statistics 101

U6 - L1: Introduction to SLR Thomas Leininger

Recap: Chi-square test of independence Ball throwing



3 Correlation



Statistics 101



Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:H0: Ball-throwing ability and major are independent. Ball-throwing

skills do not vary by major.HA: Ball-throwing ability and major are dependent. Ball-throwing

skills vary by major.

https:// commons.wikimedia.org/ wiki/ File:Archery Target 80cm.svg

Major Public Policy Undeclared Other TotalHit target 40 10 10 60Missed target 20 30 30 80Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

https://commons.wikimedia.org/wiki/File:Archery_Target_80cm.svg


Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load ofpublic policy majors show up at my booth?

The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwingskills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwingskills vary by major.


Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.



Chi-square test of independence

The test statistic is calculated as

χ2df =

k∑i=1

(O − E)2

Ewhere df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C isthe number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected counts in two-way tables

Expected Count =(row total) × (column total)

table total


Recap: Chi-square test of independence Expected counts in two-way tables



3 Correlation



Statistics 101





df = (R − 1) × (C − 1) =

(2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52





df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value :

smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52





df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2

χ2df =

k∑i=1

(O − E)2

E=

(40−25.7)2

25.7 + · · · +(30−22.857)2

22.857 = 24.306

p-value : smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83

2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.823 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.274 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.475 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52


Modeling numerical variables



3 Correlation



Statistics 101




So far we have worked with1 numerical variable (Z, T)

1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as wellas modeling numerical response variables using a numerical orcategorical explanatory variable.

Wed–Friday: to model numerical variables using manyexplanatory variables at once.




So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)

1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)






So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)

2 categorical variables (χ2 test for independence)






So far we have worked with1 numerical variable (Z, T)1 categorical variable (χ2)1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)2 categorical variables (χ2 test for independence)





Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduaterate in all 50 US states and DC and the % of residents who live belowthe poverty line (income below $23,050 for a family of 4 in 2012).

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

Response?

% in poverty

Explanatory?

% HS grad

Relationship?

linear, negative,moderately strong


Correlation



3 Correlation



Statistics 101


Correlation

Quantifying the relationship

Correlation describes the strength of the linear associationbetween two variables.

It takes values between -1 (perfect negative) and +1 (perfectpositive).

A value of 0 indicates no linear association.


Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS grad?

(a) 0.6

(b) -0.75

(c) -0.1

(d) 0.02

(e) -1.5

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty


Correlation

Guessing the correlation

Question

Which of the following is the best guess for the correlation between %in poverty and % HS female householder?

(a) 0.1

(b) -0.6

(c) -0.4

(d) 0.9

(e) 0.5

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

8 10 12 14 16 18

6

8

10

12

14

16

18

% female householder, no husband present

% in

pov

erty


Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

(a)

●●●

●

●●●

●●●●●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●●●●●

●

●

●●●●●●●●●

●

●●●●●

●

●●

●

●●

●●

●

●

●●

●●

●●

●

●●●●●●●

●●

●

●●●

●

●●

●

●●●●●

●

●●

●●

●●●

●●●

●●

(b)

●

●

●●

●●●●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

(c)

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●●●

●

●●●

●●

●●

●

●

●

●●●

●●●

●

●

●

●●

●

●●●

●

●

●

●

●

●●

●●

●

●●

●●●

●

●

●

●

●

●●

●

●

(d)


Correlation

Assessing the correlation

Question

Which of the following is has the strongest correlation, i.e. correlationcoefficient closest to +1 or -1?

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●

(a)

●●●

●

●●●

●●●●●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●●●●●

●

●

●●●●●●●●●

●

●●●●●

●

●●

●

●●

●●

●

●

●●

●●

●●

●

●●●●●●●

●●

●

●●●

●

●●

●

●●●●●

●

●●

●●

●●●

●●●

●●

(b)

●

●

●●

●●●●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

(c)

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●●●

●

●●●

●●

●●

●

●

●

●●●

●●●

●

●

●

●●

●

●●●

●

●

●

●

●

●●

●●

●

●●

●●●

●

●

●

●

●

●●

●

●

(d)

(b)→correlationmeans linearassociation


Fitting a line by least squares regression



3 Correlation



Statistics 101


Fitting a line by least squares regression Residuals



3 Correlation



Statistics 101



Residuals

Residuals are the leftovers from the model fit: Data = Fit + Residual

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty



Residuals (cont.)

ResidualResidual is the difference between the observed and predicted y.

ei = yi − yi

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

y

5.44

yy

−4.16

y

DC

RI

% living in poverty inDC is 5.44% morethan predicted.

% living in poverty inRI is 4.16% less thanpredicted.


Fitting a line by least squares regression Best line



3 Correlation



Statistics 101



A measure for the best line

We want a line that has small residuals:

1 Option 1: Minimize the sum of magnitudes (absolute values) ofresiduals

|e1| + |e2| + · · · + |en|

2 Option 2: Minimize the sum of squared residuals – least squares

e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad




We want a line that has small residuals:1 Option 1: Minimize the sum of magnitudes (absolute values) of

residuals|e1| + |e2| + · · · + |en|


e21 + e2

2 + · · · + e2n


than twice as bad





residuals|e1| + |e2| + · · · + |en|


e21 + e2

2 + · · · + e2n

Why least squares?

1 Most commonly used2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad





residuals|e1| + |e2| + · · · + |en|


e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used

2 Easier to compute by hand and using software3 In many applications, a residual twice as large as another is more

than twice as bad





residuals|e1| + |e2| + · · · + |en|


e21 + e2

2 + · · · + e2n

Why least squares?1 Most commonly used2 Easier to compute by hand and using software

3 In many applications, a residual twice as large as another is morethan twice as bad





residuals|e1| + |e2| + · · · + |en|


e21 + e2

2 + · · · + e2n


than twice as bad



The least squares line

y = β0 + β1x��

��predicted y��

��intercept

@@@R

slope

HHHHHj

explanatory variable

Notation:Intercept:

Parameter: β0Point estimate: b0

Slope:Parameter: β1Point estimate: b1


Fitting a line by least squares regression The least squares line



3 Correlation



Statistics 101



Given...

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty

% HS grad % in poverty(x) (y)

mean x = 86.01 y = 11.35sd sx = 3.73 sy = 3.1

correlation R = −0.75



Slope

Slope

The slope of the regression can be calculated as

b1 =sy

sxR

In context...b1 =

3.13.73

× −0.75 = −0.62

InterpretationFor each % point increase in HS graduate rate, we would expect the% living in poverty to decrease on average by 0.62% points.



Intercept

InterceptThe intercept is where the regression line intersects the y-axis. Thecalculation of the intercept uses the fact the a regression line alwayspasses through (x, y).

b0 = y − b1x

●

●

●

●

●

●● ●

●

●●●●●

● ●●

●

●

●

●●●

●

●

●

●

●●

●●

●

●●●

●

●

●●●

●

●

●●

●●●●

●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

b0 = 11.35 − (−0.62) × 86.01

= 64.68



Interpret b0

Question

How do we interpret the intercept? (b0 = 64.68)

●

●

●

●

●

●● ●

●

●●●●●

● ●●

●

●

●

●●●

●

●

●

●

●●

●●

●

●●●

●

●

●●●

●

●

●●

●●●●

●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept

States with no HS graduates are expected on average to have64.68% of residents living below the poverty line.



Recap: Interpretation of slope and intercept

Intercept: When x = 0, y is expected to equal the value of theintercept.

Slope: For each unit increase in x, y is expected toincrease/decrease on average by value of the slope.



Regression line

% in poverty = 64.68 − 0.62 % HS grad

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty


Fitting a line by least squares regression Prediction & extrapolation



3 Correlation



Statistics 101



Prediction

Using the linear model to predict the value of the responsevariable for a given value of the explanatory variable is calledprediction, simply by plugging in the value of x in the linear modelequation.There will be some uncertainty associated with the predictedvalue - we’ll talk about this next time.

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty



Extrapolation

Applying a model estimate to values outside of the realm of theoriginal data is called extrapolation.

Sometimes the intercept might be an extrapolation.

●

●

●

●

●

●● ●

●

●●●●●

● ●●

●

●

●

●●●

●

●

●

●

●●

●●

●

●●●

●

●

●●●

●

●

●●

●●●●

●

● ●

0 20 40 60 80 1000

10

20

30

40

50

60

70

% HS grad

% in

pov

erty

intercept



Examples of extrapolation




1 http:// www.colbertnation.com/ the-colbert-report-videos/ 269929

2 Sprinting:


http://www.colbertnation.com/the-colbert-report-videos/269929



1 http:// www.colbertnation.com/ the-colbert-report-videos/ 2699292 Sprinting:


http://www.colbertnation.com/the-colbert-report-videos/269929

Fitting a line by least squares regression Conditions for the least squares line



3 Correlation



Statistics 101



Conditions for the least squares line

1 Linearity

2 Nearly normal residuals

3 Constant variability



Conditions: (1) Linearity

The relationship between the explanatory and the responsevariable should be linear.

Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals




The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.

Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals




The relationship between the explanatory and the responsevariable should be linear.Methods for fitting a model to non-linear relationships exist, butare beyond the scope of this class.Check using a scatterplot of the data, or a residuals plot.

x x

ysu

mm

ary(

g)$r

esid

uals

x

ysu

mm

ary(

g)$r

esid

uals



Anatomy of a residuals plot

% HS grad

% in

pov

erty

80 85 90

5

10

15

−5

0

5

∗ RI:

% HS grad = 81 % in poverty = 10.3% in poverty = 64.68 − 0.62 ∗ 81 = 14.46

e = % in poverty − % in poverty

= 10.3 − 14.46 = −4.16

� DC:

% HS grad = 86 % in poverty = 16.8% in poverty = 64.68 − 0.62 ∗ 86 = 11.36

e = % in poverty − % in poverty

= 16.8 − 11.36 = 5.44



Conditions: (2) Nearly normal residuals

The residuals should be nearly normal.

This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s




The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.

Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot


Sam

ple

Qua

ntile

s




The residuals should be nearly normal.This condition may not be satisfied when there are unusualobservations that don’t follow the trend of the rest of the data.Check using a histogram or normal probability plot of residuals.

residuals

freq

uenc

y

−4 −2 0 2 4 6

02

46

810

12

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot


Sam

ple

Qua

ntile

s



Conditions: (3) Constant variability

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

68

1012

1416

18

% HS grad

% in

pov

erty

● ●●

●

●●

●●

●

● ●●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

80 90

−4

04

The variability of pointsaround the least squares lineshould be roughly constant.

This implies that the variabilityof residuals around the 0 lineshould be roughly constant aswell.

Also called homoscedasticity.

Check using a residuals plot.



Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliers x x

yg$residuals

x

yg$residuals



Checking conditions

Question

What condition is this linear modelobviously violating?

(a) Constant variability

(b) Linear relationship

(c) Non-normal residuals

(d) No extreme outliersx x

yg$residuals

x

yg$residuals


Fitting a line by least squares regression R2



3 Correlation



Statistics 101



R2

The strength of the fit of a linear model is most commonlyevaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable isexplained by the model.

The remainder of the variability is explained by variables notincluded in the model.

For the model we’ve been working with, R2 = (−0.62)2 = 0.38.



Interpretation of R2

Question

Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?

(a) 38% of the variability in the % of HGgraduates among the 51 states isexplained by the model.

(b) 38% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

(c) 38% of the time % HS graduates predict% living in poverty correctly.

(d) 62% of the variability in the % ofresidents living in poverty among the 51states is explained by the model.

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 85 90

6

8

10

12

14

16

18

% HS grad

% in

pov

erty


Fitting a line by least squares regression Categorical explanatory variables



3 Correlation



Statistics 101



Poverty vs. region (east, west)

poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: eastIntercept: The estimated average poverty percentage in easternstates is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in westernstates is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.




poverty = 11.17 + 0.38 × west




Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.






poverty = 11.17 + 0.38 × west




Then, the estimated average poverty percentage in westernstates is 11.17 + 0.38 = 11.55%.This is the value we get if we plug in 1 for the explanatory variable



unit 6: simple linear regression lecture : introduction to slrtjl13/s101/slides/unit6lec1.pdf ·...

Documents