the simple regression model interval estimation, section 15.3 also read confidence and prediction...

The Simple Regression Model

Interval Estimation, Section 15.3 also read confidence and prediction intervals

Correlation, Section 15.4

Estimation and Tests, Section 15.5, 15.6, 15.7

PP 9 2

The Standard Error of the Estimate - Se or Sy.x The least squares method minimizes the

distance between the predicted y and the observed y, the SSE Need a statistic that measures the variability of

the observed y values from the predicted y A measure of the variability of the observed y

values around the sample regression line Also our estimate of the scatter of the y values in

the population around the population regression line It is an estimate of y|x

PP 9 3

Standard Error of the Estimate

E(Y

|X=

20

)

E(Y

|X=

50

)

E(Y

|X=

80

)

Y

X20 50 80

σ y|x

σ y|x

σ y|x

A -Sample

B - Sample

C - Population

Se is an estimate of σy|x

PP 9 4

Standard Error of the Estimate

The standard error has the units of the dependent variable, y

The formula requires us to find first the predicted value for each observation in the data set and second, the error term for that observation Can calculate the error or residual for an

observation

22

)ˆ( 2

n

SSE

n

yyS iie

xbby 10ˆ

PP 9 5

Calculating a Residual

xy 83.226.278ˆ

To calculate the standard error for a sample, use

xi yi ei

40 165 165 0

54 85 125.2 -40.2

85 9 37.5 -28.5

210

2

n

yxbybyS iiiie

Calculating a Residual

xi yi ei

40 165 165 0

54 85 125.2 -40.2

85 9 37.5 -28.5

PP 9 6

Calculating the Standard Error of the Estimate Substituting

Units are deaths per 1000 live births. Since we choose b0 and b1 to minimize the SSE, we were implicitly minimizing the standard error of the estimate

386.3918

)37732)(83.2()1244)(26.278(166692

eS

PP 9 7

The Coefficient of Determination Want to develop a measure as to how well

the independent variable predicts the dependent variable

Want to answer the following question Of the total variation among the y’s, how much

can be attributed to the relationship between X and Y, and how much can be attributed to chance?

PP 9 8

The Coefficient of Determination By total variation among the y’s, we mean

the changes in Y from one sample observation to another

Why do the values of Y differ from observation to observation? The answer, according to our hypothesized

regression model, is That the variation in Y is partly due to changes in X,

which leads to changes in the expected value of Y And partly due to chance, that is, the effect of the

random error term

PP 9 9

The Coefficient of Determination Ask how much of the observed variation in Y

can be attributed to the variation in X and how much is due to other factors (error)

Define “sample variation of Y” If there was no variation in Y, all the values of Y

when plotted against X would lie on a straight line Corresponds to the average value of Y

PP 9 10

No Variation in Y

Y

X

PP 9 11

The Coefficient of Determination Now in reality the

observed values of Y are scattered around this line Variation in Y can be

measured as the distance of the observed yi from the average Y

Y

Yyi

X

yi,xi

PP 9 12

SST = SSR + SSE

Total variation can be decomposed into explained variation and unexplained variation

SST = SSR + SSE

xbby 10ˆ

Yy ˆ

yyi ˆYyi

Xi

yi

yyYyYy ii ˆˆ

222 )ˆ()ˆ()( iiii yyYyYy

Y

PP 9 13

Coefficient of Determination or R2 R2 is the proportion of the variation of Y that

can be attributed to the variation of X R2 = SSR/SST or

R2 = 1 - SSE/SST SST = SSR + SSE SST/SST = SSR/SST + SSE/SST 1 = R2 + SSE/SST

PP 9 14

Coefficient of Determination or R2

R2 describes how well the sample regression line fits the observed data

Tells us the proportion of the total variation in the dependent variable explained by variation in the explanatory variable

R2 is an index No units associated 0 R2 1

PP 9 15

Interpreting R2

R2 = 1 indicates a perfect fit

An R2 close to zero indicates a very poor fit of the regression line to the data

R2 = 1

xbby 10ˆ

Y

R2 = 0

xbby 10ˆ

PP 9 16

Computational Formulas

nyyYySST i222 )()(

iiiiii yxbybyyySSE 1022)ˆ(

2.8931520)1244(166692 2 SST

98.27922)73237)(83.2()1244(62.278166692 SSE

SSR = 89315.20 – 27922.98 = 61392.21

R2 = 61392.21/89315.20 =0.68737

Interpret the R2 value in terms of our problem68.74% of the variation in mortality rates is explained by variation in

immunization rates

PP 9 17

Interpretation of R2 as a Descriptive Statistic Suppose we find a very low R2 for a given sample

Implies that the sample regression line fits the observations poorly A possible explanation is that X is a poor explanatory variable This is a statement about the population regression line

That is, the population regression line is horizontal Can test this with reference to the sample data Null hypothesis is

H0: 1 = 0If we do not reject this null hypothesis, we find that Y is influenced only by the random error term

Another explanation of a low R2 is that X is a relevant explanatory variable But that its influence on Y is weak compared to the influence

of the error term

PP 9 18

Pearson’s Correlation Coefficient Correlation is used to measure the strength of the

linear association between two variables The correlation coefficient is an index

No units of measurement Positive or negative sign associated with the measure

The boundaries for the correlation coefficient are

The values r = 1 and r = -1 occur when there is an exact linear relationship between x and y

11 r

PP 9 19

Pearson’s Correlation Coefficient

r = -1 r = 1 r = 0

X and Y are perfectly negatively correlated

X and Y are perfectly positively correlated

X and Y are uncorrelated

PP 9 20

Pearson’s Correlation Coefficient As the relationship

between x and y deviate from perfect linearity, r moves away from |1| toward 0

With the data to the right, the correlation model should not be applied

Y

X

PP 9 21

Computational Formula for r

nyynxx

nyx

yxr

iiii

iiii

2222

829.0

20124416669220

1526124090

2012441526237,73

22

r

Based on this sample there appears to be a fairly strong linear relationship between the percentage of children immunized in a specified country and its under-5 mortality rate. The correlation coefficient is fairly close to 1. In addition there is a negative relationship. Mortality decreases as percent immunized increases.

PP 9 22

Pearson’s Correlation Coefficient Limitations of the Correlation Model The correlation model does not specify the nature of the relationship

Do not infer causality An effective immunization program might be the primary reason for the decrease

in mortality, but it is possible that the immunization program is a small part of an overall health care system that is responsible for the decrease in mortality

The model measures linear relationships The Y values for a given X are assumed to be normally distributed and

the X values for a given Y are also assumed to be normally distributed Sampling from a “bivariate normal distribution”

The model is very sensitive to outliers If there are pairs of data points way outside the range of the other data

points, this can alter the value of the correlation coefficient and give misleading results

Do not extrapolate the correlation coefficient outside the range of data points The relationship between X and Y may change outside the range of sample

points

PP 9 23

Testing Hypotheses about the Population Correlation Coefficient Test whether there is a significant correlation, , in

the population between X and Y H0: = 0 There is no linear association

H1: 0 There is a significant linear association The sample correlation coefficient is an unbiased

estimator of the population correlation coefficient, which we designate as That is, the E(r) = The sampling distribution of the statistic r is approximately

normally distributed

PP 9 24

Testing Hypotheses about the Population Correlation Coefficient The standard error of the sample correlation

coefficient is

The test statistic is

2

1 2

n

rS r

2

1 2

02

n

r

rtn

291.6

18

68737.1

8291.18

t

PP 9 25

Testing Hypotheses about the Population Correlation Coefficient Critical Value at ⍺ = 0.05

t18,.05/2 = t 18,.025 = 2.101 Degrees of freedom = df = n - 2 Decision Rule

If (-2.101 ≤ -6.291 ≤ 2.101) do not reject Therefore, Reject

Comparing the test statistic with the critical value, we reject the null hypothesis and conclude that there is a significant linear association between immunization rates and mortality rates

PP 9 26

Relationship between Correlation, R, and Coefficient of Determination, R2 r = R = the square root of the coefficient of

determination, R2

R = -0.829 Correlation coefficient R2 = 0.687 Coefficient of determination

PP 9 27

Computer Presentation of Correlation Matrix

MORTRATE IMMUNRATE

MORTRATE 1

IMMUNRATE -0.829075272 1

PP 9 28

Inferences about the Population Parameters Want to create a confidence interval for the slope (or

intercept) or want to test whether the population slope, , (or intercept) equals zero

Saw before (OLS properties):

E(b0) = β0

normal

E(b1) = β1

normal

b0 b1

Sampling Distributions

PP 9 29

Inferences about the Population Parameters Among all linear unbiased estimators, OLS

estimators have the smallest variance The standard error of b0 and b1 are

2

2

|)(

1

Xx

X

ni

xy

2

|

)( Xxi

xy

PP 9 30

Inferences about the Population Parameters Since y|x is unknown, we substitute the standard

error of the estimate, Se, and use the t distribution

2

2

)(

10 Xx

X

nSS

i

eb

2)(1

Xx

SS

i

eb

45013.

20)1526(124490

386.3921

bS

46.350bS

PP 9 31

Confidence Intervals Population Slope and Intercept Use information about the sampling

distributions to construct confidence intervals for the population slope and intercept

If the conditional probability distribution of Y|X follows a normal distribution

12/,2002/,20 0 bbnbn StbStb

111 2/,2112/,21 bnbn StbStb

PP 9 32

Confidence Intervals Population Slope and Intercept For the slope

t18,05/2 = t 18,.025 = 2.101

-3.77 1 -1.89 with a degree of confidence of .95

For the intercept

203.77 0 352.75 with a degree of confidence of .95

05.1)4501(.101.283.2)4501(.101.283.2 1

05.1)46.35(101.226.278)46.35(101.226.278 0

PP 9 33

Confidence Intervals Population Slope and Intercept The interval estimates appear wide

Small sample size Large variation in mortality for given immunization

rates Se is large

PP 9 34

Tests of Hypotheses

The most common type of hypothesis that is tested with the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y

The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is E(Y|X) = 0 +1 x

To say there is no relationship means E(Y|X) is not linearly dependent, which is to say 1 equals zero H0: 1 = 0 There is no relationship between X and Y H1: 1 0 There is a significant relationship between X and

Y If we have a theory that suggests the direction of the relationship

than we will want a one tail test

H0: 1 = 0 There is no relationship between X and Y

Relationship between Y and X

y = 4.4

R2 = 0

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9 10

X

Y

PP 9 36

Tests of Hypotheses

The test statistic is

Set level of significance Find critical value in t -

table df = n - 2

DR: if (-tcv ≤ t-test ≤ tcv), do not reject

1

1112

1 bbn S

b

S

bt

normal

b1

t

reject

0

Sampling Distribution

under the null hypothesis

t n - 2-t

reject do not reject

PP 9 37

Tests of Hypotheses

For our problem H0: 1 ≥ 0 No relationship between X

and Y H1: 1 < 0 An inverse relationship

between X and Y Test statistic

Let ⍺ = 0.05 Critical value: t18,0.05 = -1.734

DR: if (-tcv ≤ t-test), do not reject (-1.734 > -6.291), reject

291.64501.

8317.218

t

0

Sampling Distribution

under the null hypothesis

t n - 2-1.734

reject do not reject

-6.291

b1-2.831

PP 9 38

Tests of Hypotheses

Conclude that the immunization rate is significantly and inversely related to the mortality rate

Remember: You want to reject the null You have found that your independent variable is

related

PP 9 39

Computer Output of the Problem

MORTALITY,Y IMMUNIZED, X

Mean 62.2 76.3

Standard Error 15.33101432 4.488640634

Median 31 83

Mode 9 83

Standard Deviation 68.56238036 20.07381117

Sample Variance 4700.8 402.9578947

Range 220 72

Minimum 6 26

Maximum 226 98

Sum 1244 1526

Count 20 20

PP 9 40

Multiple R 0.829075272R Square 0.687365806Adjusted R Square 0.669997239Standard Error 39.38625365Observations 20

ANOVAdf SS MS Significance F

Regression 1 61392.21442 61392.2144 6.25E-06Residual 18 27922.98558 1551.27698Total 19 89315.2

Coefficients Standard Error t Stat P-valueIntercept 278.2600899 35.45613832 7.84800892 6.25E-06IMMUNRATE -2.831718085 0.450130083 -6.2908883 3.22E-07

SUMMARY OUTPUT

Regression Statistics

Excel Output

b1 =b0 =

Sb0 =

Sb1 =

= Se

PP 9 41

Online Homework - Chapter 15 Overview Simple Regression CengageNOW fourteenth assignment

the simple regression model interval estimation, section 15.3 also read confidence and prediction...

Documents