the simple regression model interval estimation, section 15.3 also read confidence and prediction...
Post on 20-Dec-2015
221 views
TRANSCRIPT
The Simple Regression Model
Interval Estimation, Section 15.3 also read confidence and prediction intervals
Correlation, Section 15.4
Estimation and Tests, Section 15.5, 15.6, 15.7
PP 9 2
The Standard Error of the Estimate - Se or Sy.x The least squares method minimizes the
distance between the predicted y and the observed y, the SSE Need a statistic that measures the variability of
the observed y values from the predicted y A measure of the variability of the observed y
values around the sample regression line Also our estimate of the scatter of the y values in
the population around the population regression line It is an estimate of y|x
PP 9 3
Standard Error of the Estimate
E(Y
|X=
20
)
E(Y
|X=
50
)
E(Y
|X=
80
)
Y
X20 50 80
σ y|x
σ y|x
σ y|x
A -Sample
B - Sample
C - Population
Se is an estimate of σy|x
PP 9 4
Standard Error of the Estimate
The standard error has the units of the dependent variable, y
The formula requires us to find first the predicted value for each observation in the data set and second, the error term for that observation Can calculate the error or residual for an
observation
22
)ˆ( 2
n
SSE
n
yyS iie
xbby 10ˆ
PP 9 5
Calculating a Residual
xy 83.226.278ˆ
To calculate the standard error for a sample, use
xi yi ei
40 165 165 0
54 85 125.2 -40.2
85 9 37.5 -28.5
210
2
n
yxbybyS iiiie
Calculating a Residual
xi yi ei
40 165 165 0
54 85 125.2 -40.2
85 9 37.5 -28.5
PP 9 6
Calculating the Standard Error of the Estimate Substituting
Units are deaths per 1000 live births. Since we choose b0 and b1 to minimize the SSE, we were implicitly minimizing the standard error of the estimate
386.3918
)37732)(83.2()1244)(26.278(166692
eS
PP 9 7
The Coefficient of Determination Want to develop a measure as to how well
the independent variable predicts the dependent variable
Want to answer the following question Of the total variation among the y’s, how much
can be attributed to the relationship between X and Y, and how much can be attributed to chance?
PP 9 8
The Coefficient of Determination By total variation among the y’s, we mean
the changes in Y from one sample observation to another
Why do the values of Y differ from observation to observation? The answer, according to our hypothesized
regression model, is That the variation in Y is partly due to changes in X,
which leads to changes in the expected value of Y And partly due to chance, that is, the effect of the
random error term
PP 9 9
The Coefficient of Determination Ask how much of the observed variation in Y
can be attributed to the variation in X and how much is due to other factors (error)
Define “sample variation of Y” If there was no variation in Y, all the values of Y
when plotted against X would lie on a straight line Corresponds to the average value of Y
PP 9 10
No Variation in Y
Y
X
PP 9 11
The Coefficient of Determination Now in reality the
observed values of Y are scattered around this line Variation in Y can be
measured as the distance of the observed yi from the average Y
Y
Yyi
X
yi,xi
PP 9 12
SST = SSR + SSE
Total variation can be decomposed into explained variation and unexplained variation
SST = SSR + SSE
xbby 10ˆ
Yy ˆ
yyi ˆYyi
Xi
yi
yyYyYy ii ˆˆ
222 )ˆ()ˆ()( iiii yyYyYy
Y
PP 9 13
Coefficient of Determination or R2 R2 is the proportion of the variation of Y that
can be attributed to the variation of X R2 = SSR/SST or
R2 = 1 - SSE/SST SST = SSR + SSE SST/SST = SSR/SST + SSE/SST 1 = R2 + SSE/SST
PP 9 14
Coefficient of Determination or R2
R2 describes how well the sample regression line fits the observed data
Tells us the proportion of the total variation in the dependent variable explained by variation in the explanatory variable
R2 is an index No units associated 0 R2 1
PP 9 15
Interpreting R2
R2 = 1 indicates a perfect fit
An R2 close to zero indicates a very poor fit of the regression line to the data
R2 = 1
xbby 10ˆ
Y
R2 = 0
xbby 10ˆ
PP 9 16
Computational Formulas
nyyYySST i222 )()(
iiiiii yxbybyyySSE 1022)ˆ(
2.8931520)1244(166692 2 SST
98.27922)73237)(83.2()1244(62.278166692 SSE
SSR = 89315.20 – 27922.98 = 61392.21
R2 = 61392.21/89315.20 =0.68737
Interpret the R2 value in terms of our problem68.74% of the variation in mortality rates is explained by variation in
immunization rates
PP 9 17
Interpretation of R2 as a Descriptive Statistic Suppose we find a very low R2 for a given sample
Implies that the sample regression line fits the observations poorly A possible explanation is that X is a poor explanatory variable This is a statement about the population regression line
That is, the population regression line is horizontal Can test this with reference to the sample data Null hypothesis is
H0: 1 = 0If we do not reject this null hypothesis, we find that Y is influenced only by the random error term
Another explanation of a low R2 is that X is a relevant explanatory variable But that its influence on Y is weak compared to the influence
of the error term
PP 9 18
Pearson’s Correlation Coefficient Correlation is used to measure the strength of the
linear association between two variables The correlation coefficient is an index
No units of measurement Positive or negative sign associated with the measure
The boundaries for the correlation coefficient are
The values r = 1 and r = -1 occur when there is an exact linear relationship between x and y
11 r
PP 9 19
Pearson’s Correlation Coefficient
r = -1 r = 1 r = 0
X and Y are perfectly negatively correlated
X and Y are perfectly positively correlated
X and Y are uncorrelated
PP 9 20
Pearson’s Correlation Coefficient As the relationship
between x and y deviate from perfect linearity, r moves away from |1| toward 0
With the data to the right, the correlation model should not be applied
Y
X
PP 9 21
Computational Formula for r
nyynxx
nyx
yxr
iiii
iiii
2222
829.0
20124416669220
1526124090
2012441526237,73
22
r
Based on this sample there appears to be a fairly strong linear relationship between the percentage of children immunized in a specified country and its under-5 mortality rate. The correlation coefficient is fairly close to 1. In addition there is a negative relationship. Mortality decreases as percent immunized increases.
PP 9 22
Pearson’s Correlation Coefficient Limitations of the Correlation Model The correlation model does not specify the nature of the relationship
Do not infer causality An effective immunization program might be the primary reason for the decrease
in mortality, but it is possible that the immunization program is a small part of an overall health care system that is responsible for the decrease in mortality
The model measures linear relationships The Y values for a given X are assumed to be normally distributed and
the X values for a given Y are also assumed to be normally distributed Sampling from a “bivariate normal distribution”
The model is very sensitive to outliers If there are pairs of data points way outside the range of the other data
points, this can alter the value of the correlation coefficient and give misleading results
Do not extrapolate the correlation coefficient outside the range of data points The relationship between X and Y may change outside the range of sample
points
PP 9 23
Testing Hypotheses about the Population Correlation Coefficient Test whether there is a significant correlation, , in
the population between X and Y H0: = 0 There is no linear association
H1: 0 There is a significant linear association The sample correlation coefficient is an unbiased
estimator of the population correlation coefficient, which we designate as That is, the E(r) = The sampling distribution of the statistic r is approximately
normally distributed
PP 9 24
Testing Hypotheses about the Population Correlation Coefficient The standard error of the sample correlation
coefficient is
The test statistic is
2
1 2
n
rS r
2
1 2
02
n
r
rtn
291.6
18
68737.1
8291.18
t
PP 9 25
Testing Hypotheses about the Population Correlation Coefficient Critical Value at ⍺ = 0.05
t18,.05/2 = t 18,.025 = 2.101 Degrees of freedom = df = n - 2 Decision Rule
If (-2.101 ≤ -6.291 ≤ 2.101) do not reject Therefore, Reject
Comparing the test statistic with the critical value, we reject the null hypothesis and conclude that there is a significant linear association between immunization rates and mortality rates
PP 9 26
Relationship between Correlation, R, and Coefficient of Determination, R2 r = R = the square root of the coefficient of
determination, R2
R = -0.829 Correlation coefficient R2 = 0.687 Coefficient of determination
PP 9 27
Computer Presentation of Correlation Matrix
MORTRATE IMMUNRATE
MORTRATE 1
IMMUNRATE -0.829075272 1
PP 9 28
Inferences about the Population Parameters Want to create a confidence interval for the slope (or
intercept) or want to test whether the population slope, , (or intercept) equals zero
Saw before (OLS properties):
E(b0) = β0
normal
E(b1) = β1
normal
b0 b1
Sampling Distributions
PP 9 29
Inferences about the Population Parameters Among all linear unbiased estimators, OLS
estimators have the smallest variance The standard error of b0 and b1 are
2
2
|)(
1
Xx
X
ni
xy
2
|
)( Xxi
xy
PP 9 30
Inferences about the Population Parameters Since y|x is unknown, we substitute the standard
error of the estimate, Se, and use the t distribution
2
2
)(
10 Xx
X
nSS
i
eb
2)(1
Xx
SS
i
eb
45013.
20)1526(124490
386.3921
bS
46.350bS
PP 9 31
Confidence Intervals Population Slope and Intercept Use information about the sampling
distributions to construct confidence intervals for the population slope and intercept
If the conditional probability distribution of Y|X follows a normal distribution
12/,2002/,20 0 bbnbn StbStb
111 2/,2112/,21 bnbn StbStb
PP 9 32
Confidence Intervals Population Slope and Intercept For the slope
t18,05/2 = t 18,.025 = 2.101
-3.77 1 -1.89 with a degree of confidence of .95
For the intercept
203.77 0 352.75 with a degree of confidence of .95
05.1)4501(.101.283.2)4501(.101.283.2 1
05.1)46.35(101.226.278)46.35(101.226.278 0
PP 9 33
Confidence Intervals Population Slope and Intercept The interval estimates appear wide
Small sample size Large variation in mortality for given immunization
rates Se is large
PP 9 34
Tests of Hypotheses
The most common type of hypothesis that is tested with the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y
The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is E(Y|X) = 0 +1 x
To say there is no relationship means E(Y|X) is not linearly dependent, which is to say 1 equals zero H0: 1 = 0 There is no relationship between X and Y H1: 1 0 There is a significant relationship between X and
Y If we have a theory that suggests the direction of the relationship
than we will want a one tail test
H0: 1 = 0 There is no relationship between X and Y
Relationship between Y and X
y = 4.4
R2 = 0
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9 10
X
Y
PP 9 36
Tests of Hypotheses
The test statistic is
Set level of significance Find critical value in t -
table df = n - 2
DR: if (-tcv ≤ t-test ≤ tcv), do not reject
1
1112
1 bbn S
b
S
bt
normal
b1
t
reject
0
Sampling Distribution
under the null hypothesis
t n - 2-t
reject do not reject
PP 9 37
Tests of Hypotheses
For our problem H0: 1 ≥ 0 No relationship between X
and Y H1: 1 < 0 An inverse relationship
between X and Y Test statistic
Let ⍺ = 0.05 Critical value: t18,0.05 = -1.734
DR: if (-tcv ≤ t-test), do not reject (-1.734 > -6.291), reject
291.64501.
8317.218
t
0
Sampling Distribution
under the null hypothesis
t n - 2-1.734
reject do not reject
-6.291
b1-2.831
PP 9 38
Tests of Hypotheses
Conclude that the immunization rate is significantly and inversely related to the mortality rate
Remember: You want to reject the null You have found that your independent variable is
related
PP 9 39
Computer Output of the Problem
MORTALITY,Y IMMUNIZED, X
Mean 62.2 76.3
Standard Error 15.33101432 4.488640634
Median 31 83
Mode 9 83
Standard Deviation 68.56238036 20.07381117
Sample Variance 4700.8 402.9578947
Range 220 72
Minimum 6 26
Maximum 226 98
Sum 1244 1526
Count 20 20
PP 9 40
Multiple R 0.829075272R Square 0.687365806Adjusted R Square 0.669997239Standard Error 39.38625365Observations 20
ANOVAdf SS MS Significance F
Regression 1 61392.21442 61392.2144 6.25E-06Residual 18 27922.98558 1551.27698Total 19 89315.2
Coefficients Standard Error t Stat P-valueIntercept 278.2600899 35.45613832 7.84800892 6.25E-06IMMUNRATE -2.831718085 0.450130083 -6.2908883 3.22E-07
SUMMARY OUTPUT
Regression Statistics
Excel Output
b1 =b0 =
Sb0 =
Sb1 =
= Se
PP 9 41
Online Homework - Chapter 15 Overview Simple Regression CengageNOW fourteenth assignment