regression and sample correlation
TRANSCRIPT
-
8/10/2019 Regression and Sample Correlation
1/28
Lecture 10REGRESSION AND SAMPLE
CORRELATIONPredrag Spasojevic
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
-
8/10/2019 Regression and Sample Correlation
2/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
INTRODUCTION
Many engineering and scientific problems are concerned withdetermining a relationship between a set of variables.
For example: chemical process, interest relationship between:
the output of the process, the temperature at which it occurs,
the amount of catalyst employed.
Knowledge of such a relationship would enable us to predictthe output for various values of temperature and amount of
catalyst.
-
8/10/2019 Regression and Sample Correlation
3/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LINEAR REGRESSION LINE In many situations, there is a single response variable Y - the
dependent variable,
depends on the value of a set of input x1, . . . , xr - called
independent variables
The simplest type of relationship is a linear relationship. That
is, for some constants 0, 1, . . . , r would hold the equation
Y= 0+ 1x1+ + rxr (1)
If this was the relationship between Yand thexi, i = 1, . . . , r,
then possible (once the iwere learned) to exactly predict the
response for any set of input values.
-
8/10/2019 Regression and Sample Correlation
4/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LINEAR REGRESSION LINE
In practice, such precision is almost never attainable, the most that one can expect is that Equation 1 would be
valid subject to random error, i.e
The explicit relationship is:Y= 0+ 1x1+ +rxr+ e (2)
where e, representing the random error is assumed to be a r. v.
having mean 0. This relationship is called a linear regression equation.
-
8/10/2019 Regression and Sample Correlation
5/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LINEAR REGRESSION LINE
Linear regression equation describes the regression of Y onthe set of independent variablesx1, . . . ,xr.
The quantities 0, 1, . . . , r are called the regression
coefficients, and must usually be estimated from a set of data. Simple regression equation is a regression equation containing
a single independent variablex (input level)
Y= + x+ e
Y is the response and e representing the random error, is a
random variable having mean 0 and variation .
-
8/10/2019 Regression and Sample Correlation
6/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LINEAR REGRESSION LINE
EX. 1: Consider the following 10 data pairs (xi, yi), i = 1,..., 10,relating y, the percent yield of a laboratory experiment, to x,
the temperature at which the experiment was run.
i xi yi i xi yi
1 100 45 6 150 68
2 110 52 7 160 75
3 120 54 8 170 764 130 63 9 180 92
5 140 62 10 190 88
-
8/10/2019 Regression and Sample Correlation
7/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LINEAR REGRESSION LINE
A plot of yi versus xi called a scatter diagram is given inFig. 1. It seems that a simple linear regression model would be
appropriate.
-
8/10/2019 Regression and Sample Correlation
8/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
Suppose: the responses Yicorresponding to the input valuesxi,
i = 1, . . . , n be observed and used to estimate and in a
simple linear regression model.
IfAis the estimator of and Bof ,then the estimator of the
responsecorresponding to the input variablexiwould be:
A+ B xi.
The actual response is Yi, so the squared difference is:
(YiA+ B xi),
-
8/10/2019 Regression and Sample Correlation
9/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
The sum of the squared differences between the estimated
responses and the actual response valuescall it SSis:
The method of least squares:
chooses as estimators of and the values ofAand Bthat
minimize SS.
So, to determine these estimators, we differentiate SS first
with respect toAand then to B as follows:
2
1
( )
n
i i
i
SS Y A x
-
8/10/2019 Regression and Sample Correlation
10/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
Setting these partial derivatives = zero yields the normal
equationsfor the minimizing valuesAand B:
-
8/10/2019 Regression and Sample Correlation
11/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
Let
By method of substitution
first normal equation:
Second normal equation:
-
8/10/2019 Regression and Sample Correlation
12/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
by usual transformations of Second normal equation:
and the fact that
-
8/10/2019 Regression and Sample Correlation
13/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
So we get the following proposition:
The least squares estimators of and corresponding to the
data setxi, Yi, i = 1, . . . , n are, respectively,
straight lineA+ Bxis called the estimated regression line.
-
8/10/2019 Regression and Sample Correlation
14/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
EX. 2: The raw material used in the production of a certain
synthetic fiber is stored in a location without a humidity
control.
Measurements of the relative humidity in the storage
location
the moisture content of a sample of the raw material were
taken over 15 days with the following data (in percentages)resulting.
-
8/10/2019 Regression and Sample Correlation
15/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
Calculating least squares estimators by last proposition, the
estimated regression line of moisture content depending on
relative humidity in the storage location will be the line from
the following Figure.
-
8/10/2019 Regression and Sample Correlation
16/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
LEAST SQUARES ESTIMATORS OF THE
REGRESSION PARAMETERS
-
8/10/2019 Regression and Sample Correlation
17/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
Notation: If we let
the least squares estimators can be expressed as
-
8/10/2019 Regression and Sample Correlation
18/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
Suppose: we measure the amount of variation in the set of
response values Y1, . . . , Yncorresponding to the set of input
valuesx1, . . . ,xn.
A standard measure in statistics of the amount of variation in a
set of values Y1, . . . , Ynis:
if all the Yiare equal and thus are all equal to Ythen SYY
would equal 0.
-
8/10/2019 Regression and Sample Correlation
19/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
The variation in the values of the Yiarises from two factors:
First: the input values xi are different, so the response
variables Yiall have different mean values;
Second:
the fact that even when the differences in the input
values are taken into account,
each of the response variables Yi has variance and
thus will not exactly equal the predicted value at its
inputxi.
E E E E E E /
-
8/10/2019 Regression and Sample Correlation
20/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
How much of the variation in the values of the response
variables is due to the different input values?
How much is due to the inherent variance of the responses
even when the input values are taken into account?
Answer: note that the quantity
measures the remaining amount of variation in the response
values after the different input values taking into account.
DE CRIPTIVE ND INFERENTI L T TI TIC LECTURE 2013/14
-
8/10/2019 Regression and Sample Correlation
21/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
Thus, SYY SSR represents the amount of variation in the
response variables that is explained by the different input
values.
The quantity R defined by
represents the proportion of the variation in the response
variables that is explained by the different input values.
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14
-
8/10/2019 Regression and Sample Correlation
22/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION
R is called the coefficient of determination.
0 R 1.
A value of R near 1: most of the variation of the response data
is explained by the different input values,
A value of R near 0: little of the variation is explained by the
different input values.
The value of R is an indicator of how well the regression model
fits the data, with a value near 1 indicating a good fit, and one
near 0 indicating a poor fit.
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14
-
8/10/2019 Regression and Sample Correlation
23/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE SAMPLE CORRELATION COEFFICIENT
For all data set consists of the paired values (xi, yi), i =1, . . . , n.
is obtained a statistic that can be used to measure the
association between the individual values of a set of paired
data. That statistic is called the sample correlation coefficient and
defined by:
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14
-
8/10/2019 Regression and Sample Correlation
24/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE SAMPLE CORRELATION COEFFICIENT
The sample correlation coefficient is always between 1 and 1.
If correlation coefficient is positive value, the correlation is
proportionate.
If correlation coefficient is negative value then the relationship
is inverse or inversely proportional.
If |r|=1 , then the correlation between the r.vs X and Y is
linearly perfect.
So, more the absolute value is closer to 1, more stronger
correlation.
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES s 2013/14
-
8/10/2019 Regression and Sample Correlation
25/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE SAMPLE CORRELATION COEFFICIENT
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summ r 2013/14
-
8/10/2019 Regression and Sample Correlation
26/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION AND THE
SAMPLE CORRELATION COEFFICIENT
Consider data pairs (xi, Yi), i = 1, . . . , n, of response values Y1, .
. . , Yncorresponding to the set of input valuesx1, . . . ,xn .
The sample correlation coefficient rof these data pairs in the
notation of slide 17 is:
Upon using identity
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
-
8/10/2019 Regression and Sample Correlation
27/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION AND THE
SAMPLE CORRELATION COEFFICIENT
we see that:
So,
The sign of ris the same as that of B.
The above gives additional meaning to the sample correlation
coefficient.
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
-
8/10/2019 Regression and Sample Correlation
28/28
DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14
THE COEFFICIENT OF DETERMINATION AND THE
SAMPLE CORRELATION COEFFICIENT
For instance, if a data set has its sample correlation coefficient
requal to 0.9, then this implies
a simple linear regression model for these data explains 81
percent (since R = 0.9 = 0.81) of the variation in the
response values.
That is, 81 percent of the variation in the response values is
explained by the different input values.