topic 4: simple correlation and regression analysisweb.uvic.ca/~bettyj/246/topic4web.pdf · format...

52
Topic 4: Simple Correlation and Regression Analysis Question : “How do we determine how the changes in one variable are related to changes in another variable or variables?” Answer: REGRESSION ANALYSIS Which derives a description of the functional nature of the relationship between two or more variables. Examples: (i) Life-time earnings can be explained by: Educational level Job experiences occupation Gender country of birth (ii) Age of Death can be explained by Parent’s age of death weight Race political situation Healthcare accessibility diet (iii) Attendance at hockey games can be explained by Rank of team player salary History of penalties violence during the game

Upload: others

Post on 19-Feb-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

  • Topic 4: Simple Correlation and Regression Analysis

    Question: “How do we determine how the changes in one variable are related to changes in another variable or variables?” Answer: REGRESSION ANALYSIS

    Which derives a description of the functional nature of the relationship between two or more variables. Examples:

    (i) Life-time earnings can be explained by:

    Educational level Job experiences occupation

    Gender country of birth

    (ii) Age of Death can be explained by

    Parent’s age of death weight Race political situation Healthcare accessibility diet

    (iii) Attendance at hockey games can be explained by

    Rank of team player salary History of penalties violence during the game

  • 2

    (II) Question: “How do we determine the ________ of this relationship between two or more variables?”

    Answer: ___________ ANALYSIS (which determine the “strength” of such __________ between two or more variables.

    — This topic will cover the fundamental ideas of econometrics.

    ”We will transform theoretical economic relationships into specific __________ forms, by

    (i) gathering the ____ to estimate the parameters of these functions(ii) test hypotheses about these parameters and (iii) do predictions.

    ”We will bring together:

    & Economic ______& Empirical ____& Statistical _____

    — With regression analysis we estimate the value of one variable (_________ variable) on the basis of one or more other variables (___________ or explanatory variables).

  • 3

    Example: Demand Function

    Suppose the demand for Good A can be expressed by the following:QA=f(PA, PB, M) ³ “multi-variate” relationship

    The quantity of Good A demanded is a function of its own _____ (PA ) ,the price of another good (PB), and disposable ______, (M).

    ˜According to the Law of Demand, there is an _______ relationship between the quantity of A demanded (QA) and its price (PA), given that everything else is constant (ceteris paribus).

    &This is a partial relationship.

    &QA is referred to as the ________ variable, while PA, PB , and M are ___________ or explanatory variables.

    *********

    ˜ Economic theory dictates the variable or variables whose values determine the behaviour of a variable of interest.

    i.e. Economic theory make claims about the variables that ______the value of the variable of interest.

    &For example, the price of a new __ will determine the quantity of cars demanded.

    &Or, the interest rate on mortgages, CPI and the GDP may partially determine the demand for new _______.

  • 4

    ˜But:

    (i) economic theory does not always provide the expected _____ for all ___________ variables.

    Example: PB : its ____ in the demand function depends on whether the, a complement or substitute good.

    (ii) Economic theory does not always provide clear and precise information about the functional ____ of the function, f.

    &The functional ____ of these type of relationships must be specified before estimation can take place. Otherwise, estimation of economic theory can be __________.

    Example: A ______ demand function is specified:

    Q P P MA A B= + + +β β β β1 2 3 4

    where βi’s are the __________ of the model.

    &Parameters are usually _______.&We often wish to estimate them.

    Note: QA, PA, PB, and M may be transformations of non-linear data:

    For example: QQ

    or Q QAC

    C A= =1 2 .

    &These non-linear relationships have been transformed into a linear format and hence expressed in a linear regression model.

  • 5

    For example:

    ( )Q Q

    QQ

    Sales Sales

    Sales

    Sales

    =

    =

    2

    1

    **********

    ˜ Once we have determined the functional form of the regression, we can address questions such as:

    ” If we change the ______ tax laws, will there be an effect on quantity of new _____ demanded, (which works through disposable income)?

    ” Are goods A and B substitutes or complements (which is determined by cross price elasticities)? If the price of B rises,

    does the quantity of A increase or decrease?

    ,If PB increases and it is a Complement: QA demanded decreases.

    Example: Quantity of large ________ demanded decreases when the price of ________ increases.

    ,If PB increases and it is a Substitute: QA demanded increases.

    Example: Quantity of ______ demanded increases when the price of oranges increases. * ______ and oranges are substitutes.

  • 6

    q Above we had examples of exact relationships between the _________ variable and the explanatory variables.

    &The function relationships were ____________ (i.e. no uncertainty involved).

    &However, in the real world, such relationships are not as simple or straight forward.

    For Illustration: If we look at two businesses that build the same product, face the same production costs and factor prices, their demands are not usually the same.

    &There are many other individual factors that have been ignored. (Firm’s ideology, skill level of employees, marketing experience, etc.)

    qThere will be some uncertainty , “__________” function.

    For Illustration: Suppose we have several __________ with the same income, facing the same prices. The typical demands from each _________ will differ because other individual factors have been ignored by the model.

    Such as: ______ _______ _______

    ________________________ ____ ______

    q These factors project uncertainty into the function.

  • 7

    qThis __________ element is incorporated into the function by including a “stochastic” error term into the model.

    Q P P MY Y X= + + + +β β β β ε1 2 3 4

    qOtherwise, we could simply regard the ____________ relationship as explaining the _______ relationship between the _________ variable and a set of explanatory variables:

    E Q P P MY Y X( ) = + + +β β β β1 2 3 4

    **********˜ For this topic, we will restrict our models to the following:

    (1) ______ Regression:, one _________ variable (Y) and one __________t variable (X). (This can be extended to a multi-variate case.)

    (2) ______ Parameters:,We assume the relationship between the dependent variable and

    the independent variable is linear in terms of parameters:

    E Y X( ) = +α β

    nPopulation regression line is linear in terms of parameters α and β which depict the functional relationship between X and Y.

  • 8

    (3) ³Means for a given value of X, the ________ E Y X( ) = +α β value of Y (average) is given by α + βX.

    %We expect Y, [E(Y)], to change as X changes.Use subscripts to denote which observation on X and Y we are looking

    at: ³ for observation i.E Y Xi i( ) = +α β

    Parameters:α , is referred to as the (population) Y-__________ term.β , denotes the (population) _____ term which is the

    derivative of Y with respect to (w.r.t.) X:

    .∂∂yx

    (Xi, Yi) *

    Yi * gi E(Yi)=α+βXi *

    * *

    Slope=β= E(Yi)∂∂E Y

    Xi

    i

    ( )

    α *

    Xi

  • 9

    Example: Let Y be the _____ rate and X represent the ____________rate. Then for the ith observation, the expected value of _____ is:

    E(Crime)= α+β(Unemployment rate) for the ith observation.

    E(Yi )=α+βXi for the ith observation.

    For each city with the same X, (unemployment rate,) actual crime (Y)varies because of the “other” factors involved.

    Hence, for observation i:

    ³ Population __________ ModelY Xi i i= + +α β ε

    where is the ______ error for observation i, and hence:ε i

    ε i i iY E Y= − ( )

    which represents the difference between the ______ observed Yi(crime rate) and the population __________ line, E(Yi).

    &β – the slope– measures the ________ product of a change in unemployment which is implicit in a change in the crime rate.

  • 10

    Another example: Let Yi be the number of years of post secondary education for

    individual i.

    Let Xi be the number of years of post secondary education by oneof the _______(either by the ______ or ______, whatever ifhighest,) of individual i.

    Suppose we know that α=1.3 and β=0.8. Then the population regressionline is:

    Y Xor

    E Y X

    i i i

    i i

    = + +

    = +

    13 08

    13 08

    . .

    ( ) . .

    ε

    Notes on this example: (i) Suppose individual i is Jason. One of his parents have ____

    years of post secondary schooling. How many years of post secondary education do we expect Jason to have, ceteris paribus?

    Xi=4, so:

    E Y XE YE Y years

    i i

    i

    i

    ( ) . .( ) . ( . )( )( ) . .

    = += +=

    13 0813 08 44 5

    On average, we expect an individual with ______that have ___ years ofpost secondary schooling to complete 4.5 years of post secondaryschooling.

  • 11

    (ii) Other factors affect Yi for i=Jason.

    &If the number of years of schooling is in fact 4.5, the ______ error term, gi, equals zero: gi=0.

    &But, if the actual number of years of school completed is __years, gi=2.5 for Jason.

    Obviously, regression analysis is useful since it allows us to________ the association or relationship among variables.

    (i.e. We can determine how variables ______ each other ___________.

    ************

    Individual’s schooling (Yi)

    E(Yi)=α+βXi=1.3+0.8Xi

    Parent’s schooling(Xi)

  • 12

    The Sample Regression Model(α and β _______)

    In the previous examples we assumed that we _____the value of α and β– the population parameters.

    •This is unrealistic, and we usually must _______α and β using sample data. I.e. this is analogous to using as an estimator of µ, or s2 as an X estimator of σ2.

    Questions: 1: How do we decide which __________ can we use for estimating α and β? 2: How can we use ______ information?

    The ______ regression line takes the form:

    $Y a bXi i= +

    where: (i) the “___” (Ë) indicates a “_____value” or a ________ value.

    (ii) ‘a’ is the _________ of α. (iii) ‘b’ is the _________ of β. (iv) the ‘i’ subscript indicates that the ith observation and includes all observations in the sample from 1 to n (n=sample size).

    As with the population regression line, typically.$Y Yi i≠

  • 13

    The difference between and Yi, the ______ error, is denoted by ei:$Yi

    .e Y Yi i i= − $

    Substituting $ .Y a bX e Y a bXi i i i i= + ⇒ = − −

    Hence, we can express the ______ regression model as:

    Y a bX ei i i= + +Note:

    ei is the sample_______. gi is the __________ error.

    ei and gi are ____ the same thing.

    ei is _____________. gi is ___ observable.

    e Y a bX known

    Y X

    i i i

    i i

    = − − ⇒

    = − + ⇒

    a,b are sample estimates of and .

    and are unknown population parameters.

    i

    α β

    ε α β α β

  • 14

    In general a…α and b…β. } distinction between a ______and a population.(Recall for all _________. This is the same idea.)X ≠ µ

    Notation:

    — Let ei (the sample error) be referred to as the _________ term.

    — Let gi (the population error) be referred to as the ___________ term.

    — Let ‘a’ be our estimator of α _______ from a sample.

    — Let ‘b’ be our estimator of β derived from a ______.

    nSince an estimator is a _______ or rule, we must determine a and b from the _______information.

    Which method should be employed, since there aremany possible estimators?

  • 15

    Section 13.3The Method of _______ Squares

    ˜ Recall we have restricted our discussion in this topic to a ___-variable (bivariate) model which is ______ in parameters.

    Y Xi i i= + +α β ε

    This model is linear in _________ (Y and X) as well as parameters.

    ˜ But, it is not necessary to ________ our _________ this way.

    Example:& Regression analysis can handle

    (i) ___-______ models:

    Log Y Xi i i( ) = + +α β ε

    or (ii) ___-___ models:

    Log Y Log Xi i i( ) ( )= + +α β ε

    These models are ______ in parameters, but ___-______ in variables.

  • 16

    By redefining the variables: Y Log Yi i* ( )=

    and ,X Log Xi i* ( )=

    the models are linear in the new variables:

    and Log Y Y Xi i i i( )*= = + +α β ε

    Log Y Log XY Xi i i

    i i i

    ( ) ( )* *

    = + += + +α β εα β ε

    respectively.

    ˜What is important for the following analysis is that our _________are linear (_ and _).

    Least Squares:

    Suppose we have a sample of size n, with (Xi,Yi) pair values:

    Plot the data points with Yi’s on the vertical axis and Xi’s on the horizontal axis.

    Yi * * *

    * ** *

    * * *

    * *

    Xi0

  • 17

    We can derive a ______ line that passes through these values bymeasuring α and β by values of a and b, such that the ______ line passesclose to the observed data.

    Yi * * $Y a bXi i= + *

    * ** * slope = B

    * * *

    * *

    Xi0

    How should we choose a and b?(By __________ the distance between the ______ observation and the ______ line:

    To illustrate, isolate one point:(Xi,Yi)

    Y $Y a bXi i= +

    *}ei

    Yi } e Y Yi i i= − $ X

  • 18

    We want the sum of the _________ to be as small as possible:

    Y

    Regression line

    ei

    Yi ej

    $Yj$Yi

    YjX

    Xi Xj

    The problem with choosing a and b so as to minimize , is offsettingeii

    n

    =∑

    1

    _____.We could:

    (i) minimize eii=1

    n

    ∑ or

    (ii) minimize e Least Squares Approachi2

    i=1

    n

    ∑ ⇐

  • 19

    Why Use The Least Squares Approach?

    (I) All observations are given _____ weight.

    (II) Our estimators of α and β, a and b respectively, have good statistical properties:

    (i) ________: E(a)=α and E(b)=β (ii) _________: Minimum ________ estimators out of all linear

    unbiased estimators (BEST).

    (III) _______________ simple (closed form analytic solution to minimization problem.)

    (IV) Extends trivially to ____________ and non-linear relationships.

    Thus, least squares (LS) estimators of α and β are those estimators of aand b which minimize the sum of the _______ _________. Mathematically, Least Squares problem is:

    Min G e Y Y

    where Y a bX

    G e Y Y Y a bX

    a b ii

    n

    i ii

    n

    i i

    ii

    n

    i ii

    n

    i ii

    n

    ( , ) ( $ )

    $

    ( $ ) ( ) .

    = = −

    = +

    = = − − −

    = =

    = = =

    ∑ ∑

    ∑ ∑ ∑

    2

    1 1

    2

    2

    1 1

    2

    1

    so substituting the above expression into G:

    =

    Solve this minimization problem using calculus.,Partially __________ G with respect to a and b and equate these first order conditions (FOC) to ____. This results in two equation with two ________ and we solve for a and b:

  • 20

    Derivation of Least Squares Estimators:

    Min G Y a bX

    iGa

    Y a bXa

    Ga

    Y a bX FOC

    Y a bX

    Y a bX e

    Y a

    a b ii

    n

    i

    i i

    i

    n

    ii

    n

    i

    ii

    n

    i

    ii

    n

    i ii

    n

    ii

    n

    i

    ,( )

    ( )( )

    ( :)

    ( )( ) ( )

    ( )

    ( )

    ( )

    = − −

    =− −

    = − − − =

    = − − − =

    − − = ⇐ =

    =

    =

    =

    =

    = =

    = =

    ∑ ∑

    1

    2

    2

    1

    1

    1

    1 1

    1 1

    2 1 0

    2 0

    0 0

    ∂∂

    ∂∂

    ∂∂

    using the chain rule

    which yields the normal equation:

    Taking the summation sign through:

    {

    n

    i

    n

    i

    ii

    n

    ii

    n

    ii

    n

    ii

    n

    ii

    n

    ii

    n

    bX

    Y na b X

    na Y b X

    an

    Y bn

    X

    a Y bX

    ∑ ∑

    ∑ ∑

    ∑ ∑

    ∑ ∑

    − =

    − − =

    = −

    = −

    =

    = =

    = =

    = =

    1

    1 1

    1 1

    1 1

    0

    0

    1 1

    ( )

    ( )

    ( )

    rearranging:

    =

    dividing by n:

    which is the L.S. estimator of .

    α

  • 21

    ( )( )

    ( )( )

    ( )

    ( )

    ii Gb

    Y a bX

    b

    Gb

    Y a bX X

    X Y aX bX

    X Y aX bX

    i ii

    n

    i

    n

    i i i

    i

    n

    i i i i

    i

    n

    i i i i

    ∂∂

    ∂∂

    =− −

    =

    = − − − =

    = − − =

    = − − =

    =

    =

    =

    =

    2

    1

    1

    1

    2

    1

    2

    0

    2 0

    2 0

    0

    Apply the chain rule:

    which gives the normal equation:

    ⇒ − − =

    ⇒ − − =

    ⇒ − − =

    ⇒ − − =

    ⇒ − −

    = = =

    = =

    = =

    = =

    =

    ∑ ∑ ∑

    ∑ ∑

    ∑ ∑

    ∑ ∑

    X Y a X b X

    X Y anX b X

    a

    X Y Y-bX nX b X

    X Y nX bnX b X

    X Y nX b nX X

    i ii

    n

    ii

    n

    ii

    n

    i ii

    n

    ii

    n

    i ii

    n

    ii

    n

    i ii

    n

    ii

    n

    i ii

    n

    i

    1 1

    2

    1

    1

    2

    1

    1

    2

    1

    1

    2

    1

    1

    0

    0

    0

    0

    (*)

    (

    (

    Since = Y - bX, we substitute that into (*)

    )

    Y +

    Y +

    2

    2 2

    10) =

    =∑i

    n

  • 22

    Rearranging such that "b" is on the LHS:

    Least Squares Estimator of

    ⇒ − = −

    ⇒ =−

    = =

    =

    =

    ∑ ∑

    b X nX X Y nXY

    bX Y nXY

    X nX

    ii

    n

    i ii

    n

    i ii

    n

    ii

    n

    ( )21

    2

    1

    1

    2

    1

    2 β

    So, the least squares formulae for a and b for fitting ourlinear regression model to the data are:

    a Y bX= −

    bX Y nXY

    X nX

    i ii

    n

    ii

    n=−

    −=

    =

    ∑1

    2 2

    1

  • 23

    Can also write b as:

    bX X Y Y

    X X

    or

    b nX X Y Y

    nX X

    i ii

    n

    ii

    n

    i ii

    n

    ii

    n

    =− −

    = −− −

    −−

    =

    =

    =

    =

    ( )( )

    ( )

    ( )( )( )

    ( )( )

    1

    2

    1

    1

    2

    1

    11

    11

    Note: (i) The denominator is the sample __________of Xi’s:

    sn

    X XX ii

    n2 2

    1

    11

    =−

    −=∑

    ( )( )

    (ii) The numerator is the sample ___________ between Xi and Yi:

    sn

    X X Y YXY i ii

    n

    =−

    − −=∑

    11 1

    ( )( )

    Reflect how X and Y are ________to each other.Thus, we can also write:

    b SSXY

    X

    = =2

    Sample cov. between indep. & dep. variableSample Variance of independent variable

  • 24

    (iii) The sample regression line always ______ through the point:______ .

    Recall, that the fitted regression model is:

    $

    .,$

    $ ( )$ :

    $ ( ) .$

    Y a bXand

    a Y bXSoY Y bX bXY Y b X X

    So X and Y Y thenY Y b X X Y

    i i

    i i

    i i

    i

    = +

    = −

    = − +

    = + −

    = =

    = + − =

    if X

    The sample regression line (Y) always passes through the point (X,Y).

    i

    Y

    Y

    X

    X

  • 25

    (IV) With the Least Squares procedure, the ___ of the errors (residuals) equals _____:

    e Y a bXii

    n

    i ii

    n

    = =∑ ∑= − − =

    1 10( )

    Recall the _______ equation for determining a:

    Y na b X

    Equate to zero

    Y a b X

    Y a bX e

    Y Y

    i ii

    n

    i

    n

    i ii

    n

    i

    n

    i i ii

    n

    i

    n

    i ii

    n

    = +

    − − =

    − − = ← =

    − =

    ==

    ==

    ==

    =

    ∑∑

    ∑ ∑∑

    ∑∑

    11

    11

    11

    1

    0

    0 0

    0

    :

    ( )

    ( $ ) .

    The sum of the least squares residuals is always ____.

    The least squares regression line is derived such than the line will be situated amongst the data values such that the ________ residuals (under- estimates of actual point) always _______ cancel out the ________ residuals (over-estimates of actual points).

  • 26

    Example: __________ Phone _____ (Y) over a number of years (X):

    Sales (Y) (in Millions) Time (X) (Years)

    n=6Population Regression Line: Yi=α+βXi+gEstimate α and β by least squares estimators a and b to give sampleregression model: $Y a bXi i= +or equivalently

    .Y a bX ei i i= + +

    Sales (Y) in Millions Time (X) Xi2 XiYi

    1 6

    4 13

    9 24

    16 41.2

    25 60

    36 54

    ΣYi=51.8 ΣXi= 21 Σ Xi2= 91 ΣXiYi=198.2

  • 27

    YX=

    = =

    8 63321 6 35.

    / .

    bX Y nXY

    X nX

    i ii

    n

    ii

    n=−

    −=

    =

    ∑1

    2 2

    1

    =−−

    =−−

    = =

    198 2 6 35 8 63391 6 35

    198 2 18129391 735

    16 90717 5

    0 9661

    2

    . ( )( . )( . )( )( . )

    . ..

    ..

    .

    a Y bX= −

    = −= − =

    8 633 0 9661 358 633 338135 5252. ( . )( . ). . .

    $ . .Y Xi i= +5252 0 9661

  • 28

    If you put a “hat” on the ________t variable then you do not have( $ )Yito put the _______(ei) into the regression line; If you do not put the ‘hat’on Yi, then you ____ include ei.

    Forecasting:

    If time (X) =__, then

    $ . .$ . . ( )$ . . .

    Y X

    Y

    Y

    i i

    i

    i

    = +

    = +

    = + =

    5252 0 9661

    5252 0 9661 10

    5252 9 661 14 913

    In ___ years, the best estimate of cellular phone sales is __.___ million.

  • 29

    Graph of Sales Against Time

  • 30

    Descriptive Statistics For Both Series

  • 31

    Equation Specification

    Least Squares Regression Output

  • 32

    Summary: Regression Model – Terms and Symbols

    Term Population Symbol Sample SymbolModel:

    Y Xi i i= + +α β ε Y a bX ei i i= + +

    Error:

    Slope:

    Intercept:

    Equation of the line:

    Concluding Remarks:

    1) Least squares procedure is simply a “_____-fitting” technique. Itmakes no ___________ about the independent variable(s), X, thedependent variable, Y, or the error term.

    2) We will need to make ___________ about the ___________variable(s), the _________ variable, and the error term if we wishto consider how well a and b estimate α, the population _________term and β, the population ______parameter.

    3) We will also need to make assumptions if we want to form ______ estimates for predicted values of y or interval estimates for α and β.

    4) Will also need assumptions if we want to test any hypothesesabout population parameters.

  • 33

    Example: H0:β=β0 versus Ha: β>β0,

    where β0 is some known value.

    Special test: H0:β=β0=0 versus Ha:β…0 This tests if X is ___________ in explaining _.

    5) We can extend least squares principle to ___-______ models.

    6) We use least squares because it can be shown that under certain assumptions, a and b are ________ estimators of α and β,

    respectively. They are also _________and _______(i.e. minimumvariance).

    In other words, under certain assumptions, a and b are the best unbiased estimators of α and β, respectively.

    .

  • 34

    Section 13.4Assumptions and Estimator Properties

    The population model is : where i=1, 2, 3...,n.Y Xi i i= + +α β ε

    The ____ Assumptions: Gauss - Markov Theorem: Given the fivebasic assumptions, (below,) the leastsquares estimators in the linear regression model are best linear unbiased estimators,

    "BLUE."Assumption #1: The random variable g is statistically ___________ of X. That is, E(g,X)=__. Always holds if X is non-__________ (fixed in repeated sampling).

    Assumption #2: __ is ________ distributed i.

    Assumption #3: E(gi)=_. That is, on average the disturbance term is ____ i.

    Assumption #4: Var(gi)= i and Xi. This is known as σε2

    the assumption of ________________ or constant ________, across all observations. Otherwise _______________.

  • 35

    Assumption #5: Any two errors gi and gj are statistically ___________ of each other for i…j. I.e. E(gi,gj)=E(gi)E(gj )=_ for i…j. That is, zero __________ (disturbances are uncorrelated). For example, disturbance in observation 1 does not affect observation 2, etc.

    If disturbances are __________ across observations, then there exists _______________ or serial ___________.

    Remarks: 1) From assumptions 1 through 4 we have __ ~N(0, ) i.σε

    2

    2) Yi is ______ only because gi is random if X is non-random or non-___________.3

    0

    2

    ) ( ) ( )

    ) .

    ( ) ( )( )

    E Y E XX

    Var Y Var XVar

    i i i

    i

    i i i

    = + += +

    =

    = + +==

    α β εα β

    ε

    α β εε

    σα β

    ε

    ε

    since X is non - stochastic and E(

    Since , and X are non - stochastic (No variability).Y is a linear function of and is normal.

    i i

    i

    i i

    That is, from gi ~N(0, ), it follows that Yi ~N(α+βXi, ).σε2 σε

    2

    4) As gi and gj for i…j are __________, so to are Yi and Y j for i…j.

  • 36

    Section 13.5: Measures of Goodness of Fit

    This section will explore two measures of ______ness of ____:

    1) The Standard _____ of the Estimate: which is the measure of the ________ fit of the sample points to the sample __________ line.

    2) Coefficient of ______________R2: which is a measure of the ______ ____ness of fit of a ______ regression line.

    Notation:Total _________ of Y: the difference between the ________ value of Yi and the ____ of the y-values, .Y

    ( )Y Yi −This _____ deviation can be expressed as the sum of two other

    deviations:

    ($ $ ).Y Y Yi i− −) and (Yi

    ,The first deviation expression is the ________ ei: ( $ )Y Y ei i i− = Since the ei are random, the term is referred as the ( $ )Y Y ei i i− = ___________ deviation.

    ,The second deviation can be _________ by the regression line.

    is the _________ deviation, because it is possible to ( $ )Y Yi −

    _______ that differs from because, Xi _____ from $Yi Y X.

  • 37

    ( ) ( )Total deviation = Unexplained deviation + Explained Deviation

    Y Y Y Y (Y Y)i i− = − + −$ $

    YRegression Line

    Explained { } Unexplained

    *

    X

    Y

    Unexplained dev. (Yi- ) * $Yi Total Dev. Yi- Y Explained Dev.( - )$Yi Y

    Y

    X

    Y

  • 38

    The two parts of the total deviation are ___________.

    Hence, we can ______ each __________and sum over all ‘n’ observations as follows:

    ( ) ( )Y Y Y Y + (Y Y) i 2i 1

    n

    i

    2

    i 1

    n

    i=1

    n

    − = − −= =∑ ∑ ∑$ $

    Breaking into three parts:

    ( )Y Y The total variation The total sum of squares SST

    i

    2

    i 1

    n

    − ⇐

    ⇐⇐

    =∑

    ( )Y Y The unexplained variation The sum of squares error

    SSE = e

    i

    2

    i 1

    n

    i2

    i=1

    n

    − ⇐

    =∑

    $

    .

    ( )$Y Y The explained variation due to the regression. The sum of squares regression SSR

    i

    2

    i 1

    n

    − ⇐

    ⇐⇐

    =∑

  • 39

    Hence the _________ in regression analysis is:

    ( ) ( )Total deviation = Unexplained deviation + Explained Deviation

    Y Y Y Y (Y Y)

    SST = SSE + SSRi

    2

    i

    22− = − + −∑ ∑ ∑$ $

    Why are we dissecting the total _________ into___ components?

    We can now determine the ____ness of ___ of a ________ in terms of the size of ___.

    %If the fit is perfect, ___=__.

    %If the fit is not perfect ___…__.

  • 40

    Calculation of SST, SSR and SSE

    The best way to illustrate the calculation of these measures, is throughillustration:

    Example: Cellular Phone Sales (Y) over a number of years (X):

    Sales (Y) inMillions

    Time (X) Xi2 XiYi

    ΣYi=51.8 ΣXi= 21 Σ Xi2= 91 ΣXiYi=198.2

    YX=

    = =

    8 63321 6 35.

    / .

    b = 0 9961.

    a Y bX= − = 5252.

    $ . .Y Xi i= +5252 0 9661

  • 41

    Sales (Y) inMillions

    ( )Y Yi −2 e Y Yi i

    2 2= −( $) ( $ )Y Yi −2

    6.933 0.048 5.832

    4.550 0.468 2.099

    0.401 0.023 0.233

    2.779 1.401 0.234

    11.337 3.677 2.101

    0.135 4.197 5.835ΣYi=__._ Σ=___=26.14 Σ=___=9.81 Σ=___=16.33

    We should find that ___=___ + ___.

    A more convenient way to determine SST is:Total variation in the dependent variable Y:

    ( )SST Y n Yiin

    ii

    n

    = −= =∑ ∑2

    1 1

    21

    The amount of variation _________ by the regression equation can bedetermined without first calculating ___. The method eliminates thecalculation of each estimated value ( ).$Yi

    _________ variation: SSR b X X Y Y

    b

    i i= − −

    =

    ∑( ( )( )

    ( ) covariation of X and Y

    Once SST and SSR are found, the unexplained variation SSE is foundby subtraction: SSE=___-___.

  • 42

    How Good is the Fit In Regression Analysis?

    (I) Standard _____ of the ________

    s nY Y SSE

    ne i iin

    =−

    − =−=

    ∑1

    2 22

    1( $ )

    The number of degrees of freedom in this case is ___ because ___ sample statistics (_ and__) must be calculated before the value of $Yi can be computed.

    The information from this ______provides an indication of the ___ of the regression model. It also indicates the ____ of the _________ that result.

    The units of se are always the same as those for the variable __ since ‘e’ is a leftover (residual) component of _.

    To determine whether se is large or small, we compare it to the average size of _: Using a coefficient of _________ (C.V.) type measure, we can define the typical percentage error in predicting _ using the sample data and the regression line.

  • 43

    Coefficient of Variation for regression residuals:

    CV residuals sYe. . *=

    100

    e Y Yi i2 2 9 817∑ = − =( $) .

    sn

    Y Ye i ii

    n

    =−

    =−

    ==

    =∑

    12

    16 2

    9 817

    2 45251566

    2

    1( $ )

    * .

    ..

    CV residuals sYe. . *

    .*

    . *

    =

    =

    100

    8 633100

    01814 100

    = 1.566

    = 18.14

    For the cellular phone example, the C.V. residuals= 18.14. In predicting the number of cellular phone sales each year in this sample with the estimated model, the value will be _________ by 18.14%, on the average.

  • 44

    (II) Coefficient of _____________

    The second measure of goodness of fit is the coefficient of _____________.

    It facilitates in the interpretation of the relative amount of _________ that has been _________ by the sample ___________line.

    Deriving R2: (in the text it is denoted r2)

    Recall: SST = SSE + SSR

    ,If we divide this expression by SST:

    SSTSST

    SSESST

    SSRSST

    SSESST

    SSRSST

    = +

    = +1

    These two ratios must sum to ___ and are M.E.

    Recall, SSE is the ___________ variation in Y.

    The ratio represents the proportion of total variation SSESST

    that is ___________ by the regression relation.

  • 45

    The ratio represents the ________ measure of goodness of fit SSRSST

    and is called the coefficient of _____________.

    It measure the proportion of total variation that is explained by the regression line.

    R SSRSST

    2 = =Variation explained

    Total variation

    If the regression line ________fits all the sample points, ___=___ and the coefficient of determination would achieve its maximum value of _: R2=__. Y Regression line

    X Perfect Fit: R2 =_; $Y Yi i=

  • 46

    As the degree of fit become ___accurate and ____of the variation in Y is explained by their relation with X, R2 _________.

    Y Regression line

    XqThe lowest value of R2 is ___, which will occur whenever SSR=0 and SSE=___. This occurs when for all observations. $Y Yi =

    Y

    Regression lineY

    X

    No systematic relation between Y and X

    R2=_; ; b=_$Y Yi =The regression line is _________l with a ____ slope (b=_). Hence, X has no effect in explaining changes in Y.

  • 47

    Example: Determine the coefficient of determination for the cellular phone example.

    We know that SSR=_____ and SST = _____ .Hence ,

    R SSRSST

    2 16 332614

    0 625= = =..

    .

    The interpretation of this result is that 62.5% of the _____sample variation in cellular phone sales is _________ by thelinear relation of time in years.

    The remaining 37.5% of the variation in sales is still___________.

    Most likely some other ________ has been _______ from theregression model that could explain some additional portion ofthe variation. (“Multiple regression model” is required.)

  • 48

    Section 13.6 ___________ Analysis

    The measurement of how well two or more variables ____ together is called ___________ analysis.

    One measure of the population relationship between two random variables is the population covariance:

    [ ] ( )( )[ ]Cov X Y E X X Y Yi i( , = − −

    Usually this measure is ___ a ____ indicator of the relative ______ of the relationship between two variables because its magnitude depends on the _____ used to measure the variables.

    Hence, it is necessary to standardize the _________ of two variables in order to have a ____ measure of fit. (____-____ measure.)

    Standardization is carried out by dividing the ___________of X and Y by the standard deviation of X times the standard deviation of Y:

    Population Correlation Coefficient:

    = Covariance of X and Y(Std. Dev. of X) *(Std. Dev.of Y)

    ρσ σ

    =C X Y

    X Y

    ( , ),

  • 49

    Three Special Cases:

    Case #1: _______ ________ correlation: All values of X and Y fall on a __________ sloped straightline: ρ=___

    Y

    XCase #2: _______ ________ Correlation:

    All values of X and Y fall on a __________ sloped straightline: ρ=___

    Y

    XCase #3: ____ Correlation:

    If X and Y are ___ linearly related, since they are____________ random variables, then the value of thecorrelation coefficient will be ____, since C(X,Y)=0.ρ=__.

    Thus ρ measures the strength of the ______association between X and Y.

    , Values of ρ close to ____ indicate a ____ relation; , Values close to +1 indicate a ______ positive relation; , Values close to -1 indicate a _______negative correlation.

  • 50

    Sample Correlation Coefficient

    We employ sample data to estimate the population parameter ρ.

    The ______ correlation coefficient is denoted by the letter ‘r’.

    We determine the value of r in the same manner as ρ, except that we substitute for each population parameter its best estimate based on the sample data.

    Sample Covariance: S :

    S

    XY

    XY = −− −

    =∑

    11 1n

    X X Y Yi ii

    n

    ( )( )

    Let Sx and SY represent the sample standard deviation.

    Hence, the Sample Correlation Coefficient:

    r SS S

    SS

    XY

    X Y

    Y

    = =

    − −

    −−

    −−

    =

    =

    = =

    ∑ ∑

    Cov of sample values of X and YSample standarddeviation of X

    Sample standarddeviation of Y

    =

    1n -1

    (X X) *(Y Y)

    1n 1

    (X X) * 1n 1

    (Y Y)

    r = SCSS

    Sample covariationSample Variations

    i ii 1

    n

    i2

    i 1

    n

    i2

    i 1

    n

    XY

    X

  • 51

    The Connection Between Correlation and Regression

    1) The correlation coefficient and the coefficient of determination measure the ________ of the ___________ between X and Y.

    For a model with only one ___________variable, the measure R2 is the ______ of the correlation measure r, as they measure the same relationship.

    When there is a weak relationship, both measures are close to ____.

    A ______ relation is indicated between the two variables X and Y when r approaches -1 or +1.

    And since the square of r is always ________, and R2 approaches +1when there is a ______ relation.

    2) The correlation coefficient provides information about the _________ of the association between X and Y. I.e. either positively or negatively related.

    A positive correlation must always correspond to a regression line with a positive _____ (b), indicating a direct relation.

    A ________ correlation corresponds to an _______ relation.

  • 52

    3) In correlation analysis the X and Y variables are assumed to have a probability distribution.

    In regression analysis, the independent variable X can be considered as a set of given values. When X is an independent random variable, correlation between X and Y is ____.

    4) In correlation analysis there is no specified “dependent” and “independent” variable designation.

    In regression analysis we impose a model, such that, Y is being _________ by X.

    If the dependent and independent variables are switched, the regression model will ____be the same, although the sign on the _____ parameter will remain the same.

    The End