regression_3d

Upload: surojeetsadhu

Post on 09-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 regression_3D

    1/50

    1

    Simple Linear RegressionSimple Linear Regression

    and Correlationand Correlation

    Chapter 17

  • 8/7/2019 regression_3D

    2/50

  • 8/7/2019 regression_3D

    3/50

    3

    17.2 The Model The first order linear model

    y = dependent variable

    x = independent variable

    F0

    = y-intercept

    F1 = slope of the line

    = error variable

    IFF! xy 10

    x

    y

    F Run

    Rise F = Rise/Run

    F0 andF1 are unknown,therefore, are estimatedfrom the data.

    I

  • 8/7/2019 regression_3D

    4/50

    4

    17.3 Estimating the Coefficients The estimates are determined by

    drawing a sample from the population of interest,

    calculating sample statistics.

    producing a straight line that cuts into the data.

    The question is:

    Which straight line fits best?

    x

    y

  • 8/7/2019 regression_3D

    5/50

    53

    3

    The best line is the one that minimizesthe sum of squared vertical differencesbetween the points and the line.

    41

    1

    4

    (1,2)

    2

    2

    (2,4)

    (3,1.5)

    Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 +

    (4,3.2)

    (3.2 - 4)2 = 6.89

    Sum of squared differences = (2 -2.5)2

    + (4 - 2.5)2

    + (1.5 - 2.5)2

    + (3.2 - 2.5)2

    = 3.99

    2.5

    Let us compare two lines

    The second line is horizontal

    The smaller the sum ofsquared differencesthe better the fit of the

    line to the data.

  • 8/7/2019 regression_3D

    6/50

    6

    To calculate the estimates of the coefficients

    that minimize the differences between the datapoints and the line, use the formulas:

    xbybs

    )Y,Xcov(b

    10

    2

    x

    1

    !

    !

    The regression equation that estimates

    the equation of the first order linear modelis:

    xbby1

    0!

  • 8/7/2019 regression_3D

    7/50

    7

    Example 17.1 Relationship between odometer

    reading and a used cars selling price.

    A car dealer wants to findthe relationship between

    the odometer reading andthe selling price of used cars.

    A random sample of 100cars is selected, and the data

    recorded.

    Find the regression line.

    ar Od o met er Price

    1 7 1

    447 061

    4 008

    4 0862 57 5

    5 31705 5784

    6 34010 5359

    . . .

    . . .

    . . .

    Independent variable x

    Dependent variable y

  • 8/7/2019 regression_3D

    8/50

    8

    Solution Solving by hand

    To calculate b0 and b1 we need to calculate severalstatistics first;

    ;41.411,5y

    ;45.009,36x

    !

    !

    256,356,11n

    )yy)(xx(),cov(

    688,528,431n

    )xx(s

    ii

    2

    i2x

    !!

    !!

    where n = 100.

    533,6)45.009,36)(0312.(41.5411xbyb

    0312.688,528,43

    256,356,1

    s

    ),cov(b

    10

    2x

    1

    !!!

    !!!

    x0

    312.533,6xbby

    10 !!

  • 8/7/2019 regression_3D

    9/50

    9

    4500

    5000

    5500

    6000

    19000 29000 39000 49000

    Odometer

    Price

    Using the computer (see file Xm17-0

    1.xls)SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.806308

    R Square 0.650132Adjusted 0.646562

    Standard 151.5688

    Observatio 100

    ANOVA

    df SS MS F ignificance

    Regressio 1 4183528 4183528 182.1056 4.4435E-24

    Residual 98 2251362 22973.09

    Total 99 6434890

    Coefficient tandard Erro t Stat P-value

    Intercept 6533.383 84.51232 77.30687 1.22E-89

    Odometer -0.03116 0.002309 -13.4947 4.44E-24

    x0312.533,6y !

    Tools > Data analysis > Regression > [Shade the y range and the x range] > K

  • 8/7/2019 regression_3D

    10/50

    10

    This is the slope of the line.

    For each additional mile on the odometer,the price decreases by an average of $0.0312

    4500

    5000

    5500

    6000

    19000 29000 39000 49000

    Odometer

    Price

    x0312.533,6y !

    The intercept is b0 = 6533.

    6533

    0 No data

    Do not interpret the intercept as the rice of cars that have not been driven

  • 8/7/2019 regression_3D

    11/50

    11

    17.4

    Error Variable: Required Conditions The errorIis a critical part of the regression model.

    Four requirements involving the distribution ofI mustbe satisfied.

    The probability distribution ofI is normal.

    The mean ofI is zero: E(I) = 0.

    The standard deviation ofI is WI for all values of x.

    The set of errors associated with different values of y areall independent.

  • 8/7/2019 regression_3D

    12/50

    12

    From the first three assumptions we have:y is normally distributed with meanE(y) =F0 +F1x, and a constant standarddeviation WI

    Q

    F0 +F1x1

    F0 +F1x2

    F0 +F1x3E(y|x2)

    E(y|x3)

    x1 x2 x3

    Q

    E(y|x1)

    Q

    The standard deviation remains constant,

    but the mean value changes with x

  • 8/7/2019 regression_3D

    13/50

    13

    17.5 Assessing the Model The least squares method will produce a

    regression line whether or not there is a linear

    relationship between x and y.

    Consequently, it is important to assess how wellthe linear model fits the data.

    Several methods are used to assess the model: Testing and/or estimating the coefficients.

    Using descriptive measurements.

  • 8/7/2019 regression_3D

    14/50

    14

    This is the sum of differences between the pointsand the regression line.

    It can serve as a measure of how well the line fits the

    data.

    This statistic plays a role in every statisticaltechnique we employ to assess the model.

    2

    x

    2Y

    s

    )Y,Xcov(s)1n(SSE !

    .)yy(SSEn

    1i

    2ii

    !

    !

    Sum of squares for errors

  • 8/7/2019 regression_3D

    15/50

    15

    The mean error is equal to zero.

    If WI is small the errors tend to be close to zero(close to the mean error). Then, the model fits the

    data well.

    Therefore, we can, use WI as a measure of thesuitability of using a linear model.

    An unbiased estimator ofWI2

    is given by sI2

    2n

    SSEs

    EstimateofErrordardtanS

    !I

    Standard error of estimate

  • 8/7/2019 regression_3D

    16/50

    16

    Example 17.2

    Calculate the standard error of estimate for example17.1, and describe what does it tell you about themodel fit?

    Solution

    6.15198

    363,251,2

    2n

    SSEs

    ,Thus

    363,252,2688,528,43

    )256,356,1(

    )999,64

    (99s

    ),cov(

    s)1n(SSE

    999,6499

    890,434,6

    1n

    )yy(s

    2

    2x

    2

    2ii2

    !

    !

    !

    !

    !

    !!

    !

    I

    Calculated before

    It is hard to assess the model basedon sI even when compared with themean value of y.

    4.411,5y,6.151s !!I

  • 8/7/2019 regression_3D

    17/50

    17

    Testing the slope

    When no linear relationship exists between twovariables, the regression line should be horizontal.

    Linear relationship.Different inputs (x) yielddifferent outputs (y).

    No linear relationship.Different inputs (x) yieldthe same output (y).

    The slope is not equal to zero The slope is equal to zero

  • 8/7/2019 regression_3D

    18/50

    18

    We can draw inference aboutF1 from b1 by testingH

    0:F1 = 0

    H1:F1 = 0 (or < 0,or > 0)

    The test statistic is

    If the error variable is normally distributed, the statistic isStudent t distribution with d.f. = n-2.

    1b

    11

    s

    bt

    F!

    The standard error of b1.

    2x

    b

    s)1n(

    ss

    1

    ! Iwhere

  • 8/7/2019 regression_3D

    19/50

    19

    Solution Solving by hand

    To compute t we need the values of b1 and sb1.

    49.1300231.

    0312.s

    bt

    00231.688,528,43)(99(

    6.151

    s)1n(

    ss

    312.b

    1

    1

    b

    11

    2x

    b

    1

    !!F

    !

    !!

    !

    !

    I

    Using the computer

    oe icients t n r rror t t t ue

    ntercept 0 0

    ometer 0 0 0 00 0

    There is o erwhe ming e i ence to in erth t the o ometer re ing ects the

    uction se ing price

  • 8/7/2019 regression_3D

    20/50

    20

    Coefficient of determination When we want to measure the strength of the linear

    relationship, we use the coefficient of determination.

    !!

    2

    2

    22

    22

    )(

    1)],[cov(

    yy

    SSERor

    ss

    YXR

    iyx

  • 8/7/2019 regression_3D

    21/50

    21

    To understand the significance of this coefficientnote:

    verall variability in y

    The regression model

    The error

  • 8/7/2019 regression_3D

    22/50

    22

    x1 x2

    y1

    y2

    y

    Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

    !2

    22

    1 )yy()yy(2

    22

    1 )yy()yy( 2

    222

    11 )yy()yy(

    Total variation in y = Variation explained by theregression line)

    + Unexplained variation (error)

  • 8/7/2019 regression_3D

    23/50

    23

    R2 measures the proportion of the variation in ythat is explained by the variation in x.

    !

    !

    !

    2i

    2i

    2i

    2i

    2

    )yy(

    SSR

    )yy(

    SSE)yy(

    )yy(

    SSE1R

    Variation in y = SSR + SSE

    R2 takes on any value between zero and one.

    R2 = 1: erfect match between the line and the data points.

    R2 = 0: There are no linear relationship between x and y.

  • 8/7/2019 regression_3D

    24/50

  • 8/7/2019 regression_3D

    25/50

    25

    17.6 Finance Application: Market Model ne of the most important applications of linear

    regression is the market model.

    It is assumed that rate of return on a stock (R) islinearly related to the rate of return on the overallmarket.

    R =F0 +F1Rm +IRate of return on a particular stock Rate of return on some ma or stock index

    The beta coefficient measures how sensitive the stocks rateof return is to changes in the level of the overall market.

  • 8/7/2019 regression_3D

    26/50

    26

    Example 17.5 The market model

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.560079

    R Square 0.313688

    Adjusted 0.301855

    Standard 0.063123

    Observatio 60

    ANOVA

    df SS MS F gnificance

    Regressio 1 0.10563 0.10563 26.50969 3.27E-06

    Residual 58 0.231105 0.003985

    Total 59 0.336734

    Coefficient andard Err t Stat P-value

    Intercept 0.012818 0.008223 1.558903 0.12446

    TSE 0.887691 0.172409 5.148756 3.27E-06

    Estimate the market model forNortel, a stock traded in theToronto Stock Exchange.

    Data consisted of monthly

    percentage return forNorteland monthly percentage returnfor all the stocks.

    This is a measure of the stocksmarket related risk. In this sample,

    for each 1% increase in the TSEreturn, the average increase inNortels return is .8877%.

    This is a measure of the total risk embeddedin the Nortel stock, that is market-related.

    Specifically, 31.37% of the variation in Nortelsreturn are explained by the variation in theTSEs returns.

  • 8/7/2019 regression_3D

    27/50

    27

    17.7 Using the Regression Equation

    If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.

    Illustration

    redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).

    285,5)000,40(0312.6533x0312.6533y !!!

    Before using the regression model, we need toassess how well it fits the data.

    If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.

    Illustration

    redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).

  • 8/7/2019 regression_3D

    28/50

    28

    rediction interval and confidence interval

    Two intervals can be used to discover how closelythe predicted value will match the true value of y.

    rediction interval - for a particular value of y, Confidence interval - for the expected value of y.

    The confidence interval

    s IE 2i

    2g

    2)xx(

    )xx(n1sty

    The prediction interval

    s IE 2i

    2g

    2)xx(

    )xx(n11sty

    The prediction interval is wider than the confidence interval

  • 8/7/2019 regression_3D

    29/50

    29

    Example 17.6 interval estimates for the car

    auction price

    rovide an interval estimate for the bidding price on aFord Taurus with 40,000 miles on the odometer.

    Solution The dealer would like to predict the price of a single car

    The prediction interval(95%) =

    s IE 2i

    2g

    2 )xx(

    )xx(

    n

    1

    1sty

    303285,5

    160,340,309,4

    )009,36000,40(

    100

    11)6.151(984.1)]40000(0312.6533[

    2

    s!

    s

    t.025,98

  • 8/7/2019 regression_3D

    30/50

    30

    The car dealer wants to bid on a lot of 250 FordTauruses, where each car has been driven for about40,000 miles.

    Solution

    The dealer needs to estimate the mean price per car. The confidence interval (95%) =

    s I2

    i

    2g

    2

    )xx(

    )xx(

    n

    1sty

    35285,5160,340,309,4

    )009,36000,40(

    100

    1)6.151(984.1)]40000(0312.6533[

    2

    s!

    s

  • 8/7/2019 regression_3D

    31/50

    31

    s IE2

    i

    2g

    2

    )xx(

    )xx(

    n

    1sty

    x

    s IE 2i

    2

    2)xx(

    2

    n

    1sty

    s

    IE 2i

    2

    2 )xx(

    1

    n

    1sty

    The effect of the given value of x on the interval

    As xg moves away from x the interval becomeslonger. That is, the shortest interval is found at x.

    2x 2x

    1x)1x( ! 1x)1x( !

    g10 xbby !

    )1xx(y g !

    )1xx(y g !

    2x)2x( ! 2x)2x( !

    1x 1x

    The confidence intervalwhen xg = x

    The confidence interval

    when xg = 1x s

    The confidence intervalwhen xg = 2x s

  • 8/7/2019 regression_3D

    32/50

    32

    17.8 Coefficient of correlation The coefficient of correlation is used to measure the

    strength of association between two variables.

    The coefficient values range between -1 and 1.

    If r = -1 (negative association) or r = +1 (positiveassociation) every point falls on the regression line.

    If r = 0 there is no linear pattern. The coefficient can be used to test for linear

    relationship between two variables.

  • 8/7/2019 regression_3D

    33/50

    33

    Testing the coefficient of correlation When there are no linear relationship between twovariables, V = 0.

    The hypotheses are:

    H0: V = 0H1: V = 0

    The test statistic is:

    yx

    2

    ss

    ),cov(rbycalculated

    ncorrelatiooftcoefficiensa pletheisrhere

    r1

    2n

    rt

    !

    !

    The statistic is Student t distributed

    ith d.f. = n - 2, provided the variablesare bivariate nor ally distributed.

  • 8/7/2019 regression_3D

    34/50

    34

    Example 17.7 Testing for linear relationship

    Test the coefficient of correlation to determine if linear

    relationship exists in the data of example 17.1.

    Solution We test H0: V = 0

    H1: V 0.

    Solving by hand:

    The re ection region is|t| > tE/2,n-2 = t.025,98 = 1.984 or so.

    The sample coefficient of

    correlation r=cov(X,Y)/sxsy=-.806

    The value of the t statistic is

    Conclusion: There is sufficientevidence at E = 5% to infer thatthere are linear relationshipbetween the two variables.

    49.13r1

    2nrt

    2!

    !{

  • 8/7/2019 regression_3D

    35/50

    35

    Spearman rank correlation coefficient

    The Spearman rank test is used to test whetherrelationship exists between variables in cases where

    at least one variable is ranked, or both variables are quantitative but the normality

    requirement is not satisfied

  • 8/7/2019 regression_3D

    36/50

    36

    The hypotheses are:

    H0: Vs = 0

    H1: Vs = 0

    The test statistic is

    a and b are the ranks of the data.

    For a large sample (n > 30) rs

    is approximatelynormally distributed

    bas

    ss)b,acov(r !

    1nrz s !

  • 8/7/2019 regression_3D

    37/50

    37

    Example 17.8

    A production manager wants to examine therelationship between

    aptitude test score given prior to hiring, and performance rating three months after starting work.

    A random sample of 20 production workers wasselected. The test scores as well as performance

    rating was recorded.

  • 8/7/2019 regression_3D

    38/50

    38

    Apt itude Performance

    Employee test rat ing

    1 59 3

    2 47 2

    3 58 4

    4 66 3

    5 77 2

    . . .

    . . .

    . . .

    Solution

    The problem ob ective isto analyze therelationship betweentwo variables.

    Performance rating isranked.

    The hypotheses are:

    H0: Vs = 0

    H1: Vs = 0

    The test statistic is rs,

    and the re ection regionis |rs| > rcritical (taken fromthe Spearman rankcorrelation table).

    Scores range from 0 to 100 Scores range from 1 to 5

  • 8/7/2019 regression_3D

    39/50

    39

    Apt tude erfor ance

    p oyee test rat ng

    1 59 3

    2 47 2

    3 58 4

    4 66 3

    5 77 2

    . . .

    . . .

    . . .

    an a

    9

    3

    8

    14

    2

    .

    .

    .

    an

    1 .5

    3.5

    17

    1 .5

    3.5

    .

    .

    .

    T es are ro eny averag ng the

    ran s.

    So v ng y hand

    an each var a e separate y.

    Ca c te s = 5.92; sb =5.5 ; cov a,b) = 12.34

    Thus rs = cov a,b)/[s sb] = .379.

    The cr t c va ue forE = .05 and n = 20 is .450.

    Conclusion:Do not re ect the null hypothesis. At 5% significancelevel there is insufficient evidence to infer that thetwo variable are related to one another.

  • 8/7/2019 regression_3D

    40/50

  • 8/7/2019 regression_3D

    41/50

    41

    Residual Analysis

    Examining the residuals (or standardized residuals),we can identify violations of the required conditions

    Example 17.1 - continued Nonnormality.

    Use Excel to obtain the standardized residual histogram.

    Examine the histogram and look for a bell shaped diagram with

    mean close to zero.

  • 8/7/2019 regression_3D

    42/50

    42

    For each residual we calculate

    the standard deviation as follows:

    !

    ! I

    2

    2i

    i

    ir

    )xx(

    )xx(

    n

    1h

    whereh1ssi

    OUT UT

    O ser i n esi s n r esi s

    1 . 2 .

    2 . 2 2 . 1 1

    . 0 -0. 21 21

    22 .20 0 1. 01 0 12

    2 . 0 1 1. 1 2

    0

    10

    20

    30

    40

    2 5 1 5 0 5 0 5 1 5 2 5 o e

    A artial list of

    Standard residuals

    Standardized residual i =Residual i / Standard deviation

    We can also apply the Lilliefors testor the G2 test of normality.

  • 8/7/2019 regression_3D

    43/50

    43

    Heteroscedasticity

    When the requirement of a constant variance isviolated we have heteroscedasticity.

    + + +

    +

    + ++

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    +

    The spread increases with y

    y^

    Residualy

    +

    +++

    +

    ++

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    +

  • 8/7/2019 regression_3D

    44/50

    44

    +

    +

    +

    +

    ++

    +

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    +y^

    Residualy

    +

    +++

    +

    ++

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +++

    +

    ++

    +

    +++

    +

    ++

    +

    +++

    +

    ++

    The spread of the data pointsdoes not change much.

    When the requirement of a constant variance isnot violated we have homoscedasticity.

  • 8/7/2019 regression_3D

    45/50

    45

    +

    +

    +

    +

    ++

    +

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    ++

    +

    +

    +y^

    Residualy

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    +

    ++

    +

    ++

    +

    +

    +

    ++ +

    +

    +

    +++

    +

    ++

    As far as the even spread, this isa much better situation

    When the requirement of a constant variance isnot violated we have homoscedasticity.

  • 8/7/2019 regression_3D

    46/50

    46

    N

    onindependence of error variables

    A time series is constituted if data were collectedover time.

    Examining the residuals over time, no pattern shouldbe observed if the errors are independent.

    When a pattern is detected, the errors are said to beautocorrelated.

    Autocorrelation can be detected by graphing theresiduals against time.

  • 8/7/2019 regression_3D

    47/50

    47

    +

    +++ +

    ++

    ++

    ++

    ++ +

    ++

    ++

    +

    +

    +

    +

    +

    +

    +Time

    Residual Residual

    Time

    +

    +

    +

    atterns in the appearance of the residuals

    over time indicates that autocorrelation exists.

    Note the runs of positive residuals,replaced by runs of negative residuals

    Note the oscillating behavior of theresiduals around zero.

    0 0

  • 8/7/2019 regression_3D

    48/50

    48

    utliers An outlier is an observation that is unusually small or

    large.

    Several possibilities need to be investigated when anoutlier is observed:

    There was an error in recording the value.

    The point does not belong in the sample.

    The observation is valid. Identify outliers from the scatter diagram.

    It is customary to suspect an observation is an outlier ifits |standard residual| > 2

  • 8/7/2019 regression_3D

    49/50

    49

    +

    +

    +

    +

    + +

    + + ++

    +

    +

    +

    +

    +

    +

    +

    The outlier causes a shiftin the regression line

    but, some outliersmay be very influential

    ++++++++++

    An outlier An influential observation

  • 8/7/2019 regression_3D

    50/50