fintree...fintree los d dependent variable independent variable variable you are using to explain...

Correlation And RegressionLOS a Sample

covarianceSample

correlation

Measures how two variables move together

Captures the linear relationship between two variables

Measures strength of linear relationship between two variables

Standardized measure of covariance

Cov(x,y) = r × S × Sx y

2Unit = %

Range = −∞ to +∞

+ve covariance = Variables tend to move together

−ve covariance = Variables tend to move in opposite directions

Unit = No unit

Range = −1 to +1

r = 1 means perfectly +ve correlation

r = 0 means no linear relationship

r = −1 means perfectly −ve correlation

n − 1

∑ (X − X) (Y − Y)Cov(x,y) =

S Sx y×

Cov(x,y)r =

LOS b

LOS c

Limitations to correlation analysis

Test of the hypothesis that the population correlation coefficient equals zero

Scatter plot: Graph that shows the relationship between values of two variables

−ve covariance −ve correlation −ve slope

+ve covariance +ve correlation +ve slope

Outliers

Since calculated test statistic lies outside the range, conclusion is ‘Reject the null hypothesis’

‘r’ is statistically significant, which means that population ‘r’ would be different than zero

Spurious correlationNonlinear relationship

Appearance of causal linear relationship but no economic

relationship exists

Extremely large or small values may influence the estimate of correlation

Measures only linear relationships, not non linear

ones

Eg. r = 0.4 n = 62 Confidence level = 95% Perform a test of significance

Step 1:

Step 2:

Step 3:

Define hypothesis

Calculate test statistic

Calculate critical values t-distribution, DoF = 60

3.2

−2 +2

H : r = 0, H : r 0 a ≠ 0

r × √n − 2

√1 − 2r

0.4 × √62 − 2

√1 − 20.4

https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.

FinTree

LOS d Dependent variable Independent variable

Variable you are using to explain changes in the dependent variable

Also referred to as explanatory variable/exogenous

variable/predicting variable

Variable you are seeking to explain

Also referred to as explained variable/endogenous

variable/predicted variable

LOS e

LOS f

Assumptions underlying linear regression

Standard error of estimate, coefficient of determination and confidence interval for regression coefficient

Rp

Dependent variable

Independent variable

Intercept Slope

= RFR + β (R − RFR)m

De

pe

nd

en

t va

ria

ble


x

y

Œ Relationship between dependent and independent variable is linear� Independent variable is uncorrelated with the error termŽ Expected value of the error term is zero� Variance of the error term is constant (NOT ZERO). The economic relationship

b/w variables is intact for the entire time period (eg. change in political regime)� Error term is uncorrelated with other observations (eg. seasonality)‘ Error term is normally distributed

Sum of squared errors (SSE):

Regression line:

Slope coefficient (beta):

Sum of the squared vertical distances between the estimated and actual Y-values

Line that minimizes the SSE

Describes change in ‘y’ for one unit change in ‘x’

Cov (x,y)

Variance (x)

Eg.

Sum of squared errors (SSE)

‘x’ 10 15 20 30

Actual ‘y’ 17 19 35 45

Predicted ‘y’ 15.81 23.36 30.91 46.01

Errors 1.19 −4.36 4.09 −1.01

Squared errors 1.416 19 16.73 1.02 38.166

Standard error of estimate (SEE) =

2Coefficient of determination (R ): % variation of dependent variable explained by % variation of the independent variable

2 2For simple linear equation, R = r

4.36==SSE 38.166

n − 2 2√ √


FinTree

Confidence interval for regression coefficient

b1

^

± (t c × SE)

Slope Standard error

Critical value (t-value)

b0Y^ ^^

+= b1 p × X

Intercept

Predicted value (y)

Forecasted value (x)

Slope

Eg. b = 0.481

^

SE = 0.35 n = 42 Calculate 90% confidence interval

Confidence interval: 0.48 ± (1.684 × 0.35) −0.109 to 1.069

5 ± (2.042 × 2.68)

−0.472 to 10.472

LOS g

LOS h & i

Hypothesis testing for population value of a regression coefficient

Predicted value of dependent variable

Predicted value Confidence interval

Confidence interval for the predicted value of dependent variable

Eg.

Eg.

b = 0.481

^

^ ^

SE = 0.35 n = 42 Confidence interval = 90% Perform a test of significance

Forecasted return (x) = 12% Standard error = 2.68Intercept = −4% Slope = 0.75

n = 32 Calculate predicted value (y) and 95% confidence interval

Since calculated test statistic lies inside the range, conclusion is ‘Failed to reject the null hypothesis’

Step 1:

Step 2:

Step 3:

Define hypothesis


Calculate critical values

Y = −4 + 0.75 × 12 5%=

t-distribution, DoF = 40

1.371

−1.684 1.684

H : = 0, H : b b ≠ 00 1 a 1

Sample stat. − HV

Std. error

0.48 − 0

0.35

Slope is not significantly different from zero

Y

Y

^

^

±

±

(t c × SE)

(t c × SE)

Predicted value (y)

Standard error


b0Y^ ^^

+= b1 p × X


FinTree

LOS j Analysis of variance (ANOVA)

Y: Y :i Y :i^

Mean Actual value Predicted value

Sum of squared errors (SSE)

Regression sum of squares (RSS)

Total sum of squares (SST)

Measures unexplained variation

aka sum of squared residuals

2∑ (Y − Y )i i

Measures explained variation

2∑ (Y − Y )i i

Measures total variation

2∑ (Y − Y )i i

^

^

ª Higher the RSS, better the quality of regression

2ª R = RSS/SST

2ª R = Explained variation/Total variation

F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF

ANOVA Table

When to use F-test and t-test

Source of variation DoF Sum of squares Mean sum of squares

Regression (explained)

k RSS MSR = RSS/k

Error (unexplained)

n − k − 1 SSE MSE = SSE/n − k − 1

Total n − 1 SST

LOS k Limitations of regression analysis

Y = b + b x + b x + ε0 1 1 2 2

t-test t-test

F-test

Linear relationships can change over time (parameter instability)

Public knowledge of regression relationship may make their future usefulness ineffective

If the regression assumptions are violated, hypothesis tests will not be valid (heteroscedasticity and autocorrelation)


FinTree

LOS b Interpreting estimated regression coefficients

Multiple Regression And Issues In Regression Analysis

Multiple regression equationLOS a

b0Y + + + + + ε….= b1 1 X b2 2 X bk k X

Intercept

Dependent variable


Slope Error term

Intercept term

Slope coefficient

Value of dependent variable when all independent

variables are equal to zero

Measures how much dependent variable changes when independent variable

changes by one unit, holding other independent

variables constant

LOS c & d Hypothesis testing for population value of a regression coefficient

Eg. b = 0.151 b = 0.282SE = 0.381 SE = 0.0432 n = 43

Confidence interval = 90% Perform a test of significance

Step 1:

Step 2:

Step 3:

Define hypothesis


Calculate critical values t-distribution, DoF = 40

0.394

6.511

−1.684 1.684

H : = 0, H : b b ≠ 00 1 a 1 H : = 0, H : b b ≠ 00 2 a 2

Sample stat. − HV

Sample stat. − HV

Std. error

Std. error

0.15 − 0

0.28 − 0

0.38

0.043

Since calculated test statistic (b ) lies inside the range, conclusion is 1 ‘Failed to reject the null hypothesis’

And test statistic (b ) lies outside the range, conclusion is 2 ‘Reject the null hypothesis’

Variable with slope ‘b ’ is not significantly different from zero1

and variable with slope ‘b ’ is significantly different from zero 2

Solution is to drop the variable with slope ‘b ’1

DoF = n − k − 1


FinTree

P-value

5 ft. 5 ft.3.8 ft. 4.5 ft.4 ft. 6 ft.

Significance level

P-value

FTRFTR FTR

Reject

Reject

P-value is the lowest level of significance at which null hypothesis is rejected

LOS e

Confidence interval for regression coefficient

b1

^

± (t c × SE)

Slope Standard error


Predicted value of dependent variable

^ ^ ^ ^ ^ ^^ ^

b0Y + + + +….= b1 1 X b2 2 X bk k X

Intercept

Predicted value (y)

Forecasted value (x)

Slope

LOS f Assumptions of a multiple regression model

Œ Relationship between dependent and independent variable is linear� Independent variables are uncorrelated with the error term and there is no

exact linear relation between two or more independent variablesŽ Expected value of the error term is zero� Variance of the error term is constant (NOT ZERO). The economic relationship

b/w variables is intact for the entire time period (eg. change in political regime)� Error term is uncorrelated with other observations (eg. seasonality)‘ Error term is normally distributed

LOS g F-statistic

ª F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF

ª It is used to check the quality of entire regression model

ª One-tailed test, rejection region is on right side

ª If the result of F-test is significant, at least one of the independent variable is able to explain variation in dependent variable


FinTree

Eg.

Eg.

n = 48 SST = 430k = 6 SSE = 190 Significance level = 2.5% and 5%

n = 30k = 6 2R = 73%

n = 30k = 8 2R = 75%

Perform an F-test

RSS =

MSR =

MSE =

F-statistic =

Critical value (F-table) at 2.5% significance level (DoF 6,41) = 2.74

SST − SSE 430 − 190 240

40

4.634

8.631

RSS

SSE

MSR

240

190

40

k

n − k − 1

MSE

6

41

4.634

Calculated test statistic is on the right of critical value, therefore the conclusion is ‘Reject the null hypothesis’

Since the conclusion at 2.5% significance is , the conclusion at 5% significance is also ‘Reject’ ‘Reject’

All the variables are significantly different from zero

2 2Adding two more variables is not justified because adjusted R < adjusted R2 1

LOS h

LOS i

2 2R and adjusted R

ANOVA table

2R : % variation of dependent variable explained by % variation of all the independent variables

2R = RSS/SST

2R = Explained variation/Total variation

2Adjusted R =

2 2Adjusted R < R in multiple regression

1 − × 2(1 − R )n − 1

n − k − 1)] ])

2Adjusted R =1

2Adjusted R =2

41.1%

39.58%

1 −

1 −

×

×

2(1 − 0.73 )

2(1 − 0.75 )

30 − 1

30 − 1

30 − 6 − 1

30 − 8 − 1

)

)

]

]

]

]

)

)

F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF

Source of variation DoF Sum of squares Mean sum of squares

Regression (explained)

k RSS MSR = RSS/k

Error (unexplained)

n − k − 1 SSE MSE = SSE/n − k − 1

Total n − 1 SST


FinTree

Multiple regression equation by using dummy variablesLOS j

b0Y + + + + + ε….= b1 1 X b2 2 X bk k X

Intercept

Dependent variable


Slope Error term

Dummy variables: Independent variables that are binary in nature (i.e. in the form of yes/no)

They are qualitative variables

Values: If true = 1, if false = 0

Use n – 1 dummy variables in the model

LOS k & l Types of heteroskedasticity

Conditional Unconditional

Occurs when heteroskedasticity of the error variance is

not correlated with the independent variables

Does not cause major problems for statistical

inference

Occurs when heteroskedasticity of the error variance is correlated with the

independent variables

Causes problems for statistical inference

ViolationsConditional

heteroskedasticityPositive serial

correlationNegative serial

correlationMulticollinearity

MeaningVariance not

constatntErrors are correlated

Errors are correlated

Two or more independent variables

are correlated

Effect Type I errors Type I errors Type II errors Type II errors

DetectionExamining scatter plots or Breusch-

Pagan test

Durbin-Watson test

Durbin-Watson test

F - significantt - not significant

CorrectionWhite-corrected standard errors

Hansen method Hansen methodDrop one of the

variables

2ª Breusch-Pagan test: n × Rª White-corrected standard errors is also known as robust standard errorª Durbin-Watson test ≈ 2(1 − r). ª Multicollinearity: The question is never a yes or no, it is how muchª None of the assumption violations have any impact on slope coefficients.

The impact is on standard errors and therefore on t-test


FinTree

LOS m

Model specifications

Model misspecifications

Omitting a variable

Variable should be transformed

Incorrectly pooling data

Using lagged dependent variable as an independent variable

Forecasting the past

Measuring independent variables with error

Model should have strong economic reasoning

Functional form of the variables should be appropriate

The model should be parsimonious (concise/brief)

The model should be examined for violations of assumptions

Model should be tested on out of sample data

Model misspecifications might have impact on both slope coefficient and error terms

LOS n Models with qualitative dependent variables

Probit

Based on the normal distribution

Based on the logistic distribution

Similar to probit and logit but uses financial ratios as

independent variables

Logit Discriminant

LOS o Interpretation of multiple regression model

Values of slope coefficients suggest that there is economic relationship between the independent and dependent variables

But it may also be possible for a regression to have statistical significance even when there is no economic relationship

This statistical significance must also be factored into the analysis


FinTree

Time-series Analysis

Limitation of trend models is that they are not useful if the error terms are serially correlated

Linear trend models

Log-linear trend models

Log-linear trend is a trend in which the dependent variable

changes at an exponential rate with time

Used for financial time series

Has a curve

Convex curve:+ve trend

Concave curve:−ve trend

Equation:ln y = b + b t + εt 0 1 t

Linear trend is a trend in which the dependent variable

changes at a constant rate with time

Has a straight line

Upward-sloping line:+ve trend

Downward-sloping line:−ve trend

Equation:y = b + b t + εt 0 1 t

Time series: Set of observations on a variable’s outcomes in different time periods

Used to explain the past and make predictions about the future

LOS b

LOS a

How to determine which model to use

Predicted trend value for a time series

Log-linear trend model

Linear trend model

Plot the data

x x

y y


FinTree

LOS c Requirement for a time series to be covariance stationary

A time series is covariance stationary if it satisfies the following three conditions:

Constant and finite mean

Constant and finite variance (same as homoskedasticity)

Constant and finite covariance of time series with itself

Eg. b0Xt = + b1 Xt−1

5Xt = + 0.5 Xt−1

X = 6t − 1 X = 20t − 1

X = 8t − 1 X = 15t − 1

X = 9t − 1 X = 12.5t − 1

X = 8t X = 15t

X = 9t X = 12.5t

X = 9.5t X = 11.25t

X = 10t − 1 X = 10t

If X = 10, then X = 10, X = 10, X = 10 and so ont − 1 t t + 1 t + 2

This is called constant and finite mean

AR model: A time series regressed on its own past values

Equation AR(1): X = b + b X + εt 0 1 t − 1 t

Equation AR(2): X = b + b X + b X + εt 0 1 t − 1 2 t − 2 t

Chain rule of forecasting: Calculating successive forecasts

Mean of the time series = = = 10b0 5

1 − b1 1 − 0.5

For a model to be valid, time series must be covariance stationary

Most economic and financial time series relationships are not stationary

The model can be used if the degree of nonstationarity is not significant

LOS d Autoregressive (AR) model


FinTree

LOS e Autocorrelations of the error terms

If the error terms have significant serial correlation (autocorrelation), the AR model used is not the best model to analyze the time series

Procedure to test if the AR model is correct:

Step 1: Calculate the intercept and slope using linear regressionStep 2: Calculate the predicted valuesStep 3: Calculate the error termsStep 4: Calculate the autocorrelations of the error termsStep 5: Test whether the autocorrelations are significantly different from zero

Test used to know if the autocorrelations are significantly different from zero: t-test

If the autocrrelations are statistically significant from zero (if the decision is reject):

Model does not fit the time series

If the autocrrelations are not statistically significant from zero (if the decision is FTR):

Model fits the time series

LOS f Mean reversion

t statisticAutocorrelation

Standard error=

It means tendency of time series to move toward its mean

Mean reverting level =b0

1 − b1

LOS g In-sample and out-of-sample forecasts and RMSE criterion

Eg.Sample

value (X )tXt − 1

Predicted value

ErrorSquared errors

200 - - - -

220 200 216.5 3.5 12.25

215 220 227.8 −12.8 163.84

205 215 225 −20 400

235 205 219.4 15.6 243.36

250 235 236.4 13.6 184.96

1004.41

In-sample root mean squared error (RMSE) = 14.17

SSE 1004.41

n 5√ √


FinTree

Eg.

Out-of-sample root mean squared error (RMSE) = 12.57

SSE 632

n 4√ √

Actual value

Predicted value

ErrorSquared errors

215 - - -

235 225 10 100

220 236.4 −16.4 268.96

240 227.9 12.1 146.41

250 239.2 10.8 116.64

632

LOS h Instability of coefficients of time-series models

Select the time series with lowest out-of-sample RMSE

One of the important issues in time series is the sample period to use

Shorter sample period More stability but less statistical reliability→

Longer sample period Less stability but more statistical reliability→

Data must also be covariance stationary for model to be valid

LOS i Random walk with a drift

Random walk

A time series in which predicted value of a dependent variable in one period is equal

to the value of dependent variable in previous period

plus an error term

Equation:X = X + εt t t − 1

A time series in which predicted value of a dependent variable in one period is equal

to the value of dependent variable in previous period plus or minus a constant

amount and an error term

Equation:X = b + X + εt 0 t t − 1

ª Both of the above equations have a slope (b ) of 11

ª Such time series are said to have ‘unit root’

ª They are not covariance stationary because they do not have a finite mean

ª To use standard regression analysis, we must convert this data to covariance stationary. This conversion is called ‘first differencing’


FinTree

LOS j & k Unit root test of nonstationarity

Autocorrelation approach

If autocorrelations exhibit do notthese characteristics, it is said to be a nonstationary time series:

Autocorrelations at all lags are statistically insignificant from zero

or

As the no. of lags increase, the autocorrelations drop down to zero

More definitive than autocorrelation approach

X − X = b + b X − X + εt t − 1 0 1 t − 1 t − 1 t

X − X = b + (b − 1)X + εt t − 1 0 1 t − 1 t

If null (b − 1 (g) = 0) can not be 1

rejected, the time series has a unit root

Dickey-Fuller test

First differencing

Sales Lag 1 First difference

- -∆ sales

(current year)∆ sales

(previous year)

230 - - -

270 230 40 -

290 270 20 40

310 290 20 20

340 310 30 20

Eg.

Equation: y = 30 − 0.25x

Forecasted sales: 340 − 55 = 285

Equation: y = 30 − 0.25(340) y = (55)^ ^ ^

If time series is a random walk then we must convert this data to covariance stationary. This conversion is called first differencing

LOS l How to test and correct for seasonality

Seasonality can be detected by plotting the values on a graph or calculating autocorrelations

Seasonality is present if the autocorrelation of error term is significantly different from zero

Correction: Adding a lag of dependent variable (corresponding to the same period in previous year) to the model as another independent variable


FinTree

LOS m Autoregressive conditional heteroskedasticity (ARCH)

ARCH exists if the variance of error terms in one period is dependent on the variance of error terms in previous period

Testing: Squared errors from the model are regressed on the first lag of the squared residuals

^ ^a0εtEquation: εt − 1

2 2+ += a1

Intercept

Predicted error term of

current period

Predicted error term of last period

Slope

μt

Error term

LOS n How time-series variables should be analyzed for nonstationarity and/or cointegration

To test whether the two time series have unit roots, a Dickey-Fuller test is used

Possible scenarios:

Œ Both time series are covariance stationary (linear regression can be used)� Only the dependent variable time series is covariance stationary (linear regression

should not be used)Ž Only the independent variable time series is covariance stationary (linear regression

should not be used)� Neither time series is covariance stationary and the two series are not cointegrated

(linear regression should not be used)� Neither time series is covariance stationary and the two series are cointegrated

(linear regression can be used)

Cointegration: Long term economic or financial relationship between two time series

LOS o Appropriate time-series model to analyze a given investment problem

ª Understand the investment problem you have and make a choice of model

ª If you have decided to use a time-series model plot the values to see whether the time series looks covariance stationary

ª Use a trend model, if there is no seasonality or structural shift

ª If you find significant serial correlation in the error terms, use a complex model such as AR model

ª If the data has serial correlation, reexamine the data for stationarity before running an AR model

ª If you find significant serial correlation in the residuals, use an AR(2) model

ª Check for seasonality

ª Test whether error terms have ARCH

ª Perform tests of model's out-of-sample forecasting performance (RMSE)


FinTree

Determine probabilistic variables:

Run the simulation:

Check for correlation across variables:

Define probability distributions for these variables:

Step 1

Step 4

Step 3

Step 2

No constraint on number of input variables that can be allowed to vary.

It means to draw an outcome from each distribution and compute the value based on these outcomes

If the correlation is strong, either allow only one of the variables to vary (focus on the variable that has the highest impact on value) or build the correlation into the simulation

Focus on a few variables that have significant impact on value.

Number of probabilistic inputs: Higher the number of probabilistic inputs, greater the number of simulations required.

Types of distributions: Greater the diversity of distributions, greater the number of simulations required.

Range of outcomes: Greater the range of outcomes, greater the number of simulations required.

Historical data: Useful when past data is available and reliable. Estimate the distribution based on past values.

Cross-sectional data: Useful when past data is unavailable or unreliable. Estimate the distribution based on the values of similar variables.

Statistical distribution and parameters: Useful when historical and cross sectional data is insufficient or unreliable. Estimate the distribution and its parameters.

Three ways to define probability distribution

Probabilistic Approaches: Scenario Analysis, Decision Trees And Simulations

Steps in running a simulationLOS a, b & c

LOS d Advantages of using simulations in decision making

Provides a distribution of expected value rather than a point estimate

Better input estimation

An analyst will usually examine both historical and cross-sectional data to select a proper distribution and

its parameters, instead of relying on single best estimates. This results in

better quality of inputs

Simulations provide distribution of expected value however they do not

provide better estimates

Simulations do not always lead to better decisions


FinTree

LOS e Common constraints introduced into simulations

Earnings and CF constraints Market value constraintsBook value constraints

Imposed internally: Analyst’s expectations

Imposed externally:Loan covenants

Likelihood of financial distress

Indirect bankruptcy costs

Regulatory capital restrictions

Negative equity

LOS f Issues in using simulations in risk assessment

ª Garbage in, garbage out: Inputs should be based on analysis and data, rather than guesswork

ª Inappropriate probability distributions: Using probability distributions that have no resemblance to the true distribution of an input variable will provide misleading results

ª Non-stationary distributions: Distributions may change over time due to change in market structure. There can be a change the form of distribution or the parameters of the distribution

ª Dynamic correlations: Correlation across input variables can be modeled into simulations only when they are stable. If they are not it becomes far more difficult to model them

Risk-adjusted value

Cash flows from simulations are not risk-adjusted and should not be discounted at RFR

AssetRisk-adjusted discount rate

Expected value using simulation

σ from simulation

A 15% $100 17%

B 18% $100 21%

ª We have already accounted for B’s greater risk by using a higher discount rate

ª If we choose A over B on the basis of A’s lower standard deviation, we would be penalizing Asset B twice

ª An investor should be indifferent between the two investments

Eg.

LOS g Selecting appropriate probabilistic approach

Type of risk Correlated? Sequential? Appropriate approach

Continuous Yes Doesn’t matter Simulation

Discrete Yes No Scenario analysis

Discrete No Yes Decision tree


FinTree

fintree...fintree los d dependent variable independent variable variable you are using to explain...

Documents