fintree...fintree los d dependent variable independent variable variable you are using to explain...
TRANSCRIPT
Correlation And RegressionLOS a Sample
covarianceSample
correlation
Measures how two variables move together
Captures the linear relationship between two variables
Measures strength of linear relationship between two variables
Standardized measure of covariance
Cov(x,y) = r × S × Sx y
2Unit = %
Range = −∞ to +∞
+ve covariance = Variables tend to move together
−ve covariance = Variables tend to move in opposite directions
Unit = No unit
Range = −1 to +1
r = 1 means perfectly +ve correlation
r = 0 means no linear relationship
r = −1 means perfectly −ve correlation
n − 1
∑ (X − X) (Y − Y)Cov(x,y) =
S Sx y×
Cov(x,y)r =
LOS b
LOS c
Limitations to correlation analysis
Test of the hypothesis that the population correlation coefficient equals zero
Scatter plot: Graph that shows the relationship between values of two variables
−ve covariance −ve correlation −ve slope
+ve covariance +ve correlation +ve slope
Outliers
Since calculated test statistic lies outside the range, conclusion is ‘Reject the null hypothesis’
‘r’ is statistically significant, which means that population ‘r’ would be different than zero
Spurious correlationNonlinear relationship
Appearance of causal linear relationship but no economic
relationship exists
Extremely large or small values may influence the estimate of correlation
Measures only linear relationships, not non linear
ones
Eg. r = 0.4 n = 62 Confidence level = 95% Perform a test of significance
Step 1:
Step 2:
Step 3:
Define hypothesis
Calculate test statistic
Calculate critical values t-distribution, DoF = 60
3.2
−2 +2
H : r = 0, H : r 0 a ≠ 0
r × √n − 2
√1 − 2r
0.4 × √62 − 2
√1 − 20.4
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS d Dependent variable Independent variable
Variable you are using to explain changes in the dependent variable
Also referred to as explanatory variable/exogenous
variable/predicting variable
Variable you are seeking to explain
Also referred to as explained variable/endogenous
variable/predicted variable
LOS e
LOS f
Assumptions underlying linear regression
Standard error of estimate, coefficient of determination and confidence interval for regression coefficient
Rp
Dependent variable
Independent variable
Intercept Slope
= RFR + β (R − RFR)m
De
pe
nd
en
t va
ria
ble
Independent variable
x
y
Œ Relationship between dependent and independent variable is linear� Independent variable is uncorrelated with the error termŽ Expected value of the error term is zero� Variance of the error term is constant (NOT ZERO). The economic relationship
b/w variables is intact for the entire time period (eg. change in political regime)� Error term is uncorrelated with other observations (eg. seasonality)‘ Error term is normally distributed
Sum of squared errors (SSE):
Regression line:
Slope coefficient (beta):
Sum of the squared vertical distances between the estimated and actual Y-values
Line that minimizes the SSE
Describes change in ‘y’ for one unit change in ‘x’
Cov (x,y)
Variance (x)
Eg.
Sum of squared errors (SSE)
‘x’ 10 15 20 30
Actual ‘y’ 17 19 35 45
Predicted ‘y’ 15.81 23.36 30.91 46.01
Errors 1.19 −4.36 4.09 −1.01
Squared errors 1.416 19 16.73 1.02 38.166
Standard error of estimate (SEE) =
2Coefficient of determination (R ): % variation of dependent variable explained by % variation of the independent variable
2 2For simple linear equation, R = r
4.36==SSE 38.166
n − 2 2√ √
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Confidence interval for regression coefficient
b1
^
± (t c × SE)
Slope Standard error
Critical value (t-value)
b0Y^ ^^
+= b1 p × X
Intercept
Predicted value (y)
Forecasted value (x)
Slope
Eg. b = 0.481
^
SE = 0.35 n = 42 Calculate 90% confidence interval
Confidence interval: 0.48 ± (1.684 × 0.35) −0.109 to 1.069
5 ± (2.042 × 2.68)
−0.472 to 10.472
LOS g
LOS h & i
Hypothesis testing for population value of a regression coefficient
Predicted value of dependent variable
Predicted value Confidence interval
Confidence interval for the predicted value of dependent variable
Eg.
Eg.
b = 0.481
^
^ ^
SE = 0.35 n = 42 Confidence interval = 90% Perform a test of significance
Forecasted return (x) = 12% Standard error = 2.68Intercept = −4% Slope = 0.75
n = 32 Calculate predicted value (y) and 95% confidence interval
Since calculated test statistic lies inside the range, conclusion is ‘Failed to reject the null hypothesis’
Step 1:
Step 2:
Step 3:
Define hypothesis
Calculate test statistic
Calculate critical values
Y = −4 + 0.75 × 12 5%=
t-distribution, DoF = 40
1.371
−1.684 1.684
H : = 0, H : b b ≠ 00 1 a 1
Sample stat. − HV
Std. error
0.48 − 0
0.35
Slope is not significantly different from zero
Y
Y
^
^
±
±
(t c × SE)
(t c × SE)
Predicted value (y)
Standard error
Critical value (t-value)
b0Y^ ^^
+= b1 p × X
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS j Analysis of variance (ANOVA)
Y: Y :i Y :i^
Mean Actual value Predicted value
Sum of squared errors (SSE)
Regression sum of squares (RSS)
Total sum of squares (SST)
Measures unexplained variation
aka sum of squared residuals
2∑ (Y − Y )i i
Measures explained variation
2∑ (Y − Y )i i
Measures total variation
2∑ (Y − Y )i i
^
^
ª Higher the RSS, better the quality of regression
2ª R = RSS/SST
2ª R = Explained variation/Total variation
F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF
ANOVA Table
When to use F-test and t-test
Source of variation DoF Sum of squares Mean sum of squares
Regression (explained)
k RSS MSR = RSS/k
Error (unexplained)
n − k − 1 SSE MSE = SSE/n − k − 1
Total n − 1 SST
LOS k Limitations of regression analysis
Y = b + b x + b x + ε0 1 1 2 2
t-test t-test
F-test
Linear relationships can change over time (parameter instability)
Public knowledge of regression relationship may make their future usefulness ineffective
If the regression assumptions are violated, hypothesis tests will not be valid (heteroscedasticity and autocorrelation)
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS b Interpreting estimated regression coefficients
Multiple Regression And Issues In Regression Analysis
Multiple regression equationLOS a
b0Y + + + + + ε….= b1 1 X b2 2 X bk k X
Intercept
Dependent variable
Independent variable
Slope Error term
Intercept term
Slope coefficient
Value of dependent variable when all independent
variables are equal to zero
Measures how much dependent variable changes when independent variable
changes by one unit, holding other independent
variables constant
LOS c & d Hypothesis testing for population value of a regression coefficient
Eg. b = 0.151 b = 0.282SE = 0.381 SE = 0.0432 n = 43
Confidence interval = 90% Perform a test of significance
Step 1:
Step 2:
Step 3:
Define hypothesis
Calculate test statistic
Calculate critical values t-distribution, DoF = 40
0.394
6.511
−1.684 1.684
H : = 0, H : b b ≠ 00 1 a 1 H : = 0, H : b b ≠ 00 2 a 2
Sample stat. − HV
Sample stat. − HV
Std. error
Std. error
0.15 − 0
0.28 − 0
0.38
0.043
Since calculated test statistic (b ) lies inside the range, conclusion is 1 ‘Failed to reject the null hypothesis’
And test statistic (b ) lies outside the range, conclusion is 2 ‘Reject the null hypothesis’
Variable with slope ‘b ’ is not significantly different from zero1
and variable with slope ‘b ’ is significantly different from zero 2
Solution is to drop the variable with slope ‘b ’1
DoF = n − k − 1
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
P-value
5 ft. 5 ft.3.8 ft. 4.5 ft.4 ft. 6 ft.
Significance level
P-value
FTRFTR FTR
Reject
Reject
P-value is the lowest level of significance at which null hypothesis is rejected
LOS e
Confidence interval for regression coefficient
b1
^
± (t c × SE)
Slope Standard error
Critical value (t-value)
Predicted value of dependent variable
^ ^ ^ ^ ^ ^^ ^
b0Y + + + +….= b1 1 X b2 2 X bk k X
Intercept
Predicted value (y)
Forecasted value (x)
Slope
LOS f Assumptions of a multiple regression model
Œ Relationship between dependent and independent variable is linear� Independent variables are uncorrelated with the error term and there is no
exact linear relation between two or more independent variablesŽ Expected value of the error term is zero� Variance of the error term is constant (NOT ZERO). The economic relationship
b/w variables is intact for the entire time period (eg. change in political regime)� Error term is uncorrelated with other observations (eg. seasonality)‘ Error term is normally distributed
LOS g F-statistic
ª F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF
ª It is used to check the quality of entire regression model
ª One-tailed test, rejection region is on right side
ª If the result of F-test is significant, at least one of the independent variable is able to explain variation in dependent variable
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Eg.
Eg.
n = 48 SST = 430k = 6 SSE = 190 Significance level = 2.5% and 5%
n = 30k = 6 2R = 73%
n = 30k = 8 2R = 75%
Perform an F-test
RSS =
MSR =
MSE =
F-statistic =
Critical value (F-table) at 2.5% significance level (DoF 6,41) = 2.74
SST − SSE 430 − 190 240
40
4.634
8.631
RSS
SSE
MSR
240
190
40
k
n − k − 1
MSE
6
41
4.634
Calculated test statistic is on the right of critical value, therefore the conclusion is ‘Reject the null hypothesis’
Since the conclusion at 2.5% significance is , the conclusion at 5% significance is also ‘Reject’ ‘Reject’
All the variables are significantly different from zero
2 2Adding two more variables is not justified because adjusted R < adjusted R2 1
LOS h
LOS i
2 2R and adjusted R
ANOVA table
2R : % variation of dependent variable explained by % variation of all the independent variables
2R = RSS/SST
2R = Explained variation/Total variation
2Adjusted R =
2 2Adjusted R < R in multiple regression
1 − × 2(1 − R )n − 1
n − k − 1)] ])
2Adjusted R =1
2Adjusted R =2
41.1%
39.58%
1 −
1 −
×
×
2(1 − 0.73 )
2(1 − 0.75 )
30 − 1
30 − 1
30 − 6 − 1
30 − 8 − 1
)
)
]
]
]
]
)
)
F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF
Source of variation DoF Sum of squares Mean sum of squares
Regression (explained)
k RSS MSR = RSS/k
Error (unexplained)
n − k − 1 SSE MSE = SSE/n − k − 1
Total n − 1 SST
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Multiple regression equation by using dummy variablesLOS j
b0Y + + + + + ε….= b1 1 X b2 2 X bk k X
Intercept
Dependent variable
Independent variable
Slope Error term
Dummy variables: Independent variables that are binary in nature (i.e. in the form of yes/no)
They are qualitative variables
Values: If true = 1, if false = 0
Use n – 1 dummy variables in the model
LOS k & l Types of heteroskedasticity
Conditional Unconditional
Occurs when heteroskedasticity of the error variance is
not correlated with the independent variables
Does not cause major problems for statistical
inference
Occurs when heteroskedasticity of the error variance is correlated with the
independent variables
Causes problems for statistical inference
ViolationsConditional
heteroskedasticityPositive serial
correlationNegative serial
correlationMulticollinearity
MeaningVariance not
constatntErrors are correlated
Errors are correlated
Two or more independent variables
are correlated
Effect Type I errors Type I errors Type II errors Type II errors
DetectionExamining scatter plots or Breusch-
Pagan test
Durbin-Watson test
Durbin-Watson test
F - significantt - not significant
CorrectionWhite-corrected standard errors
Hansen method Hansen methodDrop one of the
variables
2ª Breusch-Pagan test: n × Rª White-corrected standard errors is also known as robust standard errorª Durbin-Watson test ≈ 2(1 − r). ª Multicollinearity: The question is never a yes or no, it is how muchª None of the assumption violations have any impact on slope coefficients.
The impact is on standard errors and therefore on t-test
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS m
Model specifications
Model misspecifications
Omitting a variable
Variable should be transformed
Incorrectly pooling data
Using lagged dependent variable as an independent variable
Forecasting the past
Measuring independent variables with error
Model should have strong economic reasoning
Functional form of the variables should be appropriate
The model should be parsimonious (concise/brief)
The model should be examined for violations of assumptions
Model should be tested on out of sample data
Model misspecifications might have impact on both slope coefficient and error terms
LOS n Models with qualitative dependent variables
Probit
Based on the normal distribution
Based on the logistic distribution
Similar to probit and logit but uses financial ratios as
independent variables
Logit Discriminant
LOS o Interpretation of multiple regression model
Values of slope coefficients suggest that there is economic relationship between the independent and dependent variables
But it may also be possible for a regression to have statistical significance even when there is no economic relationship
This statistical significance must also be factored into the analysis
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Time-series Analysis
Limitation of trend models is that they are not useful if the error terms are serially correlated
Linear trend models
Log-linear trend models
Log-linear trend is a trend in which the dependent variable
changes at an exponential rate with time
Used for financial time series
Has a curve
Convex curve:+ve trend
Concave curve:−ve trend
Equation:ln y = b + b t + εt 0 1 t
Linear trend is a trend in which the dependent variable
changes at a constant rate with time
Has a straight line
Upward-sloping line:+ve trend
Downward-sloping line:−ve trend
Equation:y = b + b t + εt 0 1 t
Time series: Set of observations on a variable’s outcomes in different time periods
Used to explain the past and make predictions about the future
LOS b
LOS a
How to determine which model to use
Predicted trend value for a time series
Log-linear trend model
Linear trend model
Plot the data
x x
y y
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS c Requirement for a time series to be covariance stationary
A time series is covariance stationary if it satisfies the following three conditions:
Constant and finite mean
Constant and finite variance (same as homoskedasticity)
Constant and finite covariance of time series with itself
Eg. b0Xt = + b1 Xt−1
5Xt = + 0.5 Xt−1
X = 6t − 1 X = 20t − 1
X = 8t − 1 X = 15t − 1
X = 9t − 1 X = 12.5t − 1
X = 8t X = 15t
X = 9t X = 12.5t
X = 9.5t X = 11.25t
X = 10t − 1 X = 10t
If X = 10, then X = 10, X = 10, X = 10 and so ont − 1 t t + 1 t + 2
This is called constant and finite mean
AR model: A time series regressed on its own past values
Equation AR(1): X = b + b X + εt 0 1 t − 1 t
Equation AR(2): X = b + b X + b X + εt 0 1 t − 1 2 t − 2 t
Chain rule of forecasting: Calculating successive forecasts
Mean of the time series = = = 10b0 5
1 − b1 1 − 0.5
For a model to be valid, time series must be covariance stationary
Most economic and financial time series relationships are not stationary
The model can be used if the degree of nonstationarity is not significant
LOS d Autoregressive (AR) model
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS e Autocorrelations of the error terms
If the error terms have significant serial correlation (autocorrelation), the AR model used is not the best model to analyze the time series
Procedure to test if the AR model is correct:
Step 1: Calculate the intercept and slope using linear regressionStep 2: Calculate the predicted valuesStep 3: Calculate the error termsStep 4: Calculate the autocorrelations of the error termsStep 5: Test whether the autocorrelations are significantly different from zero
Test used to know if the autocorrelations are significantly different from zero: t-test
If the autocrrelations are statistically significant from zero (if the decision is reject):
Model does not fit the time series
If the autocrrelations are not statistically significant from zero (if the decision is FTR):
Model fits the time series
LOS f Mean reversion
t statisticAutocorrelation
Standard error=
It means tendency of time series to move toward its mean
Mean reverting level =b0
1 − b1
LOS g In-sample and out-of-sample forecasts and RMSE criterion
Eg.Sample
value (X )tXt − 1
Predicted value
ErrorSquared errors
200 - - - -
220 200 216.5 3.5 12.25
215 220 227.8 −12.8 163.84
205 215 225 −20 400
235 205 219.4 15.6 243.36
250 235 236.4 13.6 184.96
1004.41
In-sample root mean squared error (RMSE) = 14.17
SSE 1004.41
n 5√ √
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Eg.
Out-of-sample root mean squared error (RMSE) = 12.57
SSE 632
n 4√ √
Actual value
Predicted value
ErrorSquared errors
215 - - -
235 225 10 100
220 236.4 −16.4 268.96
240 227.9 12.1 146.41
250 239.2 10.8 116.64
632
LOS h Instability of coefficients of time-series models
Select the time series with lowest out-of-sample RMSE
One of the important issues in time series is the sample period to use
Shorter sample period More stability but less statistical reliability→
Longer sample period Less stability but more statistical reliability→
Data must also be covariance stationary for model to be valid
LOS i Random walk with a drift
Random walk
A time series in which predicted value of a dependent variable in one period is equal
to the value of dependent variable in previous period
plus an error term
Equation:X = X + εt t t − 1
A time series in which predicted value of a dependent variable in one period is equal
to the value of dependent variable in previous period plus or minus a constant
amount and an error term
Equation:X = b + X + εt 0 t t − 1
ª Both of the above equations have a slope (b ) of 11
ª Such time series are said to have ‘unit root’
ª They are not covariance stationary because they do not have a finite mean
ª To use standard regression analysis, we must convert this data to covariance stationary. This conversion is called ‘first differencing’
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS j & k Unit root test of nonstationarity
Autocorrelation approach
If autocorrelations exhibit do notthese characteristics, it is said to be a nonstationary time series:
Autocorrelations at all lags are statistically insignificant from zero
or
As the no. of lags increase, the autocorrelations drop down to zero
More definitive than autocorrelation approach
X − X = b + b X − X + εt t − 1 0 1 t − 1 t − 1 t
X − X = b + (b − 1)X + εt t − 1 0 1 t − 1 t
If null (b − 1 (g) = 0) can not be 1
rejected, the time series has a unit root
Dickey-Fuller test
First differencing
Sales Lag 1 First difference
- -∆ sales
(current year)∆ sales
(previous year)
230 - - -
270 230 40 -
290 270 20 40
310 290 20 20
340 310 30 20
Eg.
Equation: y = 30 − 0.25x
Forecasted sales: 340 − 55 = 285
Equation: y = 30 − 0.25(340) y = (55)^ ^ ^
If time series is a random walk then we must convert this data to covariance stationary. This conversion is called first differencing
LOS l How to test and correct for seasonality
Seasonality can be detected by plotting the values on a graph or calculating autocorrelations
Seasonality is present if the autocorrelation of error term is significantly different from zero
Correction: Adding a lag of dependent variable (corresponding to the same period in previous year) to the model as another independent variable
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS m Autoregressive conditional heteroskedasticity (ARCH)
ARCH exists if the variance of error terms in one period is dependent on the variance of error terms in previous period
Testing: Squared errors from the model are regressed on the first lag of the squared residuals
^ ^a0εtEquation: εt − 1
2 2+ += a1
Intercept
Predicted error term of
current period
Predicted error term of last period
Slope
μt
Error term
LOS n How time-series variables should be analyzed for nonstationarity and/or cointegration
To test whether the two time series have unit roots, a Dickey-Fuller test is used
Possible scenarios:
Œ Both time series are covariance stationary (linear regression can be used)� Only the dependent variable time series is covariance stationary (linear regression
should not be used)Ž Only the independent variable time series is covariance stationary (linear regression
should not be used)� Neither time series is covariance stationary and the two series are not cointegrated
(linear regression should not be used)� Neither time series is covariance stationary and the two series are cointegrated
(linear regression can be used)
Cointegration: Long term economic or financial relationship between two time series
LOS o Appropriate time-series model to analyze a given investment problem
ª Understand the investment problem you have and make a choice of model
ª If you have decided to use a time-series model plot the values to see whether the time series looks covariance stationary
ª Use a trend model, if there is no seasonality or structural shift
ª If you find significant serial correlation in the error terms, use a complex model such as AR model
ª If the data has serial correlation, reexamine the data for stationarity before running an AR model
ª If you find significant serial correlation in the residuals, use an AR(2) model
ª Check for seasonality
ª Test whether error terms have ARCH
ª Perform tests of model's out-of-sample forecasting performance (RMSE)
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
Determine probabilistic variables:
Run the simulation:
Check for correlation across variables:
Define probability distributions for these variables:
Step 1
Step 4
Step 3
Step 2
No constraint on number of input variables that can be allowed to vary.
It means to draw an outcome from each distribution and compute the value based on these outcomes
If the correlation is strong, either allow only one of the variables to vary (focus on the variable that has the highest impact on value) or build the correlation into the simulation
Focus on a few variables that have significant impact on value.
Number of probabilistic inputs: Higher the number of probabilistic inputs, greater the number of simulations required.
Types of distributions: Greater the diversity of distributions, greater the number of simulations required.
Range of outcomes: Greater the range of outcomes, greater the number of simulations required.
Historical data: Useful when past data is available and reliable. Estimate the distribution based on past values.
Cross-sectional data: Useful when past data is unavailable or unreliable. Estimate the distribution based on the values of similar variables.
Statistical distribution and parameters: Useful when historical and cross sectional data is insufficient or unreliable. Estimate the distribution and its parameters.
Three ways to define probability distribution
Probabilistic Approaches: Scenario Analysis, Decision Trees And Simulations
Steps in running a simulationLOS a, b & c
LOS d Advantages of using simulations in decision making
Provides a distribution of expected value rather than a point estimate
Better input estimation
An analyst will usually examine both historical and cross-sectional data to select a proper distribution and
its parameters, instead of relying on single best estimates. This results in
better quality of inputs
Simulations provide distribution of expected value however they do not
provide better estimates
Simulations do not always lead to better decisions
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree
LOS e Common constraints introduced into simulations
Earnings and CF constraints Market value constraintsBook value constraints
Imposed internally: Analyst’s expectations
Imposed externally:Loan covenants
Likelihood of financial distress
Indirect bankruptcy costs
Regulatory capital restrictions
Negative equity
LOS f Issues in using simulations in risk assessment
ª Garbage in, garbage out: Inputs should be based on analysis and data, rather than guesswork
ª Inappropriate probability distributions: Using probability distributions that have no resemblance to the true distribution of an input variable will provide misleading results
ª Non-stationary distributions: Distributions may change over time due to change in market structure. There can be a change the form of distribution or the parameters of the distribution
ª Dynamic correlations: Correlation across input variables can be modeled into simulations only when they are stable. If they are not it becomes far more difficult to model them
Risk-adjusted value
Cash flows from simulations are not risk-adjusted and should not be discounted at RFR
AssetRisk-adjusted discount rate
Expected value using simulation
σ from simulation
A 15% $100 17%
B 18% $100 21%
ª We have already accounted for B’s greater risk by using a higher discount rate
ª If we choose A over B on the basis of A’s lower standard deviation, we would be penalizing Asset B twice
ª An investor should be indifferent between the two investments
Eg.
LOS g Selecting appropriate probabilistic approach
Type of risk Correlated? Sequential? Appropriate approach
Continuous Yes Doesn’t matter Simulation
Discrete Yes No Scenario analysis
Discrete No Yes Decision tree
https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd.
FinTree