natcor forecasting and predictive analytics · plot the differenced (stationary) version we soon...

Forecasting and Predictive Analytics

NATCOR

Forecasting with ARIMA models

Nikolaos Kourentzes

[email protected]

mailto:[email protected]

O u t l i n e

Forecasting with ARIMA

1. Autoregression, ACF and PACF

2. AR models

3. Differencing

4. MA models

5. ARMA and ARIMA models

6. Identification

7. Seasonal ARIMA

8. Transformations

2

F o r e c a s t i n g w i t h A R I M A m o d e l s

• AutoRegressive Integrated Moving Average (ARIMA) models take a different perspective to Exponential Smoothing to model time series.

• Exponential Smoothing splits a time series into its constituent components (level, trend and season) and models these.

• ARIMA captures statistical relationships that are emergent from the data, without being specific about the type of component can be seen as a data driven model.

• ARIMA is made up by three parts:

• Autoregressive AR terms

• Integration (differencing) I operator

• Moving Average MA terms

• We will introduce each of these separately and them bring them together to build a complete ARIMA model.

3

A u t o r e g r e s s i o n

• To understand autoregression let us think of a habitual consumption, like drinking coffee: the amount of coffee consumed yesterday could be a predictor of today’s sales.

• Autoregression: we regress something to itself (more accurately its lags). 4

Sales yesterday

Sales 2-days ago

Regression line

Salest = 46.38 + 0.49 Salest-1 (+ error) Salest = 75.52 + 0.18 Salest-2 (+ error)

A u t o c o r r e l a t i o n F u n c t i o n ( A C F )

• A useful tool to summarise this type of information is the AutoCorrelation Function (ACF) and its plot (ACF plot or correlogram)

5

Lag 0: correlation with itself

From -1 to +1 correlation

Significance bounds outside = significant

A u t o c o r r e l a t i o n F u n c t i o n ( A C F )

• To calculate it we use:

• is the time series at time t;

• is the mean of the time series;

• n is the sample size and k is the lag order.

• It is simply the correlation between Yt and Yt-k

• The significance bounds are calculate as:

• Most statistical software calculate this for you [function in R: acf] 6

𝑟𝑘 = 𝑌𝑡 − 𝑌 𝑌𝑡−𝑘 − 𝑌 𝑛

𝑡=𝑘+1

𝑌𝑡 − 𝑌 2𝑛𝑡=1

𝑌𝑡

𝑌

𝐵 = ±𝑧1−𝛼/2

1

𝑁1 + 2 𝑟𝑖

2

𝑘

𝑖=1

P a r t i a l A u t o c o r r e l a t i o n F u n c t i o n ( PA C F )

• Something that is not immediately apparent with ACF is that all lags k>2 are contaminated by preceding correlations.

• The correlation between Yt-1 and Yt-2 is our problem. If that was zero then the correlation between Yt and Yt-2 would be only because of that connection and not partially due to the correlation that carries over through Yt-1.

• The Partial AutoCorrelation Function (PACF) resolves this and provides just the effect of lag Yt-k on Yt.

• The principle to estimate PACF is the following. Estimate an autoregression with all lags up to k-1, keep the coefficients fixed and add Yt-k. Its coefficient is the value of PACF. There are fast ways to get this, such as the Durbin-Levinson recursion.

• We rely on software to do this for us. [function in R: pacf] 7

rk Y Yt-1 Yt-2

Y 1

Yt-1 0.49 1

Yt-2 0.18 0.5 1

A C F a n d PA C F p l o t s

• Let us compare the ACF and PACF plots for coffee sales

8

We can see that once the in-between correlations are

removed then only Yt-1 is connected to Yt

Salest = 46.38 + 0.49 Salest-1

T h e A u t o r e g r e s s i v e ( A R ) m o d e l

• The main idea behind this model is that we can use lags (Yt-k) to model the behaviour of our target variable. Generally we can write:

• The order p of AR(p) can be as high as we want, but generally small values for p suffice (typically up to the seasonal frequency: quarterly 4, monthly 12, etc.).

• We will return to the choice of p later.

9

𝐴𝑅 𝑝 : 𝑌𝑡 = 𝛿 + 𝜙𝑖𝑌𝑡−𝑖

𝑝

𝑖=1

+ 휀𝑡

𝛿 = 𝜇 1 − 𝜙𝑖𝑝𝑖=1

φi is the autoregressive

coefficient of lag i

μ is the mean of the series (can be zero)


• Let us take a very simple AR(1) model:

• Observe that μ = 0, so in the long term we expect to stay around zero. Let us understand why.

• At each period we get an εt and 0.5Yt-1 to produce Yt. So we always take half of what we had before, plus an error that is centred around zero. Visually that is:

10

𝑌𝑡 = 0.5𝑌𝑡−1 + 휀𝑡 , 휀𝑡~𝑁(0, 𝜎)

Some large εt

Reverts to the mean


• Let us see some more examples:

• Observe that always |φ1| < 1 11

Smooth (positive correlation)

Changes direction (negative

correlation)


• Let us see some more examples with |φ1| ≥ 1

• In these cases we see that the series does not remain around the mean (0), but wanders off. This is a non-stationary process: the mean over time is not constant.

• The practical issue with non-stationarity is that unless we treat it, calculating the model variance is messy and model variance forecast uncertainty decision making.

• This is an important issue in ARIMA modelling that we will return to. For now let us focus on stationary processes.

12


• For an AR(1) to be stationary |φ1| < 1

• For the conditions for AR(p) to be stationary we need to introduce some new notation, the lag or backshift operator:

• Let us rewrite AR(p) using the backshift operator:

• AR(p) is stationary “if the roots of the characteristic polynomial lie outside the unit circle”. During the estimation of AR models we impose restrictions to ensure that this is true. This is done by the statistical software for us.

• For example for AR(2) the conditions are: |φ2|<1, φ1+φ2<1, φ2-φ1<1

13

𝐵𝑌𝑡 = 𝑌𝑡−1

𝐵𝑘𝑌𝑡 = 𝑌𝑡−𝑘

𝑌𝑡 = 𝛿 + 𝜙1𝑌𝑡−1 + 𝜙2𝑌𝑡−2 + ⋯+ 𝜙𝑝𝑌𝑡−𝑝 + 휀𝑡 𝑌𝑡 = 𝛿 + 𝜙1𝐵𝑌𝑡 + 𝜙2𝐵

2𝑌𝑡 + ⋯+ 𝜙𝑝𝐵𝑝𝑌𝑡 + 휀𝑡

𝑌𝑡 = 𝛿 + 𝑌𝑡(𝜙1𝐵 + 𝜙2𝐵2 + ⋯+ 𝜙𝑝𝐵

𝑝) + 휀𝑡

(1 − 𝜙1𝐵 − 𝜙2𝐵2 − ⋯− 𝜙𝑝𝐵

𝑝)𝑌𝑡 = 𝛿 + 휀𝑡

Characteristic polynomial

F o r e c a s t i n g w i t h A R m o d e l s

• Suppose our model is AR(2):

• To forecast with that, i.e. produce values Yt+h, we need to know all the values in the right hand side of the equation.

• First we set εt+h = 0, as future errors are unknown and we expect then on average to be 0; that is E(εt) = 0, since εt ~ N(0,σ) and is independent and identically distributed.

• If h = 1 (the order of the AR model), then we use the observed Yt and Yt-1 to get Ŷt+1:

• If h = 2, the value of Yt+1 is unknown, so we use the forecasted one Ŷt+1 :

• For all h > 2, we use forecasted values for both lags:

14

𝑌𝑡 = 0.4𝑌𝑡−1 + 0.2𝑌𝑡−2 + 휀𝑡

𝑌 𝑡+1 = 0.4𝑌𝑡 + 0.2𝑌𝑡−1

𝑌 𝑡+2 = 0.4𝑌 𝑡+1 + 0.2𝑌𝑡

𝑌 𝑡+ℎ = 0.4𝑌 𝑡+ℎ−1 + 0.2𝑌 𝑡+ℎ−2

F o r e c a s t i n g w i t h A R m o d e l s

• Visually the forecast will look like:

• Observe that as long as the AR(p) model is stationary, eventually the forecast will get closer and closer to the mean (here it is zero).

15

A R m o d e l s a n d PA C F

• In theory the PACF plot can help us specify (stationary) AR models

16

We get the order and approximate sizes of

coefficients (not generally exact)

O u t l i n e



2. AR models

3. Differencing

4. MA models


6. Identification

7. Seasonal ARIMA

8. Transformations

17

D i f f e r e n c i n g ( I n t e g r a t i o n – I )

• Let us return to the issue of non-stationarity. In practical terms non-stationarity will be caused when a series exhibits trend and/or seasonality.

• We can deal with non-stationarity by using differencing

• Differencing can help stabilise the mean of the time series, by removing the trend/seasonality.

18

𝑌𝑡′ = 𝑌𝑡 − 𝑌𝑡−1

𝑌𝑡′ = 1 − 𝐵 𝑌𝑡

The trend has been removed and the

series is now stationary

T h e R a n d o m W a l k I ( 1 )

• First-order differencing – written as I(1) – is the random walk (or naïve) model.

• The random walk says that the changes of Yt are explained solely by the random term εt. That means that when the random walk is the appropriate model, we have no information about the time series that we can exploit for forecasting.

• In the plot of the random walk it seems as if there is a trend we can exploit. When we plot the differenced (stationary) version we soon realised that the way the trend changes has no structure, i.e. it is random! That means unforecastable. Hence the random walk is a fitting model when we have no information about how a series evolves.

19

𝑌𝑡 = 𝑌𝑡−1 + 휀𝑡 𝑌𝑡− 𝑌𝑡−1 = 휀𝑡

T h e R a n d o m W a l k I ( 1 )

• An example of 100 random walk instances starting from the same point:

• The various RW paths are random and unpredictable. All these lines

20

H i g h e r o r d e r d i f f e r e n c i n g

• We can devise higher order differencing if needed:

• Modelling in first order differences is as if we are modelling the rate of change of the observed variable. Modelling in second order differences is as if we are modelling the rate of change of the rate of change of the observed variables.

• We can calculate any order of differencing I(d), but typically we do not need to consider more than second order, and even that is rare. 21

𝑌𝑡′ = 𝑌𝑡 − 𝑌𝑡−1

𝑌𝑡′ = 1 − 𝐵 𝑌𝑡

𝑌𝑡′′ = 𝑌′𝑡 − 𝑌′𝑡−1

𝑌𝑡′′ = 1 − 𝐵 𝑌𝑡 − B 1 − B

𝑌𝑡′′ = 1 − B 1 − B Yt

𝑌𝑡′′ = 1 − B 2𝑌𝑡

I(1) – first order difference

I(2) – second order difference

A R ( p ) a n d I ( d )

• We can combine AR(p) and I(d) to construct AR models for non-stationary time series.

• So for example we can write AR(1) in first differences:

22

𝑌′𝑡 = 𝑌𝑡 − 𝑌𝑡−1

𝑌𝑡′ = 𝜙1𝑌′𝑡−1 + 휀𝑡

First we apply differences

Then we model on the differenced data

or we can write directly:

(1 − 𝜙1𝐵)(1 − 𝐵)𝑌𝑡= 휀𝑡

Differencing AR(1)

A R I ( p , d ) – A C F a n d PA C F

• Suppose we have ARI(1,1):

• The ACF and PACF plots of the original and differenced series look like:

23

(1 − 0.7𝐵)(1 − 𝐵)𝑌𝑡= 휀𝑡

Slow decaying pattern:

differencing is needed

Observe that magnitude is

wrong

PACF of lag 1 is about right

A R I ( p , d ) – A C F a n d PA C F

• Suppose we have ARI(2,1):

• The ACF and PACF plots of the original and differenced series look like:

24

(1 − 0.5𝐵 − 0.4𝐵2)(1 − 𝐵)𝑌𝑡= 휀𝑡

Slow decaying pattern:

differencing is needed

Observe that magnitude is

wrong

PACF of lags 1 and 2 are about

right

S i n g l e E x p o n e n t i a l S m o o t h i n g a n d A R I M A

• Let us consider the Single Exponential Smoothing (SES). It can be written in two forms:

• From the error correction form we can manipulate it to produce the following:

25

(1 − 𝐵)𝑌 𝑡= 𝛼𝑒𝑡−1

Recursive form

Error correction form

𝑌 𝑡 = 𝛼𝑌𝑡−1 + 1 − 𝛼 𝑌 𝑡−1

𝑌 𝑡 = 𝑌 𝑡−1 + 𝛼𝑒𝑡−1

which implies the following model:

1 − 𝐵 𝑌𝑡 = −(1 − 𝛼)휀𝑡−1 + 휀𝑡

1 − 𝐵 𝑌𝑡 = −𝜃휀𝑡−1 + 휀𝑡

This term acts as a regression on past

errors. This is the MA term of the ARIMA

Observe that I change from Ŷt to Yt

and et = Yt - Ŷt

T h e M o v i n g A v e r a g e ( M A ) m o d e l

• Similarly to AR(1) we can define MA(1):

• MA(1) will always be stationary, but θ needs to be bounded |θ| < 1.

• This is necessary to make sure that errors from the distant past are not more important than recent errors. This condition is known as the property of invertibility.

• Two ways to understand why:

• One based on SES

• Remember that from SES we have (1-α) = θ. Given than 0 < α < 1, then 0 < θ < 1. Otherwise SES would not exhibit the exponentially decaying weight pattern. However, this does not explain the restriction fully…

• In fact we will show that since |θ| < 1 then the correct bounds for the smoothing parameter α of SES is 0 < α < 2.

• The other way that will allow us to show this fully requires transforming MA(1) into a regression… 26

𝑌𝑡 = 𝜇 + 휀𝑡 − 𝜃휀𝑡−1

I n v e r t i b i l i t y R e s t r i c t i o n f o r M A ( 1 )

• Starting from the basic MA(1) equation:

27

𝑌𝑡 = 𝜇 + 휀𝑡 − 𝜃휀𝑡−1 𝑌𝑡 − 𝜇 = 휀𝑡 − 𝜃휀𝑡−1

𝑧𝑡 = 휀𝑡 − 𝜃휀𝑡−1 휀𝑡 = 𝑧𝑡 + 𝜃휀𝑡−1

휀𝑡−1 = 𝑧𝑡−1 + 𝜃휀𝑡−2

𝑧𝑡 = 휀𝑡 − 𝜃 𝑧𝑡−1 + 𝜃휀𝑡−2 𝑧𝑡 = 휀𝑡 − 𝜃𝑧𝑡−1 − 𝜃2휀𝑡−2

𝑧𝑡 = 휀𝑡 − 𝜃𝑧𝑡−1 − 𝜃2𝑧𝑡−2 − 𝜃3𝑧𝑡−2 − 𝜃4 𝑧𝑡−4 − ⋯

We substitute Yt -μ with zt

We solve for εt

Based on this we can write the equation for εt-1

Which in turn can be substituted above to give:

Continuing we end up with an infinite order AR model, for the past weights of which to be decreasing then |θ| < 1

M A ( 1 ) A C F a n d PA C F

• Let us explore the ACF and PACF plots for some MA(1) series

28

For MA it is the ACF plot that reveals the model order (for AR it was PACF).

M A ( q ) m o d e l s

• We can extend MA(1) to higher order MA(q) models that contain q terms:

• The intuitive interpretation of MA(q) models is that they model shocks in εt with finite memory up to q. So suppose that something happened this periods in the variable we want to model, an MA(q) model with remember and propagate this shock for the next q periods.

• The coefficients of MA(q) need to be restricted so that it is invertible. Again this is guaranteed by satisfying that “the roots of the characteristic polynomial lie outside the unit circle”. This is done by the statistical software for us.

29

𝑀𝐴 𝑞 : 𝑌𝑡 = 𝜇 + 휀𝑡 − 𝜃𝑗휀𝑡−𝑗

𝑞

𝑗=1

O u t l i n e



2. AR models

3. Differencing

4. MA models


6. Identification

7. Seasonal ARIMA

8. Transformations

30

A R M A ( p , q ) m o d e l s

• We now have all the different components to construct ARMA model:

• The AR and MA coefficients must meet the same requirements of the simpler models (stability and invertibility).

• In general identifying the orders p, d, q using the ACF and PACF plots is almost impossible.

31

𝐴𝑅𝑀𝐴 𝑝, 𝑞 : 𝑌𝑡 = 𝛿 + 𝜙𝑖𝑌𝑡−𝑖

𝑝

𝑖=1

+ 휀𝑡 − 𝜃𝑗휀𝑡−𝑗

𝑞

𝑗=1

A R M A ( p , q ) – A C F & PA C F

• As you can see these are useless…

32

A R I M A ( p , d , q ) m o d e l s

• We have all the different components to construct a full ARIMA model:

• The AR and MA coefficients must meet the same requirements of the simpler models (stability and invertibility).

• In general identifying the orders p, d, q using the ACF and PACF plots is almost impossible.

• There are several approaches in the literature how to identify an ARIMA. We will introduce a common approach to solve the problem.

33

𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 : 1 − 𝐵 𝑑𝑌𝑡 = 𝛿 + 𝜙𝑖𝑌𝑡−𝑖

𝑝

𝑖=1

+ 휀𝑡 − 𝜃𝑗휀𝑡−𝑗

𝑞

𝑗=1

A R I M A ( p , d , q ) – A C F & PA C F

• As you can see these are still useless…

34

I d e n t i f y i n g A R I M A m o d e l s

• Before we find a way to identify ARIMA we need to understand why it is difficult.

• Consider some sampled data.

• We proceed to build a model by implicitly assuming there is some underlying

Data Generating Process (DGP).

• Implicitly we also suggest that there one true DGP.

• The modelling exercise is about finding that true DGP or getting as close to it as

possible.

35

True model

Model space Model A

Model B Model C Model D

Model E

T r u e m o d e l

36

What makes a model be closer or further from the true DGP?

• What information it considers (variables)

• What parameters it has (estimation of coefficients)

Model selection is all about finding the model with the least distance from the true

model. So for our case Models A-E could be alternative ARIMA models.

True model

Model space Model A


Model E

dA

dB dC

dD

dE

T r u e m o d e l

37

Consider three fitted ARIMA models:

• Model A

• Model B

• Model C

If we knew the true model then it would simply be a question: which of dA, dB and dC is

smaller, but in reality the true model is unknown

True model

Model space Model A


Model E

dA

dB dC

dD

dE

Different parameters from A

Different form from A

𝑌𝑡 = 𝜙1𝑌𝑡−1

𝑌𝑡 = 𝜙1′𝑌𝑡−1

𝑌𝑡 = 𝜙1′′𝑌𝑡−1 + 𝜃1𝑒𝑡−1

T r u e m o d e l & m o d e l s e l e c t i o n

38

All model selection metrics/criteria (like AIC, BIC, adj. R2, etc) are good/bad practical

approximations of the unknown distances di (with certain assumptions).

Here is the logic behind information criteria (IC) such as Akaike IC (AIC):

True model

Model A

Model B

dA

dB

Let us assume that Model A is closer to the true process. The distance dA cannot be measured, so let us assume it is an unknown value C. Model B’s distance is then C + (dA – dB) i.e. what we really measure is the extra distance starting from a model – not the true. This also explains why the numbers we get from such criteria are meaningless on their own and should only be used comparatively.

Consider: what does an error mean on its own?

ALWAYS compare a model against benchmarks!

T r u e m o d e l & e s t i m a t i o n

39

The assumption of a “true model” has a very important implication for parameter

estimation:

• Since C is unknown we can never be sure that we know the true process of real

data any model we apply is approximate.

• But how do we find model parameters? (for the sake of this example I will go into

a simple AR(1) model: )

Data

Using the previous data point we produce the next

e2

y1 y2 y3 y4

e3

e4

Prediction

So at every data point we measure the error between current actual and the prediction from the previous point (1-step ahead). We summarise:

It is not clear in the formula, but

minimising SSE we try to get the

parameters that minimise the t+1

error

T r u e m o d e l & e s t i m a t i o n

40

The resulting coefficients provide a fit with the minimum in-sample error (variance) and

unbiased (mean error = 0).

Let me rewrite this, the variance of errors is:

Assuming that sample size is not the issue, there are two cases here:

1. The model is true your parameters are able to describe the process at any

point in time (sample).

2. Your model is not true (what really happens) you approximate what you have

seen, but there will be some errors (more than the “natural randomness”) and

your parameters provide the minimum variance t+1 errors.

Unbiased!

W h a t i s t h e f i r s t p r o b l e m ?

For example let us approximate a sine wave with a polynomial:

• In-sample with the appropriate polynomial order we can achieve a very

good approximation.

• In a predictive context (out-of-sample) the quality of approximation

degrades fast as we produce multi-period ahead predictions, because

we have only approximated well the 1-step ahead.

If the true process was know the prediction would be perfect. 41

W h a t i s t h e s e c o n d p r o b l e m ?

In-sample with a sufficiently flexible model I can explain the process to any degree of accuracy.

Have I really explained the process? No!

So does a good in-sample fit tell us anything? Because your model fits the data it does not mean it explains the underlying process.

42

Important: A good fit tells you nothing about predictive power of a model.

I n f o r m a t i o n c r i t e r i a

43

Information criteria typically have the following structure:

For example AIC is: Number of model

parameters

Likelihood, for several models this is approximated fine by Mean Squared

Error (MSE) for the purpose of AIC calculation for ARIMA

So what AIC does is balance the trade-off between over- and under-fitting. We are looking

for a model that fits well and is not too complex (potential to overfit). It prefers

parsimonious simpler models that can achieve good fit.

I n f o r m a t i o n c r i t e r i a

44

There are a few more well established information criteria:

• The AICc (corrected for sample size) adjusts the penalty of AIC to account for the fact

that we often have limited sample sizes. When samples are large these two converge,

so it is often recommended to just use AICc all the time.

• The Bayesian IC (BIC) has a different penalty, which is stronger than AIC for sample

sizes of more than 7 observations. Typically it will result in more parsimonious models:

Sample size

A n o t e o n i n f o r m a t i o n c r i t e r i a

45

The likelihood and therefore information criteria are conditional on the available sample.

This means:

• I cannot compare ICs calculated on different samples

• Different samples can occur for example due to:

- data differencing

- data transformations

• We “summarise” this by often saying that ICs should be used between models of

the same family. This is not 100% correct and we should simply observe the

sample to remain the same (number of observations, transformations, scale). ICs

do not require the models to be nested (i.e. belonging to the same supermodel).

• In practice that would mean that in most cases you can use ICs to select between

different exponential smoothing models, but not between exponential smoothing

models and ARIMA.

A R M A ( p , q ) – I d e n t i f i c a t i o n

• For now let us ignore any potential need for differencing.

1. We can fit different ARMA(p,q) up to p = pmax and q = qmax (typically up to seasonal length) and calculate some IC, for example AICc;

2. Compare all AICc and pick the best model (lowest AICc).

• Here is an example with coffee sales and pmax = qmax = 2.

• It is apparent that this approach can become very computationally demand quickly.

Model AICc

ARMA(1,0) 308.577

ARMA(2,0) 308.577

ARMA(0,1) 301.865

ARMA(0,2) 301.865

ARMA(1,1) 305.600

ARMA(1,2) 305.600

ARMA(2,1) 308.577

ARMA(2,2) 305.600

Both have the lowest AIC, so choose ARMA(0,1) that is the simplest.

A R M A ( p , q ) – I d e n t i f i c a t i o n

• Instead of trying all p & q combinations we can do a smart search of the model space.

1. We can fit different ARMA(p,q) up to p = pmax and q = qmax (typically up to seasonal length) and calculate some IC, for example AICc;

2. Fit four alternatives and pick best: ARMA(2,2), ARMA(0,0), ARMA(1,0), ARMA(0,1);

3. For the best try to vary p/q by ± 1 and include/exclude constant. Pick best model on AICc;

4. Repeat step (3) from new best model until no more improvement in AICc can be achieved. Do not exceed pmax and qmax.

• This identification routine is implement in R in function auto.arima in the forecast package.

• It is not the only way to do it!

47

A n e x a m p l e o f A R M A i d e n t i f i c a t i o n

• Let us consider a different example. ARIMA models are very popular in many field like engineering and computing because of their flexibility.

• We take a song and extract its wave-form. This is now a time series (sampled at the frequency rate that was recorded/stored), which we can model with ARIMA.

• Suppose we extract a 10 second sample (it is at 11,025 samples per second, so we have 110,250 observations). This is adequate to fit an ARMA.

• We find that an ARMA(5,4) is adequate according to AICc (auto.arima process).

• We produce fitted ARMA values and compare with actuals… or listen to the fitted values and the actuals. This is one way of doing “lossy compression”.

• We forecast the next 5 seconds and compare visually or listening to it, to evaluate how good the identified ARMA did.

• Let me warn you: the sample has sufficient “noise”!

48


• Plot of the last 1000 data points and 500 first forecasted values

• Coloured area are the Prediction Intervals of the model show the error variance/forecast uncertainty.

49 Sample ARMA fit ARMA forecast


• Let us understand why there is such a difference between fit & forecast.

• During the in-sample period we only produce t+1 forecasts, i.e. we keep on getting new information and the model updates. If it has not captured the data well, it is not apparent, because it is only predicting t+1.

• Out-of-sample it is asked to forecast without any new data. To do this it produces multi-step forecasts that exhibit a mean-reversion to zero. Zero means no sound! The ARMA did not capture the song.

• All the information contained in the past instead went to the forecast errors. The prediction intervals are based on the error variance.

50

In-sample fit can be misleading! Out of sample evaluation is needed before you commit to a model.

We will return on this topic in forecast evaluation.

• Identification of ARIMA is more complex because we need to first deal with the order of differencing.

• As differencing changes the sample, we can no longer rely on ICs, so we need different alternatives:

• Use ACF plots. When differencing is needed they demonstrate a very slow gradual decline.

• Predictive forecast evaluation

• Use statistical tests

• There is a large selection of statistical tests. None is perfect, but at least are useful to automate the modelling process. Common tests are the KPSS (Kwiatkowski–Phillips–Schmidt–Shin test) and the ADF (Augmented Dickey Fuller).

• These are applied iteratively to identify the appropriate order of differencing.

• auto.arima uses KPSS

A R I M A ( p , d , q ) – I d e n t i f i c a t i o n

51

O u t l i n e



2. AR models

3. Differencing

4. MA models


6. Identification

7. Seasonal ARIMA

8. Transformations

52

• The ARIMA models we dealt with so far cope with non-seasonal time series, but it is intuitively easy to extend them to SARIMA.

• Consider an ARIMA(1,1,1), we will extend it to SARIMA(1,1,1)(1,1,1) to highlight the differences:

• SARIMA(p,d,q)(P,D,Q)s in general have the following form:

S e a s o n a l A R I M A ( p , d , q ) ( P, D , Q ) s

53

𝐴𝑅𝐼𝑀𝐴 1,1,1 : (1 − 𝜙1𝐵)(1 − 𝐵) 𝑌𝑡 = (1 + 𝜃1𝐵)휀𝑡

𝑆𝐴𝑅𝐼𝑀𝐴 1,1,1 1,1,1 12:

(1 − 𝜙1𝐵)(1 − Φ1𝐵12)(1 − 𝐵)(1 − 𝐵12)𝑌𝑡 = (1 + 𝜃1𝐵)(1 + Θ1𝐵

12)휀𝑡

Seasonal period: 12 months

Seasonal AR Seasonal I Seasonal MA

Observe that differencing is done seasonally: Yt – Yt-12

1 − 𝜙𝑖𝐵𝑖

𝑝

𝑖=1

1 − Φ𝑘𝐵𝑠𝑘

𝑃

𝑘=1

1 − 𝐵 𝑑 1 − 𝐵𝑠 𝐷𝑌𝑡 = 1 − 𝜃𝑗𝐵𝑗

𝑞

𝑗=1

1 − Θ𝑙𝐵𝑠𝑙

𝑄

𝑙=1

휀𝑡

• As you can appreciate identification of SARIMA is a beast…

• First we need to establish the order of seasonal differences (D) and level differences (d). For the seasonal differences auto.arima uses the OCSB test (Osborn, Chui, Smith, Birchenhall).

• It is good practice to first consider seasonal differences and then proceed if needed to level differences.

• Once the differencing is sorted then we proceed to identify orders p, q, P and Q using some information criterion.

• Typically the seasonal orders are low (P, Q ≤ 2, D ≤ 1).

I d e n t i f i c a t i o n o f S A R I M A

54

1 − 𝜙𝑖𝐵𝑖

𝑝

𝑖=1

1 − Φ𝑘𝐵𝑠𝑘

𝑃

𝑘=1

1 − 𝐵 𝑑 1 − 𝐵𝑠 𝐷𝑌𝑡 = 1 − 𝜃𝑗𝐵𝑗

𝑞

𝑗=1

1 − Θ𝑙𝐵𝑠𝑙

𝑄

𝑙=1

휀𝑡

Let us forecast monthly wine bottle sales

• We use auto.arima to identify the model orders.

• OCSB indicate 1 seasonal difference and KPSS 1 level difference

• Using AICc the best SARIMA is found to be: SARIMA(0,1,1)(0,1,2)12

• We use this to forecast the 32 periods.

S A R I M A e x a m p l e

• Some Exponential Smoothing models have equivalent ARIMA models, but in most of the cases there is no direct equivalence.

• An interesting observation is that all equivalences refer to additive forms of exponential smoothing.

• ARIMA cannot do multiplicative modelling, so the typical Holt-Winters exponential smoothing (additive trend – multiplicative seasonality) cannot be modelled by ARIMA

• This is a major limitation as most seasonal business time series exhibit multiplicative form.

E x p o n e n t i a l S m o o t h i n g E q u i v a l e n c e s

EXSM ETS ARIMA

SES ETS(A,N,N) ARIMA(0,1,1) Holt (Linear Trend) ETS(A,A,N) ARIMA(0,2,2)

Damped Holt ETS(A,Ad,N) ARIMA(1,1,2) Seasonal ETS(A,N,A) ARIMA(0,0,s)(0,1,0)s

Additive Holt-Winters (Trend-Seasonal) ETS(A,A,A) ARIMA(0,1,s+1)(0,1,0)s

Damped Additive Holt-Winters ETS(A,Ad,A) ARIMA(1,0,s+1)(0,1,0)s

56

• Logarithms come to partial rescue. They have the nice property of transforming multiplicative relations to additive.

• But we need to be careful with trends, are they will typically become nonlinear, which ARIMA does not handle.

L o g a r i t h m i c t r a n s f o r m a t i o n

Multiplicative Additive

57

• The process is as follows:

1. Transform the series using logarithms

2. Fit SARIMA and forecast

3. Return forecasts to original domain by using exponents

L o g a r i t h m i c t r a n s f o r m a t i o n

58

• ARIMA follow a different modelling philosophy to exponential smoothing.

• ARIMA sees the time series as AR and MA components.

• A series may need to be differenced first.

• ACF and PACF plots are useful for very simple ARIMA, but become useless once mixed forms appear (so almost always!).

• To identify the ARIMA orders we can use information criteria and the differencing using statistical tests.

• Seasonal ARIMA model seasonality and can be built in a similar fashion.

• Consider data transformations (differences & logs) when needed.

A R I M A s u m m a r y

59

Books:

• Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. The book that started it all!

• Fildes, Robert, and Keith Ord. "Forecasting competitions–their role in improving forecasting practice and research." A companion to economic forecasting (2002): 322-253. Easy introduction to ARIMA, good intuition.

• Hyndman, Rob J., and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2014. Easy introduction with R examples, available online for free.

• Hamilton, James Douglas. Time series analysis. Vol. 2. Princeton: Princeton university press, 1994. All the details you would ever want to know about ARIMA.

• http://nikolaos.kourentzes.com/ My forecasting research blog.

Packages for R:

• forecast contains auto.arima

• smooth a different implementation of ARIMA based on state-space models (available on github)

R e s o u r c e s

60

http://nikolaos.kourentzes.com/


Thank you for your attention!

Questions?

Nikolaos Kourentzes Lancaster University Management School

Lancaster Centre for Forecasting - Lancaster, LA1 4YX email: [email protected]

Forecasting blog: http://nikolaos.kourentzes.com

www.forecasting-centre.com/

Full or partial reproduction of the slides is not permitted without author’s consent. Please contact [email protected] for more information.

mailto:[email protected]



http://www.forecasting-centre.com/




natcor forecasting and predictive analytics · plot the differenced (stationary) version we soon...

Documents