arxiv:2009.08102v1 [stat.ml] 17 sep 2020automatic forecasting using gaussian processes g. corani1...

Automatic Forecasting using Gaussian Processes

G. Corani1

([email protected])

A. Benavoli2

J. Augusto1

M. Zaffalon1

1 Ist. Dalle Molle Intelligenza Artificiale (IDSIA), USI-SUPSI, 6928Manno, Switzerland

2 Department of Computer Science and Information Systems (CSIS),University of Limerick, Ireland.

Abstract

Automatic forecasting is the task of receiving a time series and returning aforecast for the next time steps without any human intervention. We propose anapproach for automatic forecasting based on Gaussian Processes (GPs). So far,the main limits of GPs on this task have been the lack of a criterion for the se-lection of the kernel and the long times required for training different competingkernels. We design a fixed additive kernel, which contains the components neededto model most time series. During training the unnecessary components are madeirrelevant by automatic relevance determination. We assign priors to each hyperpa-rameter. We design the priors by analyzing a separate set of time series through ahierarchical GP. The resulting model performs very well on different types of timeseries, being competitive or outperforming the state-of-the-art approaches. Thanksto the priors, we reliably estimate the parameters with a single restart; this speedupmakes the model efficient to train and suitable for processing a large number oftime series.

1 IntroductionAutomatic forecasting (Hyndman and Khandakar, 2008) is the task of receiving a timeseries and returning a forecast for the next time steps without any human intervention.In order to scale on a large number of time series, the algorithm should be both accurateand fast.

1

arX

iv:2

009.

0810

2v1

[st

at.M

L]

17

Sep

2020

For monthly and quarterly time series, the state of the art is constituted by state-space exponential smoothing (ETS, Hyndman (2018a)) and automated arima proce-dures (auto.arima, Hyndman and Khandakar (2008)).

Yet ETS and auto.arima do not manage complex seasonalities, such as non-integeror multiple seasonalities. State-of-the-art approaches for these cases are TBATS (De Liv-era, Hyndman, and Snyder, 2011) and Prophet (Taylor and Letham, 2018).

Gaussian Processes (GPs) are well suited to time series modelling, but so far theydid not excel in automatic forecasting. The automatic statistician (Duvenaud et al.,2013) was an attempt in this direction, based on the idea of training and comparingseveral kernels of different complexity. It had however two main problems: the lack(Lloyd, 2014) of a criterion for kernel selection and the time required for training allthe candidate models. Criteria for estimating the out-of-sample performance of a GPmodel have been later developed, but they are not suitable for choosing among a largenumber of models (Vehtari et al., 2016).

The spectral mixture (SM) kernel (Wilson and Adams, 2013) allows automatic pat-tern discovery and extrapolation beyond training data. The SM kernel is indeed pow-erful if its hyper-parameters are perfectly set, but estimating them is challenging. Themarginal likelihood is highly multimodal and it is unknown how to initialize the hyper-parameters. For instance Wu et al. (2017) use Bayesian optimization to initialize thehyperparameters of the SM kernel at each restart. This is effective but time-consuming,as it requires a large number of restarts. Neither the automated statistician nor the SMkernel have been used for automatic forecasting.

We address automatic forecasting with a fixed additive kernel, which contains aperiodic kernel to model the seasonal behaviour; moreover, it contains linear, RBF andSM kernels to model different type of trends. We thus avoid testing different combi-nations of kernels. Our additive kernel is slightly overparameterized, but this is nota problem: the unnecessary components are turned off during training by automaticrelevance determination (ARD, MacKay (1998)).

Yet this model is not competitive if it is simply trained by maximizing the marginallikelihood. Simple but specialized models such as ETS and auto.arima are very effec-tive on time series that for instance contain only a hundred data. In this context theygenerally outperform (Hewamalage, Bergmeir, and Bandara, 2020; Alexandrov et al.,2019) recurrent neural networks.

To improve we need to introduce prior knowledge. In order to obtain distribu-tional information about the hyper-parameters, we analyze a separate set of time se-ries through a hierarchical GP model. On this basis we define priors for the hyper-parameters.

Thanks to the priors, during training the optimizer discards unlikely regions of thehyper-parameter space and converges to a sensible estimate. Moreover, a single restartis enough for a reliable training. We measured an equivalent forecasting accuracy usingone or twenty restarts. All the results reported in the paper refer hence to a single restart.This speedup makes our model suitable to process a large number of time series.

We perform extensive experiments on the monthly and quarterly time series of theM3 competition (Makridakis and Hibon, 2000) where our model outperforms both ETSand auto.arima. We then apply our model also to time series with complex seasonality,where it also shows excellent performance.

2

The paper is organized as follows: we first recall the definition of GPs and kernel;we then present the composite kernel and the hierarchical GP used to define the hy-perparameters priors. Eventually, we present experimental results on different types oftime series.

2 Gaussian processesConsider the regression model

y = f(x) + v, (1)

where x ∈ Rp, f : Rp → R and v ∼ N(0, s2v) is the noise. Assume that we observethe training data D = {(xi, yi), i = 1, . . . , n}. In Gaussian Process (GP) regression,we place a GP prior on the unknown f , f ∼ GP (0, kθ),1 and compute the posteriordistribution of f given the data D. We then use this posterior to make inferences aboutf .

In particular, based on the training data XT = [x1, . . . ,xn], y = [y1, . . . , yn]T

, and given m test inputs (X∗)T = [x∗1, . . . ,x∗m] , we wish to find the posterior dis-

tribution of f∗ = [f(x∗1), . . . , f(x∗m)]T . From (1) and the properties of the Gaussiandistribution, 2 the posterior distribution of f∗ is (Rasmussen and Williams, 2006, Sec.2.2):

p(f∗|X∗, X,y,θ) = N(f∗; µθ(X∗|X,y), Kθ(X∗, X∗|X)), (2)

with mean and covariance given by:

µθ(f∗|X,y) = Kθ(X∗, X)(Kθ(X,X))−1y,

Kθ(X∗, X∗|X) = Kθ(X∗, X∗) (3)

−Kθ(X∗, X)(Kθ(X,X))−1Kθ(X,X∗).

The kernel defines the covariance between any two function values:Cov(f(x), f(x∗)) =kθ(x,x∗). The kernel specifies which functions are likely under the GP prior. Commonkernels are the white noise (WN), the squared exponential (RBF), the periodic (PER)and the linear (LIN). The following expressions are valid for p = 1, which is the caseof time series; see instead (Rasmussen and Williams, 2006) for generalizations:

WN: kθ(x1, x2) = s2vδx1,x2

LIN: kθ(x1, x2) = s2l x1x2

RBF: kθ(x1, x2) = s2r exp

(− (x1 − x2)2

2`2r

)PER: kθ(x1, x2) = s2p exp

(− (2 sin2(π|x1 − x2|/pe)

`2p

)1A GP prior with zero mean function and covariance function kθ : Rp × Rp → R+, which depends on

a vector of hyperparameters θ2In the paper, we incorporate the additive noise v into the kernel by adding a White noise kernel term.

3

where δx1,x2 is the Kronecker delta, which equals one when x1 = x2 and zero oth-erwise. The hyperparameters are the variances s2v, s

2l , s

2r, s

2p > 0, the lengthscales

`r, `p > 0 and the period pe.Wilson and Adams (2013) proposed the spectral mixture kernel, which is the sum-

mation of different SM kernels. Each SM kernel is (Wilson and Adams, 2013) theproduct of a RBF kernel and a further kernel called cosine kernel. In our compositekernel, we introduce two SM kernels. The i-th SM kernel is defined as:

SMi: kθ(x1, x2) = s2miexp

(− (x1 − x2)2

2`2mi

)cos

(x1 − x2τmi

),

with hyperparameters smi , `mi and τmi .

We denote by θ the vector containing all the hyperparameters. Variances and length-scales are non-negative hyperparameters, to which we assign log-normal priors (laterwe show how we define the priors).

We then compute the maximum a-posteriori (MAP) estimate of θ, thus approximat-ing the marginal of f∗ with (2). We thus maximize w.r.t. θ the joint marginal probabilityof y,θ, which is the product of the prior p(θ) and the marginal likelihood (Rasmussenand Williams, 2006, Ch.2):

p(y|X,θ) = N(y; 0,Kθ(X,X)). (4)

2.1 Kernel designPositive definite kernels (i.e., those which define valid covariance functions) are closedunder addition and multiplication; in particular an additive kernel models the data asthe sum of independent functions. Also Prophet (Taylor and Letham, 2018) computesforecasts which are the sum of different non-linear functions of time; in their case:trend, seasonality and holiday effect.

We propose a kernel K given by:

K = PER + LIN + RBF + SM1 + SM2 + WN.

where the WN kernel represents the Gaussian noise of Eq.(1).We now quickly discuss the role of the different components. The periodic kernel

(PER) captures the seasonality of the time series. Time series models (Hyndman andKhandakar, 2008) assume both quarterly and monthly time series to have a seasonalperiod of one year. Analogously, we fix the period pe of the PER kernel to one year, thusintroducing prior knowledge and reducing the number of hyperparameters. Throughoutthis paper, we express the time in years; thus we fix pe=1.

Many time series are moreover characterized by a linear trend. For instance auto.arima(Hyndman and Khandakar, 2008) applies first differences on about 40% of the monthlytime series of the M3 competition. Fitting an arma model on the first differences of thetime series is equivalent (Hyndman, 2018a, Chap.8) to add a linear trend to the fore-casts. This is the reason for including the LIN kernel.

Long-term trends are generally smooth, and can be properly modelled by the RBFkernel. The two SM kernels are used to pick up the remaining signal.

4

During training, the GP detects the most important patterns of the time series; it isthen able to generate forecasts of appropriate complexity, as shown in Fig.1.

8 9 10 11Time (years)

1.0

1.5

2.0

2.5

Nor

mal

ized

valu

es

historical

forecast


−1

0

1

Nor

mal

ized

valu

es

historical

forecast


0

2

Nor

mal

ized

valu

es

historical

forecast

Figure 1: Example of GP forecasts of different complexity. ARD turns off the seasonalcomponent in the first two examples. The examples refer to monthly time series andthe forecasts are computed for up to 18 months ahead.

2.2 Priors for the hyperparametersWe assign priors to the hyperparameters to improve the training the GP. In particularwe assign log-normal priors to variances and lengthscales:

s2l , s2r, s

2p, s

2m1, s2m2

, s2v ∼ LogN(νs, λs) (5)

`r ∼ LogN(νr, λr) (6)`p ∼ LogN(νp, λp) (7)`m1 ∼ LogN(νm1 , λm1) (8)`m2 ∼ LogN(νm2 , λm2) (9)

where LogN(ν, λ) denotes the log-normal distribution with mean ν and variance λ.Equation (5) shows that all variances have the same prior. The rationale is the fol-

lowing. The variance defines, for each kernel component, the distance between thesampled functions and the mean function. Within an additive composition of kernels,the components with larger variance have a larger impact on the final forecast. A pri-ori all kernel components are equally relevant; their variances have therefore the sameprior.

For simplicity, we assign moreover the same prior to the lenghtscale of the cosineand RBF kernel within each SM component, that is `mi

= τmifor i = 1, 2.

5

To quantify the different ν and λ of the priors (5)-(9) we need distributional in-formation about the hyperparameters. We obtain it by analysing a set of time seriesthrough a hierarchical GP; we discuss this step in the next section.

Hierarchical GP model We assume the generative model of the j-th time series tobe:

s2(j)l , s2(j)r , s2(j)p , s2(j)m1

, s2(j)m2, s2(j)v ∼ LogN(νs, λs)

`(j)r ∼ LogN(νr, λr)

`(j)p ∼ LogN(νp, λp),

`(j)m1∼ LogN(νm1

, λm1)

`(j)m2∼ LogN(νm2 , λm2),

y(j) ∼ N(0,Kθ(X(j), X(j))),

where Kθ denotes our composite kernel.Let us recall that the various ν and λ represent the mean and the standard deviation

of the log of the hyperparameters. We assume them to be drawn from the followinghyperprior:

νs, νp, νr, νm2∼ N(0, 5)

νm1∼ N(−1.5, 5)

λs, λr, λp, λm1, λm2

∼ Gamma(1, 1),

where the lower prior mean for νm1is designed to specialize the SM0 towards short-

term effects.We fit this model on 350 monthly time-series, randomly selected from the M3 com-

petition. When we fit the hierarchical model, it jointly estimates the hyperparameters ofthe different time series, which in this way borrow statistical strength from each other.

The hierarchical model is a principled approach for jointly analysing multiple timeseries, but it is hardly usable because of its computational requirements. We thus fit itonly once, and we extract the posterior means of the various ν and λ parameters. Suchvalues are reported in Table 1.

In the experiments shown later in the paper, we fit the GP as a univariate timeseries model, using as parameters of the lognormal priors (Eq.5) the values of Tab 1.The univariate GP equipped with such priors can be thus seen as a fast approximationto the hierarchical GP.

We implemented the model in PyMC3 (Salvatier, Wiecki, and Fonnesbeck, 2016).Before fitting the hierarchical model, we standardize each time series to have mean 0and variance 1. We use automatic differentiation variational inference to approximatethe posterior distribution of νi, λi.

Priors and cross-information As discussed by (Hewamalage, Bergmeir, and Ban-dara, 2020), in the context of time series, big data does not usually mean that the in-dividual time series contain lots of data. Rather, it typically means that there are many

6

related time series from the same domain. Such related time series can be jointly ana-lyzed (Montero-Manso et al., 2020; Smyl, 2020), in order to improve over univariateapproaches. Also our approach can be seen as a way to use cross-series information.Indeed, the estimates of the hyper-parameters on a given time series depend, throughthe prior, also on information extracted from a set of previously analyzed time series.

The inferred priors The lognormal priors are reported in Tab.1.

log(x) x xparameter ν λ median 95th

variance -1.6 1.00 0.2 1.06lengthscales

std periodic 0.35 1.44 1.4 15.0rbf 1.04 0.75 2.8 9.8SM1 -0.71 0.84 0.5 2.0SM2 0.97 0.7 2.6 8.8

Table 1: Characteristics of the lognormal priors.

The above priors can be interpreted as follows. Recall that each time series hasvariance one. Our model is composed by sum of six kernels and the sum of their medianvariances is indeed close to 1 a priori.

The median lenghscale of the rbf kernel is about 2.8 years, which is a long-termtrend. This implies a significant correlation between observations located up to twolenghtscale of distances, that is about 5.6 years. Trends with long lengthscale are likelyto be smooth and thus well captured by the RBF kernel.

The SM1 and SM2 are used to capture short-term and long-term trends respectively.All priors are characterized by long tails, as shown by the 95-th percentiles.On the quarterly time series, we remove from the kernel the short-term component

SM1.

3 Results

3.1 Monthly and quarterly case studyThe time series of the M3 competition are available from the Mcomp package (Hynd-man, 2018c) for R. There are 1489 monthly time series; we used 350 of them to trainthe hierarchical GP, and we exclude them from the subsequent experiments.

We thus run experiments on the remaining 1079 monthly time series. The length oftraining set varies between 49 and 126 months, while the test set is always 18 monthslong. We run experiments also on the 760 quarterly time series from the M3 compe-tition. The training set contains between 16 and 64 observations, while the test set isalways 2 years long (8 quarters).

7

Figure 2: Distribution of MAE and CRPS on the 1080 monthly time series.

Figure 3: Distribution of MAE and CRPS on the 756 quarterly time series.

We standardize each time series using the mean and the standard deviation of thetraining set. This is necessary both for having variances coherent with the priors andfor obtaining numerically homogenous results on different time series.

We denote by GP our model trained with priors and by GP0 our model trainedwithout priors. We use a single restart when training both GP and GP0 ; the averagetraining time is <1 sec. on the monthly time series (average number of points: about100) and <0.5 sec on the quarterly time series (average number of points: about 45),using a standard laptop.

We use the priors of Tab.1 both on monthly and quarterly time series. We also triedto define specialized priors for the quarterly time series. We did so by first averagingthe same 350 monthly time series over quarters, and then fitting again the hierarchicalmodel. The posterior estimates of the hierarchical model are in this case shifted towardslonger lengthscales, since quarterly time series are smoother than monthly ones. Theestimates for the variance remain instead practically unchanged. When running exper-iments on quarterly time series, we tried both the priors of Tab.1 and those inferredfrom quarterly time series. The empirical results are practically equivalent in the twocases. We conclude that the priors of Tab.1 can be used with different types of timeseries since they are only weakly informative and have long tails.

Competitors The auto.arima algorithm first checks if it is necessary to differentiate(possibly multiple times) the time series in order make it stationary. Once the timeseries is assumed stationary, it fits multiple ARMA models with different orders of theautoregressive and the moving average part; it eventually selects the best combination

8

via AICc.The ETS algorithm fits several state-space exponential smoothing model (Hynd-

man, 2018a), characterized by different types of trend, seasonality and noise; the bestcombination is eventually chosen via AICc. Both auto.arima and ETS are implementedwithin the forecast package for R, and their algorithms have been constantly updatedafter the first publication (Hyndman and Khandakar, 2008). Some details of both algo-rithm have been tweaked on the basis of the results on the data sets of the M3 compe-tition.

The Prophet algorithm (Taylor and Letham, 2018) is a decomposable time seriesmodel, whose forecast is determined by the sum of different non-linear functions oftime. It is able to detects change-points within the time series and it estimates season-aliy using a partial Fourier sum; for this reason it can cope with complex seasonalities(although, as we will see, it is not very effective on time series with simple seasonalitylike monthly and quarterly time series). All models represent the forecast uncertaintyvia a Gaussian distribution.

GP ETS arima GP0 Prophet

montly time series (1080)MAE 0.49 0.52 0.52 0.56 0.70

prob improv. (.99) (.97) (1.0) (1.0)CRPS 0.35 0.37 0.37 0.39 0.51

prob improv. (.99) (.96) (1.0) (1.0)LL -0.99 -1.05 -1.07 -1.12 -1.42

prob improv. (.99) (.59) (1.0) (1.0)

quarterly time series (756)MAE 0.40 0.43 0.39 0.48 0.64

prob improv. (.99) (.45) (1.0) (1.0)CRPS 0.28 0.31 0.30 0.34 0.50

prob improv. (.99) (.96) (1.0) (1.0)LL -0.78 -0.93 -0.91 -1.02 -2.54

prob improv. (.99) (.99) (1.0) (1.0)

Table 2: Median results on the monthly and quarterly M3 time series. In each row weboldface the best results, i.e., the lower MAE, the lower CRPS and the higher LL. Thegray rows shows the posterior probability of the median of the GP being better than themedian of the competitors.

We denote by yt and yt the actual value and the expected value of the time seriesat time t. We moreover denote by σ2

t the variance of the forecast at time t and by T thelength of the test set. We measure mean absolute error (MAE) and log-likelihood (LL)

9

on the test set:

MAE =

T∑t=1

|yt − yt|

LL =1

T

(− 1

2

T∑t=1

log(2πσ2t )− 1

2σ2t

T∑t=1

( yt − yt )2

)

We furthermore we compute the continuous-ranked probability score (CRPS) (Gneit-ing and Raftery, 2007), which generalizes the MAE to the case of probabilistic fore-casts. It is a proper scoring rule for probabilistic forecasts, which corresponds to theintegral of the Brier scores over the continuous predictive distribution. We denote byFt the cumulative predictive distribution at time t and and by z the variable over whichwe integrate. The CRPS is defined as:

CRPS(Ft, yt) = −∫ ∞−∞

(Ft(z)− 1{z ≥ yt})2dz.

For the Gaussian case, the integral can be computed in closed form (Gneiting andRaftery, 2007).

MAE and CRPS are loss functions, hence the lower the better. Instead for LL, thehigher the better.

The medians results for both monthly and quarterly time series are given in Tab. 2.The GP obtains the best median performance on almost all indicators. We check thesignificance of the differences on the medians via the Bayesian signed-rank test (Be-navoli et al., 2014). It returns the posterior probability of GP having a better medianthan each competitor on a given indicator, given the results obtained on the collection ofmonthly or quarterly time series. Such posterior probability is equivalent (Benavoli etal., 2014) to 1-pval, where pval denotes the p-value of the Wilcoxon one-sided signed-rank test. In most cases (Tab.2) the posterior probabilities support the hypothesis ofthe GP significantly improving the median performance compared to each competitor.The improvement is not only on the medians, however. Figures 2 and 3 show a clearimprovement on the distribution of both MAE and CRPS. The same type of result isfound also on the LL distribution (boxplots not shown).

The large improvement of GP over GP0 shows that the priors are necessary to fullyexploit the potential of the GP. In particular, the model trained without priors is clearlyoutperformed by both ETS and auto.arima. We also note a poor performance of Prophetin this context.

4 Non-integer seasonalityA challenging problem is how to forecast time series which have a non-integer sea-sonality (Hyndman, 2018a, Sec.11.1). For instance, the number of weeks in a year is365.25/7 ≈ 52.18. This makes it difficult to model the yearly seasonality on a timeseries constituted by weekly data.

10

Both arima and ETS cannot deal with non-integer seasonality. A model suitable tothis end is harmonic regression, which is an arima model augmented with covariatesconstituted by Fourier terms. The number of Fourier terms is selected by minimizingAICc (Hyndman, 2018a, Sec.11.1).

A more sophisticated approach is TBATS (De Livera, Hyndman, and Snyder, 2011),which introduces Fourier terms within an exponential smoothing state space model.

Prophet models seasonality through Fourier terms and for this reason it is able tofit to this time series.

Also the GP model fits seamlessly to this time series. From the theoretical view-point, a GP with a periodic kernel is equivalent to a state space model with infiniteFourier terms so it can learn any periodic function (Solin and Sarkka, 2014; Benavoliand Zaffalon, 2016). From the practical viewpoint, we keep pe=1 since the period of thetime series is one year. When we generate the X, the time increases of ∆t = 1

52.18 atevery sample. We keep the same kernel structure and priors. We expect lengthscales onweekly time series to be shorter than on monthly time series; yet as already discussed,the priors of Table 1 are only weakly informative and we do not change them. We donot include GP0 in this experiment.

We ran experiments on the gasoline time series, available from the fpp2 packagefor R (Hyndman, 2018b). It regards the US motor gasoline supplied every week overthe period 1991 - 2017.

We perform experiments with training sets of different length, starting from 120weekly observations and ending at 1240 observations; we increase the training set ofabout two years at every iteration. We consider a long-term forecast of 104 weeks (twoyears) ahead. An example of forecast is shown in Fig.4.

The median results are given in Table 3. The best performing methods are TBATSand GP, which are practically equivalent. Thus the GP models performs very well bothon time series with simple seasonality (monthly, quarterly) and complex seasonality.As the results of the various experiments are correlated (they refer to training set ofincreasing size), we do not perform statistical hypothesis testing (which assumes i.i.d.data).

The performance of Prophet is some 10% worse on every indicator, and dynamicregression is slightly worse than Prophet.

GP TBATS Prophet Harmonicregression

MAE 0.39 0.38 0.42 0.42CRPS 0.27 0.27 0.30 0.32

LL -0.71 -0.70 -0.87 -0.93

Table 3: Median results over the 15 experiments performed on the US gasoline data set(each experiment involved a training set of different length.)

On the longest time series, the training time of the GP arrives to up to 90 secs, whileProphet fits in a few seconds. The training of GP the can be however largely reduced

11

2013 2014 2015 2016 2017Time (years)

0

1N

orm

aliz

edva

lues

historical

forecast

Figure 4: Example of forecast 2 years ahead (104 weeks) on the gasoline weekly timeseries. The forecast contains the seasonal pattern repeating twice and a slightly increas-ing trend.

by considering approximated techniques, as we discuss more in detail in the followingcase study.

5 Multiple seasonalitiesWe now consider a high-frequency time series characterized by two seasonalities. Thecall time series (Hyndman, 2018a, Sec.11.1) counts the number of retail banking callsreceived every 5 minutes each weekday. The time series is irregularly spaced as dataare collected only during weekdays and only during certain hours (between 7:00am and9:05pm); this is however not a problem, since all the considered methods are able todeal with irregularly sampled data. The time series is available from the fpp2 package(Hyndman, 2018b) for R.

The time series has two seasonalities: a daily and a weekly one. To model thedouble seasonality, we add a periodic kernel to our kernel, which becomes:

K = PERw + PERd + LIN + RBF + SM1 + SM2 + WN,

where PERw and PERd represents respectively the weekly and the daily seasonality.In our setup, one year has length 1. We thus fix the period of PERw to 1

52.18 and theperiod of PERd to 1

(52.18·7)Since the time series contains thousands of observations, we could not run the

full GP. We thus use the stochastic variational GP (SVGP) available from GPflow(Matthews et al., 2017). As competitors, we consider again Prophet, TBATS and har-monic regression.

We run experiments considering 15 training set of different lengths, ranging from1,700 samples to 24,250 samples. In each experiment we produce a forecast for the nexttwo weeks. This implies forecasting about 1700 time steps ahead (2 weeks ·5 weekday

12

GP Prophet Harmonic TBATS(SVGP) regression

CRPS 0.17 0.18 0.31 0.39MAE 0.24 0.25 0.31 0.36LL -0.26 -0.36 -1.04 -1.30

Table 4: Median results on the call time series.

·169, where 169 is the number of 5-minutes intervals during the working hours). Weadopt also in this case the priors of Tab.1. The median results are given in Tab.4; oncemore the GP obtains the best results. This time it is followed closely by Prophet, whileTBATS and harmonic regression are outperformed.

The training times are of few seconds for Prophet, and about 50 seconds for theSVGP. In future work we could test more recent techniques for exact inference of thefull GP on large data sets (Wang et al., 2019) or approximated techniques specific fortime series (Ambikasaran et al., 2015).

6 Code and replicabilityWe submit our code and the data which allow replicating the experiments on themonthly and quarterly M3 time series. The submitted code is based on GPy (GPy,since 2012).

7 ConclusionsWe propose an additive kernel which makes the Gaussian process suitable for auto-matic forecasting. We train it using weakly informative priors on the hyper-parameters,obtaining excellent results on different types of time series, characterized by simpleand complex seasonalities, few data and large amount of data.

As far as we know, these are the best results so far in automatic forecasting withGaussian processes. Different future research directions are possible. One is the exten-sion to time series characterized by non-Gaussian likelihoods, such as count time seriesor intermittent time series. Other possibilities include the adoption of a Student like-lihood for time series characterized by outliers and the adoption of inference methodssuitable to treat time series containing thousands of observations.

ReferencesAlexandrov, A.; Benidis, K.; Bohlke-Schneider, M.; Flunkert, V.; Gasthaus, J.;

Januschowski, T.; Maddix, D. C.; Rangapuram, S.; Salinas, D.; Schulz, J.; et al.2019. Gluonts: Probabilistic time series models in Python. arXiv preprintarXiv:1906.05264.

13

http://arxiv.org/abs/1906.05264

Ambikasaran, S.; Foreman-Mackey, D.; Greengard, L.; Hogg, D. W.; and ONeil, M.2015. Fast direct methods for Gaussian processes. IEEE transactions on patternanalysis and machine intelligence 38(2):252–265.

Benavoli, A., and Zaffalon, M. 2016. State Space representation of non-stationaryGaussian Processes. arXiv preprint arXiv:1601.01544.

Benavoli, A.; Corani, G.; Mangili, F.; Zaffalon, M.; and Ruggeri, F. 2014. A BayesianWilcoxon signed-rank test based on the Dirichlet process. In Proc. InternationalConference on Machine Learning, 1026–1034.

De Livera, A. M.; Hyndman, R. J.; and Snyder, R. D. 2011. Forecasting time series withcomplex seasonal patterns using exponential smoothing. Journal of the AmericanStatistical Association 106(496):1513–1527.

Duvenaud, D.; Lloyd, J.; Grosse, R.; Tenenbaum, J.; and Zoubin, G. 2013. Structurediscovery in nonparametric regression through compositional kernel search. In Proc.International Conference on Machine Learning, 1166–1174.

Gneiting, T., and Raftery, A. E. 2007. Strictly proper scoring rules, prediction, andestimation. Journal of the American statistical Association 102(477):359–378.

GPy. since 2012. GPy: A Gaussian process framework in Python. http://github.com/SheffieldML/GPy.

Hewamalage, H.; Bergmeir, C.; and Bandara, K. 2020. Recurrent neural networks fortime series forecasting: Current status and future directions. International Journalof Forecasting.

Hyndman, R. J., and Khandakar, Y. 2008. Automatic time series forecasting: theforecast package for R. Journal of Statistical Software 26(3):1–22.

Hyndman, R.J. & Athanasopoulos, G. 2018a. Forecasting: principles and practice,2nd edition,. OTexts: Melbourne, Australia.

Hyndman, R. 2018b. fpp2: Data for ”Forecasting: Principles and Practice” (2ndEdition). R package version 2.3.

Hyndman, R. 2018c. Mcomp: Data from the M-Competitions. R package version 2.8.

Lloyd, J. R. 2014. GEFCom2012 hierarchical load forecasting: Gradient boostingmachines and Gaussian processes. International Journal of Forecasting 30(2):369 –374.

MacKay, D. J. 1998. Introduction to Gaussian processes. NATO ASI Series F Computerand Systems Sciences 168:133–166.

Makridakis, S., and Hibon, M. 2000. The M3-Competition: results, conclusions andimplications. International Journal of Forecasting 16(4):451 – 476.

14

http://arxiv.org/abs/1601.01544

http://github.com/SheffieldML/GPy

http://github.com/SheffieldML/GPy

Matthews, A. G. d. G.; van der Wilk, M.; Nickson, T.; Fujii, K.; Boukouvalas, A.; Leon-Villagra, P.; Ghahramani, Z.; and Hensman, J. 2017. GPflow: A Gaussian processlibrary using TensorFlow. Journal of Machine Learning Research 18(40):1–6.

Montero-Manso, P.; Athanasopoulos, G.; Hyndman, R. J.; and Talagala, T. S. 2020.Fforma: Feature-based forecast model averaging. International Journal of Forecast-ing 36(1):86 – 92.

Rasmussen, C., and Williams, C. 2006. Gaussian processes for machine learning.Gaussian Processes for Machine Learning.

Salvatier, J.; Wiecki, T. V.; and Fonnesbeck, C. 2016. Probabilistic programming inPython using PyMC3. PeerJ Computer Science 2:e55.

Smyl, S. 2020. A hybrid method of exponential smoothing and recurrent neural net-works for time series forecasting. International Journal of Forecasting 36(1):75–85.

Solin, A., and Sarkka, S. 2014. Explicit link between periodic covariance functionsand state space models. In Artificial Intelligence and Statistics, 904–912.

Taylor, S. J., and Letham, B. 2018. Forecasting at scale. The American Statistician72(1):37–45.

Vehtari, A.; Mononen, T.; Tolvanen, V.; Sivula, T.; and Winther, O. 2016. Bayesianleave-one-out cross-validation approximations for Gaussian latent variable models.The Journal of Machine Learning Research 17(1):3581–3618.

Wang, K.; Pleiss, G.; Gardner, J.; Tyree, S.; Weinberger, K. Q.; and Wilson, A. G.2019. Exact Gaussian processes on a million data points. In Advances in NeuralInformation Processing Systems, 14648–14659.

Wilson, A., and Adams, R. 2013. Gaussian process kernels for pattern discovery andextrapolation. In Proc. International Conference on Machine Learning, 1067–1075.

Wu, J.; Poloczek, M.; Wilson, A. G.; and Frazier, P. 2017. Bayesian optimization withgradients. In Advances in Neural Information Processing Systems, 5267–5278.

15

arxiv:2009.08102v1 [stat.ml] 17 sep 2020automatic forecasting using gaussian processes g. corani1...

Documents