a som-based hybrid linear-neural model for short-term load forecasting

Neurocomputing 74 (2011) 2874–2885

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m1 Te

journal homepage: www.elsevier.com/locate/neucom

A SOM-based hybrid linear-neural model for short-term load forecasting

Vineet Yadav a,n, Dipti Srinivasan b,1

a National University of Singapore, #16-01, 27 Paya Lebar Road, Singapore 409042, Singaporeb Department of Electrical and Computer Engineering, National University of Singapore, Block E4-06-09, Engineering Drive 3, Singapore 117576, Singapore

a r t i c l e i n f o

Article history:

Received 16 April 2010

Received in revised form

2 February 2011

Accepted 24 March 2011

Communicated by F. Rossimultivariate thresholds and smooth transition between the sub-spaces. In this paper, we propose a new

Available online 25 May 2011

Keywords:

Self-organizing map (SOM)

Feedforward neural network

Ho–Kashyap algorithm

Short-term load forecasting

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.03.039

esponding author. Tel.: þ65 96341161.

ail addresses: [email protected], vineet@nu

l.: þ65 6516 6544; fax: þ65 6779 1103.

a b s t r a c t

In this paper, a short-term load forecasting method is considered, which is based upon a flexible

smooth transition autoregressive (STAR) model. The described model is a linear model with time

varying coefficients, which are the outputs of a single hidden layer feedforward neural network.

The hidden layer is responsible for partitioning the input space into multiple sub-spaces through

method to smartly initialize the weights of the hidden layer of the neural network before its training.

A self-organizing map (SOM) network is applied to split the historical data dynamics into clusters, and

the Ho–Kashyap algorithm is then used to obtain the separating planes’ equations. Applied to the

electricity markets, the proposed method is better able to model the smooth transitions between the

different regimes, which are present in the load demand series because of market effects and season

effects. We use data from three electricity markets to compare the prediction accuracy of the proposed

method with traditional benchmarks and other recent models, and find our results to be competitive.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Deregulation and free competition of the electric powerindustry have raised many challenging issues. Accurate electricityload forecasting has become an essential task for effectivelymanaging the power systems, because important operating deci-sions such as scheduling of power generation, scheduling of fuelpurchasing, maintenance scheduling and planning for energytransactions depend upon the electricity load forecasts. In thederegulated power market, all the market players includinggeneration companies, transmission companies, independentsystem operators (ISOs) and regional transmission organizations(RTOs) perform load forecasting continuously, as they plan,negotiate and operate based upon the available forecasts. Further-more, an accurate price forecast is not possible without anaccurate load forecast, and a price forecast error can have seriousimplications for profit and market shares of the company. Thougha comfortable state of performance has been achieved for elec-tricity load forecasting, but market players will always bring innew dynamic bidding strategies, which, coupled with price-dependent load shall introduce new variability and non-statio-narity in the electricity load series. Hence, the modern power

ll rights reserved.

s.edu.sg (V. Yadav).

system will always require more advanced and more accurateforecasting tools.

Electricity load forecasting methods and more generally, timeseries prediction methods can be broadly divided into twocategories: statistical methods and artificial neural networks(ANN) methods.

Statistical methods follow a model-driven approach, whichmeans that they attempt to build an exact model of the system.These methods include linear regression methods, moving aver-age, exponential smoothing methods, autoregressive and movingaverage (ARMA) models, Box–Jenkin methods and Kalman filter-ing methods [1–5]. A major limitation of these methods is thatthey require some prior knowledge about the relationshipbetween the input and the output. They are also of limited appealbecause they are based on linear analysis, whereas the load seriesthat they try to model is a non-linear function of the independentvariables.

To overcome the limitations of these linear statistical methods,various non-linear time series models have been developed. Ineconometrics, one class of models, which has been studiedextensively involves the switching regimes models, which assumea finite number of linear regimes. In the threshold autoregressive(TAR) model proposed by Tong [6] and Tong and Lim [7], themovement between two linear autoregressive models is governedby an observable variable, called threshold variable. In [8], a TARmodel with multiple thresholds is developed for load forecasting.The optimum number of thresholds is the one, which minimizesthe sum of threshold variances, and the threshold variable is

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2011.03.039

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.neucom.2011.03.039

Fig. 1. Electricity demand in England and Wales from July 1, 2005, to June 30,

2006.

Fig. 2. Average daily demand patterns by season for England and Wales. N.B.

Winter¼December–February, Spring¼March–May, Summer¼ June–August,

Autumn¼September–November.

V. Yadav, D. Srinivasan / Neurocomputing 74 (2011) 2874–2885 2875

a lagged value of the load demand series. A generalization of theTAR model is the smooth transition autoregressive (STAR) model,which was initially proposed in [9] and further developed in [10]and [11]. This model allows for the transition between the tworegimes to be smooth, so there can be a continuum of statesbetween the two extreme regimes. A modified STAR model forload forecasting is proposed in [12] where temperature plays therole of threshold variable. Such switching regime models havealso been used for electricity price forecasting [13] and [14].

In the recent years, artificial neural networks have become apopular tool for time series forecasting [15–17]. With the artificialneural network, one does not attempt to create an explicit modelof the underlying physical system. One rather learns the mappingbetween the inputs and the output using the historical data.However a multilayer perceptron network is usually an opaquesystem, which is difficult to interpret. Similarly it is difficult tointerpret the significance of hidden layer weights, and how toinitialize the different layers’ weights before training.

In this paper, we consider the neuro-coefficient smooth transi-tion autoregressive (NCSTAR) model, which was first proposed in[18], and further developed in [19] and [20]. This formulationcombines the ideas from threshold autoregressive models and fromartificial neural networks. The coefficients of the linear model arethe outputs of a single hidden layer feedforward neural network.The hidden layer neurons and their logistic activation functions areresponsible for partitioning the input space into multiple sub-spaceswhile still allowing smooth transition between the sub-spaces. ThisNCSTAR model is particularly suitable for load forecasting becauseload demand depends heavily on season effects, and seasons do notchange abruptly. Rather the daily load profile changes graduallyfrom one season to another.

The problem of the initialization of weights and biases of amultilayer perceptron has been extensively studied in literature[21,22]. Weight initialization influences the probability of suc-cessful convergence, the speed of convergence and the general-ization capability of the network. In [19], a procedure to choosethe initial parameters of the hidden layer is detailed, whichinitializes the hyperplanes between the sub-spaces as parallel toeach other, and oriented in the direction perpendicular to themaximum variance of the input variables. There is no reason tobelieve why these initial weights might lie close to a goodsolution. In this paper, a new method for initializing the weightsis presented. First, a self-organizing map (SOM) network isapplied to split the historical data dynamics into clusters. Thenthe Ho–Kashyap algorithm is used to obtain the equations of thehyperplanes separating the clusters. These equations can then beused to smartly initialize the weights and biases of the hiddenlayer of the network.

The paper is organized as follows. In Section 2, we discuss thestylized features exhibited by the electricity load demand series.Section 3 describes the hybrid linear-neural model as proposedin [19]. Section 4 describes the proposed weight initializationmethod and why it is required. Section 5 evaluates the forecastingperformance on actual data sets and Section 6 concludes thearticle.

2. Electricity load demand series: stylized facts

In order to develop an appropriate load forecasting model, weneed to examine the main features of the load demand series.Fig. 1 shows the semi-hourly electricity demand of England andWales from 1 July 2005 to 30 June 2006. The key intra-annualfeatures to be noticed are the weekly seasonal cycle, stronginfluence of the holiday period around Christmas and mostimportantly, the weather sensitive part of the load because of

the changing seasons. More relevant to this paper is the intuitivefact that the transition between different seasons is rathersmooth, taking place over several days.

It is generally assumed that there are four seasons in Britain.These are spring (March–May), summer (June–August), autumn(September–November) and winter (December–February). Fig. 2shows the average daily demand patterns by season. Not only isthe average load demand different between the four seasons,but also the shape of the load demand pattern varies betweenseasons. For example, the evening peak hourly load between5 and 7 p.m. is distinctly visible in autumn and winter, while it isclearly missing in spring and summer. Similarly, we can see ajump in load demand in the early morning hours from midnightto 5 a.m., which is more distinct in autumn and winter comparedto spring and summer. This heating load jump is because of the‘‘Economy 7’’ cheaper rate units for off-peak overnight hours inBritain. To reiterate a previous point, while the daily load patternhas been shown here by four different curves representative ofthe four seasons, the transition from one curve to another is agradual and smooth one.

Fig. 3. Effect of a public holiday on the load demand.

V. Yadav, D. Srinivasan / Neurocomputing 74 (2011) 2874–28852876

Hence market effects and season effects lead to multipleregimes being present in the load time series. These multipleregimes can be thought of as occupying different sub-spaces inthe input space. The transition between any two regimes is asmooth transition, though the degree of smoothness may varybetween regimes. As explained in the next section, the weights ofthe hidden layer of the neural network represent the hyperplanesseparating the sub-spaces, and the biases are responsible for thesmoothness of the transition. Thus, a highly accurate learning andprediction can be expected not only for test samples, whichclearly represent a particular regime, but also for test samples,which represent the transition from one regime to another.

Fig. 3 shows the effects on the load series when a special dayoccurs, in this case the summer bank holiday in Britain on29 August 2005, which was a Monday. It can be seen that whilethe shape of the daily load pattern is quite different between theweekends and regular weekdays, there is not much difference inthe daily load pattern and the average load demand betweenweekends and special days. Thus, in the absence of a modeldedicated for the treatment of special days such as single dayholidays, it is not too unreasonable to replace their load predic-tion with that of the closest weekend load demand. For specialdays such as longer holiday periods e.g. Christmas period, the loadcan be replaced by the means of the data from the same day of theweek and the same hour, one week before and one week after.In this paper, both these methods have been used, dependingupon the size of the prediction horizon and the special daysinvolved.

The features mentioned here are by no means unique to theEngland and Wales demand series. These features of multiplelevels of seasonality, special days and non-linear effects ofweather variables play out differently in different power marketsof the world, and the finally observed series displays a greatvariety of profiles.

3. Hybrid linear-neural model and weight initialization

3.1. TAR model and STAR model

Before moving on the NCSTAR model, a short mathematicalformulation of the two models, which it aims to generalize,is given.

The threshold autoregressive (TAR) model, which allows for alocally linear approximation over a number of states, can be

formulated as

yt ¼

jð1Þ0 þXp

i ¼ 1

jð1Þi yt�iþeð1Þt if st�kor

jð2Þ0 þXp

i ¼ 1

jð2Þi yt�iþeð2Þt if st�kZr

8>>>>><>>>>>:

ð1Þ

where et is an iid error and ji are the real coefficients, st�k is thestate determining threshold variable, whose value relative to thethreshold r determines which of the linear autoregressive modelsshall be activated. The integer k determines with how many lagsdoes the threshold variable influence the regime at time t. If st�k isreplaced by a lagged value yt�k of the series, then the model isreferred to as self-exciting threshold autoregressive (SETAR) model.

The smooth transition autoregressive (STAR) model, which is ageneralization of the TAR model, can be expressed as

yt ¼ jð1Þ0 þXp

i ¼ 1

jð1Þi yt�i

!ð1�F½Oðst�k�cÞ�Þ

þ jð2Þ0 þXp

i ¼ 1

jð2Þi yt�i

!F½Oðst�k�cÞ� ð2Þ

where F is a transition function, which is a continuous functionthat is bounded between 0 and 1. The parameter O is responsiblefor the smoothness of this function. From this formulation, onecan interpret the STAR model as a regime-switching model thatallows for two regimes, which are associated to the extreme valueof the transition function F¼0 and F¼1 and there is a smoothtransition between the two regimes.

3.2. NCSTAR model

The neuro-coefficient smooth transition autoregressive(NCSTAR) model, as proposed in [19,20], is briefly described here.

The STAR model has certain limitations. As can be seen from(2), yt is a weighted average of two AR models where the weightsof the two models are determined by the transition functionF. Hence the STAR model is unable to accommodate more thantwo regimes, given any form of the function F. Similarly, thismodel is unable to handle the case where the regimes aredetermined by a combination of different variables.

The NCSTAR model aims at partitioning the n-dimensionalspace and allowing smooth transitions between the regimes. It isa linear model whose coefficients are given by the output of asingle hidden layer neural network. Hence

yt ¼U0tztþet ð3Þ

where Ut is the parameter vector (output of the neural network),zt the ½1,z0t �

T , z0t the p-dimensional vector of explanatory variablesformed by lagged variables of the time series yt.

The coefficients Ft(j) are the output of a neural network whose

architecture is shown in Fig. 4. So

FðjÞt ¼Xh

i ¼ 1

ljifxi ,biðxtÞ�gj, j¼ 0,1,. . .,p ð4Þ

where

fx,bðxÞ ¼1

1þexpð�o0xþbÞ ð5Þ

is the output of a neuron of the hidden layer, and x the n-dimensional input vector, x¼[o1, y, on] the vector of synapticweights arriving at the hidden neuron in consideration, b theoffset (bias) of the same neuron, lji the synaptic weight betweenthe ith hidden and jth output neurons, gj the offset (bias) of the jthoutput neuron.

Fig. 4. Architectural graph of the neural network of the NCSTAR model. Note that

the input x is n-dimensional, there are h hidden neurons and the output parameter

vector U is (pþ1)-dimensional.

Fig. 5. Sample autocorrelation function for New England hourly data.


Summarizing the role of the hidden layer, when x0x¼b, theparameters x and b define a hyperplane in the n-dimensionalspace of x. The non-linear sigmoidal transformation is responsiblefor creating the smooth transition. With h hidden neurons, h

hyperplanes are created, which split the n-dimensional space intoseveral polyhedral regions.

Finally, (3) and (4) can be combined to give a more compactform:

yt ¼ g0ztþXh

i ¼ 1

l0izt fxi ,biðxtÞþet ð6Þ

This form shows that yt can be considered as a weightedaverage of hþ1 AR models, one AR model for each hidden neuron,and the weight of each AR model depends upon the location of xt

with respect to the hyperplane defined by the weights and bias ofthat hidden neuron.

3.3. Model identification

The empirical modeling cycle, as first introduced by Box et al.[4], is followed. This includes the three steps of model identifica-tion, model estimation and finally model validation. Dependingupon the outcome of model validation, the first two steps mightneed to be done again.

Consider again (6), the representative equation of the NCSTARmodel. The output yt is a function of zt and xt, which are p- andn-dimensional vectors of explanatory variables formed by laggedvariables of yt. The aim of model identification is to identify thesize of p and n, determine which lags to consider for zt and xt, andalso determine h, the number of hidden neurons.

The input vector x is formed by the lagged variables of thetime series yt for lags 1–24 for hourly load forecasting. It isimportant to include all the hours for the past day in x because weare trying to group the historical data by their daily load patterns,where the patterns may form due to season effects or marketeffects.

What lags should be considered while forming zt is anotherimportant question. The procedure is the same as if we wanted toobtain the relevant lags in a regular autoregressive model.The autocorrelation function is plotted to see which lags aremost correlated, and then Akaike information criterion (AIC) orBayesian information criterion (BIC) are used to confirm. Considerthe example of New England hourly load data from 01/01/2005 to31/12/2007. In Fig. 5, the autocorrelation function of this data isshown up to 200 lags, i.e. slightly above a week.

From the figure, it can be seen that the load at the hourof prediction, i.e. Lh is most highly correlated with the previous

two-hour loads, i.e. Lh�1 and Lh�2. Then Lh is also highlycorrelated with the loads Lh�24, Lh�48, Lh�72 and Lh�168. Butamongst these peaks for the past seven days, the maximumcorrelation is with Lh�168, i.e. one week back load. Now there isa choice to be made. A larger number of lags could be selected tomake zt, which might carry more information to the model, butthis comes at a risk of getting a complex model and risk of over-fitting. Akaike Information Criteria, when used to find the mostadequate lags, suggests that lags 1, 24 and 168 can be chosen herefor zt.

Finally, h the number of hidden neurons needs to be deter-mined. It can be considered to be the most important designparameter with respect to the approximation capabilities of aneural network [23], as it is closely related to the challenge ofbias/variance dilemma [24]. There is significant literature ondetermining this magic number [25–27] being the recentlyproposed approaches, but for the purpose of this paper, therelatively popular approaches of cross-validation and early-stop-ping [28,29] were chosen.

3.4. Model estimation

In this stage, the aim is to estimate the values of theparameters c, ki, xi and bi using the training data. The costfunction to be minimized is the sum of the squared errors over allthe training samples, given by

C ¼1

2

XT

t ¼ 1

ðyt�ytÞ2

ð7Þ

where T is the total number of training samples and yt is theestimated value.

While the conventional back-propagation method can be usedfor training the network using the derivative of the cost function C

with respect to c, ki, xi and bi, this chapter uses a second methodcalled Concentrated Least Square (CLS) method, proposed by [30],also used by [19]. The observation, which this CLS method makes,is that yt is linear with respect to c and ki, and non-linear withrespect to xi and bi. So ordinary least squares (OLS) estimator canbe used to train c and ki, and a non-linear approach similar toback-propagation is used to train xi and bi. Using the CLS methodreduces the dimensionality of the iterative estimation problemconsiderably, and thus reduces the computation burden too.


3.5. Model validation

In this stage, the quality of the model, which has beenidentified and estimated, is assessed. Residual analysis is apopular way for model validation. If the model is correctlyspecified and the parameter estimates are reasonably close tothe true values, then the residuals should show the properties ofwhite noise. They should behave like independent, identicallydistributed normal variables with zero means and commonstandard deviations.

Two approaches are used by this paper for residual analysis.First, the histogram of the error residual can be plotted, andcompared visually with the normal distribution. Secondly, to testfor normality, the Kolmogorov–Smirnov test, Lilliefors test andJarque–Bera test are popular tests with their Matlab implementa-tions available.

3.6. Initial conditions

In [19], the initial synaptic weights for the hidden neurons arechosen such that the hyperplanes created are all parallel to eachother and oriented in the direction perpendicular to the max-imum variance of the input variables. The offsets of the hiddenlayer neurons, which determine the position of the hyperplaneswith respect to the origin, are chosen such that the hyperplanesare located at equal distances from each other on both sides of themean of the data samples. No justification is provided as to whythis might be a good choice for weight vectors.

In [20], the authors have proposed a different method toinitialize the network parameters. The slope parameter issearched over a grid of possible values, and random sampling isdone over a uniform distribution to get initial values for thenormalized synaptic weights. Next, the value of the log-likelihoodis calculated. This whole process is repeated a certain number oftimes, and those initial values are chosen, which maximize thelog-likelihood. Essentially this is a random search in the log-likelihood function landscape to find a maximum possible value.

The most significant contribution of this paper is how toinitialize the weight vectors of the NCSTAR model. In neuralnetworks literature, there is a huge amount of literature dealingwith initialization. These range from simple method such asrandom weight initialization, to methods involving extensivestatistical and/or geometrical analysis of the data before satisfac-tory results can be obtained. Ref. [31] is a recent comparison ofpopular weight initialization methods. Generally, it is advisable tofully understand the statistical or geometrical features of thefunction in question before choosing an approach. In this paper,we chose the geometrical analysis method, as the NCSTAR modeldescription lends itself easily to a geometrical explanation. Therole of the hidden layer as a separating hyperplane has alreadybeen detailed in the previous section. What this paper proposes isgather information about the structure in the electricity load datausing SOM-based clustering, and then pass on this information tothe NCSTAR model as a starting point using the weight vectorinitialization. If the initial weights are close to a good solution, thetraining will be much faster and the possibility of obtainingadequate convergence increases.

4. SOM-based approach to weight vector initialization

4.1. Motivation

Much work has been done on the importance of weightinitialization in order to obtain a properly trained network.Weight initialization influences the probability of successful

convergence, the speed of convergence and the generalizationcapability of the network [22]. But the optimal values of weightsare generally not known a priori as they are mainly dependent onthe data set used. In [32], it was shown that performing a globalsearch for obtaining the optimal values of weights might not befeasible because small changes in the initial weights mightchange the convergence behavior of the network significantlydue to chaos in network dynamics. Hence it is important to startthe training only with a good approximation of the optimal initialvalue of the parameters, or as [32] puts it, to start the learningprocess in the ‘‘eye of the storm’’. This is exactly what ourproposed initialization method sets out to do.

Before training the NCSTAR model, a clustering algorithm isused to form the regimes, and this information is then passed onto the NCSTAR model through its weight initialization. The self-organizing map is a popular neural network method for cluster-ing. Once the clusters are formed, a method is required, whichgives the separating hyperplanes if the clusters are linearlyseparable, and converges to a decent solution if the clusters arelinearly non-separable. The Ho–Kashyap algorithm is chosen forthis task.

This new method of weight initialization makes the NCSTARmodel more robust to initial conditions. This is because a SOMnetwork, which is the first step of the proposed weight initializa-tion, is very robust with respect to its weight vector initialization[33].

4.2. Recent SOM based approaches to forecasting

Traditionally, supervised learning methods such as feedfor-ward multilayer perceptrons and radial basis function networkshave been more popular for time series forecasting compared tounsupervised learning methods. A plausible explanation is thatwhile time series prediction is considered a function approxima-tion problem, the SOM has usually been seen as an architecturesuitable for vector quantization, clustering and visualization [34].Some noteworthy SOM-based approaches for forecasting arediscussed next.

In [35,36], the Double Vector Quantization method is pro-posed, which is specifically aimed at obtaining long term trends.Two SOM networks are trained: one to cluster the regressorsxin(t), and a second one to cluster the associated deformationsDxin(t)¼xin(tþ1)�xin(t). This double quantization regressors anddeformations only gives a static characterization of the pastdynamics of the time series. To perform the forecasting, moreinformation is needed than the two sets of prototypes. This isobtained from the associations between the deformations Dxin(t)and the corresponding regressors xin(t), which provide usefuldynamical information about how the series might evolvebetween a regressor and the next point. This information is thenstochastically modeled in a transition probability matrix. Oncethe two SOM networks have been trained and the transitionmatrix obtained, then a one-step-ahead prediction can beobtained by finding the closest prototype to the current regressorvector, and then using conditional probability to determine thedeformation vector. The one-step-ahead prediction is simply thesum of the two. For a multi-step forecast, the same step isrepeated after inserting each new one-step-ahead estimate asthe regressor vector for the next iteration of the algorithm.

The above Double Vector Quantization approach is notdesigned to determine a precise estimate for time tþ1, but ismore specifically devoted to the problem of long term evolution,which can only be obtained in terms of trends. When dealing withSTLF, we are also interested in a precise estimate for time tþ1.Using Double Vector Quantization to reduce the short-term errorwill require a large number of neurons in the SOM networks of


regressors and deformations. This approach does not provide forany kind of interpolation methods to alleviate the limitation. Theproposed NCSTAR approach, on the other hand, has an implicitinterpolation method, because it allows for smooth transitionbetween different clusters through the use of the logistic activa-tion function.

Ref. [37] is a relatively new approach to handle forecastingusing SOMs. Instead of using q time-invariant local linear models(an autoregressive model for each cluster of data), the proposedKSOM algorithm rather works with a single time-variant linear ARmodel. The coefficient vector is recomputed at every time step t,directly from a subset of K (K5q) winning weight vectors, i.e. wefind K neurons whose weight vectors are the closest to the inputvector. According to the authors, the proposed approach handlesnon-stationary time series better, because only few neurons areused to build the predictor for each input vector.

Though KSOM is an interesting approach, it is essentially avector quantization algorithm, similar to Double Vector Quantiza-tion. And any vector quantization algorithm would suffer fromhigh prediction errors when approximating continuous mappingsunless a huge number of neurons are used. No attempt atinterpolation is made here as well.

4.3. First stage: clustering via SOM

The SOM algorithm, first introduced by Kohonen in [33], is oneof the most popular ANN model based on the unsupervisedcompetitive learning paradigm. Learning from past examples,the SOM creates a mapping from a continuous high dimensionalinput space j onto a discretized low dimensional output space w.This discrete output space w consists of q neurons, which arearranged according to some fixed topology, e.g. a two-dimen-sional rectangular or hexagonal grid. The mapping c(x) : j-w isdefined by the weight vectors W¼(w1,w2,y, wq), and it assignsto an input vector x(t) a neuron index:

inðtÞ ¼ argmin8i

:xðtÞ�wiðtÞ:� �

ð8Þ

where :.: refers to the Euclidean distance and t is the discretecurrent iteration. It is important to note that the weight vectorshave the same dimensionality as the input patterns.

A competitive-cooperative learning rule is used to train theweight vectors. When an input vector is presented to the network,the weight vector of the winning neuron and its neighbors areupdated as

wiðtþ1Þ ¼wiðtÞþaðtÞhðin,i; tÞ½xðtÞ�wiðtÞ� ð9Þ

so the weight vectors of the adapted neurons are moved a bittowards the input vector. The amount of movement is controlledby the learning rate a, which decreases exponentially with time.The number of neurons, which are affected by this adaptation, isdetermined by a neighborhood function h. Typically the neigh-borhood function is unimodal, symmetric and monotonicallydecreasing with increasing distance to the winner. A popularchoice is the Gaussian function:

hðin,i; tÞ ¼ exp �:riðtÞ�ri*ðtÞ:

2

2s2ðtÞ

!ð10Þ

where :ri(t)�ri*(t): is the distance between neurons i and in in thediscrete output space w, and s(t) is the radius of the neighborhoodfunction at time t, which decreases exponentially to ensure thereduction of the neighborhood size during training.

The neighborhood function is selected to cover a large area ofthe output space w in the beginning of the learning, and it isgradually reduced such that towards the end of the process, onlythe winner is adapted. The map is said to have converged when

the global ordering of the weight vectors achieves a steady state.An important feature of the resulting map is the preservation ofneighborhood relations, i.e. nearby data vectors in the input spaceare mapped onto neighboring neurons in the output space. Due tothis topology-preserving property, the low dimension outputspace is able to show the structure hidden in the high-dimen-sional data, such as clusters and spatial relationships [38].

For the hourly load forecasting problem, the input to the SOMnetwork is the 24-dimensional xt vector from (4) and (5). Theneurons are arranged in a one-dimensional lattice. The training iscarried out in two phases—ordering and convergence phase. Theinitial values for the learning rate and the size of the neighbor-hood function and their rates of decay are set up as proposed in[39].

Being an unsupervised network, the SOM is able to split thehistorical data set into several subsets where each subset has itsown unique characteristics, such as a particular season effect ormarket effect. Next, the hyperplanes separating these subsetshave to be obtained.

4.4. Second stage: Ho–Kashyap procedure

The purpose of this stage is to find the separating hyperplanes,i.e. linear classifiers, which separate the clusters obtained in theprevious stage. As explained earlier, these hyperplanes are thenused to initialize the weights and biases between the input layerand the hidden layer. The normal of the plane and its distancefrom the origin are used to initialize the weight vector x and thebias b, respectively, in (5). The values of x and b changes furtherduring the neural network training stage to accommodate thesmooth transition or the sigmoidal non-linearity.

Consider a binary classification problem first, as the multi-class classification is a straightforward and natural generalizationfrom the 2-class to the c-class (c42) case. Consider a set oftraining samples {xi}1r irk where xiARn belongs to either one oftwo classes c1 or c2. The goal is to determine the hyperplanedefined by g(x)¼0, such that all the training samples are correctlyseparated, i.e.

gðxÞ ¼wT xþw0

40 if xAc1

40 if xAc2

(ð11Þ

Making some notational changes for convenience:

a¼w0

w

� �and yi ¼

1

xi

" #ð12Þ

So now (11) can be rewritten as

aTy¼40 if yAc1

o0 if yAc2

(ð13Þ

This can be further simplified by replacing each yiAc2 by �yi

so now we have a simplified set of constraints:

aT yi40, i¼ 1,2,. . .,k ð14Þ

Two popular procedures to obtain linear classifiers are theperceptron rule and the minimum squared error (MSE) rule. Theperceptron rule is basically a gradient descent method applied onthe perceptron criterion function, which is the sum of thedistances from the misclassified samples to the decision bound-ary. While it can be shown that it converges if the classes arelinearly separable, the method produces an infinite sequence ofvectors in the non-separable case. The second method, which isthe MSE rule, attempts to minimize the sum-of-squared-error


criterion function Js(a)¼:Ya�b:2 where

Y¼

y1

^

yk

264

375 and b¼

b1

^

bk

264

375

where bi are some arbitrarily specified positive constants, referredto as margin. The MSE solution will depend upon the marginvector b. If b is fixed arbitrarily, then the MSE solution might notyield a separating vector in the linearly separable case. Buthopefully we will get a useful discriminant in both the separableand the non-separable case.

The Ho–Kashyap procedure is a linear classifier, which com-bines perceptron and MSE classification. This method ensures thatin the separable case, a separating hyperplane is computed, and inthe non-separable case, a MSE optimal solution is found. The basicidea behind the method is that if the samples are linearlyseparable, and if both a and b are allowed to vary in the criterionfunction Js(a,b)¼:Ya�b:2, subject to the constraint b40, thenthe minimum value of Js(a,b) is zero. The separating vector issimply the vector a, which achieves this minimum.

So we need to alternate between two steps until convergence.

�
Fix b and minimize Js(a,b) with respect to a. This is done usingthe pseudo-inverse approach:
raJs ¼ 2YTðYa�bÞ ¼ 0, or a¼ ðYT YÞ�1YT b ð15Þ

�
Fix a and minimize Js(a,b) with respect to b:
bkþ1 ¼ bk�rrbJs, r40¼ bkþ2rðYak�bkÞ ð16Þ

From (15), if some elements of Yak�bk are less than zero, thenthe corresponding elements of bkþ1 will be reduced compared tobk. To ensure that this does not happen, define

eþk ¼12ðekþ9ek9Þ where ek ¼ Yak�bk ð17Þ

So finally the Ho–Kashyap algorithm can be summarized as

b140 arbitrary, ak ¼ ðYT YÞ�1YT bk, bkþ1 ¼ bkþ2reþk ð18Þ

Examining the convergence properties of (16), it can be shownthat

�
a separating vector can be found in a finite number of steps ifthe problem at hand is separable and � an evidence of non-separability can be found if the problem is
non-separable.

Since we are concerned with a multi-class classificationproblem, the above two-class classification approach needs tobe modified accordingly. The binary perceptron algorithm can bereplaced by its multi-class classification extension using theKessler’s construction, as described in [40]. Refer to Appendix Afor more details. Once the equations of hyperplanes, i.e. ai havebeen obtained, then w and w0 from (12) can be used to initializethe weight vectors and the biases of the hidden layer neurons.

5. Forecast results and comparisons

The proposed method is applied for one-day-ahead hourlyload profile of the electricity demand of the Alberta, New SouthWales and New England markets, and finally the results aresummarized.

These markets have been specifically chosen because thetesting period lasts at least one year, and because the seasonality

effect is strong for these markets. The strength of the proposedmethod lies in being better able to model smooth transitionsbetween multiple regimes. Seasonality is a prominent source ofmultiple regimes in the load demand time series and in order toaccommodate all the seasons at least once, it is necessary to haveat least one year’s data for testing.

As a main goal of this paper is to demonstrate that the SOM-based initialization of the NCSTAR model is able to provide betteraccuracy compared to the original NCSTAR model, it is necessaryto carry out the same simulation twice, once using the proposedinitialization scheme, and once using the original initializationscheme as proposed in [20], and these will be referred to asNCSTAR-SOM model and NCSTAR model, respectively, in thissection.

5.1. Alberta market

Five years of publicly available hourly load demand data forthe Alberta market [41] in Canada is used. Data from January 1,2000 to December 31, 2002 is used to train the network, datafrom January 1, 2003 to December 31, 2003 is used as a validationset, whereas data from January 1, 2004, to December 31, 2004, isused to calculate the out-of-sample accuracy.

Before the training process starts, the input vector xin(t) needsto be normalized between 0 and 1. This is done through

xinnðtÞ ¼

xinðtÞ�minðxinðtÞÞ

maxðxinðtÞÞ�minðxinðtÞÞð19Þ

This normalization is important for two reasons. First becauseSOMs, as we are using them, are based on the Euclidean metrics.So without normalization, even if the load profile of the first weekin 2001 is similar to the first week of 2004, the SOM networkmight not club them together in the same cluster if the electricitydemand has generally increased in 2004. Second, the neuralnetwork is using a sigmoid function as the activation functionin the hidden layer whose output has a range of 0–1. The back-propagation training works faster and more accurately if thenetwork input also lies in the same range of 0–1.

The criteria to compare the performance of different methodsproposed is the mean absolute percentage error MAPE, whichindicates the accuracy of prediction. MAPE is defined as follows:

MAPE¼Xn

i ¼ 1

ð100n9yai�yfi9=yaiÞ=n ð20Þ

where yai is the actual value, yfi is the forecasted value and n is thesample size of the predicted values.

During the training procedure of the neural network, severalparameters influence the performance of the proposed model. Forexample, the number of hidden layers in the network is animportant parameter. Using too many hidden neurons increasesthe risk of over-fitting, and using too few hidden neurons leads tounder-fitting. Both these scenarios are harmful for the general-ization capability of the trained network. Similarly, the SOMnetwork, which is used to come up with the initial networkweights, is dependent upon various parameters such as thelearning rate a and the neighborhood radius s. Several values ofthese parameters were tried and tested on a separate validationset as described in [4]. As mentioned earlier, three years of data isused for training, and one year data for validation. Finally thevalues of a and s are chosen, which minimize the mean squarederror over the validation set.

The forecast load published by Alberta Electric System Operator(AESO) is a reasonable benchmark to compare against for theproposed NCSTAR-SOM model. For the test period of 2004, theMAPEs of AESO, NCSTAR and NCSTAR-SOM are 1.26%, 1.12% and


1.07%, respectively. Tables 1 and 2 show the breakdown of MAPEresults over the days of the week and the months of the year for theperiod 2004, respectively. Clearly the NCSTAR-SOM model has beenable to significantly improve the forecasting results over both, AESOand NCSTAR.

Looking at the monthly MAPE breakdown for the threeapproaches, we notice that the MAPE is high for two periods—wintermonths of January–December and summer months of June–July. Boththese periods correspond to a relatively higher demand. These periodsalso often include some extremely high demand days because ofsudden heat wave or cold wave on those particular days. As theNCSTAR-SOM model does not incorporate future weather or tem-perature information, the prediction accuracy deteriorates duringthese periods. But a lack of meteorological variables in the NCSTAR-SOM model is still justified because for short lead times like one-day-ahead, usually the meteorological variables evolve in a very smoothfashion, and the load series can sufficiently capture the change.

We also compare our work against a recent model—PREDICT2-ES, proposed in [42]. Comparisons are made for four weeks of theyear 2004, each representing a different season—winter, spring,summer and fall. The results are shown in Table 3.

�

TabCom

M

T

W

T

Fr

Sa

Su

T

TabCom

Ja

Fe

M

A

M

Ju

Ju

A

Se

O

N

D

TabCom

T

2

5

8

1

ARIMA refers to the autoregressive integrated moving averagemethodology developed by Box et al. [4]. The autocorrelationand practical autocorrelation functions are used as toolsto determine an ARIMA model of the form ARIMA (p,d,q)

le 1parison of prediction results for Alberta for each day of the week.

AESO NCSTAR NCSTAR-SOM

onday 1.44 1.28 1.21

uesday 1.19 1.04 1.12

ednesday 1.09 0.94 1.06

hursday 1.03 0.98 0.92

iday 1.18 1.26 1.01

turday 1.29 1.14 1.03

nday 1.36 1.23 1.14

otal 1.26 1.12 1.07

le 2parison of prediction results for Alberta for individual months.

AESO NCSTAR NCSTAR-SOM

nuary 1.26 1.16 1.23

bruary 0.90 1.09 1.03

arch 1.41 1.32 0.92

pril 1.11 1.08 0.94

ay 1.20 1.03 0.87

ne 1.29 1.08 1.02

ly 1.61 1.27 1.36

ugust 1.12 1.17 1.25

ptember 1.28 1.09 0.97

ctober 1.02 0.98 0.99

ovember 1.13 1.01 0.91

ecember 1.36 1.11 1.36

le 3parison of prediction results for Alberta.

est period ARIMA ANN PREDICT

/16/2004–2/22/2004 1.440 2.130 0.945

/11/2004–5/17/2004 1.070 1.100 0.812

/16/2004–8/22/2004 2.540 2.130 1.272

0/25/2004–10/31/2004 1.500 0.820 0.745

(P,D,Q)S, where d and D are the order of the non-seasonal andseasonal differences, respectively, p and P are the order of thenon-seasonal and seasonal autoregressive term, and q and Q

are the order of the non-seasonal and seasonal terms.The seasonal difference is represented by the S lag. Forthe four seasons, models obtained were (1�1�1)(3�0�0),(1�1�1)(1�0�1), (1�1�0)(2�0�0) and (1�1�0)(3�0�0).The models and results are from [42].
� ANN refers to artificial neural networks. The multilayer feed-
forward and radial basis functions (RBF) ANNs were tested.The stopping criterion was minimal error and error againstdiversity. In order to optimize the number of layers andneurons, a grid search approach based on genetic algorithmhas been used. For the four seasons, the models chosen wereMLP 168-8-1, RBF 168-25-1, MLP 168-10-1 and RBF 168-24-1.The models and results are from [42].
� PREDICT2-ES is the model proposed in [42]. It is a non-linear
chaotic local dynamics model. The accuracy of such a model isclaimed to depend upon five important parameters—dimension-ality of embedding space, size of the local neighborhood, timedelay, Euclidean distance metric of the nearest trajectory algo-rithms and the type of regression functions of local constantmodels [43]. In the PREDICT2-ES model, these five parameters aresearched using an optimization model based on evolutionarystrategy.
� AESO refers to the error made by the forecast load published
by AESO for the same period.

It can be seen that ARIMA and ANN have a worse accuracycompared to the other three approaches. AESO and NCSTAR-SOMare a slight improvement over the PREDICT2-ES model. Theaccuracy of AESO and NCSTAR-SOM are comparable. But NCSTARhas a worse performance compared to NCSTAR-SOM.

5.2. New South Wales market

Here the results for NCSTAR-SOM are compared against theresults for a Bayesian Markov chain scheme proposed in [44]upon six months of data from New South Wales market [45] inAustralia. This model presents a Bayesian approach for estimatingmulti-equation models for intra-day load forecasting, where afirst-order vector autoregression is used for the errors.

The training period is from January 1, 1998 to December 31,2000. The test period is February 1, 2001, to July 31, 2001. Theresults for the six months data and the monthly breakdown areshown in Tables 4 and 5, respectively. Clearly the NCSTAR-SOMmodel improves the prediction accuracy not only in the completesix months data, but also over individual months.

5.3. New England market

Five years of publicly available hourly load demand data fromJanuary 1, 2005, to December 31, 2009, for the New Englandmarket [46] in the USA is used. The first three years data is used totrain the network, the fourth year data is used as a cross-validation set, while the fifth year is used as a test prediction set.

2-ES AESO NCSTAR NCSTAR-SOM

0.877 0.894 0.869

0.832 0.846 0.834

0.986 1.203 1.175

0.727 0.831 0.663

Table 5Comparison of prediction results for NSW for the individual months using MAPE (%).

Weekdays Weekends

Bayesian NCSTAR NCSTAR-

SOM

Bayesian NCSTAR NCSTAR-

SOM

February 3.97 2.32 2.25 4.00 2.89 2.63

March 3.17 3.06 2.79 3.44 3.17 2.78

April 2.31 2.16 2.05 3.06 2.38 1.46

May 3.14 2.84 2.00 3.74 2.65 2.51

June 4.45 2.98 2.20 4.53 2.97 2.32

July 2.17 2.09 1.81 2.33 2.26 2.02

Table 6Comparison of prediction results for New England for each day of the week.

ST-NN NCSTAR NCSTAR-SOM

Monday 4.01 4.12 2.71

Tuesday 3.42 3.56 2.74

Wednesday 3.24 3.35 2.89

Thursday 3.24 3.39 2.91

Friday 3.50 3.46 3.28

Saturday 4.61 4.48 3.33

Sunday 4.26 4.55 3.37

Total 3.75 3.83 3.03

Table 7Comparison of prediction results for New England for individual months.

ST-NN NCSTAR NCSTAR-SOM

January 2.91 3.05 2.78

February 2.70 2.63 2.39

March 2.79 2.95 2.91

April 3.20 3.29 3.27

May 3.10 3.19 3.14

June 4.52 4.26 2.44

July 3.97 3.95 3.69

August 5.28 5.17 4.95

September 6.10 5.75 3.24

October 2.85 3.47 2.41

November 3.19 3.54 2.23

December 4.40 4.55 3.28

Table 4Comparison of prediction results for NSW for six months using MAPE (%).

Weekdays Weekends

Mean (Bayesian) 3.10 3.43

Mean (NCSTAR) 2.58 2.73

Mean (NCSTAR-SOM) 2.17 2.31

Median (Bayesian) 3.10 3.33

Median (NCSTAR) 2.39 2.26

Median (NCSTAR-SOM) 1.82 2.01

Fig. 6. Histogram of the residual error for New England hourly data.

Table 8Percentage of hours with a certain MAPE range for

New England data.

MAPE range (%) % of hours

o1 24.2

1–2 20.7

2–3 17.4

3–4 11.7

4–5 8.7

5–6 6.7

6–7 4.0

47 6.6


The benchmark used is the semigroup based system-type neuralnetwork (ST-NN) architecture proposed in [47,48]. In this method, thenetwork is decomposed into two channels—a semigroup channel anda function channel. The semigroup channel models the dependency ofthe load on temperature, whereas the functional channel representsthe fundamental characteristics of daily load cycles.

In Table 6, the results are shown, which compare the two modelsfor the seven days of the week. Clearly NCSTAR-SOM outperformsthe ST-NN model on each day, and the improvements in accuracyare largest for the weekend days. NCSTAR-SOM is also able tooutperform NCSTAR. Looking at the MAPE values for the combineddata, MAPEs are 3.75%, 3.83% and 3.03%, respectively, for ST-NN,NCSTAR and NCSTAR-SOM methods. NCSTAR-SOM has been able toimprove the results by 0.72% over ST-NN, which is a significantimprovement. In Table 7, results are compared for the twelvemonths of the year. While ST-NN and NCSTAR-SOM perform rathersimilarly in the easy-to-predict months of March, April, May, it canbe seen that NCSTAR-SOM is able to provide a much better resultcompared to ST-NN and NCSTAR in months, which form thetransition between seasons, such as June and September.

5.4. Residual error results and analysis

The model validation step involves analysis of the residualerror. Due to space constraints, only the results for the New

England data will be presented. The error residuals (signedpercentage error) are obtained, and their histogram is plotted inFig. 6. On top of the histogram, a best-fit normal distribution issuperimposed. In Table 8, the range of forecasting errors ispresented. It can be seen that roughly 60% of the hours have aMAPEo3%, and roughly 10% of the hours have a MAPE46.5%.

At the 95% significance level, the null hypothesis of normaldistribution is rejected for the obtained residuals using all thethree tests—Kolmogorov–Smirnov test, Lilliefors test and Jarque–Bera test. In Fig. 6. looking at the histogram of the residuals andcomparing it with normal distribution’s histogram, we can seethat the residuals have fat-tails. This means that from time totime, there are rather large values in the error, which are hard toreconcile with the standard distribution assumption of normality.

Fig. 7. Hourly change D¼ Lkþ1=Lk for New England hourly data.

Fig. 8. Comparison of training performance of SOM-based initialization (solid line)

and originally proposed initialization (dotted line) for the NCSTAR model.


In a fat-tailed distribution, the probability of large and smallvalues is much higher than would be implied by a normaldistribution. This can be the reason why the tests for normalityare rejected for the residuals.

Consider why the fat-tails are present in the error residuals.This paper hypothesizes that sudden changes in weather due tosummer effect and winter effect are responsible for the fat-tails.The winter months of January–December and summer months ofJune–July often include some extremely high demand daysbecause of sudden cold wave or sudden heat wave on thoseparticular days. The NCSTAR-SOM model does not considerexogenous weather forecasts in its inputs, because it assumesthat weather variables evolve in a smooth fashion and the loadseries can sufficiently capture the change. This assumption leadsto bigger errors for the days when weather changes suddenly, andhence the fat-tails.

To support this hypothesis, consider Fig. 7. Let Lk and Lkþ1

denote the load demand at consecutive hours k and kþ1,respectively. So D¼ Lkþ1=Lk denotes the change over the twoconsecutive hours. In Fig. 7 the histogram of D for the four yeardata is plotted. The histogram appears to be the superposition ofthree bell-shaped segments. The biggest bell-shaped segment isin the middle, around D¼1. This corresponds to normal weather,when weather changes smoothly. Then there are two bell-shapedsegments, one below and one above the center one. These arebeing created because of more-than-normal or less-than-normalchange in demand over two hours due to weather effects. Hence,weather effects lead to fat-tails, which leads to the rejection of thehypothesis of normality of error residuals.

5.5. Comparison of speed and accuracy of the training

The proposed NCSTAR-SOM model with SOM-based initializa-tion is able to provide a more accurate and faster trainingcapability compared to the original NCSTAR model with theoriginal weight initialization involving parallel hyperplanes asdescribed in [19]. The reason for this faster training, as discussedearlier, relates to the fact that the SOM-based initialization helpsthe training to start from a value close to the global optimum,hence the training has less chance of getting stuck in a localminimum, which is a major concern for the MLP. This fact isshown in Fig. 8. For the Alberta data (from Section 5.1 above), theresults from NCSTAR-SOM and NCSTAR model are reconsideredhere. The training in both the scenarios is through the regularback-propagation algorithm [39]. Fig. 8 shows the fall of the

MAPE with the number of iterations for 1000 iterations. Thecontinuous line and the dotted line represent the results obtainedfor NCSTAR-SOM and NCSTAR models, respectively. The MAPEfalls faster, and deeper for the NCSTAR-SOM method compared tothe NCSTAR method. Similar pattern is observed for other days ofthe week. Hence the training is more accurate and faster.

An interesting observation, which can be made in this figure, isthat for the first few iterations, the MAPE is higher for NCSTAR-SOM compared to NCSTAR. This runs contrary to our proposedargument that the SOM-based initialization helps to start thetraining from closer to the global minimum. For this particulartime series, it seems that the parallel hyperplanes perpendicularto the maximum variance of input variables approach of [19] wasable to provide a better initial estimate compared to NCSTAR-SOM. However, the benefits of NCSTAR-SOM show up as thetraining progresses, and the NCSTAR-SOM’s MAPE falls fastercompared to NCSTAR.

6. Discussion and conclusion

In this paper, we first explain why the NCSTAR model, with itsmultivariate thresholds and smooth transitions between regimes,is suitable for short-term load forecasting. This is because the loaddemand is highly dependent on seasonal factors, and seasons tendto change gradually. Next, we highlight the inadequacies of thecurrent methods of initializing the weights of the NCSTAR neuralnetwork, and explain the importance of having good initial net-work weights. Finally we propose a two step method to obtainfairly good initial weights. In the first step, unsupervised learningis used to cluster the historical data into separate regimes. Secondstep involves the use of the Ho–Kashyap algorithm to find theequations of the separating planes, which are then used toinitialize the weights of the hidden layer neurons. Experimentson four prominent energy markets show that the proposedmethod gives competitive results in terms of prediction accuracyfor hourly load forecasting as well as daily load forecasting.

A notable advantage of the proposed method is that it caneasily handle the non-stationarity in the electricity load data,which might occur due to seasonal effects or market effects. Thisis because the NCSTAR model is working with the weighted sumof several local AR models instead of a single global model for the


whole series. This handling of non-stationarity is a desirableproperty because in a deregulated power market, players willcontinue to bring in new dynamic bidding strategies, which willintroduce more non-stationarity in the price-dependent loaddemand series.

The proposed method of weight initialization for the NCSTARmodel also makes it more robust to initial conditions because thefirst step of the initialization method involves an SOM, which isgenerally robust against bad initializations.

In the present model, we do not include exogenous variablessuch as weather factors like temperature or humidity. This can bejustified by the observation that for short lead times like one-day-ahead, the weather variables evolve in a smooth fashion. Howeverwe also note that the prediction accuracy is the worst for peakwinter and peak summer months, which are most associated withsudden cold wave or sudden heat wave, respectively. Assumingthat good weather forecasts are available, it will be interesting toincorporate the weather factors in our model in such a way thatthey influence the predicted load demand only if the predictedweather is significantly different from historically observed nor-mal weather for the next day. Furthermore, which weathervariable to consider shall depend upon the characteristics of theenergy market being studied. So while humidity might be animportant factor for load forecasting in tropical markets, it doesnot matter much in the temperate markets. Future work in thisfield will have to deeply investigate these characteristics beforeweather variables can be incorporated in the model to improvethe prediction accuracy further.

Appendix A

Kessler’s construction: This is an approach to generalize a2-class problem to a c-class problem. Continuing using thenotation as in Section 5.4, for the linearly separable case, thereexists a set of weight vectors a1,y,ac such that if yAoi, then

aTi y4aT

j y, 8ia j

This definition allows reducing the c-class case to the 2-classcase by suitable manipulation.

Suppose yAo1, then

aT1y�aT

j y40, 8j¼ 2,3,. . .,c

If we construct a (c � n) dimensional vector a¼

a1

a2

^

ac

266664

377775,

(where n¼ nþ1 is the dimension of yi)then the previous equation can equivalently be thought of as

requiring a to correctly classify all (c�1) of the (c � n) dimen-sional vectors

Z12 ¼

þy

�y

0

0

^

0

2666666664

3777777775

, Z13 ¼

þy

0

�y

0

^

0

2666666664

3777777775

, Z1c ¼

þy

0

0

0

^

�y

26666666664

37777777775

i.e. aTZ1j40, j¼ 2,3,. . .,cMore generally, if yAoi, we can construct (c�1), (c � n)

dimensional vectors Zij, with ith subvector¼y, jth subvector¼�y,and all other subvectors¼0.

Then

aTZij40, 8ia j

looks like a 2-class problem. Now the Ho–Kashyap procedure, asdetailed in IV-D can be used to find a separating hyperplane inthe linearly separable case, and an MSE optimal solution in thenon-separable case.

References

[1] A.D. Papelexopoulos, T.C. Hesterberg, A regression-based approach to short-term load forecasting, IEEE Transactions on Power Systems 5 (4) (Nov. 1990)1535–1550.

[2] W. Charytoniuk, M.S. Chen, P. Van Olinda, Nonparametric regression basedshort-term load forecasting, IEEE Transactions on Power Systems 13 (3) (Aug1998) 725–730.

[3] S. Huang, K. Shih, Short-term load forecasting via ARMA model identificationincluding non-Gaussian process considerations, IEEE Transactions on PowerSystems 18 (2) (2003) 673–679.

[4] G.E.P. Box, G.M. Jenkins, G. Reisnel, Time Series Analysis—Forecasting andControl, 3rd ed., Prentice Hall, 1994.

[5] P. Zarchan, H. Musoff, Fundamentals of Kalman Filtering: A PracticalApproach, AIAA Publications, 2005.

[6] H. Tong, On a threshold model, in: C.H. Chen (Ed.), Pattern Recognition andSignal Processing, Sijthoff and Noordhoff, Amsterdam, The Netherlands, 1978.

[7] H. Tong, K.S. Lim, Threshold autoregression, limit cycles and cyclical data (withdiscussion), Journal of Royal Statistical Society, Series B 42 (1980) 245–292.

[8] S.R. Huang, Short-term load forecasting using threshold autoregressivemodels, IEE Proceedings on Generation Transmission and Distribution 144(5) (1997) 477–481.

[9] K.S. Chan, H. Tong, On estimating thresholds in autoregressive models,Journal of Time Series Analysis 7 (1986) 179–190.

[10] R. Luukkonen, P. Saikkonen, T. Terasvirta, Testing linearity against smoothtransition autoregressive models, Biometrika 75 (1988) 491–499.

[11] T. Terasvirta, Specification, estimation, and evaluation of smooth transitionautoregressive models, Journal of American Statistical Association 89 (425)(1994) 208–218.

[12] L.F. Amaral, R.C. Souza, M. Stevenson, A smooth transition periodic autore-gressive model for short-term load forecasting, International Journal ofForecasting 24 (4) (2004) 603–615.

[13] A.T. Robinson, Electricity pool prices: a case study in nonlinear time series,Applied Economics 32 (5) (2000) 527–532.

[14] M. Stevenson, Filtering and forecasting electricity prices in the increasinglyderegulated Australian electricity market, International Institute of Fore-casters Conference (2001) 1–31.

[15] A.G. Bakirtzis, V. Petridis, S.J. Kiartzis, M.C. Alexiadis, A.H. Maissis, A neuralnetwork short term load forecasting model for the Greek power system, IEEETransactions on Power Systems 11 (2) (1996) 858–863.

[16] H. Yoo, R.L. Pimmel, Short term load forecasting using a self-supervisedadaptive neural network, IEEE Transactions on Power Systems 14 (2) (1999)779–784.

[17] T. Senjyu, H. Takara, K. Uezato, T. Funabashi, One-hour-ahead load forecast-ing using neural network, IEEE Transactions on Power Systems 17 (1) (2002)113–118.

[18] A. Veiga, M. Medeiros, A hybrid linear-neural model for time series forecast-ing, Proceedings of NEURAP (1998) 377–384.

[19] A. Veiga, M. Medeiros, A hybrid linear-neural model for time series forecast-ing, IEEE Transactions on Neural Networks 11 (6) (Nov 2000) 1402–1412.

[20] A. Veiga, M. Medeiros, A flexible coefficient smooth transition time seriesmodel, IEEE Transactions on Neural Networks 16 (1) (Jan 2005) 97–113.

[21] M.F. Redondo, C.H. Espinosa, Weight initialization methods for multilayerfeedforward, Proceedings of the ESANN (2001) 119–124.

[22] G. Thimm, E. Fiesler, High-order and multilayer perceptron initialization,IEEE Transactions on Neural Networks 8 (2) (1997) 349–359.

[23] V. Kecman, Learning and Soft Computing: Support Vector Machines, NeuralNetworks, and Fuzzy Logic Models, The MIT Press, Cambridge, MA, 2001.

[24] S. Geman, E. Bienenstock, R. Doursat, Neural networks and the bias/variancedilemma, Neural Computation 4 (1) (1992) 1–58.

[25] I. Gomez, L. Franco, J.M. Jerez, Neural network architecture selection: Canfunction complexity help? Neural Processing Letters 30 (2009) 71–87.

[26] R. Delogu, A. Fanni, A. Montisci, Geometrical synthesis of MLP neuralnetworks, Neurocomputing 71 (2008) 919–930.

[27] S. Trenn, Multilayer perceptrons: approximation order and necessary numberof hidden units, IEEE Transactions on Neural Networks 19 (5) (2008)836–844.

[28] L. Prechelt, Early stopping—but when, neural networks: tricks of the trade,Lecture Notes in Computer Science, 1524, Springer Verlag, Heidelberg, 1998.

[29] R. Setiono, Feedforward neural network construction using cross validation,Neural Computation 13 (12) (2001) 2865–2877.

[30] S. Leybourne, P. Newbold, D. Vougas, Unit roots and smooth transitions,Journal of Time Series Analysis 19 (1998) 83–97.


[31] M.F. Redondo, C.H. Espinosa, A comparison among weight initializationmethods for multilayer feedforward networks, IJCNN 4 (2000) 543–548.

[32] J.F. Kolen, J.B. Pollack, Backpropagation is sensitive to initial conditions,Laboratory of Artificial Intelligence Research, Computer Information ScienceDepartment, Technical Report TR 90-JK-BPSIC, 1990.

[33] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, Germany, 1997.[34] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on

Neural Networks 16 (3) (2005) 645–678.[35] G. Simone, A. Lendasse, M. Cottrell, J.C. Fort, M. Verleysen, Time series

forecasting: obtaining long term trends with self-organizing maps, PatternRecognition Letters 26 (12) (2005) 1795–1808.

[36] G. Simone, A. Lendasse, M. Cottrell, J.C. Fort, M. Verleysen, Double quantiza-tion of the regressor space for long-term time series prediction: method andproof of stability, Neural Networks 17 (8–9) (2004) 1169–1181.

[37] G. Barreto, J. Mota, L. Souza, R. Frota, Non-stationary time series predictionusing local models based on competitive neural networks, Lecture Notes inComputer Science 3029 (2004) 1146–1155.

[38] J. Vesanto, E. Alhoniemi, Clustering of the self-organizing map, IEEE Transac-tions on Neural Networks 11 (3) (2000) 586–600.

[39] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice-Hall, New Jersey, 1999.

[40] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., Wiley-Interscience, 2000.

[41] The Alberta Electric System Operator, Available at /www.aeso.caS.[42] C.U. Vila, A.Z. de Souza, J.W. Lima, P.P. Balestrassi, Electricity demand and

spot price forecasting using evolutionary computation combined withchaotic nonlinear dynamic model, Electrical Power and Energy Systems 21(2) (2010) 108–116.

[43] H. Kantz, T. Schreiber, Nonlinear Time Series Analysis, Cambridge UniversityPress, 2002.

[44] R. Cottet, M. Smith, Bayesian modelling and forecasting of intradayelectricity load, Journal of American Statistical Association 98 (464) (2003)839–849.

[45] Australian Energy Market Operator, Available at /www.aemo.com.auS.[46] New England ISO. Available at /www.iso-ne.comS.[47] K.Y. Lee, Shu Du, Short term load forecasting using semigroup based system-

type neural network, Proceedings of the ISAP (2005) 291–296.[48] Shu Du, Short term load forecasting using system-type neural network

architecture, Master’s Thesis, Baylor University, 2009.

Dipti Srinivasan obtained her ME and Ph.D. degrees inElectrical Engineering from the National University ofSingapore (NUS) in 1991 and 1994, respectively. Sheworked at the University of California at Berkeley’sComputer Science Division as a postdoctoralresearcher from 1994 to 1995. In June 1995, she joinedthe faculty of the Electrical & Computer Engineeringdepartment at the National University of Singapore,where she is an Associate Professor. From 1998–1999she was a Visiting Faculty in the Department ofElectrical & Computer Engineering at the Indian Insti-tute of Science, Bangalore, India.

Her main areas of interest are neural networks,evolutionary computation, intelligent multi-agent systems and application ofcomputational intelligence techniques to engineering optimization, planning andcontrol problems. Her research has focused on the development of hybrid neuralnetwork architectures, learning methods and their practical applications for largecomplex engineered systems, such as the electric power system and urbantransportation systems. These systems are examined in various projects byapplying multidisciplinary methods that are able to cope with the problems ofimprecision, learning, uncertainty and optimization, when concrete models areconstructed.

Vineet Yadav obtained his ME Degree in ElectricalEngineering from the National University of Singapore(NUS) in 2010. His main area of interest is applicationsof neural networks for pattern recognition in timeseries.

www.aeso.ca

www.aeso.ca

www.aeso.ca

www.aeso.ca

www.aemo.com.au

www.aemo.com.au

www.aemo.com.au

www.aemo.com.au

www.iso-ne.com

www.iso-ne.com

www.iso-ne.com

www.iso-ne.com

a som-based hybrid linear-neural model for short-term load forecasting

Documents