machine learning for restaurant sales forecast

UPTEC IT 18004

Examensarbete 30 hpMaj 2018

Machine Learning for Restaurant Sales Forecast

Mikael HolmbergPontus Halldén

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Machine Learning for Restaurant Sales Forecast

Mikael Holmberg and Pontus Halldén

There are many restaurants that do not have a solid forecast of their daily sales. Often, they neither have the education nor the energy to make a calculated estimation of the sale. At best some restaurants look at the last years sale on the equivalent day and the current date settings.

Caspeco is a company that provides different services to the restaurant business. Until recently they have tried different forecasting solutions which usually includes trends of some time interval.

In this thesis, we investigate if it is possible to create a forecasting solution based on supervised learning. Two different methods are tested, Extreme Gradient Boosted Trees and Long Short Term Memory Neural Network. The two methods are evaluated against each other and compared to the current uplift model used by Caspeco.

The data used for training the supervised learning methods is a combination of data provided by Caspeco, and data collected from the Swedish Meteorological and Hydrological Institute (SMHI). This is data such as temperature, minutes of sunshine, rainfall etc, all of which are known to have an impact on the sales of a restaurant.

The results show that the models are dependent on the settings of the restaurants. That is the size of the restaurant, the type of the restaurant and if they have an outdoor seating area etc. Both models show a better result of approximately 10-15 percentage points, with regard to the current unit of measurement of Caspeco's current uplift model. Although the models surpass its comparison metric, opportunities for an even greater result is real. Unexpected sales in the form of events are known to influence the results. If different types of event data would be provided we conjecture that the supervised learning models can give a much higher prediction result.

Tryckt av: Reprocentralen ITCUPTEC IT 18004Examinator: Lars-Åke NordénÄmnesgranskare: Johannes BorgströmHandledare: Jonas Mattsson

Popularvetenskaplig Sammafattning

Krogare och restaurangagare prioriterar oftast inte att gora en valgrundad up-pskattning av kommande forsaljningar. Ofta nojer de sig med anvanda sammaveckodag fran foregaende ars forsaljning och gora en eventuell andring beroendepa faktorer sa som vader och handelser.

I denna uppsats har vi i samrad med forsaljningsanalytiker utvecklat tva olikamaskininlarningsmodeller, som tillampat de tva algoritmerna Extreme GradientBoosted Trees och Long Short Term Memory Neural Network. Resultaten frandessa tva metoder visar en okad precision i forsaljningsprognosen gentemot nu-varande metod. Den data som metoderna appliceras pa innehaller olika tid-och vaderdata. Bra vader borde i regel medfora en okad forsaljning. Detta hardock visat sig vara svart att bevisa med vara modeller. Resultaten visar pa enforbattring om modellerna tranas med vaderdata, dock en blyg sadan. Vi disuk-terar vilka faktorer som har en paverkan pa en restaurangs forsalning samt vilkaforbattringar.

Acknowledgements

We would like to thank our mentor Jonas Mattson at Caspeco for his help andsupport throughout this thesis, as well as Caspeco for their help with logistics andmuch more.

Lastly a special thank you to our mentor Johannes Borgstrom. Without yourvaluable input and feedback this thesis would not have been completed.

CONTENTS CONTENTS

Contents

1 Introduction 71.1 Caspeco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Supervised vs. Unsupervised learning . . . . . . . . . . . . . . . . . 92.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Protocol Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Swedish Meteorological and Hydrological Institute . . . . . . . . . 10

2.4.1 SMHI Open API . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . 142.5.2 Randomly Trained Decision Trees . . . . . . . . . . . . . . 152.5.3 Random Forest Model . . . . . . . . . . . . . . . . . . . . . 16

2.6 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6.1 Gradient Boosted Trees and XGBoost . . . . . . . . . . . . 17

2.7 Feed Forward Neural Network . . . . . . . . . . . . . . . . . . . . . 202.7.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 222.7.2 Long Short Term Memory Networks . . . . . . . . . . . . . 23

2.8 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Methodology and Implementation 253.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2 Feature Creation . . . . . . . . . . . . . . . . . . . . . . . . 303.1.3 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . 313.1.4 Correlation of features . . . . . . . . . . . . . . . . . . . . . 343.1.5 Final Features . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.6 Standardization and Normalization . . . . . . . . . . . . . . 42

3.2 Root Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Extreme Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.2 Final XGBoost model . . . . . . . . . . . . . . . . . . . . . 46

3.4 LSTM Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 473.4.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.4.2 Final LSTM model . . . . . . . . . . . . . . . . . . . . . . . 49

4 Results 514.1 Dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3 Dataset 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5

CONTENTS CONTENTS

4.4 The Saturday Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 544.5 The Summer Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Discussion 565.1 Date Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Weather Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Feature importance and selection . . . . . . . . . . . . . . . . . . . 615.4 Summer and weekday datasets . . . . . . . . . . . . . . . . . . . . 625.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5.1 Restaurant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5.2 Restaurant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 665.5.3 Restaurant 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.6 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.7 Multiple restaurants prediction . . . . . . . . . . . . . . . . . . . . 695.8 Choice of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 705.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Conclusion 71

A Appendix A 76

6

1 INTRODUCTION

1 Introduction

Any business’s goal is to make a profit. This is accomplished by having more salesthan costs. A restaurant’s profit comes from the manager’s ability to estimate howmuch food and drinks will be sold in the upcoming days. If there is high sales inthe restaurant, there must be enough ingredients in the inventory as well as anappropriate number of people working.

To be able to estimate the upcoming sales takes experience. The impact ofthe sales of a restaurant and what kind of weather it is, is apparent. If there issunshine on a warm day, the sales of drinks will probably increase, but are peopleconsidering the heavy wind of this day when deciding to leave the house?

These questions are hard to answer, but with appropriate data mining andmachine learning algorithms, the relationship between the weather and the salescan be utilized in making good predictions of the sales.

1.1 Caspeco

Caspeco is an Uppsala based company that provides tools and services for salaryhandling, resource scheduling, financial analysis and budgets [1]. At the momentCaspecos customers do their own sales forecast and it is based on last year’s saleson an equivalent day, with an uplift according to recent sales or wishful thinking.By not having an accurate sale forecast it is hard to predict the workload. Havingtoo many people working on a quiet day or the opposite, both increases the costsof the restaurant.

1.2 Motivation

The purpose of this project is to evaluate if a system can create a more accuratesales forecast than what can be done by comparing last year’s sales on an equivalentday adjusted with an uplift. Evaluating how different data sources can be used toforecast sales, implementing a practical integration to the data sources and createa sales forecast solution using historical data and new data sources are the goalsof this project.

1.3 Research Question

This thesis will investigate the following problems:

1. Can supervised learning models be used to predict a higher number of salesthat lies within 15% of the actual sale than the simple uplift heuristic canpredict?

7

2 BACKGROUND

2. Which of Extreme Gradient Boosted Trees and Long Short Term MemoryNeural Network is the best in terms of:

(a) The number of predicted sales that lies within 15% of the actual sale

(b) The number of predicted sales that lies within 15% of the actual salewhen a model is trained on data from single restaurant or when a modelis trained on several restaurants?

(c) Training time?

3. What features are the most important when predicting sales?

4. What is the impact of the weather on the sales on a particular weekday?

5. What is the impact of the weather on the sales during a specific period oftime?

The result of the final model must predict more sales within 15% of the actualsales value than the simple uplift heuristic predicts within 15% of the actual salesvalue. However, the best solution is relative to which context it is applied to. Inthe interest of a company, an emphasize on the forecast of the upcoming 2-5 daysmight be more important than a forecast of the coming 2-3 weeks. Furthermore,a model that is complex to implement and has long training time but yields agreater result than a model that is easier to implement and with shorter trainingtime, might not be the most usable one. In the interest of this thesis, the numberof prediction within 15% of the actual sale, regardless of usability and trainingtime of the model is the most important.

During the progression of the project, two other research questions becameapparent. What is the impact of the weather on the sales on a particular weekday?,and What is the impact of the weather on the sales for a specific period of time?,are two research questions that are investigated out of the curiosity of the projectmembers. These questions are not as thoroughly investigated as the rest of theresearch questions.

2 Background

The sales of a restaurant can vary greatly depending on several different factors,and the weather is one of those factors. The weather changes rapidly and so dothe needs of the restaurant. An accurate sales forecast is, therefore, a good toolfor a restaurant and will be realized by implementing Extreme Gradient BoostedTrees (XGBoost) and a Long Short-Term Memory Neural Network (LSTM). Abrief discussion of the choices can be read in Section 5.8.

It is important to realize that all restaurants may not be weather-dependent.There is a big difference between a small bistro with a cozy outdoor seating anda well-known pub or restaurant that is more dependent on events and good food.These pubs/restaurants are more dependent on time in the dataset. Is it the day

8

2.1 Supervised vs. Unsupervised learning 2 BACKGROUND

after the 25th or is it a Tuesday versus a Saturday. This will be further discussedin Section 5

2.1 Supervised vs. Unsupervised learning

There are several types of problems that Machine learning solves. Depending onthe problem and data at hand, different approaches can be taken.

A supervised learning approach is often taken when the data consists of bothinput and output values. By calculating the error from the difference between themodel’s predicted value and the actual output value it is possible to change themodel’s weights and biases to minimize this error [2].

Supervised learning solves problems of both classification and regression nature.By using classification it is possible to categorize data to a class, for example ”red”or ”car”. By using regression it is possible to forecast a numerical output value,for example the number of persons that is going to attend an event.

Unsupervised learning is commonly used when there is no output data. Insteadthe learning algorithm tries to find patterns in the data on its own. Unsupervisedlearning is often used to find clusters and put unseen data in a suitable neighbor-hood [2]. This can be used for example in sales patterns: if a costumer buys a tophat he is likely to buy a walking cane.

With the input dataset given in this project and its numerical output value, itis clear that this is a regression problem. The preferring approach is a supervisedlearning method over an unsupervised learning method due to the data available.

2.2 Python

Python is a well documented, easy to use programming language with an enor-mous variety of different packages that can be used to make your project possible.Pandas [3] is a package which provides data-structures and data analysis tools.Keras [4] is a high level API build on top of Tensorflow [5], an open source soft-ware library for computation using data flow graphs. Tensorflow was developed byGoogle’s Machine Intelligence research organization, with the purpose of creatinga platform for machine learning research. XGBoost [6] is a library that providesan API for boosted trees. These are the most important packages used in thisproject.

2.3 Protocol Buffer

Google has developed their own software for serializing data and it is both language-neutral and platform-neutral [7], called Protobuf. In comparison to XML it is [8]:

• 3 to 10 times smaller

• 20 to 100 times faster

• less ambiguous

9

2.4 Swedish Meteorological and Hydrological Institute 2 BACKGROUND

• able to generate data access classes that are easier to use programmatically

This motivates why Caspeco, with years of sales data, has chosen to serializeit with Protobuf.

2.4 Swedish Meteorological and Hydrological Institute

Swedish Meteorological and Hydrological Institute, SMHI, went from doing manualmeasurements to use automated ones from special measurement stations in 1990.This transition made it possible to get measurements more often, and from evenmore locations. However, manual measurements are still being done even today.It is of important to understand the data and if there is possible noise in the data.With this in mind, a brief overview of how the measurements of some importantfeatures of this project will follow.

Temperature is measured both automatically and manually. The automatedmeasurements are being done every hour, and the manual ones three times a day.The measurements are dependent on the environment of the station. If the sur-roundings of the station are, for example, dry and consist of sand, this will yielda higher temperature reading than one of a station surrounded by wet grass. Sta-tions on a slope or surrounded by buildings also give inaccurate measurements,and therefore all of SMHI’s stations that measures temperature are located ap-proximately 2 meters above an open grass field. In this project, a daily average ofthe automatic measurements is being used.

Rainfall is measured by collecting all rain that has fallen in a cylinder, andthen measuring it by hand. This is how SMHI collects information in a majorityof the cases.

Measuring the amount of rainfall is not easy and has some known sources oferror. This is important and will be expanded upon since the amount of rain is animportant feature in our training phase.

One of the biggest sources of error is the effect of the wind on the amount ofwater that is collected in the cylinder. With heavy wind, less water is collected inthe cylinder, and more is falling directly onto the ground. A solution to this wasto place the cylinder closer to the ground which would eliminate the influence ofthe wind, but this yielded a larger problem due to litter getting into the cylinder.SMHI abandoned this approach and instead constructed artificial windbreakerswhich would lessen the effect of the wind, but not eliminate it.

The placement of the cylinder is also a known source of error. It has to beplaced at least twice the diameter of the cylinder away from any close objects.However, finding a perfect placement with regard to surrounding objects and thewind is not always possible, and there will always be some percentage of error inthe final readings.

Snow depth is measured by hand with a ruler on even ground. There areno stations which measure snow depth automatically. When the temperature isapproximately 0 degrees, and the snow is melting or the rain is freezing, the snowdepth can be approximated to that 1 cm fresh snow is approximately 1 mm water.

10

2.4 Swedish Meteorological and Hydrological Institute 2 BACKGROUND

In this project, the readings are made in cm, and therefore the values arounda temperature of 0 degrees can be a source of error.

Wind, Cloud and Sun Measurements are all made automatically by moderninstruments and do not have any major error sources.

2.4.1 SMHI Open API

The weather dataset was extracted from SMHI’s database using SMHI Open DataAPI [9]. The API is developed to easily retrieve forecast data, which is differentfrom the weather history that is needed for this project. The data returned froman API request is returned as a JSON object. A request for forecast data fromNorrkoping looks like the following [9]:

https : // opendata−download−met fc s t . smhi . se / api / category /pmp3g/ ve r s i o n /2/ geotype / po int / lon /16.158/ l a t /58.5812/ data . j son

To find the weather history record is not as straight forward as getting forecastdata. The API request looks a bit different:

https : // opendata−download−metobs . smhi . se / api/ ve r s i o n / l a t e s t / parameter /x . j son

The last part /parameter/x, represents what kind of information that will be re-quested from parameter x. SMHI currently has 24 different parameters, which aresummarized and explained in Table 3. When this parameter is specified, a JSONobject containing all the stations that use this parameter is returned. How theAPI was used to extract history data is explained in Section 3.1.1.

11

2.5 Decision Trees 2 BACKGROUND

Table 1: The different SMHI-parameters and their explanation

Parameter Explanation Type Measured Unit

1 Air Temperature Instantaneous value 1/hour Degree Celsius

2 Air Temperature Average 1/day Degree Celsius

3 Wind direction Average for 10 min 1/hour Degree True

4 Wind speed Average for 10 min 1/hour Meter/second

5 Amount of Rainfall Sum for a day 1/day Millimeter

6 Relative Air Humidity Instantaneous value 1/hour Percent

7 Amount of Rainfall Sum of latest hour 1/hour Millimeter

8 Snow Depth Instantaneous value 1/day Meter

9 Air Pressure at Sea Level Instantaneous value 1/hour Hectopascal

10 Amount of Sunshine Sum 1/hour Second

11 Global Ir-radians Average latest hour 1/hour Watt/m2

12 Visibility Instantaneous value 1/hour Meter

13 Present Weather Instantaneous value 1/hour 8/day Code

14 Amount of Rainfall Duration of 15 minutes 4/hour Millimeter

15 Rain Intensity Duration of 15 minutes 4/hour Millimeter/Second

16 Total Cloud Coverage Instantaneous value 1/hour Percent

17 Amount of Rainfall Instantaneous value 2/day Code

18 Amount of Rainfall Instantaneous value 1/day Code

19 Air Temperature Minimum 1/day Degree Celsius

20 Air Temperature Maximum 1/day Degree Celsius

21 Gust Maximum 1/day Meter/Second

22 Air Temperature Sum Monthly 1/month Degree Celsius

23 Amount of Rainfall Sum Monthly 1/month Millimeter

24 Long wave Ir-radians Average latest hour 1/hour Watt/m2

2.5 Decision Trees

The usage of decision trees in machine learning is not new, but their rise in popular-ity in recent years is due to the discovery of their previously unknown generalizationcapabilities if multiple trees are used together [10, 11].

A rooted tree is a kind of graph where one node is denoted as the root node.It consists of nodes and edges where a node can either be an internal/split node,represented by a circle or a leaf node, represented by a square. An edge is astraight line connecting two nodes. Every node except the root node has oneincoming edge. In Figure 1 below a binary tree structure can be seen.

12


Figure 1: A directed rooted tree organized with an orientation away from the root.The gray circle represent the root node, the circles below represents internal/splitnodes and the squares represents leaf nodes.

In a decision tree, each internal node can be thought as test or a function, andthe leaves store a predictor or estimator that yields the final result. The inputdata is passed from the root node through the tree, and depending on the inputdata, one split node in each layer will activate and dispatch it to the next leveluntil a leaf is reached. An example of this can be seen in Figure 2.

13


Figure 2: A decision tree deciding whether the image is in winter or summersetting. Each split node has a test function which is applied onto the input datawhich guides it until a leaf node has been reached and an answer is decided.

A decision tree is dependent of the tests in each split node and of the predictors,or results, in the leaves. In a simple problem like seen above, intuition is goodenough to construct the tests but in a more complex problem, these tests shouldbe learned from training data.

2.5.1 Basic Definitions

The input data, or a data point is a vector v = (x, x1, x2, ..., xd) ∈ Rd wherex1, ..., xd represent different attributes of the data, known as features. The dimen-sionality d of the feature space can be large or even infinite.

In the literature the test mentioned in Section 2.5 is called split function, testfunction or weak learner and is formulated for split node j:

h(v, θj : Rd × T → 0, 1) (1)

where 0,1 can represent ”true” or ”false”, θj ∈ T are the parameters associatedwith the j:th node and T is the space of all the parameters of the node. The data

14


is sent to one of its child nodes depending on the result, where 0 or ”false” isinterpreted as send to the left child, and 1 or ”true” is interpreted as send to theright child.

The definition of a data point is a very general one, and to make this morespecialized and applicable to decision trees, the training set or training point isdefined. A training point is used to compute the tree parameters. A training set,S0, is a collection of training points.

In decision trees, subsets of training points belong to different tree branches.The subset of training points reaching node 1 is S1, and the subset reaching thechild of node 1 would be SL1 or SR1 . The following properties apply for split node jin binary trees [11]:

Sj = SLj ∪ SRj , Sj = SLj ∩ SRj = ∅, SLj = S2j+1, SRj = S2j+2 (2)

To find the optimal parameters for a model an objective function is defined:

Obj(Θ) = L(θ) + ω(Θ) (3)

where L represents the training loss function and ω(Θ) represents the regulariza-tion term. L defines how good the prediction is, and ω(Θ) influences the complexityof the model (aids in preventing overfitting).

2.5.2 Randomly Trained Decision Trees

The behavior of a decision tree can be split into two phases, the training phase (of-fline) and the testing phase (online).

The testing phase consists of traversing the decision tree with previously unseendata while applying the tests in the split nodes. The tests are trained in the trainingphase and are remained fixed throughout the entire testing phase. The traversingends when a leaf node containing a predictor or an estimator is reached, whichmaps an input to an output.

The training phase is the core of the functioning of the tree. This phaseselects the type and parameters of the test function h(v, θj) of each split nodeby optimizing an objective function defined on the training data set. Basically, ineach split node we learn its function that best splits the training subset Sj intoSL1 or SR1 . The maximization of the objective function I for the j:th split node isdefined as:

θj = arg maxθ∈τ

I(Sj , θ) (4)

Given θ and Sj , the left and right sets are defined as:

SLj (Sj , θ) = (v, •) ∈ Sj | h(v, θ) = 0SRj (Sj , θ) = (v, •) ∈ Sj | h(v, θ) = 1

(5)

To compute I these three sets are used as input. The sets for the children arethe functions of the parents set Sj and the splitting parameters θ. Note that the

15


objective function I(Sj , θ) is of a general form, and what is defined as ”best” isdependent of the problem.

The size and shape of the tree is of high importance during the training phase.The root node is initialized in the beginning of the training, where the best pa-rameters are found as described above. Subsequently two child nodes are created,each obtaining disjoint subset of the training set. This continues recursively untila stop criterion is met and the training phase is over.

There are a few common practices for deciding the stop criterion. The mostintuitive one and the one used is to stop growing at maximum level X. By avoidinggrowing full trees the chance of specialization decreases [10].

After a training phase the following has been calculated:

1. The optimum split function of each split node

2. A tree structure

3. In each leaf a different set of training points

The parameters of a split functions are θ = (φ, ψ, τ). The selector functionφ = φ(v) selects the chosen features of the vector v, ψ decides the geometric figureused as a discriminator and τ is used to capture the thresholds of the inequalitiesof the binary test.

To improve generalization and decrease the chance of overfitting, randomnessis injected into the training phase. There are two popular way of doing this.Bagging [12, 10], where the idea is to train each tree in a forest on a differenttraining data, or subsets, chosen randomly from a labeled database. This yieldsfaster training and improved generalization, but the entire training dataset is notused for all trees.

In random node optimization [13, 14], small random subsets, τj ⊂ τ , of theparameter values are used for the j:th node. Under randomly trained decisiontrees the training is therefore to optimize the each split node j as:

θj = arg maxθ∈τj

I(Sj , θ) (6)

A parameter ρ = |τj | is used to control the degree of randomness in a tree.When ρ is at maximum, there is no randomness at all since all information willbe used for all split nodes, and respectively when ρ is at minimum, one split nodeonly chooses one random set of parameter values for θ. The parameters φ, ψ, τ arepossible to randomize individually, together or in any combination [11].

2.5.3 Random Forest Model

An ensemble of randomly trained decision trees make up a random forest. Themost important concept of a random forest is that each randomly trained decisionis randomly different from another. This is what leads to its great generalizationabilities [10].

16

2.6 Regression trees 2 BACKGROUND

In Section 2.5.2 a randomness parameter was defined as ρ = |τj |. In contrastto randomly trained decision trees, this parameter in random forests not onlycontrols the randomness of each tree, but also the correlation between each treein the forest. If ρ = |τ | is at max, all the trees in the forest are identical, and viceversa if ρ = 1.

All trees are trained individually, and in the testing phase each test data ispassed through all the trees until it reaches its leaves. To combine all the differenttrees predictions into a single one, an averaging operation is applied.

p(c | v) =1

T

T∑i=1

pt(c | v) (7)

where T is the index of a tree, v is a test data and p(c | v) is the posteriordistribution of the t:th tree [11].

In summary, the parameters that matter in a random forest are:

• The choice of features used

• The size of the forest

• The maximum allowed depth of the forest

• What kind of weak learner/test function of the split nodes

• The training objective function

This explanation of decision trees, randomly trained decision trees and randomforest model follows the book Decision Forests for Computer Vision and MedicalImage Analysis [11].

2.6 Regression trees

Regression trees share much of the same properties as decision trees described inSection 2.5.2, and as with any kind of regression algorithm it takes continuousvalues as input values and associates with a continuous output value [15]. Themain change that has to be done from the standard decision tree is to changethe objective function to represent the change from discrete values to continuous,further described in Section 2.6.1.

2.6.1 Gradient Boosted Trees and XGBoost

XGBoost is a scalable machine learning framework, that is recognized to workwell for sales prediction [16]. It is based on Jerome H. Friedman’s article [17]. Hedeveloped a general gradient descent “boosting” paradigm for additive expansionsbased on any fitting criterion.

The model of XGBoost is tree ensembles. The tree ensemble model consists ofclassification and regression trees (CART). A difference between a decision trees

17


and a CART is that in a CART, a real score is associated with each leaf whereasin a decision tree the leaf contain a decision value. A single CART is not pow-erful enough to be used by itself, and for this reason multiple CART’s are usedcollectively. This model can be written as:

yi =K∑k=1

fk(xi), fk ∈ z (8)

where K is the number of trees, f is a function in the functional space z and z isthe set of all possible CARTs.

Given an objective function as seen in Equation (9), the additive expansionscan soon be defined. The objective function always has two parts, the loss function

l(yi, y(t)i ) where yi is the expected value, y

(t)i the predicted value at timestep t. The

second part is the regularization term Ω(fi) where fi is a function that containsthe structure of the tree and the leaf score [6].

obj =

n∑i=1

l(yi, y(t)i ) +

y∑i=1

Ω(fi) (9)

The additive tree expansion can be seen in Equation (10) [6]. This algorithmdoes not train all trees at the same time. Instead it starts out with one tree, adjustswhat have been learned, and adds a new tree in the next step. The prediction value

at step t is denoted y(t)i .

y(0)i = 0

y(1)i = f1(xi) = y

(0)i + f1(xi)

y(2)i = f1(xi) + f2(xi) = y

(1)i + f2(xi)

· · ·

y(t)i =

t∑k=1

fk(xi) = y(t−1)i + ft(xi)

(10)

The additive tree expansion does not add a random tree, it adds the tree thatoptimizes the objective, see Equation (11).

obj(t) =

n∑i=1

l(yi, y(t)i ) +

y∑i=1

Ω(fi)

=n∑i=1

l(yi, y(t−1)i + ft(xi)) + Ω(fi) + constant

(11)

If Mean Square Error(MSE) is used as the loss function, the equation changes

18


to Equation (12).

obj (t) =

n∑i=1

(yi − (y(t−1)i + ft(xi)))

2 +

t∑i=1

Ω(fi)

=

n∑i=1

[2(y(t−1)i − yi)ft(xi) + ft(xi)

2] + Ω(fi) + constant

(12)

The form of this is friendly due to the nature of mean square error. To makeit more general, the Taylor expansion of the loss function up to the second orderis used, see Equation (13).

obj (t) =

n∑i=1

[l(yi, y(t−1)i ) + gift(xi) +

1

2hift

2(xi)] + Ω(ft) + constant

gi = αy(t−1)i

l(yi, y(t−1)i )

hi = α2y(t−1)i

l(yi, y(t−1)i )

(13)

After removing the constants, a new definition of the specific objective at stept that only relies on gi and hi is formed, Equation (14). In this definition everyloss function can be optimized by utilizing the same solver that take gi and hi asinputs.

n∑i=1

[gift(xi) +1

2hift

2(xi)] + Ω(ft) (14)

To decide the complexity of the tree, the regularization term Ω(f) has to bedetermined. By redefining the definition of a tree f(x) as in Equation (15) it ispossible to determine the complexity Ω(f), see Equation (16)

ft(x) = wq(x), w ∈ RT , q : Rd → 1, 2, ..., T (15)

T is the number of leaves, w is a vector of scores on leaves and q is a functionassigning each data point to a leaf.

Ω(f) = γT +1

2λ

T∑i=1

w2j (16)

By combining Equation (14) and Equation (16) the objective value with thet:th tree can be written as seen in Equation (17)

obj(t) ≈n∑i=1

[giwq(xi) +1

2hiw

2q(xi)

] + γT +1

2λ

T∑i=1

w2j

=

T∑j=1

[(∑i∈Ij

gi)wj +1

2(∑i∈Ij

hi + λ)w2j ] + γT

(17)

19

2.7 Feed Forward Neural Network 2 BACKGROUND

Ij = i | q(xi) = j, that is the set of indices of data points to the j:th leaf.All the data points on the same leaf get the same score, therefore the index was

changed on line two. To compress Equation (17) further, Gj =∑i∈Ij

gi and

Hj =∑i∈Ij

hi can be defined. This yields Equation (18):

obj(t) =T∑j=1

[Gjwj +1

2(Hj + λ)w2

j ] + γT (18)

In this equation, wj are independent with respect to each other. The best vectorof scores wj for a given tree structure q(x) can be seen in Equation (19):

w∗j = − GjHj + λ

obj∗ = −1

2

T∑j=1

G2j

Hj + λ+ γT

(19)

where obj∗ measures how good a tree structure q(x) is.In practice it is not possible to enumerate all possible trees and pick the best

one, instead one level of a tree is optimized at a time. That is the additive expan-sion mentioned earlier. An attempt to split a leaf into two leaves would yield anoptimization. To calculate this split Equation (20) is used:

Gain =1

2

[G2

L

HL + λ+

G2R

HR + λ− (GL +GR)2

HL +HR + λ

]− γ (20)

where the first term is the score on the new left leaf, the second term is the scoreon the new right leaf, the third term is the score on the original leaf and the fourthterm is the regularization on the new leaf. If the gain is smaller than γ, the branchis not added.

The explanation of how Extreme Gradient Boosted Trees work follows fromthe official documentation at their website and the original paper [6].

2.7 Feed Forward Neural Network

Feed Forward Neural Network (FFNN) is a kind of neural network which is abiological-inspired programming paradigm where the computer learns from obser-vation data, and the feed forward phase can be explained from Figure 3. TheFFNN computes an output from the observation/input data that is fed to thenetwork. This data goes through different layers, from input to output, and arecomputed in each layer before it is passed to the next one. Each layer has a num-ber of neurons. These neurons consists of weights, biases and activation functions.These properties are a combination that describe the neuron, when it should be

20


activated or not for a set of input data. Depending on the activation function forthe neurons, the FFNN can work with data that is not linearly separable [2].

Figure 3: A feed forward neural network with one hidden layer [18]

A FFNN computes its output values and compares it to the target data. Itestimates how close the computed output is to the target data by an error function,usually the mean squared error, and uses this with a learning algorithm to tweakthe weights of the neurons inputs and indirect optimize the network.

There are different learning algorithms to use, where back-propagation is themost common one. Back-propagation propagates the error from the output layerall the way back to the input layer, to change the weights of the neurons to minimizethe MSE, which in turn leads to a better final result [2].

A neural network with multiple layers are often referred to as a deep learningneural network. Deep learning, with all its layers, are often used on data withhigh levels of abstraction. It has great success in for example image and speechrecognition, while another variant of neural network is called recurrent neuralnetwork have proved to work better on sequential data [19].

21


2.7.1 Recurrent Neural Network

A recurrent neural network (RNN) is a special version of the standard neuralnetwork which includes feedback connections, enabling the network not only touse current inputs but as well inputs perceived previously in time [2], see Figure 4.

Figure 4: A simple Recurrent Neural Network contains loops. The input signal Xt

goes through the Neural Net A and outputs Ht

To understand why this is the preferred architecture for time-series and sequen-tial problems see Figure 5. When the RNN is unfolded it can be viewed as copiesof the same neural network, which dispatches a message sequentially starting fromthe first network and can therefore learn from previously seen inputs.

Figure 5: An RNN unfolded. It can be viewed as multiple copies of the samenetwork that sends a message sequentially through the network.

A drawback of RNN is that they lack the ability to handle long-term contextualdependencies. To make this more concrete, imagine a language model predictorwhich tries to predict the last word in the sentence ”A circle is round”. It is obviousthat the next word is going to be round, and not much contextual information isneeded. Imagine the same model trying to predict the last word in ”I did exchangestudies in Spain, and now I speak fluent Spanish”. Predicting this last word is muchharder due to more contextual information being needed. Recent information hintsthat the last word is a language, but to decide on what language more information

22


from the context of Spain from further back is needed. If the gap between recentinformation and information when it is needed is too large, the problem will be toocomplex for the RNN [20, 21]. The solution to this is an extension of the RNN,called the Long Short Term Memory Network) (LSTM).

2.7.2 Long Short Term Memory Networks

The Long Short Term Memory Networks (LSTM) is an extension of RNN proposedby Hochreiter and Schmidhuber in 1997 [22]. It is introduced to address TheVanishing Gradient problem [23] and the long-term dependencies described above.

The vanishing gradient problem emerges when neural networks are trained withgradient-based learning methods and back-propagation. To update the network,the weights receive a small change in proportion to the gradient of the error signal,as described in Section 2.7. However, the activation function, for example softsignor Tanh, have a small range, (-1,1) and are computed by back-propagation throughthe chain rule. The effect of this is that a large change of the weights for earlylayers does not yield a big effect on the output, and the ”later” layers are trainedvery slowly [23].

The report Long Short-Term Memory by S. Hochreiter and J. Schmidhuber [22]explains a major LSTM feature as ”it enforces constant, non-exploding, non-vanishing error flow”. LSTM introduce a new element called a memory cell. At itscore, it contains a recurrently self-connected linear unit called the Constant Er-ror Carousel (CEC). The CEC’s solves the vanishing gradient problem explainedabove. If there is no new input or error signals to the cell, it keeps it CEC’slocal error constant. The cell has two gates, the input gate and the output gate,which protects the cell from forward flowing activation and backward flowing er-ror. When the gates are closed, they do not let in irrelevant input and noise andtherefore the cell does not disturb the rest of the network [24].

This model has a risk of growing indefinitely in terms of the state, and causethe network to breakdown. To solve this problem a new gate called the forget gatewas introduced [24]. This gate gives the cell the opportunity to reset the valueof the state and release internal resources. Another addition to the memory cellwas the ”peephole connection” [25], which allows the gate to take a look at thecurrent state of the memory cell. This will aid in the activation of the gates. Theseadditions to the model gives us a memory cell as in Figure 6.

23

2.8 Related work 2 BACKGROUND

Figure 6: Long Short-Term Memory block as used in the hidden layers of a recur-rent neural network [26]

2.8 Related work

During the research phase of this project similar projects were found. In an articlein the Journal of Foodservice Business Research [27] the authors investigated theeffect of weather on restaurant sales. They tested the effect of different weatherfeatures on specific restaurant items. The results were a bit varied. Some weatherfeatures have different effects on different menu items, and that the sales of someitems were more affected by weather while others were not. This provides somefurther evidence to what is suspected by Caspecos analyst, that the restaurantsales change with the weather.

The article Evaluation of computational intelligence techniques for daily pro-ducy sales forecasting [28] the authors investigate the following questions: whatinput set is most informative for daily sales time series forecasting; do weatherinput features improve forecasting performance; what computational intelligencemodel is most appropriate for daily sales forecasting. The dataset consisted ofweather data and 89 real life product sales time series from several stores. Theyconcluded that the most important features were from the times series itself, theweather features did not improve the result and the best forecasting method wasby using support vector regression model. Even though the context of this fore-casting problem is different from the problem of this project, it provides somevaluable information regarding what data might be important.

Kaggle [29], the learning and data science competition website, have postedthe competition Walmart Recruiting II: Sales in Stormy Weather where privatepersons can enter and make there own solution [30].

The competitions scope is that Walmart, an American grocery store, wantto know the sales patterns for 111 weather-sensitive products, such as umbrellas,milk and bread, on a 3 day period around stormy weather days. The competitorsare provided with sales data for the 111 products at 45 different Walmart stores.These stores are covered by 20 weather stations that contains data of temperature,rain, snowfall, wind, labeled weather type and more.

Submissions from the competitors where evaluated against a root mean squared

24

3 METHODOLOGY AND IMPLEMENTATION

logarithmic error. The winner had an error of 9.3 % and found that the mostimportant feature was the weekday and for some stores the month periodicity.Weather features had almost no impact, people went shopping consider if it rainedor not. Although, he is making a reflection if the weather data was valid or not, ifthe pair of store and station was correct or if it came from a station further away.

In the book Data Mining and Big Data: Second International Conference [31]they have made a case study in Japan on food sales prediction with meteorologicaldata. Some interesting notes on the impact of weather were taken, not only theweather on the specific day was taken into account but also from past few days.If there is a high difference in weather it could have a big impact on the sales, forexample when in goes from 15 degrees Celsius to 15 degrees in a day, the salesof sodas and water would increase. Instead of making the prediction of sales asa regression problem, it was made into a classification problem with Labelns , s is

the store and n the day, based on a Ratens = SalensSale(n−7)s

where the second fraction

is the sale for the same store a week earlier to take the weather into account.The problem was solved with a LSTM to manage sequences of data and a stackeddenoising auto-encoder network to reduce the dimensions of features learned fromthe LSTM.

3 Methodology and Implementation

For this project a system that can create a sales forecast will be developed. It willbe trained using supervised learning, as the actual cost, and features for that sale,can be used as target data and input data [32]. This will be tweaked by utilizingeven more data to try to extract more features as mentioned in Section 2.

3.1 Data

The main part of this project was to understand the data available. As describedin the book Feature Selection for Data and Pattern Recognition [33] ”..and in caseswhen this information is incomplete or uncertain the resulting predictive accuraciesof constructed systems, whether they induce knowledge from available data insupervised or unsupervised manner, relying on statistics-oriented calculations orheuristic algorithms, could be unsatisfactory or falsified, making observations andconclusions unreliable.”. Even if a good and well-tested algorithm has been found,it is no guarantee that it will work if the data is not reliable.

Three different datasets from restaurants in three different cities were used forthis project. The restaurants were all located in cities with similar number ofinhabitants in the south central of Sweden. The characteristics of the datasets aresummarized in Table 2. The sales data from these restaurants combined with theweather data from SMHI from these cities will be the three datasets used for ourmodels.

25

3.1 Data 3 METHODOLOGY AND IMPLEMENTATION

Table 2: A summary of the characteristics of the different datasets

Name Outdoor seating area Expected seasonality Size of dataset Approximate average daily sales SEK

Dataset 1 Yes Yes 1268 rows 114 000

Dataset 2 Yes Yes 1494 rows 68 000

Dataset 3 Yes No 959 rows 144 00

3.1.1 Data extraction

SMHI have different parameters that can be used depending on what data you wantto extract, as mentioned in Section 2.4.1. All parameters did not seem relevant forthis project, therefore an intuitive importance ranking of the parameters was done.Many of the parameters were similar to each other, differing only in time intervalof the measurement. Others did not feel relevant to the problem, for examplefeatures such as radians and visibility, and they should not have an effect on whya customer visits a restaurant. The ranking can be seen in the rightmost columnin Table 3, where a rank of 1 signifies a high importance and a rank of 0 signifiesa low importance. Features with a rank of 0 was not considered for this project,but parameters with a rank of 1 were used.

26


Table 3: The different parameters of SMHI’s API, their explanation and an im-portance factor

Explanation Parameter Type Measured Importance

Air Temperature 1 Instantaneous value 1/hour 0

Air Temperature 2 Average 1/day 1

Air Temperature 19 Minimum 1/day 0

Air Temperature 20 Maximum 1/day 0

Air Temperature 22 Sum Monthly 1/month 0

Wind direction 3 Average for 10 min 1/hour 0

Wind speed 4 Average for 10 min 1/hour 1

Gust 21 Maximum 1/day 0

Amount of Rainfall 5 Sum for a day 1/day 1

Amount of Rainfall 7 Sum of latest hour 1/hour 1

Amount of Rainfall 17 Instantaneous value 2/day 0

Amount of Rainfall 18 Instantaneous value 1/day 1

Amount of Rainfall 23 Sum Monthly 1/month 0

Amount of Rainfall 14 Duration of 15 minutes 4/hour 0

Rain Intensity 15 Duration of 15 minutes 4/hour 0

Relative Air Humidity 6 Instantaneous value 1/hour 0

Air Pressure at Sea Level 9 Instantaneous value 1/hour 0

Present Weather 13 Instantaneous value 1/hour 8/day 0

Amount of Sunshine 10 Sum 1/hour 1

Visibility 12 Instantaneous value 1/hour 0

Total Cloud Coverage 16 Instantaneous value 1/hour 0

Snow Depth 8 Instantaneous value 1/day 1

Global Ir-radians 11 Average latest hour 1/hour 0

Long wave Ir-radians 20 Average latest hour 1/hour 0

A general approach of how to extract the data from SMHI’s API can be seenin Figure 7

27


Figure 7: A flow graph of the algorithm extracting data using SMHI’s API

SMHI’s database is suboptimal when it comes to handling weather historyrecords. The main problems are summarized as:

• All stations do not have all the parameters

• All stations do not save history records

• All stations which save history records are not guaranteed to have the latesthistory records

The general approach seen in Figure 7 could not be used and was therefore ex-tended to handle these flaws, as can be seen in Figure 8. To decide which station toextract data from, in SelectStation, the Euclidean distance between the restaurantand the closest weather station is calculated. If the closest weather station doesnot have the latest history records, the next closest weather station was used.

28


Figure 8: A flow graph of the specialized algorithm extracting weather historyrecords using SMHI’s API

29


The sales patterns from 50 years ago are not the same as today, therefore itwas important to use as new data as possible from both Caspeco and SMHI or theresult would most likely be inadequate.

3.1.2 Feature Creation

The features used for the training phase is important to the final result, but choos-ing good features is not trivial. Feature engineering consists of both creatingfeatures and selecting the good ones. The first step, creating features, requires ex-perience, expert knowledge and creativity. Expert knowledge can be gained fromexamining the data and the problem. In this case expert knowledge is to under-stand the core of what can affect the sales of a restaurant, while other featureswhich can help can be created to boost the learning system.

The features created for this project can be seen in Table 4 and the reasoningbehind choosing these features will be expanded on in this section.

Table 4: Features and their descriptions used for the training phase

Feature Range Description

SalesExVAT [0,∞] A restaurants sales ExVAT per day

Avg sale of month [0,∞] The average sale of the month

Avg sales last 7 days [0,∞] The average sale of the last 7 days, current day not included

Avg sale of day in month [0,∞] The average sale of the day in month

Avg sale of weekday [0,∞] The average sale of the weekday

Sale last year on day [0,∞] The sale of last year on the equivalent day

Date [start,end] yyyy-mm-dd, only used for indexing not in training

Day in dataset [1,n] A 1:n value

Weekday nr [1,7] Numerical representation of days

Day in month [1,31] Numerical representation of day in month

Month nr [1,12] Numerical representation of months

Day in year [1,365] Numerical representation of a day in a year

Week nr [1,52] Numerical representation of weeks

Year [start,end] The year

Is holiday [0,1] 1 if holiday, 0 otherwise

Is evening [0,1] 1 if day before holiday, 0 otherwise

Temperature [−∞,∞] Average of temperature during per day

Avg temperature last 7 days [−∞,∞] Average temperature of the last 7 days, current day not included.

Rainfall [0,∞] Average of rainfall during per day

Minutes sunshine [0,1440] Average of minutes of sunshine per day

Wind speed [0,∞] Average of the wind speed per day

Cloud cover [0,100] Average of cloud cover per day

Snow Depth [0,∞] Average of snow depth per day

SalesExVAT is the sale for one day. The features Avg sale of month, Avg -sale last 7 days, Avg sale of day in month, Avg sale of weekday nr are experimen-tal and might help detect seasonality and shorter trends. Sale last year on day isthe feature which Caspeco is currently basing their Uplift method on, which there-fore was included in our models as well.

A date entry was split into separate features: Weekday nr, Day in month and

30


Day in year to be able to investigate if there were any daily, monthly or annuallybehaviors in the data, see Section 5.1 for a discussion. A purely experimentalfeature Day in dataset was created since it was mentioned in the documentationof similar projects as a feature which might help. The original date entry was keptbecause it was needed in the LSTM-implementation described in Section 3.4.

The two features Is holiday and Is evening, the day before a holiday, were twoexperimental features which idea was to help the models to understand that salesmight rise for these special days.

There are several different weather features present in the dataset, whereasTemperature and Temperature last 7 Days are the most important one. As men-tioned in Section 1, it is common knowledge that the sales increases when thetemperature increases. Other features which are believed to be of importancewere Cloud Coverage and Rainfall. The overall importance of the features will bediscussed in Section 5.2.

3.1.3 Data pre-processing

After extracting the data from Caspeco and SMHI the data was further pre-processed. The first step taken was to check the distribution of the weekdaysin the datasets. This can be seen in Table 5 for Dataset 1, Table 6 for Dataset 2and Table 7 Dataset 3. Restaurants are often closed on Sundays and/or Mondays,or there are sales with the wrong sales date which might explain this distribution.This uneven distribution of weekdays in the dataset can affect the final modelsand therefore all Sundays from Dataset 1, all Mondays from Dataset 2 and allMondays and Sundays from Dataset 3 were excluded from the final datasets.

Table 5: The different number of weekdays in dataset 1

Weekday Number

Monday 205

Tuesday 213

Wednesday 211

Thursday 216

Friday 203

Saturday 215

Sunday 5

31



Weekday Number

Monday 95

Tuesday 231

Wednesday 237

Thursday 237

Friday 232

Saturday 232

Sunday 230


Weekday Number

Monday 33

Tuesday 172

Wednesday 189

Thursday 187

Friday 185

Saturday 190

Sunday 3

The next step was to look at SalesExVAT feature to see if there were anyunusually large or small transactions. In Figure 9 a box plot for Dataset 2 can beseen. The bottom of the box represents the first quartile and the top the thirdquartile. The line in the middle of the box represents the median. Outliers arerepresented as individual points. There are several outliers that can be seen, withthe rightmost sales being clear outliers, the ones closer to the maximum should beexamined more closely.

32


Figure 9: A box plot of the sales of restaurant 2. There are several outliers, withat least four sales being unusually large.

The density plot in Figure 10 shows the distribution of the sales. A largeportion of the sales are in the approximately 20.000-200.000 range and thereforethe sales outside of this range should be examined.

33


Figure 10: A density plot of the sales of restaurant 2. The largest portion of thesales lies within the 20.000-200.000 range

The next step was to check how many of these sales lies outside this range.There were approximately 1400 sales in the Dataset 2, and 43 of these sales werebelow 20.000 and 25 exceeds 200.000. That is, 3% of the data is below the limitand approximately 2% is higher than 200.000. There is no hard and known limitfor how much data can be discarded, but about 5% seemed appropriate. The sameprocedure of identifying limits and discarding outliers based on this were appliedon the other two datasets. This was a crude way of filtering for outliers, but itwas very effective.

Defining outliers is very subjective and therefore a discussion of this can beread in Section 5.6.

3.1.4 Correlation of features

The correlation of features is important in the task of selecting good featuresfor the learning model. As stated in the article Models of Incremental ConceptFormation [34]: ”Features are relevant if their values vary systematically withcategory membership”. This means that a feature is useful if it is correlated withthe target feature, otherwise it is not.

34


As a measurement of correlation between features the Pearson correlation co-efficient [35] was used. The range of the Pearson correlation co-efficient is [-1,1],where -1 or +1 means perfect linear relationship. In Table 8, Table 10, Table 12the correlation between the target feature and the rest of the features is illustrated.

A good subset of features consists of features relevant to the target but alsofeatures independent of each other.Gennari et alt [34, 36] states that features thatare highly correlated with each other are redundant information and should beeliminated from the dataset, as they are essentially describing the same thing.This is illustrated for Dataset 1 in Table 9, Dataset 2 in Table 11 and Dataset 3in Table 13.

There is no exact correlation co-efficient value that determines whether a fea-ture should be included or not, but for this project in feature pairs with a valueof 1, one of the features were discarded and in feature pairs with a value higherthan approximately 0.8 one or none of the features were dropped.

By comparing Table 9 to Table 8, it is possible to decide which features fromthe feature pairs with high correlation to each other to drop, by selecting anddiscarding the feature with the smallest correlation to the target feature. Ulti-mately the features: Year, Month nr, Week nr and Weekday nr were discardedfrom Dataset 1. Even though Temperature and Avg temperature last 7 days werehighly correlated they were kept since they were mandatory to the project.

In the same manner as when examining Dataset 1, Dataset 2 and Dataset 3were examined. From Dataset 2 the features: Year, Month nr and Week nr werediscarded and from Dataset 3 Year, Month nr, Week nr and Weekday nr werediscarded. The weather features pairs with high correlation with each other werekept.

35


Table 8: The correlation between the target feature SalesExVAT and the fea-tures Avg sale of weekday, Weekday nr, Sale last year on day, Avg sale of month,Avg sales last 7 days, Temperature, Avg sale of day in month, Avg temperature -last 7 days, Day nr, Month nr, Week nr, Snow depth, Minutes sunshine, Cloud -cover, Day in dataset, Avg sale of year, Is evening, Year and Day in month. indataset 1. A value close to -1 or +1 means high correlation, and 0 no correlation

Feature Correlation

Avg sale of weekday 0.93

Weekday nr 0.87

Sale last year on day 0.37

Avg sale of month 0.13

Avg sales last 7 days 0.12

Temperature 0.10

Avg sale of day in month 0.09

Avg temperature last 7 days 0.09

Day nr 0.08

Month nr 0.08

Week nr 0.08

Snow depth -0.07

Minutes sunshine 0.06

Cloud cover 0.04

Day in dataset 0.04

Avg sale of year 0.03

Is evening 0.03

Year 0.01

Day in month 0.01

36


Table 9: The correlation between non-target feature pairs of Dataset 1. A valueof close to -1 or 1 means high correlation.

Feature pair Correlation

Avg sale of year and Year -1.00

Day nr and Month nr 1.00

Day nr and Week nr 0.98

Month nr and Week nr 0.98

Day in dataset and Year -0.97

Avg sale of weekday and Weekday nr 0.94

Temperature and Avg temperature last 7 days 0.90

Avg sale of year and Day in dataset 0.83

37


Table 10: The correlation between the target feature and the features Avg sale -of month, Avg sales last 7 days, Avg sale of weekday, Temperature, Avg tempera-ture last 7 days, Minutes sunshine, Sale last year on day, Cloud cover, Month nr,Day nr, Snow depth, Week nr, Weekday nr, Day in dataset, Year, Avg sale of -day in month, Avg sale of year, Raindfall mm, Wind speed, Day in month, Is hol-iday and Is evening in dataset 2. A value close to -1 or +1 means high correlation,and 0 no correlation

Feature Correlation


Avg sales last 7 days 0.54


Temperature 0.47

Avg temperature last 7 days 0.44

Minutes sunshine 0.37


Cloud cover -0.26

Month nr 0.19

Day nr 0.19

Snow depth -0.19

Week nr 0.18

Weekday nr 0.17

Day in dataset -0.16

Year 0.11


Avg sale of year -0.06

Rainfall mm -0.05

Wind speed -0.03

Day in month -0.03

Is holiday -0.01

Is evening -0.00

38




Avg sale of year and Year -1.00






Avg sale of year and Day in dataset 0.85

Avg sale of month and Avg sales last 7 days 0.84

39


Table 12: The correlation between the target feature and the features Avg sale of -weekday, Weekday nr, Sale last year on day, Avg sales last 7 days, Avg sale of -day in month, Avg sale of month, Snow depth, Avg sale of year, Day nr, Month -nr, Cloud cover, Day in month, Year, Wind speed, Avg temperature last 7 days,Day in dataset, Is evening, Minutes sunshine, Is holiday, Rainfall mm and Tem-perature in dataset 3. A value close to -1 or +1 means high correlation, and 0 nocorrelation

Feature Correlation


Weekday nr 0.82


Avg sales last 7 days -0.16



Snow depth -0.09

Avg sale of year -0.09

Day nr 0.08

Month nr 0.07

Week nr 0.07

Cloud cover -0.06

Day in month 0.04

Year -0.04

Wind speed 0.03

Avg temperature last 7 days -0.03

Day in dataset 0.02

Is evening 0.02

Minutes sunshine -0.02

Is holiday -0.01

Rainfall mm -0.01

Temperature -0.01

40




Avg sale of year and Year 1.00





Avg sale of weekday and Weekday nr 0.90


Avg sale of year and Day in dataset -0.85

Cloud cover and Minutes sunshine -0.82

A more thorough discussion on the chosen features and their importance forthe project can be read in Section 5.

3.1.5 Final Features

Many experimental features were created beforehand to help catching certain pat-terns. Some of these were discovered to be redundant, see Section 3.1.4. The finalfeatures used for training for the three original datasets can be seen in Table 14.

41


Table 14: Final features remaining after feature analysis and used for training

Dataset 1 Dataset 2 Dataset 3

Avg sale of year Avg sale of year Avg sale of year

Avg sale of month Avg sale of month Avg sale of month

Avg sales last 7 days Avg sales last 7 days Avg sales last 7 days

Avg sale of day in month Avg sale of day in month Avg sale of day in month

Sale last year on day Sale last year on day Sale last year on day

Day in dataset Day in dataset Day in dataset

Day in month Day in month Day in month

Day nr Day nr Day nr

Temperature Temperature Temperature

Avg temperature last 7 days Avg temperature last 7 days Avg temperature last 7 days

Rainfall mm Rainfall mm Rainfall mm

Snow depth Snow depth Snow depth

Cloud cover Cloud cover Cloud cover

Minutes sunshine Minutes sunshine Minutes sunshine

Wind speed Wind speed Wind speed

Weekday nr Weekday nr

Avg sale of weekday Avg sale of weekday

3.1.6 Standardization and Normalization

The data extracted and selected were in various ranges and depending on thelearning algorithm used, the data needed to be normalized. Neural networks aredependent on normalized data but decision trees, for instance, are not.

For a neural network, the data needs to be within a certain range if the neurons’activation function is to work correct. Different activation functions have differentrange and domain where rescaling of the data will improve performance. Thesigmoid function have an active domain of

[−√

0,√

1]

(see Figure 11), a rangewhere input values of this interval have a larger impact on the output. Values nearthe activation function’s asymptotic ends have a small influence on the weights’changes and thus a small influence on the output [2].

42

3.2 Root Mean Square Error 3 METHODOLOGY AND IMPLEMENTATION

Figure 11: The sigmoid activation function

The scaling of the data needs to be done within the range of the activationfunction, for example (0, 1) for sigmoid function and (−1, 1) for hyperbolic tangent.There are three frequently used scaling methods: amplitude scaling (often callednormalization) see Equation (21), mean centering and variance scaling. The lasttwo can be combined to the Z-score normalization(or standardization) see Equa-tion (22) [2].

Xnorm =X −Xmin

Xmax −Xmin(21)

z =x− µσ

(22)

Activation functions are part of a neural network, but for decision trees thesplit function is not scale-dependent [37]. There is no harm of scaling the data fordecision trees and is recommended to ease the comparison between tree-modelsand neural networks.

3.2 Root Mean Square Error

Both machine learning models need a loss function to minimize. Mean averageerror and root mean square error are two common loss functions which were con-sidered. Mean average error is a measure between two continuous values, in ourcase between the predicted value and the observed one. It is the absolute differencebetween the pair, averaged over the test data and is defined in Equation 23.

MAE =1

n

n∑i=1

| yi − xi | (23)

Note that it does not take the direction of the output-input set into account.

43

3.3 Extreme Gradient Boosting3 METHODOLOGY AND IMPLEMENTATION

Root mean square error is the square root of the average of the squared differ-ences between the predicted and observed value. It is defined in Equation 24.

RMSE =

√√√√ 1

n

n∑i=1

(yi − xi)2 (24)

Both metrics express the averaged prediction error in the same unit as the dataused, and in both cases the lower the score the better.

The main difference between the two is how they penalize errors. RMSE penal-izes higher errors more than MAE. This means that it will prioritize to mitigatehigh errors. The choice between what loss function to use is dependent on theproblem. In this project, RMSE seemed more appropriate and will therefore beused as a loss function for both the XGBoost and LSTM implementations.

3.3 Extreme Gradient Boosting

The extreme gradient boosting algorithm was implemented using the frameworkXGBoost as mentioned in Section 2.6.1. The easiest approach to implementthe algorithm can be seen in Figure 12. There were three functions involved inthe process, xgboost.Dmatrix() which transformed the dataset to a format thatis optimized for speed and memory efficiency used by XGBoost, xgboost.train()which trained a model and xgboost.predict() which is the final call used to get theresult.

44


Figure 12: A simple approach to an XGBoost implementation

The official documentation [38] for xgboost.train() can be found in Appendix A,in which its numerous parameters are listed and explained. However, the firstargument, params is of great importance and it will be more thoroughly explainedand discussed in Section 3.3.1.

45


3.3.1 Model Setup

XGBoost has three kinds of parameters: General parameters, Booster parametersand Learning parameters [39]. The general parameters decide overall functional-ity of XGBoost. The booster parameters, denoted params in the xgboost.train()function, decides the behavior of the selected booster. The learning parametersare used to define the optimization metric used in each step.

The booster parameters used for this implementation were eta that shrinks theweights at each step to help prevent overfitting. max depth controls the maximumdepth of a tree. If it is too high, the tree might become too specialized. Min child -weight is the minimum weights of all observations in a child, and if too high it canlead to underfitting. Subsample is the fractions of observations to be randomlysampled for each tree. If the value is too low it might lead to underfitting. Thefinal parameter is colsample bytree which subsamples columns used in each tree.

The value of these parameters were all found by using xgboost.cv(), whichutilized cross-validation to find the optimal parameters in given ranges. The rangesused were part experimental and part found by studying the literature and can besummarized in Table 15.

Table 15: The cross-validated XGBoost parameters

Parameter Default value Selected range

eta 0.3 [0.3, 0.2, 0.1, 0.05, 0.01, 0.005]

max depth 6 [3,12]

min child weight 1 [1,8]

subsample 1 [4,11]

colsample bytree 1 [4,11]

The optimal value for the parameter num boost round in xgboost.train() wasfound using early stopping, where the process stops if the validation loss does notdecrease for a set number of epochs.

3.3.2 Final XGBoost model

The dataset was split in a 70/30 split, with 70% of the data to training and 30%data to testing. All the features, but not including the Date feature, discussed inSection 3.1.5 were used. The number of rows of the sets were 799 for training and343 for testing in Dataset 1.

The final model after parameter tuning for Dataset 1 looked like the following:

46

3.4 LSTM Neural Network 3 METHODOLOGY AND IMPLEMENTATION

params = ’ max depth ’ : 3 ,’ min ch i ld we ight ’ : 2 ,’ eta ’ : 0 . 0 5 ,’ subsample ’ : 1 . 0 ,’ co l sample bytree ’ : 1 . 0 ,

# Other parameters’ ob j e c t i v e ’ : ’ reg : l i n e a r ’

num boost round = 169gbt = xgb . t r a i n (

params ,dtra in ,num boost round=num boost round

)

where dtrain is the training dataset in a sparse matrix.This model was used for the final predictions for Dataset 1 presented in Sec-

tion 4. The only change to the models for the other datasets were the parametervalues, which also were decided by cross-validation and will not be presented.

3.4 LSTM Neural Network

The purpose of long short term memory is the use of sequential data. The data isextended to keep features from t days that can be all from 0 days, only the day tobe predicted, or in theory an infinity number of days. The idea is to figure out if atrend of weather and sale has an impact, so the value of t should be no more than30 days. When these extra features are created the set is split into a training set,a validation set, and test set. These are further divided into input and output,where the input keeps all features from the time sequences and the output is onlythe sales for the day to be predicted.

47


Figure 13: A simple approach to an LSTM implementation

The implementation of LSTM was done through the framework Keras. Themodel was built in two parts, an architecture part and a learning part.

48


3.4.1 Model setup

There must be a decision on the number of hidden layers, in addition to the inputand output layer, where more layers implies a deeper learning of the data. In eachof these layers the number of neurons is set. More neurons gives a more specializedmodel towards the given dataset, and fewer neurons gives a more general model.There are more architecture decisions that can be made in each layer. The weightsof the neurons are often randomized on initialization to break symmetry in thesearch for the global minima of the loss function. With a fixed initialized valuethere is a higher risk of getting stuck in a local minima, see Figure 14

Figure 14: 3D-graph of the gradient descent and initialized weights

An activation function should be set for each layer, and they often differ de-pending on if it is an input, a hidden or an output layer. For the output layer, alinear function is often used if it is a regression problem. High/Low input valuesoften leads to very small weight changes for NN’s with some activation function,but there are alternatives that are more forgiving. Compare to other feed for-ward NN’s the LSTM does not have the vanishing gradient problem, mentionedin Section 2.7.2. This gives a bigger range of possible activation functions for thehidden/input layers.

When the architecture is defined, the learning phase needs to be set. Thisincludes how many optimizers the model should use and which loss function thatit applies, as well as how many epochs it should use and with how large of abatch size in each epoch. The number of epochs needed is hard to predict. Toovercome this a high initial number is set, together with the possibility of earlystopping. The batch size is often dependent on the performance of the computer:a higher batch size means less iterations per epoch but requires more memory.When all parameters are set the model is trained via cross-validation: a well-known validation technique that test the trained model on an unseen validationdataset.

3.4.2 Final LSTM model

The difficulty of neural networks is that they have no complete manual on whatsettings that fits what data and problem, only guidelines. The number of layers,neurons, epochs, batch size are hard to decide. Keras models are compatible withthe Sckit-learn library. Sckit provides a parameter optimization tool called Gridsearch. With this tool, an interval for each parameter can be given and grid search

49


will create and test a model for each combination of these parameters. With fivedifferent parameters and a interval of ten different values for each one, it will yield100 000 different models. Sckit-Grid saves all models in memory which makes itimpossible to compute on a regular computer.

Figure 15: The ReLU (Rectified Linear Unit) activation function

The restriction of grid search led to a combination of trial and error togetherwith a smaller parameter range for grid search. This led to a final model with asingle input layer containing 20 neurons, random initial weights, a relu activationfunction, see Figure 15, and an output layer with a linear activation function. Thetraining parameters was 200 epochs, a batch size of 20 and stochastic gradientdescent (SGD) as an optimization function with a root mean square error as a lossfunction, and a learning rate of 0.1. The number of t sequences was 2, current dayand the day before.

The model was trained on a dataset including the features found in Sec-tion 3.1.5.

de f ba se l i ne mode l ( ) :model = Sequent i a l ( )model . add (LSTM(20 , a c t i v a t i o n =’ re lu ’ ,

k e r n e l i n i t i a l i z e r =’random uniform ’ ,input shape=(X. shape [ 1 ] , X. shape [ 2 ] ) ) )

model . add ( Dense ( 1 ) )opt imize r = opt im i z e r s .SGD( l r =0.1)model . compi le ( l o s s =’mae ’ , opt imize r=opt imize r )

e s t imator = KerasRegressor ( b u i l d f n=base l ine mode l ,epochs =200 , b a t c h s i z e =20)

50

4 RESULTS

4 Results

The chosen models as well as the uplift model were applied on three differentdatasets which are described in Section 3.1. The results are presented in the fol-lowing manner: the first metric is the result in regard to how Caspeco differentiatesbetween good and bad predictions, that is if the prediction is within 15% of theactual sales value, it is classified as a good prediction. The second metric is theroot mean square error which is a measurement of the average deviation of thepredictions from the observed values. The third metric is a normalized error whichis the ratio between the RMSE of implemented model and the RMSE of the upliftmodel. This metric is sensitive to outliers and high values. The fourth metricis the geometric mean of relative average error [40], see Equation( 25) where thenominator is the chosen algorithm (LSTM or XGBoost) and the denominator isthe uplift algorithm which is the chosen benchmark.

GMRAE = m

√√√√ m∏t=1

| yt − ftyt − f∗t

| (25)

The predictions from the uplift model is a 3% increase from last years sale onthe equivalent day. This is how Caspecos’ customers do their predictions today,however the percentage increase varies slightly between their customers due totheir optimism.

4.1 Dataset 1

Dataset 1 is a restaurant located in a city in the south central of Sweden. It hasa big outdoor seating area and an approximate daily sales of 114 000 SEK. Theresults for the chosen algorithms can be seen in Table 16.

Table 16: The chosen algorithms and their results in different metrics on Dataset1

Error measurement XGBoost LSTM Uplift

Within 15% 63 % 65% 51%

RMSE 24377 29995 30774

Normalized error 0.79 1.86 1

GMRAE 0.74 1.13 1

By examining Table 16 it can be seen that XGBoost and LSTM score roughlythe same. XGBoost have 63% of its predictions within 15% of the actual salewhereas LSTM has 65%, and the current uplift method has 51% correct.

51

4.2 Dataset 2 4 RESULTS

Figure 16: The different features used in XGBoost for Dataset 1 on the y-axis andtheir F-score on the x-axis. The F-score represents how many times each featureis split on

Figure 16 shows how XGBoost uses the features. The importance of the fea-tures seem to correspond well to the feature correlation described in Section 3.1.4,but surprisingly day nr seem to be more important then what could be expected.

4.2 Dataset 2

Dataset 2 is a restaurant located in a city in the south central of Sweden. It isa smaller restaurant than the one of Dataset 1, and it has a big outdoor seatingarea. The approximate daily sales of the restaurant in Dataset 2 is 68 000 SEK.The results for the chosen algorithms can be seen in Table 17.



Within 15% 47% 42% 33%

RMSE 21655 23427 30845


GMRAE 0.68 0,82 1

The result from Dataset 2 can be seen in Table 17, where XGBoost once againperformed better than the other algorithms. However, with only 47% of its pre-dictions within 15% of the actual value, and LSTM’s result of 42%, it can be seen

52

4.3 Dataset 3 4 RESULTS

as a weak result and will be discussed in Section 5. The current uplift method gets33% of its predictions within 15%.

Figure 17: The different features used in XGBoost for Dataset 2 on the y-axis andtheir F-score on the x-axis

In Figure 17 it can be seen how XGBoost uses the features for Dataset 2.The importance of the features does not seem to correspond well to the featurecorrelation described in Section 3.1.4. This will be discussed in Section 5.

4.3 Dataset 3

Dataset 3 is also a restaurant located in the south central of Sweden. However, itis the largest of the restaurants, with a non-significant outdoor seating area andan approximate daily sales of 144 000 SEK. The results for the chosen algorithmscan be seen in Table 18.



Within 15% 47% 36% 36%

RMSE 49114 72773 78109


GMRAE 0.74 2.12 1

The results of Dataset 3 can be seen in Table 18 where XGBoost performedbetter than the other models with its 47% within the 15% limit. LSTM has a

53

4.4 The Saturday Dataset 4 RESULTS

result of 36%, and performed the same as the uplift method’s 36%. The result ofXGBoost was a good result given the characteristics of the restaurant. It will befurther discussed in Section 5.

Figure 18: The different features used in XGBoost for Dataset 3 on the y-axis andtheir F-score on the x-axis

In Figure 18 it can be seen how XGBoost uses the features for Dataset 3.The importance of the features seems to correspond well to the feature correlationdescribed in Section 3.1.4.

4.4 The Saturday Dataset

The final question which was going to be investigated was how well the modelswould perform if only data from a particular weekday was chosen. In this case onlySaturdays and the features: Temperature, Avg temperature last 7 days, Rainfall -mm, Snow depth, Cloud cover, Minutes sunshine and Wind speed were extractedand used for training. The result can be seen in Table 19 and it is only evaluatedin the same metric as Caspeco is doing today. The idea behind choosing only aspecific day and the interpretation of the result will be discussed in Section 5.

54

4.5 The Summer Dataset 4 RESULTS

Table 19: The result of XGBoost, LSTM and Uplift models on the Saturdaydataset in percent of how many of its predictions that is within 15% of the actualsale

Dataset XGBoost LSTM Uplift

1 78% 83% 68%

2 44% 47% 41%

3 65% 65% 60%

In Table 19 it can be seen that the performance of XGBoost for Dataset 1 andDataset 3 increased significantly, but for Dataset 2 there was a small decrease.The results for the LSTM model significantly increased for all datasets and theuplift model performed better for all datasets.

4.5 The Summer Dataset

As the project progressed it became apparent that it would be interesting to in-vestigate if the result could be improved even further by choosing only a selectedperiod of time of the year. To investigate this, only data from the months June-August were selected, including all the original features from the three datasets.This is only evaluated in the same metric as Caspeco are doing today. In Table 20it can be seen how well the algorithms performs on the the summer datasets. Theidea behind choosing only a specific period of time and the interpretation of theresult will be discussed in Section 5

Table 20: The result of XGBoost, LSTM and Uplift algorithms on the summerdataset in percent of how many of its predictions that is within 15% of the actualsale


1 61% 52% 49%

2 63% 49% 34%

3 45% 25% 32%

By examining Table 20 it can be seen that for XGBoost the only significantchange in performance is on Dataset 2 with an increase of 16 percentage pointscompared to the original Dataset 2, seen in Table 17. The result of the LSTMmodel did not change significantly. An important note is that the parameters ofthe LSTM model was not altered and they were still optimized for the standarddataset. Comparing the result from the uplift model’s performance on the originaldataset with the summer dataset, there is no significant change between them.

55

4.6 Benchmarks 5 DISCUSSION

4.6 Benchmarks

To answer the question of which model has the shortest training time, the differentalgorithms were timed during training. The training time for XGBoost were lessthan a minute for all datasets. For LSTM the training time were around one totwo minutes depending on the dataset size. It takes approximately 0.25 secondsper item-set. The hyperparameter search differs a lot in time for the two algo-rithms. The Tensorflow library, mentioned in Section 2.2 that was used for theLSTM implementation, used a copy of the model for each hyperparameter com-bination. This means that the parameter search takes up a lot of memory spacefor the machine it is running on and it is not possible with too many parametercombinations.

5 Discussion

This section will discuss the importance of date feature and the weather features,as well as the results obtained in Section 4. In addition to the result, the reasoningbehind creating the two new datasets and their importance to the project, thoughtsand ideas of what could be improved will be discussed.

5.1 Date Features

In Section 3.1.2 it is stated that a dd/mm/yy date had been split into separatefeatures to investigate if there are any daily, monthly or yearly patterns in the salesof a restaurant. By examining Figure 19, Figure 20 and Figure 21 some expectedbehaviors appears.

Figure 19 shows that the sales increases during the weekend. Restaurant 3 seesa big increase in sales, but Restaurant 1 sees only a small increase. This is mostlikely due to the characteristics of the restaurants, which will be further expandedupon in this section.

56

5.1 Date Features 5 DISCUSSION

Figure 19: A line plot with the average sale for each day in a week, weekday nr 1= monday, ...

Figure 20 shows that the effect of seasonality depends on what sort of a restau-rant it is. Restaurant 1 which is famous for its food, does not suffer from the effectsof seasonality. Restaurant 2 main trait is its outdoor seating area and are thereforeblossoming during the summer months. Restaurant 3 is famous for its variety ofoffers, but might lose some customers to a restaurant with a bigger outdoor seatingarea during the summer months.

Figure 20: A line plot with the average sale for each month in a year

When people get their salary, the sales on the following days will increase, seeFigure 21. This differs a bit between the two restaurants though. The sales ofRestaurant 1, which can be characterized as a ”fine dining” type of restaurant,are more fixed since there is a set three course menu to choose from. This isin contrast to Restaurant 3, which can be characterized as more of pub type of

57

5.2 Weather Features 5 DISCUSSION

restaurant, where the sales fluctuate quite a bit. For example, the days after the25th in Restaurant 1 might provide more customers, but the customers themselveswill spend a similar amount of money whether it is the 25th or the 12th. InRestaurant 3 it is more likely that people buy more drinks after the 25th than inthe middle of the month, which leads to more fluctuating sales.

Figure 21: A line plot with the average sale for a day in a month

5.2 Weather Features

There are several different weather features present in the datasets, where Tem-perature is the most significant one. As mentioned in Section 1, it is common tobelieve that the sales increase for a restaurant when the temperature increases.By looking at Figure 22 this behavior can be observed for Restaurant 2.

58


Figure 22: A scatter plot with temperature on x-axis and SalesExVat on y-axisfor restaurant 2. The sales increases with the temperature

By comparing Figure 22 to Figure 23, a plot that shows the sales for Restau-rant 3 relative to the temperature, does not show the same correlation. This canbe explained by the settings of the restaurants. Restaurant 3 is a large restaurantwith a small outdoor seating area, and Restaurant 2 is a medium size restaurantwith a large outdoor seating area. Restaurant 3 disproves the idea that all restau-rants thrives when there is good weather. Restaurants similar to Restaurant 3 donot make less sales on a sunny day, but the Figure 23 implies that there are othermatters which impact the sales.

59


Figure 23: A scatter plot with temperature on x-axis and SalesExVat on y-axisfor restaurant 3. The sales are not affected by the temperature

It is important to realize that temperature can be relative. A temperatureof 10-15 degrees in August could mean a normal sale number, but a temperatureof 10-15 degrees in May could mean a higher sale compared to the rest of themonth, see Figure 24. The sale follow the normality of the temperature, 15 degreesin May could mean the first sunny/warm day and hence bring more people to therestaurants than it would in August, where it would be considered a chilly day.September could be different than May, where summer heat is fresh in mind, andtherefore 15 degrees would not have the same impact as in May. It is importantto take the temperature relative to the season of the sale and this is one of thereasons the Month nr feature as well as the summer datasets were created.

60

5.3 Feature importance and selection 5 DISCUSSION

(a)

(b)

Figure 24: (a) Sales in May with temperature on x-axis and SalesExVat on y-axis(b) Sales in August with temperature on x-axis and SalesExVat on y-axis

5.3 Feature importance and selection

Our initial assumption of what features were important to be able to predict ac-curate sales predictions turned out to be partly wrong. Restaurant 1 and Restau-rant 2 were chosen specifically by our supervisor at Caspeco because he knew theywere weather dependent. This lead us to believe the weather features from SMHIwere going to be among the top ranked features used by the models, as well asthe highest correlated features with regard to the sale. However, both of these as-sumptions turned out to be false as can be read in Section 3.1.4 and in the resultsin Section 4. Restaurant 3 was chosen because it should not be affected by theweather, and therefore it would be interesting to see if any relationships to theweather features could be found at all.

As the weather features were not important, it raises the question of how theresult would be if they were not included at all. This was easily tested and forDataset 1 the accuracy dropped with 2 percentage points, to 61%. For Dataset 2

61

5.4 Summer and weekday datasets 5 DISCUSSION

the accuracy dropped 4 percentage points, to 43% and for Dataset 3 the accuracydropped 2 percentage points, to 46%.

The two features is holiday and is evening were two experimental featureswhich idea was to capture the irregularity of sales on the holidays and evenings.It was quickly discovered that these features had no effect whatsoever, which leadto them being dropped.

5.4 Summer and weekday datasets

As the project progressed, two other research questions became apparent due tothe results not being satisfactory. What is the impact of the weather on the saleson a particular weekday?, and What is the impact of the weather on the sales fora specific period of time?

The intention behind the first question was that the date features was moreimportant than the weather features, as can be seen in Section 3.1.4. An exampleof this is the sales of a Tuesday with perfect weather conditions, i.e it is warm,sunny and no wind, does not surpass the sales of a generic Saturday. This is nota hard truth, more of a general one since there are many aspects to take intoconsideration, whereas the size of the outdoor seating area is the most importantone. To attempt to investigate the sales just depending on the weather, newdatasets for the three restaurants were created in which only Saturdays with itsweather features are considered, since the sales of Saturdays ought to be the mostconstant one.

In Section 4.4 the results of the XGBoost and LSTM model can be read. Theresults showed that the idea behind these new datasets seemed promising, withan increase in almost all datasets for both models. However, it should be takinginto consideration that the sizes of the datasets were reduced drastically as only1/7 of the actual data was used. This makes the results more uncertain, but theystrongly suggest that for predicting the sales on Saturdays it is better to isolateand train a model on that specific weekday. This was not investigated for the otherweekdays, but the sales of Fridays should behave similarly. No conclusion can bedrawn for the remaining weekdays as more investigation is needed, see Section 5.9.

The intention behind the second question was to mitigate the effects of sea-sonality. It was briefly mentioned in the previous section, that the weather andforemost the temperature was relative to its season. This means that a tempera-ture of 15 degrees Celsius in April, which is a really good day in April should notbe treated the same as 15 degrees Celsius in July, which is at most a mediocre sum-mer day. Therefore the summer months of June, July and August were extractedfrom the datasets of all restaurants and new models were trained on these.

This reasoning should be covered by a time feature, either if it is in day ormonth. A time feature should give the models a chance to learn the differenceabout importance between temperature at different time as the above example.Temperature of 15 degrees should have a bigger impact in April than in June.The XGboost model should create a new branch for different time intervals where

62

5.5 Result 5 DISCUSSION

weather have different impact and for LSTM the temperature should be weighteddifferently depending on the time. However, it is not with absolute certainty youcan describe how a neural network or a decision trees work. The filtering of monthsdecrease the variance and standard deviation on most of the restaurants and cantherefore improve the outcome. These two reasons made us try our theory.

In Section 4.5 the results of the XGBoost and LSTM model is seen. The resultsshow that it can be a good idea to divide the data into seasons. This is true forat least the XGBoost implementation with an increase of 16 percentage points inthe accuracy of Dataset 2, and only small decreases in Dataset 1 and Dataset 3.An increase in Dataset 2 was the goal, seeing as this restaurant was known tobe weather dependent. That it would yield such a big increase was unexpected.Nonetheless, this result is more satisfactory as it is more in agreement with thecharacteristics of the restaurant. It should be weather dependent and thereforemitigating the seasonality should have helped in improving the model.

The LSTM implementation did not achieve the same improvements as theXGBoost implementation. It has a bigger decrease for 2 out of the 3 datasets.There was a 13 percentage points decrease for Restaurant 1 and 9 percentagepoints decrease for Restaurant 3. Restaurant 2 had an increase of 2 percentagepoints, which correlates with the feature importance. The large decreases and thesmall increase could be explained by the neural networks optimization function.No matter the function that is chosen, they are all trained on the same data. Thesummer dataset is only 1/4 of the actual data and this reduction in size impacts theperformance. If there exist a smaller amount of training patterns, the optimizationfunction will have a harder time to find good values for the neurons’ weight [2].In comparison to the original datasets, even though it does not provide moredata with summer temperatures, it still gives more patterns for different outputdata. Patterns that are used for better settings of the weights. By comparing thisdataset to the Saturday dataset it does not have the same increase in result. TheSaturday dataset has a smaller variance on the output and the same complexityis not required.

It should be taken into consideration that less data was used, which makes theresults more uncertain. It is believed that for certain restaurants it is better todivide the year into seasons and train models thereafter. Only the summer seasonwas investigated in this project, as it was believed to have the most impact. Therest is left for future work, see Section 5.9.

5.5 Result

There was an expected weather dependency of the sales of two of three of the chosenrestaurants. This turned out to be true not only for the expected restaurants, butalso for the restaurant which it was not expected for. The weather data improvedthe results of the machine learning models, although not as much as hoped. Bothmodels performed better than the current uplift model. This makes the results asuccess with regard to the goal of the project, even though the members of this

63


project expected a better outcome.The most important features when predicting the sales were clearly the fea-

tures regarding the date, Weekday nr, Day nr, Avg sale of weekday. The weatherfeatures were as mentioned earlier, usually among the bottom ones. As the datefeature was the most important one, it will be thoroughly discussed in the followingsections.

The result of the Saturday dataset is very satisfactory, and is more in agreementof what was actually hoped to be achieved for the original dataset.

The summer dataset was interesting to compare between the XGBoost modeland LSTM model. The XGBoost model performed a lot better in Dataset 2, andnot so much worse for the other two datasets. The LSTM model performed worsein all cases. A neural network is sensitive if there is a small amount of trainingpatterns and at the same time a high variance of output patterns.

5.5.1 Restaurant 1

Dataset 1 is a restaurant in the south central of Sweden with a medium sizedoutdoor seating area. This was the restaurant which had the best result withits 65%, with regard to Caspeco’s within 15% limit. The results of the machinelearning algorithms were much better than the current uplift method, but theweather feature only improved the result by a small amount as mentioned earlier.To try to understand why only 65% of the predictions was within the 15% limit,an analysis of the data and the remaining 35% will follow.

Relative Standard Deviation(RSD) is a great way to get a sense of variationfor the data. Compared to standard deviation, it gives a variance relative to itsown data. For example, the standard deviation for a Saturday’s sale might betwice as large as a Wednesdays, but with RSD it is normalized to the average.This is a better way to visualize the sales per day, since the sales differs greatlybetween different weekdays. In Figure 25 it can be seen that there is a high RSDfor weekdays compared to the weekends. This restaurant is solidly dependenton its dining and if these types of restaurant perform well, you often have fullybooked weekends. Therefore, the Saturdays keep a low RSD and should be easierto predict as supported by the result in Section 4.4.

64


Figure 25: Relative Standard Deviation of the sales of weekdays for Dataset 1

In Table 21 the distribution of the weekdays in the result can be seen. Thenumber of weekdays in the training dataset were approximately the same, as canbe read in Section 3.1.3, but there were at least half as many occurrences ofFridays and Saturdays as the rest of the weekdays. This indicates that Fridaysand Saturdays were easier to predict since it implies that more were within the15% limit. A reason to why Mondays-Thursdays can be harder to predict is thatthese days had a low average of sales, and therefore not much was needed to exceedthe 15% limit. For example, if there is a difference of 20 customers on a Mondaythere is a bigger risk for a higher error on the predictions compared to if there isa difference of 20 customers on a Saturday.

Table 21: The distribution of weekdays with higher than 15% error for Dataset 1

Weekday Amount higher than 15% Error Average Sales ∼15% of Average Sales

Monday 33 27589 4138

Tuesday 26 67841 10176

Wednesday 24 78848 11827

Thursday 26 97100 14565

Friday 11 218084 32712

Saturday 12 209478 31421

As mentioned earlier, this restaurant or type of restaurant which was more ofa fine dining experience, relied more on the date feature than the weather feature.Even though the result was improved by the weather feature, it was ultimately

65


what kind of weekday and what the date was that was important for predictions.

5.5.2 Restaurant 2

Dataset 2 is a restaurant with outdoor seating and was described to be heavilyweather dependent. The results were therefore somewhat surprising with a modest47% as the best result. With the weather data at hand, it was thought of as therestaurant to have the smallest error rate. However, compared to the other datasetsit shows a higher weather dependency, as can be seen in Figure 17 compared toFigure 16 and Figure 18. It still does not reach the level of importance as thetime features as explained in 5.5, but should be of more significance compared inDataset 1 and 2. That is why, with an exclusion of time, the Saturday Dataset 2’sresults are a bit surprising and do not have the same decrease of error as the othertwo, but instead almost has an increase of error. From Figure 26 it looks like itshould be easy to predict a Saturday due to its low variance. A low variance isnot necessarily good since the model does not have the chance to learn the outliersthat occurs.

Figure 26: Relative standard deviation of the sales of weekdays for Dataset 2

In Table 22 the distribution of weekdays in the result can be seen. Compared tothe corresponding table for Dataset 1, this shows an even distribution of weekdaysin the result. This strengthens the belief of that the sales did not depend on whatweekday it was and that there was something else that outweighed this factor.

66




Tuesday 36 46819 7022

Wednesday 43 55007 8251

Thursday 43 52943 7941

Friday 36 119284 17892

Saturday 24 120621 18093

Sunday 44 38536 5780

A discovery and a possible explanation to the weak performance of this datasetwas found in the analysis of the result. This restaurant has a bookable floor,which is used for companies and big gatherings. Depending on how many of thesebookings are made, the effect on the result can be big. With a high number ofbookings, the sales in the outdoor seating and the rest of the restaurant is of littleimportance. If there is a booking for a large party in this section, the sale is notrepresentative of the restaurant. Consequentially, it also means that the weatherdependency is not representative of this day. Even though it is a rainy June day,the sale could be high due to the sales of the bookable floor.

These private bookings are somewhat similar to the events explained in Sec-tion 5.9, but differs in the way that they do not influence the whole restaurant.These bookings could also possibly explain why the Saturday dataset did not im-prove the 15% error rate. If it would influence the rest of the restaurant, somelearning pattern could be found. Dataset 2 is as the other two datasets influencedby the date, but the additional problem explained above was not disregarded whenfiltering out the time aspect. The bookings of this floor often occur on Saturdays,which implies that an unexplainable variance of sales still occurs for the modelwhen training on the Saturday dataset.

There are at least two possible solution to this problem. Add a booking as anevent or exclude the sale from this area, both explained in Section 5.9.

5.5.3 Restaurant 3

This restaurant is located in a medium sized city in the south central of Sweden.It is well visited and provides both a restaurant, a bar, an outdoor seating and adance floor section. The outdoor seating is only one of the areas which influencesthe restaurant, and compared to the Dataset 2, it is not as dependent on thissection. The results from Section 4.3 and Section 4.5 showed that the restaurantwas time dependent in the same manner as Dataset 1. In Figure 27 the RSD forthe weekdays of Restaurant 3 can be seen. The RSD was highest for Tuesdaysand Wednesdays, and the lowest RSD was on Saturdays and Fridays. This givesa notion of that weekends could be easier to predict than for the rest of the week.

67


Figure 27: Relative Standard Deviation of the sales of weekdays for Dataset 3

In Table 23 it can be seen that there are less Fridays and Saturdays with anerror exceeding 15% compared to the other days. This gives further evidence ofthat Fridays and Saturdays are less dependent on external circumstances, whereasfor the other days these circumstances might have a big impact on.



Tuesday 31 21476 3221

Wednesday 44 52176 7826

Thursday 37 55207 8281

Friday 19 210730 31609

Saturday 15 500640 75096

Further analysis of the result and the restaurant show that the main reason ofdiversion and difficulty of predicting the sales was because of the many events therestaurant holds. See Section 5.9 for possible solutions.

Events can be both direct and indirect of the restaurant. There can be eventshappening in the city which might influence the sales of the restaurant, but itis more likely the events created by the restaurant itself which have the mostinfluence. These events can be for example pub quizzes, live artists or diningevents. The variety of the events makes it possible to believe that they are notsolely hosted on Fridays or Saturdays. This is a possible explanation to whythere are many Tuesdays and Wednesdays in with a high error, see Table 23. The

68

5.6 Outliers 5 DISCUSSION

restaurant could have hosted events on these days to try to attract more customersand given the small average sales of these days, not many customers are neededto break the limit.

The results from the Saturday dataset in Section 4.4 show a low error. A reasonfor this is that Saturdays are not as dependent on other aspects other than thatit is a Saturday. It is not dependent on events and weather, it is the weekend andpeople will visit the restaurant.

5.6 Outliers

A great part of the time spent on this project was spent in the pre-processingphase with trying to understand the data and to identify outliers, see Section 3.1.3.This process was partially experimental as outliers is very dependent on the dataitself. A first and crude filtering was done by removing surprisingly low or highsales of each dataset. To decide if a sale was surprisingly low or high, it was notonly the sales amount that was important but also the number of sales in thatapproximate amount. In this project an outlier was defined as: an unusual smallor large transaction that only happened on a limited number of occasions. If thenumber of these low or high transactions were less than 5% of the entire dataset,they would be removed. This filtering allowed us to remove sales that most likelycorresponded with holidays and the day before the holidays, as well as events thatcould have been going on in the city for a limited time. Sales on these periodsare not representative of the restaurant in general, and were therefore removed tobe able to predict the sales of the restaurant more accurately. A few problemsof filtering this way was to decide the lower and upper limit of the sales, as wellas the limit of how much of the entire dataset these outlier could be composedof. The lower and upper limits were based on the corresponding distance and boxplots of the datasets, but could be adjusted by trial and error. The 5% limit isvery experimental but was good enough for this filtering.

5.7 Multiple restaurants prediction

Our first intention was to explore the possibility to have a general model that couldbe used to predict on several restaurants. It became clear early on that this wouldnot be possible. The differences between restaurants sales would be to wide andgive a variation that would be impossible for the model to handle. With that said,it is not impossible to add a feature that somehow would index a restaurant. Forexample restaurant=1 for Dataset 1, restaurant=2 and so on. This would simplifythe learning for the model, e.g there is no way the sale could be SEK 300 000 forRestaurant 2, but it would be possible for Restaurant 3. All though this seemslike a possible solution, it would lead to a model that would be way to complexwhen the number of restaurants increase.

A more robust solution would be to categorize restaurants. A categorizationthat could possibly be done by clustering on features as latitude/longitude, inhab-

69

5.8 Choice of algorithms 5 DISCUSSION

itants, size of restaurants, opening hours etc. With this approach the number ofindexation would drastically decrease compared to the unique indexation explainedabove. For example it could create a class feature that put similar restaurants inthe same class. With a smaller number of values the model would be less complexand a model to predict multiple restaurants could be possible.

Even though we have an idea how to handle this problem as explained above,the time to implement a solution would be too time consuming and could beconsidered to be a whole other project. Instead we used this time to research andtest to exclude the time dependency as explained in Section 5.4.

5.8 Choice of algorithms

To choose algorithms for a problem is not a trivial choice. There is not a perfectalgorithm that works for every problem, yet some algorithms are known to performbetter on certain problems than others. In this case, the XGboost algorithm wasexpected to perform well on this problem, given that it had great success in similarproblems, see Section 2.8.

The LSTM algorithm has some limited support for these kind of problems inthe literature, see Section 2.8. However, we thought it would be interesting to seehow it would perform given the characteristics of the algorithm, i.e that it supportlong term dependencies which in our case was interpreted as taking earlier daysinto account when predicting the sales for the coming day, see Section 3.4.

5.9 Future work

An idea that was acknowledged when the data from Caspeco was first examinedwas that it provided detailed information about the articles sold, that is its nameand type. This would make it possible not only to predict the overall sale of therestaurant, but also what kind of articles that would be sold. The data couldbe split and classified into two parts, food and beverages. It could be possible topredict the sales of each of these groups. This could be important for example barswhich relies more on the drinks sold rather than the food. It could be possibleto go even deeper in the data, divide the beverages into groups of non-alcoholicand alcoholic drinks or maybe even more interesting into beer, red wine, whitewine, spirits and non-alcoholic and try to predict the sales of each group and thenquantity. These ideas were all very interesting, but would only be investigated ifthere were any time remaining. However, it was decided to investigate the effectsof a summer and Saturday dataset on the overall daily sales instead.

In the research stage of the project it was noticed that there are featuresregarding events that could possibly have an impact on the result. However, thesefeatures were either unavailable or too time consuming to gather. If the featureswere available it would probably yield better predictions.

Events are hard to categorize. Even if it is possible to scrape the web forevent data, it would be almost impossible to weight it against a restaurant. Some

70

6 CONCLUSION

restaurant could thrive on e.g music events and some on sports. A possible solutionto this could be that every restaurant keep a calendar provided with it takeson events. This would give a possibility to rank the events where e.g a sportsrestaurant gives a big sports game the rank 5 out of 5, but do not care about thelocal artist at the concert hall two streets away and gives that event a 1 out of5. This event/calendar data is not provided by the restaurant at the moment andhence could not be used in the solution for this project. Another solution couldbe via categorization, explained in Section 5.7. With categorized restaurants, itwould be possible to use them against events scraped from the web and make amore automate solution compare to the one explained above. These categoriescould for example be a ”sport restaurant” or a ”music restaurant”.

For Restaurant 2, see Section 5.5.2, there was a problem of a bookable areaof the restaurant which made it difficult to predict the sales of the entire restau-rant. The first solution to this could be to add another class to the event featureexplained above, since it should be something similar as these events. Anothersolution could be to exclude the sales from the bookable area altogether, whichwould lead to only the weather-dependent part of the restaurant to be predicted.This workaround should be acceptable, since the staff should roughly be able topredict the sales of a private booking beforehand, since the number of people areknown.

The result of isolating specific weekdays seems promising. It would be in-teresting to see if the same improvements as for Saturdays could be seen on allweekdays, and could be compared to the full dataset to draw some conclusions ofwhich approach is superior. The same applies for the season approach. It would beinteresting to see what conclusions could be drawn if the remaining nine monthswere divided into seasons and trained on new models.

6 Conclusion

There was an expected weather dependency of the sales of two of three of the chosenrestaurants. This turned out to be true not only for the expected restaurants, butalso for the restaurant which it was not expected for. The weather data improvedthe results of the machine learning models, although not as much as hoped. Bothmachine learning models performed better than the current uplift model. Thismakes the results a success with regard to the goal of the project.

The most important features when predicting the sales are clearly the featuresregarding the date, whereas the weather features has the least impact. This isgenerally true for all restaurants, but of course there are restaurants in which theweather features has a larger impact than in others.

The evaluation and research of the different restaurants gave an idea of thevariety between the types of restaurants and the lack of features in our models.All restaurants are believed to be event dependent, but at different degrees. Theevent data, external or internal, are expected to have a bigger impact than the

71

6 CONCLUSION

weather data in general.The date feature was stripped away in the creation of the Saturday dataset

to be able to investigate the effect of the weather on the sales. The result of theSaturday dataset was very satisfactory, and was more in agreement of what wasactually hoped to be achieved for the original dataset.

Another approach to try to investigate the effect of the weather on the saleswas to extract data from the months June, July and August. In this dataset,XGBoost saw a big increase in accuracy in one of the three restaurants, and asmall decrease in the other two. LSTM saw an increase in accuracy in the samerestaurant but a big decrease in the other two.

A summary of the accuracies with regard to the within 15% limit of the differentimplementations can be seen in Table 24.

Table 24: The chosen algorithms results in percent within 15% of the actual salevalue on different datasets


Dataset 1 63 65 51

Saturday 1 78 83 68

Summer1 61 52 49

Dataset 2 47 42 33

Saturday 2 44 47 41

Summer 2 63 49 34

Dataset 3 47 36 36

Saturday 3 65 65 60

Summer 3 45 25 32

Given the result presented in Table 24 and the GMRAE metrics for the differenttables, shown in the tables of Section 4, we conclude that XGBoost is the preferredchoice.

72

REFERENCES REFERENCES

References

[1] Caspeco. Homepage for Forecasting solutions.http://www.forecastingsolutions.com/arima.html, 2018.

[2] Andries P. Engelbrecht. Computational Intelligence: An Introduction. WileyPublishing, 2nd edition, 2007.

[3] Wes McKinney. pandas: a Foundational Python Library for Data Analysisand Statistics.

[4] Francois Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

[5] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, MichaelIsard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur,Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaViegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, YuanYu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. Software available from tensorflow.org.

[6] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.CoRR, abs/1603.02754, 2016.

[7] Protocol buffer google. https://developers.google.com/protocol-buffers/. Ac-cessed: 2017-11-23.

[8] Protocol buffer google, developer’s guide.https://developers.google.com/protocol-buffers/docs/overview. Accessed:2017-11-23.

[9] Sveriges meteorologiska och hydrologiska institut. SMHI open data API.http://opendata.smhi.se/apidocs/metfcst/index.html, 2017.

[10] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.

[11] A. Criminisi and J. Shotton. Regression Forests, pages 47–58. Springer Lon-don, London, 2013.

[12] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug1996.

[13] Yali Amit and Donald Geman. Randomized Inquiries About Shape; an Ap-plication to Handwritten Digit Recognition. Technical report, University ofChicago, Department of Statistics, 11 1994.

73

https://github.com/fchollet/keras


[14] Tin Kam Ho. The Random Subspace Method for Constructing DecisionForests. IEEE Transactions on Pattern Analysis and Machine Intelligence,20(8):832–844, 1998.

[15] Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. Decision Forests: AUnified Framework for Classification, Regression, Density Estimation, Mani-fold Learning and Semi-Supervised Learning, pages 81–227. NOW Publishers,January 2012.

[16] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.CoRR, abs/1603.02754, 2016.

[17] Jerome H. Friedman. Greedy function approximation: A gradient boostingmachine. Ann. Statist., 29(5):1189–1232, 10 2001.

[18] Teijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui. Quaternionicmultilayer perceptron with local analyticity. Information, 3(4):756–770, 2012.

[19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,521:436, 05 2015.

[20] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding theexploding gradient problem. CoRR, abs/1211.5063, 2012.

[21] Christopher Olah. Understanding LSTM networks, 2015.

[22] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997.

[23] Sepp Hochreiter. The vanishing gradient problem during learning recurrentneural nets and problem solutions. International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, 6(2):107–116, April 1998.

[24] Felix A. Gers, Jurgen A. Schmidhuber, and Fred A. Cummins. Learning toforget: Continual prediction with LSTM. Neural Comput., 12(10):2451–2471,October 2000.

[25] Felix A. Gers and Juergen Schmidhuber. Recurrent nets that time and count.Technical report, Dalle Molle Institute for Artificial Intelligence Research,2000.

[26] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık, Bas R. Steunebrink, andJurgen Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069,2015.

[27] Milos Bujisic, Vanja Bogicevic, and H. G. Parsa. The effect of weather factorson restaurant sales. Journal of Foodservice Business Research, 20(3):350–370,2017.

74


[28] Gediminas Gediminas Zylius, Rimvydas Simutis, and Vygandas Vaitkus.Evaluation of computational intelligence techniques for daily product salesforecasting. International Journal of Computing, 14(3):157–164, 2015.

[29] Kaggle, the home of data science and machine learning.https://www.kaggle.com/. Accessed: 2017-11-23.

[30] Kaggle, Walmart recruiting ii: Sales in stormy weather.https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/discussion/14358. Accessed: 2017-11-23.

[31] Xin Liu and Ryutaro Ichise. Food sales prediction with meteorological data— a case study of a japanese chain supermarket. In Data Mining and BigData: Second International Conference, DMBD 2017, Fukuoka, Japan, July27 – August 1, 2017, Proceedings, pages 93–104, Cham, 2017. Springer Inter-national Publishing.

[32] Zbigniew Michalewicz, Martin Schmidt, Matthew Michalewicz, and Con-stantin Chiriac. Adaptive Business Intelligence. Springer Science and BusinessMedia, Berlin, Germany, 2006.

[33] Urszula Stanczyk and Lakhmi C. Jain. Feature Selection for Data and PatternRecognition. Springer Publishing Company, Incorporated, 2014.

[34] John Gennari, Pat Langley, and Doug Fisher. Models of incremental conceptformation. 40:11–61, 09 1989.

[35] TIBCO Software Inc. The electronic statistics textbook.http://www.statsoft.com/Textbook/Statistics-Glossary/P/button, 2018.

[36] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artif.Intell., 97(1-2):273–324, December 1997.

[37] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. AnIntroduction to Statistical Learning. Springer-Verlag New York Inc., 2013.

[38] Tianqi Chen and Carlos Guestrin. XGBoost python API.http://xgboost.readthedocs.io/en/latest/python/python api.html, 2018.

[39] Tianqi Chen and Carlos Guestrin. XGBoost parameters.http://xgboost.readthedocs.io/en/latest/parameter.html, 2018.

[40] J. Scott Armstrong and Fred Collopy. Error measures for generalizing aboutforecasting methods: Empirical comparisons. International Journal of Fore-casting, 8:69–80, 1992.

75

A APPENDIX A

A Appendix A

xgboost

xgboost.train(params, dtrain, num boost round=10, evals=(), obj=None,feval=None, maximize=False, early stopping rounds=None, evals re-sult=None, verbose eval=True, xgb model=None, callbacks=None, learn-ing rates=None)

• params (dict) – Booster params.

• dtrain (DMatrix) – Data to be trained.

• num boost round (int) – Number of boosting iterations.

• evals (list of pairs (DMatrix, string)) – List of items to be evaluated duringtraining, this allows user to watch performance on the validation set.

• obj (function) – Customized objective function.

• feval (function) – Customized evaluation function.

• maximize (bool) – Whether to maximize feval.

• early stopping rounds (int) – Activates early stopping. Validation errorneeds to decrease at least every ¡early stopping rounds¿ round(s) to continuetraining. Requires at least one item in evals. If there’s more than one, willuse the last. Returns the model from the last iteration (not the best one). Ifearly stopping occurs, the model will have three additional fields: bst.best -score, bst.best iteration and bst.best ntree limit. (Use bst.best ntree limitto get the correct value if num parallel tree and/or num class appears in theparameters)

• evals result (dict) – This dictionary stores the evaluation results of all theitems in watchlist.

• verbose eval (bool or int) – Requires at least one item in evals. If verbose -eval is True then the evaluation metric on the validation set is printed at eachboosting stage. If verbose eval is an integer then the evaluation metric on thevalidation set is printed at every given verbose eval boosting stage. The lastboosting stage / the boosting stage found by using early stopping rounds isalso printed.

• learning rates (list or function (deprecated - use callback API instead)) –List of learning rate for each boosting round or a customized function thatcalculates eta in terms of current number of round and the total number ofboosting round (e.g. yields learning rate decay)

76

A APPENDIX A

• xgb model (file name of stored xgb model or ’Booster’ instance) – Xgbmodel to be loaded before training (allows training continuation).

• callbacks (list of callback functions) – List of callback functions that are ap-plied at end of each iteration. It is possible to use predefined callbacks by us-ing xgb.callback module. Example: [xgb.callback.reset learning rate(custom -rates)]

77

machine learning for restaurant sales forecast

Documents