forecasting future delivery orders to support vehicle …1256744/fulltext02.pdfforecasting future...

IN THE FIELD OF TECHNOLOGYDEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGYAND THE MAIN FIELD OF STUDYCOMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Forecasting future delivery orders to support vehicle routing and selection

GUSTAF ENGELBREKTSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Forecasting future deliveryorders to support vehiclerouting and selection

GUSTAF ENGELBREKTSSON

Degree Programme in Information and Communication Technology300 ECTSDate: October 4, 2018Industrial supervisor: Johan Frisk, Fleet 101 ABSupervisor: Somayeh AghanavesiExaminer: Elena TroubitsynaSwedish title: Förutsägelse av framtida leveransorder för att stödjaval av fordon samt deras ruttplaneringSchool of Electrical Engineering and Computer Science

iii

Abstract

Courier companies receive delivery orders at different times in ad-vance. Some orders are known long beforehand, some arise with avery short notice. Currently the order delegation, deciding which caris going to drive which order, is performed completely manually by a(TL) where the TL use their experience to guess upcoming orders. Ifdelivery orders could be predicted beforehand, algorithms could cre-ate suggestions for vehicle routing and vehicle selection.

This thesis used the data set from a Stockholm based courier com-pany. The Stockholm area was divided into zones using agglomerativeclustering and K-Means, where the zones were used to group deliver-ies into time-sliced Origin Destination (OD) matrices. One cell in oneOD-matrix contained the number of deliveries from one zone to an-other during one hour. Long-Short Term Memory (LSTM) RecurrentNeural Networks were used for the prediction. The training featuresconsisted of prior OD-matrices, week day, hour of day, month, precip-itation, and the air temperature.

The LSTM based approach performed better than the baseline, theMean Squared Error was reduced from 1.1092 to 0.07705 and the F1 scoreincreased from 41% to 52%. All features except for the precipitationand air temperature contributed noticeably to the prediction power.The result indicates that it is possible to predict some future deliveryorders, but that many are random and are independent from prior de-liveries. Letting the model train on data as it is observed would likelyboost the predictive power.

iv

Sammanfattning

Budföretag får in leveransorder olika tid i förväg. Vissa order är kändalång tid i förväg, medan andra uppkommer med kort varsel. I dags-läget genomförs orderdelegationen, delegering av vilken bil som körvilken order, manuellt av en transportledare (TL) där TL använder sinerfarenhet för att gissa framtida order. Om leveransorder skulle kunnaförutsägas i förväg kan fordonsrutter och fordonsval föreslås av algo-ritmer.

Denna uppsats använde sig utav ett dataset från ett Stockholmsba-serat budföretag. Stockholmsområdet delades in i zoner med agglome-rativ klustring och K-Means, där zoner användes för att gruppera leve-ranser in i tidsdelade Ursprungsdestinationsmatriser (OD-matriser).En cell i en OD-matris innehåller antalet leveranser från en zon tillen annan under en timme. Neurala nätverk med lång-kortsiktsminne(LSTM) användes för förutsägelsen. Modellen tränades på tidigare OD-matriser, veckodag, timme, månad, nederbörd, och lufttemperatur.

Det LSTM-baserade tillvägagångssättet presterade bättre än baslin-jen, det genomsnittliga kvadratfelet minskade från 1,1092 till 0,07705och F1-poängen ökade från 41% till 52%. Nederbörd och lufttempera-tur bidrog inte märkbart till förutsägelsens prestation. Resultatet indi-kerar att det är möjligt att förutse vissa leveransorder, men att en storandel är slumpmässiga och oberoende från tidigare leveranser. Att lå-ta modellen tränas med nya data när den observeras skulle troligtvisöka prognosförmågan.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Goal & purpose . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Contribution to research . . . . . . . . . . . . . . . . . . . 41.6 Ethics and societal issues . . . . . . . . . . . . . . . . . . . 41.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . 51.8 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 62.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 72.3 Routing Problems . . . . . . . . . . . . . . . . . . . . . . . 102.4 Problem specific terms . . . . . . . . . . . . . . . . . . . . 122.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . 132.6 Related research . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Method 193.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Zoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . 293.5 Measuring the prediction performance . . . . . . . . . . . 333.6 Baseline prediction: The calendar model . . . . . . . . . . 333.7 Long-Short Term Memory based prediction . . . . . . . . 34

4 Results 414.1 Hyperparameter optimisation . . . . . . . . . . . . . . . . 414.2 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . 46

v

vi CONTENTS

5 Discussion 485.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3 The Long-Short Term Memory based approach . . . . . . 555.4 Contribution to research . . . . . . . . . . . . . . . . . . . 555.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 58

Chapter 1

Introduction

Goods transportation is vital for our society to work. The total revenuefor the freight market in Sweden 2015 was 275 billion SEK [56]. Theroad freight market is operating with slim profit margins [21]. Manysteps in the transportation chain is performed manually by people,such as driving vehicles and managing vehicle fleets [15]. Automa-tion of steps in the transport chain is becoming feasible with modernresearch and hardware [14]. Optimising the transport chain by for ex-ample minimising driving distances can lead to economical and envi-ronmental improvements [15].

1.1 Background

Fleet 101 is a company developing a software named K2 that is used fortransport management in courier and goods transport companies. Thesoftware is sold to various customers in the freight industry mostly lo-cated in the Nordic countries. Each customer run their own instanceof the software. The software is used to keep track of the whole trans-port chain from incoming delivery orders to sending out invoices tocustomers [16].

One large and important part in the transport chain is to delegatedelivery jobs among available vehicles, something currently being per-formed by people manually. K2 is used to keep track of vehicle at-tributes such as position, load capacity, etc. as well as delegating trans-port jobs [16].

DHL [14] describe the research field anticipatory logistics, which aimsto make supply chains more efficient. The research field has been iden-

1

2 CHAPTER 1. INTRODUCTION

tified as a trend with high-impact for the logistics business area. An-ticipatory logistics includes, for example, anticipatory shipping, whichcan be used to predict future shipments. Taylor [52] interpreted DHL’spredictions and split them into three parts:

• Autonomous logistics, which includes self-driving vehicles.

• Internet of Things (IoT), which refers to, for example, deliveryvehicles being connected to the internet, including sensors on ve-hicles.

• Artificial intelligence and logistics, which refers to the manypossibilities to using AI and machine learning to optimize logis-tics, the area connected to this thesis.

Many areas belonging to anticipatory shipping are used today such asInternet of Things. However, the research is far from finished [9].

The vision of the company is to automate manual transport man-agement using state of the art approaches, such as modern machinelearning methods. One part in the automation consists of anticipatingfuture incoming transport orders.

1.2 Problem

Future delivery orders are not always known in advance. Occasionallythe orders are known far in advance, but more frequently they come induring the day. This makes it less straight-forward to automatize thetask of delegating orders among vehicles using well-established andwell-studied approaches such as using solvers for the Vehicle Rout-ing Problem (VRP), since not all delivery orders are deterministic andknown in advance [43].

The number of transport orders for different times are not uni-form. Peak periods exist and transport managers have experienceabout them [27, 26]. Possible factors that affect the number of trans-port orders can be, for example, type of day, month, and the nation-wide economic situation. As a simple example there are usually moretransport orders before Christmas in December than in July in Stock-holm, something displayed in the data set used in this thesis.

Currently known orders can be input into a route planning soft-ware used by the company. The route planning software accepts a list

CHAPTER 1. INTRODUCTION 3

of delivery orders from point A to B with available vehicles as input.The software outputs near-optimal driving routes for the vehicles. Thenear-optimal solutions, however, assumes that future delivery ordersare static, which is not the case. If predicted future orders can be in-serted together with known future orders, it is likely that the outputwill become more usable. The software does not accept any sort ofstochastic information about pick-up hotspots or similar.

The data set used is from a large Swedish courier company. Thecompany has different service types ranging between different typeof business-to-business (B2B) deliveries during working hours, pre-planned home deliveries during the evening, mixed with priority rushtransports at any time. For these jobs the company has different vehi-cles, ranging from small vans to lorries. This thesis will target B2B-deliveries during working hours.

1.3 Goal & purpose

Delivery forecast

Previous delivery data

Route optimisation

This thesis The future, using a proprietary solver

Figure 1.1: The purpose: predict deliveries.

As displayed in Figure 1.1, the purpose of this degree project is toassist transport planning by presenting predictions about future trans-port orders, based on previous delivery data. The research question isformulated as follows: How can transport management be assisted usingpredictions based on historical data?

The goal is to predict future transport orders from location A to Bshort-term, e.g., during the rest of the day, the next day, or the nexthour. The goal is not to predict transport orders long-term, i.e., forfuture years. The prediction should either be readable by a human orin a format that can be used by a computer for route optimization, asdisplayed in Figure 1.1. The goal is visualised in Figure 1.2.

4 CHAPTER 1. INTRODUCTION

Region

Arlanda

Bromma

Årsta

Sigtuna

Figure 1.2: The goal: predict deliveries between zones in a region.

1.4 Data

The data set resides in a database. The relevant data consists of about15 years of historical delivery orders containing different forms of de-livery deadlines, pick-up addresses, and delivery addresses. An ad-dress usually has a city, zip-code, street address, and street number.Most addresses saved in the database during the last four to five yearsalso have coordinates with varying degrees of accuracy.

1.5 Contribution to research

In the literature study (Section 2.6) the newest research focused on pas-senger transport when neural network approaches were used. Olderresearch that focused on freight transportation usually used statisticalmethods. Research where predictions in the form of Origin-Destination(OD) pairs were generated, instead of only origin hot spots in a freightmanagement context generated by recurrent neural networks, has notbeen found.

1.6 Ethics and societal issues

There is an increasing need for responsible and sustainable transports [14].Due to the large and global scale of transports, even minor reductionsin driving distances can have a large impact.

One ethical issue concerning this thesis, is automation of manuallabour. If the transport leader role is fully or partially automatised, the

CHAPTER 1. INTRODUCTION 5

work burden of transport leaders is decreased leading to possible lessneed for them and worse job security. Another ethical and legal issueis the usage of historical data. This thesis tries to avoid these issuesby, for example, using random noise on the data. Using data withrandom noise makes, for example, coordinates less sensitive, while thescientific prediction performance measurements are assumed to not beaffected.

1.7 Delimitations

This thesis focus on making predictions for Business-to-Business (B2B)deliveries. The objective is not to try and predict home deliveriesto private persons or occasional random events requiring specializedtransports. The reason for predicting B2B-deliveries is that they are dy-namic, the optimal routing and selection of delivery vehicles dependon future not-known delivery jobs. This contrasts with deliveries tohome deliveries to individuals, where all orders are known at routeplanning time.

1.8 Thesis outline

This chapter introduces the thesis. Chapter 2: Theory provides a the-oretical background for concepts used in the thesis and provides anoverview of previous research of the area. Chapter 3: Method describehow the prediction was performed. Chapter 4: Results presents theperformance of the prediction model. Finally Chapter 5: Discussionanalyses the prediction performance and discusses the work.

Chapter 2

Theory

This chapter aims to give the reader the necessary background to un-derstand the work performed. Related research is also presented.

2.1 Clustering

Two forms of clustering using unsupervised machine learning wasused for the thesis. They are presented here.

2.1.1 K-Means Clustering

The K-Means clustering algorithm works in the way described below.

1. Decide how many k centroids to be created.

2. Select random initial k centroids.

3. For each k centroids, create a cluster of all points closest to thecentroid.

4. Create k new centroids by calculating the center of mass for allpoints in a cluster.

5. Repeat the previous two steps until the centroids no longer change.

Due to the nature of K-Means clustering, it only supports Euclideandistances in theory [4].

6

CHAPTER 2. THEORY 7

2.1.2 Agglomerative Clustering

Agglomerative clustering is a type of hierarchical clustering. The ideais that each point at first is its own cluster, then clusters are merged bycombining close clusters. One advantage with agglomerative cluster-ing is that the distance metric does not have to be Euclidean [33].

2.2 Artificial Neural Networks

An Artificial Neural Network (ANN), often simply called neural net-work, is a machine learning method inspired by the human brain. ANNshave many benefits, for example they are non-linear, which allowsthem to capture non-linear inputs. An ANN consists of a set of infor-mation processing neurons, where a single neuron has three elements:connecting links, adder, and an activation function [23].

Input signals Weights

Summingjunction

Activationfunction Output

x1 w1

x2 w2

xm wm

Σ φ(·) yk

Bias bk

Figure 2.1: Neuron. Adapted from Figure 1.5 in [23].

In Figure 2.1 a neuron is displayed. The connecting links (input sig-nals) have their own weights that are summed in the adder (summingjunction). The activation function defines the output yk. The activationfunction can, for example, be a simple threshold function, that returns1 or -1 depending on the adder functions output [23].

8 CHAPTER 2. THEORY

Input layer Hidden layer

Output layer

Figure 2.2: Artificial Neural Network.

Typically when neurons are combined to form an ANN, neuronsare combined in different layers. In a feed forward neural network(FFNN), displayed in Figure 2.2, neurons are combined into one inputlayer, one output layer and an arbitrary amount of hidden layers. Asingle neuron can be linear but when they are combined the whole netbecomes non-linear [23].

Activation function

Each hidden neuron in a neural network computes a regressive output,usually ranging from 0 to 1. The function used for the calculation iscalled the activation function. A common activation function commonin RNNs is the hyperbolic tangent, it ranges from -1 to 1. To rescalethe output to a classification format after the last layer softmax can beused. With softmax scores are given to the different classes where thescores for all possible classes sum to one [37, 62].

Loss function

A loss function is a function measuring the cost or performance of aprediction. It is used during training for the network to measure thecost when weights are updated. The mean squared error can be used

CHAPTER 2. THEORY 9

as a loss function. Another loss function used for categorical output iscross-entropy [37].

2.2.1 Long Short Term Memory Recurrent Neural Net-works

A Recurrent Neural Network (RNN) categorises itself from a FFNNin that it has feedback loops [23]. FFNNs as well as RNNs are badat handling prior dependencies, i.e., historical data, something thatis solved with Long Short-Term Memory (LSTM) networks. LSTMssolve this by having a memory cell. A typical LSTM unit consists ofa cell, an input gate, an output gate, and a forget gate. The cell is thememory itself, the cell is able to remember data for a long term. Theinput gate controls the input activations. The output gate controls theoutput flow from the cell. The forget gate controls the LSTM’s memoryby a self recurrent connection, i.e., it controls the cell (the memory).Each gate works as a simple ANN with no hidden layers, a gate hasan activation function. Different variations of the LSTM unit exist [6].

Training

Neural Networks need to be trained. Back-propagation through timeis a common way to train LSTM RNNs [22, 6]. Back-propagation com-putes the gradient of the cost function and works by calculating theerrors backwards, the algorithm starts at the last layer and works itway towards the first layer. Gradient descent is usually used as an op-timisation algorithm to decide on which extent the weights should beupdated [37]. ADAM is an optimisation method combining two olderoptimisation methods. ADAM has built in learning rate decay [28].

Overfitting

When a machine learning model starts to learn too much detail fromthe training data, it will generalise less and perform worse on the testdata. This condition is called overfitting. Two popular ways to com-bat overfitting in Neural Networks are dropout and regularisation.Dropout works by randomly dropping hidden neurons and then re-adding them. Regularisation works by adding an extra term to theloss function, which results in the network preferring to learn smallerweights [37].

10 CHAPTER 2. THEORY

2.2.2 Convolutional Neural Networks

Deep Convolutional Neural Networks (CNNs) have good performancefor image classifications. Residual Networks (ResNets) introduce resid-ual connections in a Deep CNN [24].

2.3 Routing Problems

The travelling salesman problem (TSP) describes in what order a sales-person should visit a given set of cities exactly once, in order to mini-mize the total distance and time spent. TSP is NP-hard and has severalvariations and forks [45].

2.3.1 The Vehicle Routing Problem

Depot

C1

C2

C3

C5

C4

C6

Figure 2.3: VRP visualised. C1 to C6 represent different customers,depot the common start location. The blue line is the routing for onevehicle, the red the routing for another vehicle.

One generalization of the TSP is the Vehicle Routing Problem (VRP).It was first described in 1959 and the VRP describes in what order oneor more trucks should visit service stations from a main station [13].VRP is NP-hard alike TSP [30]. Polynomial approximation algorithmsexist for solving the VRP [10]. The problem is visualised in Figure 2.3,where the two colours represent different routes.

CHAPTER 2. THEORY 11

2.3.2 The Pick-up and Delivery Problem

C5

C6

C3

C2C4

C8

C1

C7

Figure 2.4: PDP visualised. Boxes are pick-up points, circles drop-offpoints.

The pick-up and delivery problem (PDP) is related to VRP. It de-scribes how one or more vehicles should perform deliveries betweendifferent locations [36]. The problem is illustrated in Figure 2.4. Thedefinition of the PDP can vary. One definition is that one vehicle onlycan serve one delivery at a time, while another definition is that onevehicle can serve multiple transport orders at the same time [8].

A problem very related to the PDP is the dial-a-ride problem, thatin short describes how taxis should be routed to pick-up and drop offpassengers when ride-sharing can be used [12].

2.3.3 VRP & PDP classes

VRP & PDP have several variations. Relevant ones are presented here,nonetheless the problems have a wide range of applications. As anexample, the VRP has even been applied to military aircraft missionplanning [46].


Time Windowed

In the time windows flavour, time constrains are introduced. Typicallyeach transport order has an earliest allowable pick-up time and a latestallowable delivery time [36].

Dynamic & Stochastic

There are times when not all orders are known during the planningstage. The dynamic (or online, real-time) flavour defines the case whenadditional transport orders occur after the planning stage, during theoperation. The stochastic flavour extends the dynamic, by describingthat previous knowledge about future unknown transport orders canbe taken into consideration in the planning step [47].

Different strategies to satisfy dynamic demand exist. One strat-egy called double horizon describes how route distance should beminimized in the short term, while favouring empty vehicles longterm. Different waiting approaches describe when and where vehi-cles should wait when time allows; the reason for the waiting is to beable to faster satisfy new delivery jobs. Fruitful regions define vehiclere-routing to areas where the probability of future requests is high [27].

2.4 Problem specific terms

This section gives a short introduction to some terms and methodsused by the found literature.

2.4.1 Origin Destination Matrix

An Origin Destination (OD) matrix describes the flow of, for example,passengers, goods, or data flow between different zones. Table 2.1displays a sample OD-matrix giving information of the flow betweenthe zones a, b, and c. The flow a→ a is 1, the flow c→ b is 8, etc.

Table 2.1: Sample OD-matrix.

a b c

a 1 2 3b 4 5 6c 7 8 9


One application of OD-matrices is that they assist traffic planning,usually on a large scale. Typically the set of all zones, i.e. the completematrix, form a region or city while the size of the individual zonesvaries. However, in this thesis an OD-matrix will give informationabout the number of delivery orders between zones. It is also possibleto add a time-dimension to the OD-matrix, by adding discrete time-intervals as a new dimension in the matrix [42].

2.4.2 Simple calendar method to create OD-matrices

Some literature compare their new methods of predicting OD-matriceswith a simple calendar model. The calendar model varies a bit be-tween different research, but the basic principle is usually the same.The principle is that for each origin-destination pair week days aresplit into different groups (working day, school holiday, etc.) and his-torical data is mapped to the groups. The time can be split into discreteintervals if it is desirable to have a time dimension [55].

2.4.3 Autoregressive models

Autoregressive Integrated Moving Average (ARIMA) is a statisticaland regressive model of a random process. It is assumed that the fu-ture is a linear function of historical data. To apply an ARIMA model,a model is constructed and parameters are then found [63].

Vector Autoregression (VAG) is a generalization of the basic autore-gressive model that allows for multiple dependent variables as input.Usually VAG performs better than simpler univariate autoregressivemodels [59].

2.5 Performance metrics

The Mean Squared Error (MSE) is defined as:

MSE =1

N

N∑i=1

(yi − ti)2

where N is the number of predictions, yi a prediction, and ti an ex-pected value.


According to [33], precision and recall are defined as:

precision =true positives

true positives + false positives

recall =true positives

true positives + false negatives

and F1 score as:

F1 score = 2 ∗ precision ∗ recallprecision + recall

2.6 Related research

This section overviews the previous research on making predictionsabout future transportation demand, both in a passenger and freightcontext. There are related research fields such as forecasting demandin electrical networks [39], but this section will focus on more closelyrelated fields including traffic flow forecasting such as [54].

In general the first step of making a prediction is to divide thewhole area to be predicted into smaller zones [57]. There are twoclasses of approaches: if only the origin is predicted or if both the ori-gin and destination are predicted. Section 2.6.1 looks at research thatmostly predicts the origin, while Section 2.6.2 presents research thatpredicts both the origin and destination.

2.6.1 Forecasting origin and/or destination as sepa-rate entities for VRP & PDP solvers

Ichoua, Gendreau, and Potvin [27] stated in 2007 that solution ap-proaches for the VRP that anticipate future demand are not yet ma-ture, but that research interest exists. However, Ritzinger, Puchinger,and Hartl [47] claimed in 2016 that the research interest has increasedduring the last few years for the Dynamic and Stochastic VRP and it is,for example, now possible to process knowledge about demand usingmodern statistical approaches. For example, in 2018 van Engelen et al.[58] incorporated historical demand with empty vehicle re-routing inthe dial-a-ride problem.

An example of how future demand predictions can be generatedand taken into consideration in the planning step is presented in an


article by Schilde, Doerner, and Hartl [49]. Patients were transportedfrom their homes to a hospital or from a hospital to home. Around halfof the requests were known in the morning while the other half weredynamic and appeared during the day. The inter-arrival times of thedynamic requests were found to have an exponential distribution. Thereturn transports resulting from the first dynamic request were foundto have a gamma distribution.

Swihart and Papastavrou [51] created and analysed a model forthe PDP. It was discovered that dynamic requests arrived according toPoisson processes in geographical zones. In another report by Garridoand Mahmassani [20] it was assumed that the dynamic requests arriveaccording to a Poisson distribution. The Poisson distribution assump-tion was used in conjunction with an autoregressive model to predictfuture orders short-term. In 2000 Garrido and Mahmassani [19] contin-ued their research by modelling demand with an econometric model,however they discovered that their model’s prediction did not corre-spond fully to a real sample. Newer research from 2016 by Vonolfenand Affenzeller [61] confirms that assuming that the arrival rates oftransport orders can be seen as a Poisson process.

2.6.2 Forecasting demand as origin-destination pairs

Tsekeris and Tsekeris [57] writes about the traditional four-stage trans-port planning process for passenger transport, which are trip genera-tion, trip distribution, mode choice and traffic assignment. Trip gen-eration refers to forecasting passenger transport by using econometricmodels. Trip distribution refers to allocating the demand from the pre-vious step into an origin destination matrix. Mode choice specifies tosplitting the OD-matrix into different modes of transport, for exampleprivate car or public transport. Traffic assignment maps the OD-matrixinto a transport network, i.e., which routes will be used. A Ph.D. the-sis by Peterson [42] also states that the generation of OD-matrices on alarge scale is a well studied problem.

Tsekeris and Tsekeris [57] writes that using for example seasonalexponential smoothing can predict the medium-term or long-term trans-port demand. Tsekeris and Tsekeris present an overview of the meth-ods used for forecasting and states that modern approaches occasion-ally combine steps from the classical methods, such as averaging. Rele-vant methods include Kalman filtering, autoregressive models such as


ARIMA, genetic algorithms, and artificial neural networks. In generalthe paper is aimed at a macro-level scale, i.e., predicting the OD-graphfor commuters in a region.

Toqué et al. [55] predicted public transport demand city wide asOD-matrices using Long-Short Term Memory (LSTM) Recurrent Neu-ral Networks (RNN). Their RNN model had one LSTM layer. Fortraining of the model they used gradient-based optimization, wherethey experimented with the hidden state size. Their input to the modelwas prior OD-matrices at 300 time stamps; the output was a predictedOD-matrix at the time stamp to be predicted. The RNN method wascompared with two more conventional methods, a calendar model,and a Vector Autoregressive (VAR) model. The calendar method con-sisted of putting historical rides into 15-minute slots for different daytypes, where one day type was for example ”Monday to Wednesday”and another was ”school holiday”. The VAR & LSTM methods out-performed the calendar method.

In an article from 2018, Li et al. [31] developed an algorithm to pre-dict OD-matrices for taxi trips in a large city. It was done by combin-ing non-negative matrix factorization (NMF) with an autoregressivemodel. Li et al. stated that predicting OD-matrices using statisticalmodels, including for example maximum likelihood and Bayesian in-ference, are unsuitable for short-term predictions. Li et al. writes thatthe reason is that those statistical models assume that all transportsmust end in identical time windows, an assumption that cannot bemade since transport orders have different time lengths.

The reason as to why Li et al. [31] chose to use NMF over regressionor neural networks, was that they claimed that it would not be possi-ble to detect the purpose of the travel (i.e. commute, leisure, etc.) asdescribed by Peng et al. [41]. For this thesis it is not required to factorin the purpose of a trip, since the main purpose always is the same,i.e., deliver a package. Li et al. also ruled out Kalman filtering, sinceMing-jun and Shi-ru [34] found out that the predictions are delayed.

Zhang, Zheng, and Qi [64] developed a deep learning method forcrowd flow prediction and compared it with some autoregressive mod-els. Their method combined convolutional neural networks that lookedat different time intervals. The deep learning method performed bet-ter than the autoregressive models such as ARIMA & Vector Autore-gression (VAR). Crowd flow prediction is a bit different from predict-ing OD-matrices, since the problem is about predicting flows between


neighbouring grids. Nevertheless the result is interesting, as it showsthat using neural networks works better than autoregressive models.

Alonso-Mora, Wallar, and Rus [3] looked at the dial-a-ride problemin New York City. They divided the city into a grid system with equalareas and divided the historical data into 15 minute intervals for eachweek day. A clustering algorithm was used to merge grids into largergrids, where a probability distribution then was found for each OD-pair. The prediction was then used in their routing algorithm.

Azzouni and Pujolle [6] used a Long Short-Term Memory (LSTM)Recurrent Neural Network (RNN) to predict Origin-Destination pairs,in an approach reminding of Toqué et al. [55]. As input to the LSTMAzzouni and Pujolle mapped a prior NxN OD-matrix into a N2 longvector. The output was then a N2 long vector that could be mappedback into a NxN OD-matrix. They also presented a method for con-tinuous prediction over time.

Tian and Pan [54] also predicted OD-matrices using LSTM RNNs.However, they also compared the performance with some other ap-proaches including Support Vector Machines and Feed Forward Neu-ral Networks. The LSTM RNN has the best performance. They statethat LSTM RNNs have good performance for short-term predictions,due to LSTMs’ ability to remember long-term data.

In a Ph.D. thesis by Larsen [29], parcel pick-up points for mailtrucks were analysed. It was determined that pick-up locations werehighly dynamic since only a small subset of the locations were knownin advance.

2.6.3 Summary

We can conclude that for OD-matrices autoregressive and machinelearning approaches outperform the simple calendar method [55]. Re-cent research has shown that approaches using neural networks arefeasible and can work better than autoregressive methods [64, 55, 6,54]. LSTM RNNs are used over simple RNNs since training simpleRNNs using back-propagation is difficult when modelling long range [6].Some research, for example [31, 3, 48], also focus on finding clusters forpick-up hot spots.

In general there are more research about incorporating dynamicand stochastic information into VRP than PDP [8]. A lot of researchabout predicting OD-matrices are on a macro-scale, for example trans-


portation city-wide, while less research has been performed on a micro-scale.

The overview of the related work shows that there is no previousresearch that aims at predicting deliveries at a smaller scale. Anotherdifference is that this thesis used classification output, instead of out-put on a regression format which previous research used.

Chapter 3

Method

This chapter presents the method used to predict future deliveries. Thedata, zoning approach, and prediction method is described. The ex-perimental setup is also presented.

3.1 Overview

As shown in Section 2.6, a lot of research focused on how a relativelysimple forecast could be used in a VRP or PDP solver, where previousknowledge about only the origin was used. For example the previousknowledge might be modelled as a Poisson process. One advantageof knowing where future pick-up hot spots are, is that free vehiclescan be routed to those positions. For example, it is common that taxidrivers drive to an airport when they have no passengers, since thedrivers anticipate future orders from the airport [66].

An interview with a transport leader was conducted, where thepurpose was to further understand how transport leaders work, howthey could be helped by a prognosis, and on what factors they basetheir experience about upcoming deliveries on. The interviewee ex-plained that they never have the problem with too few orders andempty vehicles. They may choose to let some vehicles be empty onstandby for important incoming jobs with a short deadline, but theydid not currently have the need to route empty vehicles to future loca-tions where it is believed that future orders will occur.

Since the route optimization software used by Fleet 101 does not ac-cept stochastic information about pick-up hotspots, inserting only fu-ture pick-up points instead of data derived from OD-matrices, would

19

20 CHAPTER 3. METHOD

require a solution where the destination point would need to be in-ferred from the pick-up point. An example on how to solve this prob-lem would be to place the predicted pick-up hotspots as delivery jobswith some arbitrary destination, where the predicted jobs have theconstraint that they must be delivered after the real jobs, which wouldmake the end destination to have less importance.

3.1.1 Predicting OD-matrices

The decision was made to predict the OD-matrices using RNNs withLSTMs. The reason was twofold. First of all recent research presentedin Section 2.6 showed that approaches using RNNs with LSTMs per-formed better than autoregressive methods. Secondly it is easier toadd additional features into neural networks than in autoregressivemodels, since for example a LSTM RNN can capture dependenciesbetween the features [55]. An additional feature is, for example, theweather.

To save time and resources in the implementation the high-levelmachine learning API Keras [2] was used with the machine learningframework Tensorflow [1] as back-end. Recent research papers foundin the literature study using LSTM RNNs approaches to predict OD-matrices used Keras and/or Tensorflow [55, 6].

Database Addressesto zones

Delivery timesAddresses

Time-slicedOD-matrices

NeuralNetworkPrediction

Figure 3.1: Method overview.

The method in short is simplified in Figure 3.1. First data is re-trieved from a database, time-sliced OD-matrices are created, insertedinto the Neural Network and finally predictions are made. The timeslices are one hour long, i.e., one time-sliced OD-matrix contains alldeliveries for one hour.

CHAPTER 3. METHOD 21

3.2 Data description

The data resided in a Microsoft SQL database, the total size was about170GB including all data not relevant for this thesis. Available dataranged from 2003 to the early spring of 2018, where less data was avail-able for the first years and more data existed during later years. Therelevant parts in the data set were related to transport orders, whererelevant features for transport orders orders are presented below.

• Pick-up & delivery addresses. Some address fields are:

– Address lines. The content can be for example a companyname or a street address. The address lines are not clean;occasionally a field can contain a company name but morefrequently a street address.

– Zip-code. It exists in nearly all addresses, independentlyof whether the address lines describe a company or streetaddress. The zip-code is in nearly all cases clean, i.e., thefield is not used to describe something else.

– City. Is usually clean, however at times the field is usedfor other things. Frequently the names of cities or areas isshorted, for example ”Ö-malm” instead of ”Östermalm”.

– Coordinates. Since around 2014-2015 coordinates are con-stantly available with different accuracies, since 2015 mostaddresses have coordinates with high accuracy. All addressessince 2015 are not guaranteed to have coordinates.

• Date.

• Earliest allowable pick-up time.

• Latest allowable delivery time (deadline). Together with the ear-liest allowable pick-up time a time window is formed.

In addition to the provided data set described above, external data inthe form of a calendar was available. The calendar could for examplemap dates to week-days and tell whether a day is a public holiday inSweden or not. Weather data for a weather station in central Stock-holm was also retrieved from the Swedish Meteorological and Hydro-logical Institute (SMHI) [50]. The weather data retrieved contained


0 24 48 72 96 120 144 168

0.00

0.01

0.02

0.03

0.04

0.05

Figure 3.2: Weekly hourly distribution of all available delivery dead-lines, including home deliveries. 24-48 is Tuesday, 48-72 Wednesday,etc. The y-axis has been normalised.

the daily precipitation in millimetres and the temperature in Celsiusat 06:00, 12:00 & 18:00.

Figure 3.2 displays the delivery deadline distributions for an aver-age week during a three month period. The top highest peaks are 16:00and 17:00, typical deadlines for deliveries since non-urgent deliveriestypically can be delivered any time during working hours the sameday. The peak at 22:00 represents home deliveries. It can be observedthat almost no deliveries happens on Saturdays and Sundays.

3.3 Zoning

Since a Origin-Destination (OD) matrix was predicted the size of thematrix needed to be decided, i.e., a good level of detail had to be found.Letting each possible street be its own zone would not be feasible, sincethe matrix would be too sparse and too large. The goal with the zoning


was to divide the Stockholm area into a set of zones, where the zoneshad a similar number of deliveries in them. The purpose was to usethe zones to get a prediction between two points that could be used ina route optimisation software. In the end the clustering approach wasdeemed more feasible. This section describes two different approachesto create zones. The first approach is using zip-codes as zones and thesecond is using clustering to create larger zones.

3.3.1 Using zip-codes as zones

It was assumed that the sizes of zip-code areas in Sweden are corre-lated to either the population or number of packages. That means thatthe area of a zip-code in central Stockholm can be tiny (a single block),while a zip-code’s area in the countryside can have a large area (a smalltown). The numbers in Swedish zip-codes are structured according tothe geographical area, for example all zip-codes beginning with 10 or11 are located in Stockholm [44].

One drawback of using zip-codes for dividing zones is that theymay change [44]. Zip-code data in Sweden is not open and freelyavailable. Since no resource was found that lists all of these changeswith their dates, it is not feasible to update zip-codes retroactively. Itwas assumed that the changing zip-codes problem is minor and that itwould not have a noticeable impact on the result.

Using zip-codes as zones was performed by letting the first threedigits form a zone, e.g., an address with the zip aaabb belongs to azone named aaa. The problem with using the zip-code approach forzoning is that the OD-matrices become either too sparse or too small,see Figure 3.3 for an example with sparsity. In the figure it can be seenthat most cells are black, meaning that no deliveries occur betweenthose two zones. Due to the sparsity and results of basic experimentsperformed, it was decided that using zip-codes would be infeasible.If the detail level would be lowered by using only two digits, nearlyall addresses in the Stockholm city area (Kungsholmen, Södermalm,Vasastaden, etc.) would belong to the same zone.

3.3.2 Creating zones by clustering

As an alternative to zip-codes zoning, a clustering approach using un-supervised machine learning was implemented. Most addresses in the


0.0000

0.0025

0.0050

0.0075

0.0100

0.0125

Figure 3.3: OD-matrix for deliveries during one Wednesday between3-digit zip-codes. To see differences more easily the OD-matrix hasbeen plotted in a logarithmic scale. The plot has been normalized.Black means no deliveries between zones, lighter colours indicatesmore deliveries.

data set had coordinates since around 2015, allowing addresses from2015 and onward to use this approach. Algorithms from the machinelearning library Scikit-Learn [40] were used for the implementation.

Distance metric

A distance metric between coordinates is required for clustering. A ba-sic distance metric between two coordinates is the Euclidean distance.Approximating a distance in a grid-based city may work well usingEuclidean distances, however since Stockholm is a city consisting ofmany islands using Euclidean distances leads to undesirable distancesbetween points. An example of the undesirable behaviour is displayedin Figure 3.4, where points with water in between are close accordingto the distance metric but far away for a car.

To solve the problem with Euclidean distances, driving times be-tween points were used instead as a distance metric. To calculate


Figure 3.4: Zones created with a Euclidean distance metric, note howthe pink points in the top left belong to the same cluster as the pinkpoints in the bottom left. Map from OpenStreetMap contributors [38].For privacy reasons noise has been added to the coordinates, thereofcoordinates in the sea.

the driving times Open Streetmap Routing Engine (OSRM) [32] wasused. OSRM was used to retrieve a distance matrix containing driv-ing times between all points. One drawback of using driving timesis the calculation time required to compute the distance matrix. Cal-culating a distance matrix for 1000 points is near instant, howevercomputing a distance matrix for 2000 points takes significantly longertime. The matrix size grows with O(points2), meaning the time in-crease is quadratic. Calculating a distance matrix for over 2-3000 arbi-trary points was deemed infeasible.

Handling outliers

Some addresses may have wrong coordinates, causing them to be infor example another city. Other addresses may be isolated from otheraddresses, due to occasional deliveries to the country side. In shortoutliers need to be handled to not affect the rest of the clustering neg-atively, either by removing them or taking them into account in theclustering algorithm, which requires a cluster algorithm able to handleoutliers. K-Means does not handle outliers well in its basic form [18].

Compared with for example K-Means, agglomerative clustering


Figure 3.5: Outlier detected using agglomerative clustering.

does not try to create zones with an equal amount of points and oftenplaces outlining points into small clusters, as displayed in Figure 3.5and 3.6. This behaviour was exploited to find outlining points andthen remove them. Each cluster containing less than four points hadits points removed from the set of all points, before the final clusteringwas performed.

Figure 3.6: Clustered coordinates using agglomerative clustering, cen-troids are marked with crosshairs. In total 30 clusters, some are outsideof the map borders. Notice the differences in cluster sizes. Map fromOpenStreetMap contributors [38].


Clustering using K-Means

After removing outliers, different clustering algorithms and their pa-rameters were experimented with. Figure 3.7 displays a result of firstremoving outliers with the method described and then clustering 30zones with K-Means. Figure 3.6 displays clustering using agglomera-tive clustering. Notice how K-Means created clusters with more equalsizes than agglomerative clustering. A common problem with manytested algorithms, was that they grouped all points not too far awayfrom the Stockholm city together into a single large cluster, while itcreated small clusters from points further away from the city centre.

K-Means was deemed to perform the most desirable, despite theo-retically not supporting non-Euclidean distances. K-Means performedbest since the number of points in each zone turned out to be of equalamounts and most zones seemed to have good areas.

Figure 3.7: Clustered coordinates using K-Means, centroids aremarked with crosshairs. In total 30 clusters, some are outside of themap borders. Map from OpenStreetMap contributors [38].


Classifying unobserved points

The centroids are marked with crosshairs in Figure 3.7. The coordi-nates of a centroid were calculated by taking the mean of all coordi-nates in a cluster. The calculation of the latitude is displayed in Equa-tion 3.1, where C is the set of all points c in a cluster. The longitude iscalculated using the same equation.

Centroid latitude =

∑|C|i=1 ci latitude|C|

(3.1)

Only the centroids were used for classifying points. This leads tothe size of the distance matrix needed to grow linearly with the num-ber of points to be classified, O(|clusters| ∗ |points|), where the numberof clusters is constant. This allowed for fast distance matrix retrievalfrom OSRM. The classification was simply performed by finding theclosest centroid to an unknown point, where the distance was mea-sured with the driving time. Note that since all points were calculatedusing the centroids, some points in Figure 3.7 belong to other clustersin the final run.

Method summary

The method described above can be summarised with the steps below.

1. Decide the number of clusters desired.

2. Retrieve a distance matrix. Distance metric: driving time.

3. Cluster using agglomerative clustering, remove outlining pointsfrom the data set. Distance metric: driving time.

4. Discard the clustering result.

5. Cluster using K-Means. Distance metric: driving time.

6. Calculate centroids using the Euclidean distance.

7. Discard the points used for the construction, only keep the cen-troids.

8. Classify new points by finding the closest centroid. Distancemetric: driving time.


3.4 Data pre-processing

The data needs to be pre- and post-processed. This section describeshow the data was processed to be able to be used in the baseline pre-diction in Section 3.6 and the machine learning prediction approach inSection 3.7.

3.4.1 Data split

The data set described in Section 3.2 was split into training, valida-tion, and test data. Due to preferring data containing coordinates toavoid having to geocode address lines into coordinates, a step requir-ing cleaning of address lines and paid services, older data was notused. The data split is presented below.

• Training set. The data set contained deliveries from 2015-01-01to 2016-12-31. Note that for most experiments presented in Sec-tion 4 the training set only used data from 2016. By having atleast a full year represented in the training data all seasonal vari-eties and public holidays are represented.

• Validation set. Data ranging from 2017-01-01 to 2017-06-30. Thisperiod does not cover all periods of the year, however some hol-idays such as Easter & Midsummer are included.

• Test set. Data ranging from 2017-07-01 to 2018-02-28. This periodcovers different periods of the year and some public holidayssuch as the Christmas period.

Splitting the data set into one year each for the validation and testwould probably capture all seasonal varieties better. However, due totwo reasons the data was not split in that way. Firstly, it was assumedthat the training data would be too far back in time, meaning that de-liveries that usually occurred in 2015 would not occur two years later.The data split above does did solve this problem completely, but itmade the training set and test set closer in time. Secondly, it alloweddifferent training periods to be used in the experiments without chang-ing the validation and test set.


3.4.2 Grouping addresses into OD-matrices

Delivery addresses with their corresponding time needed to be groupedtogether into time-sliced Origin Destination matrices. A time of a de-livery is the latest allowable delivery time, i.e., deadline. A cell inan OD-matrix represents the number of deliveries from one zone toanother. All cells on the diagonal represent the number of deliverieswithin a zone. Each OD-matrix represent one time-slice, i.e., an OD-matrix contains all deliveries between two times. An example of time-sliced OD-matrices are displayed in Figure 3.8, where OD-matrices fora single week has been created with each time-slice being one day long.

Monday Tuesday Wednesday Thursday

Friday Saturday Sunday

Figure 3.8: OD-matrices plotted as heat maps for a sample week. Tosee differences more easily the OD-matrices have been plotted in a log-arithmic scale. 29 zones were used resulting in 841 cells for each OD-matrix.

For the resulting model presented in Section 3.7 each time-slice wasone hour long and only OD-matrices in the range of 7:00 to 18:00 wereincluded. For instance, the first OD-matrix for one day contained alldeliveries with deadlines from 7:00 to 7:59, the next 8:00 to 8:59 etc.


Deliveries with times outside of this range were assumed to be othertypes of deliveries such as home deliveries, which were not to be pre-dicted. Note that, for example, a delivery with deadline 13:30 was onlyincluded in the 13:00-13:59 OD-matrix and not in any prior ones.

3.4.3 Additional features

Except for prior OD-matrices themselves some additional features wereadded. The additional features were

• hour {7, 8, 9, . . . , 18},

• day of week or public holiday with eight possible values,

• month with twelve possible values,

• precipitation,

• and air temperature.

The weather features are described in more detail in Section 3.2.

3.4.4 Input encoding

The input to a neural network can be either continuous in a regressionformat or in a categorical or ordinal format. If the input is in regres-sion the input is scaled to for example−1 to 1 if the hyperbolic tangentis used as an activation function. If the input is in categorical or or-dinal format it is one-hot encoded. All inputs were one-hot encoded,since basic experiments performed indicated that a ordinal format per-formed better. The ordinal encoding is described next.

Encoding OD-matrices

Each cell was transformed into a discrete range in the interval {0, 1, 2}.0 means zero deliveries, 1 one delivery, and 2 two or more deliveries.The reason for handling everything with two or more deliveries as thesame feature, is that knowing the exact amount of deliveries betweenzones is harder to predict and less interesting to know. The most im-portant thing to know in the output is whether or not a transport willhappen between two zones or not, since a delivery vehicle usually canfit more than one package. Since an ordinal input was desirable each


matrix was one-hot encoded, resulting in each matrix being three timesas large.

Encoding additional features

The hour, day of week, and month features were one-hot encoded.The weather features, precipitation and air temperature, were splitinto discrete intervals (bins) and one-hot encoded. The bins for theprecipitations were

{0, 0.01, 1.0, 3.0, 5.0, 7.0, 10.0, 20.0}

meaning that 2mm rain belong to 1.0 and everything 20mm and abovebelong to 20. The bins for the temperatures were

{−20,−15,−10,−5, 0, 5, 10, 20, 25}

i.e. 5-degree intervals.

3.4.5 Constructing the LSTM input

The input to a LSTM unit needs to be in a specific shape. First of all, atarget is required when training a Neural Network, requiring the prob-lem to be transformed into a supervised learning problem. Secondly,the input needs to have multiple samples in the correct dimensions.

To transform the problem into a supervised learning problem asliding window technique was used. Let xi denote an OD-matrix attime i. If xi is the target then all OD-matrices at i − 1 and earlier areprior matrices, i.e., available features for training. For the next OD-matrix at xi+1 all prior OD-matrices are instead matrices at xi and ear-lier.

The tensors, the inputs to a LSTM unit, were constructed by firstflattening each OD-matrix, i.e., reshaping the 2D-matrix to a 1D vec-tor. Then samples were constructed from OD-matrices and if requiredadditional features (month etc) were appended. To construct a singlesample let the target sample be the flattened OD-matrix xi and let thetime steps, the training vectors be

xi−10, xi−9, ..., xi−1

where the vectors are flattened OD-matrices with optional additionalfeatures appended.


A sample is now

{xi−10, xi−9, ..., xi−1|xi}

with xi being a flattened OD-matrix target free from any additionalfeatures and xi−1 and below being flattened OD-matrices for trainingwith optional additional features. The next sample is the next windowin the sliding window principle, beginning at xi−9 and ending at xi+1.The first four samples will therefore be in the format

{xi−10, xi−9, ..., xi−1|xi}{xi−9, xi−8, ..., xi|xi+1}{xi−8, xi−7, ..., xi+1|xi+2}{xi−7, xi−6, ..., xi+2|xi+3}

where the target is to the right of the | sign. Together a set of samplesfollowing each other form a batch.

3.5 Measuring the prediction performance

To measure the performance of the prediction mean square error (MSE)was used, since it is a common error metric used by for example bothToqué et al. [55] and Azzouni and Pujolle [6].

accuracy =correctly classified cells

total cells(3.2)

Since it can be hard to get an intuition on how good the perfor-mance is, the F1 score was also measured. Using only the accuracygiven in Equation 3.2 for measuring the performance would lead tothe accuracy being high even if the prediction only predicts no deliv-eries at all, since the sparsity of the matrix would lead to many cellswith no deliveries being correctly classified. F1 score considers boththe precision and recall of a classification, which avoids the problemwith bias and accuracy. MSE and F1-score is defined in Section 2.5.

3.6 Baseline prediction: The calendar model

To be able to compare the machine learning approach to predicted OD-matrices, a baseline was needed. A simple model was implemented,


named the calendar model. The calendar model split the OD-matricesinto slots based on hour and day type and took the average value overthe training set for each slot. If there are 18− 7 = 11 available hours inone day and eight days (Mon-Sun & public holiday), there are in total11·8 = 88 total slots. To predict a day, the slot at the corresponding timeand day type was simply returned. A sample result with continuousoutput for a full day (not hour slot) is displayed in Figure 3.9. To get aclassification prediction instead of a continuous, all cells were simplytransformed in the same way as described in Section 3.4.4.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

02

46

810

1214

1618

2022

2426

28 0

4

8

12

16

(a) Real

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

02

46

810

1214

1618

2022

2426

28 0

4

8

12

16

(b) Calendar

Figure 3.9: Comparison for a real day and the calendar prediction forthe same day.

3.7 Long-Short Term Memory based predic-tion

The input into the LSTM was as described in Section 3.4.5 prior OD-matrices with optional additional features. The output was a predictedOD-matrix. A few different models were implemented and they arepresented in this section.

3.7.1 Network output

Since the output from a LSTM is hard to interpret directly, a densely-connected Neural Network layer was placed as the last step in all mod-


els. The output was therefore in the same format in all models. Theoutput is a one-hot encoded flattened OD-matrix, where each one-hotencoded cell group contains the softmax scores, the probabilities, forthe different cells. The output can therefore be transformed into anOD-matrix by picking the cells with the highest softmax scores.

0.1 0.7 0.2 0.8 0.1 0.1

(a) Part of the Neural Network output.

1 0

(b) Transformed output.

Figure 3.10: Output processing.

Figure 3.10a displays a part of the output. Recall that there arethree bins to choose from, which means that the figure will be trans-formed into two cells when processing the output, as displayed in Fig-ure 3.10b.

3.7.2 Implemented models

Three different models were implemented, one sequential and two dif-ferent parallel models. This section describes the models and their dif-ferent input variations that are evaluated in Chapter 4. In total therewere five variations with the input and models, described below.

SMNoAF: Sequential Model No Additional Features

The first and most simple model only took prior OD-matrices as in-put and not any additional features such as day type etc. The inputis described in Section 3.4.5, in short each input sample is ten priorflattened OD-matrices.

SMNoW: Sequential Model No Weather

This model was identical to SMNoAF, except that the input also con-sisted of additional features (exlucding the weather) appended to theOD-matrix. The model is visualised in Figure 3.11. Due to the fact thatthe parallel models were expected to perform better, no experimentwith the weather and the sequential model was performed.


Prior OD-matrices

Concatenate

Additional features

LSTM layer(s)

Dense layer

Figure 3.11: SMNoW.

PMNoW: Parallel Model No Weather

The parallel model separated the prior OD-matrices and additionalfeatures into one LSTM for the prior OD-matrices and one dense layerfor the additional features. The input into the LSTM is ten prior OD-matrices as in the sequential model, while the input to the dense layeris the additional features for a given time directly. The model is dis-played in Figure 3.12a. This variation has no weather as input. Dropoutis added after the dense layer for the additional features to preventoverfitting.

PMNWD: Parallel Model Weather Dense

This model was exactly the same as the PMNoW, except that it also hasthe weather inserted with the additional features into the dense layer.

PMWLSTM: Parallel Model Weather LSTM

This model was created with the hypothesis that weather on the pre-vious days can affect, for example, production in a factory. The ideawas to place the weather for the ten previous days into its own LSTMmodel. The model is displayed in Figure 3.12b.


Additional features

Dense layer

Dropout

Prior OD-matrices

LSTM layer(s)

Concatenate

Dense

(a) PMNoW.

Additional features

Dense layer

Dropout

Prior OD-matrices

LSTM layer(s)

Concatenate

Weather

LSTM layer

Dense layer

(b) PMWLSTM.

Figure 3.12: Two parallel models.

3.7.3 Hyperparameter selection

The models have many possible hyperparameters that can be tunedand experimented with. The parameters that were experimented withwill be presented in Section 3.7.4, while this Section states the hyper-parameters and configuration used for all experiments.

Gradient based optimisation with the ADAM-optimiser was usedfor training. Some research found, for example [55, 5], also used theADAM-optimiser. The activation function used for the LSTMs wasthe hyperbolic tangent, since it was used by [55, 7, 25, 62, 65] amongothers.

Categorical cross entropy was used as loss function for the networkand not MSE, since softmax was used as activation function on the lastdense layer to allow categorical output. The batch size used was 32,meaning the network was trained with 32 samples in each iteration.

3.7.4 Experiment setup

The models in Section 3.7.2 were evaluated and compared. In additionhyperparameter optimisation was performed by trying different hy-perparameters on the PMNoW model. A final model was also trained


with the best model and hyperparameters found. The metrics usedwhen presenting the results was test data MSE, the F1 score on thetest data, and the standard deviation of both of the metrics. Since theMSE and F1 score was calculated on each OD-matrix separately, thestandard deviation partially served as an indication on whether or notmodels actually managed to fit the data or simply tried to predict somesort of average. Due to the time needed to train all models, each ex-periment was run only once.

Common default parameters

Unless anything else is stated, all experiments used the configurationand hyperparameters stated below by default. The data was not shuf-fled, since the data was seen as a time series.

• 500 neurons for the LSTM layer(s).

• Training time 100 epochs.

• Early stopping, when the validation MSE has not improved forfive epochs the training is stopped. I.e. a model was trained for100 epochs or until the validation MSE stops improving, what-ever came first.

• Regularisation with the l1 & l2 norm on all LSTM layer(s), toprevent overfitting.

• Dropout with rate 0.4.

• Train data range: 2016-01-01 to 2016-12-31

• Validation data range: 2017-01-01 to 2017-06-30

• Test data range: 2017-07-01 to 2018-02-28

• Learning rate of 0.0001

• Batch size 32 and window size 10 (10 prior OD-matrices), result-ing in input Tensors having the shape 32 × 10 × N , where N isthe number of features. If there are 29 zones, 32 one-hot encodedhour, month, and day type features, 17 one hot encoded weatherfeatures, then N = 292 ∗ 3 + 32 + 17 = 890.


Training period

The viability of using both of the available years for training versusonly one year was explored, by training the model with the commonparameters and changing the time range for the training data to alsoinclude 2015.

Number of epochs

The number of epochs were compared, naturally early stopping wasnot used. The reason for this experiments was to see what happenswith the loss, validation MSE, and test MSE as the time spent train-ing is increased. For example it could be possible that early stoppingwould terminate a training due to the validation MSE temporarilyworsening to then suddenly improve again. Another reason was tosee if overfitting occur. The following epochs were evaluated:

• 10

• 30

• 50

• 100

• 200

Number of neurons

The number of neurons for use in the LSTM layer(s) were evaluated.The hypothesis is that too few will not work at all, while more neu-rons will improve the result given enough training time. The follow-ing numbers of neurons will be tested:

• 50

• 100

• 250

• 500

• 750

• 1000

Number of hidden layers

In the PMNoW model the LSTM part (”LSTM layer(s)”) in Figure 3.12ahad hidden layers. Two to four hidden layers were tested. Hiddenlayers were only tested with this model, due to time constrains andsince it was assumed that the parallel models would perform betterthan the sequential ones.


Learning rates

Different learning rates were evaluated. The evaluated learning rateswere:

• 0.001

• 0.0005

• 0.0001

• 0.00005

• 0.00001

The rates have been selected since basic experiments were performedthat showed that learning rates in this range seemed interesting, i.e.,differences could be noted. No early stopping was used when testinglearning rates, since the training loss and validation MSE were anal-ysed over epochs.

An experiment was also performed where the learning rate wasreduced by a factor of 10 when the training loss plateaued. Plateauingin this context is defined as the training loss not improving after threeepochs.

Chapter 4

Results

This chapter describes the result of the hyperparameter optimisationand presents the final model and hyperparameters used.

4.1 Hyperparameter optimisation

This section presents the result of the different experiments describedin Section 3.7.4. MSEs have been rounded to four significant figures,other numbers to three significant figures.

4.1.1 Training period

Table 4.1: Score comparison between two different training periods.

2015 and 2016 2016 OnlyMSE 0.0811 0.07791MSE Std 0.0785 0.0747F1 50.1% 51.7%F1 Std 0.0987 0.0977

Table 4.1 indicates that only training the model on 2016 data seemsto perform better. Early stopping terminated the training after approx-imately 27 epochs on both of the training ranges. Basic experimentsperformed where the training and test set were shorter and closer intime gave better scores.

41

42 CHAPTER 4. RESULTS

4.1.2 Model comparison

Table 4.2: Score comparison between the different models. SMNoAF= Sequential Model No Additional Features. SMNoW = SequentialModel No Weather. PMNoW = Parallel Model No Weather. PMWD =Parallel Model Weather Dense. PMWLSTM = Parallel Model WeatherLSTM. Cal = Calendar baseline.

SMNoAF SMNoW PMNoW PMWD PMWLSTM CalMSE 0.08749 0.08243 0.07788 0.07826 0.07705 0.1092MSE Std 0.0822 0.0779 0.0748 0.0754 0.0734 0.103F1 46.8% 49.0% 51.8% 51.6% 52.1% 40.8%F1 Std 0.0964 0.0988 0.0976 0.0981 0.0984 0.108

The result is displayed in Table 4.2. All models performed betterthan the calendar baseline and had a lower standard deviation. Onlytraining on prior OD-matrices worked better than the calendar model,indicating that it is possible to learn from prior OD-matrices. PMWL-STM performed slightly better than the PMNoW but the PMWD per-formed slightly worse, which makes it hard to say whether using theweather features gives a better prediction or not.

4.1.3 Number of epochs

0 25 50 75 100 125 150 175 200Epoch

0.480

0.485

0.490

0.495

0.500

0.505

0.510

0.515

0.520

F1 S

core

0.078

0.079

0.080

0.081

0.082

0.083

0.084

0.085

MSE

(a) Performance on the test data.

0 25 50 75 100 125 150 175 200Epoch

5800

6000

6200

6400

6600

6800

7000

Loss

0.027

0.028

0.029

0.030

0.031

0.032

0.033

0.034Va

l MSE

(b) Training loss and validation MSE.

Figure 4.1: Comparison between the number of training epochs.

CHAPTER 4. RESULTS 43

Figure 4.1 presents the result of training on different epochs. FromFigure 4.1a it can be seen that the performance at 10 epochs is theworst and the best performance is at 100 epochs. Figure 4.1b showsthat the validation MSE is at its lowest at approximately 25 epochs,after that the validation MSE increases which may indicate that thenetwork start to overfit. At 200 epochs the neural network is prob-ably overfitted, since both the test and validation MSE has increasedcompared with previous epochs.

4.1.4 Number of neurons

200 400 600 800 1000Neurons

0.510

0.512

0.514

0.516

0.518

F1 S

core

0.0774

0.0775

0.0776

0.0777

0.0778

0.0779

0.0780

0.0781

0.0782

MSE

Figure 4.2: Comparison of the performance when different number ofneurons were used.

In Figure 4.2 the result of using different number of neurons is dis-played. Early stopping terminated the learning at 100 neurons afterabout 80 epochs, then the training was stopped earlier and earlier asthe number of neurons increased. At 1000 neurons the training wasstopped after around 23 epochs.

4.1.5 Number of hidden layers

The test score for two to four hidden layers is displayed in Figure 4.3.In the figure x = 1 is one LSTM and no hidden layers, x = 2 is two


1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0LSTM layers

0.5170

0.5172

0.5174

0.5176

0.5178

F1 S

core

0.0774

0.0775

0.0776

0.0777

0.0778

0.0779

0.0780

0.0781

MSE

Figure 4.3: Test score for different added hidden layers.

LSTM layers out of which one is hidden etc. The prediction perfor-mance decreases with more layers, indicating that adding more hid-den layers will worsen the result.

4.1.6 Learning rates

In Figure 4.4 it can be seen that the learning rate of 0.0001 gives the bestprediction performance. To further see what happens with the highestlearning rates (0.001), the training loss and validation MSE has beenplotted in Figure 4.5a. In the figure it can be seen that the training lossquickly decreases to then slowly increase while the validation MSEjumps up and down in. Compare the two curves with the MSE curvein Figure 4.5b where the training loss slowly decreases and the MSEcurve is smooth. Note that no early stopping was used.

Experiments performed both with and without early stopping didnot seem to affect any score noticeably. The learning rate used for thesetests was 0.0001, as specified in Section 3.7.4.


10 5 10 4 10 3

Learning rate

0.495

0.500

0.505

0.510

0.515

0.520

F1 S

core

0.078

0.079

0.080

0.081

0.082

MSE

Figure 4.4: Test score for different learning rates.

0 20 40 60 80 100Epoch

5800

5850

5900

5950

6000

6050

6100

6150

Loss

0.0270

0.0275

0.0280

0.0285

0.0290

0.0295

Val M

SE

(a) Learning rate of 0.001

0 20 40 60 80 100Epoch

5800

6000

6200

6400

6600

6800

7000

Loss

0.027

0.028

0.029

0.030

0.031

0.032

0.033

0.034

Val M

SE

(b) Learning rate of 0.0001.

Figure 4.5: Comparison between two learning rates.


4.2 Final model

As a final experiment the best hyperparamers found were tried. Thebest parameters & hyperparameters were:

• 2016 data for training.

• The PMWLSTM model.

• Training for 100 epochs or until early stopping terminates thetraining.

• 100 neurons.

• No hidden layers

• Learning rate of 0.0001.

• No learning rate reduction on plateauing.

As a simple test the final model was also tried with 500 neuronsinstead of 100. The result is displayed in Table 4.3, where it can beseen that the MSE between 100 & 500 neurons is about the same. Thetraining for 100 neurons was terminated after about 60 epochs, for 500neurons after about 25 epochs.

Table 4.3: Performance of the final model.

100 neurons 500 neuronsMSE 0.07706 0.07705MSE Std 0.073 0.0736F1 52.0% 52.1%F1 Std 0.1 0.0991

Figure 4.6 displays the difference in MSE between the final modeland the calendar baseline for each predicted OD-matrix in the test set.Positive values means that the baseline has a higher MSE, negativevalues means that the final LSTM model performed worse than thebaseline. From the figure it can be seen that only a handful predic-tions by the LSTM performed worse than the baseline prediction andin those cases the MSE difference was small.


0 500 1000 1500 2000 2500Future OD-Matrix prediction

0.00

0.05

0.10

0.15

0.20

MSE

Diff

eren

ce

Figure 4.6: MSE difference for the final model with 500 neurons andthe baseline. Positive values means that the baseline performed worse.

Chapter 5

Discussion

This chapter analyses the hyperparameter optimisation, the model-and feature selection, and the clustering. It also discusses the work,gives suggestions for future work, and finally, gives some concludingremarks.

5.1 Analysis

This section analyses and discusses the results of the (hyper)parameteroptimisation and model selection.

5.1.1 Parameter selection

Training period

It is a bit surprising that the model performs better by only training iton one year instead of two. It could be argued that more training datashould generalise the model more and reduce overfitting, but that doesnot seem to happen here. Since early stopping terminated the trainingafter about the same number of epochs, it indicates that overfittingstarted to occur at the same time on the 2015-2016 training set. An ex-planation could be that the training data is similar between the years.It is possible that the prior OD-matrices do not contain a lot of datathat can be used for training, since they are too random.

48

CHAPTER 5. DISCUSSION 49

Number of epochs

Looking at Figure 4.1 it is interesting that the test MSE is lowest for 100epochs, despite that the validation MSE seems to indicate overfitting at100 epochs. The result of the overfitting does not seem to be noticeableuntil 200 epochs. Toqué et al. [55] experimented with over 3000 epochsfor the training, their plot of validation loss did not increase and indi-cate overfitting. The difference can maybe be explained by Toqué et al.using regression instead of classification, i.e., their input and outputwas continuous. It is very likely that if continuous input was usedinstead, the training would require more epochs and possibly containmore trainable information.

Number of neurons

As with the epochs it is a bit surprising that the test MSE did not de-crease with more neurons. When Azzouni, Boutaba, and Pujolle [5]predicted OD-matrices their MSE was decreased when they increasedthe number of neurons. The simple explanation for the decreasedpredictive performance is that more neurons means a more complexmodel requiring more training data. However, the training periodexperiment performed with 500 neurons indicated that adding moretraining data did not improve the result. Another explanation is in-stead that the prior OD-matrices simply do not contain enough infor-mation. It is possibly solvable by letting them be in a continuous inputas in most research found, alternatively adding more bins for the cate-gorisation.

Number of hidden layers

Azzouni and Pujolle [7] had increased predictive performance whenadding hidden LSTM layers, the reason for the ineffectiveness hereprobably has the same explanations as the decreased performance whenmore neurons are added.

Learning rates

With the highest tested learning rate the learning rate is too high, thereason for the uneven MSE curve in Figure 4.5a is probably that thegradient descent optimisation takes too large steps and cannot op-timise. Compare the plot with Figure 4.5b, where the training loss

50 CHAPTER 5. DISCUSSION

slowly decreases and the validation MSE is smooth as the number ofepochs increase. The lowest possible learning rate has a lower scorethan 0.0001, probably because of the slower learning not letting themodel learn as much in the given time limit, recall that everythingwas trained for 100 epochs. Note however that while the learning rateof 0.0001 seems to overfit in Figure 4.5b since no early stopping wasused, the lower value tested learning rates are probably a bit too slowdue to the fact that 100 epochs were not enough to perform better than0.0001.

Regarding reduction of the learning rate on plateauing, the reasonthat the prediction performance did not seem to increase is most likelybecause the ADAM algorithm has a built in learning rate decay [28].It is possible that reducing the learning rate on plateauing would havebetter results if other optimisers were used instead of ADAM. It canhowever not be said that other optimisers with learning rate reductionwould perform better than the ADAM optimiser.

5.1.2 Model & feature selection

It is clear that the prior OD-matrices are not completely random, theycontain some sort of trainable information since the MSE for SMNoAFwas lower than the calendar baseline. Since the MSE standard devi-ation is closer to the other Neural Network models than the baseline,it also indicates that the SMNoAF did not simply find a single goodmatrix that it always can predict for a good enough score, i.e., thereexists variations in the predictions based on the input data.

From the experiments performed it is clear that both prior OD-matrices and the additional features such as day type etc. are relevantfor the learning. However, it is harder to draw any conclusions aboutthe effect of the weather features. Since the MSE for PMNoW is verysimilar to the MSEs for PMWD & PMWLSTM in Table 4.2, the weathercan not be said to improve the result noticeably. The reason for theMSE for the PMWLSTM being the lowest could simply be random-ness or that additional noise was introduced in the model, leading tothe model overfitting less on the training data and generalizing better.

Mukai and Yoden [35] forecasted taxi demand in Tokyo using Neu-ral Networks. Their result showed that adding the precipitation as abinary feature (raining or not raining) as a feature did not improve theresult of the prediction. Since the precipitation did not affect the pre-


diction performance in either Table 4.2 or in the report by Mukai andYoden, it is assumed that neither the precipitation or the air tempera-ture are relevant features. It can also be assumed that more experimen-tation with for example the categorical encoding of the precipitation(Section 3.4.4) would lead to better predictions.

A problem with using the weather as a feature is the uncertaintyin weather predictions. Since the tests in Chapter 4 were performedusing known weather, they did not take into account wrong weatherprognoses. If the weather would have been a feasible feature to use,using the weather in a real world OD-matrix prediction would possi-bly skew the result if the weather prediction in itself were wrong.

A problem with seeing the problem as a time-series with OD-matricesthat was ignored in this thesis, is that OD-matrices on Monday morn-ing will not have good prior OD-matrices, i.e., the prior OD-matricesfor the Monday morning will be from Sunday, a day probably veryindependent from the Monday. It is possible that it could have beenbetter to let prior OD-matrices for a Monday be prior Mondays, priorOD-matrices for a Tuesday prior Tuesdays, and so on.

5.1.3 Clustering

Even though no clustering experiments were presented in Chapter 4,there are still aspects about the zoning and clustering that needs to beanalysed and discussed.

The zones used

The zones used, displayed in Figure 3.7, could probably be improvedif they were to be used for predictions in a real setting. Some possibleissues with the used zones exist. One problem is smaller neighbouringclusters outside of more densely populated areas having fever deliv-eries, it would likely be better to merge these small clusters. Someclusters north of Stockholm city also look a bit strange, for examplepoints in Danderyd and Lovön belong to the same cluster.

Zoning approach: machine learning vs static

Since it was decided in Chapter 3 that clustering would be used, noexperiments were presented in the result using zip-codes as zones. As


described in Section 3.3.1 using 3-digit zip-codes were assumed to cre-ate to sparse matrices. However even though the zip-code approachwas ruled out it does not mean that using a non-machine learning ap-proach to create clusters is bad. A good approach could be to combineclustering with a manual post-processing of the clusters, where firsttoo many clusters were created and then merged manually. Anotheralternative is to simply create zones completely manually, by letting,for example, a transport leader guess what would be good zones. Mu-nicipalities and city areas could also be converted into zones, perhapswith some manual processing.

It is perhaps a bit surprising that K-Means still seems to have workedgood enough for the clustering, despite K-Means not having theoreti-cal support for non-Euclidean distance metrics.

Driving times as a distance metric

The Open Source Routing Machine (OSRM) was used with Open StreetMap (OSM) data to extract driving times. A good aspect with OSMmaps is that since the maps are community made, they are usually al-ways very up to date in Stockholm with new roads etc. being addedinstantly. The drawback with OSRM and OSM data is however thatthey do not incorporate any information about traffic in their prog-noses, leading to way shorter driving times than realistic during, forexample, the rush hour. This may lead to less optimal clusters, since itin the real world may be faster for a delivery vehicle to drive on smallroads between locations inside a city area than going between areas onlarger main roads with queues.

A small error source is that OSRM may return incorrect drivingtimes, however this should only affect stray coordinates, assumingthat the created centroids are not placed in strange locations.

Performance metrics

One issue this report has not answered is how good zones are. To let ahuman subjectively decide if zones are good is not the same as havinga calculated score. The approach to create a performance metric for thezones would be to deploy the whole prediction system with a routingengine and let the sum of all real driving times be the metric. Note thatthis metric would also measure the performance of the predictions.


5.2 Future work

Naturally all research can always be improved. This section aims topresent the most important improvements that is feasible to do. Morehyperparameter tuning can always be performed but this section willnot focus on discussing that.

5.2.1 Zones creation

As described in Section 5.1.3, the zones could be done better. The nextstep to create better clusters would be to create even more clustersusing either agglomerative clustering or K-Means, to then manuallymerge neighbouring clusters. It would also be possible to run the ag-glomerative clustering recursively on large clusters to split them intosmaller ones.

5.2.2 Model enhancements

To be able to train a more advanced model (more neurons, more hid-den layers, etc.) the input could be changed from splitting each cellinto three bins to more bins or letting them be in a scaled continu-ous input. To allow for more hidden layers to work better, ResNetcould perhaps improve the performance. With ResNet the original in-put would be combined with the output from the last hidden layer,which would lead to more information about prior OD-matrices beingtaken into account when giving a prediction.

To increase the prediction performance in a more long-term per-spective, the simplest solution would be to retrain the model withnewly observed data at regular intervals, for example once a day. Al-ways trying to predict the next day would likely lead to better resultson average than predicting transports months in advance. Anothersolution for better long-term performance would be to somehow tryto capture seasonal variety better, for example by extracting any sea-sonal trends in the data pre-processing to then add the trend back inthe prediction.

More features could perhaps be identified and experimented with.Long-term features such as the economic climate would probably notbe able to predict short-term transports, however features such as whengoods trains arrives could perhaps give a better short-term prediction


performance. It is however unclear how these features would be re-trieved.

Another model enhancement would be to combine long-term data(prior year(s)) with short-term data (prior months/weeks). Long-termdata could capture seasonal varieties and special days, while short-term data could contain more details about where deliveries currentlyare being delivered.

5.2.3 Measuring the prediction confidence

Since categorical crossentropy has been used, the softmax scores couldbe used as a confidence metric for the prediction by a model. If softmaxhas a very high score for one cell and lower for the others, the confi-dence is high. If the softmax scores are more uniformly distributed theconfidence is lower. The problem is however that simply using soft-max scores as a confidence metric would not take the uncertainty ofthe model itself into account. Gal [17] describes how dropout etc. canbe used even with regression to get a confidence score, which wouldbe a good start if a better confidence metric would be desirable.

5.2.4 Evaluate a realistic scenario

The next step for this thesis would be to evaluate whether or not thecomplete approach to predicting OD-matrices would work in a realsetting. By using the route planning software used by the companywith known deliveries and future prognoses combined a vehicle sug-gestion could be retrieved. For a basic test the test set could be usedwhere the suggested cars were used for deliveries, however measur-ing the result would not be trivial.

Comparing the algorithm suggested vehicles with the vehicles se-lected by the transport leaders would perhaps not be the best metric,since it would assume that the transport leaders vehicle selections arethe best selections. Measuring the total driving distance would solvethe problem with the transport leaders not always selecting the bestvehicles, however it would not take into account different vehicles be-ing required for different types of jobs.

Another evaluation that could be performed would instead be topresent the vehicle suggestions to a transport leader, who would theneither use a suggestion or possibly give a short explanation of why a


presented suggestion was bad.

5.3 The Long-Short Term Memory based ap-proach

It is naturally possible that completely different approaches other thanthe LSTM-based approach would perform better. The most commonother approaches for predicting OD-matrices found in the literaturestudy, such as ARIMA, would probably not perform better since noother research found indicates that. Since the prediction was modelledas a classification problem, it is possible that other machine learningapproaches would perform satisfactory. Some research found in theliterature study, such as [11], indicated that non-LSTM based neuralnetworks performed very poorly, however in those papers regressiveoutput was used.

5.4 Contribution to research

Compared with research found in Section 2.6 where OD-matrices werepredicted, this report has some differences. First of all, a custom clus-tering approach with outlier handling was used for zoning. Secondly,this report evaluated the concept of OD-matrices on a smaller scale,other research Azzouni, Boutaba, and Pujolle [5], Vinayakumar, So-man, and Poornachandran [60], and Hua et al. [25] among others, usedthe GÉANT data set [53] which is OD-matrices for data network traf-fic. Compared with the GÉANT data set, the data set used in this thesiswas smaller and more sparse. The prior OD-matrices describing deliv-eries usually contained none or up to a couple of deliveries betweenzones, while the GÉANT data set always has higher numbers of trafficbetween zones.

Finally, this thesis did not use input and output in a regression for-mat as in all other research found, since output on a categorical formatwas deemed easier to predict and not less interesting than output ina continuous format. In other research areas it is usually more inter-esting to know approximately how much demand between zones thatwill occur, for example in the research by Toqué et al. [55] where pub-lic transport demand was forecasted. In this thesis aspect it is more


important to know whether a transport will happen or not, not if twoor four transports will happen as delivery vehicles usually can carryseveral packages at the same time.

5.5 Conclusions

This thesis tried to predict future delivery orders in the Stockholm areausing Long-Short Term Memory Recurrent Neural Networks. The pre-dictions created with the LSTM model performed better than the base-line in the evaluation framework used. A major part of the work pre-sented was parameter optimisation.

The research question

However to completely answer the research question ”How can trans-port management be assisted using predictions based on historical data?”, thepredictions would need to be evaluated in a more realistic scenario.The evaluation is possible to conduct but due to time constraints forthe thesis, the evaluation was not performed. One reason was that auser interface for a transport leader inside an existing system wouldhave been needed to be implemented, which would had taken toomuch time from other parts of the thesis work. This thesis insteadfocused on the prediction model.

Deploying the prediction system

For deploying the prediction system it would probably be best to cre-ate clusters with some manual post-processing, alternatively creatingclusters completely by hand. Since clusters only need to be createdonce, it is perhaps not infeasible to use manual labour in the creation.The LSTM approach should not care about the clusters, as long as thereare not too few and not too many and the clusters are of preferablyequal sizes. Note that having clusters of equal sizes leading to betterresults is assumed and has not been proven in this report.

Data conclusions

Since all LSTM models gave better predictions than the baseline, us-ing LSTM RNNs to predict OD-matrices can be said to be feasible. It


is however hard to say how much predictive information prior OD-matrices actually contain, obviously they contain some informationbecause SMNoAF worked better than the baseline, however, the re-sults could have been better.

It could be argued that future deliveries are mostly independentfrom prior deliveries. However, due to the long difference in timebetween the training and test set, in combination with there possiblyexisting better prediction methods, it can not be said for certain thatfuture deliveries are independent. Recall that some basic experimentsperformed with the training and test data closer in time increased thepredictive performance.

Nonetheless, the results indicate that future deliveries seem to bequite random and it is unclear if additional features would boost thescore noticeably. Even if, for example, goods train arrivals would beadded as a feature, it is something that even the baseline should beable to predict if a train is arriving the same time every week.

Weather features

Using the weather as a feature may give slightly more predictive power,but due to weather prognoses not always being correct and the in-creased predictive power from the weather is too small to draw anyconclusion, it might be better to skip using the weather in predictions.

Bibliography

[1] Martın Abadi et al. TensorFlow: Large-Scale Machine Learning on Hetero-geneous Distributed Systems. 2015. URL: https://www.tensorflow.org/.

[2] François Chollet et al. Keras. 2015. URL: https://keras.io/.

[3] Javier Alonso-Mora, Alex Wallar, and Daniela Rus. “Predictive Routingfor Autonomous Mobility-on-Demand Systems with Ride-Sharing”. In:IEEE, 2017-09, pp. 3583–3590. ISBN: 978-1-5386-2682-5. DOI: 10.1109/IROS.2017.8206203. URL: http://ieeexplore.ieee.org/document/8206203/ (visited on 2018-04-04).

[4] David Arthur and Sergei Vassilvitskii. “K-Means++: The Advantagesof Careful Seeding”. In: Proceedings of the Eighteenth Annual ACM-SIAMSymposium on Discrete Algorithms. SODA ’07. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.ISBN: 978-0-89871-624-5. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494 (visited on 2018-07-02).

[5] Abdelhadi Azzouni, Raouf Boutaba, and Guy Pujolle. “NeuRoute: Pre-dictive Dynamic Routing for Software-Defined Networks”. In: 201713th International Conference on Network and Service Management (CNSM).2017 13th International Conference on Network and Service Manage-ment (CNSM). 2017-11, pp. 1–6. DOI: 10.23919/CNSM.2017.8256059.

[6] Abdelhadi Azzouni and Guy Pujolle. “A Long Short-Term Memory Re-current Neural Network Framework for Network Traffic Matrix Pre-diction”. In: (2017-05-16). arXiv: 1705.05690 [cs]. URL: http://arxiv.org/abs/1705.05690 (visited on 2018-04-22).

[7] Abdelhadi Azzouni and Guy Pujolle. “NeuTM: A Neural Network-Based Framework for Traffic Matrix Prediction in SDN”. In: (2017-10-17). arXiv: 1710.06799 [cs]. URL: http://arxiv.org/abs/1710.06799 (visited on 2018-05-01).

58

https://www.tensorflow.org/

https://www.tensorflow.org/

https://keras.io/

https://doi.org/10.1109/IROS.2017.8206203

https://doi.org/10.1109/IROS.2017.8206203

http://ieeexplore.ieee.org/document/8206203/


http://dl.acm.org/citation.cfm?id=1283383.1283494

http://dl.acm.org/citation.cfm?id=1283383.1283494

https://doi.org/10.23919/CNSM.2017.8256059

http://arxiv.org/abs/1705.05690






BIBLIOGRAPHY 59

[8] Gerardo Berbeglia, Jean-François Cordeau, and Gilbert Laporte. “Dy-namic Pickup and Delivery Problems”. In: European Journal of Opera-tional Research 202.1 (2010-04), pp. 8–15. ISSN: 03772217. DOI: 10.1016/j.ejor.2009.04.024. URL: http://linkinghub.elsevier.com/retrieve/pii/S0377221709002999 (visited on 2018-04-05).

[9] Felix Björklund. Uppkopplingen låter lastbilen maximeras. 2018-01-18. URL:http://telekomidag.se/uppkopplingen-ar-en-otrolig-styrka-oss/ (visited on 2018-06-05).

[10] Olli Bräysy and Michel Gendreau. “Vehicle Routing Problem with TimeWindows, Part I: Route Construction and Local Search Algorithms”. In:Transportation Science 39.1 (2005), pp. 104–118. ISSN: 0041-1655. JSTOR:25769233.

[11] Qixiu Cheng et al. Analysis and Forecasting of the Day-to-Day Travel De-mand Variations for Large- Scale Transportation Networks: A Deep LearningApproach. 2016-12-16. DOI: 10.13140/RG.2.2.12753.53604.

[12] Jean-François Cordeau and Gilbert Laporte. “The Dial-a-Ride Problem:Models and Algorithms”. In: Annals of Operations Research 153.1 (2007-09-01), pp. 29–46. ISSN: 0254-5330, 1572-9338. DOI: 10.1007/s10479-007-0170-8. URL: http://link.springer.com/article/10.1007/s10479-007-0170-8 (visited on 2018-04-25).

[13] G. B. Dantzig and J. H. Ramser. “The Truck Dispatching Problem”.In: Management Science 6.1 (1959), pp. 80–91. ISSN: 0025-1909. JSTOR:2627477.

[14] DHL. Logistics Trend Radar. 2016. URL: http : / / www . dhl . com /content/dam/downloads/g0/about_us/logistics_insights/dhl_logistics_trend_radar_2016.pdf.

[15] Truls Flatberg et al. “Dynamic And Stochastic Vehicle Routing In Prac-tice”. In: Dynamic Fleet Management. Operations Research/ComputerScience Interfaces Series. Springer, Boston, MA, 2007, pp. 41–63. ISBN:978-0-387-71721-0 978-0-387-71722-7. DOI: 10.1007/978-0-387-71722-7_3. URL: http://link.springer.com/chapter/10.1007/978-0-387-71722-7_3 (visited on 2018-04-06).

[16] Fleet101 | Solutions to deliver. 2018. URL: http://www.fleet101.se/ (visited on 2018-04-04).

[17] Yarin Gal. What My Deep Model Doesn’t Know... | Yarin Gal - Blog | Cam-bridge Machine Learning Group. 2015-06-03. URL: http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html (visited on2018-04-20).

https://doi.org/10.1016/j.ejor.2009.04.024


http://linkinghub.elsevier.com/retrieve/pii/S0377221709002999


http://telekomidag.se/uppkopplingen-ar-en-otrolig-styrka-oss/

http://telekomidag.se/uppkopplingen-ar-en-otrolig-styrka-oss/

http://www.jstor.org/stable/25769233

https://doi.org/10.13140/RG.2.2.12753.53604

https://doi.org/10.1007/s10479-007-0170-8

https://doi.org/10.1007/s10479-007-0170-8

http://link.springer.com/article/10.1007/s10479-007-0170-8


http://www.jstor.org/stable/2627477

http://www.dhl.com/content/dam/downloads/g0/about_us/logistics_insights/dhl_logistics_trend_radar_2016.pdf



https://doi.org/10.1007/978-0-387-71722-7_3

https://doi.org/10.1007/978-0-387-71722-7_3

http://link.springer.com/chapter/10.1007/978-0-387-71722-7_3


http://www.fleet101.se/

http://www.fleet101.se/

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

60 BIBLIOGRAPHY

[18] Guojun Gan and Michael Kwok-Po Ng. “K-Means Clustering with Out-lier Removal”. In: Pattern Recognition Letters 90 (2017-04-15), pp. 8–14.ISSN: 0167-8655. DOI: 10.1016/j.patrec.2017.03.008. URL:http://www.sciencedirect.com/science/article/pii/S0167865517300740 (visited on 2018-06-26).

[19] Rodrigo Garrido and Hani Mahmassani. “Forecasting Freight Trans-portation Demand with the Space–Time Multinomial Probit Model”.In: Transportation Research Part B: Methodological 34.5 (2000-06-01), pp. 403–418. ISSN: 0191-2615. DOI: 10.1016/S0191-2615(99)00032-6.URL: http://www.sciencedirect.com/science/article/pii/S0191261599000326 (visited on 2018-04-06).

[20] Rodrigo Garrido and Hani Mahmassani. “Forecasting Short-Term FreightTransportation Demand: Poisson STARMA Model”. In: TransportationResearch Record: Journal of the Transportation Research Board 1645 (1998-01), pp. 8–16. ISSN: 0361-1981. DOI: 10.3141/1645-02. URL: http://trrjournalonline.trb.org/doi/10.3141/1645-02 (vis-ited on 2018-04-25).

[21] Greater Than AB. Analysis of the European Road Freight Market. 2011-06.URL: https://ec.europa.eu/clima/sites/clima/files/docs/0012/registered/greater_than_analysis_road_freight_market_en.pdf.

[22] Klaus Greff et al. “LSTM: A Search Space Odyssey”. In: IEEE Transac-tions on Neural Networks and Learning Systems 28.10 (2017-10), pp. 2222–2232. ISSN: 2162-237X, 2162-2388. DOI: 10.1109/TNNLS.2016.2582924.arXiv: 1503.04069. URL: http://arxiv.org/abs/1503.04069(visited on 2018-05-02).

[23] Simon S. Haykin. Neural Networks: A Comprehensive Foundation. PrenticeHall, 1999. 842 pp. ISBN: 978-0-13-908385-3.

[24] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, NV, USA: IEEE, 2016-06, pp. 770–778. ISBN: 978-1-4673-8851-1. DOI: 10.1109/CVPR.2016.90. URL: http://ieeexplore.ieee.org/document/7780459/ (visited on 2018-09-20).

[25] Yuxiu Hua et al. “Traffic Prediction Based on Random Connectivityin Deep Learning with Long Short-Term Memory”. In: (2017-11-08).arXiv: 1711.02833 [cs]. URL: http://arxiv.org/abs/1711.02833 (visited on 2018-06-12).

https://doi.org/10.1016/j.patrec.2017.03.008

http://www.sciencedirect.com/science/article/pii/S0167865517300740


https://doi.org/10.1016/S0191-2615(99)00032-6



https://doi.org/10.3141/1645-02

http://trrjournalonline.trb.org/doi/10.3141/1645-02

http://trrjournalonline.trb.org/doi/10.3141/1645-02

https://ec.europa.eu/clima/sites/clima/files/docs/0012/registered/greater_than_analysis_road_freight_market_en.pdf



https://doi.org/10.1109/TNNLS.2016.2582924



https://doi.org/10.1109/CVPR.2016.90






BIBLIOGRAPHY 61

[26] Soumia Ichoua, Michel Gendreau, and Jean-Yves Potvin. “ExploitingKnowledge About Future Demands for Real-Time Vehicle Dispatch-ing”. In: Transportation Science 40.2 (2006-05), pp. 211–225. ISSN: 0041-1655, 1526-5447. DOI: 10.1287/trsc.1050.0114. URL: http://pubsonline.informs.org/doi/abs/10.1287/trsc.1050.0114 (visited on 2018-04-06).

[27] Soumia Ichoua, Michel Gendreau, and Jean-Yves Potvin. “Planned RouteOptimization For Real-Time Vehicle Routing”. In: Dynamic Fleet Man-agement. Operations Research/Computer Science Interfaces Series. Springer,Boston, MA, 2007, pp. 1–18. ISBN: 978-0-387-71721-0 978-0-387-71722-7.DOI: 10.1007/978-0-387-71722-7_1. URL: http://link.springer.com/chapter/10.1007/978-0-387-71722-7_1(visited on 2018-04-06).

[28] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for StochasticOptimization”. In: (2014-12-22). arXiv: 1412.6980 [cs]. URL: http://arxiv.org/abs/1412.6980 (visited on 2018-06-27).

[29] Allan Larsen. The Dynamic Vehicle Routing Problem. Red. by Oli B.G.Madsen. Technical University of Denmark (DTU), 2000-12.

[30] Jan Karel Lenstra and Alexander Rinnooy Kan. “Complexity of Vehi-cle Routing and Scheduling Problems”. In: Networks 11.2 (1981-06-01),pp. 221–227. ISSN: 1097-0037. DOI: 10.1002/net.3230110211. URL:http://onlinelibrary.wiley.com/doi/abs/10.1002/net.3230110211 (visited on 2016-04-29).

[31] Xianghua Li et al. “A Hybrid Algorithm for Estimating Origin-DestinationFlows”. In: IEEE Access 6 (2018), pp. 677–687. DOI: 10.1109/ACCESS.2017.2774449.

[32] Dennis Luxen and Christian Vetter. “Real-Time Routing with Open-StreetMap Data”. In: Proceedings of the 19th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic Information Systems. GIS ’11.New York, NY, USA: ACM, 2011, pp. 513–516. ISBN: 978-1-4503-1031-4.DOI: 10.1145/2093973.2094062. URL: http://doi.acm.org/10.1145/2093973.2094062 (visited on 2018-06-06).

[33] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.Introduction to Information Retrieval. New York, NY, USA: CambridgeUniversity Press, 2008. ISBN: 978-0-521-86571-5.

[34] Deng Ming-jun and Qu Shi-ru. Fuzzy State Transition and Kalman Fil-ter Applied in Short-Term Traffic Flow Forecasting. 2015. URL: https ://www.hindawi.com/journals/cin/2015/875243/ (visitedon 2018-04-25).

https://doi.org/10.1287/trsc.1050.0114

http://pubsonline.informs.org/doi/abs/10.1287/trsc.1050.0114



https://doi.org/10.1007/978-0-387-71722-7_1






https://doi.org/10.1002/net.3230110211

http://onlinelibrary.wiley.com/doi/abs/10.1002/net.3230110211

http://onlinelibrary.wiley.com/doi/abs/10.1002/net.3230110211

https://doi.org/10.1109/ACCESS.2017.2774449

https://doi.org/10.1109/ACCESS.2017.2774449

https://doi.org/10.1145/2093973.2094062

http://doi.acm.org/10.1145/2093973.2094062

http://doi.acm.org/10.1145/2093973.2094062

https://www.hindawi.com/journals/cin/2015/875243/

https://www.hindawi.com/journals/cin/2015/875243/

62 BIBLIOGRAPHY

[35] Naoto Mukai and Naoto Yoden. “Taxi Demand Forecasting Based onTaxi Probe Data by Neural Network”. In: Intelligent Interactive Multime-dia: Systems and Services. Smart Innovation, Systems and Technologies.Springer, Berlin, Heidelberg, 2012, pp. 589–597. ISBN: 978-3-642-29933-9 978-3-642-29934-6. DOI: 10.1007/978-3-642-29934-6_57. URL:http://link.springer.com/chapter/10.1007/978-3-642-29934-6_57 (visited on 2018-06-06).

[36] William P Nanry and J Wesley Barnes. “Solving the Pickup and De-livery Problem with Time Windows Using Reactive Tabu Search”. In:Transportation Research Part B: Methodological 34.2 (2000-02-01), pp. 107–121. ISSN: 0191-2615. DOI: 10.1016/S0191-2615(99)00016-8.URL: http://www.sciencedirect.com/science/article/pii/S0191261599000168 (visited on 2018-04-20).

[37] Michael A. Nielsen. Neural Networks and Deep Learning. DeterminationPress, 2015. URL: http://neuralnetworksanddeeplearning.com (visited on 2018-04-26).

[38] OpenStreetMap contributors. OpenStreetMap. URL: https://www.openstreetmap.org/copyright (visited on 2018-05-25).

[39] D. C. Park et al. “Electric Load Forecasting Using an Artificial NeuralNetwork”. In: IEEE Transactions on Power Systems 6.2 (1991-05), pp. 442–449. ISSN: 0885-8950. DOI: 10.1109/59.76685.

[40] Fabian Pedregosa et al. “Scikit-Learn: Machine Learning in Python”. In:Journal of Machine Learning Research 12 (2011-10), pp. 2825–2830. ISSN:1533-7928. URL: http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html (visited on 2018-06-06).

[41] Chengbin Peng et al. “Collective Human Mobility Pattern from TaxiTrips in Urban Area”. In: PLoS ONE 7.4 (2012-04-18). ISSN: 1932-6203.DOI: 10.1371/journal.pone.0034487. pmid: 22529917. URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3329492/(visited on 2018-04-25).

[42] Anders Peterson. “The Origin-Destination Matrix Estimation Problem: Analysis and Computations”. Linköping University, Department ofScience and Technology, 2007. 40 pp. URL: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-8859 (visited on 2018-04-26).

[43] Victor Pillac et al. “A Review of Dynamic Vehicle Routing Problems”.In: European Journal of Operational Research 225.1 (2013), pp. 1–11. DOI:10.1016/j.ejor.2012.08.015. URL: https://hal.archives-ouvertes.fr/hal-00739779 (visited on 2018-04-06).

https://doi.org/10.1007/978-3-642-29934-6_57



https://doi.org/10.1016/S0191-2615(99)00016-8



http://neuralnetworksanddeeplearning.com

http://neuralnetworksanddeeplearning.com

https://www.openstreetmap.org/copyright

https://www.openstreetmap.org/copyright

https://doi.org/10.1109/59.76685

http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html

http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html

https://doi.org/10.1371/journal.pone.0034487

22529917

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3329492/

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-8859

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-8859


https://hal.archives-ouvertes.fr/hal-00739779

https://hal.archives-ouvertes.fr/hal-00739779

BIBLIOGRAPHY 63

[44] Postnummer. URL: http://www.postnord.se/information/om-postnord/samhalle/postnummer (visited on 2018-04-26).

[45] Abraham P. Punnen. “The Traveling Salesman Problem: Applications,Formulations and Variations”. In: The Traveling Salesman Problem andIts Variations. Combinatorial Optimization. Springer, Boston, MA, 2007,pp. 1–28. ISBN: 978-0-387-44459-8 978-0-306-48213-7. DOI: 10.1007/0-306-48213-4_1. URL: http://link.springer.com/chapter/10.1007/0-306-48213-4_1 (visited on 2018-04-23).

[46] Nils-Hassan Quttineh et al. “Military Aircraft Mission Planning: A Gen-eralized Vehicle Routing Model with Synchronization and Precedence”.In: EURO Journal on Transportation and Logistics 2.1-2 (2013-05-01), pp. 109–127. ISSN: 2192-4376, 2192-4384. DOI: 10.1007/s13676-013-0023-3. URL: http : / / link . springer . com / article / 10 . 1007 /s13676-013-0023-3 (visited on 2018-04-23).

[47] Ulrike Ritzinger, Jakob Puchinger, and Richard F. Hartl. “A Survey onDynamic and Stochastic Vehicle Routing Problems”. In: InternationalJournal of Production Research 54.1 (2016-01-02), pp. 215–231. ISSN: 0020-7543, 1366-588X. DOI: 10.1080/00207543.2015.1043403. URL:http://www.tandfonline.com/doi/full/10.1080/00207543.2015.1043403 (visited on 2018-03-29).

[48] Doris Sáez, Cristián E. Cortés, and Alfredo Núñez. “Hybrid AdaptivePredictive Control for the Multi-Vehicle Dynamic Pick-up and Deliv-ery Problem Based on Genetic Algorithms and Fuzzy Clustering”. In:Computers & Operations Research 35.11 (2008-11), pp. 3412–3438. ISSN:03050548. DOI: 10.1016/j.cor.2007.01.025. URL: http://linkinghub.elsevier.com/retrieve/pii/S0305054807000287(visited on 2018-04-06).

[49] Michael Schilde, Karl F. Doerner, and Richard F. Hartl. “Metaheuris-tics for the Dynamic Stochastic Dial-a-Ride Problem with Expected Re-turn Transports”. In: Computers & Operations Research 38.12 (2011-12-01), pp. 1719–1730. ISSN: 0305-0548. DOI: 10.1016/j.cor.2011.02.006. URL: http://www.sciencedirect.com/science/article/pii/S0305054811000475 (visited on 2018-04-24).

[50] SMHI Opendata. URL: https://opendata-download-metobs.smhi.se/ (visited on 2018-06-10).

[51] Michael R. Swihart and Jason D. Papastavrou. “A Stochastic and Dy-namic Model for the Single-Vehicle Pick-up and Delivery Problem”. In:European Journal of Operational Research 114.3 (1999-05-01), pp. 447–464.ISSN: 0377-2217. DOI: 10.1016/S0377-2217(98)00260-4. URL:

http://www.postnord.se/information/om-postnord/samhalle/postnummer

http://www.postnord.se/information/om-postnord/samhalle/postnummer

https://doi.org/10.1007/0-306-48213-4_1

https://doi.org/10.1007/0-306-48213-4_1

http://link.springer.com/chapter/10.1007/0-306-48213-4_1

http://link.springer.com/chapter/10.1007/0-306-48213-4_1

https://doi.org/10.1007/s13676-013-0023-3

https://doi.org/10.1007/s13676-013-0023-3



https://doi.org/10.1080/00207543.2015.1043403

http://www.tandfonline.com/doi/full/10.1080/00207543.2015.1043403

http://www.tandfonline.com/doi/full/10.1080/00207543.2015.1043403

https://doi.org/10.1016/j.cor.2007.01.025







https://opendata-download-metobs.smhi.se/

https://opendata-download-metobs.smhi.se/

https://doi.org/10.1016/S0377-2217(98)00260-4

64 BIBLIOGRAPHY

http://www.sciencedirect.com/science/article/pii/S0377221798002604 (visited on 2018-04-05).

[52] Christine Taylor. Artificial Intelligence and Logistics Is Transforming Busi-ness. 2017-10-25. URL: https : / / www . datamation . com / big -data/artificial-intelligence-and-logistics-is-transforming-business.html (visited on 2018-03-28).

[53] The TOTEM Project. URL: https://totem.info.ucl.ac.be/dataset.html (visited on 2018-06-08).

[54] Yongxue Tian and Li Pan. “Predicting Short-Term Traffic Flow by LongShort-Term Memory Recurrent Neural Network”. In: 2015 IEEE Inter-national Conference on Smart City/SocialCom/SustainCom (SmartCity). 2015IEEE International Conference on Smart City/SocialCom/SustainCom(SmartCity). 2015-12, pp. 153–158. DOI: 10.1109/SmartCity.2015.63.

[55] Florian Toqué et al. “Forecasting Dynamic Public Transport Origin-Destination Matrices with Long-Short Term Memory Recurrent Neu-ral Networks”. In: 2016 IEEE 19th International Conference on IntelligentTransportation Systems (ITSC). 2016 IEEE 19th International Conferenceon Intelligent Transportation Systems (ITSC). 2016-11, pp. 1071–1076.DOI: 10.1109/ITSC.2016.7795689.

[56] Trafikanalys. Transportbranschens Ekonomi. 2016-12-19. URL: https://www.trafa.se/transportforetag/transportbranschens-ekonomi/ (visited on 2018-06-06).

[57] Theodore Tsekeris and Charalambos Tsekeris. “Demand Forecasting inTransport: Overview and Modeling Advances”. In: Economic research- Ekonomska istraživanja 24.1 (2011-03-01), pp. 82–94. ISSN: 1331-677X.URL: https://hrcak.srce.hr/index.php?show=clanak&id_clanak_jezik=101118 (visited on 2018-03-28).

[58] Matti van Engelen et al. “Enhancing Flexible Transport Services withDemand-Anticipatory Insertion Heuristics”. In: Transportation ResearchPart E: Logistics and Transportation Review 110 (2018-02-01), pp. 110–121. ISSN: 1366-5545. DOI: 10.1016/j.tre.2017.12.015. URL:http://www.sciencedirect.com/science/article/pii/S1366554517307810 (visited on 2018-03-29).

[59] “Vector Autoregressive Models for Multivariate Time Series”. In: Mod-eling Financial Time Series with S-PLUS R©. Springer, New York, NY, 2006,pp. 385–429. ISBN: 978-0-387-27965-7 978-0-387-32348-0. DOI: 10.1007/978-0-387-32348-0_11. URL: http://link.springer.com/



https://www.datamation.com/big-data/artificial-intelligence-and-logistics-is-transforming-business.html



https://totem.info.ucl.ac.be/dataset.html

https://totem.info.ucl.ac.be/dataset.html

https://doi.org/10.1109/SmartCity.2015.63

https://doi.org/10.1109/SmartCity.2015.63

https://doi.org/10.1109/ITSC.2016.7795689

https://www.trafa.se/transportforetag/transportbranschens-ekonomi/



https://hrcak.srce.hr/index.php?show=clanak&id_clanak_jezik=101118

https://hrcak.srce.hr/index.php?show=clanak&id_clanak_jezik=101118

https://doi.org/10.1016/j.tre.2017.12.015



https://doi.org/10.1007/978-0-387-32348-0_11

https://doi.org/10.1007/978-0-387-32348-0_11




BIBLIOGRAPHY 65

chapter/10.1007/978-0-387-32348-0_11 (visited on 2018-04-30).

[60] R. Vinayakumar, K. P. Soman, and P. Poornachandran. “Applying DeepLearning Approaches for Network Traffic Prediction”. In: 2017 Interna-tional Conference on Advances in Computing, Communications and Infor-matics (ICACCI). 2017 International Conference on Advances in Com-puting, Communications and Informatics (ICACCI). 2017-09, pp. 2353–2358. DOI: 10.1109/ICACCI.2017.8126198.

[61] Stefan Vonolfen and Michael Affenzeller. “Distribution of Waiting Timefor Dynamic Pickup and Delivery Problems”. In: Annals of OperationsResearch 236.2 (2016-01-01), pp. 359–382. ISSN: 0254-5330, 1572-9338.DOI: 10.1007/s10479-014-1683-6. URL: http://link.springer.com/article/10.1007/s10479-014-1683-6 (visited on 2018-04-06).

[62] Jun Xu et al. “Real-Time Prediction of Taxi Demand Using RecurrentNeural Networks”. In: (), p. 10.

[63] G. Peter Zhang. “Time Series Forecasting Using a Hybrid ARIMA andNeural Network Model”. In: Neurocomputing 50 (2003-01-01), pp. 159–175. ISSN: 0925-2312. DOI: 10.1016/S0925-2312(01)00702-0.URL: http://www.sciencedirect.com/science/article/pii/S0925231201007020 (visited on 2018-04-27).

[64] Junbo Zhang, Yu Zheng, and Dekang Qi. “Deep Spatio-Temporal Resid-ual Networks for Citywide Crowd Flows Prediction”. In: Microsoft Re-search (2016-11-27). URL: https://www.microsoft.com/en-us/research/publication/deep-spatio-temporal-residual-networks-for-citywide-crowd-flows-prediction/ (visitedon 2018-04-25).

[65] Jianlong Zhao et al. “Towards Traffic Matrix Prediction with LSTM Re-current Neural Networks”. In: Electronics Letters 54.9 (2018), pp. 566–568. ISSN: 0013-5194. DOI: 10.1049/el.2018.0336.

[66] Zhong Zheng, Soora Rasouli, and Harry Timmermans. “Modeling TaxiDriver Anticipatory Behavior”. In: Computers, Environment and UrbanSystems 69 (2018-05-01), pp. 133–141. ISSN: 0198-9715. DOI: 10.1016/j.compenvurbsys.2018.01.008. URL: http://www.sciencedirect.com/science/article/pii/S0198971517303897 (visited on2018-04-25).




https://doi.org/10.1109/ICACCI.2017.8126198

https://doi.org/10.1007/s10479-014-1683-6



https://doi.org/10.1016/S0925-2312(01)00702-0



https://www.microsoft.com/en-us/research/publication/deep-spatio-temporal-residual-networks-for-citywide-crowd-flows-prediction/



https://doi.org/10.1049/el.2018.0336

https://doi.org/10.1016/j.compenvurbsys.2018.01.008

https://doi.org/10.1016/j.compenvurbsys.2018.01.008



TRITA EECS-EX-2018:673

www.kth.se

forecasting future delivery orders to support vehicle …1256744/fulltext02.pdfforecasting future...

Documents