implementation and evaluation of selected machine learning

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Implementation and evaluation of selected Machine Learning algorithms on a resource constrained telecom hardware platform

SEBASTIAN LEBORG

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Implementation and evaluation of selectedMachine Learning algorithms on a resource

constrained telecom hardware platform

Sebastian [email protected]

September 18, 2017

Master’s thesis in Computer science done at CSCIndustrial supervisor: Andreas ErmedahlAcademic supervisor: Pawel HermanExaminer: Erwin Laure

KEYWORDSMachine Learning, Resource constraints, Parallelism, Telecom, 5G

1

Abstract

The vast majority of computing hardware platforms available today arenot desktop PCs. They are embedded systems, sensors and small specializedpieces of hardware present in almost every digital product available today.Due to the massive amount of information available through these deviceswe can find new and exciting ways to apply and benefit from machine learn-ing. Many of these computing devices have specialized, resource-constrainedarchitectures and it might be problematic to perform complicated computa-tions. If such a system is under heavy load or has restricted performance,computational power is a valuable resource and costly algorithms must beavoided.This master thesis will present an in-depth study investigating the trade-offsbetween precision, latency and memory consumption of a selected set of ma-chine learning algorithms implemented on a resource constrained multi-coretelecom hardware platform. This report includes motivations for the selectedalgorithms, discusses the results of the algorithms execution on the hardwareplatform and offers conclusions relevant to further developments.

Implementation och utvardering av utvaldamaskininlarningsalgoritmer pa en

resursbegransad telekom-maskinvaruplattform

Sammanfattning

Majoriteten av berakningsplattformarna som finns tillgangliga idag arinte stationara bordsdatorer. De ar inbyggda system, sensorer och sma spe-cialiserade hardvaror som finns i nastan alla digitala produkter tillgangligaidag. Pa grund av den enorma mangden information som finns tillgangligvia dessa enheter kan vi hitta nya och spannande satt att dra nytta av mask-ininlarning. Manga av dessa datorer har specialiserade, resursbegransadearkitekturer och det kan vara problematiskt att utfora de komplicerade berakningarsom behovs. Om ett sadant system ar tungt belastat eller har begransadprestanda, ar berakningskraft en vardefull resurs och kostsamma algoritmermaste undvikas.Detta masterprojekt kommer att presentera en djupgaende studie som un-dersoker avvagningarna mellan precision, latens och minneskonsumtion av enutvald uppsattning maskininlarningsalgoritmer implementerade pa en resurs-begransad flerkarnig telekom-maskinvaruplattform. Denna rapport innehallermotivationer for de valda algoritmerna, diskuterar resultaten av algoritmernapa hardvaruplattformen och presenterar slutsatser som ar relevanta for vi-dareutveckling.

2

Contents

1 Introduction 51.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Project scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 Target platform . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Flake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Learning and prediction . . . . . . . . . . . . . . . . . . . . . 142.6 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.1 Multi layer perceptron . . . . . . . . . . . . . . . . . . 162.6.2 Backpropagation of error . . . . . . . . . . . . . . . . . 182.6.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 222.7.1 Linear classification . . . . . . . . . . . . . . . . . . . . 222.7.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 252.7.3 Sequential Minimal Optimization . . . . . . . . . . . . 262.7.4 Multiclass classification . . . . . . . . . . . . . . . . . . 27

2.8 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 282.8.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . 282.8.2 Gini index . . . . . . . . . . . . . . . . . . . . . . . . . 302.8.3 Mean squared error . . . . . . . . . . . . . . . . . . . . 30

2.9 k-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . 31

3 Related work 33

3

4 Method 354.1 Experiments and Data sets . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.2 Wine quality . . . . . . . . . . . . . . . . . . . . . . . 364.1.3 Beamforming . . . . . . . . . . . . . . . . . . . . . . . 364.1.4 Experiment setup . . . . . . . . . . . . . . . . . . . . . 39

4.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 The memory hierarchy and the selected algorithms . . . . . . 40

4.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . 404.3.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . 434.3.3 Support vector machine . . . . . . . . . . . . . . . . . 44

5 Results 455.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.1 Wine Quality . . . . . . . . . . . . . . . . . . . . . . . 505.2.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Discussion 676.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 686.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 696.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.5 Thesis implications . . . . . . . . . . . . . . . . . . . . . . . . 71

6.5.1 Ethics and Sustainability . . . . . . . . . . . . . . . . . 72

7 Conclusions and Future work 747.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2.1 Training off-target . . . . . . . . . . . . . . . . . . . . 757.2.2 Local and Common memory in tandem . . . . . . . . . 767.2.3 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 76

4

Chapter 1

Introduction

We currently live in the so-called computer age. More and more people getaccess to smart handheld devices that allow both technology and people toconnect and share information. This has lead to a large increase in inter-connected devices communicating over the mobile network. According toforecasts from Cisco [18], the global mobile data traffic will increase from 7exabytes (EB) in 2016 to 49 EB in 2021. In order to deal with this massivequantity of data, network providers construct advanced systems of inter-connected components that work in tandem to provide the mobile networkecosystem. An example of such a component is the Radio Base Stations(RBS) that Ericsson develops. The RBS is a system that uses radio signalsto receive, handle and transmit the data going through these networks. Onecritical step in handling this data is referred to as signal processing whichconcerns the analysis, synthesis and modification of data transmitted overthe radio medium.

Ericsson’s RBSes are built to process large amount of data. Signal pro-cessing in an RBS requires numerous mathematical operations including en-coding and modulation/demodulation to be performed repeatedly on a seriesof data samples under short and hard deadlines. The RBSes often performthese computations for many devices in parallel. These requirements lead tohardware platforms with very specialized architectures including many par-allel compute units, hardware accelerators and fast but often limited memory[33]. Any algorithm running in parallel to the primary signal processing al-gorithms has to be carefully designed to avoid any excessive consumption ofvaluable resources.

5

Figure 1.1: Illustration of the different radio access network components usedin 5G networks. Taken from [10] which is under submission.

The arrival of 5G will increase the load and requirements on the currentsystems that deal with telecommunication [18]. In order to handle these re-quirements the Radio Access Network (RAN), as seen in Figure 1.1, has tobe very efficient with a high level of automation. However as the load inthe RBSes increases, more and more data become available. This data canbe used in order to obtain insights about user patterns, the performance ofthe RBS components and many other things that can help us create a moreefficient RAN. Machine learning is the practice of creating algorithms thatlearn from data. It has the potential to identify patterns and correlations indata and can provide efficient and data driven solutions to many problemsthat the RBS’s face, such as beamforming or radio resource scheduling. Ma-chine learning technologies are however often used on systems with a greatamount of computational resources. Any algorithm operating on the special-ized RBS has to be carefully designed with resource limitations in mind dueto the large amount of radio traffic that requires processing.

The Ericsson Many-core Architecture (EMCA) is the primary computeunit and a key component in Ericsson’s RBS. It appears in 3G (WCDMA),4G (LTE) and 5G RBS. The 5G RBS is illustrated in Figure 1.1. This thesisprovides Ericsson with an investigation regarding the applicability of the se-lected machine learning algorithms on the EMCA. This is done by measuringperformance metrics such the scaleability and resource consumption of thealgorithms when operating on the EMCA. In the future this will serve asinput regarding the utilization of machine learning algorithms on the EMCA

6

and resource constrained hardware platforms with characteristics similar tothe EMCA. This will provide input to Ericsson in their future RAN hardwareand software design decisions.

1.1 Problem statement

In order to deal with the increasing demand for radio and internet traffic,more efficient RAN solutions need to be developed. This thesis intends toinvestigate the performance trade-off between prediction accuracy, memoryfootprint and latency for a selected set of machine learning algorithms. Thealgorithms are to be adapted to and operate on the EMCA

The research questions considered in this thesis concerns the scalabilityand applicability of the selected machine learning algorithms when appliedto the EMCA. Can the selected machine learning algorithms run on em-bedded platforms with characteristics similar to the EMCA? How do thesealgorithms need to be adapted in order to do so? And how do they scale interms of latency, memory requirements and precision when adapted to theEMCA?

The objective is to evaluate and compare the selected algorithms basedon the performance and required resources during both training and predic-tion. The first task is to find algorithms that operate with a sufficiently lowmemory footprint and can deliver predictions within short deadlines. Shortdeadlines are important because the EMCA is designed for signal processingoperations and relies on a steady throughput of data. It continuously pro-cesses data in fixed time intervals which means that any algorithms runningon the system needs to be reliable and fast in a changing environment. Thisis a serious issue because in many cases a large amount of the platforms re-sources is required for other computations.

7

1.2 Project scope

The goal of this thesis is not to find resource efficient implementations of theselected algorithms but to investigate the scalability of the selected machinelearning algorithms. The focus of this thesis is limited to state-of-the-practiceimplementations of machine learning algorithms adapted to the EMCA.

The implementation objective of this thesis is to create a library of ma-chine learning algorithms operating on the EMCA. The research objectiveis to examine the trade-offs between accuracy and computational quantitiessuch as latency and memory in a the selected set of machine learning algo-rithms. The evaluation has been performed on the EMCA using toy datasets and a test case provided by Ericsson. The test case constitutes a taskthat the EMCA can perform using machine learning methods.

1.3 Thesis outline

The rest of this thesis is organized as follows: Section 2 presents the relevantbackground information for the selected algorithm along with relevant liter-ature that provides more thorough mathematical proofs for the algorithms.Section 3 presents other academic works related to the field of resource ef-ficient machine learning. Section 4 provides information about the selectedtest cases, the hyperparameter configurations for the algorithms along withother algorithmic design choices. It also presents the worst case complexityof the algorithms. Section 5 presents the results obtained in the experimentsand explains the results. Section 6 further elaborates upon the results ob-tained in the experiments and discusses the impact and limitations of thethesis. Section 7 presents what conclusions we can draw from the results andprovides examples of work suitable to extend this thesis work.

8

Chapter 2

Background

2.1 Target platform

The EMCA is an asymmetric many-core system-on-chip used in Ericsson’sRBS. Each EMCA chip contains of many Digital Signal Processors (DSPs),allowing for a high degree of task parallelism. The EMCA has access to threedifferent memories as seen in Figure 2.1. The first memory is a small andfast memory located on each DSP called the local memory (LM). There isalso a larger common memory (CM) situated on the chip that is used forstorage and communication between DSPs and HW accelerators. The CMcomes at a higher access cost than the LM. The final memory is called theexternal memory (EM) and is larger than the CM and has even higher ac-cess times. The EM is commonly used to dump logs and for storing datafor non-performance critical tasks. The LM on the EMCA is designed us-ing Harvard architecture which means that data and program memory areseparated. The EMCA is programmed using DSP-C, i.e. C extended withsome specific DSP keywords/instructions. Ericsson have made the hardwaredesign, compilers, debuggers, profilers and the operating system for devel-oping EMCA based systems. The EMCA is used as a compute unit in alarger chip array called a Digital Unit Standard (DUS) as seen in Figure2.2. Each DUS contains a set of EMCA’s as well as some additional proces-sors. The DUS is used in Ericssons test beds and in Ericsson’s RBS products.

It is expected that machine learning can provide more functionality andmore cost-effective solutions compared to existing algorithms currently run-

9

Figure 2.1: Architectural overview of the EMCA and its memory hierarchy

Figure 2.2: Architectural overview of the DUS and its connected EMCAs

10

ning on the EMCA. Due to the specialized architecture and many othercompeting time-critical processes, conventional machine learning approacheswill be hard to utilize. This means that the algorithms need to be designedwith parallelization and careful selection of hyperparameters in mind. Theselected ML algorithms will have to be adapted to the particularities of theEMCA. For example, they need to be adapted to handle the memory hi-erarchy and the limited set of memory available both during training andprediction.

2.2 Flake

Flake is an execution environment running on the EMCA. It is developed byEricsson Software & Technology for use in their testbeds. It is a real timeoperating system (RTOS) that contains definitions of project structure, ex-ecution models and memory hierarchies for running DSP-C applications onthe target platform. Flake is capable of executing programs on both a Linuxx86 platform where the EMCA environment is emulated, and on the EMCAcore (Trinity).

The main task of Flake is to provide programming abstractions whichallow the algorithm designer to utilize the inherent parallelism of the base-band application. The programming abstractions allow for dividing largerprograms into smaller tasks that can be executed in parallel. The program-ming model employed in Flake is to express the program as a connectedgraph of run-to-completion tasks that are synchronized by barriers. Eachrun-to-completion task has a begin barrier and an end barrier, which mayserve as a begin barrier for subsequent tasks. The execution of these lattertasks will not start until all the previous tasks has reached their end barrier.In summary, a Flake program is simply a graph of run-to-completion tasksas edges connected by synchronizing barriers as nodes.

2.3 Beamforming

A radio transmitter needs to transmit signals on a certain frequency. In-creasing or decreasing the transmission frequency have different effects on

11

the transmission. In high transmission frequencies, the signal propagation ismore hostile and the free-space propagation loss is higher since the receive an-tennas get smaller. The diffraction losses are higher because the radio signalsdo not “bend” around corners to the same extent at high transmission fre-quencies. Finally, the wall penetration losses are higher. Higher frequencies,however, also offer significant opportunities. As the transmission frequencygets higher, the antenna elements get smaller. With this, it becomes possibleto pack more elements into a smaller antenna. A state-of-the-art antennaused to transmit on the frequency 2.6GHz is roughly one meter tall, andcontains 20 transmission elements. At 15GHz, it is possible to design anantenna with 200 elements that is only 5 cm wide and 20 cm tall. With moreantenna elements, it becomes possible to steer the transmission towards theintended receiver. Steering this signal is referred to as beamforming. Sincewe are concentrating the transmission in a certain direction, coverage is sig-nificantly improved. With more antenna elements, the beams get narrower.It then becomes vital to transmit the signal in the appropriate direction inorder to maximize the received signal energy at the user [6].

Beamforming is a signal processing technique used in the radio transmit-ters in 5G. The primary objective of beamforming is to direct a transmissionso as to align it with its users which provides them with a better connection.

12

(a) Static beamforming (b) Dynamic beamforming

Figure 2.3: Illustration of beamforming. Figure 2.3a illustrates a transmit-ter, its corresponding beams and different UE’s. Figure 2.3b illustrates atransmitter with two beams that has different phase and relative amplitudevalues. We can see that the shape of the beams are affected by these settingsand modifying them will change the range and width of the beam. Takenfrom Ericsson internal.

There are two forms of beamforming. In static beamforming all trans-mission beams are placed statically. In dynamic beamforming we manipulatethe direction of the beams digitally as users move around so as to provide thebest signal to the users in a changing environment. The left image in Figure2.3 describes an example of static beamforming where all beams attemptto cover as much area as possible. The right image illustrates an exampleof a beam that has steered its transmission so as to provide a moving userwith a good signal. The test case described in Section 4.1.3 concerns dynamicbeamforming. More specifically the test case concerns a problem called beamselection which is the task of estimating which beam will provide the bestsignal to a user.

2.4 Machine Learning

A machine learning algorithm is an algorithm that is able to learn fromdata. Tom M. Mitchell gives the following definition of of what a learningmachine is: ”A computer program is said to learn from experience E with re-spect to some class of task T and performance measure P , if its performance

13

at task T , as measured by P , improves with experience E” [43]. Machinelearning tasks are usually described in terms of how the machine learningsystem should process a collection of data consisting of features that havebeen quantitatively measured from some object or event.

The most popular machine learning techniques is classification, regressionand clustering. Classification aims to provide a function f : RN → {1, ..., k}that maps an input of dimension N to a certain class. An example of sucha class in the case of image recognition could be ”apple, orange, human”.Regression aims to approximate a continuous valued function f : RN → Rsuch as total annual revenue, or how long a person might live given a certainset of attributes. Clustering aims to create groupings in an unlabeled dataset by finding similarities based on the closeness of their different features.This report will focus on regression and classification algorithms.

2.5 Learning and prediction

Machine learning algorithms generally consist of two primary phases. Thetraining phase where an algorithm attempts to learn a desired behavior, andthe prediction phase where the algorithm performs its intended function. Inorder to learn a desired behavior our algorithm has to train on training datausing a training algorithm. In supervised learning the training algorithm isdesigned to find parameters that maps a certain input to a correspondingoutput label. This process is roughly described in Figure 2.4.

14

Figure 2.4: Visualization of the different Machine Learning phases taken from[14]

The domain object is the object which we want to learn something about.We describe this object using a set of features which is a select set of at-tributes that aim to describe different instances of the object. In the Irisdataset [13], the domain object is flowers. The features used to describethese flowers are measurements such as petal and sepal width and lengthfrom different flowers belonging to 3 different species of flora. Our train-ing data set contains these selected measurements for many different flowersalong with a label that correctly identify which species the flower measure-ments belong to. This training data is then processed by a learning algorithm.In supervised learning the learning algorithm is an optimization procedurethat maps the parameters of the model so that when the model performsa prediction on the training data, the model will output a prediction that(hopefully) corresponds to the correct data entries label. Once our algorithmhas converged to a desirable behavior we can then use our model to identifyflowers using only measurements from the features we previously selected.How well our model performs depends on the problem, the model, how muchdata is available, which features have been selected and many other factors.We asses the quality of our model by its ability to generalize. The learningalgorithm attempts learn a behaviour by changing the models parametersso that it reduces the prediction error on the training set. The goal of su-pervised learning is however not to identify the labels of the training dataset (since we already know them) but to correctly identify the labels of new,

15

unseen data. A model that has this quality generalizes well. This means thatthe model has learned to approximate the mathematical distribution fromwhich the data origins.

2.6 Neural Networks

Neural Network(NN)s originate from an attempt to imitate information pro-cessing principles adopted by the brain. From a practical perspective biolog-ical realism would be difficult to emulate. The focus has instead been shiftedtowards creating models that mimic how biological systems process informa-tion but which also have been proven to have a great amount of practicalvalue [2]. A downside of NNs from the resource- constrained perspective isthat they require a significant amount of resources if the network architectureis too large. This offers an interesting subject for evaluation as the size ofthe network will deeply affect the prediction accuracy, latency and memoryfootprint. These following sections will give an overview of the equationsand mathematical theorems that is used when implementing the NN. Fur-ther derivations of the formulas used can be found in Bishop [2] and Yosha[43].

2.6.1 Multi layer perceptron

A Multi layer perceptron (MLP) is a class of NN’s. The simplest form of aMLP consists of three layers as seen in Figure 2.5 [25]. The first layer is theinput layer whose number of nodes correspond to the dimensionality of theinput data. These nodes are connected to what is referred to as a ’hiddenlayer’. These connections have an associated weight w which is multipliedwith the input variable x as it is fed through the networks layers. Each nodein the hidden layer accumulates the sum of all the inputs and applies an ac-tivation function σ to the result[43]. The purpose of the activation functionsis to introduce a non-linearity to each layer of the network. The activationfunctions used in this thesis work will be the Leaky Rectified Linear Unitand the Sigmoid function.

Leaky ReLU: σ(x) =

{x, if x ≥ 0

0.01x, otherwise

16

Sigmoid: σ(x) =1

1 + e−x

yk = σ(d∑j=0

xjwkj) (2.1)

yk is the output of node k, and d is the dimensionality of the input. Thisresult yk is then fed to the following layer, which in the case of a single layernetwork, is the output layer. The output layer repeats the process in thehidden layer and provides the prediction of the NN which corresponds to theresult of the classification or regression task [43].

Figure 2.5: Neural Network with one hidden layer

Adding more hidden layers to our model increases its complexity. Thismeans that it will be slower in performing predictions, slower to train andthat it will get a larger memory footprint. The motivation behind using sev-eral layers comes from the fact that each layer is capable of transforming theinput. The layer- by layer transformation can be seen as a way for the NNto automatically change the representation of our data to a new represen-tation. These new representations can drastically change how the followinglayers perceive the data and the network can often learn to represent thedata better than hand crafted features [23]. This allows the network to han-dle complex or high dimensional data by breaking a large problem down toseveral small problems [43]. An example of how this layer- by layer trans-formation can be beneficial is facial detection in images. The first layer usesthe pixel information in order to identify lines and edges in the image. Thesecond layer uses the output from the first layer and because of the datas newrepresentation, the network can combine these primitive shapes and identify

17

more complex shapes such as mouths, chins and eyes. The third layer canuse those complex shapes in order to find a complete face inside a picture.NNs are powerful and scaleable tools.

Figure 2.6: Illustration of NN architecture taken from Gauran Chakravorty[4]

According to the universal approximation theorem, a feed-forward NNwith a single hidden layer can approximate continuous functions on compactsubsets of RN [8]. This means that a shallow NN represent any continuousfunction using only a single layer NN given some constraints on the acti-vation function. According to Gybenko [17] the theorem has been provedfor Sigmoid activation function and according to Yoshua [43] it has beenproven for the ReLU function. The theorem however does not touch uponthe learnability of the parameters associated with the model which meansthat it can be hard, or impossible, to learn the parameters that would allowit to converge to said function.

2.6.2 Backpropagation of error

When a NN is created the weights of the network are commonly randomlyinitialized from a Gaussian distribution N(0, 1). In order for the network toperform any sort of meaningful prediction the network needs to adjust theseweights such that a certain input provides us with a predefined output. Thisis called training, or learning.

Backpropagation of error (Backprop) is a method for training NN’s thatuses an optimization technique called gradient descent. Backprop works by

18

feeding an input through a network and comparing the results with the cor-rect label. This comparison is done using a loss function and provides uswith an error which is propagated backwards through each layer. Backpropthen use these error values to calculate the gradient of the loss function withrespect to the weights. The optimization function then uses this gradientsin order to update the different weights with the objective to minimize theloss function [32]. If the output yk of a neuron is given by equation 2.1 wecan define an error function E that measures the distance of our predictionfrom the target.

En =1

2

∑k

(ynk − tnk)2 (2.2)

Where n is the number of samples and k is the output dimensionality andt is the target. The gradient of this error function with respect to the weightis then:

δEnδwji

= (ynj − tnj)xni (2.3)

Linear activation functions are however only capable of doing linear com-binations of weights and inputs which means that we can’t do nonlinearseparation of data. If we choose an activation function such as the Sigmoidor ReLU functions we need to find the derivative of the activation functionin order to obtain the gradient of the error. The gradient ∇w is then definedby:

zji = Activation of unit i sent to unit j

aj =∑i

wjizji

δj = σ′(aj)∑k

wkjδk

∇wji =δEnδwji

= δjzji

(2.4)

19

Where σ′(aj) is the derivative of the activation function. The gradient∇w is used to update the weights in order to compensate for the error.We iteratively apply small updates until our model converge to a desirablebehavior. There are however several numerical problems that arise from thisstrategy. All formulas used in this chapter are taken from Bishop [2].

2.6.3 Overfitting

From the machine learning algorithms perspective, the goal during the train-ing phase is to minimize the models error on the training set. This goal ishowever problematic due to a problem called overfitting. Overfitting occurswhen our model learns the noise in the data instead of capturing the un-derlying function that created the data. An overfitted model will providegood results on the training set as seen in the rightmost part of Figure 2.7because we have learned the peculiarities, or the noise, of our training set.The model will however perform poorly on unseen data. This means that themodel is bad at generalizing [43]. NN’s using the Backpropagation algorithmare sensitive to overfitting and we must apply different techniques in orderto combat this.

Figure 2.7: Examples of underfitting, good fit, and overfitting taken fromSKLearn [36].

Regularization

The real objective of training is to create a model that is good at generalizing.We want to create a model that is able to correctly classify new, unseen data.

20

L2 regularization is a way to prevent overfitting by limiting the size of theNN’s weights [43]. L2 regularization penalizes large weights by introducinga term that adds the squared weights of the NN to the loss function. Thismeans that at every weight update ∇wi a fraction of the total weight 2λwi issubtracted from the weight update. This penalizes large weights and causesthe network to avoid over-inflating its weights. The weight update ∇w attime t+ 1 with L2 regularization then becomes:

∇wi(t+ 1) = ηδE

δwi(t)− 2λwi(t) (2.5)

Where λ is a constant reducing the magnitude of the regularization andη is the learning rate which controls the magnitude of the weight change.

Momentum

Figure 2.8: Overlapping classes in the Iris dataset

When dealing with data from overlapping classes as seen in Figure 2.8 ourNN will have a hard time finding a hyperplane that can separate the classes.The weight updates ∇wi,j will not converge to zero unless we massively over-fit the function which means will not generalize well. The momentum method

21

by Polyak [31] is a Stochastic Gradient Descent (SGD) trick that accumu-lates a velocity vector of weight change. The idea is that when we calculatea new weight update, we preserve a part of the previous iterations calculatedweight update. Polyak showed that that momentum can considerably accel-erate convergence to a local minimum.

The weight update with regularization and momentum can be formulatedas:

∇w(t+ 1) = ηδE

δw(t)− 2λw(t) + α∇w(t) (2.6)

Where η is the learning rate, and α is the momentum constant.

2.7 Support Vector Machines

Support Vector Machines (SVM) offer one of the most robust and accu-rate methods among all well-known machine-learning algorithms. Training aSVM is theoretically well-founded, requires only a few training samples, andis insensitive to the dimensionality of the data. It also has a small memoryfootprint and predictions can be done by simply performing a linear combina-tion of a few variables[38]. SVM’s are also insensitive to overfitting becausethe learning problem is formulated as a convex quadratic optimization prob-lem. This is a very attractive property because our algorithm will likely becompeting for resources with other processes. These following sections willgive an overview of the equations and mathematical theorems used whenimplementing the SVM’s and the optimization algorithm used for training.Further derivations of the formulas used can be found in Platt’s paper [30].

2.7.1 Linear classification

Suppose we have N training data points {(x1, y1), ..., (xn, yn)} where xi ∈RN and yi ∈ ±1 for which we would like to learn a separating hyperplaneclassifier. In linear classification, the hypothesis h used for classification isdefined as an affine linear function f on the input space as:

22

h(x) = sgn(f(x))

f(x) = w · x+ b

w ∈ RN , b ∈ R

Where w is a weight vector and b is the bias. This defines a hyperplaneH : y = w · x+ b = 0 that splits RN into two parts corresponding to the twoclasses. From this hyperplane we want to find two parallel planes at equaldistance under the condition that there are no data points between the twoplanes H1 and H2 and that the distance between them is maximized.

H2 : y = w · x+ b = +1

H1 : y = w · x+ b = −1(2.7)

Figure 2.9: H,H1, H2 and their corresponding support vectors. Taken fromOpenCV [28]

In order to maximize the distance under the above conditions, some datapoints will be located at the planes H1 and H2. These data points are knownas Support Vectors. The distance from a point on H1 to H is |w·x+b|||w|| . Using

23

the above equation 2.7 we obtain |w·x+b|||w|| = 1

||w|| [30]. The goal is then tominimize:

min ||w|| = wTw

minwTw subject to yi(w · xi + b) ≥ 1

In order to find a solution to the above problem need to solve a convexquadratic optimization problem. Convex optimization can be done by in-troducing Lagrange multipliers which is a strategy for finding local maximaor minima of a function subjected to equality constraints. The optimizationproblem formulated in the previous section then becomes [30]:

L(w, b, α) =1

2wTw −

N∑i=1

αixi(w · xi − b) +N∑i=1

αi (2.8)

Where α is the Lagrange multipliers. In order to find a solution to theabove expression we can solve the Wolfe dual [41]. We have found the solutionto our convex optimization problem when the gradient of w and b equalszero. The problem thus becomes: maximize L(w, b, α) with respect to theconstraints δL

δw= 0 and δL

δb= 0.

w =N∑i=1

αiyixi

N∑i=1

αiyi = 0

(2.9)

Substitute them into L(w, b, a) and we have eliminated the primal vari-ables w and b. We have now obtained the following objective function:

LD ≡N∑i=1

αi −1

2

∑i,j

αiαjyiyj(xi · xj) (2.10)

We can then do binary classification of on data thusly:

f(x) = sgn(w · x+ b) = sgn(N∑i

αiyi(xi · x) + b) (2.11)

24

2.7.2 Kernels

Support vector machines are typically implemented using either a linear ora radial basis function kernel. A linear SVM (LSVM) as seen in Figure 2.9is only capable of linearly separating data. A SVM using a Radial basisfunction (RSVM) transforms the input data into a nonlinear space whichmakes nonlinear separation of data possible.

This nonlinear transformation of the data θ(xi) · θ(xj) is called a kernelk. Kernels can be any function satisfying the Mercer condition [30] but theimplementation in this thesis uses the linear, and radial basis function kernel.

θ(xi) · θ(xj) = k(xi, xj)

k(xi, xj) = e−||xi−xj ||2/2σ2 (2.12)

The complete dual form formulation for any given kernel function:

LD ≡N∑i=1

αi −1

2

∑i,j

αiαjyiyjk(xi, xj)

subject to:

0 ≤ αi ≤ C, for i = 1, 2, ..., N

N∑i=1

yiαi = 0

(2.13)

Where C is a hyperparameter for the soft margin cost function whichcontrols the influence of each individual support vector. This controls theerror penalty for stability.

As stated by Keerthi et al, ”The analysis also indicates that if completemodel selection using the Gaussian kernel has been conducted, there is noneed to consider linear SVM.[20]”. This means that in terms of accuracyradial basis functions perform at least on par with linear functions given aproper model selection. The downside of radial basis functions is howeverthat they come at a much higher cost of prediction and training compared tothe linear kernel. Both kernels have been implemented in order to investigatethe performance trade-offs.

25

2.7.3 Sequential Minimal Optimization

The process of solving quadratic optimization problems is called quadraticprogramming (QP). Sequential Minimal Optimization (SMO) is an optimiza-tion technique created by Platt from Microsoft [30]. SMO solves the QP prob-lem that the SVM’s training phase popses by breaking it down into a seriesof smallest possible QP problems, using Osunas theorem to ensure conver-gence. The amount of memory required for SMO is linear in the training setsize, which means that SMO is one of the most memory efficient alternatives.

The SMO algorithm searches through the feasible region of the dualproblem 2.13 and maximizes the objective function 2.10. It works by us-ing heuristic functions to select two Lagrange multipliers αi to optimize ata time which corresponds to the smallest possible optimization problem. Atevery step, SMO chooses two Lagrange multipliers to jointly optimize, findsthe optimal values for these multipliers and updates the SVM to reflect thenew optimal values. The advantage of this strategy is that solving for twoLagrange multipliers can be found analytically which means that we avoidusing time-consuming numerical QP optimization techniques. The two maincomponents of SMO are (1) an analytical method for solving for the two La-grange multipliers, (2) a heuristic for choosing which multipliers to optimize.

The algorithm proceeds as follows:

Algorithm 1 Sequential Minimal Optimization

1: procedure Optimize2: Find Lagrange multiplier α1that violates Karush-Kuhn-Tucker condition3: Find second Lagrange multiplier α2

4: Optimize pair (α1, α2) with respect to constraints5: Repeat 2, 3 and 4 until convergence

Convergence of the algorithm depends on a user defined tolerance. Pre-dictions for the RSVM come at complexity O(n) where n is the number ofsamples. Training is an optimization problem which means that it is hardto set a complexity, but according to Platt the complexity is somewhere be-tween O(n) and O(n3) depending on the problem [30]. However, convergencealso depends on the hyperparameters of the SVM such as the regularizationterm C, the slack variable coefficient or the RBF functions γ that describes

26

the variance of the data. If the SVM can’t find a hyperplane that sepa-rates the data given the hyperparameters, the optimization algorithm willnot converge.

2.7.4 Multiclass classification

SVM’s are by design binary classifiers. The hyperplane that SVM’s use toseparates the data can only split the space RN into two parts correspondingto two classes. This is problematic because many real world problems arenot binary problems. There are however two popular techniques for dealingwith this limitation.

Figure 2.10: One against all multi class classification

One against all is a technique designed to perform multi-class classifica-tion by creating one SVM per class. It is illustrated in Figure 2.10. TheSVM is trained to identify one class as the positive samples and all otherclasses as negative samples. Predictions will then require us to run the datasamples through c SVM’s where c is the number of classes. The SVM thatprovides a positive output with the maximum margin will be the predictionof the model.Another technique is One against one. This technique separates classes pair-wise and thus requires us to create one SVM per two classes. This meansthat we need c(c − 1)/2 different SVM’s in order to perform predictions.This technique can provide faster model training convergence due to smallertraining sets, but due to the limited memory of the EMCA One against allis chosen as multiclass classification technique.

27

2.8 Random Forest

Random Forest (RF) is an ensemble of decision trees created by using boot-strap samples of the training data and random feature selection in tree induc-tion[37]. It has achieved good performance on machine learning competitionsand is backed as a strong algorithm by previous Ericsson researches for bothclassification and regression tasks.

2.8.1 Decision trees

A decision tree can be seen as a collection of interlinked if statements thataims to perform classification or regression on a data sample as can be seenin Figure 2.11. Each node in the tree has 0, 1 or 2 children. When a datasample is passed through the different nodes, a certain feature in the datais compared to the nodes decision value. The features that the decision treeshould use is decided by the training algorithm. If the data sample has avalue lower or equal to the node, the sample proceeds to the nodes left child.If the value is higher it proceeds to the right child. If the node has no childrenit is called a leaf. Leafs determine what the decision tree will predict basedon the learning algorithm [25].

28

Figure 2.11: Decision tree

The training algorithm works by splitting a data set into two differentdata sets several times. The training algorithm assesses the quality of thesplit by using a goodness rating that attempts to estimate the similarity ofthe samples in the two new data sets. In this thesis the Gini index, furtherexplained in Section 2.8.2, is used for classification, and mean squared erroris used for regression. The split is represented by a node and the nodevalue that created best split is saved in the node. The two new data setsthat appear are the the children of the node. This splitting process continuesuntil some stopping conditions are met. Once a stopping condition is met, thenode turns into a leaf node and calculates which class of value the leaf nodecorresponds to. The first stopping condition limits how deep out tree canbecome. If we are at a certain depth the training algorithm will automaticallyturn the node into a leaf. The second condition limits the size of the data set.If a split results in a data set that is of a small enough size, the algorithmwill automatically turn it into a leaf. The third condition limits how pickywe should be with the splits. If every single sample in a split belongs to thesame class it is a good idea to turn that node into a leaf. The same logic canbe applied in regression by turning the node into a leaf if the mean squared

29

error is lower than some value. These following sections will give an overviewof the equations and mathematical theorems that is used when implementingthe NN. Further derivations of the formulas used can be found in Bishop [2]and [25].

2.8.2 Gini index

In order to train a decision tree belonging to a RF ensemble we need to de-termine how to perform the splits that construct the tree. When performingclassification we commonly use a measurement called the Gini index. TheGini index provides us with a way to measure the class purity of a set ofdata. The class purity obtains a low value if there are several entries of thesame class in a data set, and a high value if the data set contains a mix ofclasses. The gini index is defined thusly.

Gini index =

∑ni=1

∑nj=1 |xi − xj|

2n∑n

j=1 xi(2.14)

Where n is the number of samples and x is the label of a sample. TheGini index is a measure of statistical dispersion which measures inequality.It was invented by Corrado Gini in 1912 in an attempt to represent thewealth distribution in a country. The gini index obtains a low value if adata set contains similar values (classes, or continuous values) and obtainsa high value in case our data set has high variability. In order to decidewhich feature and what value to use, we simply calculate the Gini index forall combinations of features and values. The split1 that has the lowest Giniindex represents the optimal split given the training data. We calculate theoptimal split using the Gini index for every node. This is a costly operationof complexity O(2d × f × n) where d is the depth of the decision tree, f isthe number of features and n is the number of data samples. Performing aprediction has complexity O(d) which is cheap.

2.8.3 Mean squared error

The gini index measures the class purity of the data set. A class can bedescribed by an integer such as 0, 1 or 2. There is no locality between the

1The feature and value pair that divides a data set into two different sets of datadepending on if they are bigger, or less than the value of the feature.

30

class values in classification. The values given to a class is only an identifierwhich means that class 1 is no closer to class 0 than class 1000 is. Whenperforming regression we have a continuous output with locality. The outputvalue 0 is closer to 1 than 2 which means we can’t use the gini index. Themean squared error(MSE) is measures the squared difference between eachsample in the data set and the mean of the dataset. A low MSE means thatthe samples are close to the average value of the data set. A high MSE meansthat the data set contains many different values. MSE is commonly used tomeasure the error in regression tasks and provides the output locality thatis needed.

MSE =

∑ni=1(xi − µ)2

n(2.15)

2.9 k-fold cross validation

In order to build a good model as much data as possible should be used.However, if our testing set gets too small we cannot trust that the providedaccuracy corresponds to the models ability to generalize. k-fold cross valida-tion can solve this problem by splitting the data into k different subsets. Thetraining algorithm uses (k−1)

kof the samples for training and 1

kof the samples

for validation. This procedure is then repeated for all k possible choices ofthe validation group meaning that we use all samples for both training andvalidation. The performance score for the different models are then averagedbetween all the models [2]. This means that our models performance will notbe affected by randomly sampling from the training set in order to create atest set since all samples are used for validation exactly once. The differentscores obtained by the models can also provide us with both an average ofthe models performance and information about how much the models scoredeviates from its mean. This measurement is called standard deviation andis calculated as:

Standard deviation =

√∑Ni=0(p− p)2

N(2.16)

Where p is the performance of the model and p is the average performanceof all models. Cross validation helps in controlling both bias and variance of

31

the measurements, which is desireable for a statistical learning machine thatrelies on maximum likelihood optimization.

32

Chapter 3

Related work

Most papers on resource constrained machine learning seek to either providea more resource efficient architecture of an existing algorithm [22][26], or tocreate a model that attempts to apply costs to things such as model complex-ity or availability of data [19]. This work has a novel point of view in somesense as we investigate the performance trade-offs between a set of machinelearning algorithms instead of attempting to improve the models, or to finda general cost estimation.

The selected algorithms are required to operate within short deadlinesin order to provide valuable predictions for the target platform. There arenumerous different strategies for achieving this and many of them differ fromalgorithm to algorithm. Neural networks need to be of sufficiently small sizein order to make predictions that can fit deadlines. The training can how-ever easily be parallelized to perform distributed training on separate coresas described by George et al [15]. More advanced strategies for distributingcomputations for neural networks can be found in popular papers such as theoriginal TensorFlow paper by Abadi et al.[1], or in Google’s paper Large scaledistributed deep networks by Dean et al [9]. Dean et al describes methods forcreating training algorithms distributed over thousands of CPU’s and GPU’swhich can be used as inspiration, or future work, for distributing computa-tions on the target platform.

Ensemble learning systems provide a very interesting quality describedby Yang et al. [42]. The authors describe an ensemble of classifiers workingtogether by performing predictions one by one. Once a deadline is approach-

33

ing, the algorithm simply bases its prediction on a majority vote over allfinished predictions. This allows us to always have a prediction ready giventhat at least one predictor has finished. A popular algorithm used for ensem-ble learning is Random Forest first described by Svetnik et al. [37]. RandomForests can, like neural networks, rapidly grow in size which will affect itsmemory footprints. There are many tricks to reduce the size of the decisiontrees in random forests such as Feature-budgeted random forests as proposedby Nan et al. [26] or the Optimally Pruning Decision Tree Ensembles WithFeature Cost also by Nan et al.[27].

Training relies on solving a convex optimization problem. An alternativeto using expensive numerical solvers is to use the analytic solver proposedby Platt[30]. A problem with this approach is that Sequential minimal opti-mization is not inherently adapted for online training. Adapting the SupportVector Machine for multi-class classification and online training can be doneas proposed by Trinh et al. [38] including LaSVM [3], or other techniquessuch as the online passive-agressive algorithm [7]. It is however possible todistribute classification and training to get a self-improving system based onbatch training.

Wearable devices in the form of smart devices or health monitoring de-vices have become increasingly popular due to the advancement of Internet ofThings (IoT). Many of these devices can be considered resource constraineddue to the small size they require for the wearers comfort. Shoeb et al. [35]presents a wearable device using machine learning that attempts to identifyepileptic seizures in the wearer. Another example of a wearable health mon-itoring device using machine learning techniques is presented by Ozdemir etal[29]. Venture et al. [39] discuss energy optimizations using embedded de-vices and machine learning. Vehicle automation can also be considered to berelated to resource constrained machine learning due to the hard deadlinesand limited computing power. Chenyi et al. [5] presents a deep convolutionalneural network for autonomous vehicle control using computer vision.

34

Chapter 4

Method

4.1 Experiments and Data sets

The experiments have been performed by measuring the resource require-ments of the training and prediction phases on the EMCA platform runningin Ericsson’s lab. All information was collected using function traces andall experiments were averaged over several runs depending on the varianceof the results. The initial experiments involved evaluation of the algorithmsworking with the Iris [13] and Wine Quality [wqref] data sets. The Iris dataset presents a classification problem and Wine Quality presents a regressionproblem. Both data sets were selected from the UCI machine learning repos-itory [24] and are both popular and well known data sets. After performingthe initial experiments, a third test case provided by Ericsson was evaluated.The data sets are presented in more detail in the following sections.

Table 4.1: Information about the different data sets used in the experiments

Data set Samples Features Tasks Target range (y)Iris 150 4 Classification [0, 1, 2]

Wine Quality 4898 12 Regression [0, 1, ... 10]Beamforming 70000 2 Regression [-120.0, ... ,-60.0]

4.1.1 Iris

The Iris data set consists of 150 entries of measurements collected from threedifferent flowers. The data available is the sepal and petal length and width

35

and a corresponding class describing what flower the data describes as figure4.1 shows. The Iris ML task is an example of supervised classification andis regarded as a simple domain due to the restricted features and the fewclasses. One of the three plant classes is linearly separable from the othertwo whilst the remaining two are not. This means that the data set is notlinearly separable.

Petal length Petal width Sepal length Sepal width Class5.1 3.5 1.4 0.2 Iris Setosa

Figure 4.1: Example Iris data

4.1.2 Wine quality

The Wine quality data set entries consist of 11 different measurements col-lected from various types of wine. The data set contains 1599 entries of datacollected from red wine and 4898 entries of white wine. This thesis will fo-cus on the white wine data set. The goal of the data set is to predict thequality ∈ [0, 10] of the wine. This data set is interesting because it (1) ithas a lot of features and (2) the labels are subjective. The labels (quality ofthe wine) has been determined as the median value of a quality rating givenby least 3 wine experts. This means that the labels are not directly derivedfrom the features and some mathematical function, but is determined by thenoisy judgment of human beings.

Fixed acidity Volatile acidity Citric acid Residual sugar ... Quality7.5 0.53 0.06 2.6 ... 10

Figure 4.2: Example Wine Quality data

4.1.3 Beamforming

Ericsson has provided a test case using beamforming data collected from ansimulator. The test case corresponds to one of the problems machine learn-ing could potentially solve if successfully implemented on the target platform.The original problem was that of beam selection as described in Section 2.3.The problem is however that there are 48 different beams in the simulator

36

which makes for a difficult classification problem. The NN would require 48different output nodes and 48 separate SVM’s would have to be trained forthis problem. The test case was therefore reformulated as a regression prob-lem where we attempt to estimate the best signal available to a user devicefrom any given beam.

The resulting test case aims to estimate the highest quality ReferenceSignal Receiver Power (RSRP) available for a user. Radio transmitters usebeamforming to direct signals so that users obtain the best possible RSRP.Depending on the number of beams, users, and topology different users re-ceive different signal quality. Data were collected from an emulated environ-ment that contained four users who sent and received packets on a regularbasis whilst moving around in the environment as can be seen in Figure4.3. Beamforming algorithms are used in the transmitters placed in the en-vironment in order to provide the agents with a connection, or RSRP signal.Positional (x, y) coordinates as seen in Figure 4.3 along with the best trans-mitters signal strength have been collected from agents as they walk aroundpseudo-randomly. Each user reports their RSRP for every single beam asthey send or receive packets. Figure 4.4 displays the best RSRP signal thatthe blue agent in Figure 4.3 received as it walked around.

37

Figure 4.3: The collected (x, y) coordinates of the four different agents asthey walk around in the simulator

Each data entry in this test case consists of X and Y coordinates and thehighest signal quality as the correct label.

Figure 4.4: Best RSRP signal quality (the label) for the agent correspondingto the blue path

38

4.1.4 Experiment setup

Because the hardware platform is designed with multiple storage units twodifferent versions of each ML algorithm were developed. The classificationlibrary was developed to utilize the DSP’s LM due to the small size of thedata set. The regression library was developed to work on the CM or the EMof the chip. This was done in order to highlight the performance trade-off ofusing the different memories. Using the LM in tandem with the CM wouldrequire a more advanced data handling architecture which was unfeasible toimplement for all algorithms given the time constraints. This will be dis-cussed in the future work section. The EM usage was only included as afinal experiment on the beamforming test case as the two other data sets didnot require the additional memory.

All algorithms were run individually as a part of a test suite on an EMCAsituated on a DUS chip in Ericsson’s lab. The code was compiled on a localmachine and then uploaded to Ericsson’s testbed. Before uploading andrunning the code on the EMCA it was debugged and tested in Flake’s Linuxenvironment. Trace functions collected the results and reported them backonce execution had finished. The EMCA was not under any sort of additionalcomputational load during the experiments. The motivation behind thisdecision was that if multiple cores are in use our algorithms might have towait for computational resources which adds noise and randomness to themeasurements.

4.2 Parallelism

Due to time limitations the only algorithm that was parallelized was RF.This algorithm is referred to as PRF. Both PRF and RF is included in theevaluation. Each decision tree contained in the PRF ensemble is built onan individual DSP. Major changes to the algorithms architecture would havebeen required in order to efficiently parallelize the SVM and the NN. Thiswas omitted in the evaluation but is discussed in more detail in the futurework section.

39

4.3 The memory hierarchy and the selected

algorithms

The largest difference between working towards CM and LM is that LMdoesn’t allow dynamic or variable memory allocation because there is noheap allocation in the LM. This affects the algorithms in several ways.

4.3.1 Random Forest

The decision trees that make up the RF ensemble are constructed by splittinga parents data set into two smaller sets of data which are given to its children.As mentioned in 2.8 a split occurs at every node until a stopping criteria ismet. This means that for every split we need to allocate space for the twonew data sets. If we were to run the algorithm in the LM which doesn’t allowdynamic or variable memory, we would require O(2d × D) bytes of storagewhere d is the depth of the tree and D is the size of our data set in bytes.This becomes even worse in the case where we build our trees in parallel.The memory required then becomes O(t × 2d × D) where t is the numberof trees we are constructing in parallel. For a relatively small ensemble of 5trees with depth 7, a 1 MB data set then requires 5 × 27 × 1 = 640MB ofdata available. This is well beyond the size limitations of both the LM andthe CM. A non-parallel version of RF still uses 27 × D which is unfeasiblefor practically any data set when working towards the LM. Because of this,a compromise between the CM and the LM has been implemented for theRF algorithm. The algorithm will store its data in the CM and read chunksof data from the CM to the LM for processing. The EMCA’s memory archi-tecture allows us to efficiently read large blocks from the CM into the LM,and then save the results back to the buffer in the CM. This compromise willresult in a model that is slower than one operating exclusively on the LM,but the alternative is that the RF algorithm is restricted to a small fractionof the available data set or that the EMCA will run out of memory.

If we can make use of dynamic memory we can however reduce the re-quired memory from 2d ×D to d×D by using DFS. Consider the Decisiontree that would result in the architecture presented in Figure 4.5a. As weconstruct our tree using Depth First Search (DFS) we will visit all the left-most nodes until we reach a leaf. The decision tree memory allocation will

40

then only need to allocate memory for the visited nodes in Figure 4.5b. Whenwe have reached a leaf by fulfilling some stopping criteria, we can then freethe memory of the node because we have no more children that depends ontheir parents data set as seen in Figure 4.5c. As we visit the last child nodein the branch we can recursively free all memory as seen in Figure 4.5d. Aswe can see we never have to allocate more memory than the size of the dataset times the depth of the tree.

41

(a)

(b)

(c)

(d)

Figure 4.5: Visualization of the training algorithms memory requirementswhile constructing a decision tree. Figure 4.5a corresponds to the initialskeleton of the graph. Figure 4.5b corresponds to the nodes allocated memoryonce the algorithm searches for leaf nodes using DFS. Figure 4.5c correspondsto memory being freed recursively since there are no more children.

42

The RF ensembles included in the results consist of 10 decision trees.Each tree was built with a maximum depth of 6. If a data set split consistedof less than 10 samples, the node was turned into a leaf. Each decision treeobtained a random subset containing one third of the available data. Thiswas done in order to prevent the CM from filling up due to the large memoryconsumption of the parallel RF algorithm.

RF has the prediction complexity O(d×t) where d is the depth of the treeand t is the number of decision trees in the ensemble. Training an unprunedtree has complexity O(f × D × log(D)) where f is the number of featuresin the data, and D is the size of the data set [40].

4.3.2 Neural Network

The NN uses weights for every connected node. This matrix is representedby several float matrices which are statically allocated in the beginning of thealgorithms execution. The dimensionality of the NNs different components isnot changed during execution time which means that the components can bestatically allocated and will always have the same size. This in turn meansthat a NN can operate well both towards the CM and the LM. The archi-tecture used for Iris and Wine Quality was a network with 3 hidden layersthat consisted of [8, 12, 16] nodes. The last test case provided by Ericssonrequired a more complex network and consisted of [32, 64, 64] nodes in thehidden layers. The network architectures were decided by exploratory searchwhere a good compromise between training time and accuracy was the goal.All layers used the Leaky ReLU activation function. The Sigmoid functionwas also a candidate for an activation function but was omitted because itdecreased the convergence time of the net and did not improve the accuracy.

The NN was trained for 50 epochs in the regression case, and 35 epochsin the classification case. The learning rate was set to 0.1 and was decayedevery 5 epochs by a division of 2. The momentum coefficient alpha was setto 0.95 and l2 regularization was set to 1e− 6.

Performing a prediction using a neural network requires us to pass thedata sample through the layers of the neural network, multiplying the sam-ple by the layers weights, adding bias and applying the activation function.Therefore the NN performs a prediction with complexity equal to the Num-ber of connections in the network. The number of connections in the network

43

for a k layered network corresponds to the number of weights (plus bias) andcorresponds to f ×L1 +

∑k−1i=0 (Li×Li+1) +Lk×Lout where Ln is the number

of nodes in layer n. The training complexity is O(e×D × p) where e is thenumber of epochs the network is trained for, D is the size of the data set andp is the prediction complexity [11].

4.3.3 Support vector machine

Like the NN the SVM only uses statically allocated memory. The Lagrangemultiplier matrix α and the error cache however scales with the cardinality ofthe data set. This means that as the size of the data set increases the memoryrequired for the SVM will also increase. The time required for SMO will alsoincrease as we’re performing pairwise optimization between the data points.As the data set increases in size the optimization will take longer and longertime to converge. SVMs are sensitive to parameter selection which meansthat we will have to tune our hyperparameters carefully. One of the hy-perparameters that greatly affect the convergence time is the C-value. TheC-value controls our tolerance for misclassification during training. A smallC means that we accept misclassifications in order to provide a large marginto the remaining samples. A large C means that the optimization will notconverge as long as there are misclassifications.

The RSVM used a C-value of 5.0, an error tolerance level of 0.01, and aslack variable coefficient of 0.1. The LSVM used the same values but requireda lower C-value of 0.5.

The LSVM has the following prediction complexity: O(f + 1) where f isthe number of features. The RSVM has prediction complexity: O(D) whereD is the amount of samples in the training data set. Both SVM’s use SMOfor training. SMO has a complexity around O(D) to O(D3) according toPlatt [30] but is defined as an optimization procedure that only exits oncecertain conditions are met.

44

Chapter 5

Results

The purpose of running the performed experiments was to evaluate how thedifferent algorithms scale in terms of classification accuracy/MSE, memoryconsumption and time requirements for the training and testing phases. Al-gorithms that require less than 40.96 seconds to execute were timed by Flake’sinternal trace API that provides nanosecond precision. Experiments that re-quire more than 40.96 seconds are measured by another timer. The lattertimer measures the time it takes to compile, upload the code and to receivethe results, and can only provide the time as integer seconds. The reasonfor this is that Flake’s trace API clips the nanosecond timer at 40.96× 109.Flake is designed for signal processing where most tasks are finished withinnano- or microseconds. Flake is simply not designed for computations ofsuch length. All results except the timing measurements that took longerthan 40.96 seconds are collected as the 10-fold cross validation average. In10-fold cross validation the test set contains one tenth of the number of avail-able samples. The training set contains the remaining samples. The timingmeasurements for the algorithms that required more time are collected asaverages of at least 3 runs.

The training phase is the phase where the algorithm trains the model.The testing phase is the phase where the algorithm performs prediction onthe test data set and evaluates its performance. The reason why the trainingand testing phases of the different algorithms differ is that they have differ-ent computational complexities. This chapter contains measurements andresults of the performed experiments. Worst case complexity is discussedfurther in Section 4.3.

45

Any discussions of results related to the RF and PRF algorithm, exclud-ing training time and memory usage, applies to both PRF and RF. Thealgorithms are implemented using the same code with the exception thatPRF trains the trees that constitute the ensemble in parallel over manycores whilst RF trains the trees sequentially.

5.1 Classification

The following section contains the experiment results collected when per-forming classification experiments on the Iris data set. The data size forthe Iris data set is 3kB. The initial data set is omitted from the memoryrequirement measurements:

Table 5.1: Iris data set classification results

NN RF PRF LSVM RSVMAccuracy % 95.3 94.0 94.0 92.7 96.7

Memory (kB) 2.1 13.0 57.1 1.8 1.8Training (ms) 129.8 28.4 4.67 4.8 3595.3Testing (ms) 0.112 0.204 0.204 0.004 4.636

Table 5.1 shows the results of the experiments run on Iris for all algo-rithms. As we can see all algorithm have a high accuracy whilst other mea-surements such as training and testing time differ more. This is because theIris problem provides a relatively simple problem domain with few samples,a low number of features and a relatively simple decision surface.

In order to assure that the models provides reliable results, the modelswere also replicated and tested using machine learning libraries in Pythonand evaluated on a desktop. NN was implemented using the Python libraryKeras and achieved 95.23% test accuracy on the Iris data set. RF was imple-mented using R and achieved 93.5% accuracy. RSVM was implemented inthe Python library scikit-learn and achieved 97.7% accuracy. LSVM was alsoimplemented with scikit-learn and achieved: 93.3% accuracy. The largest dif-ference between results were RSVM where the scikit-learn implementationperformed 1% better than the EMCA implementation.

46

Table 5.2: Iris 10-fold cross validation standard deviation of accuracy

150 samples NN RF PRF LSVM RSVMStandard deviation EMCA 4.27 3.59 3.59 8.14 4.47Standard deviation Desktop 5.05 4.01 4.01 6.15 4.08

Figure 5.1: 10-fold cross validation MSE and MSE standard deviation of thealgorithms implemented on the EMCA and the desktop reference algorithms.

Table 5.2 shows the standard deviation of the accuracy obtained fromthe 10-fold cross validation experiments. Figure 5.1 shows the accuracy andthe standard deviation for the algorithms implemented on the EMCA andthe models built using libraries on the desktop. As expected, the results aresimilar which validates the EMCA implementations. Variability is estimatedas error bars using standard deviation and is further explored in Section 5.3.

47

(a) Training time

(b) Testing time

Figure 5.2: Figure 5.2a shows the time spent on the training phase for theselected machine learning algorithms on the Iris data set. The training setcontains 135 samples. Figure 5.2b shows the time spent on the testing phasefor the selected machine learning algorithms on the Iris data set. The testingset contains 15 samples.

48

As we can see in Figure 5.2a and 5.2b, the algorithms training and testingphases require different amount of time. RSVM was omitted from bothfigures as it required a lot more time than the other algorithms in order tofinish its exeuction. NN also takes a relatively long time in order to finishits training phase, as can be seen in Figure 5.2a but is second only to LSVMin the time it takes to finish the testing phase as seen in Figure 5.2b. RFand PRF perform well on the training phase as seen in Figure 5.2a but fallbehind in the testing phase as seen in Figure 5.2b.

Figure 5.3: PRF and RF execution time. Suite RF and Suite PRF describethe time required to finish the entire test suite, which involves training theensembles and performing the tests. Train RF describes the sum of the timerequired for the training phase. Since RF is sequential this is equal to thetime it takes to complete the test suite minus the time it takes to completethe test phase. Train PRF describe the sum of the time elapsed over all coresin order to train the PRF ensemble. 150 samples was used in this experiment.

RF builds every decision tree on the same core whilst PRF distributes thetrees on multiple cores. Suite PRF finishes execution within 3.1 ms whilstSuite RF requires 18.8 ms. This is because the 10 decision trees that consti-tute the ensemble are built in parallel on 10 different cores. In Figure 5.3 wecan see that the sum of the time spent on all cores is comparable but PRFfinishes quicker due to parallelism.

49

5.2 Regression

When performing regression, classification accuracy cannot be used as a per-formance measure for our model. The performance of the model is quantifiedin terms of MSE. As stated in Section 4.1.4 all data except certain hyperpa-rameters were stored on the CM whilst performing the following experiments.

5.2.1 Wine Quality

The following section present benchmarks on varying subsample sizes fromthe Wine Quality data set. This was done in order to highlight how thedifferent performance metrics was affected by the availability of data. Assome algorithms were unable to finish execution on some tasks a 300 secondtime limit was given to each algorithm in order to provide an upper limit forexecution. If the algorithm exceeds the limit, the results will be marked bya ”-”. Memory requirements for the RF is measured as the highest observedpeak using Flake’s dynamic memory diagnostic tool. As reference for thealgorithms performance, an algorithm that always guess the average labelvalue gives a MSE of 0.88.

Figure 5.4: 10-fold cross validation MSE and MSE standard deviation of thealgorithms implemented on the EMCA and the desktop reference algorithms.

Figure 5.4 shows the accuracy and the standard deviation for the algo-rithms implemented on the EMCA and the models built using libraries onthe desktop. The results are, like in Figure 5.1, similar which validates the

50

EMCA implementations. The performance of the RSVM and LSVM provedto be worse than the performance acheived by NN and RF/PRF. The reasonfor this is discussed further in Section 6.2.

Table 5.3: Wine Quality evaluation results, 500 samples

500 samples NN RF PRF LSVM RSVMMSE 0.61 0.59 0.58 0.83 0.77

Memory (kB) 2.1 300 459 12.0 12.0Training (s) 2.60 4.97 0.58 0.21 5.04Testing (s) 0.0076 0.0031 0.0031 0.0003 0.0056




As we can see in Table 5.3 RSVM manages to obtain results slightlybetter than predicting the average value. As the number of samples increasethe algorithms take more and more time to converge and as we can see inTable 5.4 and table 5.5 and 5.6 of the Wine Quality experiments, LSVM andRSVM either converge to guessing at the average or not at all.




51


4898 samples NN RF PRF LSVM RSVMMSE 0.53 0.56 0.55 0.83 -

Standard deviation (MSE) 0.010 0.011 0.011 0.000 -Memory (MB) 0.002 2.79 14.61 0.06 -

Training (s) 25.40 208.96 20.90 22.50 -Testing (s) 0.075 0.043 0.046 0.003 -

As we can see in Tables 5.3, 5.4, 5.5 and 5.6 the MSE is only marginallyimproved as we increased the number of samples in the Wine Quality testcase. Most other metrics such as training/testing time and memory increaseas we increase the number of samples. If we look at 5.3 and 5.6 we can seethat the memory required by PRF is increased from 459 kB to 14.61 MBwhilst the MSE is only improved by 0.03. The reason for this is discussed indetail in Section 4.1.2.

52

(a) Training time

(b) Testing time

Figure 5.5: Figure 5.5a shows the time spent on the training phase for theselected machine learning algorithms for the Wine Quality task. The trainingset uses 450 samples. Figure 5.5b shows the time spent on the testing phasefor the selected machine learning algorithms for the Wine Quality task. Thetesting set uses 50 samples.

53

Figure 5.5a and 5.5b visualize the training and testing times of the ex-periment when we use a dataset containing 500 samples. If we compare theresults with the Iris experiment results presented in figures 5.2a and 5.2b wecan see that the execution time of the algorithms has changed. This is be-cause we are now using the CM which is slower than the LM. NN performedrelatively well on the testing phase in Iris and relatively bad in the trainingphase whereas RF performed poorly on the testing phase and well on thetraining phase. We can clearly see from Figures 5.5a and 5.5b that the exactopposite is now the case for NN and RF. Despite the fact that RSVM con-verges relatively quicker than in did in Figure 5.2a, it is unable to providegood results as seen in Table 5.3.

Figure 5.6: PRF and RF execution time on the Wine Quality data set.Suite RF and Suite PRF describe the time required in order to finish theentire test suite which involves training the ensembles and to perform thetests. Train RF and Train PRF describe the sum of the time required to trainthe ensemble elapsed over all cores and we can clearly see that Train RF andTrain PRF are roughly the same with some variance due to the randomlysampled data sets. 500 samples was used in this experiment.

54

Figure 5.7: PRF and RF DSP workload on the Wine Quality experimentusing 500 samples. The horizontally distributed blue bars in the beginningof the execution describe the PRF building 10 decision trees in parallel ondsp[1, ..., 10] whilst the main thread running on dsp0 yields. After the par-allel algorithm finishes, the sequential RF runs and builds 10 decision treessequentially on dsp0.

Figure 5.7 provides a comparison between RF and PRF. Given that eachdecision tree can utilize its own DSP, the total runtime for PRF is equalto the time of the largest executing subtask (dsp4) plus the synchronizationcost whilst RF equals the sum of all subtasks.

55

Figure 5.8: Graph visualizing the increase in training time as the number ofsamples increases for the Wine Quality experiment.

Figure 5.8 illustrates the dependence of the measured training time on thesample size of the Wine Quality experiments. As we can see, NN is the onlyalgorithm whose training time appears to increase linearly with the amountof samples.

5.2.2 Beamforming

RSVM and LSVM were restricted to a random subset of 20% of the availablesamples as the training time increased rapidly with the amount of sampleswith no positive effect on the results. Each of the 10 trees in the RF ensemblewas given a random subset of 1

3of the available samples. Any time measure-

ments in this section that took longer than 1000 second to finish executionis marked by ”-”.

56

Table 5.7: Beamforming evaluation results, 2000 samples


Memory (kB) 10.6 159.0 760.5 24.0 24.0Training (s) 39.00 6.188 0.720 0.5 43.00Testing (s) 0.175 0.214 0.215 0.02 0.201

The performance of RSVM and LSVM was, like in the Wine Qualityexperiments, poor. The training algorithm did not converge unless the Cvalue was low. LSVM was only capable of predicting the average value ofthe training data sets label regardless of hyperparameter tuning. RSVM ob-tained slightly better results than LSVM when C was set to 0.5.


10000 samples NN RF PRF LSVM RSVMMSE 13.65 9.83 9.82 - -

Memory (kb) 10.6 1940.0 8530 - -Training (s) 192.00 86.00 11.00 - -Testing (s) 0.84 0.53 0.54 - -


20000 samples NN RF PRF LSVM RSVMMSE 12.11 9.66 9.67 - -

Memory (MB) 0.01 2.75 14.58 - -Training (s) 387.00 331.00 33.35 - -Testing (s) 1.74 1.07 1.08 - -

The following experiment utilizes the EM instead of the CM as largersample sizes require more memory than the CM can supply.

57

Table 5.10: Beamforming evaluation results when using the EM, 70000 sam-ples

70000 samples NN RF PRF LSVM RSVMMSE 10.03 9.55 9,55 - -

Memory (MB) 0.01 7.0 26.2 - -Training (s) 1313.00 6654.00 711.00 - -Testing (s) 6.20 4.07 4.06 - -

10000 samplesStandard deviation (MSE) 0.651 1.696 1.696 - -

Table 5.10 shows the results of the experiment when operating with asample size of 70000 towards the EM. It also shows the standard deviationof the MSE for the experiment with the sample size of 10000. As we can seethe NN has a lower standard deviation than RF and PRF.

Figure 5.9: The MSE of NN and RF with the different sample sizes used inTables 5.7-5.10.

As can be clearly seen in Figure 5.9 the MSE is reduced when increasingthe amount of samples. The performance of the RF seems to improve onlymarginally after 20000 samples. An ensemble with more trees or with deeperdecision trees might be required in order to provide better results for the

58

RF algorithms. PRF was omitted from the graph since the only differencebetween RF and PRF is that PRF employs a parallel implementation, whichreduces the compute time. Any observed MSE difference is due to varianceand the random sampling of the data sets.

(a) Training time

(b) Testing time

Figure 5.10: Figure 5.10a corresponds to the time required for the trainingphase with different training data sizes illustrated in a line graph. Figure5.10b corresponds to the time required for the testing phase with differenttesting data sizes illustrated in a line graph

59

Figure 5.10a shows the time the training algorithm required for all dif-ferent subsamples of the beamforming experiment. As we can see the NN’straining time still appear to increase linearly with the training data whenusing the more complex architecture. As the number of samples increase theRF and PRF training time increases non-linearly. The parallel nature of thePRF still makes it faster than the NN.

Figure 5.10b shows the testing time required. Both the NN and RFappears to follow a linear trend with an increasing amount of samples.

Figure 5.11: Memory required by the algorithms as the number of samplesused increase

Figure 5.11 shows the memory utilization of the algorithms as the numberof samples increases. As expected and shown in previous experiments, the NNonly requires constant memory for the network architecture. The memoryrequired by RF and PRF appears to increase logarithmically with the amountof samples.

5.3 Variability

Real time systems such as Ericsson’s RBS systems need to be predictable inorder to guarantee the real time performance. Critical tasks need to havepreallocated resources so that deadlines can be met. If an algorithm runningon a real time system has a stochastic component, the system needs to al-locate time and resources for the worst possible case. This is a serious issue

60

on resource constrained systems as resources are scarce. Using 10-fold crossvalidation provides information about the variance in the context of general-isation properties of the algorithm. All results presented in this section areobtained from 10-fold cross validation experiments. As described in Section2.9 10-fold cross validation splits the data set so that 10% of the availablesamples are used as the test set whilst the remaining data is used for training.

Figure 5.12: The average accuracy and accuracy’s standard deviation calcu-lated from the 10-fold cross validation experiments on the Iris data set. Thetraining set contained 135 samples and the testing set contained 15 samples.

As described in Section 2.9 k-fold cross validation runs the algorithm ktimes and divides the training and testing set so that all training samplesare used as testing samples at some point during execution. As we can see inFigure 5.12 the LSVM has a relatively high standard deviation. The reasonfor this is that LSVM is a linear classifier whilst two of the Iris classes arenot linearly separable. If the validation set happens to draw many samplesfrom the classes that are linearly separable the LSVM will most likely performwell. If samples are drawn from the classes that are not linearly separable thealgorithm will perform worse. Such a small validation set has a low chance

61

of estimating the algorithms true ability to generalize, which will result in avalidation accuracy with a high standard deviation.

Figure 5.13: The average accuracy and accuracy’s standard deviation calcu-lated from the 10-fold cross validation experiments on the Wine Quality dataset. The training set contained 4408 samples and the testing set contained490 samples.

As we can see in Figure 5.13 the standard deviation is lower when com-pared to the standard deviation in the Iris experiment c.f. Fig 5.12. Onereason for this is that the data set used in the Wine Quality experimentis larger than the Iris data set. This means that the test set has a higherchance to represent the underlying distribution of the data which will resultin a more accurate estimate of the generalization performance.

62

Figure 5.14: The average accuracy and accuracy’s standard deviation calcu-lated from the 10-fold cross validation experiments on the Wine Quality dataset. The training set contained 1080 samples and the testing set contained120 samples.

Figure 5.14 uses 1200 samples instead of 4898 samples like in Figure 5.13.We can clearly see that both the standard deviation and the MSE increasefor the experiment with the lower number of samples.

63

Figure 5.15: The average MSE and the standard deviation of the 10-foldcross validation experiments on the beamforming data set. The training setcontained 9000 samples and the testing set contained 1000 samples.

Table 5.11: Iris data set training time measurement and standard deviation

150 samples NN RF PRF LSVM RSVMTraining (ms) 129.8 28.4 4.67 4.8 3595.3

Standard deviation (ms) 0.00 0.91 0.91 5.43 1278.10

Both LSVM and RSVM obtained a large standard deviation in their train-ing times. The reason for this is that the optimization algorithm SMO onlystops once certain prerequisites are met which means that it does not havea deterministic execution time. It is interesting to see that the training timeof the NN provided a standard deviation close to zero. The reason for thisis that there is little randomness involved in the training of the implementedNN. The only random component that was involved in the training of the NNis the random initialization of the weights. The weight initialization can af-fect the convergence time of the NN if the training phase involves a stoppingcondition that terminates training when a certain performance is obtained.No such condition were however implemented and the NN only exits training

64

when the maximum number of training epochs have been reached. The stan-dard deviation of the RF algorithms training time was above zero which wasto be expected due to the large amount of randomness involved in buildingthe trees that constitute the ensemble.

Table 5.12: Iris data set testing time measurement and standard deviation

150 samples NN RF PRF LSVM RSVMTesting (ms) 0.112 0.201 0.201 0.004 4.636

Standard deviation (ms) 9.76e-5 1.10e-2 1.10e-2 4.02e-3 3.10

The RSVM had the largest standard deviation during the testing timemeasurements as well. This is due to the fact that any of the samples inthe training data set can become a support vector. If many training samplesbecome support vectors, the prediction takes more time and vice versa. TheLSVM also obtains a standard deviation, albeit a low one. The reason forthis is that LSVM classification does not use support vectors or any randomcomponents. LSVM classification is described in further detail in Section2.7. The NN had a standard deviation close to zero, which was also observedin the training phase, as seen in Table 5.11. RF and PRF had the secondlargest standard deviation, which is caused by the inherent randomness inthe algorithm.

Flake’s trace function API was used in order to obtain the measurementsfor tables 5.11 and 5.12. As described in the beginning of chapter 5, Flake’strace functionality can only trace functions that takes less than 40.96 secondsto run. The 10-fold cross validation requires running the algorithms tentimes with ten different compositions of data/test sets. This means thatthe experiments using the Iris data set is the only experiment that can beproperly measured with nanosecond precision using cross validation. In orderto collect the other measurements that took longer than 40.96 seconds to run,another timer was used that measured the time it took to upload the code,execute the results and receive the response. This measurement is noisy asthe timer does not exclusively measure the execution of a certain componentin the algorithm such as the training or testing phase, but instead accountsfor the time the entire test suite takes from start to completion. The only wayto collect detailed cross validation time measurements for tasks longer than

65

40.96 seconds is to manually partition the data sets and to collect all the k-fold start-to-completion measurements individually. This would be extremelylimited, time consuming and imprecise. Calculating the standard deviation oftime measurements from tasks taking over 40.96 seconds is therefore omittedbecause of limitations in Flake.

66

Chapter 6

Discussion

6.1 Neural Network

In the range of the experiments the NN appeared to follow a linear trendas seen in Figures 5.8 and 5.10a. More simulations are however required inorder to confirm the linear characteristic. All resources that the NN usedwere statically allocated which means that we do not risk exceeding the plat-forms memory limit as we train the model. The NN also obtained the lowest(best) MSE score on the Wine Quality data set and a high score on the Irisdata set which implies that it is a strong candidate for both regression andclassification. As we can see in the classification experiments in Section 5.1the NN training time was comparatively slow when tested with a small dataset and the fast local memory. The testing time was however comparativelyfast in this setting as seen in Figure 5.2b. The training phase was relativelyfast when the data set was large and we operated towards a slower memorybut the testing phase became comparatively slow. As we can see in Section5.2.2, the training time increased by a large factor when we increased the sizeof the network, which was expected. As the number of samples increase thetraining time becomes similar to the other algorithms, see Figure 5.9, andthe performance improves, see Figure 5.9, which supports the fact that theNN scale well with an increasing number of samples.

67

6.2 Support Vector Machine

As the data set size increased the training time required for the SVM’s SMOalgorithm to converge increased very rapidly. The RSVM obtained the bestaccuracy on the Iris classification data set as can be seen in table 5.1. Thetraining and testing time for the RSVM were however much higher than theother algorithms. The LSVM obtained the worst results on the Iris exper-iment and was only capable of guessing at the average, or worse than theaverage, on the regression experiments. One reason for this is that none ofthe data sets used in the experiments were linearly separable.

SVMs are not guaranteed to converge as their training phase consists ofan optimization procedure that does not exit until certain conditions definedby hyperparameters are met. As we can see in the Wine Quality and Beam-forming experiments, both the LSVM and RSVM struggle to converge as thedata set size increases. This is because the optimization algorithm could notfind a hyperplane that separates the data under the conditions defined by thehyperparameters. Relaxing the constraints of the optimization by decreasingthe cost variable C did cause the optimization to converge but resulted in thehigh mean squared error of 3−4. This roughly corresponds to predicting theaverage target value of the training set. The SMO algorithm has been shownto be effective on classification tasks, but there are few reports of SMO algo-rithm being successfully used on regression problems. The reason is that theSMO algorithm originally presented by Platt et al. can be extremely slow onnon-sparse data sets and on problems that have many support vectors [16].It also performs poorly on regression tasks [34, 21]. Both the Wine Qualityand the Beamforming experiment presents non-sparse regression tasks androughly half of the samples become support vectors.

In order to investigate state-of-the-art performance in SVM’s, RSVM andLSVM implementations from SKlearn were also tested on the Wine Qual-ity experiment. The SKlearn implementations optimization algorithm usesSMO and Working set selection (WSS) presented by Fan et al. [12]. WSSuses second-order information in order to provide faster convergence. Theo-retical properties such as linear convergence are established using WSS [12]whilst the SMO algorithm presented by Platt has complexity O(D3) whereD is the size of the data set [30]. This allows the SKlearn implementationto converge within the time limit given more strict optimization conditions

68

which improves performance. The SKlearn algorithm was able to reduce itsMSE to 0.80 for RSVM and 0.71 for LSVM when tweaking hyperparame-ters. This performance is better than the RSVM/LSVM results presented inFigure 5.4 but worse than NN and RF. This implies that even if we improvethe optimization algorithm, SVR is not great for the Wine Quality problem.The goal of the thesis was not to implement the best possible algorithms interms of complexity or performance but to evaluate the selected algorithmsadapted to the E MCA. The reason why WSS was not implemented on theEMCA was that performance was not the only focus of the experiments. Im-plementation and adaptation to a very specialized piece of hardware requiresa great deal of time which require us to limit the implementations. The SMOalgorithm presented by Platt was considered a sufficiently good algorithm forthe purpose of the experiments.

6.3 Random Forest

The speed of the Random Forests training phase depends heavily on thespeed of the memory read and write access times. As we can see in Fig-ure 5.2a, the Random Forest algorithm is second only to the linear SVM intraining time when performing classification even though the Random Forestalgorithm used a LM/CM compromise as described in Section 4.3.1. As wecan see in Figure 5.2a the PRF algorithm obtained the best training time ofall the algorithms even though all data were stored in the CM. The reasonbehind this is that each tree’s data set is saved and loaded onto the CM inbatches as the algorithm needs them. The EMCA is capable of saving andloading large blocks of memory in only one load or write operation. Thisarchitecture fits the Random Forest algorithm very well since the data setbelonging to a node is only loaded once from the CM (given that it fits in theLM). The CM also has access to dynamic memory which means that buildingthe ensemble will have memory complexity O(d ×D) instead of O(2d ×D)as described in Section 4.3.1. As seen in figure 5.8 the Random Forest al-gorithms training time increases exponentially with the training data. Thememory required for training the Random Forest becomes much larger thanthe other algorithms when we are dealing with larger data sets (see Figure5.6).

Figure 5.11 shows interesting results as the RF algorithms memory re-

69

quirements appear to scale logarithmically with an increasing amount of sam-ples. The reason for this is that even though the upper limit memory com-plexity is O(d×D) the algorithm behaves differently. The RF algorithm stopsbuilding a branch and creates a leaf node once a certain stopping conditionis met. An example of a stopping condition is a maximum recursion depthin the decision tree, or that the amount of data samples belonging to a nodehas to be larger than some predefined limit, or that the MSE/Gini index isdeemed small enough. All of these stopping conditions were implemented inRF and PRF. As the number of samples increase and the data set becomeincreasingly dense, or non-sparse, these conditions force the decision trees tocreate leaf nodes even when there are a lot of samples left in a node. Thisbehavior massively reduces the amount of memory that the ensemble wouldotherwise use. Proper tuning of these parameters is important so that theaccuracy/error impact is as little small as possible.

6.4 Limitations

The key limitations of this thesis are presented in this section.One significant limitation of this thesis is that Flake’s trace API was not ableto provide reliable timing for algorithms that executed longer than 40.96 sec-onds. 10-fold cross validation involves running the algorithms 10 times andthe total execution time for all 10 runs needs to be under 40.96 seconds inorder to measure it. This made it unfeasible to collect precise time measure-ments for cross validation for all experiments except the Iris experiments.Flake is designed for signal processing where most tasks are finished withinmilli- or micro seconds. Machine learning introduces new requirements tothe hardware platform and Flake or the EMCA is simply not designed forsuch time consuming tasks.

Another limitation is that both regression data sets (Wine Quality, Beamforming) were non-sparse. The SMO algorithm that was used as the opti-mization algorithm for the SVMs perform poorly on non-sparse regressiondata sets [16]. The results were that the training algorithm either did notconverge, or had to be tuned in such a relaxed way that they were only capa-ble of guessing at the average value of the label. There are other generalizedversions of Platt’s SMO algorithm that can better deal with non-sparsityand regression which is discussed in Section 6.2. This thesis limits is scope

70

to Platt’s originally presented SMO algorithm and leaves improvements tofuture work.

In terms of statistical learning, variance, or standard deviation, is a keycomponent of the evaluation of generalization capabilities in the algorithms.The results chapter present the obtained results without standard deviation.The reason for this was that the initial 10-fold cross validation experimentswere run by exclusively calculating the mean of the obtained results and notthe variance. Section 5.3 explores variance in terms of standard deviationbut limits the experiments to one sample configuration per algorithm, pertest case. This was done because of time limitations. Rerunning the 10-foldcross validation on the RF algorithm using the largest sample size in thebeamforming experiment would take more than 18 hours and there are sev-eral sample size configurations and several algorithms. This means that thelargest portion of our obtained results does not explore the variability of theresults.

The EMCA hardware is also limited in the sense that it only supportedsingle-precision floating point format and not double-precision. The magni-tude of this limitation is however negligible as the experiment that comparedthe performance of the algorithms to the performance of similar implemen-tations on a desktop, where double precision was used, showed very littledifference in accuracy as seen in Figure 5.1.

6.5 Thesis implications

This thesis has provided information about the scalability of machine learn-ing algorithms on the EMCA. The results can be applicable to resource con-strained hardware platforms with similar characteristics as the EMCA. Theselected algorithms were implemented based on original presentations of thealgorithms, with some minor modifications, and applied to operate on theEMCA. No specialized resource efficient implementations were chosen so it isdifficult to relate the obtained results to state-of-the-art resource constrainedmachine learning algorithms. Most work on resource constrained machinelearning, e.g. [22, 26] that touch upon resource constrained machine learningrevolve around finding specialized representations or architectures of alreadyknown algorithms. These specialized architectures are designed to improve

71

the performance by reducing memory requirements, execution times or otherperformance metrics as discussed in chapter 3. This work is therefore novelbecause it touches upon the scalability of selected state-of-the-practice algo-rithms for a telecom hardware platform.

The main contributions of this work is knowledge about resource con-strained systems, and Ericsson’s future RAN. The thesis work provides Er-icsson with implementations of machine learning algorithms on the EMCAand presents how the different algorithms can be adapted so as to operate onthe EMCA. Performance related metrics were also collected from the exper-iments. The results can provide some information about how the algorithmsmight perform on environments with characteristics similar to the EMCA,i.e. many parallel compute units, limited memory.

6.5.1 Ethics and Sustainability

The implementation of machine learning algorithms on resource constrainedplatforms offers more computation- and energy efficient systems. A moreenergy efficient system decreases the cost and negative environmental effectthat these systems have. However, the performance of machine learning algo-rithms tend to improve with an increasing amount of data and computationalpower. This means that the more data and the more computational power asystem has, the better it will perform its intended task. This quality makesit very attractive to design large systems with plenty of resources when per-forming machine learning. The arrival of 5G will undoubtedly increase thedata load of the radio networks, which means that the systems that handledata traffic, such as Ericsson’s RBSes, will require more resources. If machinelearning is intended to replace some of the algorithms currently running inthe RBSes it is likely that the performance requirements for the hardware willincrease and that future systems must be designed with more resources avail-able. This thesis has shown that the selected machine learning algorithmscan be adapted to and operate on resource constrained hardware platforms.Not all machine learning applications require a great deal of resources butrather an implementation that can efficiently utilize the available resources.The problem boils down to how much resources are available, what problemneeds to be solved and how precise the algorithm needs to be. In my view,machine learning can provide solutions to a RBS that will provide a moreautonomous, data driven and adaptive system, which will be better suited

72

to handle the increasing amount of data traffic.

Large companies such as Google, Facebook and Microsoft collect userdata on a world wide scale which they use internally to improve their systems.Ericsson’s RBS handle a great deal of sensitive data. Ericsson is however onlythe developers of the RBS. Ericsson sells their RBS to network providerswhich means that Ericsson does not have access to the data unless providedby their customers. The ethical issue for Ericsson in this scenario is whatsensitive information is stored in the RBSes and how this information is used.Standards such as the 3GPP are however designed to regulate this. Morespecifically relating to machine learning on the EMCA there is no real ethicalconflicts to be reflected upon.

73

Chapter 7

Conclusions and Future work

7.1 Conclusions

This thesis has evaluated some selected machine learning algorithms bybenchmarking their performance on the resource constrained EMCA. Theexperiments cast light on the question as to how the selected machine learn-ing algorithms scale in terms of memory consumption, latency and accuracy.

The NNs were predictable in their resource requirements as the algorithmonly used statically allocated memory. The standard deviation of the timerequired for the training and testing phase were both close to zero whichmeans that the time required for training and testing have a low variability.The NN can be costly in terms of prediction- and training time but scaleswell with an increasing amount of training samples. Both prediction time andtraining time appeared to increase linearly with the amount of data for theNN but more experiments are required in order to determine this quality. Thememory required for the algorithm does not change during either training oftesting.

The RSVM performed well in the classification experiment. As none ofthe selected data sets were linearly separable the LSVM did not perform sowell in comparison to the other algorithms in terms of accuracy and MSE. Itdid however provide the fastest prediction time for all experiments which isan attractive quality. The selected quadratic optimization algorithm, SMO,used by both LSVM and RSVM had trouble converging in the regressionexperiments which limits how much can be said generally about SVMs.

RF proved to be a powerful algorithm well suited for parallel systems.

74

RF obtained the lowest MSE on the beamforming data set and obtaineda competitive score on the other two data sets. Training the RF ensemblesbecame very expensive in terms of memory, or time, when the number of datasamples increased. The training phase of RF and PRF does not appear toscale linearly with the amount of data but it does appear scale linearly in thetesting phase. PRF used a great deal of memory but offered a comparativelyfast training time in return. The RF algorithm used a lot of memory whencompared to the SVM and NN but used only a fraction of the memory usedby PRF. RF required much more time in order to finish its training phase.Using RF on the EMCA requires a software architecture that utilize thedifferent memories in conjunction.

The general conclusion we can draw about machine learning on the EMCAis that the training phase is the most problematic part. Any machine learningalgorithm designed to perform the training phase on the EMCA needs tobe carefully designed not to consume too much resources. Deciding whichalgorithm to perform the intended tasks on the EMCA, such as beamforming,depends on a lot of factors. Factors such as how accurate we need to be, howmuch memory is available, how quick we need answers and how much datawe have are key to deciding the final implementation.

The research question considered in this thesis regarded the scalabilityof the ML adaptations implemented on the EMCA. This thesis has provedthat the selected ML algos were applicable to the EMCA. The experimentshave evaluated the scalability of the algorithms in terms of memory, latency,and accuracy when operating in the resource constrained environment. Ttheresearch question has been answered and the goals of the project have beenfulfilled.

7.2 Future work

7.2.1 Training off-target

The results show us that the training phase is by far the most problematicpart of performing machine learning on the target platform. The trainingphase requires large amounts of memory which quickly becomes computa-tionally expensive. This is due to the fact that we have limited local memoryavailable and long access times to the common and external memory. In-creasing the number of samples is important because the accuracy of the

75

model increases the more data we have available as we can see in Figure 5.9.

One solution to this problem is to perform the training on an externalmachine with more resources. The trained model can then be transferredto the target platform via FTP or some other protocol. This would requiresupport in Flake to allow us to packet and load a model efficiently. If suchfunctionality was present, we could train our model on a separate machinewith more resources. This means that we can create more complex modelsthat could not have been trained on the target platform due to the resourcelimitations.

7.2.2 Local and Common memory in tandem

The implemented machine learning library was initially designed to workexclusively on the LM or the CM. When the initial tests were run the resultsshowed that the RF algorithm consumed too much memory in order to runexclusively on the LM. In order to make using the RF algorithm feasible onthe LM a software architecture that utilized both the CM and the LM wasdesigned and implemented. All data were stored in the CM and loaded intothe LM for processing. This architecture provided very few drawbacks andgood results as seen in Figure 5.2

This architecture can be extended to the other algorithms as well. Thecurrent regression library is ”naive” in the sense that every single variableaccess is done towards the CM with the exception of a few hyperparameters.Since the access time towards the CM is much slower than the LM, this isa very inefficient design. The data set can be placed in the CM or EM andbuffers can be allocated in the LM which the algorithm reads from. Thismethod will take advantage of the speed of the LM and the size of the largermemories.

7.2.3 Parallelism

Generally speaking, any machine learning algorithm that is to operate on aresource constrained platform need to be carefully designed. Using a par-allel training algorithm may decrease the execution time by a large factorbut might increase the memory requirement of the algorithm. The differentalgorithms training and testing phases also behave differently depending onthe components of the platform due to their individual complexity. When

76

adapting any algorithm to a resource constrained hardware platform manyfactors have to be considered. Factors such as parallelism, worst case com-plexity, deadlines and prediction accuracy are key in deciding the feasibilityof adapting any algorithm to a resource constrained system.

The only algorithm that was parallelized in this thesis was the trainingphase of the PRF algorithm. As we can see in Figure 5.8 the parallel al-gorithm achieved very high execution time gain from parallelization. Thesequential algorithm requires the sum of time elapsed constructing all treeswhilst the parallel version requires the time the most time consuming treeused plus a small synchronization cost. All decision trees in an ensembleobtain a random subset of the available training samples, which means thatdifferent trees will require different amount of time to construct.

Parallelizing an NN can be done by performing batch methods [43]. Sev-eral training samples can be independently processed on different cores andthe obtained gradients δE

δwcan be summed and averaged for each sample.

Another more robust technique for parallelizing NNs is described by Georgeet al. [15]. George et al. suggest a design where the core architecture ofthe network is designed to be parallelized over different cores. SVM paral-lelization can be done as suggested by Zhu et al. [44]. Zhu et al. suggestsa training algorithm that distributes the training samples over m machinesand performs training using row-based Incomplete Cholesky Factorization(ICF). Since multi-class classification involves training several SVM’s, paral-lelization can be done by distributing the training of the individual SVM’sto several cores.

77

Acknowledgements

This thesis was performed at Ericsson BNEP Systems Technology. Thanksto my industrial supervisor, Andreas Ermedahl and other personnel at Er-icsson BNEP System Technology including Thomas Magnusson, LoghmanAndimeh, and Dag Lindbo. Thanks to KTH and my academic supervisor,Pawel Herman.

78

Bibliography

[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, andMatthieu Devin. “Tensorflow: Large-scale machine learning on het-erogeneous distributed systems”. In: arXiv preprint arXiv:1603.04467(2016).

[2] Christopher M Bishop. “Pattern recognition”. In: Machine Learning128 (2006), pp. 225–280.

[3] Antoine Bordes, Seyda Ertekin, Jason Weston, and Leon Bottou. “Fastkernel classifiers with online and active learning”. In: Journal of Ma-chine Learning Research 6.Sep (2005), pp. 1579–1619.

[4] Gauran Chakravorty. Neural Network design. 2017. url: https://

www.quora.com/What-is-the-difference-between-deep-and-

shallow-neural-networks.

[5] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. “Deep-driving: Learning affordance for direct perception in autonomous driv-ing”. In: Proceedings of the IEEE International Conference on Com-puter Vision. 2015, pp. 2722–2730.

[6] Tidestav Claes. Massive beamforming in 5G radio access. 2015. url:https://www.ericsson.com/research-blog/5g/massive-beamforming-

in-5g-radio-access/.

[7] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, andYoram Singer. “Online passive-aggressive algorithms”. In: Journal ofMachine Learning Research 7.Mar (2006), pp. 551–585.

[8] Balazs Csanad Csaji. “Approximation with artificial neural networks”.In: Faculty of Sciences, Etvs Lornd University, Hungary 24 (2001),p. 48.

79

https://www.quora.com/What-is-the-difference-between-deep-and-shallow-neural-networks



https://www.ericsson.com/research-blog/5g/massive-beamforming-in-5g-radio-access/

https://www.ericsson.com/research-blog/5g/massive-beamforming-in-5g-radio-access/

[9] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V Le.“Large scale distributed deep networks”. In: Advances in neural infor-mation processing systems. 2012, pp. 1223–1231.

[10] Corcoran. Diarmuid, Andimeh. Loghman, Ermedahl. Andreas, Kreuger.Per, and Schulte. Christian. Data Driven Selection of DRX for EnergyEfficient 5G RAN. Under submission.

[11] Andreas Engel. “Complexity of learning in artificial neural networks”.In: Theoretical computer science 265.1 (2001), pp. 285–306.

[12] Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin. “Working set se-lection using second order information for training support vector ma-chines”. In: Journal of machine learning research 6.Dec (2005), pp. 1889–1918.

[13] Ronald A. Fisher. UCI Machine Learning Repository: Iris Data Set.http://archive.ics.uci.edu/ml/datasets/Iris, 2011. url: http://archive.ics.uci.edu/ml/datasets/Iris.

[14] Peter Flach. Machine learning: the art and science of algorithms thatmake sense of data. Cambridge University Press, 2012.

[15] D George, M Alan, and N Tia. “Parallelizing neural network trainingfor cluster systems”. In: Book Parallelizing neural network training forcluster systems, Series Parallelizing neural network training for clustersystems (2008).

[16] Jun Guo, Norikazu Takahashi, and Tetsuo Nishi. “Convergence proofof a sequential minimal optimization algorithm for support vector re-gression”. In: Neural Networks, 2006. IJCNN’06. International JointConference on. IEEE. 2006, pp. 355–362.

[17] G Gybenko. “Approximation by superposition of sigmoidal functions”.In: Mathematics of Control, Signals and Systems 2.4 (1989), pp. 303–314.

[18] Cisco Visual Networking Index. Global mobile data traffic forecast up-date, 2016-2021. 2016.

[19] Aloak Kapoor and Russell Greiner. “Learning and classifying underhard budgets”. In: European Conference on Machine Learning. Springer.2005, pp. 170–181.

80

http://archive.ics.uci.edu/ml/datasets/Iris

http://archive.ics.uci.edu/ml/datasets/Iris

[20] S Sathiya Keerthi and Chih-Jen Lin. “Asymptotic behaviors of supportvector machines with Gaussian kernel”. In: Neural computation 15.7(2003), pp. 1667–1689.

[21] S. Sathiya Keerthi, Shirish Krishnaj Shevade, Chiranjib Bhattacharyya,and Karuturi Radha Krishna Murthy. “Improvements to Platt’s SMOalgorithm for SVM classifier design”. In: Neural computation 13.3 (2001),pp. 637–649.

[22] Minje Kim and Paris Smaragdis. “Bitwise neural networks”. In: arXivpreprint arXiv:1601.06071 (2016).

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advancesin neural information processing systems. 2012, pp. 1097–1105.

[24] M. Lichman. UCI Machine Learning Repository. 2013. url: http://archive.ics.uci.edu/ml.

[25] Stephen Marsland. Machine learning: an algorithmic perspective. CRCpress, 2015.

[26] Feng Nan, Joseph Wang, and Venkatesh Saligrama. “Feature-budgetedrandom forest”. In: arXiv preprint arXiv:1502.05925 (2015).

[27] Feng Nan, Joseph Wang, and Venkatesh Saligrama. “Optimally Prun-ing Decision Tree Ensembles With Feature Cost”. In: arXiv preprintarXiv:1601.00955 (2016).

[28] OpenCV. Introduction to SVM. 2017. url: http://docs.opencv.

org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_

to_svm.html.

[29] Ahmet Turan Ozdemir and Billur Barshan. “Detecting falls with wear-able sensors using machine learning techniques”. In: Sensors 14.6 (2014),pp. 10691–10708.

[30] John Platt. “Sequential minimal optimization: A fast algorithm fortraining support vector machines”. In: (1998).

[31] Boris T Polyak. “Some methods of speeding up the convergence ofiteration methods”. In: USSR Computational Mathematics and Math-ematical Physics 4.5 (1964), pp. 1–17.

81

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html



[32] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learn-ing representations by back-propagating errors”. In: Cognitive modeling5.3 (1988), p. 1.

[33] Martin Sauter. From GSM to LTE-advanced: An Introduction to MobileNetworks and Mobile Broadband. John Wiley & Sons, 2014.

[34] Shirish Krishnaj Shevade, S Sathiya Keerthi, Chiranjib Bhattacharyya,and Karaturi Radha Krishna Murthy. “Improvements to the SMO al-gorithm for SVM regression”. In: IEEE transactions on neural networks11.5 (2000), pp. 1188–1193.

[35] Ali H Shoeb and John V Guttag. “Application of machine learning toepileptic seizure detection”. In: Proceedings of the 27th InternationalConference on Machine Learning (ICML-10). 2010, pp. 975–982.

[36] SKLearn. Scikit Learn overfitting and underfitting. 2017. url: http://scikit-learn.org/stable/auto_examples/model_selection/

plot_underfitting_overfitting.html.

[37] Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Cul-berson, Robert P Sheridan, and Bradley P Feuston. “Random forest: aclassification and regression tool for compound classification and QSARmodeling”. In: Journal of chemical information and computer sciences43.6 (2003), pp. 1947–1958.

[38] Xuan Tuan Trinh. Online learning of multi-class Support Vector Ma-chines. 2012.

[39] Daniela Ventura, Diego Casado-Mansilla, Juan Lopez-de-Armentia, PabloGaraizar, Diego Lopez-de-Ipina, and Vincenzo Catania. “ARIIMA: areal IoT implementation of a machine-learning architecture for reduc-ing energy consumption”. In: International Conference on UbiquitousComputing and Ambient Intelligence. Springer. 2014, pp. 444–451.

[40] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. DataMining: Practical machine learning tools and techniques. Morgan Kauf-mann, 2016.

[41] Philip Wolfe. “A duality theorem for non-linear programming”. In:Quarterly of applied mathematics 19.3 (1961), pp. 239–244.

[42] Ying Yang, Geoff Webb, Kevin Korb, and Kai Ming Ting. “Classifyingunder computational resource constraints: anytime classification usingprobabilistic estimators”. In: Machine Learning 69.1 (2007), pp. 35–53.

82

http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html



[43] B Yoshua, IJ Goodfellow, and A Courville. Deep learning. 2016.

[44] Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui,and Edward Y Chang. “Parallelizing support vector machines on dis-tributed computers”. In: Advances in Neural Information ProcessingSystems. 2008, pp. 257–264.

83

www.kth.se

implementation and evaluation of selected machine learning

Documents