secondary carrier prediction in cellular networks using

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Secondary Carrier Prediction in Cellular Networks using Compressive Variational Methods

Public version

HILDING WOLLBO

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Secondary Carrier Predictionin Cellular Networks UsingCompressive VariationalMethods

HILDING WOLLBO

Master in Information and Network TechnologyDate: September 17, 2020Industrial supervisors: Rickard Cöster, Martin IsakssonAcademic supervisor: Mats BengtssonExaminer: Ragnar ThobabenSchool of Electrical Engineering and Computer ScienceHost company: Ericsson AB, Stockholm, SwedenSwedish title: Sekundärfrekvensprediktion i cellulära nätverkgenom komprimerande variationsmetoder

iii

AbstractA new method for predicting coverage on secondary frequencies in cellularnetworks is proposed and evaluated against previous studies in the field. Datais aggregated from several frequencies and cells, greatly reducing the numberof Machine Learning models employed in the network. An implementation ofa variational approximation of the Information Bottleneck method is studiedfor different data and compression levels. The model is evaluated using severalmetrics in relation to a baseline decision tree model. The new model is shownto have improved performance for simulated network data, while also beingcapable of achieving similar performance as models trained on data from asingle source, using only a fraction of the total data. The results and conclu-sions are also validated for real network data.

iv

SammanfattningEn ny metod för att prediktera täckning på sekundärfrekvenser i cellulära nät-verk framställs och utvärderas mot tidigare studier inom området. Data ag-gregeras från flertalet frekvenser och celler, vilket avsevärt minskar mängdenanvända maskininlärningsmodeller i nätverket. En implementation av en vari-ationsmetodsapproximation av Information Bottleneck-metoden studeras förolika data och kompressionsnivåer. Modellen utvärderas med hjälp av fleraolika mätetal och jämförs med en beslutsträdsriktmodell. Den nya modellenpåvisas uppnå förbättrad prestanda för simulerad nätverksdata, samt genomatt använda endast en bråkdel av den totala datan vara kapabel till att uppnåliknande prestanda sommodeller som tränats på data från en enda källa. Resul-taten och slutsatserna styrks även utifrån experiment på verklig nätverksdata.

v

AcknowledgementsI would like to express a great thank you tomy supervisors at Ericsson, RickardCöster &Martin Isaksson, who have been very patient and with their great ex-perience have helped me grasp the fundamental problems treated in this thesis.I would especially like to thank Martin Isaksson for his help with formulatingthe practical frameworks necessary for creating the thesis, as well as for hismethodological and linguistic insights. I would also like to thank Henrik Ry-dén for his help with producing the simulated dataset used in this thesis. Iwould like to thank my academic supervisor Mats Bengtsson as well as myexaminator Ragnar Thobaben for their feedback and comments on the thesis.I would also like to thank my opponent Einar Bremer. I would like to expressmy gratitude to Ericsson as a company for the provision of tools and equip-ment necessary to perform this thesis. In this strange time of the Covid-19crisis, we were able to work as usual if not even more efficiently thanks to thestructural clarity and infrastructure provided by the company. I would alsolike to thank my friend and colleague Torsten Molitor for the discussions andinsights we have been able to exchange during the work with our respectivetheses, which have been invaluable for understanding and solving the prob-lems I encountered. Finally, I would like to thank my coworkers at Ericssonfor their feedback and suggestions during the course of this project.

Contents

List of Figures vii

List of Tables viii

Acronyms x

Concepts xii

Symbols xiii

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Heterogenous Cellular Networks 4

3 Secondary Carrier Prediction 63.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Machine Learning Model . . . . . . . . . . . . . . . . . . . . 73.3 Local vs. Central models . . . . . . . . . . . . . . . . . . . . 9

4 The Information Bottleneck 114.1 Initial formulation . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Rate distortion theory . . . . . . . . . . . . . . . . . 114.1.2 Expansion to Information Bottleneck . . . . . . . . . 13

4.2 Information Bottleneck for ANN . . . . . . . . . . . . . . . . 154.2.1 Variational Approximation . . . . . . . . . . . . . . . 154.2.2 The Reparametrization Trick . . . . . . . . . . . . . . 174.2.3 Estimating Mutual Information . . . . . . . . . . . . . 19

vi

CONTENTS vii

5 Method 215.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . 215.1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . 225.1.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . 235.1.4 Augmentation . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . 265.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . 27

5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3.1 Evaluating Performance . . . . . . . . . . . . . . . . 285.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 305.3.3 Non-parametric metrics . . . . . . . . . . . . . . . . 315.3.4 Threshold selection policy . . . . . . . . . . . . . . . 35

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 375.4.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . 385.4.2 Evaluation Algorithm . . . . . . . . . . . . . . . . . . 40

6 Results 416.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Compression and Mutual Information . . . . . . . . . . . . . 436.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 456.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . 466.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 466.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . 486.3.5 Constrained Training Data . . . . . . . . . . . . . . . 49

7 Conclusion 537.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Bibliography 55

A Prediction Results 58

B Evaluation Algorithm 61

List of Figures

4.1 The structure of the implemented variational Gaussian IB-model. 19

5.1 A typical measurement scenario. The base station is operat-ing on the red frequency, collecting measurement series fromthree different cells and four secondary frequencies. . . . . . . 23

5.2 Example ROC and PR curves. The green square is the samevalue mapped to different positions on the respective curves. . 33

5.3 Threshold selection using λ-informedness. The point on thecurve with λ-optimal informedness is associated with a spe-cific threshold δλ. . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1 Overlap distributions across secondary frequencies, approxi-mated using a Gaussian kernel. . . . . . . . . . . . . . . . . . 42

6.2 Bottleneck Map for different compression levels β . . . . . . . 446.3 Compression/Performance tradeoff . . . . . . . . . . . . . . . 456.4 Performance CDFs, secondary frequency f0 . . . . . . . . . . 496.5 Performance CDFs, secondary frequency f2 . . . . . . . . . . 506.6 Performance CDFs, secondary frequency f4 . . . . . . . . . . 51

A.1 Performance CDFs, secondary frequency f1 . . . . . . . . . . 59A.2 Performance CDFs, secondary frequency f3 . . . . . . . . . . 60

viii

List of Tables

5.1 Simulation parameters for the dataset. . . . . . . . . . . . . . 225.2 Example raw feature vectors aggregated from different cells

and secondary frequencies. . . . . . . . . . . . . . . . . . . . 225.3 RSRP measurement report mapping . . . . . . . . . . . . . . 265.4 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . 315.5 LGBM-reference model hyperparameters. . . . . . . . . . . . 395.6 IB model hyperparameters. . . . . . . . . . . . . . . . . . . . 40

6.1 Average overlap, absolute frequency and number of datapointsfor the different secondary frequencies . . . . . . . . . . . . . 42

6.2 Distribution of detectable cells per measurement for the dif-ferent primary frequencies. . . . . . . . . . . . . . . . . . . . 42

6.3 Secondary Frequency f0 . . . . . . . . . . . . . . . . . . . . 476.4 Secondary Frequency f2 . . . . . . . . . . . . . . . . . . . . 476.5 Secondary Frequency f4 . . . . . . . . . . . . . . . . . . . . 486.6 ROCAUC across frequencies for different amounts of data. . . . 526.7 PRAUC across frequencies for different amounts of data. . . . . 52

A.1 Secondary Frequency f1 . . . . . . . . . . . . . . . . . . . . 58A.2 Secondary Frequency f3 . . . . . . . . . . . . . . . . . . . . 59

ix

Acronyms

3GPP 3rd Generation Partnership Project

ANN Artificial Neural Network

AP Average Precision Score

AUC Area Under Curve

CDF Cumulative Distribution Function

dBm Decibel-milliwatts

FNR False Negative Rate

FPR False Positive Rate

GBDT Gradient Boosted Decision Trees

GPU Graphical Processing Unit

IB Information Bottleneck

LGBM LightGBM

LTE Long Term Evolution

ML Machine Learning

PMI Precoder Matrix Indicator

PPV Positive Predictive Value

PR Precision-Recall

x

Acronyms xi

QoS Quality of Service

ROC Receiver Operating Characteristics

RSRP Reference Signal Received Power

RSRQ Reference Signal Received Quality

RSSI Radio Signal Strength Indicator

SCP Secondary Carrier Prediction

TA Timing Advance

TNR True Negative Rate

TPR True Positive Rate

UE User Equipment

Concepts

coverage Whether or not a UE has sufficient signal strength, used to denote(the proportion of) positive samples in a radio measurement dataset.Also referred to as overlap.

Data Processing Inequality Information theoretic concept. Fundamentally,no post-processing of data can increase its information content, onlydecrease it.

KL-divergence Kullback-Leibler divergence, information theoretic measureof difference between two probability distributions, defined as

DKL

[p(x) || q(x)

]= −

∫dx p(x) log

q(x)

p(x).

Also known as the relative entropy, the KL-divergence has applicationsin diverse fields such as applied statistics, fluid mechanics and machinelearning.

Markov chain Stochasticmodel, describing sequences of states or events wherethe probability of each event is dependent only on the previous state.

xii

Symbols

B Information Bottleneck dimensionality

J Informedness

R(D) Rate distortion function

β Multiplicative constant determining importance of the compressing term inthe IB-objective and Blahut-Arimoto algorithm

δ Threshold for binary predictor

ε Reparametrization noise

DKL

[· || ·

]KL-divergence between two probability distributions

X Total set of input data in a finite set

Y Total set of labels/classes

X Total set of codewords in a discrete mapping

θ Neural network model parameters

d(x, x) Distortion function, measures the dissimilarity between true data xand its compressed representation x

fs Secondary frequency

nc Number of detectable neighbouring cells

x Input data

y True data label/class

z Information Bottleneck representation layer variable

xiii

Chapter 1

Introduction

Radio networks are a vital infrastructure of ourmodernworld andwill continueto grow in importance as the number of connected devices increases. Theradio networks of today are cellular, in which a set of base transceiver stationstogether form an overlapping grid of cells providing radio coverage for portableradio units commonly known as user equipment (UE) within the cell [1].

1.1 BackgroundMeasurements between different frequencies are necessary to facilitate sev-eral important network functions. Such measurements are however resourceintensive with regard to the energy spent by the UE, interference with otherunits and the time spent waiting for handover to another frequency. This hasprompted the development of approximative machine learning (ML) methodsto avoid excessive measuring and reporting between UE and base stations. Theprocess of trainingML-models with data collected on the serving frequency topredict coverage on another frequency is called Secondary Carrier Prediction(SCP) and has been proven to be a useful technique for utilizing network datato increase rate of service and network efficiency [2], [3].

Previous studies have generally been confined to the case where a singleML-model is trained on data from a single cell operating on a specific primaryfrequency, with measurements collected from UE on a single secondary fre-quency. However, the nodes in cellular networks generally carry a multitude ofcells, primary- and secondary frequencies, and we therefore postulate that wecould train fewer models using shared data from several cells and secondaryfrequencies on the node, leveraging an increased amount of data together witha reduction in the number of models. This is especially relevant since the

1

2 CHAPTER 1. INTRODUCTION

amount of available data and the number of frequencies utilized in the networkis expected to increase in the near future [4], as would the number of models.In this context, we would posit our target model to be trained elsewhere in thenetwork using data from several sources. A centrally trained model can thenbe distributed to all base stations.

The increased amount of data associated with aggregation from differentsources introduces more complexity to the problem, however, since the under-lying data distributions can be expected to vary across secondary frequenciesand cells. The data is also generally very sparse, meaning that only a few di-mensions of the data carry relevant information for predicting the coverage ona secondary frequency.

1.2 ImpactCellular radio networks are one of the most important infrastructures of today,with almost every facet of our society in some way dependent on the possibil-ity of transmission of information from one party to another. Global mobiledata traffic continues to grow at a very high rate, and mobile traffic is expectedto increase by a factor of 5 in the coming years as the number of devices growand more 5G networks are deployed into production [4]. This, in combinationwith the expected increase of IoT devices implies that we need to maximizenetwork efficieny and capacity in order to facilitate next generation applica-tions, value propositions and ways of living. This is also especially importantgiven the current state of the world and the accelerating effects on digital tech-nology posed by the consequences of the Covid-19 crisis, during which radicaltransformations of several parts of society will be necessary in order to main-tain our standard of livingwhile continuing the process of making our societiesmore sustainable [5].

One should also view telecommunications as an enabling technology for alarge set of environmental solutions, and not only focus on minimizing powerconsumption alone. According to [6], ICT solutions could directly reduce car-bon emissions by up to 15 % in other industries by 2030, while also indirectlysupport a further reduction of 35% by influencing consumer patterns, businessdecisions and transformation of legacy systems. In this context, it is impera-tive that we explore and develop new ways to maximize network performanceand capacity if we are to reach these goals, and as engineers we have an im-portant role to play in shaping this future through the development of networkinfrastructure.

CHAPTER 1. INTRODUCTION 3

1.3 ScopeThis thesis is focused on the study and development of new methods able tomodel a large amount of heterogenous data while managing the complexity as-sociated with data aggregation, using deep variational methods together withneural networks. By introducing an Information Bottleneck objective func-tion, we can study how data from different sources are mapped relative to oneanother, as well as how much information is necessary to accurately performSCP. The importance of the amount of data used for training and its effecton classification performance relative to the characteristics of the dataset isalso studied. A large part of the work is concerned with data preparation andaugmentation, encoding information about operating cell and frequency to im-prove performance. The related topic of threshold selection policies for binaryclassifiers is also briefly studied, which are necessary for determining whatconstitutes a good classifier with regards to network performance and associ-ated costs.

Chapter 2

Heterogenous Cellular Networks

The increase in network capacity has in the previous decade in one way beengradually facilitated by implementing so called heterogenous networks, wherea network of traditional macro cells for wide area coverage has been overlaidwith a layer of base stations with lower transmit power which are often referredto as small cells (or micro cells). These small cells serve as high-capacityhotspots and relays for UE while having a smaller coverage footprint, whichmeans that they do not suffer from high propagation loss in the same way asmacro cells do [7]. Furthermore, these smaller cells can operate in multiplehigh frequency layers such as 3.5 GHz, 5 GHz and 10 GHz, and many newspectra are expected to be added as the deployment of 5G progresses. Spec-tral interference can be avoided by having the macro- and small cells operat-ing in separate frequency bands, with the small cells often using much higherfrequencies than the macro cells [8]. These small cells may be used for off-loading traffic from the macro cell while simultaneously being useful for localareas requiring very high data rates, such as urban centers. However, when themacro- and small cells are operating in different frequency bands, the UEs con-nected to the macro cells need to periodically be able to detect the presence ofavailable and suitable small cells with coverage for off-loading from the macrocell to be possible. This process of inter-frequency measurement is howevercostly for the UE with regards to time spent waiting for handover and energyexpended, and is considered as one of the biggest challenges in exploiting thenetwork resources as efficiently as possible [9]. The trade-off between exploit-ing high data-rate small cells and efficiently performing as few inter-frequencymeasurements as possible is complicated by the fact that a few measurementsmay cause failure in offloading traffic from the macro cell to the small cells,which results in decreased spectral efficiency and overloading of the macro

4

CHAPTER 2. HETEROGENOUS CELLULAR NETWORKS 5

cells as well as a lower Quality of Service (QoS). This problem becomes evenmore complex as the number of available frequencies increases: in order toperform optimal off-loading, the UE would have to perform inter-frequencymeasurements and report the quality on all available secondary frequencies tothe network. The assignment of UEs will also need to be re-evaluated overtime. As conditions change, handover between different frequency bands isnecessary to maintain a high QoS. A solution to minimize the number of interfrequency measurements is to perform prediction of the signal quality on adifferent frequency using only measurement data on the frequency layer cur-rently serving the UE. In this way, a model can after an initial period of mea-suring and training entirely help in reducing the amount of inter frequencymeasurements for assignment and handover of UE, which leads to a gain ofboth energy expenditure and reduced waiting time. This has previously beenstudied in [2]and [3].

Chapter 3

Secondary Carrier Prediction

In order to perform cell selection, reselection and handover, UE in cellular net-works need to perform several measurements and report them to the network.The amount of reports sent between UE and base stations can be reduced by asignificant amount if the inter frequency signal quality can be predicted insteadof beingmeasured directly. This can be achieved by usingmeasurements avail-able on the current frequency band as features for training a machine learningmodel.

3.1 FeaturesThere are a number of possible network and measurement features that canbe used for predicting signal strength. Two prominent signal quality measure-ment metrics are the 3GPP LTEReference Signal Received Power (RSRP) andthe Reference Signal Received Quality (RSRQ). These metrics are measuredand reported by the UE depending on handover event (discovery, resynchro-nization etc.), frequency band and configuration [10]. These measurementshave previously been studied as features for Secondary Carrier Prediction in[2], [3].

RSRP is defined as the linear average of the reference signal power (mea-sured in Watt) across resource elements within a considered measurement fre-quency bandwidth [10]. Depending on measurement event, the UEmay reportRSRP measurements of the serving cell as well as detectable neighbouringcells on the same frequency, but also RSRP measurements to other frequencybands. The RSRQ is dependent on the RSRP, the number of resource blocksRb and the total received radio signal energy from all signals called Radio

6

CHAPTER 3. SECONDARY CARRIER PREDICTION 7

Signal Strength Indicator (RSSI) as

RSRQ =Rb · RSRP

RSSI

According to[3] this gives a better measure of the actual network performancecompared to the RSRP, but the RSRQ is however more difficult to estimatethan the RSRP.

Two other network based measurements that also have been studied as fea-tures for predicting signal quality are the Precoder Matrix Indicator (PMI) andthe Timing Advance (TA). The PMI is a feature relating to the direction ofenergy transmission from the macro cell to the UE, which indicates the an-gular position of the UE relative to the base station. This metric may be lessuseful in environments with a lot of scattering and reflection such as in ur-ban areas. The TA is used to synchronize UE to their serving base station,enforcing that the uplink transmission from the UE arrives at the base stationwithin a set sub frame. The TA can thus be seen as a distance measure of thepropagation length between UE and base station, but it is also affected by scat-tering in the propagation path. The TA is quantized into LTE basic time unitsTs = 1/(15000 · 2048) s [2]. Using all these features as input to a machinelearning model, one could create a regression model and define a threshold onthe estimated RSRP which indicates if the UE will have coverage on the con-sidered secondary frequency or not, or simply output a probability for binaryprediction.

In the casewhere the ground truth inter-RSRPmeasurements are performedonly on one secondary frequency fs in a set of frequencies F , one could aug-ment the feature vector with the absolute frequency used or a one-hot encod-ing, indicating which secondary frequency the measurements are measuredagainst. This becomes relevant when aggregating RSRP-measurements per-formed on the same primary carrier frequency but where the secondary fre-quency differs.

3.2 Machine Learning ModelThere are several ways to define the secondary carrier prediction problem.The ground truth values y are collected by performing inter-frequency RSRPmeasurements between UE and cells on the secondary frequencies. One wayto create labels is to simply define a threshold on the RSRP which labels thefeatures as corresponding to the UE having coverage or not, a binary classifica-

8 CHAPTER 3. SECONDARY CARRIER PREDICTION

tion task. This could also be interpreted as a regression problemwith regard tothe RSRP, which can easily be turned into a classifier. One could also create amodel for predicting which cell index on another frequency band is strongest,as in [2]. Furthermore, since there are several secondary frequencies, onecould measure the ground truth RSRP values to cells on a set of secondaryfrequencies. Such a dataset could be used in a multi-class binary classifica-tion problem or even a ranking problem of which frequencies are most likelyto provide coverage for the UE.

The input features x can be a combination of:

• Intra-RSRP measurements of detectable neighbouring cells

• Intra-RSRQ measurements calculated from RSRP, RSSI and Rb

• Precoder Matrix Indicator values (angle)

• Timing Advance values (distance)

Depending on machine learning problem, the corresponding targets y may be

• Measured inter-RSRP on either one or more secondary frequencies

• RSRQ on one or more secondary frequencies calculated from the mea-sured inter-RSRP

• Thresholded inter-RSRP binary values from one or more secondary fre-quencies

• Cell index of the strongest cell on a given secondary frequency

These features and targets then constitute the dataset

DN = {(xi, yi), i = 1...N}

which is used to train to the chosen machine learning algorithm. In the case ofbinary classification, by applying a sigmoid function in the last step the outputcan be viewed as a conditional probability p(C|x). By defining a threshold δ,this probability can then be mapped to either class as

y =

{1, if p(C|x) < δ

0, otherwise(3.1)

.

CHAPTER 3. SECONDARY CARRIER PREDICTION 9

Searching for the optimal threshold δ is in itself a non-trivial optimizationproblem. This binary classifier can be augmented further by introducing athird class which flags if the prediction is too uncertain, indicating that the UEshould perform a inter-frequency measurement to infer the optimal secondaryfrequency for handover as in [3]. The classifier then becomes:

y =

0, if p(C|x) < δl

1, if p(C|x) > δu

2, otherwise

where C = 2 indicates that the prediction is too uncertain. By varying theupper and lower threshold δu, δl, one can then adjust for the costs associatedwith erronously classified positives and negatives.

3.3 Local vs. Central modelsPrevious studies of SCP have generally been confined to local models, whereeach node trains a unique model for prediction for each cell and secondary fre-quency pair. Data is gathered from a single cell by performing measurementsthrough connected UE. The model is then trained for SCP from the given pri-mary frequency to the specified secondary frequency. However, since the UEare connected on the same primary frequency for a given base station, the datafrom all secondary frequencies and cells will in general be i.i.d. despite theground truth labels being calculated for different secondary frequencies. Intheory this facilitates data aggregation into a single model, but it is necessaryto be able to distinguishwhich secondary frequencies the data is gathered from.This can, for example, be achieved by appending an encoding to the featurevector which indicates the secondary frequency used to gather the data.

On a system level, the computational power in a base station can be a limit-ing factor, and there is a maintenance cost associated with maintaining a mul-titude of different models on the same node. The number of secondary fre-quencies used for offloading is also only expected to increase in the near term,which motivates data aggregation into a larger model which may be trainedin another location. It would be desirable to obtain a model able to performsecondary carrier predictions across the whole set of secondary frequenciesfor a given base station. Such a model should be able to leverage the largeramount of data associated with aggregating data across cells and secondaryfrequencies, while still being specific enough to reach as good performance

10 CHAPTER 3. SECONDARY CARRIER PREDICTION

as a smaller baseline model trained on data from a single cell and secondaryfrequency. By increasing the total number of available datapoints through ag-gregation into a larger model, it might also be possible to reduce the amountof data that needs to be collected per cell and secondary frequency.

Aggregation of data from different sources, while seemingly similar, in-creases the complexity of the prediction task. To perform SCP, the machinelearning model needs to capture the elements of the data that is relevant, whilealso filtering out the irrelevant parts. Even though the characteristics of datafrom different secondary frequencies and cells differ, they contain some infor-mation that is shared and relevant for the task of predicting coverage. One wayto directly implement this tradeoff is the so called Information Bottleneck (IB)method, which optimizes the representation of data for predicting a variableof interest, while limiting the complexity of the representation [11].

Chapter 4

The Information Bottleneck

The Information Bottleneck (IB) is an information theoretical framework forquantifying the tradeoff between the compression of a latent representation toinput data X ∈X while simultaneously maximizing the relevant informationof some side knowledge Y ∈Y , first introduced by Naftali Tishby et.al in2000 [11].

4.1 Initial formulationThe initial formulation of the theory was from a primarily signal processingand source coding point of view, and the compressed representation betweenthe input X and output Y was viewed as a codebook representation X . Thisconstrained optimization problem was argued to be a generalization of ratedistortion theory for lossy source compression where the distortion measured(x, x) is derived from the joint statistics of X and Y .

4.1.1 Rate distortion theoryLossy source compression is traditionally studied through the use of rate dis-tortion theory, which is concerned with finding the tradeoff between the rateor signal representation complexity and the average distortion of the recon-structed signal. This tradeoff is characterised by a rate distortion functionR(D) which finds the theoretical minimal achievable rate R associated witha specified expected distortion D [12]. The fundamental problem of lossysource coding is to, for each possible value X ∈ X find an approximate rep-resentation X ∈ X characterised by the probability p(x|x). This mappingp(x|x), assuming for simplicity that bothX and X belong to finite sets, parti-

11

12 CHAPTER 4. THE INFORMATION BOTTLENECK

tions blocks from the larger set X into a corresponding codebook element Xwith probability

p(x) =∑x∈X

p(x)p(x|x).

One factor describing the efficiency of such a partitioning is the rate, equiv-alent to the average number of bits needed to uniquely specify an element inthe codebook. The rate is lower bounded by the mutual information I(X; X)

by the asymptotic equipartition property [12]. This rate is in itself, however,not enough to determine what constitutes a good quantization or compression,since we could achieve an arbitrarily low rate by removing more informationabout the original signal X .

It is however possible to constrain the rate and identify the relevant fea-tures by introducing a distortion function which for good representations X isexpected to be small [13]. We can quantify this expected distortion introducedby the mapping of the input to a quantized codeword as

〈d(x, x)〉p(x,x) =∑x∈X

∑x∈X

p(x, x)d(x, x) (4.1)

The relation between the quantization (or compression) and expected dis-tortion is described by the rate distortion theorem, which identifies theminimalachievable rate R for a given constraint on the expected distortion as

R(D) , minp(x|x)|〈d(x,x)〉≤D

I(X; X). (4.2)

The solution is then found by solving the variational problem in p(x|x)formulated by introducing a Lagrangian multiplier to the distortion constraintthrough minimization of

F[p(x|x)] = I(X; X) + β〈d(x, x)〉p(x,x). (4.3)

This solution is given by the exponential form

p(x|x) = p(x)

Z(x, β)e−βd(x,x) (4.4)

where Z(x, β) is a normalization factor. This gives rise to the iterative al-gorithm known as the Blahut-Arimoto algorithm [14], which provides selfconsistent alternating update rules for the distributions p(x|x), p(x), defined

CHAPTER 4. THE INFORMATION BOTTLENECK 13

as

pn+1(x) =∑x

p(x)pn(x|x) (4.5)

pn(x|x) =pn(x)

Zn(x, β)e−βd(x,x) (4.6)

This algorithm thus provides a way to achieve an optimal assignment of Xinto representations X given the set of representations, but does not solve theproblem of finding the optimal set of representations.

4.1.2 Expansion to Information BottleneckUsing rate distortion theory to find the relevant features is not straightforward,however, since selecting the rate distortion function implicitly determines therelevant features of the signal. Furthermore, just identifying the relevant fea-tures does not implicitly yield a rate distortion function, which seems to limitthe usefulness of the rate distortion perspective. One possible way to avoidthis conundrum is to introduce an additional, related variable that might givesome information about which parts of the signal are relevant. That is, insteadof just trying to find the relevant features of a signal, one can condition on therelated variable and constrain the problem to finding the relevant parts of thesignal for predicting this additional variable. The choice of additional variablethen determines which features are important for the specific problem [15].

The related variable Y needs to carry at least some information about theinput X , which means that the mutual information I(X;Y ) must be positive.Given the joint distribution of the inputX and output Y as p(x, y), the relevantinformation is defined as the mutual information

I(X;Y ) =

∫p(x, y) log

p(x, y)

p(x)p(y)dxdy = Ep(x,y)

[log

p(x, y)

p(x)p(y)

]where p(x), p(y) are the respective marginal distributions [12]. Here, we as-sume that Y has some leftover information about X , and we assume themto be statistically dependent. As such, Y in itself can be viewed as implic-itly determining what features are relevant and irrelevant in X . This problemcan be likened to the problem of finding the minimal sufficient statistics ina supervised prediction task. That is, to capture and represent the minimalinformation relevant about the output label found in the input pattern. Usinginformation theory to interpret the minimal sufficient statistics, the task is thento find an optimal compressed representationZ ofX where all the components


that are irrelevant for predicting Y have been filtered out. This Z is then thesimplest achievable mapping of X able to describe the mutual informationI(X;Y ). This sequence of variables implies the Markov chain relationshipY → X → Z, where we minimize the mutual information I(X;Z) under theconstraint on the preserved information I(Z;Y ). We thus view the true labelsY as to giving rise to the observations X , and the prediction chain becomesX → Z → Y . Following the data processing inequality we have the relation-ships I(X;Y ) ≥ I(X;Z) ≥ I(Z; Y ) ≥ I(Y ;Y ) with equality if and onlyif Z constitutes a minimal sufficient statistic for predicting Y (assuming thatsuch a minimal sufficient statistic exists in the first place). We then formulatethe search for this optimal representation Z∗ ∈ Z as the Lagrangian

L[p(z|x)] = I(X;Z)− βI(Z;Y ) (4.7)

subject to the Markov chain Y → X → Z. Here, β operates as strictly non-negative factor parametrizing the tradeoff between the information retained ofthe input X in the representation Z and the preserved information relevantfor the prediction of Y . For finite sets, this variational problem has a closedform solution for some β in a set of self consistent equations [11], akin to theBlahut-Arimoto algorithm

p(z|x) = p(z)

Z(x; β)e(−βDKL

[p(y|x)||p(y|z)

]) (4.8)

p(y|z) =∑x

p(y|x)p(x|z) (4.9)

p(z) =∑x

p(x)p(z|x) (4.10)

Here, Z(·; ·) is a normalization factor, and the distance measure DKL

[· ||

·]is the KL-divergence. In a way, the IB can be viewed in relation to the

rate distortion theory as a problem whose distortion function depends on theoptimal map [13], formally

dIB(x, x) = D[p(y|x)||p(y|x)] (4.11)

Iterating over these equations we find the optimal representation Z, and theexpected distortion becomes

DIB = E [dIB(X,Z)] = I(X;Y |Z) (4.12)


which is the relevant information for predicting Y not captured by Z. We findthat this expected distortion is equivalent to the second term in (4.13), shortof an additive constant.

4.2 Information Bottleneck for ANNThe information bottleneck framework has in recent years been adapted for thestudy of deep neural networks by Tishby and others [15]. The claim is that eachlayer h in the network can be viewed as part of a Markov chain satisfying theData Processing Inequality (DPI), and the goal of the optimization algorithmis to optimize each of these layers to find the best intermediate representations,which can be formulated in the IB-framework as

I(X;Y ) ≥ I(h1;Y ) ≥ I(h2;Y ) ≥ · · · ≥ I(Y ;Y )

This suggests that one can view each layer as an encoder towards the nextlayers, as well as a decoder from the previous layers. As one moves deeperinto the network from the input, the encoder becomes less complex and thedecoder more complex. The goal of each layer then implicitly becomes tomaximize I(Y ;hi)while minimizing I(hi−1;hi). In the ideal case, one wouldthen design networks able to extract the most important and efficient featuresequivalent to the approximate minimal statistics, using as few layers and unitsas possible.

Studying the change in mutual information between layers is just a wayto characterize the evolution of the network during SGD-training, and not anexplicit method for training in itself. There are, however, ways to include theIB-objective as an intermediate layer in the objective function of the network.

4.2.1 Variational ApproximationThe IB method is an interesting framework for quantifying relevant informa-tion, but its analytical solution is only computationally tractable for finite setsand exponential families of distributions, limiting its usefulness. Computationof the mutual information for general distributions is computationally costlyand requires quantization, and it also assumed that the joint distribution be-tween X and Y is known. There are however ways to incorporate the IBmethod into the training procedure despite these limits.

Recently, methods utilizing variational approximations of themarginal anddecoder distributions have been proposed, which greatly reduce the compu-


tational complexity of calculating the mutual information terms [16]. Thesedistributions can then readily be approximated by a neural network. Assumingthat the joint probability of data, representation and label factors as

p(X,Y ,Z) = p(Z|X,Y )p(X|Y )p(Y )

by the Markov chain relationship Y → X → Z, we can formulate the IB-functional as

LIB(p(z|x)

)= I(Z;Y )− βI(X;Z) (4.13)

= Ey,z

[log

p(y, z)

p(y)p(z)

]− β Ex,z

[log

p(x, z)

p(x)p(z)

]= Ey,z

[log

p(y|z)p(y)

]− β Ex,z

[log

p(z|x)p(z)

](4.14)

Utilizing our Markov chain assumption, the posterior probability p(y|z)can be expressed by

p(y|z) =∫p(x, y|z) dx =

∫p(y|x)p(x|z) dx (4.15)

=

∫p(y|x)p(z|x)p(x)

p(z)dx (4.16)

This calculation is intractable in our case, and we can formulate a variationalapproximation q(y|z), forming our decoder, constituted by a neural network.For the first mutual information term in 4.13, using the Markov chain assump-tion, this approximation implies the lower bound

I(Z;Y ) ≥∫p(y, z) log

q(y|z)p(y)

dy dz (4.17)

=

∫p(y, z) log q(y|z)dydz −

∫p(y) log p(y) dy (4.18)

=

∫p(y, z) log q(y|z) dy dz +H(Y ) (4.19)

=

∫p(x)p(y|x)p(z|x) log q(y|z) dx dy dz+H(Y ) (4.20)

(4.21)


since

DKL

[p(Y |Z) || q(Y |Z)

]≥ 0

The label entropy H(Y ) is constant during optimization and can therefore beremoved from the functional.

For the second mutual information term in 4.13, we have that

I(Z;X) =

∫p(x, z) log

p(z|x)p(z)

dx dz (4.22)

=

∫p(x, z) log p(z|x) dx dz −

∫p(z) log p(z) dz (4.23)

While possible, calculation of themarginal distribution p(z) through p(x, z) =p(z|x)p(x) is generally difficult. Introducing another variational approxima-tion for p(z) as r(z) yields the upper bound

I(Z;X) ≤∫p(x)p(z|x) log p(z|x)

r(z)dz dx (4.24)

since DKL

[p(z) || r(z)

]≥ 0.

Together, these mutual information bounds form a lower bound L for theIB-functional as

I(Z;Y )− βI(Z;X) ≥∫p(x)p(y|x)p(z|x) log q(y|z) dx dy dz (4.25)

− β∫p(x)p(z|x) log p(z|x)

r(z)dx dz = L (4.26)

forming our objective function. However, implementing this lower bound inpractice is not entirely straightforward, since we assumed that the joint distri-bution p(x, y, z) was known during our derivation.

4.2.2 The Reparametrization TrickComputing this lower bound in practice can be made possible by introducinga few more approximations. Leveraging our Markov chain assumption on thefactoring of the joint distribution as p(x, y, z) = p(x)p(z|x)p(y|x), we can


estimate joint data-label distribution empirically as

p(x)p(y|x) = p(x, y) ≈ 1

N

N∑n=1

δxn(x)δyn(y) (4.27)

This lends the first approximation on the objective as

L ≈ 1

N

N∑n=1

Ez∼p(z|xn)

[log q(yn|z)− β log

p(z|xn)r(z)

]. (4.28)

Now, the critical step is to find a way to calculate the expectation over themarginal variable z. The posterior distribution p(z|xn) can be estimated by aneural network encoder fθ parameterized by θ, which calculates the stochasticmapping p(z|xn, θ) for a given distribution.

For the simplest, spherical Gaussian case and a B-dimensional marginalvariable z ∈ RB, this neural network outputs the B-dimensional mean and Bdimensional standard deviation, such that p(z|xn, θ) = N (z|fµθ (xn), fΣ

θ (xn))

This means that the bottleneck variable z is, given the input, not a sin-gle value but characterized by a parametric distribution with the parametersgiven by the network. These parameters are then learned by maximizing thelikelihood of the inputs x given the true labels y.

Practically however, we cannot simply use error backpropagation and prop-agate through the random node of z to train the network, since we may nottake gradients through stochastic variables. The reparametrization trick, in-troduced in 2013 by Kingma and Welling [17], allows us to take gradients bymoving the stochastic part of z through a separate node, through which we takeno derivative. This allows us to view the mapping from the encoder to stochas-tic variable as a deterministic function, with the randomness introduced by anew, independent and separatly sampled variable ε. That is, we have

z = fθ(x, ε) = fµθ (x) + ε · fΣθ (x) (4.29)

where ε ∼ N (0, 1). By selecting the variational marginal- and encoderdistributions asGaussians, we attain an analytical expression of theKL-divergence,which leads to the objective function


x

fθ(x)

+

fϕ(z)

×

ε

y

fµθ

fΣθ

z(β)

Figure 4.1: The structure of the implemented variational Gaussian IB-model.

L ≈ 1

N

N∑n=1

Eε∼p(ε) [− log q(yn|f(xn, ε))] + βDKL

[p(z|xn) || r(z)

](4.30)

For computational purposes, it is possible to construct an estimator of thefunctional from aminibatch of the full dataset. Given that the minibatch is suf-ficiently large, one can achieve an unbiased estimate of the true gradient withonly a single sample of ε per datapoint [18]. A schematic of the implementedIB-model is shown in Figure 4.1.

It is also possible to utilize distributions from the exponential family otherthan the Gaussian, as long as the variable can be reparametrized and an ana-lytical expression of an upper bound on the KL-divergence can be formulated.

4.2.3 Estimating Mutual InformationDespite the approximations performed in order to practically implement themutual information terms of the IB-objective, it can be of interest to study howthe mutual information varies between input and compressed representationfor different levels of compression β. By adding noise in the representationlayer z, we guarantee that the mutual information I(X;Z) is finite, and wecan get an estimation of how much information about the input is retained inthe bottleneck layer [15]. The addition of this noise has been shown to notmeaningfully affect the performance and representations of the network formoderate variances [19]. We can calculate the mutual information betweeninput distribution and representation layer as the expectation of the relative


entropy with regard to the input as

I(X;Z) =DKL

[p(x, z) || p(x)p(z)

](4.31)

=−∫∫

dx dz p(x, z) logp(x)p(z)

p(x, z)(4.32)

=−∫∫

dx dz p(z|x)p(x) log p(z)p(x)

p(z|x)p(x) (4.33)

=−∫dx p(x)

∫dzp(z|x) log p(z)

p(z|x) (4.34)

= Ex∼p(x)

[DKL

[p(z|x) || p(z)

]](4.35)

With the Gaussian variational approximations introduced to the encoderand marginal distributions, we can calculate an empirical upper bound to themutual information as

I(X;Z) ≤ 1

N

N∑n=1

DKL

[p(z|xn) || r(z)

](4.36)

=1

2N

N∑n=1

B∑i=1

fσiθ (xn)2 + fµiθ (xn)

2 − 1− log(fσiθ (xn)2) (4.37)

whereB is the dimensionality of the distribution, and f θµi(xn), fθσi(xn) the

i:th respective outputs of mean and standard deviation in the encoder.

Chapter 5

Method

5.1 DatasetThe datasets used for SCP arise naturally from reports used for network ser-vices such as manual handover between UE and cells. The analysis was per-formed on both simulated datasets and real operator data. The simulationswere carried out as to closely mirror conditions in real world handover re-ports, albeit very structured and consistent in size and number of files acrossfrequencies and cells.

5.1.1 SimulationThe raw data was produced by simulating several moving UE in an urban en-vironment. The simulated area consists of a 2 km x 2 km slice of a simulatedpart of a typical large European city, containing several buildings of varyingheights causing scattering effects. The simulation is performed with identicalUEmovement in each dataset. The datasets were collected for different macro-and micro cells on a set of nodes and frequencies, each frequency serves asprimary frequency once and the rest as secondary frequencies used to collectinter- and intra-RSRP measurements. The simulation parameters used to pro-duce the data is shown in Table 5.1. The raw features is collected as RSRPmeasurements in dBm, sent from UEs to the serving cell and neighbouringmeasurable cells. The serving cell is assumed to be the macro cell with high-est measured signal power. The labels are determined by thresholding the cor-responding inter-frequency RSRP measurement to the secondary frequency.Inter-RSRP measurements larger than the threshold of -100 dBm are assignedto be in coverage, with the label 1. The raw feature vector also contains fields

21

22 CHAPTER 5. METHOD

Table 5.1: Simulation parameters for the dataset.

Macro cell Tx Power 40 [W] Number of UE 60000

Micro cell Tx Power 10 [W] UE Dynamic range 6− 9 [dB]Number of Macro cells 57 Timeout probability 0.01

Number of Micro cells 38 Additive noise N (0, 9)

Table 5.2: Example raw feature vectors aggregated from different cells andsecondary frequencies.

RSRPP[dBm]

RSRP1

[dBm]RSRP2

[dBm] . . .RSRP150

[dBm] fs cID

−65.6 −70.7 −140 . . . −140 0 1

−72.5 −140 −74.5 . . . −140 2 0

−96.9 −102.7 −140 . . . −140 1 2

with information about the global cell ID from which the measurements weremade as well as the secondary frequency used. A typical raw feature vectoris shown in Table 5.2. The indexation designating the neighbours of a certainserving cell is local to the node and primary frequency, meaning that at mostone would have common indices over cells and target frequencies which thusforms the basis of our aggregation methods.

5.1.2 StructureThe simulated datasets containmeasurements from cells and different primary-and secondary frequencies in the network. There are five different frequen-cies used, where each frequency serves as primary frequency once and theremaining frequencies are used to perform measurements. Thus, each suchmeasurement series is saved as a file, uniquely defined by its cell, primary fre-quency and secondary frequency. Here we define the dataset as the collectionof measurement files on the same primary frequency, meaning that we havefive different datasets in total. Each of these datasets are comprised of groupsof measurements from different nodes, where each node contains measure-ments on a number of cells and secondary frequencies. An example of thestructure of the collection process is shown in Figure 5.1.

CHAPTER 5. METHOD 23

Figure 5.1: A typical measurement scenario. The base station is operatingon the red frequency, collecting measurement series from three different cellsand four secondary frequencies.

5.1.3 AggregationWith the limitations in indexation we can at most expect to meaningfully ag-gregate measurements across secondary frequencies and cells. On average, anode has for a specific primary frequency measurements from three separatecells. The simulations were performed on four different secondary frequenciesfor a single primary frequency.

An aggregation is in this case defined as the collection of data from sev-eral measurement series into a single, larger data structure. There are threecanonical aggregations, aggregating data from the same cell, aggregating dataon the same secondary frequency and aggregating all the possible data on thenode. These three modes lead to a potential reduction of models by a factor of3, 4 and 12 respectively. Of course, there is reason to believe that variations ofthese aggregations might be more suitable in cases where the data distributionis more homogenous across one specific dimension (e.g. same overlap on aspecific secondary frequency). This could imply aggregation of data acrosscells from a subset of secondary frequencies with similar overlap and a singleaggregation on the other secondary frequency. Other aspects such as the ex-istence of sufficient data to create a single model for the problem might alsoinfluence the optimal aggregation choice. For the purpose of this thesis thestudy is primarily focused on comparing aggregations of as much data as pos-sible with a baseline single model which is only trained and evaluated on data


from a single cell and secondary frequency.

5.1.4 AugmentationWhile the raw data itself can be sufficient for training of a ML-model, thereare some augmentations of the data that help utilize as much information aspossible from the measurements, as well as to make the simulated data morerealistic.

Cell/Frequency encoding

In order to specify the same ML-problem (binary classification) when usingdata aggregated from several secondary frequencies, it is necessary to some-how incorporate the measured secondary frequency in the model. The mul-tilabel problem equivalent to binary secondary carrier prediction is in thiscase impossible since only the ground truth inter-RSRP to one specific sec-ondary frequency is measured when constructing the datasets. Instead, wecan move the complexity of multi-class SCP to the feature vector by encodingthe secondary frequency in the feature vector and still retain a single ML-model. Thus, instead of approximating the multilabel probability P (Yf0 =

1,Yf1 = 1...YfN = 1|X) our ML-model will calculate the single probabilityP (Y = 1|X, fi). This frequency information is implemented as a one-hotencoding appended during the data preparation stage when aggregating datafrom more than one secondary frequency.

There is also information about the serving cell used for collecting thedata in the global cell ID. While not critical for specifying the ML-problem,this information can still be utilized to improve the performance of the modelwhen aggregating data from different cells. As with the secondary frequencyinformation, this is appended as a one-hot encoding to the feature vector.

Noise

In the simulated data, every measurement is completely unaffected by noiseand interference. Therefore, to make the data more realistic, a small amountof white gaussian noise is added to the RSRP measurements during the dataprocessing stage. The default noise level used is σ = 3. For normal operatingconditions, intra-frequency RSRP measurements from UE to serving cell areallowed to contain up to 4.5 dB error, which under this channel noise amountsto 1.5 σ [20]. For extreme operating conditions, the noise level must not exceed9 dB, equivalent to 3σ. Another way to make the data more faithful to real


world applications is to add a probability of time-out to each measurement.This is implemented through a Bernoulli distribution with a probability of1/100 of timeout for all the positive samples. The affected positive samplesare then labeled as out of coverage with the label 0.

Scaling and Thresholding

The primary features from the simulations are measurements of RSRP fromUE to neighbouring cells. These are measured in dBm, where the lowest valueof -140 dBm corresponds to a neighbouring cell out of range, and a value of-44 dBm signifies a very strong connection. To further simulate a realisticscenario, a dynamic range of each UE’s receiver is applied as part of the mea-surement process. This dynamic range is for each UE generated from a uni-form distribution between 6dB and 9dB such that columns with RSRP lowerthan the dynamic range of the maximum value are set to undetectable, corre-sponding to a range between normal and extreme conditions by the standardsformulated in 3GPP 36.133 [20]. To comply with 3GPP reporting standards,a maximum of 8 detectable cells may be reported, with the rest being set tothe undetectable value -140 dBm. According to 3GPP 36.133 standards, allRSRP values are floored during the data preparation stage and transformed toa non-negative number between 0 and 97 according to Table 5.3.

While these raw reports are possible to use as features, there are stabilityissues related to having large feature vectors consisting of big numbers fortraining of someML-algorithms. To alleviate this, the complete feature vectoris min-max scaled according to

x =x−min{x}

max{x} −min{x} (5.1)

such that every measurement is in the span of [0, 1].

Oversampling

One way to counter the balance problem in skewed datasets is to randomlyoversample the training set to produce a dataset with a more equal class pro-portion. This can be performed using several methods. One simple methodis to simply oversample the training set from the minority class as to achievea similar number of positive and negative examples. This can be problematicthough, since there is very little new information added to the system. Over-


Table 5.3: RSRP measurement report mapping

Reported value Measured value [dBm]

RSRP00 RSRP < −140

RSRP01 −140 ≤ RSRP < −139

RSRP02 −139 ≤ RSRP < −138...

...RSRP95 −46 ≤ RSRP < −45

RSRP96 −45 ≤ RSRP < −44

RSRP97 −44 ≤ RSRP

sampling is also technically equivalent to weighing the samples of theminorityclass as to have a greater impact on the training.

5.2 PredictorsThere are several factors to consider when selecting an appropriate ML-modelfor a secondary carrier prediction task. These include:

• available computational power

• classification vs. regression task

• single/aggregated data source

• number of data points in the dataset

• presence of unlabeled data

There are a number of differentML-models that have been studied for perform-ing secondary carrier prediction, two of the most prominent ones are DecisionTrees and Neural Networks.

5.2.1 Decision TreesLightGBM (LGBM) is a ML-framework for efficiently constructing ensembledecision tree predictors [21]. The full name of the framework is Light Gra-dient Boosted Machine, where the term boosting signifies the combination ofa sequence of weaker learners in series into a stronger unit. By sequentially


adding weaker learners in each iteration, each training on the residual errorsof the previous iteration, one gradually gets a final classifier by aggregatingthe individual weaker models. The first renowned successful application ofboosting was AdaBoost, which used shallow trees (stumps) as the weak pre-dictors. The general method of using decision trees and boosting has beenfurther developed and is known as Gradient Boosted Decision Trees (GBDT).LGBM is an efficient implementetion of GBDT available in scikit-learn andfirst published in 2017 [22].

The Random Forest ML-model is a related ensemble model consisting of aset of decision tree predictors, where each tree in the forest is constructed fromsampling an i.i.d random vector [23]. That is, for a given distribution p(θ)one samples a set of N independent random vectors θ1...θN . The k:th treeis generated by using the training set and θk to produce a classifier h(x; θk)for an input vector x. The ensemble decision is the mode of the set of treeclassifiers, forming the random forest. The Random Forest is generally a verygood predictor, but the predictions are slower than competing methods. Thismight have implications for real time applications, especially if the numberand depth of trees is increased to handle a large amount of training data.

By letting the number of trees in the ensemble grow, the generalization er-ror of the Random Forest converges to a limit. For the Gradient Boosted Deci-sion Trees, however, increasing the number of trees makes the model prone tooverfitting. The LGBM implementation of Gradient Boosted Decision Treesis particularly useful in cases with limited computational power and a need forquick inference.

5.2.2 Neural NetworksAn Artificial Neural Network is a feedforward architecture for function ap-proximation, consisting of an input layer, several hidden layers and an outputlayer [24]. Each layer in the network consists of a set of nodes, each performinga linear weighting of their inputs, which is then fed to a non-linear activationfunction and fed forward to the next layer. That is, each layer performs foreach node j and input vector x ∈ RM the calculations

aj =M∑i

wixi

zj = h(aj)


where the output vector z is fed to the next layer. The weightsw of each layerare optimized through error backpropagation. Neural networks have in the re-cent years been used to great success in diverse tasks such as spam detection,image- and speech recognition and data clustering. Due to advances in train-ing algorithms and hardware such as GPUs, the number of hidden layers andnodes possible in the neural networks have risen, giving birth to the term DeepNeural Networks. One distinct feature of Deep Neural Networks is their abil-ity to form hierarchies of features, the capability to discover latent structuresand relationships in seemingly unstructured data. This can be viewed as anautomatic feature extraction.

5.3 MethodologyThere are several factors to consider when performing an experiment involv-ingmachine learning with regards to scientific rigour and proper methodology.Important factors include metric selection for evaluating performance, experi-mental framework for comparing outcomes from differentmodels, aggregationof results and hyperparameter selection.

5.3.1 Evaluating PerformanceIn order to assess the performance of different algorithms on a given dataset,one has to select an evaluation scheme. Conventionally, onwell studied datasetsthere is already a partitioning of training and test data and one trains on thetraining data and evaluates the performance on the test data. This is howevernot the case in most real world applications. Instead, one should perform somesort of cross-validation in order to select the hyperparameters of the model andfinally test the performance on a held out set of test data.

Cross-validation

There are many cross-validation strategies available, some are more suitablefor the type of data that is being studied in this thesis. Common cross vali-dation schemes include k-fold, repeated k-fold and stratified k-fold, each withtheir benefits and drawbacks. Using k-fold cross validation is generally betterfor estimating absolute error of a machine learning algorithm, while using re-peated k-fold can be helpful when comparing the performance of two differentmodels.


The data used for SCP can be very imbalanced. That is, the minority classis very small in relation to the majority class. Therefore one needs to makesure that each cross validation partition contains a sufficient amount of exam-ples from the minority class in order for the performance on the partition to berepresentative of the performance on the complete dataset. This motivates theuse of a stratified method, methods that preserve the proportion of class labelsin all the folds. Furthermore, since some of the measurement series containonly a handful of samples from the minority class, it might be a better ideato use fewer folds and instead repeat the cross-validations for a different ran-dom state. Therefore, a repeated, stratified k-fold cross-validation scheme wasselected. More specifically, a 2-fold crossvalidation is repeated 5 times usingdifferent partitionings. The folds are used both as training and validation sets,which ensures that each data point is used for both training and validation 5times. This scheme produces 10 different model realizations and cross valida-tion performance metrics, which reduces the variance in the estimated meanperformance. Furthermore, one must separate a portion of each measurementseries ahead of cross validation to be used as a hold out test set. After selectingmodel parameters through studying the performance on the cross-validationset, we may then report the model performance on this test set.

Comparing Aggregations

In order to fairly compare the different aggregations on a dataset, we mustguarantee that the aggregations are using exactly the same data for evaluat-ing performance for the different cross-validation splits. This is achieved byreading the file lengths of all the different measurement series used to createthe aggregated data for each node, and selecting training and validation datafor each subfile according to a common specified random state, shown in thealgorithm in Appendix B. This practically leads to the joint set of validation-and test sets used for single models being exactly the same as those of the ag-gregated models. The combined training sets used for single models are alsoequivalent to the training sets used by the aggregated model.

Reporting Results

The total amount of measurement series comprising the datasets are large. Weare at first hand interested in just comparing the general performance of mod-els regardless of which specific node and cell was used to collect the data.Therefore, it is reasonable to compare the performance of a set of single mod-els with the performance of the corresponding aggregated model on the same


test sets.Furthermore, wemay have some prior criteria for even considering amodel

to be good enough to even be evaluated for usage. Criteria to consider a modelvalid could include minimal achievable cross validation metrics for a givenset of predictions, such as the existence of a threshold satisfying a joint re-quirement on minimal precision and recall or a strictly positive informedness.Thus, we first calculate the performance of the model on the individual val-idation sets of each measurement series comprising the dataset to determinemodel validity and operating threshold. We then average test set metrics fromvalid models with the same cell and secondary frequency. With these setsof test metrics for each cell, we may form confidence intervals across eitherprimary or secondary frequency. However, for representation of the data asCDFs, the averaging of data from different frequencies with different cover-age characteristics may lead to multimodal distributions in the metrics, and itis better to construct and report CDFs from metrics pertaining to valid modelsin frequency dimensions with constant overlap.

5.3.2 MetricsBeyond just the cross-validation scheme, one has to choose performance met-rics that reflect both the structure of the data as well as the classification taskand intended use. In the case of SCP, the data is commonly unbalanced, mean-ing that some metrics become biased and prone to give a false picture of theclassifier performance. For example, in a dataset with 99 % of datapoints inthe majority class a classifier could by always outputting 1 achieve a precisionclose to unity. Other metrics commonly used in machine learning, such as F1

score, suffer from the same problem [25]. Therefore, it would be desirable todefine and study metrics which are applicable regardless of class prevalence.

Confusion Matrix

The confusion matrix is a general term encompassing concepts such as sensi-tivity, specificity and accuracy. Given the true number of positive samples inthe data P and the number of negative samplesN , one can sort the thresholdedpredictions of the machine learning algorithm into one of four categories: truepositive (TP), true negative (TN), false positive (FP) and false negative (FN).The relation between the predicted class and true class is presented in the con-fusion matrix in Table 5.4.


Table 5.4: Confusion Matrix.

Prediction

positive negative Σ

Truthpositive TP FN Pnegative FP TN N

From these base metrics one can derive several other metrics such as:

• True Positive Rate (recall), tpr = TPTP+FN

• True Negative Rate (specificity), tnr = TNTN+FP

• False Positive Rate (fall-out), fpr = FPTN+FP

• False Negative Rate (miss-rate), fnr = FNTP+FN

• Positive Predictive Value (precision), ppv = TPTP+FP

It should be noted that the rate metrics are directly related to the prevalence ofthe data, meaning that in a setting with a large skew, one can achieve a veryhigh true positive (or negative) rate by categorizing every prediction as themajority class. There are several possible derivable metrics from combiningthe different rates, one such metric is the so called informedness J , defined as

J =TP

TP+ FN+

TNTN+ FP

− 1 = tpr+ tnr− 1 = tpr− fpr, J ∈ [−1, 1]

A classifier with an informedness greater than 0 is in this regard strictly bet-ter than random classification, regardless of class distribution. This index iscapable of capturing the complete performance of the classifier into a singlemetric [26].

5.3.3 Non-parametric metricsWhile informedness is independent of class balance, it is still dependent on theselected threshold. Moving away from threshold dependent metrics removes alayer of complexity andmethodological consideration from the evaluation pro-cess, which can be desirable when evaluating many datasets with varying classdistribution. Secondary carrier prediction is performed for handover from a


macro cell to a smaller cell on another frequency, and it is often most impor-tant that the prediction for the UE having coverage is very reliable and thatthe False Positive Rate is minimized, since it for a false positive would have tomanually reconfigure its receiver chain which would be costly for the UE andlead to a decreased Quality of Service. This means that the "best" predictormight be overly biased to misclassify some features as being out of coverage, afalse negative. However, this behaviour is acceptable at the system level, sincethe resulting action is simply to either perform a new inter-frequency mea-surement from the UE or wait for more data to perform the prediction. Thus,for system level analysis, it is especially interesting to investigate the perfor-mance of the ML-algorithm with regard to the False Positive Rate in relationto the True Positive Rate. Two methods describing the relation between theTrue Positive Rate and the False Positive Rate of a classifier are the so calledReceiver Operating Characteristics (ROC) and the related Area Under Curve(AUC) metric.

Receiver Operating Characteristics (ROC)

The so called Receiver Operating Characteristics is a metric defined from therelation of the True Positive Rate and the False Positive Rate for a classifier in abinary classification task [26]. Given that the raw output probabilities p(C|x)of the ML-model lie in the range [0, 1], the classifier performs a tresholdingδ such that the raw output probability is mapped to either class, as in (3.1).By parametrizing the predictions by this threshold δ ∈ [0, 1], one can for eachδ plot the True Positive Rate in relation to the False Positive Rate, formingthe ROC curve. The baseline performance of a random binary classifier isa straight line where the propensity for false positives and true positives areequal, and a classifier with a ROC curve strictly above such a straight line isbetter than random prediction. Furthermore, the ROC curve of the ideal clas-sifier would only classify true positives and no false positives. For a certainthreshold δ, such a classifier will correctly classify all positives, and as thetreshold increases, will start to classify false positives as well. This ideal clas-sifier equates to a step function passing through the points (0, 0), (0, 1) and(1, 1). An example ROC curve is shown in Figure 5.2a.

Area Under Curve (ROC AUC

Although the ROC curve is a good performance indicator of a secondary car-rier predictor, it can be useful to incorporate a more direct measurement ofperformance as well. One such measure is calculated from the so called Area


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.8

1

fpr

tpr ROC

RandomInformedness

(a) Receiver Operating Characteristics

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.8

1

tpr

ppv

Precision/Recall

(b) Precision Recall curve

Figure 5.2: Example ROC and PR curves. The green square is the same valuemapped to different positions on the respective curves.

Under Curve (AUC, which is defined as the integral of the ROC curve, equiv-alent to the area under the curve. In a cellular network, there are many ap-proaches to measuring performance depending on both the prediction task,the data sources and modes of data aggregation. In the case of secondarycarrier prediction across several base stations in a cellular network, the per-formance of a specific ML-model might vary across data from different cellsand secondary frequencies. One way to create a comprehensive performancemetric is to, for a given data partitioning and set of cells, calculate the AUCfor each cell and data pair and map the results into a CDF. That is, for a suffi-cient amount of cells and datasets, one plots the probability that the AUC for arandom cell in the dataset will be greater or equal to a given AUC value. Hereit is desirable to have a ML-model which facilitates a high mean AUC, andalso that the AUC CDF has as low variance as possible, a guarantee for highQuality of Service for the UE. In this way, one obtains a metric not only onthe performance on a specific data instance, but also for a whole dataset andacross a cellular network encompassing different propagation environments.

Thus, ROC graphs and the ROCAUCmetric are very useful for evaluatingclassifier performance. However, in instances with a large class imbalance,


the ROC metric becomes unreliable as the resolution of the curve is entirelydependent on the number of predicted positives in relation to true positivesand true negatives. This can be understood by considering the true- and FalsePositive Rates as conditioned probabilities:

tpr = P (Y = 1|Y = 1)

fpr = P (Y = 1|Y = 0)

Here, in the case of a class imbalance in the form of high overlap, the numberof possible instances where P (Y = 1|Y = 1) are high, but the number ofinstances of P (Y = 1|Y = 0) are very few, resulting in a ROC graph withvery limited resolution in the x-axis. And vice versa, in datasets with lowoverlap, we have many examples where P (Y = 1|Y = 0) but few instancesof P (Y = 1|Y = 1), leading to a low resolution in the y-axis. One couldargue, however, that given a number of imbalanced ROC graphs drawn fromthe same distribution, their mean AUC would still be true to the real AUC.

Recall is not affected by imbalance in the same way because it is onlydependent on the positive group. Recall does not consider the number of neg-ative samples that are misclassified as positive, which can be problematic inproblems containing class imbalanced data with many negative samples.

Precision-Recall curve

Similarly to the ROC curve, there exists a parametrized curve between pos-itive predicted value (precision) and True Positive Rate (recall), called thePrecision-Recall curve. The ideal point is in this case (1, 1), where one achievesperfect positive predictions while correctly classifying every positive sample.The PR-curve is complementary to the ROC curve, in cases where the num-ber of negative examples are much larger than the positives, a small changein the number of false positives may greatly influence the False Positive Rate[27]. The Precision (PPV) however, measures the proportion of correct posi-tive predictions among all positive predictions, meaning that it avoids the skewproblem of the ROC curve, being able to capture algorithm performance ona dataset with a large proportion of negative examples. From a probabilisticpoint of view, we can understand the precision as

ppv = P (Y = 1|Y = 1)

The precision is conditioned on the prediction, not the true label, which meansthat it maintains high resolution of the curve in the y-axis even for cases of low


overlap. It should be noted that the PR-curve is, unlike the ROC curve, notstrictly concave. An example Precision-Recall curve is shown in Figure 5.2b

Precision-Recall AUC

For the Precision-Recall curve one can also define an Area Under Curve (AUCmetric, able to summarize the performance of the algorithm predictions in asingle number. One can also map a set of precision recall AUC metrics takenfrom a dataset into a CDF representation, as with the ROC AUC. The perfectPrecision-Recall curve would have an AUC of 1.

Average Precision Score

An alternative metric for summarizing a Precision-Recall curve is the so calledAverage Precision Score (AP). The AP is defined as

AP =∑n

(Rn −Rn−1)Pn (5.2)

whereRn,Pn is the recall and precision for the n:th threshold of the curve. TheAP can be thought of as a weighted sum of the precision along the curve, whereeach precision is weighted by its relative improvement of the recall relativeto the previous point. This metric is a slightly less optimistic performanceestimate than the PR AUC, since the calculation of the PR AUC is performedusing the trapezoidal rule and linear interpolation [22].

Duality of ROC analysis

The ROC and PR metrics are, as suspected, intimately linked. A point onthe ROC curve can directly be mapped to a point on the PR-curve 5.2, whichmeans that there are meanigful conclusions to be drawn from studying bothmetrics in tandem. We can, for instance, have a larger confidence in the relativeperformance of our models if both ROC and PR metrics between two modelsdiffer congruently. At the same time, if they differ from one another, we caninfer meaningful operational qualities of the different models in relation to thenature of the dataset and the class balance.

5.3.4 Threshold selection policyAlthough threshold selection is not strictly necessary to evaluate the perfor-mance of a machine learning algorithm, it is still needed when deploying the


algorithm in practice and an area worthy of study in itself. Every binary clas-sification task requires the selection of a treshold which to use in determiningif a prediction of, for example 0.68, should be thresholded to either 0 or 1. Thebest policy for selecting the treshold is not obvious and is entirely dependenton the task and scenario. In some instances, the cost of false positives mightbe a lot higher than those of false negatives, meaning that the optimal thresh-old policy task-wise might be an overly conservative one. In the case of SCP,this optimal threshold will be dependent on the class balance of the datasetstudied for prediction, which means that one has to define which metrics andtraits one should optimize in the threshold search. One way to intelligentlyfind the best treshold is to use the ROC curve previously discussed. The opti-mal ROC curve is the one passing through (0, 1). Therefore, one could arguethat the optimal threshold is the one corresponding to the point on the ROCcurve with the smallest distance to (0, 1). Selecting the euclidean distance asthe cost function reduces to minimizing the distance

||(−fpr, 1− tpr)||2

This, of course, assumes equal weighting of fpr and 1− tpr = fnr which mightnot be optimal both due to the cost of respective errors but also due to the classbalance in the dataset.

Optimizing Informedness

Since the class balance for SCP tasks vary drastically across frequencies, itwould be desirable to have a threshold selection mechanism that is not affectedby class balance and can be applied to any arbitrary dataset. Extending uponthe discussion of ROC curve analysis and informedness, one could argue thatthe optimal threshold is the one maximizing the probability of informed deci-sion, i.e. the point on the ROC curve corresponding to the maximal informed-ness. Remembering that the informedness can be written as J = tpr − fprimplies that it can be mapped directly as a vertical line between the randomline tpr = fpr and the ROC curve. The informedness metric is invariant toclass prevalence, but assigns equal cost to the true positives and false posi-tives. As previously discussed, when selecting a threshold for a SCP task,the cost associated with a false positive is generally much higher than the lossof a false negative or the gain of a true positive. Incorporating this into ourthreshold selection policy, we can add weights to the terms in the informedness


equation as

J∗ = α · tpr− β · fpr

or equivalently

Jλ = tpr− λ · fpr

This policy amounts to finding the threshold δ corresponding to the point onthe ROC curve with largest distance to the line tpr− λ · fpr = 0 for a given λ,that is

δ : argmaxδ

tpr− λ · fpr. (5.3)

This criterion leads to different optimal thresholds depending on selected λ.An example of threshold selection using maximal λ-informedness is shown inFigure 5.3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.8

1

fpr

tpr ROCJλ=1

Jλ=2

Figure 5.3: Threshold selection using λ-informedness. The point on the curvewith λ-optimal informedness is associated with a specific threshold δλ.

5.4 ImplementationThe implementation of the evaluation framework for the SCP task was im-plemented in Python using SciKit-Learn [22] and Tensorflow [28], as wellas Imblearn and other miscellaneous packages. The evaluation algorithm canbe separated into experiments on smaller units (node, cells, target frequencies,singlemodels) for a given primary frequency and set of secondary frequencies,yielding unthresholded predictions. The raw predictions and true labels, cor-responding training files, test cells and frequencies are saved to file together indata structures for subsequent analysis. The raw predictions are then evaluated


using both parametric and non parametric metrics, using a specified thresholdsearch algorithm on both validation- and test set. The metrics are then used toeither construct a CDF-representation or for constructing confidence intervalsacross cells on the same cell and secondary frequency.

5.4.1 HyperparametersHyperparameter selection is an important consideration when designing aMa-chine Learning model. Two different methods were implemented and studied,a previously studied LGBM decision tree model and a deep neural networkusing a variational information bottleneck objective.

LGBM-reference model

The decision treemodel was implemented using the LGBMpackage in python,using previously studied hyperparameters. The model is generally an order ofmagnitude faster to train than the neural network model without GPU acceler-ation. The selected hyperparameters are shown in Table 5.5.

IB neural network model

The neural network with variational IB-objective was implemented using Ten-sorFlow [28] andReLu [29] activation functions, in alignmentwith the schematicpresented in Figure 4.1. The structure is loosely based on [16], using twohidden layers in the encoder and one hidden layer in the decoder. However,[16] studies image datasets which are much more dense than the relativelysparse RSRP data, meaning that there is a lot of room for reducing the num-ber of neurons in the hidden layers. In [16], the network is structured as784 − 1024 − 1024 − (2B) − 10 where 784 corresponds to the input di-mension, B the bottleneck size and the output 10 the number of classes inthe dataset, with two hidden layers in the encoder and one softmax outputlayer. In the implementation of this thesis, the network was structured asD−D/2−(2B)−2B−O, whereD is the number of features of the data (vary-ing from 50 to 150 depending on aggregation),B the bottleneck size andO theoutput sigmoid layer of size 1. The bottleneck layer, indicated in brackets (·),outputs the B-dimensional mean and the B-dimensional standard deviations,after a softplus transform. The proposed network is relatively deep and widein relation to the binary prediction task of SCP, but the dimensions are chosenas to be similar to the structure in the original paper. The depth of the network


Table 5.5: LGBM-reference model hyperparameters.

Parameter Value

’task’: ’train’,

’boosting_type’: ’gbdt’,

’objective’: ’binary’,

’metric’: ’binary_logloss’,

’metric_freq’: 1,

’is_training_metric’: ’true’,

’max_bin’: 255,

’num_trees’: 100,

’learning_rate’: 0.1,

’num_leaves’: 63,

’tree_learner’: ’serial’,

’feature_fraction’: 0.8,

’bagging_freq’: 5,

’bagging_fraction’: 0.8,

’min_data_in_leaf’: 50,

’min_sum_hessian_in_leaf’: 5.0,

’num_iterations’: 100,

’is_enable_sparse’: ’true’,

’use_two_round_loading’: ’false’,

’verbose’: -1,

’scale_pos_weight’: 2.0’


Table 5.6: IB model hyperparameters.

Parameter Value

β: 5 · 10−3

Encoder: D −D/2− 2B

Bottleneck (B): 8

Decoder: 2B − 1

Softplus bias: −5

Batch size: 100

Epochs: 10

Optimization Algorithm: Adam

β1: 0.5

β2: 0.999

Initial learning rate: 10−3

Exponential decay: 50 steps / epoch

Decay rate: 0.97

Staircase: True

means that each epoch of training takes longer time compared to that of a onelayer network. However, the depth also implies that only a handful of epochsare needed as for the model to converge. The most important design decisionfor the network is that the first layer is sufficiently wide, subsequent layers areless important but affect the training convergence of the network. Notice thatno regularization was used except for the implicit regularization posed by theinformation bottleneck objective, as in [16]. A full list of hyperparameters isshown in Table 5.6 together with the learning scheme.

5.4.2 Evaluation AlgorithmThe evaluation can on a large scale be divided into a main evaluation algorithmfor training the models and producing the raw predictions of all measurementseries in the dataset, a function for selecting a threshold by optimizing a spe-cific cost function on the validation data, a function for averaging metrics overvalid models in the same cross validation, and finally a function for construct-ing confidence intervals or CDF-representations over either secondary or pri-mary frequencies. A high level implementation of the evaluation algorithm ispresented in Appendix B.

Chapter 6

Results

6.1 DatasetThe simulated dataset consisted of measurements performed on the five dif-ferent frequencies f0, f1, f2, f3, f4, with an average file length of around 1900samples per cell and secondary frequency. The simulation was carried outover 16 different nodes, albeit only four of them had the required number of3 cells per node to qualify for evaluation. Thus, the experiments were carriedout over the data from these four nodes. The class balance between positiveand negative samples varied greatly among the measurements depending onsecondary frequency, due to the direct relationship between wavelength andpropagation distance. The relatively low frequency measurements performedto secondary frequency f0 = 778MHz had an average coverage level ∼ 90%,while the very high 5G frequency of f4 = 28 GHz had a much smaller cover-age level of ∼ 15%. The intermediate frequencies f1, f2 and f3 had relativelybalanced class distributions. The exact average coverage levels, absolute fre-quencies and file lengths are shown in Table 6.1. A Gaussian kernel was usedto get an approximation of the density functions across secondary frequencies.Using a bandwidth ofB = 0.05 produced the approximate probability densityfunction shown in Figure 6.1. The number of detectable cells per measure-ment varied from 1 to 8 for the simulated data. The distribution of detectablecells nc per primary frequency is shown in Table 6.2. In the simulation, everyUE can detect all cells to some degree, but only the 8 strongest are reportedafter the dynamic range has been applied. It can be seen that the number ofdetectable cells is closely related to the primary frequency, with the lower fre-quencies often having many measurements with only one or two detectablecells. It should be noted that the high 5G frequency f4 had a very high pro-

41

42 CHAPTER 6. RESULTS

Table 6.1: Average overlap, absolute frequency and number of datapoints forthe different secondary frequencies

f0 f1 f2 f3 f4

f [MHz] 778 1812 2640 3500 28000

overlap 0.93 0.77 0.66 0.58 0.16

n 1862 1888 1888 1888 1895

Table 6.2: Distribution of detectable cells per measurement for the differentprimary frequencies.

nc % 1 2 3 4 5 6 7 8

f0 47.1 29.9 13.7 5.6 2.3 1.0 0.3 0.1

f1 48.3 29.2 13.1 5.5 2.3 1.0 0.4 0.2

f2 48.3 28.7 13.0 5.5 2.3 1.2 0.6 0.5

f3 47.7 27.9 12.4 5.3 2.7 1.5 1.2 1.2

f4 30.5 16.5 7.5 4.0 3.9 5.0 6.8 25.7

portion of measurements with atleast 8 detectable cells (after dynamic rangetransform) compared to the other frequencies. This could imply that the vari-ation from strongest to weakest cell is comparatively smaller for this primaryfrequency.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

overlap

p X(x)

f0 f1 f2 f3 f4

Figure 6.1: Overlap distributions across secondary frequencies, approximatedusing a Gaussian kernel.

CHAPTER 6. RESULTS 43

6.2 Compression and Mutual InformationTo understand how the compression parameter β affected the performance andbehaviour of the IB-algorithm, a few tests using varying levels of β were con-ducted. First, by selecting the dimension of the bottleneck layer to onlyB = 2

allows for a direct mapping between latent representation and 2D space. AnIB-model was trained using all data with primary frequency f0 on a singlenode. After training, a fraction of the test data was fed to the model and thelatent representation output was mapped to a scatter plot. The points wereidentified through the test data as either having coverage on one of the sec-ondary frequencies or to be zero. In the relatively deterministic case of almostno compression at β = 10−8 in Figure 6.2a, the separation between positiveclasses and negative classes is quite strict, with relatively little overlap. Fur-thermore, the different secondary frequencies are mapped to slightly differentregions in the latent space. The lower frequencies f1, f2 and f3 are mappedcloser together, while the higher frequency of f4 is mapped a bit further away,with greater overlap to the zero class than the other frequencies. It can alsobe noted that distribution across frequencies is ordered so that the frequenciesform a continuum in the latent space, with measurements typical of highersecondary frequencies mapped further to the left, and the lower frequenciesmapped further to the right. In the intermediate case of β = 5 · 10−3 in Fig-ure 6.2b, the separation between zero and positive classes is still clear. How-ever, the separation between the adjacent frequencies f1, f2 and f3 is almostcompletely gone, while the highest frequency f4 is still mapped to a separatecluster in the latent space. Applying a further bit of compression by settingβ = 1 results in a latent representation where all the positive classes, regard-less of secondary frequency, is mapped to the same distribution, as in Figure6.2c. The separability between positive and negative points has also deterio-rated slightly, but there is still a very clear distinction between the positive andnegative distribution.

In the final test with a very high compression level of β = 1000 in Figure6.2d, all points are forced to be mapped to the marginal distribution N (0, 1),and the separation between coverage and non-coverage is completely impos-sible. Thus, it can be conjectured that the optimal regime would be one wherethe amount of compression applied is enough to move all of the positive pointsto the same general area regardless of secondary frequency, while still allow-ing for retained separation between positive and negative samples.

Another way to evaluate the effect of compression on the behaviour of themodel is to calculate the amount of retained information about the input in the


f1f2f3f40

(a) Deterministic Map, β = 10−8

f1f2f3f40

(b) Balanced Map, β = 5 · 10−3

f1f2f3f40

(c) Balanced Map, β = 1

f1f2f3f40

(d) Compressed Map, β = 103

Figure 6.2: Bottleneck Map for different compression levels β

bottleneck layer through the mutual information I(X;Z). Simultaneously, itcan be interesting to study how the amount of compression affects the generalpredictive performance of the model, which can be indicative of how muchinformation about the input really is necessary to successfully perform SCP.For this purpose, the average precision score and mutual informaion was stud-ied in relation to the compression level β, for three different bottleneck layersB = 2,B = 8 and B = 32, shown in Figure 6.3. For all setups, the mutualinformation was relatively constant in the deterministic regime of β = 10−8

to β = 10−5. However, the performance as measured by the AP was aboutthe same across the three setups, with a slight favor to the intermediate bot-tleneck size of B = 8. The classification performance started deterioratingfirst around β = 10−2, and for β = 102 the classification was more or lessrandom. Thus, the optimal points with regards to compression and maintain-ing performance are in the span of β = 10−4 and β = 10−2. Therefore, thepoint β = 5 · 10−3 was selected as the operating point for subsequent experi-ments. This point maintains the classification performance of a deterministicnetwork, while reducing the mutual information between feature data and la-


tent representation from approximatily 44 bits down to 6 bits.

10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 101 102

10−2

100

102

β

I(X

;Z)

IB: 2 neuron bottleneckIB: 8 neuron bottleneckIB: 32 neuron bottleneck

(a) Estimated Mutual Information I(X;Z), varying β

10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 101 102

0.6

0.8

1

β

AveragePrecisionSc

ore

IB: 2 neuron bottleneckIB: 8 neuron bottleneckIB: 32 neuron bottleneck

(b) Average Precision Score, varying β

Figure 6.3: Compression/Performance tradeoff

6.3 PredictionWith the selected architecture from the previous section, we now evaluate theprediction performance of the different models on the network dataset.

6.3.1 MetricsThe twomainmetrics reported in this section are theAreaUnder Curvemetricsfor Receiver Operating Characteristics and Precision-Recall. Together, theyform a basis of evaluation by which we can, without selecting an operatingthreshold, fairly compare the performance of two different models on the samedataset. These metrics are complimentary, since they are affected by classbalance in the data in different ways, and their joint statistics make it easier tofairly determine the difference in performance between two models comparedto only studying one of the two values.


Informedness is reported for the optimal threshold on the validation set.Thus, the reported informedness is the maximal achievable performance for agiven model and prediction, and should not be treated as a basis for choosingthe practical operating point of the model, since the costs associated with falsepositive and false negative errors are not equal. It is possible to adjust thisbalance by selecting a λ relative to the costs associated with each error giventhe prediction task, albeit there is no intuitive meaning to a λ-cost balancedinformedness, which is why the original informedness is reported as to give anidea of the potential between different models rather than an objective metricfor optimization of performance.

6.3.2 ModelsThere were four main models studied. These included the reference singleLGBM-model, the single IB-model, the aggregated IB-model and the IB-model aggregated over frequencies with similar overlap. The single modelswere trained and evaluated on data from a single cell and secondary frequency.The aggregated IB-model was trained and evaluated on data from twelve dif-ferent cell and frequency pairs. The coverage aggregated IB-model was trainedand evaluated by partitioning the data into three sets, the high overlap set f0,the intermediate overlap set f1, f2, f3 and the low overlap set f4. Note thatwhile the models were trained on different partitionings of data, all modelswere evaluated on exactly the same test data, as to make the performance be-tween models comparable. Furthermore, exactly the same amount of data isused to train the models when accounting for every measurement on a node,the only difference is that the single configurations train twelve smaller mod-els, while the aggregated IB-model trains one model per node using all thedata available.

6.3.3 EvaluationThe models were evaluated on data from all different secondary frequenciesand primary frequencies satisfying the evaluation requirements. A major de-termining factor for characterizing classifier performance is the overlap of thedataset, it only makes sense to aggregate graphical representations of datasetswith similar class balance. Aggregating test data from datasets with differentoverlap generally leads to multi-modal distributions in the results, which is un-desirable andmore difficult to interpret. In the simulated dataset, the determin-ing factor for coverage is the secondary frequency of evaluation. Therefore, to


Table 6.3: Secondary Frequency f0

Model Informedness ROCAUC PRAUC

LGBM-reference [0.633, 0.724] [0.848, 0.905] [0.985, 0.990]

IB-single [0.572, 0.684] [0.829, 0.893] [0.983, 0.989]

IB-coverage [0.617, 0.704] [0.866, 0.908] [0.986, 0.991]

IB-node [0.666, 0.752] [0.871, 0.920] [0.987, 0.992]



LGBM-reference [0.803, 0.836] [0.957, 0.971] [0.975, 0.981]

IB-single [0.744, 0.781] [0.927, 0.943] [0.957, 0.968]

IB-coverage [0.809, 0.833] [0.960, 0.970] [0.975, 0.981]

IB-node [0.811, 0.837] [0.962, 0.972] [0.977, 0.982]

construct the CDF-representations and their corresponding tables, results wereaggregated over all primary frequencies for a given secondary frequency.

For the tabular results, training data from each cell and frequency pair isused to train a set of models through a stratified 5x2 repeated cross-validation.Using the validation set for each model, we determine model validity andthreshold. The valid models on the same cell and secondary frequency arethen evaluated on the test set, and their performance metrics are then averagedand reported for this specific cell and frequency. For the simulated data, allevaluated models fulfilled the validity critera of a strictly positive informed-ness. All metrics from cells evaluated on the same secondary frequency and alldifferent primary frequencies are then used to form a 95% confidence interval.Each secondary frequency is evaluated from four different primary frequen-cies and three different cells, and the evaluation is carried out for four differentnodes. This amounts to 48 unique sets of metrics used to form the confidenceintervals reported for each secondary frequency.




LGBM-reference [0.792, 0.827] [0.958, 0.969] [0.782, 0.835]

IB-single [0.473, 0.587] [0.721, 0.812] [0.500, 0.617]

IB-coverage [0.716, 0.754] [0.912, 0.930] [0.733, 0.793]

IB-node [0.776, 0.809] [0.956, 0.965] [0.792, 0.838]

6.3.4 ResultsThe prediction results are recorded for each secondary frequency, aggregatedover the different primary frequencies. The performance characteristics withregard to coverage varies, and the CDF representation of each secondary fre-quency is reported for each of the three metrics. The corresponding averageresults for three different frequencies with differing overlap is tabulated in Ta-bles 6.3, 6.4 and 6.5. We can see that the node aggregated IB-model seems tohave the highest span of performance, while the single IB-model is generallythe lowest one. The IB-models tend to perform better relative to the baselineLGBM-model for datasets with high and medium coverage, while the baselineLGBM-model performs better for low coverage datasets. The IB-coveragemodel generally performs similar to the baseline LGBM-model for the highand intermediate coverage datasets, while a bit worse for the low coveragedatasets. The IB-node model had the highest average span of performancecompared to the other models measured by the AUCPR metrics, while slightlylower in informedness and AUCROC compared the LGBM-baseline model forthe low overlap data on secondary frequency f4. The LGBM-baseline modelhad relative to the IB-models a higher maximal informedness.

The distribution of performance metrics for the different models are shownfor the same secondary frequencies in Figures 6.4, 6.5 and 6.6. For the highoverlap dataset on secondary frequency f0, we find that all four models per-form quite similarly. The single IB-model is a little worse than the other mod-els, and the coverage aggregated IB-model and the LGBM-model performsimilar to one another. For the intermediate coverage frequency f0, the gapbetween the single IB-model and the other models widen, this could be at-tributed to a fewer number of positive samples in the training set of the singleIB-model. For the low overlap dataset f4, the gap between the smaller IB-models and the LGBM-model widen further, while the node aggregated IB-model still performs similar to the LGBM-model. This behaviour supports the


dependence of the IB-models on the number of positive samples in the dataset.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

ROCAUC

P(X≤x)

LGBM-baselineIB-singleIB-coverageIB-node

(a) Distribution of ROCAUC

0.95 0.96 0.96 0.97 0.97 0.98 0.98 0.99 0.99 1 10

0.20.40.60.81

PRAUC

P(X≤x)

LGBM-baselineIB-singleIB-nodeIB-coverage

(b) Distribution of PRAUC

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

Informedness

P(X≤x)


(c) Distribution of Informedness

Figure 6.4: Performance CDFs, secondary frequency f0

6.3.5 Constrained Training DataThere are several other considerations beyond simply finding the optimal al-gorithm for a large amount of training data. In several applications, trainingon a very limited amount of data is a common scenario which poses additionalchallenges for achieving decent prediction performance. To study this case forthe aggregated IB-model, the amount of data available for training was variedfrom 100 % of training data, down to as little as 10%. The performance wasmeasured by the AUC metrics for the ROC-curve and PR-curve evaluated onthe test set. The results are shown in Tables 6.6 and 6.7.


0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.20.40.60.81

ROCAUC

P(X≤x)



0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.20.40.60.81

PRAUC

P(X≤x)



0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

Informedness

P(X≤x)




We see that the tendency of detoriation in performance when removingdata is directly proportional to the coverage of the studied secondary frequency,i.e. number of positive samples in the resulting dataset. This indicates that adetermining factor for SCP performance for the node aggregated IB-model isthe total amount of positive samples in a dataset. This is especially true for thedata with low overlap, where we can see a greater discrepancy in performancebetween IB models aggregating data from different sources compared to thesingle IB-model, as in Table 6.5. In such a setting, aggregating data from dif-ferent frequencies might be necessary in order to reach an acceptable level ofperformance when using an IB-model. It also implies that for high coveragedatasets, we can generally use a lesser number of total samples on the nodeto train the models. This can be exemplified by comparing the performanceof the aggregated model on the high coverage frequency f0 in Tables 6.6 and6.7 which, using only 10 % of the total data, achieves performance equivalent


0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

ROCAUC

P(X≤x)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.81

PRAUC

P(X≤x)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.81

Informedness

P(X≤x)




to the baseline LGBM-model, as in Table 6.3. Similarly, for the intermediatecoverage frequency f2, we find that the node aggregated IB-model using only25 % of the total data performs similarly to the single IB-model. Thus, aggre-gation of data from different sources can allow us to use the same number oftotal data samples to reach improved performance, as well as using a smalleramount of total data to reach performance equivalent to that of a single model.


Table 6.6: ROCAUC across frequencies for different amounts of data.

Secondary 100% 50% 25% 10%

f0 [0.871, 0.920] [0.869, 0.916] [0.859, 0.909] [0.860, 0.900]

f1 [0.942, 0.958] [0.938, 0.952] [0.921, 0.936] [0.893, 0.910]

f2 [0.962, 0.972] [0.953, 0.964] [0.930, 0.945] [0.896, 0.920]

f3 [0.968, 0.975] [0.959, 0.968] [0.935, 0.949] [0.900, 0.924]

f4 [0.956, 0.965] [0.940, 0.954] [0.913, 0.928] [0.871, 0.894]

Table 6.7: PRAUC across frequencies for different amounts of data.

Secondary 100% 50% 25% 10%

f0 [0.987, 0.992] [0.987, 0.991] [0.986, 0.991] [0.984, 0.989]

f1 [0.975, 0.982] [0.974, 0.981] [0.969, 0.977] [0.959, 0.968]

f2 [0.977, 0.982] [0.971, 0.978] [0.960, 0.970] [0.943, 0.958]

f3 [0.971, 0.978] [0.966, 0.973] [0.946, 0.961] [0.922, 0.945]

f4 [0.792, 0.838] [0.760, 0.810] [0.672, 0.730] [0.555, 0.625]

Chapter 7

Conclusion

The overall summary and conclusions drawn from the project are presented inthis chapter, as well as ideas for future work in the space.

7.1 SummaryOverall, the general method of aggregating data from different sources on thesame node shows great promise both with regard to performance and poten-tial reduction in both the total number of models and the amount of necessarydata collection in the network. By employing an IB-model parametrized bya compression term, we can study the effects of compression on the interme-diate representations of the network data and how this representation varieswith class balance. We can also note that from the information available inthe features, only a small number of bits is necessary for predicting cover-age. While it is not obvious that the employed simple variational IB modelusing neural networks is optimal for performing secondary carrier prediction,it has been shown that it is possible to aggregate data from heterogenous dis-tributions into a single binary classifier, achieving better performance than anequivalent, smaller model trained on homogenous data from a single source.

7.2 Future workFuture work could include expanding the binary prediction task to not onlyconsider the aggregated secondary prediction of coverage, but to study themultilabel scenario and also to incorporate a confidence score for the differentclasses. This would allow another dimension for load balancing and predic-tion confidence, instead of the studied binary setup where only one secondary

53

54 CHAPTER 7. CONCLUSION

frequency is considered at a time. There is also room for using more diversefeatures for SCP than the sparse RSRP features studied in this thesis, whichmight provide additional benefits for performance.

Furthermore, it is far from conclusive that the studied architecture is op-timal for this prediction task. Just in the realm of IB-objectives, one couldformulate different latent distributions other than the spherical Gaussian. Theresults also suggest that regular neural network structures, equivalent to a de-terministic bottleneck, work very well for SCP. It is also interesting to studydifferent machine learning problems as well, other than binary prediction. Thecollected data from the cellular network also facilitates the study of regressionproblems, which could be used to improve prediction performance further.

There is also another interesting topic related to the binary prediction task.Generally, the training procedure for binary prediction is performed with thebinary cross entropy as the cost function. In the context of this thesis, it couldbe interesting to study SCP through another cost function, which might as-sign different weights to true and false classifications in the confusion matrix.Such a training procedure could also be related to the briefly treated thresh-old optimization problem, which might have fruitful results for general binaryclassification tasks with assymetrical costs.

Bibliography

[1] Athul Prasad et al. “Energy-efficient inter-frequency small cell discov-ery techniques for LTE-advanced heterogeneous network deployments”.In: IEEECommunicationsMagazine 51.5 (2013), pp. 72–81. issn: 01636804.doi: 10.1109/MCOM.2013.6515049.

[2] Henrik Ryden et al. “Predicting strongest cell on secondary carrier us-ing primary carrier data”. In: 2018 IEEEWireless Communications andNetworking Conference Workshops, WCNCW 2018 Id (2018), pp. 137–142. doi: 10.1109/WCNCW.2018.8369000.

[3] Caroline Svahn et al. “Inter-frequency radio signal quality predictionfor handover, evaluated in 3GPP LTE”. In: IEEE Vehicular Technol-ogy Conference 2019-April (2019). issn: 15502252. doi: 10.1109/VTCSpring.2019.8746369. arXiv: 1903.00196.

[4] Business AreaNetworks.EricssonMobility Report, "Mobile data trafficoutlook". Tech. rep. Ericsson, June 2020.

[5] Klaus Schwab.Now is the time for a ’great reset’. Ed. by TheWorld Eco-nomic Forum. [Online; posted 3-June-2020]. June 2020. url: https://www.weforum.org/agenda/2020/06/now-is-the-time-for-a-great-reset/.

[6] Ericsson Investor Relations. Ericsson Annual Report 2019 "Sustain-ability and Corporate Responsibility". Tech. rep. Ericsson, Mar. 2019.

[7] Oluwakayode Onireti et al. “Energy efficient inter-frequency small celldiscovery in heterogeneous networks”. In: IEEE Transactions on Ve-hicular Technology 65.9 (2016), pp. 7122–7135. issn: 00189545. doi:10.1109/TVT.2015.2482818.

[8] Ali Mahbas, Huiling Zhu, and Jiangzhou Wang. “The Role of Inter-FrequencyMeasurement in Offloading Traffic to Small Cells”. In: IEEEVehicular Technology Conference 2017-June.1 (2017), pp. 0–3. issn:15502252. doi: 10.1109/VTCSpring.2017.8108480.

55

https://doi.org/10.1109/MCOM.2013.6515049

https://doi.org/10.1109/WCNCW.2018.8369000

https://doi.org/10.1109/VTCSpring.2019.8746369


https://arxiv.org/abs/1903.00196

https://www.weforum.org/agenda/2020/06/now-is-the-time-for-a-great-reset/



https://doi.org/10.1109/TVT.2015.2482818


56 BIBLIOGRAPHY

[9] Oluwakayode Onireti et al. “On energy efficient inter-frequency smallcell discovery in heterogeneous networks”. In: IEEE International Con-ference onCommunications 2015-Septe (2015), pp. 13–18. issn: 15503607.doi: 10.1109/ICC.2015.7248291.

[10] 3rd Generation Partnership Project (3GPP). Evolved Universal Terres-trial Radio Access (E-UTRA), ”Measurement capabilities for E-UTRA;UE measurement capabilities” in Technical Specification (TS) 36.214.Tech. rep. June 2020.url:https://www.3gpp.org/dynareport/36214.htm.

[11] Naftali Tishby, Fernando C. Pereira, andWilliamBialek. “The informa-tion bottleneckmethod”. In: (2000), pp. 1–16. arXiv:0004057 [physics].url: http://arxiv.org/abs/physics/0004057.

[12] ThomasM. Cover and Joy A. Thomas. Elements of Information Theory.John Wiley & Sons, 2005, pp. 1–748. isbn: 9780471241959. doi: 10.1002/047174882X.

[13] Naftali Tishby and Noga Zaslavsky. “Deep learning and the informationbottleneck principle”. In: 2015 IEEE Information Theory Workshop,ITW 2015 (2015). doi: 10.1109/ITW.2015.7133169. arXiv:1503.02406.

[14] S. Arimoto. “An algorithm for computing the capacity of arbitrary dis-cretememoryless channels”. In: IEEETransactions on Information The-ory 18.1 (1972), pp. 14–20.

[15] Ravid Shwartz-Ziv andNaftali Tishby. “Opening the BlackBox ofDeepNeural Networks via Information”. In: (2017), pp. 1–19. arXiv: 1703.00810. url: http://arxiv.org/abs/1703.00810.

[16] Alexander A. Alemi et al. “Deep Variational Information Bottleneck”.In: (Dec. 2016). arXiv: 1612.00410. url: http://arxiv.org/abs/1612.00410.

[17] Diederik PKingma andMaxWelling. “Auto-EncodingVariational Bayes”.In: (Dec. 2013). arXiv: 1312.6114. url: http://arxiv.org/abs/1312.6114.

[18] Diederik P. Kingma, TimSalimans, andMaxWelling. “Variational dropoutand the local reparameterization trick”. In:Advances in Neural Informa-tion Processing Systems 2015-January.Mcmc (2015), pp. 2575–2583.issn: 10495258. arXiv: 1506.02557.

https://doi.org/10.1109/ICC.2015.7248291

https://www.3gpp.org/dynareport/36214.htm


https://arxiv.org/abs/0004057

http://arxiv.org/abs/physics/0004057

https://doi.org/10.1002/047174882X

https://doi.org/10.1002/047174882X

https://doi.org/10.1109/ITW.2015.7133169




http://arxiv.org/abs/1703.00810








BIBLIOGRAPHY 57

[19] Ziv Goldfeld et al. “Estimating information flow in deep neural net-works”. In: 36th International Conference onMachine Learning, ICML2019 2019-June (2019), pp. 4153–4162. arXiv: 1810.05728.

[20] 3rd Generation Partnership Project (3GPP). Evolved Universal Terres-trial Radio Access (E-UTRA), ”Measurements performance require-ments for UE; E-UTRAN-measurements; Intra-frequency RSRP Accu-racy Requirements” in Technical Specification (TS) 36.133. Tech. rep.July 2020. url: https : / / www . 3gpp . org / dynareport /36133.htm.

[21] Guolin Ke et al. “LightGBM: A highly efficient gradient boosting de-cision tree”. In: Advances in Neural Information Processing Systems.Vol. 2017-December. 2017.

[22] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In:Journal of machine learning research 12.Oct (2011), pp. 2825–2830.

[23] L. Breiman. “RandomForests”. In:Machine Learning 45.1 (2001), pp. 5–32. issn: 1098-6596. doi: 10.1017/CBO9781107415324.004.arXiv: arXiv:1011.1669v3.

[24] Christopher M. Bishop. Pattern recognition and machine learning, 5thEdition. Information science and statistics. Springer, 2007. isbn: 9780387310732.url: https://www.worldcat.org/oclc/71008143.

[25] David M W Powers and Natural Language Processing. “What the F-measure doesn’t measure . . . ” In: ().

[26] David M W Powers. “Evaluation: From Precision, Recall and F-Factorto ROC, Informedness, Markedness & Correlation”. In: Human Com-munication Science SummerFest December (2007), p. 24.

[27] Jesse Davis and Mark Goadrich. The Relationship Between Precision-Recall and ROC Curves. Tech. rep.

[28] MartinAbadi et al. “Tensorflow:A system for large-scalemachine learn-ing”. In: 12th Symposium on Operating Systems Design and Implemen-tation 16). 2016, pp. 265–283.

[29] Xavier Glorot, Antoine Bordes, and Y. Bengio. “Deep Sparse RectifierNeural Networks”. In: vol. 15. Jan. 2010.




https://doi.org/10.1017/CBO9781107415324.004

https://arxiv.org/abs/arXiv:1011.1669v3

https://www.worldcat.org/oclc/71008143

Appendix A

Prediction Results

The full prediction results for the remaining secondary frequencies are pre-sented here, along with their corresponding CDF-representations. The pre-diction results for secondary frequencies f1 and f3 are very similar to that ofthe secondary frequency f2 due to their similar class balances.

Table A.1: Secondary Frequency f1


LGBM-reference [0.781, 0.815] [0.938, 0.955] [0.976, 0.982]

IB-single [0.723, 0.759] [0.915, 0.932] [0.967, 0.975]

IB-coverage [0.785, 0.815] [0.941, 0.957] [0.975, 0.982]

IB-node [0.787, 0.818] [0.942, 0.958] [0.975, 0.982]

58

APPENDIX A. PREDICTION RESULTS 59

Table A.2: Secondary Frequency f3


LGBM-reference [0.803, 0.826] [0.962, 0.972] [0.970, 0.976]

IB-single [0.748, 0.779] [0.932, 0.947] [0.946, 0.960]

IB-coverage [0.805, 0.827] [0.966, 0.972] [0.969, 0.977]

IB-node [0.813, 0.834] [0.968, 0.975] [0.971, 0.978]

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.20.40.60.81

ROCAUC

P(X≤x)


0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 10

0.20.40.60.81

PRAUC

P(X≤x)


0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

Informedness

P(X≤x)


Figure A.1: Performance CDFs, secondary frequency f1

60 APPENDIX A. PREDICTION RESULTS

0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.20.40.60.81

ROCAUC

P(X≤x)


0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.20.40.60.81

PRAUC

P(X≤x)


0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.20.40.60.81

Informedness

P(X≤x)


Figure A.2: Performance CDFs, secondary frequency f3

Appendix B

Evaluation AlgorithmAlgorithm 1 Get Validation Results

Input dataset, nodes, model, frequency, load criteriaresult← []

for node in nodes dofiles, fp, fs, c ... ← read(dataset, node, frequency, load criteria)for partition in node do

x, y, l← load(files, fp, fs, partition)x, y,xtest, ytest, l← SplitTest(x, y, ptest, l)Itrain, Ivalid← RepeatStratifySplit(x, y, l)for itrain, ivalid ∈ (Itrain, Ivalid) do

xtrain, ytrain← x[itrain], y[itrain]xvalid← x[i] for i ∈ ivalidmodel.train(xtrain, ytrain) . model: lgbm, ibprediction← model.predict(xvalid)predictiontest← model.predict(xtest)for idx, p ∈ prediction do

m← AssemblePredictions(y[ivalid[idx]], p, ct, ft)m← AssemblePredictions(ytest[idx], predictiontest[idx], m)metrics["cv_results"].append(m)

end forend formetrics["file_info"]← AssembleMetrics(x, y, fp, fs, c ...)result.append(metrics)

end forend forsave(result)

61

62 APPENDIX B. EVALUATION ALGORITHM

Algorithm 2 Evaluate ThresholdInput data, *args, funcy← data["labels"]predictions← data["prediction"]threshold, data["valid"]← func(y, predictions, *args)data← validation(y, predictions, data, threshold)return data

Algorithm 3 CV Metrics MeanInput data, *args, nvalidC ← data["file_info"]["x_cell"]F ← data["file_info"]["x_freq"]for c in C do

for f in F dometrics_dict = {ctest: c, ftest: f}for m in data do

if m[ctest], m[ftest] = (c, f ) and m["valid"] thenmetrics.append(m)

end ifend formetrics_dict["valid"]← length(metrics) ≥ nvalidmean_dict← {}for metric in args do

for m do in metricsmean_dict["metric"]← mean(m["cv_metrics"]["metric"])

end forend formetrics_list.append(metrics_dict)

end forend fordata["cv_results"] = metrics_listreturn data

www.kth.seTRITA-EECS-EX-2020:710

secondary carrier prediction in cellular networks using

Documents