classification of offensive game-emblem drawings...

IT 17 089

Examensarbete 30 hpJanuari 2018

Classification of offensive game-emblem drawings using CNN (convolutional neural networks) and transfer learning

John Tunell

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Classification of offensive game-emblem drawingsusing CNN (convolutional neural networks) andtransfer learningJohn Tunell

Convolutional neural networks (CNN) has become an important tool to solve many computer vision tasks of today. The technique is though costly, and training a network from scratch requires both a large dataset and adequate hardware. A solution to these shortcomings is to instead use a pre-trained network, an approachcalled transfer learning. Several studies have shown promising results applying transfer learning, but the technique requires further studies.This thesis explores the capabilities of transfer learning when applied to the task of filtering out offensive cartoon drawings in the game of Battlefield 1. GoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned towards the target task and domain. The model achieved an accuracy of 96.71% when evaluated on the binary classi-fication task of predicting non-offensive or swastika/penis content in Battlefield "emblems". The results indicate that a CNN trained on ImageNet is applicable, even when the target domain is very different from the pre-trained networks domain.

Tryckt av: Reprocentralen ITCIT 17 089Examinator: Mats DanielsÄmnesgranskare: Anders BrunHandledare: Håkan Rosenborn

Acknowledgement

I would like to thank my supervisor, Håkan Rosenhorn, for all the advice and

guidance given throughout the project. He openly shared his experience from a

career as a software developer, which I’m very grateful for. The feedback and

teaching sessions have given me valuable preparation for a career as a software

developer. Our lunch break jogs both improved my fitness and gave me insights

regarding the software development process. I will also miss working with all the

other friendly co-workers at Uprise. I also want to thank my reviewer, Anders Brun.

The meetings we had helped me stay focused on the research task and made sure I

was going in the right direction.

vii

Contents

1 Introduction 1

1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Definitions and terminology . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Validation set . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.3 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.4 Test set distribution . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Capacity, Overfitting and Underfitting . . . . . . . . . . . . . . . . . . 10

2.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.3 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . . 13

3 Related work 15

3.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Research exploring transfer learning . . . . . . . . . . . . . . . . . . 15

3.3 Research applying transfer learning . . . . . . . . . . . . . . . . . . . 17

3.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 The inception module . . . . . . . . . . . . . . . . . . . . . . 19

4 Method 21

4.1 Emblems in Battlefield . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 How players create and use emblems in Battlefield 1 and

Battlefield 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 How offensive emblems are handled in the Battlefield games . 22

4.2 Method to approach the problem . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Step 1 - Determine goals and measurements . . . . . . . . . . 23

4.2.2 Step 2 - Establish working end-to-end baseline model . . . . . 23

4.2.3 Step 3 - Determine bottlenecks in performance . . . . . . . . 23

4.2.4 Step 4 - Repeatedly make incremental changes. . . . . . . . . 23

ix

4.3 Additional guidelines when applying machine learning . . . . . . . . 24

4.3.1 The process of knowing what to do next . . . . . . . . . . . . 24

4.3.2 Create a common data warehouse . . . . . . . . . . . . . . . 25

4.3.3 Determine human-level performance on the task . . . . . . . 25

4.3.4 Plot performance on increasing dataset size and visualize worst

errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Experimental setup 27

5.1 Software and hardware used during experiments . . . . . . . . . . . 27

5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Dataset augmentation . . . . . . . . . . . . . . . . . . . . . . 28

5.2.2 Contrast normalization . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Dataset generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.5 Machine learning framework - Tensorflow . . . . . . . . . . . . . . . 30

6 Results 31

6.1 Results iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.1.1 Step 1 - Determine goals and measurements . . . . . . . . . . 31

6.1.2 Dataset extraction . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1.3 Step 2 - Establish working end-to-end pipeline and baseline

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.4 Performance benchmarks for first model . . . . . . . . . . . . 36

6.1.5 Step 3 - Determine bottlenecks in performance . . . . . . . . 41


6.2.1 Step 4 - Repeatedly make incremental changes . . . . . . . . 42

6.2.2 Performance benchmarks for the second model . . . . . . . . 43


6.3.1 Data augmentation experiments . . . . . . . . . . . . . . . . . 50

6.3.2 Final performance comparison between all models . . . . . . 52

6.3.3 Performance on production test set . . . . . . . . . . . . . . . 52

7 Discussion 55

7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Conclusion 59

Bibliography 61

x

1Introduction

Computer vision and object classification has in the last couple of years been dra-

matically improved by advances in deep learning and convolutional neural networks

(CNNs) [18]. In the ImageNet competition 2012, the most reputable competition

within computer vision, a group of researchers from the University of Toronto en-

tered with a deep CNN algorithm called SuperVision. The team won the competition

with an error rate of 16.4 percent, while the second best entry had an error rate of

26.2 percent[34]. The results were revolutionary, and the advances in computer

vision driven by CNNs has been acknowledged as one of the top 10 breakthroughs

of 2013[26].

CNNs main power lies in its deep structure, which allow the network to create

discriminating features that for each layer increase the level of abstraction[33, 36,

9, 32]. Advances in hardware, larger datasets and more complex models are key

factors to the recent success of CNNs. Further advances in the field are though not

only driven by increasing complexity. GoogLeNet, Googles winning submission to

ImageNet 2014, used 12 times fewer parameters and got significantly more accurate

results than previous winners[17]. Recent research has started to investigate not

only ways to improve the performance of CNNs, measured in error rate, but also

the performance measured in cost-effectiveness. In order for models to be put to

real-world use, metrics like computational budget, memory consumption and dataset

size requirements need to be considered[33].

Training a deep CNN from scratch can be both costly and complicated [10]. First,

a large labeled dataset is required for the training. In many domains, the amount

of labeled data is limited, and collecting such a dataset might require experts

to annotate images. Second, deep CNN training requires extensive memory and

computational resources. Lacking adequate hardware will make the training process

extremely time-consuming. Lastly, to avoid overfitting and ensure convergence,

the training process need to be repeated iteratively, trying out different parameters

in the model[34]. This requires experience and also make the process even more

time-consuming.

To lower the cost of training CNNs, a promising alternative has emerged through

research. Instead of training a CNN from scratch, an already trained CNN is used

1

that has been trained on an existing large dataset from another domain. The CNN is

then fine-tuned towards the target domain or task. This concept is called transfer

learning. Using transfer learning, various computer vision researchers has been

able to significantly improve upon state-of-the-art performance on computer vision

tasks within a large set of domains[30, 3, 24]. Yosinski et al [35] emphasize the

importance of further studies on the exact nature and extent to which transfer

learning can be applied.

This thesis project is part of a computer vision internship at the game studio Uprise.

Uprise is a sister studio to Dice, and owned by the global video game company

Electronic Arts (EA). In the game Battlefield 1, players can draw what Uprise call

“emblems”. Emblems are images that the user can bind to their profile and also

display on their weapons and vehicles inside of the game. Emblems are not allowed

to contain offensive material. If it does, players are able to report these and EA must

handle reported emblems in due time, often manually. Uprise would like to improve

this process. During this thesis project, deep learning methods will be evaluated on

the task of filtering out these offensive emblems.

Problem formulation

This report set out to answer the following problem formulation:

How well does a CNN perform on the task of classifying offensive drawings, created by

players of the game Battlefield 1, when pretrained on ImageNet and fine-tuned on the

target dataset?

2 Chapter 1 Introduction

1.1 Thesis Structure

Chapter 2

In the background chapter, essential machine learning concepts are introduced.

Dataset partitioning strategies and the multi layer perceptron topology is explained.

Furthermore, the chapter give an introduction to convolutional neural networks.

Chapter 3

The related work chapter summarize the body of research that has been done

on transfer learning and CNNs. The chapter ends with describing the GoogLeNet

architecture and its inception module.

Chapter 4

The Battlefield emblem system is explained in the method chapter, along with

guidelines used to approach the machine learning problem.

Chapter 5

How the dataset was preprocessed and augmentation techniques are described in

the experimental setup chapter. The chapter also explain how the CNN was used

as a feature extractor. The chapter ends with a short introduction to the machine

learning framework Tensorflow.

Chapter 6

The thesis work was divided into three iterations, each given its own section in

the result chapter, along with performance benchmarks as the thesis work pro-

ceeded. The results are analyzed and discussed along with the presentation of the

benchmarks, to make the reasoning easier to follow.

Chapter 7

The results are further discussed in the discussion chapter. Future work is also

covered in the chapter.

Chapter 8

The final conclusion is given in the conclusion chapter.

1.1 Thesis Structure 3

2Background

The thesis work relies heavily on research within the field of machine learning,

deep learning, and convolutional neural networks. This chapter will present an

introduction to terminology and concepts used during the study.

2.1 Definitions and terminology

This section introduces terminology that will be used in the thesis. The task of

detecting spam or non-spam in emails will be used to illustrate examples of the

definitions. The part is based on the definitions presented by Mohri et al [20]

• Examples - The instances in the dataset. The examples are usually the rows

in a matrix or database. In our spam detection problem, an email would

correspond to an example in our dataset. Examples are used to train and

evaluate the model [20].

• Features - The set of attributes that are associated to an example [20]. The

attributes are often represented as a vector, which corresponds to the columns

in a matrix or database. The name of the sender, presence of certain keywords

in the message, the message length etc. would be considered features in the

email example.

• Labels - The category or class value assigned to examples. An example email

would have either a label of spam or non-spam. When predicting a discrete

value, the task is called classification. If the target value is continuous, the task

of classifying the output is called regression.

• Hyperparameters - A models configuration parameters are called hyperpa-

rameters. This can for example be the number of iterations we want the model

to train on the dataset or the learning rate of the model. Hyperparameters are

not to be confused with parameters in a neural network, also called weights,

which are learned through backpropagation.

5

2.2 Feedforward Networks

The most essential parts in a deep learning model are the feedforward neural

networks, also called multilayer perceptrons (MLPs) or artificial neural networks.

Neural networks are inspired by the brains information processing network, built up

by neurons. Neurons are connected to each other in a large signaling network. Every

neuron has multiple incoming connections. When a neuron receive incoming inputs,

it will sum up all the inputs and if the value exceed a given threshold, it will fire.

The signal is then passed on through connections to other neurons. Neural networks

try to model this behavior. Figure 2.1 illustrate a single perceptron.

Fig. 2.1: Perceptron topology, illustration modified from Danilo Bargen [4]

The first layer is called the input layer and is often a vector of values, called a

feature vector. The input values x are then multiplied with a weight w. The

weight is also called parameter and often depicted with the symbol θ. A bias term

is often introduced as x0 and w0, and act as a threshold value for the activation.

The multiplied inputs are summed into a single value. The value is then passed

through a activation function f that will produce an output. There are many types

of activation functions. One of the simplest is the step function, which will output a

1 if the input is higher than a given threshold, and 0 if it is below.

To be able to produce more complex functions than linear functions, the model

need to be applied not only to x, but to a non-linear transformation of x. This can

be seen as creating a new representation of x made up by the network. This is

done by adding hidden layers. These hidden layers will be used to produce the

new feature representation that will help the model find mappings that achieve

the desired output. Figure ?? illustrate the topology of a multi layer perceptron

(MLP). In this figure, each input node is connected to every neuron, each with its

own weight. The layers between the input layer and the final output layer are

called hidden layers. Just as with the perceptron example, the output layer will

take as input a feature vector, but in this case the input values transformed by the

hidden layer and not the raw data. The network will then output a value based on

the threshold and activation function. The objective of a feedforward network or

6 Chapter 2 Background

Fig. 2.2: Multi layer perceptron topology, illustration modified from Satvik Beri [5]

a multi layer perceptron is to find an hypothesis function h for a function f [14].

When solving a classification problem, we produce a function that given an example

with a feature vector x, produce a hypothesis function that will output a class label

prediction y. The predicted label y should be as close as possible to the ground truth

class label y. A feedforward network finds the values for the parameters θ that will

result in the best approximation of the function. The network will iteratively tune

the parameters to make the hypothesis function h, parameterized by θ, as similar as

possible to the target function f [14].

y = hθ(x)

The flow of information goes from an input x, through intermediate computations

that are used to define hθ, ending up in an output y. There are no connection

backwards between neurons, the features found by intermediate layers are strictly

passed forward. This is why these models are called feedforward networks. In a

feedforward network, a chain of functions are composed together and are often

represented by a directed acyclic graph (DAG), as has been shown in the previous

figures. This is why we call these models networks [14]. A simple feedforward

network example would be a network composed together with three functions

f (1), f (2), and f (3), connected together in a chain forming the complete function

hθ(x) = fθ(3)(fθ

(2)(fθ(1)(x))) [14]. In this example we would call f (1) the first

layer, f (2) the second layer and f (3) the final layer or the output layer. The length

of the chain is called the networks depth, making this network a three layer deep

network. This is also where the term "deep learning" comes from, as the networks

are composed together with many layers, creating a deep network.

When we say that we "train" the network, what we do is that we try to drive hθ(x)

to match the target function f(x) [14]. The data in our dataset provides us with

noisy and approximate examples of f(x), evaluated at different training points [14].

Every example x is associated with a ground truth label y. The dataset specifies

what the last output layer need to produce, given the input x. What the layers

2.2 Feedforward Networks 7

in between should output is what the learning algorithm will learn. The learning

algorithm will tune these layers by changing the weights θ to best implement an

approximation of f(x). This is done through a technique called backpropagation.

By propagating the mistakes backwards, tuning the weights to accomplish a better

fit to the target function, we improve and learn a better function approximation hθ.

By comparing the networks output from the hypothesis function y to the correct

value y, we can estimate a distance between the guess and the correct answer. This

comparison is done through a cost function J also called a loss function. The goal

of the algorithm is then to minimize the function J(θ) by tuning the weights of the

network to produce the desired output. Mean squared error is one of many cost

functions, illustrated in the below equation. In the equation m is the number of

training examples in the dataset.

MSE = J(θ) =1

2m

m∑

i=1

((hθ(xi) − yi)2

By calculating the partial derivatives for each weight in the network, we can find

in what direction each weight need to be adjusted in order to produce an output

that is closer to the target function output. The gradient descent is one of the most

common algorithms used to accomplish this within neural networks. Figure 2.3

illustrate the concept of gradient descent. The amount of change that is applied to a

weight is decided by the gradient and a hyperparameter α called the learning rate.

The step size of the weight change is determined by the gradient and the learning

Fig. 2.3: Gradient descent, illustration modified from Sebastian Raschka [25]

rate. Setting a high learning rate will make the gradient descent algorithm take

larger steps, and a low learning rate will make the steps smaller. The algorithm is

repeated until convergence. In the below equation, the symbol := mean that the

left-hand side is updated with the value calculated at the right-hand side:

θj := θj − α∂

∂θj

J(θ)


2.3 Datasets

In most machine learning algorithms, the dataset is divided into three subsets - A

training set, a validation set and a test set. For classification tasks, every example in

the dataset contains a number of features, x, and a target value y. Figure 2.4 give

an overview of how datasets normally are split.

Fig. 2.4: Dataset partitioning

2.3.1 Training set

The training set is used during training to tune the parameters or weights θ of the

model. This is in most applications the largest of the datasets.

2.3.2 Validation set

The validation set is used during training to find the best hyperparameters for the

model. The set is used to make an intermediate estimation on how well the model

would perform on data that it has not trained on, to avoid overfitting to the training

set. The performance during training is called the models train error.

2.3.3 Test set

To evaluate the models performance on completely unseen data, a test set is used.

This evaluation is done when the model has finished training and has its hyperpa-

rameters and weights set. The performance on the test set show how well the model

generalize, often measured in the test error or generalization error. All performance

benchmarks are generated on the test set.

2.3 Datasets 9

2.3.4 Test set distribution

When the set is divided into training and test set, a couple of assumptions need to be

made. Firstly, the test and train set is assumed to have identical distribution when

they are drawn at random from the same distribution. Secondly, we will assume that

all examples in our dataset are independent of each other. These assumptions are

called i.i.d assumptions (independent, identical distribution).

2.4 Capacity, Overfitting and Underfitting

The main challenge when constructing a machine learning algorithm, is to cre-

ate a model that performs well not only on the trained data, but also on new

unseen data[14]. The following section will describe key concepts regarding this

challenge.

The models ability to fit the training set is called the models capacity. Conceptually

it is the amount of freedom the model is given to calibrate itself towards the data

presented. This could for example be the number of iterations the model get to train

on the data, or the number of parameters in the model.

A model with low given capacity might struggle to fit the training data. This is called

underfitting. On the other hand, a model with high capacity might become too

specialized on the training set, essentially memorizing the output given a certain

input. The model will then struggle on unseen data. This is called overfitting.

Figure 2.5 illustrate the difference between overfitting and underfitting by showing

a model that try to fit a line to an example dataset.

Fig. 2.5: Illustrative example of overfitting, underfitting and optimal capacity. Illustrationmodified from Amar Gondaliya [13]


2.5 Convolutional Neural Networks

2.5.1 Convolutional layer

One of the most important layers in a convolutional network is the convolutional

layer. The layer takes two arguments; input data and a kernel. Input data can

either be the original image or the feature map of a previous layer. The output of a

convolutional layer is called a feature map. The kernel is usually a square matrix,

which will slide over the image and "filter" it for features.

At each position, the kernels Fweights will be multiplied with the pixel values within

the kernel, performing element wise multiplication. The kernels value will then be

summed up into a single value, outputting the activation at that spatial location.

Figure 2.6 illustrates the process. If a specific feature is present in the input, the

activation will be high. In the first layer of a CNN, the weights of the kernels are

often set by the network to act as edge detectors, finding the presence of vertical

and horizontal lines in the image. By convolving the image with a set of filters, a

stack of filtered images is sent to the next layer.

Fig. 2.6: Illustration displaying the convolution operation [14]

How far the kernel is moved at every step is called the kernels stride. A stride of one

would correspond to having the kernel move one pixel at each step. The region that

is within the focus of the kernel is called the kernels receptive field. All the weights

in the kernel are the same for every pixel in the image and is called weight sharing

2.5 Convolutional Neural Networks 11

Fig. 2.7: A 7×7 image with a 3×3 kerneland a stride of one [7]

Fig. 2.8: The 5 × 5 output featuremap [7]

or weight tying. The stride of the kernel affect the size of the output feature map. A

high stride will shrink the feature map output. Examples are presented in Figure 2.7

and Figure Y 2.8 to illustrate the effect of the stride hyperparameter. The stride in

this example is set to one. Figure 2.7 display an input image of size 7 × 7 with a

colored square showing the 3 × 3 kernel. The kernel is moved until it hits or would

move past a border, resulting in an output feature map of 5 × 5. Figure 2.9 show an

input image of the same image and kernel size, but with a stride of two. The kernel

can only be moved three times on one row before it has to be moved down, with a

stride of two. Figure 2.10 show the resulting 3 × 3 feature map, with a sample of

three activation outputs.

Fig. 2.9: A 7×7 image with a 3×3 kerneland a stride of two [7]

Fig. 2.10: The 3 × 3 output featuremap [7]

The final parameter that can be set in a convolutional layer is the amount of zeroes

that should be added to all borders of the image, called zero padding or padding.

Figure 2.11 show a padding of two to a 32 × 32 × 3 image, resulting in a 36 × 36 × 3

image. Padding is used to preserve the size of the image during convolutions.

Without padding, there is no input for the kernel outside the edges, and it will move

on. This results in a dimensionality reduction and can be avoided with padding.

Fig. 2.11: A 32 × 32 image with a padding of two [7]


2.5.2 Pooling layer

After the convolutional layer, a pooling layer [14] is often applied. The layer

is sometimes called a downsampling layer, emphasizing the layers objective of

decreasing the size of the image or feature map. A common pooling layer is the

maxpooling layer. In a maxpooling layer, the highest value in the kernels receptive

field will be the output of the operation. Figure 2.12 illustrate the pooling process of

a 2 × 2 maxpooling kernel with a stride of two, slid across a 4 × 4 feature map. By

sliding the kernel over the feature map, we can both reduce the size of the feature

map by summarizing "boxes" of feature maps, and at the same time become less

sensitive to the exact spatial location of a feature. The relative location to other

features is still retained. In the maxpooling operation, it doesn’t matter where in the

receptive field the highest value is positioned.

Fig. 2.12: Image displaying the output of a 2 × 2 maxpool kernel, with a stride of two[7]

2.5.3 Fully connected layer

The last layer of a convolutional neural network has the role of finding the con-

nections between features and classes, and is called a fully connected layer (FCL).

All the neurons in this layer are connected to all the neurons in the previous layer,

much like a hidden layer in a multi layer perceptron. The features generated by the

previous convolutions has by the end of the network reached a level of abstraction

were the representations can take the form of hand detectors, feet detectors, cat

detectors etc. The fully connected layer has the same amount of neurons as there

are classes, and will output a vector representing the activations for each class. The

role of the FCL is to find mappings between the activations and a certain class.

This mapping is learned through forward passes and backpropagation, described in

the previous feedforward network section. The layer before the final output layer

produces the final feature map that will be used for classification. This layer is

sometimes called the bottleneck, and the feature maps that is used as input to the

final output layer are called bottlenecks. A common activation function for the final

output layer is the softmax activation function. The function is used for multi-class

classification and is a generalization of logistic regression. A vector is produced as

output, where each element represents a class and the probability that an example

belongs to a specific class. The softmax output vector always sum to one.

2.5 Convolutional Neural Networks 13

3Related work

The chapter begins with an introduction to the concept of transfer learning, followed

by research exploring transfer learning as a theoretical concept. The second part

summarize research that has been done on transfer learning when applied to real

world problems. In the last section, the GoogLeNets architecture is introduced.

3.1 Transfer learning

Transfer learning has the last couple of years become a viable and common solution

when applying machine learning to real world problems. As been stated in the

introduction, the motivation is often that the process of training the network from

scratch is too costly.

An assumption that often has to be made in machine learning, is that the training

dataset and future data must have the same distribution and feature space[23].

When dealing with real world problem, this assumption is not always true. The

dataset available when applying machine learning might be small and the task of

labeling more data can prove expensive. On the other hand, we might have sufficient

training data in another domain, but where the dataset has a different feature space

and distribution. To avoid the expensive operation of increasing the target dataset,

we want to transfer the knowledge learned on the other domain, to the domain of

our dataset. This is called transfer learning, and has been proven highly successful

in recent years[23].

3.2 Research exploring transfer learning

Donahue et al. [8] explored how well a pre-trained state-of-the-art CNN generalizes

to perform classification on images drawn from other domains. They took a state-

of-the-art model, trained it on ImageNet, and then retrained the last layers on new

datasets and tasks. The researchers researched three datasets; The first was the

SUN-397 dataset, containing scenes like dinner or a mosque, the second was an

office dataset containing office-product images and the last dataset was the Caltech-

USCD bird dataset. Their results showed that the generality and semantic knowledge

15

learned in the pre-trained network tend to cluster images into semantic categories

that the network was never explicitly trained on. Their results was among the best

ever attained on the used datasets. The model had been trained on the task of

object recognition, but was also tested on "scene recognition", a completely different

task. The model performed surprisingly well, and was able to beat state-of-the-art

performance in accuracy with 2.9%.

Girshick et al. [12] propose an object detection algorithm that significantly improve

on previous results on PASCAL VOC 2012. Their research builds on two insights.

The first is to localize and segment objects into regions using bottom-up region

proposals, and then apply state-of-the-art convolutional networks to these regions.

The second was that it is highly effective to pretrain a CNN on an auxiliary task with

large quantities of data and then fine-tune the network for the target task. They

conclude that it is likely that transfer learning will be highly effective for a wide

variety of computer vision problems where data is scarce.

Tajbakhsh et al. [34] set out to answer the following research question: Can the use

of pre-trained deep CNNs, with sufficient fine-tuning, eliminate the need for training a

deep CNN from scratch?. Their experiments consistently demonstrated the following

properties; 1) A pre-trained CNN with enough fine-tuning seem to outperform or, in

the worst cases, perform on par with CNNs trained from scratch; 2) A CNN that is

trained using fine-tuning prove to be more robust on different sizes of training sets

than a CNN trained from scratch; 3) Neither tuning all layers, called deep tuning,

nor tuning just the last layer, called shallow tuning, proved to give the best results;

4) The best performance was achieved by doing layer-wise fine-tuning, iteratively

finding the optimal amount of layers that should have their weights being fine-tuned

during training.

Sinno Jialin Pan and Qiang Yang [23] did a survey study on transfer learning. In the

survey, the authors categorize and review the current progress on transfer learning.

The survey also focused on defining the relationship between transfer learning

and other related machine learning techniques. The research concluded that most

research show that the transferability in transfer learning, to a large degree is related

to how similar the source and target domain or task is. We still lack a similarity

measure that define distance between domains or tasks, and is suggested for future

research. The survey also covers what is called "negative transferability", that is

when the transferred knowledge actually decrease the model performance, which is

also tightly coupled to the source and target domain similarity.

In the paper "How transferable are features in deep neural networks?", Yosinki et al.

[35] experimentally try to quantify the generality versus specificity of neurons in

each layer of a CNN. A phenomenon that has been observed across many CNNs, is

16 Chapter 3 Related work

that the first layer often learn features for edge detection. This suggests that these

features are somewhat general in that they are not only useful on the current dataset

and task. For each layer, the network need to become more specialized towards the

domain of the dataset and task, transitioning from general to specific. Yosinki et al

found two distinct issues had a negative impact on transferability. The first issue

they discovered was that the performance on the target task was negatively affected

by the higher level neurons specialization towards their original task, which could

be expected. The second issue they observed was that splitting networks between

co-adapted neurons created optimization difficulties. Either of these described issues

may dominate, depending on how many of the layers are "frozen" during retraining

and fine-tuning towards the target domain and task. In line with previous results,

the paper also prove that the transferability of features decrease with the similarity

distance between the base and target task.

3.3 Research applying transfer learning

Saito and Matsui [28] highlight in their paper on semantic vector representation of

illustrations, the fact that many studies has been made on CNNs performance on

natural images, but there is a lack of research focusing on illustrations. According to

the authors, this is because of two technical issues. The first reason is the difficulty

in recognizing illustrations, because of illustrations diversity of visual elements. Eye

size, shapes of faces and bodies etc. vary a lot between, not only different drawers,

but also between drawings done by the same drawer. The second issue is the lack

of large open source datasets of illustrations. Large-scale annotated datasets like

ImageNet, is one of the driving factors behind the rapid development within image

recognition. Such a dataset for illustrations does not exist.

Esteva et al. [11] researched in their paper "Dermatologist-level classification of

skin cancer with deep neural networks", the use of transfer learning in the context

of dermatology. The study was very well-received, and the researchers were able

to produce a classification model that could classify skin lesion images with the

accuracy of a board-certified dermatologist. They used the model GoogLeNet, pre-

trained on ImageNet, and just retrained it on their target dataset. An important

note is that the Esteva et al. had a large dataset, 129,450 clinical images. They

used an interesting method of building a topological tree structure, where they

summarized the probabilities of each root nodes’ child’s, to produce the classification.

The classifier matched the performance of professional dermatologists tested across

critical diagnostic tasks for skin cancer, and is deployable on mobile devices. Several

research papers within the field of medicine describe the effectiveness of feature

extraction using pre-trained CNNs [31] [27] [15].

3.3 Research applying transfer learning 17

Al-Shabi et al. [29] propose in their paper an adult image recognition system that

uses a mixture of CNNs. The most popular method to block access to websites that

present adult content, is to search the site for restricted words. More traditional

methods has focused on handcrafting the features in adult images, like different po-

sitions and shapes. In contrast to these more traditional methods of adult-detection,

the system is an end-to-end machine learning model. The researchers manually

collected 41,154 adult images of the internet, and then used the ILSVRC-2013

dataset as non-adult images. An ensemble of CNN classifiers were used, and their

prediction on the image was weighted on their performance on the test set. The

final model yielded an impressive accuracy of over 96%.

Moustafa [21] also explored the use of deep learning for classifying pornographic

images. One of the differences from Al-Shabi et al. was that Moustafa used AlexNet

and GoogLeNet as feature extractors, using the output from the last convnet layer

(convolutional neural network layer). This allowed the last layer classifier to be

replaced with any kind of classifier e.g. Support Vector Machine (SVM). The effect is

a model that requires much less data to be trained, because it has less parameters

that need to be adapted. By combining the predictions from both AlexNet and

GoogLeNet into an ensemble-convnet with different last layer classifiers, the author

noticed a significant increase in performance on test set. The predictions from each

classifier was weighted on the classifiers performance during testing. In a study

made by Zhou et al. [37] results showed that an ensemble of CNNs can produce

state-of-the-art results on pornographic image classification. According to Zhou et

al, a common technique for categorizing images as pornographic is based on image

retrieval technology. A large image database with vast amounts of pornographic and

normal content is first created. The image to be classified is then used as query-input

and compared with images in the database. Classes of the retrieval result then

determines the class of the input image. The problem with this method is that due

to high variety in adult images, it has proven difficult to build a database that covers

a large enough set of images.

Several studies the last year has shown the effectiveness of applying CNN ensemble

classifiers and transfer learning to real world problems. Huynh et al. [16] achieved

state-of-the-art results on digital mammographic tumor classification, by using

transfer learn combined with an ensemble of classifiers. Akcay et al. [1] applied

transfer learning to the problem of x-ray baggage image classification. Their model

achieved 98.92% detection accuracy, outperforming previous work in the field. In

the study "Transfer Learning with Convolutional Neural Networks for Classification

of Abdominal Ultrasound Images", Cheng and Malhi [6] evaluated the use CNNs

and transfer learning within the field of abdominal ultrasound images. Their results

show that their CNN model achieved a classification accuracy that slightly surpassed

that of human radiologists.


3.4 GoogLeNet

In the ILSVRC14 competition, Google competed and won with a CNN model called

GoogLeNet [33]. The model had a "top five" error rate of 6.7%, pushing the state-of-

the-art. The revolutionary part of the algorithm was its architecture. Having only

22 layers, GoogLeNet uses twelve times fewer parameters compared to AlexNet,

breaking the trend of ever deeper CNN architectures. The depth of a model has a

huge impact on memory consumption, which engineers at Google realized would

become a bottleneck when applying CNNs to real world applications. Very deep

models might produce better results measured in accuracy, but can never be deployed

on for example a mobile device. Figure ?? show the complete network architecture.

Fig. 3.1: GoogLeNet CNN architecture. Illustration taken from the research paper "GoingDeeper with Convolutions" [33]

3.4.1 The inception module

To achieve this more memory-cost efficient model, researchers at Google came up

with a module they call "Inception". The model architecture is shown in Figure 3.2.

The architecture make use of a technique called "Network-In-Network", (NIN) that

was presented in the paper Network-In-Network by Min et al. [19]. Instead of

applying a linear operation in the convolutional layer, a multi layer perceptron is

used to capture the feature concepts in the input. The use of a MLP has shown to do

a better job at extracting features at each spatial location[19]. Figure 3.3 illustrate

the Network-in-Network concept.

The 1x1 layers can be used to reduce a feature map of size 512x512x80 to a map of

512x512x40 by applying 40 filters to the 1x1 convolution. The 1x1 convolutional

layers displayed in the Inception module illustration are NIN layers. NIN layers are

placed before the more computational expensive 3x3 and 5x5 convolutions to reduce

the dimensionality of the input.

3.4 GoogLeNet 19

Fig. 3.2: Inception module illustration [33]

Instead of having only a single convolution, the inception module has a composition

of differing sizes of convolutions. This effectively make the model able to "choose" if

it should use a 5x5, a 3x3 or a 1x1 convolution etc. at multiple layers. This will keep

down the total number of parameters in the model and at the same time perform

better than if the layer just had a simple convolution.

Fig. 3.3: Figure illustrating the difference between a normal linear convolution layer, and aMLPconv layer. Illustration taken from the paper "Network-In-Network" [33]


4Method

This chapter begins with presenting what Battlefield emblems are, how they are

created, and how they are reported. The next section introduces a work-flow on

machine learning problem solving that has been suggested by researchers within

the field. The guidelines introduced are applied during the process of producing the

thesis results.

4.1 Emblems in Battlefield

4.1.1 How players create and use emblems in Battlefield 1

and Battlefield 4

Uprise is an Electronic Arts studio located in Uppsala. The studios main responsibility

is the online platform and user interface surrounding the games, where players

socialize, join games, buy merchandise etc.

One of the features provided on this platform, is the possibility to create your own

"emblem". An emblem is an image that will be associate with a player profile and

also be displayed in the game; on weapons and vehicles. A player can either choose

to import an already existing emblem from another player, or create their own

emblem. In the platform, players are presented with a web-editor where they can

draw their own emblem. Unlike common painting tools like "paint", the players are

not given a brush, but instead a list of 105 symbols that can be used to compose

together their emblem. The size, color, orientation of the symbol can be adjusted by

the player. The editor also has a layer structure, a symbol can be put behind or in

front of another symbol. When a player submit their emblem, the emblem is stored

as a PNG, and no check is made if the emblem already exist in the database. The

PNG is then exposed through a unique URL. Figure 4.1 show a screen-shot of the

web-editor in Battlefield companion.

Fan based web-pages like http://emblemsbf.com/, provide galleries where players

can share their emblem creations. This results in a significant reuse of certain well

21

Fig. 4.1: Screenshot capture from the Battlefield companion emblem editor

crafted emblems. Note that Battlefield companion do not allow players to import

images that has not been created through the Battlefield web-editor.

4.1.2 How offensive emblems are handled in the Battlefield

games

Players can report other players’ emblem if they find it offensive. These reported

emblems are sent to the customer service department at Electronic Arts, where

employees will decide whether a reported emblem should be banned or not. If the

reported emblem is banned by customer service, the emblem is flagged as "hidden"

in Battlefields emblem database. No additional metadata is currently stored except

the date of the change.

22 Chapter 4 Method

4.2 Method to approach the problem

The problem was approached according to guidelines presented in the book "Deep

learning" [14], written by the machine learning researcher Andrew Ng.

4.2.1 Step 1 - Determine goals and measurements

The first step in applying machine learning to a problem, is to determine the goals of

the project, what metrics to use and target values the project should satisfy [14].

4.2.2 Step 2 - Establish working end-to-end baseline model

The next step is to establish a working end-to-end pipeline for the machine learning

task and measure performance on a first baseline model.

4.2.3 Step 3 - Determine bottlenecks in performance

According to Goodfellow et al. [14], the following questions are of great importance

when trying to determine bottlenecks in performance.

• Is the model overfitting?

• Is the model underfitting?

• Are there defects in the dataset?

• Are there defects in software?

4.2.4 Step 4 - Repeatedly make incremental changes.

The last step when applying machine learning, is to iteratively make changes to

improve performance. The following tasks are often applied at this stage:

• Gather new data

• Adjust hyperparameters

• Change algorithm if necessary

4.2 Method to approach the problem 23

4.3 Additional guidelines when applying machine

learning

4.3.1 The process of knowing what to do next

Andrew Ng argues that the process of applying deep learning in practice is still

being researched, but presents a few guidelines from his experience. When deep

Fig. 4.2: Flow-chart displaying the process of applying deep learning. Illustration takenfrom "Nuts and Bolts of Applying Deep Learning" [22]

learning is applied in practice, Ng argue that engineers often struggle on knowing

what should be done next[22]. Ng presents a flow-chart approach on how resources

in many situations are best spent, depending on performance benchmarks during

training and testing.

If the training error is high, called underfitting, the first thing to do is to make the

model bigger. In this situation, the model is not able to capture the structure of the

data, and need more freedom to adjust and fit. Training the model longer on the

dataset should also be evaluated. If the previous approaches don’t work, the model

architecture might have to be changed [22]. If nothing works, the quality of the

data might be the problem. The data could be too noisy or not include features that

makes it possible to predict the output. The solution to this problem is to start over,

collect cleaner data or a dataset with a richer set of features.

If the error on the training set is low, but the validation error is high, called overfitting,

then our model is not generalizing. In most situations, the best option is to put

efforts into obtaining more data. Adding or increasing regularization measures by

for example decreasing the number of training epochs, can improve performance

during testing. If these measures don’t increase the performance on our test set, a

different model architecture might be the last option.

24 Chapter 4 Method

The development test set (dev test set) is used to produce intermediate performance

results when a classifier has been trained using the training set and fine-tuned using

the validation set. When the validation error is low, but the error on the dev test

set is still high, the best option is again to extract more data and make sure that

the data trained on is similar to the data the model is being tested on. Synthesizing

data, by for example creating new rotated images or adding random noise, can be

an option to increase the dataset size.

The production test set (prod test set) should be extracted from the target application

domain and have a data distribution that is identical to the domain where the model

will be run. The work is done when the performance on the final production test set

is satisfactory.

4.3.2 Create a common data warehouse

Ng suggests that creating a common data warehouse for the project speeds up

development, making sure that the latest dataset are always reachable by the

engineers in the project.

4.3.3 Determine human-level performance on the task

Determining human-level performance on the task, measured in accuracy, give an

idea of where the theoretical limit of performance lies, often called the optimal

error rate. A dataset containing images often have some examples that are so

blurry or misleading, that they simply are not possible to label into a category with

high confidence. Humans perform well on many of the tasks that are normally

targeted with deep learning, leaving the gap between optimal error and human

level performance, relatively small. When iterating and improving the algorithm, it’s

easier to improve when model performance is below human-level performance.

4.3.4 Plot performance on increasing dataset size and

visualize worst errors

Running experiments using 1/8, 1/4, 1/2 etc. of the dataset give insights on the

expected performance gains if more data were extracted. A final tip that is presented

by Ng, is to visualize the models’ worst errors. Looking at the incorrect classifications

with the highest confidence can often show data that is incorrectly labeled, and will

give a better understanding on what examples the model struggle on.

4.3 Additional guidelines when applying machine learning 25

5Experimental setup

The material and setup used to produce the experimental results is explained in the

chapter. The preprocessing techniques used and dataset augmentation procedures

are described, together with the dataset generation method.

5.1 Software and hardware used during

experiments

All experiments were conducted on the following hardware:

• Intel Xeon CPU E5-1650 v3 3.5GHz 12 vCPUs

• NVIDIA GeForce GTX 980, 2048 CUDA cores.

• 32GB RAM

The following software versions were used during classifier testing/training:

• Python 3.4

• Tensorflow 1.1.2

• Ubuntu 16.04

27

5.2 Preprocessing

Images need to have a standardized pixel range, for example the range [0,1] or

[0,255]. This is the only preprocessing that is strictly required when running images

through a CNN.

5.2.1 Dataset augmentation

More data can be produced by augmenting existing images, synthesizing additional

data. Adding "noise" to images by rotating them, adding random brightness etc

are examples of augmentation techniques. The following table describe distortions

applied and experimented with during the thesis work. Figure 5.1 illustrate the

rotation technique applied to some emblems.

Tab. 5.1: Data augmentations

Distortion type Description

Random scale Randomly scale the image by x%Random crop Randomly crop the image by x%

Random brightness Randomly adjust the image brightness by x%Rotation Rotate 20 degrees for 340 degrees, synthesizing 17 images

Fig. 5.1: Rotation augmentation example

5.2.2 Contrast normalization

The magnitude between the bright and the dark pixels in the image is called the

image contrast. The amount of contrast in an image can often safely be removed, to

reduce variance and remove the need for the model to learn how to handle multiple

contrast scales. One way to achieve this is global contrast normalization, normalizing

the contrast in every image.

28 Chapter 5 Experimental setup

5.3 Dataset generation

The complete dataset was iteratively divided up into the following parts. Increasing

amount of emblems became available as the thesis work progressed. In an effort to

keep the distribution between the classes within the complete dataset consistent, the

labeling process kept a goal of producing the division of 45% non-offensive emblems,

30% swastika emblems and 25% penis emblems. A more detailed explanation of

the emblem categorization process and extraction is presented in the results chapter.

The dataset during iteration one had a size of 5000 emblems, iteration two 10 000

emblems and the third iteration 17 377 emblems. The class distribution between

the training, validation and dev test set were close to identical.

• Training set 80% - At the end of each thesis iteration, 80% of the images was

drawn at random from the dataset and put into a separate training set. This

set was used to tune the weight/parameters of the model.

• Validation set 10% - The validation set was used solely to graph the models

estimated generalization error for each epoch during training. 10% of the

dataset was set apart for this. The validation set is normally used to tune the

models hyperparameters.

• Development test set 10% - Henceforth, the development set is called dev

test set. When the model has been fully trained, performance benchmarks was

run on this set, kept separate from the training process.

• Production test set - Henceforth, the production test set is called prod test

set. At the end of the third iteration, 3650 emblems were drawn at random

from the emblem database containing 8 032 703 emblems. The MD5 hash

of these images was then compared to the 17 377 emblems in the already

labeled dataset to make sure the model have never seen the emblems before.

523 emblems were matched and removed in this process, yielding a final

production test set of 3127 emblems.

5.3 Dataset generation 29

5.4 Feature extraction

The CNN architecture GoogLeNet was used as a feature extractor during the thesis

work, pretrained on the image dataset ImageNet. Figure 5.2 show a sample from

the ImageNet dataset. Emblem images were fed through the CNN, and the output

feature map (called bottlenecks) of the last convolution layer was then used to

train a MLP to classify the Battlefield 1 emblems. The feature map produced is a

vector with length 2480, each element being a feature represented by a real number

between zero and two.

Fig. 5.2: ImageNet sample. Image taken from Stanford Vision Lab [2]

5.5 Machine learning framework - Tensorflow

TensorFlow is an open-source software library for machine learning and numerical

computation. The framework was developed by the Google Brain Team, within

Google’s Machine Intelligence research organization. In Tensorflow, a data flow

graph is defined, where each node represents a mathematical operation. The edges

between the nodes represent multidimensional data arrays, called tensors, that

are passed between the nodes. The abstraction of a computational graph makes

it possible to deploy the computation to multiple CPUs or GPUs, and on different

devices with the same API.

30 Chapter 5 Experimental setup

6Results

6.1 Results iteration 1

6.1.1 Step 1 - Determine goals and measurements

The first step was to determine the goals for the project. In discussion with Uprise,

the following objectives were decided for the thesis:

• The project should produce a categorizing service that when presented with

an emblem, will flag the emblem as offensive or not.

• The project should provide a good overview of the models strength and weak-

ness.

No explicit key metric or target values were added to the objectives. The task of

filtering out offensive content is similar to the task of spam detection in some ways.

They are both binary classification tasks and in both tasks the cost of incorrectly

classifying an example as offensive/spam is higher than making the mistake of

permitting an offensive/spam example. The dataset also have a heavily skewed

distribution between the classes. About 99% of the emblems are non-offensive,

when examining a sample of 1000 emblems drawn at random from the eight million

dataset.

The uneven class distribution render metrics like accuracy and error rate less useful

when evaluating the model on real-world samples. On a set randomly picked from

the real-world dataset, a model that would classify all the examples as non-offensive

would on average get an accuracy around 99%, which is misleading. We are not

that interested in the examples that the model correctly flag as non-offensive, so we

want to use metrics that don’t take true negatives into account.

The primary focus is to minimize the amount of examples that the classifier incor-

rectly flag as offensive, covered by the precision metric. A secondary goal is to catch

as many offensive emblems as possible in the filter, covered by the recall metric.

The F-measure take into account both precision and recall.

31

TP = True Positive = Correctly predicting a offensive emblem as offensive

TN = True Negative = Correctly predicting a non-offensive emblem as non-offensive

FP = False Positive = Incorrectly predicting a non-offensive emblem as offensive

FN = False Negative = Incorrectly predicting a offensive emblem as non-offensive

Precision =TP

TP + FP

Recall =TP

TP + FN

Fmeasure = 2 ∗

precision ∗ recall

precision + recall

An important note is that the distribution between the sets will be even during

training, rendering accuracy a useful measurement for model evaluation, and will

be considered the key metric. The dev test set has the same distribution between the

classes as in the training set.

6.1.2 Dataset extraction

Dataset labeling process

There are 8 032 703 uploaded emblems in Battlefield 1, as of April 2017. The

emblems that has been reported and marked as offensive by customer service

constitute the offensive dataset. The dataset consist of 4730 images for the game

Battlefield 1. No additional data was stored regarding the images. In order to get a

sense of the distribution within the offensive dataset, the dataset was sorted into

offensive categories. The following categories were decided:

Tab. 6.1: Categories within the offensive dataset

Nazi symbol Penis Nude Text Miscellaneous

Emblems were labeled miscellaneous when none of the other labels applied. Fig-

ure 6.1 illustrate samples from each dataset.

Fig. 6.1: Sample emblems. From left to right: nude, miscellaneous, nazi symbol, penis andtext.

32 Chapter 6 Results

Dataset Distribution

Tab. 6.2: Emblems hidden by customer service at Dice, categorized

Nazi symbol Penis Nude Text Miscellaneous Total

2942 1146 265 110 267 4730

Fig. 6.2: Distribution among hidden emblems in BF1

The distribution between the classes are shown above. Most of the offensive emblems

are Nazi symbols, followed by penis illustrations. To get a further understanding

regarding the kind of emblems that are common in Battlefield, the 10 000 most

used emblems were extracted. This was done by running a MD5 hash on all the

emblems, group all the emblems with the same hash, and then sort the emblems on

the number of occurrences. The top 10 000 emblems are reused by players 1 557

720 times. Figure 6.3 show the distribution among all the top 1000 emblems, after

manually sorting the set. Figure 6.4 display the distribution between the offensive

categories found in the top 1000 emblem dataset. The most common offensive

classes are nude and miscellaneous. The reused drawings mostly consist of advanced

illustrations, having multiple layers and being more artistic than the average emblem.

One plausible explanation to why nude images are reused the most, could be that

they are too hard for the average player to draw themselves. In contrast, most people

are capable of drawing a swastika or a penis.

Tab. 6.3: Distribution among top 1000 emblems after manual categorization

Non-offensive Nazi symbol Penis Nude Text Miscellaneous Total

907 3 8 50 4 28 1000

Fig. 6.3: Distribution among all top 1000emblems BF1

Fig. 6.4: Distribution between offensiveemblems in top 1000

6.1 Results iteration 1 33

6.1.3 Step 2 - Establish working end-to-end pipeline and

baseline model

A database was set up to store emblems and their labels. To make sure that the

dataset did not include any duplicate emblems used, a MD5-hash was produced on

each emblem and used as key in the database.

To streamline the process of collecting experimental results, a database was set up

to automatically store classifier hyperparameters, what labels that were included

as offensive, and the classifiers’ performance on test set. The end-to-end pipeline

was set up in Googles open source machine learning library TensorFlow, using the

Python API. To avoid dependency issues and ensure deployment stability, the project

made heavy use of containerization using Docker.

The pipeline looked as follows:

1. Choose what labels that should be considered offensive (in order for future

models to include new categories as offensive).

2. Define a hyperparameter configuration file that will be used for the run,

includes parameters like number of epochs, learning rate, training batch size

etc.

3. The pipeline would then fetch the latest dataset from the database and spawn a

docker container, performing the classifier training and output the trained clas-

sifier as a graph file. Bottlenecks were produced once and reused using a cache

folder, making repeated runs significantly faster. The classifiers’ performance

is then automatically stored in the database.


The dataset distribution used to train the first baseline model is shown below.

Training batch size is the number of images that are used each epoch for the forward

pass and backpropagation. The data was not augmented in any way during iteration

one or two. An accuracy of 91.6% on the test set was recorded for the baseline

model.

The dataset used in the baseline model were created without applying any fine-

grained separation within the offensive dataset, including images from all categories

as offensive.

Tab. 6.4: Dataset baseline model

Non-offensive Offensive Total

1000 4000 5000

Tab. 6.5: Training parameters baseline model

Number of training epochs Learning rate Batch size during training

4000 0.01 100

Tab. 6.6: Performance on test set

Model Accuracy F-measure Precision Recall Test set size

base-line 0.9162 0.7843 0.7018 0.8889 542


6.1.4 Performance benchmarks for first model

During the manual process of labeling images into more fine-grained categories,

it was concluded that the amount of variety between images within some classes

were very high. A sample from the miscellaneous labeled emblems is shown in

Figure 6.5.

Fig. 6.5: Sample from the miscellaneous category

The decision was made to limit the scope of the thesis to focus on filtering the

emblems containing swastikas and penises. This was due to the fact that swastikas

was seen as the most offensive category. These were also the largest offensive

categories in the hidden dataset.

Most of the images in the Nazi symbol category are swastikas, so it was also decided

that swastikas should first be considered, cleaning the Nazi symbol category to only

contain swastikas. Other symbols like the blood drop cross, confederate flags etc.

were put into a separate category. The dataset was swept through a second time,

resulting in a few images being found in the wrong category. After the dataset

clean-up, the model was trained again. By collecting more non-offensive labeled

emblems both from the top 10 000 dataset and emblems at random from the eight

million set, the dataset was changed to contain a more even distribution between

offensive and non-offensive emblems.


Hyperparameters and training set

Tab. 6.7: Training parameters

Training epochs Learning rate Training batch size

4000 0.01 100

Tab. 6.8: Dataset used during training in iteration one

Non-offensive Swastikas Penis Total

2248 1539 1211 5000

Samples from the labeled categories are shown below as thumbnails. These were

the emblems used during training. The categories non-offensive, swastika, and penis

was used to train the model, resulting in a multi-class classification problem. During

testing, performance is measured on the binary classification task of determining if

an emblem is offensive or non-offensive. If the classifier guess penis or swastika, the

guess would be coded into an offensive guess.

(a) Non-offensive (b) Swastikas (c) Penises

Fig. 6.6: Emblem thumbnails from each of the categories


Performance during training and test

The model took four minutes to generate bottlenecks (feature maps from the pre-

trained GoogLeNet CNN) and 15 minutes to fine-tune the fully connected layer. For

comparison, this procedure took four hours when run on the CPU, instead of the

GPU.

Fig. 6.7: Accuracy plot during training. Performance on training batch in orange, validationperformance in turquoise. The x-axis show the number of epochs.

Fig. 6.8: Cross-entropy plot during training. Performance on training batch in orange,validation performance in turquoise. The x-axis show the number of epochs.

Figure 6.7 plot the accuracy for each epoch on the training batch and the validation

set. The faded line is the actual performance at each epoch, and the solid line display

the smoothed-out performance across each epoch, to more easily show the trend.

Note that the training set performance is evaluated on the last 100 images, which

give rise to the large jitter in performance across epochs. The validation performance


is evaluated on the complete validation set, 542 images, every ten epochs, making it

much less prone to jitter.

The accuracy on both the training- and validation- set increase drastically during

the first 200 iterations. Performance on the validation seem to stop increasing after

around 2000 iterations, while performance on the training set continue to improve,

reaching 100% accuracy on some training batches during epoch 3500 and 4000.

The gap in accuracy between training and validation is by the end of epoch 4000

close to 5%. Figure 6.8 plot the cross-entropy during each epoch, confirming the

performance improvement during each epoch that was shown in the accuracy plot.

Actual

Class

Predicted Class

Pos Neg Total

PosTP245

FN13

258

NegFP16

TN268

284

Total 261 281 542

Tab. 6.9: Confusion matrix for the first iteration model

Performance across all measurements are shown in Table 6.10. The improvement

in accuracy is largely dependent on changing the rules for what emblems that are

considered offensive in the dataset. Only considering swastikas and penises as

offensive emblems, give the classifier a more well-defined concept, that prove to be

easier to separate from non-offensive emblems.

Tab. 6.10: Performance on dev test set

Model Accuracy F-measure Precision Recall Dev test set size

base-line 0.9162 0.7843 0.7018 0.8889 542first iteration 0.9465 0.9441 0.9387 0.9491 542


Misclassified images

Fig. 6.9: Penises misclassified as non-offensive

Fig. 6.10: Non-offensive misclassified as penis

Fig. 6.11: Swastikas misclassified as non-offensive

Fig. 6.12: Non-offensive misclassified as swastika

In Figure 6.9, emblem 2, 3, 4 and 6 are penis illustrations that are in line with how

most penis emblems look like. Emblem 7 has its ground truth label wrong, the

emblem is a bandanna, and is one of the web-editor drawing symbols. Emblem 5

could be considered correctly labeled as a penis illustration. The illustration depicts

an armed soldier with a bullet starting at the crotch.

In Figure 6.10 The first and the last emblem has been incorrectly classified as

penises. Both are characters from the cartoon show "SpongeBob SquarePants". The

character to the furthest right is the character "Patrick", and is a common character

in emblems. A sample of the "Patrick" drawings from the non-offensive category is

shown in Figure 6.13, with the incorrectly classified emblem to the furthest right.

The pink color, combined with Patrick’s pointed head and eye-balls, prove hard for

the classifier to separate from a drawing of a penis.


Fig. 6.13: Emblems from the non-offensive category containing the SpongeBob characterPatrik

The pattern that can be found in some non-offensive emblems classified as swastikas,

could be the presence of an eagle in the center of the image. This is a common

pattern for swastikas, as shown in Figure 6.14.

Fig. 6.14: A small sample of the emblems in the swastika category containing eagles

6.1.5 Step 3 - Determine bottlenecks in performance

The accuracy reach above 99% on the training batches, when the model is given

enough training epochs. The error on the validation is considerably larger, indicating

a problem due to overfitting or high variance. Monitoring the validation performance

across epochs, indicate that the problem is not due to excessive training. The

performance on validation do not show any indication of neither dropping nor

improving after 4000 epochs.

As been presented in the method chapter, the options in this kind of situation

are typically to gather more data, add or increase regularization or try a new

model architecture. Gathering more data is often the best alternative to start with,

according to NG[14], and was chosen to be the goal for the second iteration.



6.2.1 Step 4 - Repeatedly make incremental changes

Common data warehouse and web-labeling service

To reduce the gap in accuracy between training and testing, the goal of the second

sprint was to increase the size of the labeled dataset. After researching databases

for previous Battlefield games, another 25 000 emblems that had been marked as

offensive were extracted. After discussions with Uprise, employees volunteered to

help out with the fine-grained labeling of the dataset. In order for this labeling to be

done, the database set up for the thesis needed to be exposed for labeling by others

than myself. Previously, the labeling was solely done locally on my workstation.

The next step in the thesis project was then to expose the database to a web user-

interface, where employees could click and label the dataset. A UI presenting each

classifier experiment together with its hyperparameters and performance was also

implemented. Figure 6.15 display a screen-shot taken of the labeling UI. Selecting

an emblem marks it with a blue background, and the label can be submitted to the

database by clicking the button at the top.

The dataset was increased from 5000 emblems to about 10 000 emblems in the

upcoming weeks. After the second iterations’ data extraction and labeling phase,

new experiments were run.

Fig. 6.15: Web-labeling service user-interface


6.2.2 Performance benchmarks for the second model

Data quality issues

Fig. 6.16: Accuracy plot during training

Fig. 6.17: Cross-entropy plot during training

The performance plots for the second model display alarming results. The model

no longer seem to learn as well as it did during iteration one. In iteration one,

the training set accuracy reached close to 100%, now the model only reach 95%

accuracy at best. The cross-entropy is also considerably higher. After a closer look at

the misclassified images, the root of the problem was found.


Fig. 6.18: Emblems marked as misclassified during testing

In Figure 6.18, we can see that the first image from the left is an eagle that was

somehow labeled as a swastika. The following five images had been labeled as

penises. The first three penis illustrations are definitely considered offensive and

display an important challenge. Determining what emblems that should be consid-

ered penis illustrations and what images that should be considered miscellaneous or

pornographic-content, is a slippery slope. The vast majority of the emblems labeled

as penises are images where two balls and a penis is drawn. To make sure that

the model has enough examples to recognize this type of drawing, these are the

emblems that are considered penis illustrations. The kind of pornographic emblems

depicted in Figure 6.18 are therefore considered pornographic drawings and labeled

as miscellaneous offensive emblems.

The last two emblems were marked as swastikas. These emblems depict the runic

insignia of the Schutzstaffel, also known as the SS bolts. As was presented in iteration

one, the scope of the thesis was limited to only include swastikas and penises, no

other hate symbols. It became obvious that emblems added after introducing the

labeling service, had quality issues, and all the newly added labeled emblems was

examined.

Fig. 6.19: Emblems incorrectly given the label penis in the dataset

Fig. 6.20: Emblems incorrectly given the label swastika in the dataset

Figure 6.19 display a sample of emblems incorrectly labeled as penises. The third

and the last emblem, from the left, are not penis illustrations but depict bandannas.

Figure 6.20 show some emblems that were found incorrectly labeled as swastikas.

The benchmarks were run again after the dataset had been cleaned up, and the

performance is shown in the next section.


Performance after data cleaning



4000 0.01 100

Tab. 6.12: Dataset used during training in iteration two


4497 3079 2422 10 000



The performance benchmarks after the data-cleaning process are more similar to

the results found during iteration one. The cross-entropy is again decreasing to low

levels on the training set.


Actual

Class

Predicted Class

p n total

pTP463

FN16

479

nFP34

TN519

553

total 497 535 1032

Tab. 6.13: Confusion matrix second iteration model

Table 6.14 show the performance results for each classifier so far. The new model,

trained on a dataset twice the size compared to the model in iteration one, performs

slightly better. The increase in size yielded an increase of 0.51% in accuracy and

0.47% in f-measure.



base-line 0.9162 0.7843 0.7018 0.8889 542first iter. 0.9465 0.9441 0.9387 0.9491 542

second iter. no clean 0.9388 0.9308 0.9308 0.9318 1032second iter. clean 0.9516 0.9488 0.9316 0.9666 1032



Another 7000 emblems were added to the dataset. Data cleaning was made before

running additional experiments.



6000 0.01 100

Tab. 6.16: Dataset used during training in iteration three


7815 5351 4210 17377



The performance plots show results similar to the previous iterations. Performance

on the validation set reach beyond 95% after epoch 4000. The cross-entropy for the

validation set is slightly lower than in the previous iterations.


Actual

Class

Predicted Class

p n total

pTP768

FN30

798

nFP45

TN922

967

total 813 952 1765

Tab. 6.17: Confusion matrix third iteration model





third iter. 0.9575 0.9535 0.9446 0.9624 1765

The third model achieve the highest performance, increasing accuracy with an

additional 0.59% and f-measure with 0.47%. The increase in dataset size also has

the effect that the dev test set is larger for the last model, testing on 1765 emblems

compared to only 542 images in the first iteration. This makes the test results more

reliable in the third iteration. The third model classified 45 emblems as offensive,

even though they did not contain any offensive content, and 30 emblems were

incorrectly labeled as non-offensive. All the emblems that were mistaken are shown

in the next section.


Misclassified images

Fig. 6.25: Non-offensive emblems misclassified as penises

Fig. 6.26: Non-offensive emblems misclassified as swastikas

Fig. 6.27: Penis emblems misclassified as non-offensive

Fig. 6.28: Swastika emblems misclassified as non-offensive

Out of 1765 emblems, 75 were classified incorrectly. All the incorrect predictions

are shown above. The incorrect labeling of some swastikas are unsatisfactorily, and

are hard to explain. It can be concluded that the model achieve high performance

results on the dev test set, but still have several blind-spots.


6.3.1 Data augmentation experiments

To increase the dataset size further, experiments were run using different augmen-

tation techniques. In the previous experiments, the bottleneck generation was run

once, and cached for consecutive runs. The augmentation operation is run on ev-

ery training batch, so new bottlenecks has to be generated every epoch. Running

the experiment on the complete dataset took between 8-15 hours on the worksta-

tion GPU, compared to the previous experiments that took between 15-60 minutes.

Several runs were made with different augmentation settings, the most successful

configuration is shown below. The only strategy that proved useful, was the rotation

technique. The images that were suitable for rotation were handpicked from the

dataset. To ensure the reliability of the performance benchmarks, only the training

set was augmented. The validation set and the test set was left untouched by any

augmentation strategies.



8000 0.01 100

Tab. 6.20: The dataset used during training, including both rotated and not rotated images


32 380 22 638 17 450 72 468




Actual

Class

Predicted Class

p n total

pTP783

FN15

798

nFP43

TN924

967

total 826 939 1765

Tab. 6.21: Confusion matrix third iteration model

The performance during training is following the same pattern as the counterparts

without augmentation, but the error on validation and training is closer to each

other. Performance on the validation set is also higher.


6.3.2 Final performance comparison between all models





third iter. 0.9575 0.9535 0.9446 0.9624 1765third iter. aug. 0.9671 0.9643 0.9480 0.9812 1765

The model run with augmentation has the highest performance recorded, with an

accuracy of 96.71%.

6.3.3 Performance on production test set

The best model found during the project was then run on the production set. The

application only want the classifier to make a guess if it is more than 85% sure about

the prediction, so the classifier was restricted from giving a prediction when the

highest prediction for the softmax function was beneath 85%. Compared to the

dev test sets run during development, which had a distribution that resembled the

training set distribution, the production test set has 99% non-offensive emblems and

only 1% offensive.

The production test set was created by randomly selecting 3650 emblems out of the

8 032 703 emblem dataset. The MD5 sum for the 3650 emblems were then queried

against the emblem database, and 523 emblems were already present, reducing the

dataset to 3127 emblems. By manually labeling all the emblems in the downloaded

dataset, 17 swastikas, 14 penises and 3096 non-offensive were found. These 3127

emblems were then used as the production test set.

Tab. 6.23: Production test set distribution, before relabeling


3096 17 14 3127


When letting the model classify the images in the prod test set and display the

misclassified emblems, five swastika emblems were found incorrectly labeled as

non-offensive, and ten penises were labeled incorrectly as non-offensive. This

demonstrates how easy it is to miss some of the offensive the emblems, and also

showcase the models’ capability of finding the offensive classes in emblems.

Fig. 6.31: Penis emblems incorrectly labeled as non-offensive, but found by model

Fig. 6.32: Swastika emblems incorrectly labeled as non-offensive, but found by model

After correcting the test set, the corrected distribution is shown in Table 6.24. The em-

blems that were given a incorrect label are shown in Figure 6.31 and Figure 6.32.


3081 22 24 3127Tab. 6.24: Production test set distribution, after relabeling


Actual

Class

Predicted Class

p n total

pTP37

FN1

38

nFP70

TN1772

1842

total 107 1773 1880

Tab. 6.25: Confusion matrix best classifier run on production test set

Out of 3127 emblems, the classifier was confident enough to give a prediction on

1880 emblems, which corresponds to about 60% of the emblems. The prediction

results are shown in Figure 6.25.

Tab. 6.26: Performance on production test set

Model Accuracy F-measure Precision Recall Test set size

best model 0.9622 0.5103 0.3458 0.9737 1880

The performance on the prod test set is worse than the performance on the dev test

set. There are several explanations to this.

First, how a non-offensive emblem is illustrated is far more varying than how a

swastika or penis is illustrated. If emblems would be graded on how hard they are

to recognize what they depict, non-offensive emblems are several levels harder to

say what the illustration depict. Adding more non-offensive emblems to the dataset

corresponds to adding more difficult examples.

Secondly, the model has been trained on a distribution of about 45% non-offensive,

30% swastikas and 25% penises. As has been mentioned, the distribution within the

prod test set is 99% non-offensive, 0.5% swastikas and 0.5% penises. The model

will during training try to find the best fit for the model on the training set and its

distribution. If the model was trained on a distribution similar to the production

test set, the model would though only be trained on about 390 swastikas, and 390

penises, which is not enough to learn the different kinds of drawing styles.


7Discussion

An important part of the work was to delimit the project to only focus on filtering out

swastikas and penises. This both had an impact on the models performance and made

it easier to evaluate the produced models limitations. Defining the boundaries within

the computer vision problem proved to be hard. Deciding what images that should

be labeled as swastikas was straight forward, but the task of determining which

emblems that should be considered penis images or pornographic/miscellaneous

images was challenging. During the second iteration, the loosely defined boundaries

between categories proved to have negative effects on the dataset quality, when

people from outside the project contributed to the labeling process. Excluding the

emblems that are hard, by categorizing them into a miscellaneous category, was also

problematic. The model was neither trained nor tested on this category, resulting

in an overoptimistic picture of how well the model would perform on completely

unlabeled dataset, where none of the emblems were excluded into miscellaneous

categories.

Increasing the dataset size improved performance, as expected. The quality of the

added data was shown to have a severe impact on performance. The results also

show that the dataset size can be leveraged even further by the use of adequate

augmentation strategies for the dataset.

To make sure that the performance measurements are reliable on completely unseen

data, the best model was run once on a production test set, generating the final

performance results. An issue with the emblem dataset, is the presence of "near

duplicate" images. The way to make sure that an emblem was not present in both

training set and the test set, was to generate a MD5 hash on the emblem image and

then check that the MD5 hashed image is only present in one of the sets. The problem

is that only changing a single pixel in the image would produce a different MD5

hash. This opens up for the possibility that some images with very small variation

might exist in both sets, making the results less reliable. The distance in how alike

an emblem is to another emblem varies significantly. No systematic method was

implemented to counter this problem, but the test and training set was manually

checked to get an understanding of the severity of this issue. The conclusion was

made that the amount of images that are close to identical between the sets are hard

to determine without a systematic approach, but seems to be limited after manually

55

comparing the sets. It is to be noted though that this is a problem with the whole

eight million emblem dataset and not a problem due to any of the choices made in

the thesis project. People tend to reuse popular emblems and make minor changes

to these.

The step-by-step approach to machine learning problems, presented by Goodfellow

et al[14] and Andrew Ng, was of great use. The debugging strategies presented

worked well during the project, and by monitoring performance on the training set

and test set, the correct measures were taken.

Unlike Tajbakhsh el al. that did fine-tuning across several layers, this project only

focused on fine-tuning the last layer. The performance show that the CNN features

learned during training on ImageNet, can be used to classify images on the target

emblem dataset. These findings are in line with Danahue et al and several of the

studies done on applying transfer learning to real world problems. Using a pretrained

CNN as a black box, made it possible to focus more on dataset extraction and dataset

augmentation. Even though only the last layer was fine-tuned, experiments required

significant time and effort.

One of the project goals was to investigate if a CNN model could be used as a tool

for filtering out offensive emblems in the game Battlefield 1. The performance on

the production test set show that it is possible to produce such a model, but it has

both strengths and weaknesses. The model has a high recall and is highly capable of

finding offensive emblems in a dataset, but the models’ performance measured in

precision is low. About two-thirds of the emblems flagged as offensive are not. Using

the model as the single decision-maker to whether an emblem should be accepted

or declined into the game would not be a good idea. The model still have several

blind spots, and do make severe mistakes. One can imagine that incorrectly accusing

players of uploading offensive emblems could have significant negative consequences.

The model is probably more suited as a customer service AI, predicting on already

reported emblems. Determining the suitability and consequences of applying a

machine learning filtering service to a game is outside the scope of this thesis, and

could be the subject of another project. The MD5 database that has been constructed

during the thesis work could though be used directly on emblem upload, to scan if

the emblem is already flagged as offensive.

56 Chapter 7 Discussion

7.1 Future work

Several approaches were considered, but never tried during the thesis work to limit

the scope of the project. Some of them are presented in this section.

Only the last CNN layer was fine-tuned during this thesis. A future extension would

be to retrain more layers in the CNN. Related research has showed that given enough

data, fine-tuning more layers than just the final layer often produce better results.

The model could have been visualized through the technique of t-Distributed Stochas-

tic Neighbor Embedding (t-SNE), a method to visualize high-dimensional datasets.

It would also have been interesting to further debug the CNN feature map and look

at the activations for different images. Several debugging methods exists that would

have been interesting. This could have given more insight about how the feature

representation looks like for different labels.

Only the GoogLeNet architecture was used during the project. Given more time,

it would have been interesting to try different CNN-models to produce the bottle-

necks/feature extraction, like AlexNet or VGG. Related research has also shown that

an ensemble classifier in many cases outperform a single classifier. This could have

been investigated by creating a SVM or random forest classifier on the generated

feature maps and then let an ensemble of classifiers do the predictions.

7.1 Future work 57

8Conclusion

The goal of the thesis was to evaluate the use of convolutional neural networks on

the task of filtering out penises and swastikas from emblems drawn by players in

the game Battlefield 1. A CNN with the GoogLeNet architecture was pretrained on

ImageNet, and then used as a black box for feature extraction, also called transfer

learning. A multi layer perceptron was then trained on the generated feature maps

from 17 220 emblems. The produced model achieved an accuracy of 96.22%, a

precision of 34.58% and a recall of 97.37% on a sample drawn from the game at

random. It can be concluded that the model is successful at finding swastikas and

penises, but among the emblems flagged as swastikas and penises, a large portion

are non-offensive.

59

Bibliography

[1]Samet Akçay, Mikolaj E Kundegorski, Michael Devereux, and Toby P Breckon. „Transfer

learning using convolutional neural networks for object classification within x-ray bag-

gage security imagery“. In: Image Processing (ICIP), 2016 IEEE International Conference

on. IEEE. 2016, pp. 1057–1061 (cit. on p. 18).

[3]Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan

Carlsson. „From generic to specific deep representations for visual recognition“. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

2015, pp. 36–45 (cit. on p. 2).

[6]Phillip M Cheng and Harshawn S Malhi. „Transfer Learning with Convolutional Neural

Networks for Classification of Abdominal Ultrasound Images“. In: Journal of Digital

Imaging (2016), pp. 1–10 (cit. on p. 18).

[8]Jeff Donahue, Yangqing Jia, Oriol Vinyals, et al. „DeCAF: A Deep Convolutional Acti-

vation Feature for Generic Visual Recognition.“ In: Icml. Vol. 32. 2014, pp. 647–655

(cit. on p. 15).

[9]David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. „Understanding deep archi-

tectures using a recursive convolutional network“. In: arXiv preprint arXiv:1312.1847

(2013) (cit. on p. 1).

[10]Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal

Vincent. „The Difficulty of Training Deep Architectures and the Effect of Unsupervised

Pre-Training.“ In: AISTATS. Vol. 5. 2009, pp. 153–160 (cit. on p. 1).

[11]Andre Esteva, Brett Kuprel, Roberto A Novoa, et al. „Dermatologist-level classification

of skin cancer with deep neural networks“. In: Nature 542.7639 (2017), pp. 115–118

(cit. on p. 17).

[12]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. „Region-based convolu-

tional networks for accurate object detection and segmentation“. In: IEEE transactions

on pattern analysis and machine intelligence 38.1 (2016), pp. 142–158 (cit. on p. 16).

[14]Goodfellow. Deep learning. MIT Press, 2016 (cit. on pp. 7, 10, 11, 13, 23, 41, 56).

[15]Mohammad Havaei, Axel Davy, David Warde-Farley, et al. „Brain tumor segmentation

with deep neural networks“. In: Medical image analysis 35 (2017), pp. 18–31 (cit. on

p. 17).

[16]Benjamin Q Huynh, Hui Li, and Maryellen L Giger. „Digital mammographic tumor

classification using transfer learning from deep convolutional neural networks“. In:

Journal of Medical Imaging 3.3 (2016), pp. 034501–034501 (cit. on p. 18).

61

[17]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. „Imagenet classification with

deep convolutional neural networks“. In: Advances in neural information processing

systems. 2012, pp. 1097–1105 (cit. on p. 1).

[18]Yann LeCun, Bernhard Boser, John S Denker, et al. „Backpropagation applied to hand-

written zip code recognition“. In: Neural computation 1.4 (1989), pp. 541–551 (cit. on

p. 1).

[19]Min Lin, Qiang Chen, and Shuicheng Yan. „Network In Network“. In: CoRR abs/1312.4400

(2013) (cit. on p. 19).

[20]Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine

learning. MIT press, 2012 (cit. on p. 5).

[21]Mohamed Moustafa. „Applying deep learning to classify pornographic images and

videos“. In: arXiv preprint arXiv:1511.08899 (2015) (cit. on p. 18).

[22]Andrew Ng. Nuts and bolts of applying Deep Learning. 2016 (cit. on p. 24).

[23]Sinno Jialin Pan and Qiang Yang. „A survey on transfer learning“. In: IEEE Transactions

on knowledge and data engineering 22.10 (2010), pp. 1345–1359 (cit. on pp. 15, 16).

[24]Otávio AB Penatti, Keiller Nogueira, and Jefersson A dos Santos. „Do deep features

generalize from everyday objects to remote sensing and aerial scenes domains?“ In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

2015, pp. 44–51 (cit. on p. 2).

[27]Holger R Roth, Amal Farag, Le Lu, Evrim B Turkbey, and Ronald M Summers. „Deep

convolutional networks for pancreas segmentation in CT imaging“. In: SPIE Medical

Imaging. International Society for Optics and Photonics. 2015, 94131G–94131G (cit. on

p. 17).

[28]Masaki Saito and Yusuke Matsui. „Illustration2vec: a semantic vector representation of

illustrations“. In: SIGGRAPH Asia 2015 Technical Briefs. ACM. 2015, p. 5 (cit. on p. 17).

[29]Mundher Al-Shabi, Tee Connie, and Andrew Beng Jin Teoh. „Adult Content Recognition

from Images Using a Mixture of Convolutional Neural Networks“. In: arXiv preprint

arXiv:1612.09506 (2016) (cit. on p. 18).

[30]Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. „CNN

features off-the-shelf: an astounding baseline for recognition“. In: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition Workshops. 2014, pp. 806–813

(cit. on p. 2).

[31]Jae Shin, Nima Tajbakhsh, R Todd Hurst, Christopher B Kendall, and Jianming Liang.

„Automating carotid intima-media thickness video interpretation with convolutional

neural networks“. In: Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition. 2016, pp. 2526–2535 (cit. on p. 17).

[32]Karen Simonyan and Andrew Zisserman. „Very deep convolutional networks for large-

scale image recognition“. In: arXiv preprint arXiv:1409.1556 (2014) (cit. on p. 1).

[33]Christian Szegedy, Wei Liu, Yangqing Jia, et al. „Going deeper with convolutions“. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015,

pp. 1–9 (cit. on pp. 1, 19, 20).

62 Bibliography

[34]Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, et al. „Convolutional neural net-

works for medical image analysis: full training or fine tuning?“ In: IEEE transactions on

medical imaging 35.5 (2016), pp. 1299–1312 (cit. on pp. 1, 16).

[35]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. „How transferable are

features in deep neural networks?“ In: Advances in neural information processing systems.

2014, pp. 3320–3328 (cit. on pp. 2, 16).

[36]Matthew D Zeiler and Rob Fergus. „Visualizing and understanding convolutional net-

works“. In: European conference on computer vision. Springer. 2014, pp. 818–833 (cit. on

p. 1).

[37]Kailong Zhou, Li Zhuo, Zhen Geng, Jing Zhang, and Xiao Guang Li. „Convolutional

Neural Networks Based Pornographic Image Classification“. In: Multimedia Big Data

(BigMM), 2016 IEEE Second International Conference on. IEEE. 2016, pp. 206–209 (cit.

on p. 18).

Websites

[2]Stanford Author. Resources and links. 2014. URL: http://vision.stanford.edu/

resources_links.html (visited on June 8, 2017) (cit. on p. 30).

[4]Danilo Bargen. Programming a Perceptron in Python. 2013. URL: https://blog.dbrgn.

ch/2013/3/26/perceptrons-in-python/ (visited on June 8, 2017) (cit. on p. 6).

[5]Satvik Beri. Could someone explain how to create an artificial neural network in a

simple and concise way that doesn’t require a PhD in mathematics? 2013. URL: https:

//www.quora.com/Could- someone- explain- how- to- create- an- artificial-

neural-network-in-a-simple-and-concise-way-that-doesnt-require-a-PhD-

in-mathematics (visited on June 8, 2017) (cit. on p. 7).

[7]Adit Deshpande. A Beginner’s Guide To Understanding Convolutional Neural. 2016. URL:

https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner’s-Guide-

To-Understanding-Convolutional-Neural-Networks-Part-2/ (visited on May 22,

2017) (cit. on pp. 12, 13).

[13]Amar Gondaliya. Regularization implementation in R : Bias and Variance diagnosis. 2014.

URL: http://pingax.com/regularization-implementation-r/ (visited on June 8,

2017) (cit. on p. 10).

[25]Sebastian Raschka. Machine Learning FAQ. 2016. URL: https://sebastianraschka.

com/faq/docs/closed-form-vs-gd.html (visited on May 31, 2017) (cit. on p. 8).

[26]Jeff Dean Ray Kurzweil. 10 breakthrough technologies. 2013. URL: https : / / www .

technologyreview.com/s/513696/deep-learning/ (visited on Apr. 22, 2017) (cit.

on p. 1).

Websites 63

http://vision.stanford.edu/resources_links.html

http://vision.stanford.edu/resources_links.html

https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/

https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/

https://www.quora.com/Could-someone-explain-how-to-create-an-artificial-neural-network-in-a-simple-and-concise-way-that-doesnt-require-a-PhD-in-mathematics




https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/

https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/

http://pingax.com/regularization-implementation-r/

https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

https://www.technologyreview.com/s/513696/deep-learning/

https://www.technologyreview.com/s/513696/deep-learning/

List of Figures

2.1 Perceptron topology, illustration modified from Danilo Bargen [4] . . . 6

2.2 Multi layer perceptron topology, illustration modified from Satvik Beri [5] 7

2.3 Gradient descent, illustration modified from Sebastian Raschka [25] . 8

2.4 Dataset partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Illustrative example of overfitting, underfitting and optimal capacity.

Illustration modified from Amar Gondaliya [13] . . . . . . . . . . . . . 10

2.6 Illustration displaying the convolution operation [14] . . . . . . . . . . 11

2.7 A 7 × 7 image with a 3 × 3 kernel and a stride of one [7] . . . . . . . . 12

2.8 The 5 × 5 output feature map [7] . . . . . . . . . . . . . . . . . . . . . 12

2.9 A 7 × 7 image with a 3 × 3 kernel and a stride of two [7] . . . . . . . . 12

2.10 The 3 × 3 output feature map [7] . . . . . . . . . . . . . . . . . . . . . 12

2.11 A 32 × 32 image with a padding of two [7] . . . . . . . . . . . . . . . . 12

2.12 Image displaying the output of a 2 × 2 maxpool kernel, with a stride of

two[7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 GoogLeNet CNN architecture. Illustration taken from the research paper

"Going Deeper with Convolutions" [33] . . . . . . . . . . . . . . . . . . 19

3.2 Inception module illustration [33] . . . . . . . . . . . . . . . . . . . . 20

3.3 Figure illustrating the difference between a normal linear convolution

layer, and a MLPconv layer. Illustration taken from the paper "Network-

In-Network" [33] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Screenshot capture from the Battlefield companion emblem editor . . . 22

4.2 Flow-chart displaying the process of applying deep learning. Illustration

taken from "Nuts and Bolts of Applying Deep Learning" [22] . . . . . . 24

5.1 Rotation augmentation example . . . . . . . . . . . . . . . . . . . . . . 28

5.2 ImageNet sample. Image taken from Stanford Vision Lab [2] . . . . . . 30

6.1 Sample emblems. From left to right: nude, miscellaneous, nazi symbol,

penis and text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2 Distribution among hidden emblems in BF1 . . . . . . . . . . . . . . . 33

6.3 Distribution among all top 1000 emblems BF1 . . . . . . . . . . . . . . 33

6.4 Distribution between offensive emblems in top 1000 . . . . . . . . . . 33

6.5 Sample from the miscellaneous category . . . . . . . . . . . . . . . . . 36

6.6 Emblem thumbnails from each of the categories . . . . . . . . . . . . . 37

65

6.7 Accuracy plot during training. Performance on training batch in orange,

validation performance in turquoise. The x-axis show the number of

epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.8 Cross-entropy plot during training. Performance on training batch

in orange, validation performance in turquoise. The x-axis show the

number of epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.9 Penises misclassified as non-offensive . . . . . . . . . . . . . . . . . . . 40

6.10 Non-offensive misclassified as penis . . . . . . . . . . . . . . . . . . . . 40

6.11 Swastikas misclassified as non-offensive . . . . . . . . . . . . . . . . . 40

6.12 Non-offensive misclassified as swastika . . . . . . . . . . . . . . . . . . 40

6.13 Emblems from the non-offensive category containing the SpongeBob

character Patrik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.14 A small sample of the emblems in the swastika category containing eagles 41

6.15 Web-labeling service user-interface . . . . . . . . . . . . . . . . . . . . 42

6.16 Accuracy plot during training . . . . . . . . . . . . . . . . . . . . . . . 43

6.17 Cross-entropy plot during training . . . . . . . . . . . . . . . . . . . . . 43

6.18 Emblems marked as misclassified during testing . . . . . . . . . . . . . 44

6.19 Emblems incorrectly given the label penis in the dataset . . . . . . . . 44

6.20 Emblems incorrectly given the label swastika in the dataset . . . . . . . 44





6.25 Non-offensive emblems misclassified as penises . . . . . . . . . . . . . 49

6.26 Non-offensive emblems misclassified as swastikas . . . . . . . . . . . . 49

6.27 Penis emblems misclassified as non-offensive . . . . . . . . . . . . . . . 49

6.28 Swastika emblems misclassified as non-offensive . . . . . . . . . . . . 49



6.31 Penis emblems incorrectly labeled as non-offensive, but found by model 53

6.32 Swastika emblems incorrectly labeled as non-offensive, but found by

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

66 List of Figures

List of Tables

5.1 Data augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Categories within the offensive dataset . . . . . . . . . . . . . . . . . . 32

6.2 Emblems hidden by customer service at Dice, categorized . . . . . . . 33

6.3 Distribution among top 1000 emblems after manual categorization . . 33

6.4 Dataset baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.5 Training parameters baseline model . . . . . . . . . . . . . . . . . . . . 35

6.6 Performance on test set . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.7 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.8 Dataset used during training in iteration one . . . . . . . . . . . . . . . 37

6.9 Confusion matrix for the first iteration model . . . . . . . . . . . . . . 39

6.10 Performance on dev test set . . . . . . . . . . . . . . . . . . . . . . . . 39


6.12 Dataset used during training in iteration two . . . . . . . . . . . . . . . 45

6.13 Confusion matrix second iteration model . . . . . . . . . . . . . . . . . 46



6.16 Dataset used during training in iteration three . . . . . . . . . . . . . . 47

6.17 Confusion matrix third iteration model . . . . . . . . . . . . . . . . . . 48



6.20 The dataset used during training, including both rotated and not rotated

images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.21 Confusion matrix third iteration model . . . . . . . . . . . . . . . . . . 51


6.23 Production test set distribution, before relabeling . . . . . . . . . . . . 52

6.24 Production test set distribution, after relabeling . . . . . . . . . . . . . 53

6.25 Confusion matrix best classifier run on production test set . . . . . . . 54

6.26 Performance on production test set . . . . . . . . . . . . . . . . . . . . 54

67

Colophon

This thesis was typeset with LATEX 2ε. It uses the Clean Thesis style developed by

Ricardo Langner. The design of the Clean Thesis style is inspired by user guide

documents from Apple Inc.

Download the Clean Thesis style at http://cleanthesis.der-ric.de/.

http://cleanthesis.der-ric.de/

classification of offensive game-emblem drawings...

Documents