classification of offensive game-emblem drawings...
TRANSCRIPT
![Page 1: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/1.jpg)
IT 17 089
Examensarbete 30 hpJanuari 2018
Classification of offensive game-emblem drawings using CNN (convolutional neural networks) and transfer learning
John Tunell
Institutionen för informationsteknologiDepartment of Information Technology
![Page 2: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/2.jpg)
![Page 3: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/3.jpg)
Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student
Abstract
Classification of offensive game-emblem drawingsusing CNN (convolutional neural networks) andtransfer learningJohn Tunell
Convolutional neural networks (CNN) has become an important tool to solve many computer vision tasks of today. The technique is though costly, and training a network from scratch requires both a large dataset and adequate hardware. A solution to these shortcomings is to instead use a pre-trained network, an approachcalled transfer learning. Several studies have shown promising results applying transfer learning, but the technique requires further studies.This thesis explores the capabilities of transfer learning when applied to the task of filtering out offensive cartoon drawings in the game of Battlefield 1. GoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned towards the target task and domain. The model achieved an accuracy of 96.71% when evaluated on the binary classi-fication task of predicting non-offensive or swastika/penis content in Battlefield "emblems". The results indicate that a CNN trained on ImageNet is applicable, even when the target domain is very different from the pre-trained networks domain.
Tryckt av: Reprocentralen ITCIT 17 089Examinator: Mats DanielsÄmnesgranskare: Anders BrunHandledare: Håkan Rosenborn
![Page 4: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/4.jpg)
![Page 5: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/5.jpg)
Acknowledgement
I would like to thank my supervisor, Håkan Rosenhorn, for all the advice and
guidance given throughout the project. He openly shared his experience from a
career as a software developer, which I’m very grateful for. The feedback and
teaching sessions have given me valuable preparation for a career as a software
developer. Our lunch break jogs both improved my fitness and gave me insights
regarding the software development process. I will also miss working with all the
other friendly co-workers at Uprise. I also want to thank my reviewer, Anders Brun.
The meetings we had helped me stay focused on the research task and made sure I
was going in the right direction.
vii
![Page 6: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/6.jpg)
![Page 7: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/7.jpg)
Contents
1 Introduction 1
1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Definitions and terminology . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Validation set . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Test set distribution . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Capacity, Overfitting and Underfitting . . . . . . . . . . . . . . . . . . 10
2.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . . 13
3 Related work 15
3.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Research exploring transfer learning . . . . . . . . . . . . . . . . . . 15
3.3 Research applying transfer learning . . . . . . . . . . . . . . . . . . . 17
3.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 The inception module . . . . . . . . . . . . . . . . . . . . . . 19
4 Method 21
4.1 Emblems in Battlefield . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 How players create and use emblems in Battlefield 1 and
Battlefield 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 How offensive emblems are handled in the Battlefield games . 22
4.2 Method to approach the problem . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Step 1 - Determine goals and measurements . . . . . . . . . . 23
4.2.2 Step 2 - Establish working end-to-end baseline model . . . . . 23
4.2.3 Step 3 - Determine bottlenecks in performance . . . . . . . . 23
4.2.4 Step 4 - Repeatedly make incremental changes. . . . . . . . . 23
ix
![Page 8: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/8.jpg)
4.3 Additional guidelines when applying machine learning . . . . . . . . 24
4.3.1 The process of knowing what to do next . . . . . . . . . . . . 24
4.3.2 Create a common data warehouse . . . . . . . . . . . . . . . 25
4.3.3 Determine human-level performance on the task . . . . . . . 25
4.3.4 Plot performance on increasing dataset size and visualize worst
errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Experimental setup 27
5.1 Software and hardware used during experiments . . . . . . . . . . . 27
5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 Dataset augmentation . . . . . . . . . . . . . . . . . . . . . . 28
5.2.2 Contrast normalization . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Dataset generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Machine learning framework - Tensorflow . . . . . . . . . . . . . . . 30
6 Results 31
6.1 Results iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1.1 Step 1 - Determine goals and measurements . . . . . . . . . . 31
6.1.2 Dataset extraction . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1.3 Step 2 - Establish working end-to-end pipeline and baseline
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.4 Performance benchmarks for first model . . . . . . . . . . . . 36
6.1.5 Step 3 - Determine bottlenecks in performance . . . . . . . . 41
6.2 Results iteration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Step 4 - Repeatedly make incremental changes . . . . . . . . 42
6.2.2 Performance benchmarks for the second model . . . . . . . . 43
6.3 Results iteration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.1 Data augmentation experiments . . . . . . . . . . . . . . . . . 50
6.3.2 Final performance comparison between all models . . . . . . 52
6.3.3 Performance on production test set . . . . . . . . . . . . . . . 52
7 Discussion 55
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Conclusion 59
Bibliography 61
x
![Page 9: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/9.jpg)
1Introduction
Computer vision and object classification has in the last couple of years been dra-
matically improved by advances in deep learning and convolutional neural networks
(CNNs) [18]. In the ImageNet competition 2012, the most reputable competition
within computer vision, a group of researchers from the University of Toronto en-
tered with a deep CNN algorithm called SuperVision. The team won the competition
with an error rate of 16.4 percent, while the second best entry had an error rate of
26.2 percent[34]. The results were revolutionary, and the advances in computer
vision driven by CNNs has been acknowledged as one of the top 10 breakthroughs
of 2013[26].
CNNs main power lies in its deep structure, which allow the network to create
discriminating features that for each layer increase the level of abstraction[33, 36,
9, 32]. Advances in hardware, larger datasets and more complex models are key
factors to the recent success of CNNs. Further advances in the field are though not
only driven by increasing complexity. GoogLeNet, Googles winning submission to
ImageNet 2014, used 12 times fewer parameters and got significantly more accurate
results than previous winners[17]. Recent research has started to investigate not
only ways to improve the performance of CNNs, measured in error rate, but also
the performance measured in cost-effectiveness. In order for models to be put to
real-world use, metrics like computational budget, memory consumption and dataset
size requirements need to be considered[33].
Training a deep CNN from scratch can be both costly and complicated [10]. First,
a large labeled dataset is required for the training. In many domains, the amount
of labeled data is limited, and collecting such a dataset might require experts
to annotate images. Second, deep CNN training requires extensive memory and
computational resources. Lacking adequate hardware will make the training process
extremely time-consuming. Lastly, to avoid overfitting and ensure convergence,
the training process need to be repeated iteratively, trying out different parameters
in the model[34]. This requires experience and also make the process even more
time-consuming.
To lower the cost of training CNNs, a promising alternative has emerged through
research. Instead of training a CNN from scratch, an already trained CNN is used
1
![Page 10: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/10.jpg)
that has been trained on an existing large dataset from another domain. The CNN is
then fine-tuned towards the target domain or task. This concept is called transfer
learning. Using transfer learning, various computer vision researchers has been
able to significantly improve upon state-of-the-art performance on computer vision
tasks within a large set of domains[30, 3, 24]. Yosinski et al [35] emphasize the
importance of further studies on the exact nature and extent to which transfer
learning can be applied.
This thesis project is part of a computer vision internship at the game studio Uprise.
Uprise is a sister studio to Dice, and owned by the global video game company
Electronic Arts (EA). In the game Battlefield 1, players can draw what Uprise call
“emblems”. Emblems are images that the user can bind to their profile and also
display on their weapons and vehicles inside of the game. Emblems are not allowed
to contain offensive material. If it does, players are able to report these and EA must
handle reported emblems in due time, often manually. Uprise would like to improve
this process. During this thesis project, deep learning methods will be evaluated on
the task of filtering out these offensive emblems.
Problem formulation
This report set out to answer the following problem formulation:
How well does a CNN perform on the task of classifying offensive drawings, created by
players of the game Battlefield 1, when pretrained on ImageNet and fine-tuned on the
target dataset?
2 Chapter 1 Introduction
![Page 11: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/11.jpg)
1.1 Thesis Structure
Chapter 2
In the background chapter, essential machine learning concepts are introduced.
Dataset partitioning strategies and the multi layer perceptron topology is explained.
Furthermore, the chapter give an introduction to convolutional neural networks.
Chapter 3
The related work chapter summarize the body of research that has been done
on transfer learning and CNNs. The chapter ends with describing the GoogLeNet
architecture and its inception module.
Chapter 4
The Battlefield emblem system is explained in the method chapter, along with
guidelines used to approach the machine learning problem.
Chapter 5
How the dataset was preprocessed and augmentation techniques are described in
the experimental setup chapter. The chapter also explain how the CNN was used
as a feature extractor. The chapter ends with a short introduction to the machine
learning framework Tensorflow.
Chapter 6
The thesis work was divided into three iterations, each given its own section in
the result chapter, along with performance benchmarks as the thesis work pro-
ceeded. The results are analyzed and discussed along with the presentation of the
benchmarks, to make the reasoning easier to follow.
Chapter 7
The results are further discussed in the discussion chapter. Future work is also
covered in the chapter.
Chapter 8
The final conclusion is given in the conclusion chapter.
1.1 Thesis Structure 3
![Page 12: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/12.jpg)
![Page 13: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/13.jpg)
2Background
The thesis work relies heavily on research within the field of machine learning,
deep learning, and convolutional neural networks. This chapter will present an
introduction to terminology and concepts used during the study.
2.1 Definitions and terminology
This section introduces terminology that will be used in the thesis. The task of
detecting spam or non-spam in emails will be used to illustrate examples of the
definitions. The part is based on the definitions presented by Mohri et al [20]
• Examples - The instances in the dataset. The examples are usually the rows
in a matrix or database. In our spam detection problem, an email would
correspond to an example in our dataset. Examples are used to train and
evaluate the model [20].
• Features - The set of attributes that are associated to an example [20]. The
attributes are often represented as a vector, which corresponds to the columns
in a matrix or database. The name of the sender, presence of certain keywords
in the message, the message length etc. would be considered features in the
email example.
• Labels - The category or class value assigned to examples. An example email
would have either a label of spam or non-spam. When predicting a discrete
value, the task is called classification. If the target value is continuous, the task
of classifying the output is called regression.
• Hyperparameters - A models configuration parameters are called hyperpa-
rameters. This can for example be the number of iterations we want the model
to train on the dataset or the learning rate of the model. Hyperparameters are
not to be confused with parameters in a neural network, also called weights,
which are learned through backpropagation.
5
![Page 14: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/14.jpg)
2.2 Feedforward Networks
The most essential parts in a deep learning model are the feedforward neural
networks, also called multilayer perceptrons (MLPs) or artificial neural networks.
Neural networks are inspired by the brains information processing network, built up
by neurons. Neurons are connected to each other in a large signaling network. Every
neuron has multiple incoming connections. When a neuron receive incoming inputs,
it will sum up all the inputs and if the value exceed a given threshold, it will fire.
The signal is then passed on through connections to other neurons. Neural networks
try to model this behavior. Figure 2.1 illustrate a single perceptron.
Fig. 2.1: Perceptron topology, illustration modified from Danilo Bargen [4]
The first layer is called the input layer and is often a vector of values, called a
feature vector. The input values x are then multiplied with a weight w. The
weight is also called parameter and often depicted with the symbol θ. A bias term
is often introduced as x0 and w0, and act as a threshold value for the activation.
The multiplied inputs are summed into a single value. The value is then passed
through a activation function f that will produce an output. There are many types
of activation functions. One of the simplest is the step function, which will output a
1 if the input is higher than a given threshold, and 0 if it is below.
To be able to produce more complex functions than linear functions, the model
need to be applied not only to x, but to a non-linear transformation of x. This can
be seen as creating a new representation of x made up by the network. This is
done by adding hidden layers. These hidden layers will be used to produce the
new feature representation that will help the model find mappings that achieve
the desired output. Figure ?? illustrate the topology of a multi layer perceptron
(MLP). In this figure, each input node is connected to every neuron, each with its
own weight. The layers between the input layer and the final output layer are
called hidden layers. Just as with the perceptron example, the output layer will
take as input a feature vector, but in this case the input values transformed by the
hidden layer and not the raw data. The network will then output a value based on
the threshold and activation function. The objective of a feedforward network or
6 Chapter 2 Background
![Page 15: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/15.jpg)
Fig. 2.2: Multi layer perceptron topology, illustration modified from Satvik Beri [5]
a multi layer perceptron is to find an hypothesis function h for a function f [14].
When solving a classification problem, we produce a function that given an example
with a feature vector x, produce a hypothesis function that will output a class label
prediction y. The predicted label y should be as close as possible to the ground truth
class label y. A feedforward network finds the values for the parameters θ that will
result in the best approximation of the function. The network will iteratively tune
the parameters to make the hypothesis function h, parameterized by θ, as similar as
possible to the target function f [14].
y = hθ(x)
The flow of information goes from an input x, through intermediate computations
that are used to define hθ, ending up in an output y. There are no connection
backwards between neurons, the features found by intermediate layers are strictly
passed forward. This is why these models are called feedforward networks. In a
feedforward network, a chain of functions are composed together and are often
represented by a directed acyclic graph (DAG), as has been shown in the previous
figures. This is why we call these models networks [14]. A simple feedforward
network example would be a network composed together with three functions
f (1), f (2), and f (3), connected together in a chain forming the complete function
hθ(x) = fθ(3)(fθ
(2)(fθ(1)(x))) [14]. In this example we would call f (1) the first
layer, f (2) the second layer and f (3) the final layer or the output layer. The length
of the chain is called the networks depth, making this network a three layer deep
network. This is also where the term "deep learning" comes from, as the networks
are composed together with many layers, creating a deep network.
When we say that we "train" the network, what we do is that we try to drive hθ(x)
to match the target function f(x) [14]. The data in our dataset provides us with
noisy and approximate examples of f(x), evaluated at different training points [14].
Every example x is associated with a ground truth label y. The dataset specifies
what the last output layer need to produce, given the input x. What the layers
2.2 Feedforward Networks 7
![Page 16: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/16.jpg)
in between should output is what the learning algorithm will learn. The learning
algorithm will tune these layers by changing the weights θ to best implement an
approximation of f(x). This is done through a technique called backpropagation.
By propagating the mistakes backwards, tuning the weights to accomplish a better
fit to the target function, we improve and learn a better function approximation hθ.
By comparing the networks output from the hypothesis function y to the correct
value y, we can estimate a distance between the guess and the correct answer. This
comparison is done through a cost function J also called a loss function. The goal
of the algorithm is then to minimize the function J(θ) by tuning the weights of the
network to produce the desired output. Mean squared error is one of many cost
functions, illustrated in the below equation. In the equation m is the number of
training examples in the dataset.
MSE = J(θ) =1
2m
m∑
i=1
((hθ(xi) − yi)2
By calculating the partial derivatives for each weight in the network, we can find
in what direction each weight need to be adjusted in order to produce an output
that is closer to the target function output. The gradient descent is one of the most
common algorithms used to accomplish this within neural networks. Figure 2.3
illustrate the concept of gradient descent. The amount of change that is applied to a
weight is decided by the gradient and a hyperparameter α called the learning rate.
The step size of the weight change is determined by the gradient and the learning
Fig. 2.3: Gradient descent, illustration modified from Sebastian Raschka [25]
rate. Setting a high learning rate will make the gradient descent algorithm take
larger steps, and a low learning rate will make the steps smaller. The algorithm is
repeated until convergence. In the below equation, the symbol := mean that the
left-hand side is updated with the value calculated at the right-hand side:
θj := θj − α∂
∂θj
J(θ)
8 Chapter 2 Background
![Page 17: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/17.jpg)
2.3 Datasets
In most machine learning algorithms, the dataset is divided into three subsets - A
training set, a validation set and a test set. For classification tasks, every example in
the dataset contains a number of features, x, and a target value y. Figure 2.4 give
an overview of how datasets normally are split.
Fig. 2.4: Dataset partitioning
2.3.1 Training set
The training set is used during training to tune the parameters or weights θ of the
model. This is in most applications the largest of the datasets.
2.3.2 Validation set
The validation set is used during training to find the best hyperparameters for the
model. The set is used to make an intermediate estimation on how well the model
would perform on data that it has not trained on, to avoid overfitting to the training
set. The performance during training is called the models train error.
2.3.3 Test set
To evaluate the models performance on completely unseen data, a test set is used.
This evaluation is done when the model has finished training and has its hyperpa-
rameters and weights set. The performance on the test set show how well the model
generalize, often measured in the test error or generalization error. All performance
benchmarks are generated on the test set.
2.3 Datasets 9
![Page 18: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/18.jpg)
2.3.4 Test set distribution
When the set is divided into training and test set, a couple of assumptions need to be
made. Firstly, the test and train set is assumed to have identical distribution when
they are drawn at random from the same distribution. Secondly, we will assume that
all examples in our dataset are independent of each other. These assumptions are
called i.i.d assumptions (independent, identical distribution).
2.4 Capacity, Overfitting and Underfitting
The main challenge when constructing a machine learning algorithm, is to cre-
ate a model that performs well not only on the trained data, but also on new
unseen data[14]. The following section will describe key concepts regarding this
challenge.
The models ability to fit the training set is called the models capacity. Conceptually
it is the amount of freedom the model is given to calibrate itself towards the data
presented. This could for example be the number of iterations the model get to train
on the data, or the number of parameters in the model.
A model with low given capacity might struggle to fit the training data. This is called
underfitting. On the other hand, a model with high capacity might become too
specialized on the training set, essentially memorizing the output given a certain
input. The model will then struggle on unseen data. This is called overfitting.
Figure 2.5 illustrate the difference between overfitting and underfitting by showing
a model that try to fit a line to an example dataset.
Fig. 2.5: Illustrative example of overfitting, underfitting and optimal capacity. Illustrationmodified from Amar Gondaliya [13]
10 Chapter 2 Background
![Page 19: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/19.jpg)
2.5 Convolutional Neural Networks
2.5.1 Convolutional layer
One of the most important layers in a convolutional network is the convolutional
layer. The layer takes two arguments; input data and a kernel. Input data can
either be the original image or the feature map of a previous layer. The output of a
convolutional layer is called a feature map. The kernel is usually a square matrix,
which will slide over the image and "filter" it for features.
At each position, the kernels Fweights will be multiplied with the pixel values within
the kernel, performing element wise multiplication. The kernels value will then be
summed up into a single value, outputting the activation at that spatial location.
Figure 2.6 illustrates the process. If a specific feature is present in the input, the
activation will be high. In the first layer of a CNN, the weights of the kernels are
often set by the network to act as edge detectors, finding the presence of vertical
and horizontal lines in the image. By convolving the image with a set of filters, a
stack of filtered images is sent to the next layer.
Fig. 2.6: Illustration displaying the convolution operation [14]
How far the kernel is moved at every step is called the kernels stride. A stride of one
would correspond to having the kernel move one pixel at each step. The region that
is within the focus of the kernel is called the kernels receptive field. All the weights
in the kernel are the same for every pixel in the image and is called weight sharing
2.5 Convolutional Neural Networks 11
![Page 20: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/20.jpg)
Fig. 2.7: A 7×7 image with a 3×3 kerneland a stride of one [7]
Fig. 2.8: The 5 × 5 output featuremap [7]
or weight tying. The stride of the kernel affect the size of the output feature map. A
high stride will shrink the feature map output. Examples are presented in Figure 2.7
and Figure Y 2.8 to illustrate the effect of the stride hyperparameter. The stride in
this example is set to one. Figure 2.7 display an input image of size 7 × 7 with a
colored square showing the 3 × 3 kernel. The kernel is moved until it hits or would
move past a border, resulting in an output feature map of 5 × 5. Figure 2.9 show an
input image of the same image and kernel size, but with a stride of two. The kernel
can only be moved three times on one row before it has to be moved down, with a
stride of two. Figure 2.10 show the resulting 3 × 3 feature map, with a sample of
three activation outputs.
Fig. 2.9: A 7×7 image with a 3×3 kerneland a stride of two [7]
Fig. 2.10: The 3 × 3 output featuremap [7]
The final parameter that can be set in a convolutional layer is the amount of zeroes
that should be added to all borders of the image, called zero padding or padding.
Figure 2.11 show a padding of two to a 32 × 32 × 3 image, resulting in a 36 × 36 × 3
image. Padding is used to preserve the size of the image during convolutions.
Without padding, there is no input for the kernel outside the edges, and it will move
on. This results in a dimensionality reduction and can be avoided with padding.
Fig. 2.11: A 32 × 32 image with a padding of two [7]
12 Chapter 2 Background
![Page 21: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/21.jpg)
2.5.2 Pooling layer
After the convolutional layer, a pooling layer [14] is often applied. The layer
is sometimes called a downsampling layer, emphasizing the layers objective of
decreasing the size of the image or feature map. A common pooling layer is the
maxpooling layer. In a maxpooling layer, the highest value in the kernels receptive
field will be the output of the operation. Figure 2.12 illustrate the pooling process of
a 2 × 2 maxpooling kernel with a stride of two, slid across a 4 × 4 feature map. By
sliding the kernel over the feature map, we can both reduce the size of the feature
map by summarizing "boxes" of feature maps, and at the same time become less
sensitive to the exact spatial location of a feature. The relative location to other
features is still retained. In the maxpooling operation, it doesn’t matter where in the
receptive field the highest value is positioned.
Fig. 2.12: Image displaying the output of a 2 × 2 maxpool kernel, with a stride of two[7]
2.5.3 Fully connected layer
The last layer of a convolutional neural network has the role of finding the con-
nections between features and classes, and is called a fully connected layer (FCL).
All the neurons in this layer are connected to all the neurons in the previous layer,
much like a hidden layer in a multi layer perceptron. The features generated by the
previous convolutions has by the end of the network reached a level of abstraction
were the representations can take the form of hand detectors, feet detectors, cat
detectors etc. The fully connected layer has the same amount of neurons as there
are classes, and will output a vector representing the activations for each class. The
role of the FCL is to find mappings between the activations and a certain class.
This mapping is learned through forward passes and backpropagation, described in
the previous feedforward network section. The layer before the final output layer
produces the final feature map that will be used for classification. This layer is
sometimes called the bottleneck, and the feature maps that is used as input to the
final output layer are called bottlenecks. A common activation function for the final
output layer is the softmax activation function. The function is used for multi-class
classification and is a generalization of logistic regression. A vector is produced as
output, where each element represents a class and the probability that an example
belongs to a specific class. The softmax output vector always sum to one.
2.5 Convolutional Neural Networks 13
![Page 22: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/22.jpg)
![Page 23: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/23.jpg)
3Related work
The chapter begins with an introduction to the concept of transfer learning, followed
by research exploring transfer learning as a theoretical concept. The second part
summarize research that has been done on transfer learning when applied to real
world problems. In the last section, the GoogLeNets architecture is introduced.
3.1 Transfer learning
Transfer learning has the last couple of years become a viable and common solution
when applying machine learning to real world problems. As been stated in the
introduction, the motivation is often that the process of training the network from
scratch is too costly.
An assumption that often has to be made in machine learning, is that the training
dataset and future data must have the same distribution and feature space[23].
When dealing with real world problem, this assumption is not always true. The
dataset available when applying machine learning might be small and the task of
labeling more data can prove expensive. On the other hand, we might have sufficient
training data in another domain, but where the dataset has a different feature space
and distribution. To avoid the expensive operation of increasing the target dataset,
we want to transfer the knowledge learned on the other domain, to the domain of
our dataset. This is called transfer learning, and has been proven highly successful
in recent years[23].
3.2 Research exploring transfer learning
Donahue et al. [8] explored how well a pre-trained state-of-the-art CNN generalizes
to perform classification on images drawn from other domains. They took a state-
of-the-art model, trained it on ImageNet, and then retrained the last layers on new
datasets and tasks. The researchers researched three datasets; The first was the
SUN-397 dataset, containing scenes like dinner or a mosque, the second was an
office dataset containing office-product images and the last dataset was the Caltech-
USCD bird dataset. Their results showed that the generality and semantic knowledge
15
![Page 24: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/24.jpg)
learned in the pre-trained network tend to cluster images into semantic categories
that the network was never explicitly trained on. Their results was among the best
ever attained on the used datasets. The model had been trained on the task of
object recognition, but was also tested on "scene recognition", a completely different
task. The model performed surprisingly well, and was able to beat state-of-the-art
performance in accuracy with 2.9%.
Girshick et al. [12] propose an object detection algorithm that significantly improve
on previous results on PASCAL VOC 2012. Their research builds on two insights.
The first is to localize and segment objects into regions using bottom-up region
proposals, and then apply state-of-the-art convolutional networks to these regions.
The second was that it is highly effective to pretrain a CNN on an auxiliary task with
large quantities of data and then fine-tune the network for the target task. They
conclude that it is likely that transfer learning will be highly effective for a wide
variety of computer vision problems where data is scarce.
Tajbakhsh et al. [34] set out to answer the following research question: Can the use
of pre-trained deep CNNs, with sufficient fine-tuning, eliminate the need for training a
deep CNN from scratch?. Their experiments consistently demonstrated the following
properties; 1) A pre-trained CNN with enough fine-tuning seem to outperform or, in
the worst cases, perform on par with CNNs trained from scratch; 2) A CNN that is
trained using fine-tuning prove to be more robust on different sizes of training sets
than a CNN trained from scratch; 3) Neither tuning all layers, called deep tuning,
nor tuning just the last layer, called shallow tuning, proved to give the best results;
4) The best performance was achieved by doing layer-wise fine-tuning, iteratively
finding the optimal amount of layers that should have their weights being fine-tuned
during training.
Sinno Jialin Pan and Qiang Yang [23] did a survey study on transfer learning. In the
survey, the authors categorize and review the current progress on transfer learning.
The survey also focused on defining the relationship between transfer learning
and other related machine learning techniques. The research concluded that most
research show that the transferability in transfer learning, to a large degree is related
to how similar the source and target domain or task is. We still lack a similarity
measure that define distance between domains or tasks, and is suggested for future
research. The survey also covers what is called "negative transferability", that is
when the transferred knowledge actually decrease the model performance, which is
also tightly coupled to the source and target domain similarity.
In the paper "How transferable are features in deep neural networks?", Yosinki et al.
[35] experimentally try to quantify the generality versus specificity of neurons in
each layer of a CNN. A phenomenon that has been observed across many CNNs, is
16 Chapter 3 Related work
![Page 25: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/25.jpg)
that the first layer often learn features for edge detection. This suggests that these
features are somewhat general in that they are not only useful on the current dataset
and task. For each layer, the network need to become more specialized towards the
domain of the dataset and task, transitioning from general to specific. Yosinki et al
found two distinct issues had a negative impact on transferability. The first issue
they discovered was that the performance on the target task was negatively affected
by the higher level neurons specialization towards their original task, which could
be expected. The second issue they observed was that splitting networks between
co-adapted neurons created optimization difficulties. Either of these described issues
may dominate, depending on how many of the layers are "frozen" during retraining
and fine-tuning towards the target domain and task. In line with previous results,
the paper also prove that the transferability of features decrease with the similarity
distance between the base and target task.
3.3 Research applying transfer learning
Saito and Matsui [28] highlight in their paper on semantic vector representation of
illustrations, the fact that many studies has been made on CNNs performance on
natural images, but there is a lack of research focusing on illustrations. According to
the authors, this is because of two technical issues. The first reason is the difficulty
in recognizing illustrations, because of illustrations diversity of visual elements. Eye
size, shapes of faces and bodies etc. vary a lot between, not only different drawers,
but also between drawings done by the same drawer. The second issue is the lack
of large open source datasets of illustrations. Large-scale annotated datasets like
ImageNet, is one of the driving factors behind the rapid development within image
recognition. Such a dataset for illustrations does not exist.
Esteva et al. [11] researched in their paper "Dermatologist-level classification of
skin cancer with deep neural networks", the use of transfer learning in the context
of dermatology. The study was very well-received, and the researchers were able
to produce a classification model that could classify skin lesion images with the
accuracy of a board-certified dermatologist. They used the model GoogLeNet, pre-
trained on ImageNet, and just retrained it on their target dataset. An important
note is that the Esteva et al. had a large dataset, 129,450 clinical images. They
used an interesting method of building a topological tree structure, where they
summarized the probabilities of each root nodes’ child’s, to produce the classification.
The classifier matched the performance of professional dermatologists tested across
critical diagnostic tasks for skin cancer, and is deployable on mobile devices. Several
research papers within the field of medicine describe the effectiveness of feature
extraction using pre-trained CNNs [31] [27] [15].
3.3 Research applying transfer learning 17
![Page 26: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/26.jpg)
Al-Shabi et al. [29] propose in their paper an adult image recognition system that
uses a mixture of CNNs. The most popular method to block access to websites that
present adult content, is to search the site for restricted words. More traditional
methods has focused on handcrafting the features in adult images, like different po-
sitions and shapes. In contrast to these more traditional methods of adult-detection,
the system is an end-to-end machine learning model. The researchers manually
collected 41,154 adult images of the internet, and then used the ILSVRC-2013
dataset as non-adult images. An ensemble of CNN classifiers were used, and their
prediction on the image was weighted on their performance on the test set. The
final model yielded an impressive accuracy of over 96%.
Moustafa [21] also explored the use of deep learning for classifying pornographic
images. One of the differences from Al-Shabi et al. was that Moustafa used AlexNet
and GoogLeNet as feature extractors, using the output from the last convnet layer
(convolutional neural network layer). This allowed the last layer classifier to be
replaced with any kind of classifier e.g. Support Vector Machine (SVM). The effect is
a model that requires much less data to be trained, because it has less parameters
that need to be adapted. By combining the predictions from both AlexNet and
GoogLeNet into an ensemble-convnet with different last layer classifiers, the author
noticed a significant increase in performance on test set. The predictions from each
classifier was weighted on the classifiers performance during testing. In a study
made by Zhou et al. [37] results showed that an ensemble of CNNs can produce
state-of-the-art results on pornographic image classification. According to Zhou et
al, a common technique for categorizing images as pornographic is based on image
retrieval technology. A large image database with vast amounts of pornographic and
normal content is first created. The image to be classified is then used as query-input
and compared with images in the database. Classes of the retrieval result then
determines the class of the input image. The problem with this method is that due
to high variety in adult images, it has proven difficult to build a database that covers
a large enough set of images.
Several studies the last year has shown the effectiveness of applying CNN ensemble
classifiers and transfer learning to real world problems. Huynh et al. [16] achieved
state-of-the-art results on digital mammographic tumor classification, by using
transfer learn combined with an ensemble of classifiers. Akcay et al. [1] applied
transfer learning to the problem of x-ray baggage image classification. Their model
achieved 98.92% detection accuracy, outperforming previous work in the field. In
the study "Transfer Learning with Convolutional Neural Networks for Classification
of Abdominal Ultrasound Images", Cheng and Malhi [6] evaluated the use CNNs
and transfer learning within the field of abdominal ultrasound images. Their results
show that their CNN model achieved a classification accuracy that slightly surpassed
that of human radiologists.
18 Chapter 3 Related work
![Page 27: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/27.jpg)
3.4 GoogLeNet
In the ILSVRC14 competition, Google competed and won with a CNN model called
GoogLeNet [33]. The model had a "top five" error rate of 6.7%, pushing the state-of-
the-art. The revolutionary part of the algorithm was its architecture. Having only
22 layers, GoogLeNet uses twelve times fewer parameters compared to AlexNet,
breaking the trend of ever deeper CNN architectures. The depth of a model has a
huge impact on memory consumption, which engineers at Google realized would
become a bottleneck when applying CNNs to real world applications. Very deep
models might produce better results measured in accuracy, but can never be deployed
on for example a mobile device. Figure ?? show the complete network architecture.
Fig. 3.1: GoogLeNet CNN architecture. Illustration taken from the research paper "GoingDeeper with Convolutions" [33]
3.4.1 The inception module
To achieve this more memory-cost efficient model, researchers at Google came up
with a module they call "Inception". The model architecture is shown in Figure 3.2.
The architecture make use of a technique called "Network-In-Network", (NIN) that
was presented in the paper Network-In-Network by Min et al. [19]. Instead of
applying a linear operation in the convolutional layer, a multi layer perceptron is
used to capture the feature concepts in the input. The use of a MLP has shown to do
a better job at extracting features at each spatial location[19]. Figure 3.3 illustrate
the Network-in-Network concept.
The 1x1 layers can be used to reduce a feature map of size 512x512x80 to a map of
512x512x40 by applying 40 filters to the 1x1 convolution. The 1x1 convolutional
layers displayed in the Inception module illustration are NIN layers. NIN layers are
placed before the more computational expensive 3x3 and 5x5 convolutions to reduce
the dimensionality of the input.
3.4 GoogLeNet 19
![Page 28: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/28.jpg)
Fig. 3.2: Inception module illustration [33]
Instead of having only a single convolution, the inception module has a composition
of differing sizes of convolutions. This effectively make the model able to "choose" if
it should use a 5x5, a 3x3 or a 1x1 convolution etc. at multiple layers. This will keep
down the total number of parameters in the model and at the same time perform
better than if the layer just had a simple convolution.
Fig. 3.3: Figure illustrating the difference between a normal linear convolution layer, and aMLPconv layer. Illustration taken from the paper "Network-In-Network" [33]
20 Chapter 3 Related work
![Page 29: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/29.jpg)
4Method
This chapter begins with presenting what Battlefield emblems are, how they are
created, and how they are reported. The next section introduces a work-flow on
machine learning problem solving that has been suggested by researchers within
the field. The guidelines introduced are applied during the process of producing the
thesis results.
4.1 Emblems in Battlefield
4.1.1 How players create and use emblems in Battlefield 1
and Battlefield 4
Uprise is an Electronic Arts studio located in Uppsala. The studios main responsibility
is the online platform and user interface surrounding the games, where players
socialize, join games, buy merchandise etc.
One of the features provided on this platform, is the possibility to create your own
"emblem". An emblem is an image that will be associate with a player profile and
also be displayed in the game; on weapons and vehicles. A player can either choose
to import an already existing emblem from another player, or create their own
emblem. In the platform, players are presented with a web-editor where they can
draw their own emblem. Unlike common painting tools like "paint", the players are
not given a brush, but instead a list of 105 symbols that can be used to compose
together their emblem. The size, color, orientation of the symbol can be adjusted by
the player. The editor also has a layer structure, a symbol can be put behind or in
front of another symbol. When a player submit their emblem, the emblem is stored
as a PNG, and no check is made if the emblem already exist in the database. The
PNG is then exposed through a unique URL. Figure 4.1 show a screen-shot of the
web-editor in Battlefield companion.
Fan based web-pages like http://emblemsbf.com/, provide galleries where players
can share their emblem creations. This results in a significant reuse of certain well
21
![Page 30: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/30.jpg)
Fig. 4.1: Screenshot capture from the Battlefield companion emblem editor
crafted emblems. Note that Battlefield companion do not allow players to import
images that has not been created through the Battlefield web-editor.
4.1.2 How offensive emblems are handled in the Battlefield
games
Players can report other players’ emblem if they find it offensive. These reported
emblems are sent to the customer service department at Electronic Arts, where
employees will decide whether a reported emblem should be banned or not. If the
reported emblem is banned by customer service, the emblem is flagged as "hidden"
in Battlefields emblem database. No additional metadata is currently stored except
the date of the change.
22 Chapter 4 Method
![Page 31: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/31.jpg)
4.2 Method to approach the problem
The problem was approached according to guidelines presented in the book "Deep
learning" [14], written by the machine learning researcher Andrew Ng.
4.2.1 Step 1 - Determine goals and measurements
The first step in applying machine learning to a problem, is to determine the goals of
the project, what metrics to use and target values the project should satisfy [14].
4.2.2 Step 2 - Establish working end-to-end baseline model
The next step is to establish a working end-to-end pipeline for the machine learning
task and measure performance on a first baseline model.
4.2.3 Step 3 - Determine bottlenecks in performance
According to Goodfellow et al. [14], the following questions are of great importance
when trying to determine bottlenecks in performance.
• Is the model overfitting?
• Is the model underfitting?
• Are there defects in the dataset?
• Are there defects in software?
4.2.4 Step 4 - Repeatedly make incremental changes.
The last step when applying machine learning, is to iteratively make changes to
improve performance. The following tasks are often applied at this stage:
• Gather new data
• Adjust hyperparameters
• Change algorithm if necessary
4.2 Method to approach the problem 23
![Page 32: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/32.jpg)
4.3 Additional guidelines when applying machine
learning
4.3.1 The process of knowing what to do next
Andrew Ng argues that the process of applying deep learning in practice is still
being researched, but presents a few guidelines from his experience. When deep
Fig. 4.2: Flow-chart displaying the process of applying deep learning. Illustration takenfrom "Nuts and Bolts of Applying Deep Learning" [22]
learning is applied in practice, Ng argue that engineers often struggle on knowing
what should be done next[22]. Ng presents a flow-chart approach on how resources
in many situations are best spent, depending on performance benchmarks during
training and testing.
If the training error is high, called underfitting, the first thing to do is to make the
model bigger. In this situation, the model is not able to capture the structure of the
data, and need more freedom to adjust and fit. Training the model longer on the
dataset should also be evaluated. If the previous approaches don’t work, the model
architecture might have to be changed [22]. If nothing works, the quality of the
data might be the problem. The data could be too noisy or not include features that
makes it possible to predict the output. The solution to this problem is to start over,
collect cleaner data or a dataset with a richer set of features.
If the error on the training set is low, but the validation error is high, called overfitting,
then our model is not generalizing. In most situations, the best option is to put
efforts into obtaining more data. Adding or increasing regularization measures by
for example decreasing the number of training epochs, can improve performance
during testing. If these measures don’t increase the performance on our test set, a
different model architecture might be the last option.
24 Chapter 4 Method
![Page 33: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/33.jpg)
The development test set (dev test set) is used to produce intermediate performance
results when a classifier has been trained using the training set and fine-tuned using
the validation set. When the validation error is low, but the error on the dev test
set is still high, the best option is again to extract more data and make sure that
the data trained on is similar to the data the model is being tested on. Synthesizing
data, by for example creating new rotated images or adding random noise, can be
an option to increase the dataset size.
The production test set (prod test set) should be extracted from the target application
domain and have a data distribution that is identical to the domain where the model
will be run. The work is done when the performance on the final production test set
is satisfactory.
4.3.2 Create a common data warehouse
Ng suggests that creating a common data warehouse for the project speeds up
development, making sure that the latest dataset are always reachable by the
engineers in the project.
4.3.3 Determine human-level performance on the task
Determining human-level performance on the task, measured in accuracy, give an
idea of where the theoretical limit of performance lies, often called the optimal
error rate. A dataset containing images often have some examples that are so
blurry or misleading, that they simply are not possible to label into a category with
high confidence. Humans perform well on many of the tasks that are normally
targeted with deep learning, leaving the gap between optimal error and human
level performance, relatively small. When iterating and improving the algorithm, it’s
easier to improve when model performance is below human-level performance.
4.3.4 Plot performance on increasing dataset size and
visualize worst errors
Running experiments using 1/8, 1/4, 1/2 etc. of the dataset give insights on the
expected performance gains if more data were extracted. A final tip that is presented
by Ng, is to visualize the models’ worst errors. Looking at the incorrect classifications
with the highest confidence can often show data that is incorrectly labeled, and will
give a better understanding on what examples the model struggle on.
4.3 Additional guidelines when applying machine learning 25
![Page 34: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/34.jpg)
![Page 35: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/35.jpg)
5Experimental setup
The material and setup used to produce the experimental results is explained in the
chapter. The preprocessing techniques used and dataset augmentation procedures
are described, together with the dataset generation method.
5.1 Software and hardware used during
experiments
All experiments were conducted on the following hardware:
• Intel Xeon CPU E5-1650 v3 3.5GHz 12 vCPUs
• NVIDIA GeForce GTX 980, 2048 CUDA cores.
• 32GB RAM
The following software versions were used during classifier testing/training:
• Python 3.4
• Tensorflow 1.1.2
• Ubuntu 16.04
27
![Page 36: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/36.jpg)
5.2 Preprocessing
Images need to have a standardized pixel range, for example the range [0,1] or
[0,255]. This is the only preprocessing that is strictly required when running images
through a CNN.
5.2.1 Dataset augmentation
More data can be produced by augmenting existing images, synthesizing additional
data. Adding "noise" to images by rotating them, adding random brightness etc
are examples of augmentation techniques. The following table describe distortions
applied and experimented with during the thesis work. Figure 5.1 illustrate the
rotation technique applied to some emblems.
Tab. 5.1: Data augmentations
Distortion type Description
Random scale Randomly scale the image by x%Random crop Randomly crop the image by x%
Random brightness Randomly adjust the image brightness by x%Rotation Rotate 20 degrees for 340 degrees, synthesizing 17 images
Fig. 5.1: Rotation augmentation example
5.2.2 Contrast normalization
The magnitude between the bright and the dark pixels in the image is called the
image contrast. The amount of contrast in an image can often safely be removed, to
reduce variance and remove the need for the model to learn how to handle multiple
contrast scales. One way to achieve this is global contrast normalization, normalizing
the contrast in every image.
28 Chapter 5 Experimental setup
![Page 37: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/37.jpg)
5.3 Dataset generation
The complete dataset was iteratively divided up into the following parts. Increasing
amount of emblems became available as the thesis work progressed. In an effort to
keep the distribution between the classes within the complete dataset consistent, the
labeling process kept a goal of producing the division of 45% non-offensive emblems,
30% swastika emblems and 25% penis emblems. A more detailed explanation of
the emblem categorization process and extraction is presented in the results chapter.
The dataset during iteration one had a size of 5000 emblems, iteration two 10 000
emblems and the third iteration 17 377 emblems. The class distribution between
the training, validation and dev test set were close to identical.
• Training set 80% - At the end of each thesis iteration, 80% of the images was
drawn at random from the dataset and put into a separate training set. This
set was used to tune the weight/parameters of the model.
• Validation set 10% - The validation set was used solely to graph the models
estimated generalization error for each epoch during training. 10% of the
dataset was set apart for this. The validation set is normally used to tune the
models hyperparameters.
• Development test set 10% - Henceforth, the development set is called dev
test set. When the model has been fully trained, performance benchmarks was
run on this set, kept separate from the training process.
• Production test set - Henceforth, the production test set is called prod test
set. At the end of the third iteration, 3650 emblems were drawn at random
from the emblem database containing 8 032 703 emblems. The MD5 hash
of these images was then compared to the 17 377 emblems in the already
labeled dataset to make sure the model have never seen the emblems before.
523 emblems were matched and removed in this process, yielding a final
production test set of 3127 emblems.
5.3 Dataset generation 29
![Page 38: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/38.jpg)
5.4 Feature extraction
The CNN architecture GoogLeNet was used as a feature extractor during the thesis
work, pretrained on the image dataset ImageNet. Figure 5.2 show a sample from
the ImageNet dataset. Emblem images were fed through the CNN, and the output
feature map (called bottlenecks) of the last convolution layer was then used to
train a MLP to classify the Battlefield 1 emblems. The feature map produced is a
vector with length 2480, each element being a feature represented by a real number
between zero and two.
Fig. 5.2: ImageNet sample. Image taken from Stanford Vision Lab [2]
5.5 Machine learning framework - Tensorflow
TensorFlow is an open-source software library for machine learning and numerical
computation. The framework was developed by the Google Brain Team, within
Google’s Machine Intelligence research organization. In Tensorflow, a data flow
graph is defined, where each node represents a mathematical operation. The edges
between the nodes represent multidimensional data arrays, called tensors, that
are passed between the nodes. The abstraction of a computational graph makes
it possible to deploy the computation to multiple CPUs or GPUs, and on different
devices with the same API.
30 Chapter 5 Experimental setup
![Page 39: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/39.jpg)
6Results
6.1 Results iteration 1
6.1.1 Step 1 - Determine goals and measurements
The first step was to determine the goals for the project. In discussion with Uprise,
the following objectives were decided for the thesis:
• The project should produce a categorizing service that when presented with
an emblem, will flag the emblem as offensive or not.
• The project should provide a good overview of the models strength and weak-
ness.
No explicit key metric or target values were added to the objectives. The task of
filtering out offensive content is similar to the task of spam detection in some ways.
They are both binary classification tasks and in both tasks the cost of incorrectly
classifying an example as offensive/spam is higher than making the mistake of
permitting an offensive/spam example. The dataset also have a heavily skewed
distribution between the classes. About 99% of the emblems are non-offensive,
when examining a sample of 1000 emblems drawn at random from the eight million
dataset.
The uneven class distribution render metrics like accuracy and error rate less useful
when evaluating the model on real-world samples. On a set randomly picked from
the real-world dataset, a model that would classify all the examples as non-offensive
would on average get an accuracy around 99%, which is misleading. We are not
that interested in the examples that the model correctly flag as non-offensive, so we
want to use metrics that don’t take true negatives into account.
The primary focus is to minimize the amount of examples that the classifier incor-
rectly flag as offensive, covered by the precision metric. A secondary goal is to catch
as many offensive emblems as possible in the filter, covered by the recall metric.
The F-measure take into account both precision and recall.
31
![Page 40: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/40.jpg)
TP = True Positive = Correctly predicting a offensive emblem as offensive
TN = True Negative = Correctly predicting a non-offensive emblem as non-offensive
FP = False Positive = Incorrectly predicting a non-offensive emblem as offensive
FN = False Negative = Incorrectly predicting a offensive emblem as non-offensive
Precision =TP
TP + FP
Recall =TP
TP + FN
Fmeasure = 2 ∗
precision ∗ recall
precision + recall
An important note is that the distribution between the sets will be even during
training, rendering accuracy a useful measurement for model evaluation, and will
be considered the key metric. The dev test set has the same distribution between the
classes as in the training set.
6.1.2 Dataset extraction
Dataset labeling process
There are 8 032 703 uploaded emblems in Battlefield 1, as of April 2017. The
emblems that has been reported and marked as offensive by customer service
constitute the offensive dataset. The dataset consist of 4730 images for the game
Battlefield 1. No additional data was stored regarding the images. In order to get a
sense of the distribution within the offensive dataset, the dataset was sorted into
offensive categories. The following categories were decided:
Tab. 6.1: Categories within the offensive dataset
Nazi symbol Penis Nude Text Miscellaneous
Emblems were labeled miscellaneous when none of the other labels applied. Fig-
ure 6.1 illustrate samples from each dataset.
Fig. 6.1: Sample emblems. From left to right: nude, miscellaneous, nazi symbol, penis andtext.
32 Chapter 6 Results
![Page 41: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/41.jpg)
Dataset Distribution
Tab. 6.2: Emblems hidden by customer service at Dice, categorized
Nazi symbol Penis Nude Text Miscellaneous Total
2942 1146 265 110 267 4730
Fig. 6.2: Distribution among hidden emblems in BF1
The distribution between the classes are shown above. Most of the offensive emblems
are Nazi symbols, followed by penis illustrations. To get a further understanding
regarding the kind of emblems that are common in Battlefield, the 10 000 most
used emblems were extracted. This was done by running a MD5 hash on all the
emblems, group all the emblems with the same hash, and then sort the emblems on
the number of occurrences. The top 10 000 emblems are reused by players 1 557
720 times. Figure 6.3 show the distribution among all the top 1000 emblems, after
manually sorting the set. Figure 6.4 display the distribution between the offensive
categories found in the top 1000 emblem dataset. The most common offensive
classes are nude and miscellaneous. The reused drawings mostly consist of advanced
illustrations, having multiple layers and being more artistic than the average emblem.
One plausible explanation to why nude images are reused the most, could be that
they are too hard for the average player to draw themselves. In contrast, most people
are capable of drawing a swastika or a penis.
Tab. 6.3: Distribution among top 1000 emblems after manual categorization
Non-offensive Nazi symbol Penis Nude Text Miscellaneous Total
907 3 8 50 4 28 1000
Fig. 6.3: Distribution among all top 1000emblems BF1
Fig. 6.4: Distribution between offensiveemblems in top 1000
6.1 Results iteration 1 33
![Page 42: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/42.jpg)
6.1.3 Step 2 - Establish working end-to-end pipeline and
baseline model
A database was set up to store emblems and their labels. To make sure that the
dataset did not include any duplicate emblems used, a MD5-hash was produced on
each emblem and used as key in the database.
To streamline the process of collecting experimental results, a database was set up
to automatically store classifier hyperparameters, what labels that were included
as offensive, and the classifiers’ performance on test set. The end-to-end pipeline
was set up in Googles open source machine learning library TensorFlow, using the
Python API. To avoid dependency issues and ensure deployment stability, the project
made heavy use of containerization using Docker.
The pipeline looked as follows:
1. Choose what labels that should be considered offensive (in order for future
models to include new categories as offensive).
2. Define a hyperparameter configuration file that will be used for the run,
includes parameters like number of epochs, learning rate, training batch size
etc.
3. The pipeline would then fetch the latest dataset from the database and spawn a
docker container, performing the classifier training and output the trained clas-
sifier as a graph file. Bottlenecks were produced once and reused using a cache
folder, making repeated runs significantly faster. The classifiers’ performance
is then automatically stored in the database.
34 Chapter 6 Results
![Page 43: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/43.jpg)
The dataset distribution used to train the first baseline model is shown below.
Training batch size is the number of images that are used each epoch for the forward
pass and backpropagation. The data was not augmented in any way during iteration
one or two. An accuracy of 91.6% on the test set was recorded for the baseline
model.
The dataset used in the baseline model were created without applying any fine-
grained separation within the offensive dataset, including images from all categories
as offensive.
Tab. 6.4: Dataset baseline model
Non-offensive Offensive Total
1000 4000 5000
Tab. 6.5: Training parameters baseline model
Number of training epochs Learning rate Batch size during training
4000 0.01 100
Tab. 6.6: Performance on test set
Model Accuracy F-measure Precision Recall Test set size
base-line 0.9162 0.7843 0.7018 0.8889 542
6.1 Results iteration 1 35
![Page 44: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/44.jpg)
6.1.4 Performance benchmarks for first model
During the manual process of labeling images into more fine-grained categories,
it was concluded that the amount of variety between images within some classes
were very high. A sample from the miscellaneous labeled emblems is shown in
Figure 6.5.
Fig. 6.5: Sample from the miscellaneous category
The decision was made to limit the scope of the thesis to focus on filtering the
emblems containing swastikas and penises. This was due to the fact that swastikas
was seen as the most offensive category. These were also the largest offensive
categories in the hidden dataset.
Most of the images in the Nazi symbol category are swastikas, so it was also decided
that swastikas should first be considered, cleaning the Nazi symbol category to only
contain swastikas. Other symbols like the blood drop cross, confederate flags etc.
were put into a separate category. The dataset was swept through a second time,
resulting in a few images being found in the wrong category. After the dataset
clean-up, the model was trained again. By collecting more non-offensive labeled
emblems both from the top 10 000 dataset and emblems at random from the eight
million set, the dataset was changed to contain a more even distribution between
offensive and non-offensive emblems.
36 Chapter 6 Results
![Page 45: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/45.jpg)
Hyperparameters and training set
Tab. 6.7: Training parameters
Training epochs Learning rate Training batch size
4000 0.01 100
Tab. 6.8: Dataset used during training in iteration one
Non-offensive Swastikas Penis Total
2248 1539 1211 5000
Samples from the labeled categories are shown below as thumbnails. These were
the emblems used during training. The categories non-offensive, swastika, and penis
was used to train the model, resulting in a multi-class classification problem. During
testing, performance is measured on the binary classification task of determining if
an emblem is offensive or non-offensive. If the classifier guess penis or swastika, the
guess would be coded into an offensive guess.
(a) Non-offensive (b) Swastikas (c) Penises
Fig. 6.6: Emblem thumbnails from each of the categories
6.1 Results iteration 1 37
![Page 46: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/46.jpg)
Performance during training and test
The model took four minutes to generate bottlenecks (feature maps from the pre-
trained GoogLeNet CNN) and 15 minutes to fine-tune the fully connected layer. For
comparison, this procedure took four hours when run on the CPU, instead of the
GPU.
Fig. 6.7: Accuracy plot during training. Performance on training batch in orange, validationperformance in turquoise. The x-axis show the number of epochs.
Fig. 6.8: Cross-entropy plot during training. Performance on training batch in orange,validation performance in turquoise. The x-axis show the number of epochs.
Figure 6.7 plot the accuracy for each epoch on the training batch and the validation
set. The faded line is the actual performance at each epoch, and the solid line display
the smoothed-out performance across each epoch, to more easily show the trend.
Note that the training set performance is evaluated on the last 100 images, which
give rise to the large jitter in performance across epochs. The validation performance
38 Chapter 6 Results
![Page 47: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/47.jpg)
is evaluated on the complete validation set, 542 images, every ten epochs, making it
much less prone to jitter.
The accuracy on both the training- and validation- set increase drastically during
the first 200 iterations. Performance on the validation seem to stop increasing after
around 2000 iterations, while performance on the training set continue to improve,
reaching 100% accuracy on some training batches during epoch 3500 and 4000.
The gap in accuracy between training and validation is by the end of epoch 4000
close to 5%. Figure 6.8 plot the cross-entropy during each epoch, confirming the
performance improvement during each epoch that was shown in the accuracy plot.
Actual
Class
Predicted Class
Pos Neg Total
PosTP245
FN13
258
NegFP16
TN268
284
Total 261 281 542
Tab. 6.9: Confusion matrix for the first iteration model
Performance across all measurements are shown in Table 6.10. The improvement
in accuracy is largely dependent on changing the rules for what emblems that are
considered offensive in the dataset. Only considering swastikas and penises as
offensive emblems, give the classifier a more well-defined concept, that prove to be
easier to separate from non-offensive emblems.
Tab. 6.10: Performance on dev test set
Model Accuracy F-measure Precision Recall Dev test set size
base-line 0.9162 0.7843 0.7018 0.8889 542first iteration 0.9465 0.9441 0.9387 0.9491 542
6.1 Results iteration 1 39
![Page 48: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/48.jpg)
Misclassified images
Fig. 6.9: Penises misclassified as non-offensive
Fig. 6.10: Non-offensive misclassified as penis
Fig. 6.11: Swastikas misclassified as non-offensive
Fig. 6.12: Non-offensive misclassified as swastika
In Figure 6.9, emblem 2, 3, 4 and 6 are penis illustrations that are in line with how
most penis emblems look like. Emblem 7 has its ground truth label wrong, the
emblem is a bandanna, and is one of the web-editor drawing symbols. Emblem 5
could be considered correctly labeled as a penis illustration. The illustration depicts
an armed soldier with a bullet starting at the crotch.
In Figure 6.10 The first and the last emblem has been incorrectly classified as
penises. Both are characters from the cartoon show "SpongeBob SquarePants". The
character to the furthest right is the character "Patrick", and is a common character
in emblems. A sample of the "Patrick" drawings from the non-offensive category is
shown in Figure 6.13, with the incorrectly classified emblem to the furthest right.
The pink color, combined with Patrick’s pointed head and eye-balls, prove hard for
the classifier to separate from a drawing of a penis.
40 Chapter 6 Results
![Page 49: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/49.jpg)
Fig. 6.13: Emblems from the non-offensive category containing the SpongeBob characterPatrik
The pattern that can be found in some non-offensive emblems classified as swastikas,
could be the presence of an eagle in the center of the image. This is a common
pattern for swastikas, as shown in Figure 6.14.
Fig. 6.14: A small sample of the emblems in the swastika category containing eagles
6.1.5 Step 3 - Determine bottlenecks in performance
The accuracy reach above 99% on the training batches, when the model is given
enough training epochs. The error on the validation is considerably larger, indicating
a problem due to overfitting or high variance. Monitoring the validation performance
across epochs, indicate that the problem is not due to excessive training. The
performance on validation do not show any indication of neither dropping nor
improving after 4000 epochs.
As been presented in the method chapter, the options in this kind of situation
are typically to gather more data, add or increase regularization or try a new
model architecture. Gathering more data is often the best alternative to start with,
according to NG[14], and was chosen to be the goal for the second iteration.
6.1 Results iteration 1 41
![Page 50: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/50.jpg)
6.2 Results iteration 2
6.2.1 Step 4 - Repeatedly make incremental changes
Common data warehouse and web-labeling service
To reduce the gap in accuracy between training and testing, the goal of the second
sprint was to increase the size of the labeled dataset. After researching databases
for previous Battlefield games, another 25 000 emblems that had been marked as
offensive were extracted. After discussions with Uprise, employees volunteered to
help out with the fine-grained labeling of the dataset. In order for this labeling to be
done, the database set up for the thesis needed to be exposed for labeling by others
than myself. Previously, the labeling was solely done locally on my workstation.
The next step in the thesis project was then to expose the database to a web user-
interface, where employees could click and label the dataset. A UI presenting each
classifier experiment together with its hyperparameters and performance was also
implemented. Figure 6.15 display a screen-shot taken of the labeling UI. Selecting
an emblem marks it with a blue background, and the label can be submitted to the
database by clicking the button at the top.
The dataset was increased from 5000 emblems to about 10 000 emblems in the
upcoming weeks. After the second iterations’ data extraction and labeling phase,
new experiments were run.
Fig. 6.15: Web-labeling service user-interface
42 Chapter 6 Results
![Page 51: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/51.jpg)
6.2.2 Performance benchmarks for the second model
Data quality issues
Fig. 6.16: Accuracy plot during training
Fig. 6.17: Cross-entropy plot during training
The performance plots for the second model display alarming results. The model
no longer seem to learn as well as it did during iteration one. In iteration one,
the training set accuracy reached close to 100%, now the model only reach 95%
accuracy at best. The cross-entropy is also considerably higher. After a closer look at
the misclassified images, the root of the problem was found.
6.2 Results iteration 2 43
![Page 52: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/52.jpg)
Fig. 6.18: Emblems marked as misclassified during testing
In Figure 6.18, we can see that the first image from the left is an eagle that was
somehow labeled as a swastika. The following five images had been labeled as
penises. The first three penis illustrations are definitely considered offensive and
display an important challenge. Determining what emblems that should be consid-
ered penis illustrations and what images that should be considered miscellaneous or
pornographic-content, is a slippery slope. The vast majority of the emblems labeled
as penises are images where two balls and a penis is drawn. To make sure that
the model has enough examples to recognize this type of drawing, these are the
emblems that are considered penis illustrations. The kind of pornographic emblems
depicted in Figure 6.18 are therefore considered pornographic drawings and labeled
as miscellaneous offensive emblems.
The last two emblems were marked as swastikas. These emblems depict the runic
insignia of the Schutzstaffel, also known as the SS bolts. As was presented in iteration
one, the scope of the thesis was limited to only include swastikas and penises, no
other hate symbols. It became obvious that emblems added after introducing the
labeling service, had quality issues, and all the newly added labeled emblems was
examined.
Fig. 6.19: Emblems incorrectly given the label penis in the dataset
Fig. 6.20: Emblems incorrectly given the label swastika in the dataset
Figure 6.19 display a sample of emblems incorrectly labeled as penises. The third
and the last emblem, from the left, are not penis illustrations but depict bandannas.
Figure 6.20 show some emblems that were found incorrectly labeled as swastikas.
The benchmarks were run again after the dataset had been cleaned up, and the
performance is shown in the next section.
44 Chapter 6 Results
![Page 53: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/53.jpg)
Performance after data cleaning
Tab. 6.11: Training parameters
Number of training epochs Learning rate Batch size during training
4000 0.01 100
Tab. 6.12: Dataset used during training in iteration two
Non-offensive Swastikas Penis Total
4497 3079 2422 10 000
Fig. 6.21: Accuracy plot during training
Fig. 6.22: Cross-entropy plot during training
The performance benchmarks after the data-cleaning process are more similar to
the results found during iteration one. The cross-entropy is again decreasing to low
levels on the training set.
6.2 Results iteration 2 45
![Page 54: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/54.jpg)
Actual
Class
Predicted Class
p n total
pTP463
FN16
479
nFP34
TN519
553
total 497 535 1032
Tab. 6.13: Confusion matrix second iteration model
Table 6.14 show the performance results for each classifier so far. The new model,
trained on a dataset twice the size compared to the model in iteration one, performs
slightly better. The increase in size yielded an increase of 0.51% in accuracy and
0.47% in f-measure.
Tab. 6.14: Performance on dev test set
Model Accuracy F-measure Precision Recall Dev test set size
base-line 0.9162 0.7843 0.7018 0.8889 542first iter. 0.9465 0.9441 0.9387 0.9491 542
second iter. no clean 0.9388 0.9308 0.9308 0.9318 1032second iter. clean 0.9516 0.9488 0.9316 0.9666 1032
46 Chapter 6 Results
![Page 55: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/55.jpg)
6.3 Results iteration 3
Another 7000 emblems were added to the dataset. Data cleaning was made before
running additional experiments.
Tab. 6.15: Training parameters
Number of training epochs Learning rate Batch size during training
6000 0.01 100
Tab. 6.16: Dataset used during training in iteration three
Non-offensive Swastikas Penis Total
7815 5351 4210 17377
Fig. 6.23: Accuracy plot during training
Fig. 6.24: Cross-entropy plot during training
The performance plots show results similar to the previous iterations. Performance
on the validation set reach beyond 95% after epoch 4000. The cross-entropy for the
validation set is slightly lower than in the previous iterations.
6.3 Results iteration 3 47
![Page 56: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/56.jpg)
Actual
Class
Predicted Class
p n total
pTP768
FN30
798
nFP45
TN922
967
total 813 952 1765
Tab. 6.17: Confusion matrix third iteration model
Tab. 6.18: Performance on dev test set
Model Accuracy F-measure Precision Recall Dev test set size
base-line 0.9162 0.7843 0.7018 0.8889 542first iter. 0.9465 0.9441 0.9387 0.9491 542
second iter. no clean 0.9388 0.9308 0.9308 0.9318 1032second iter. clean 0.9516 0.9488 0.9316 0.9666 1032
third iter. 0.9575 0.9535 0.9446 0.9624 1765
The third model achieve the highest performance, increasing accuracy with an
additional 0.59% and f-measure with 0.47%. The increase in dataset size also has
the effect that the dev test set is larger for the last model, testing on 1765 emblems
compared to only 542 images in the first iteration. This makes the test results more
reliable in the third iteration. The third model classified 45 emblems as offensive,
even though they did not contain any offensive content, and 30 emblems were
incorrectly labeled as non-offensive. All the emblems that were mistaken are shown
in the next section.
48 Chapter 6 Results
![Page 57: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/57.jpg)
Misclassified images
Fig. 6.25: Non-offensive emblems misclassified as penises
Fig. 6.26: Non-offensive emblems misclassified as swastikas
Fig. 6.27: Penis emblems misclassified as non-offensive
Fig. 6.28: Swastika emblems misclassified as non-offensive
Out of 1765 emblems, 75 were classified incorrectly. All the incorrect predictions
are shown above. The incorrect labeling of some swastikas are unsatisfactorily, and
are hard to explain. It can be concluded that the model achieve high performance
results on the dev test set, but still have several blind-spots.
6.3 Results iteration 3 49
![Page 58: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/58.jpg)
6.3.1 Data augmentation experiments
To increase the dataset size further, experiments were run using different augmen-
tation techniques. In the previous experiments, the bottleneck generation was run
once, and cached for consecutive runs. The augmentation operation is run on ev-
ery training batch, so new bottlenecks has to be generated every epoch. Running
the experiment on the complete dataset took between 8-15 hours on the worksta-
tion GPU, compared to the previous experiments that took between 15-60 minutes.
Several runs were made with different augmentation settings, the most successful
configuration is shown below. The only strategy that proved useful, was the rotation
technique. The images that were suitable for rotation were handpicked from the
dataset. To ensure the reliability of the performance benchmarks, only the training
set was augmented. The validation set and the test set was left untouched by any
augmentation strategies.
Tab. 6.19: Training parameters
Number of training epochs Learning rate Batch size during training
8000 0.01 100
Tab. 6.20: The dataset used during training, including both rotated and not rotated images
Non-offensive Swastikas Penis Total
32 380 22 638 17 450 72 468
50 Chapter 6 Results
![Page 59: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/59.jpg)
Fig. 6.29: Accuracy plot during training
Fig. 6.30: Cross-entropy plot during training
Actual
Class
Predicted Class
p n total
pTP783
FN15
798
nFP43
TN924
967
total 826 939 1765
Tab. 6.21: Confusion matrix third iteration model
The performance during training is following the same pattern as the counterparts
without augmentation, but the error on validation and training is closer to each
other. Performance on the validation set is also higher.
6.3 Results iteration 3 51
![Page 60: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/60.jpg)
6.3.2 Final performance comparison between all models
Tab. 6.22: Performance on dev test set
Model Accuracy F-measure Precision Recall Dev test set size
base-line 0.9162 0.7843 0.7018 0.8889 542first iter. 0.9465 0.9441 0.9387 0.9491 542
second iter. no clean 0.9388 0.9308 0.9308 0.9318 1032second iter. clean 0.9516 0.9488 0.9316 0.9666 1032
third iter. 0.9575 0.9535 0.9446 0.9624 1765third iter. aug. 0.9671 0.9643 0.9480 0.9812 1765
The model run with augmentation has the highest performance recorded, with an
accuracy of 96.71%.
6.3.3 Performance on production test set
The best model found during the project was then run on the production set. The
application only want the classifier to make a guess if it is more than 85% sure about
the prediction, so the classifier was restricted from giving a prediction when the
highest prediction for the softmax function was beneath 85%. Compared to the
dev test sets run during development, which had a distribution that resembled the
training set distribution, the production test set has 99% non-offensive emblems and
only 1% offensive.
The production test set was created by randomly selecting 3650 emblems out of the
8 032 703 emblem dataset. The MD5 sum for the 3650 emblems were then queried
against the emblem database, and 523 emblems were already present, reducing the
dataset to 3127 emblems. By manually labeling all the emblems in the downloaded
dataset, 17 swastikas, 14 penises and 3096 non-offensive were found. These 3127
emblems were then used as the production test set.
Tab. 6.23: Production test set distribution, before relabeling
Non-offensive Swastikas Penis Total
3096 17 14 3127
52 Chapter 6 Results
![Page 61: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/61.jpg)
When letting the model classify the images in the prod test set and display the
misclassified emblems, five swastika emblems were found incorrectly labeled as
non-offensive, and ten penises were labeled incorrectly as non-offensive. This
demonstrates how easy it is to miss some of the offensive the emblems, and also
showcase the models’ capability of finding the offensive classes in emblems.
Fig. 6.31: Penis emblems incorrectly labeled as non-offensive, but found by model
Fig. 6.32: Swastika emblems incorrectly labeled as non-offensive, but found by model
After correcting the test set, the corrected distribution is shown in Table 6.24. The em-
blems that were given a incorrect label are shown in Figure 6.31 and Figure 6.32.
Non-offensive Swastikas Penis Total
3081 22 24 3127Tab. 6.24: Production test set distribution, after relabeling
6.3 Results iteration 3 53
![Page 62: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/62.jpg)
Actual
Class
Predicted Class
p n total
pTP37
FN1
38
nFP70
TN1772
1842
total 107 1773 1880
Tab. 6.25: Confusion matrix best classifier run on production test set
Out of 3127 emblems, the classifier was confident enough to give a prediction on
1880 emblems, which corresponds to about 60% of the emblems. The prediction
results are shown in Figure 6.25.
Tab. 6.26: Performance on production test set
Model Accuracy F-measure Precision Recall Test set size
best model 0.9622 0.5103 0.3458 0.9737 1880
The performance on the prod test set is worse than the performance on the dev test
set. There are several explanations to this.
First, how a non-offensive emblem is illustrated is far more varying than how a
swastika or penis is illustrated. If emblems would be graded on how hard they are
to recognize what they depict, non-offensive emblems are several levels harder to
say what the illustration depict. Adding more non-offensive emblems to the dataset
corresponds to adding more difficult examples.
Secondly, the model has been trained on a distribution of about 45% non-offensive,
30% swastikas and 25% penises. As has been mentioned, the distribution within the
prod test set is 99% non-offensive, 0.5% swastikas and 0.5% penises. The model
will during training try to find the best fit for the model on the training set and its
distribution. If the model was trained on a distribution similar to the production
test set, the model would though only be trained on about 390 swastikas, and 390
penises, which is not enough to learn the different kinds of drawing styles.
54 Chapter 6 Results
![Page 63: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/63.jpg)
7Discussion
An important part of the work was to delimit the project to only focus on filtering out
swastikas and penises. This both had an impact on the models performance and made
it easier to evaluate the produced models limitations. Defining the boundaries within
the computer vision problem proved to be hard. Deciding what images that should
be labeled as swastikas was straight forward, but the task of determining which
emblems that should be considered penis images or pornographic/miscellaneous
images was challenging. During the second iteration, the loosely defined boundaries
between categories proved to have negative effects on the dataset quality, when
people from outside the project contributed to the labeling process. Excluding the
emblems that are hard, by categorizing them into a miscellaneous category, was also
problematic. The model was neither trained nor tested on this category, resulting
in an overoptimistic picture of how well the model would perform on completely
unlabeled dataset, where none of the emblems were excluded into miscellaneous
categories.
Increasing the dataset size improved performance, as expected. The quality of the
added data was shown to have a severe impact on performance. The results also
show that the dataset size can be leveraged even further by the use of adequate
augmentation strategies for the dataset.
To make sure that the performance measurements are reliable on completely unseen
data, the best model was run once on a production test set, generating the final
performance results. An issue with the emblem dataset, is the presence of "near
duplicate" images. The way to make sure that an emblem was not present in both
training set and the test set, was to generate a MD5 hash on the emblem image and
then check that the MD5 hashed image is only present in one of the sets. The problem
is that only changing a single pixel in the image would produce a different MD5
hash. This opens up for the possibility that some images with very small variation
might exist in both sets, making the results less reliable. The distance in how alike
an emblem is to another emblem varies significantly. No systematic method was
implemented to counter this problem, but the test and training set was manually
checked to get an understanding of the severity of this issue. The conclusion was
made that the amount of images that are close to identical between the sets are hard
to determine without a systematic approach, but seems to be limited after manually
55
![Page 64: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/64.jpg)
comparing the sets. It is to be noted though that this is a problem with the whole
eight million emblem dataset and not a problem due to any of the choices made in
the thesis project. People tend to reuse popular emblems and make minor changes
to these.
The step-by-step approach to machine learning problems, presented by Goodfellow
et al[14] and Andrew Ng, was of great use. The debugging strategies presented
worked well during the project, and by monitoring performance on the training set
and test set, the correct measures were taken.
Unlike Tajbakhsh el al. that did fine-tuning across several layers, this project only
focused on fine-tuning the last layer. The performance show that the CNN features
learned during training on ImageNet, can be used to classify images on the target
emblem dataset. These findings are in line with Danahue et al and several of the
studies done on applying transfer learning to real world problems. Using a pretrained
CNN as a black box, made it possible to focus more on dataset extraction and dataset
augmentation. Even though only the last layer was fine-tuned, experiments required
significant time and effort.
One of the project goals was to investigate if a CNN model could be used as a tool
for filtering out offensive emblems in the game Battlefield 1. The performance on
the production test set show that it is possible to produce such a model, but it has
both strengths and weaknesses. The model has a high recall and is highly capable of
finding offensive emblems in a dataset, but the models’ performance measured in
precision is low. About two-thirds of the emblems flagged as offensive are not. Using
the model as the single decision-maker to whether an emblem should be accepted
or declined into the game would not be a good idea. The model still have several
blind spots, and do make severe mistakes. One can imagine that incorrectly accusing
players of uploading offensive emblems could have significant negative consequences.
The model is probably more suited as a customer service AI, predicting on already
reported emblems. Determining the suitability and consequences of applying a
machine learning filtering service to a game is outside the scope of this thesis, and
could be the subject of another project. The MD5 database that has been constructed
during the thesis work could though be used directly on emblem upload, to scan if
the emblem is already flagged as offensive.
56 Chapter 7 Discussion
![Page 65: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/65.jpg)
7.1 Future work
Several approaches were considered, but never tried during the thesis work to limit
the scope of the project. Some of them are presented in this section.
Only the last CNN layer was fine-tuned during this thesis. A future extension would
be to retrain more layers in the CNN. Related research has showed that given enough
data, fine-tuning more layers than just the final layer often produce better results.
The model could have been visualized through the technique of t-Distributed Stochas-
tic Neighbor Embedding (t-SNE), a method to visualize high-dimensional datasets.
It would also have been interesting to further debug the CNN feature map and look
at the activations for different images. Several debugging methods exists that would
have been interesting. This could have given more insight about how the feature
representation looks like for different labels.
Only the GoogLeNet architecture was used during the project. Given more time,
it would have been interesting to try different CNN-models to produce the bottle-
necks/feature extraction, like AlexNet or VGG. Related research has also shown that
an ensemble classifier in many cases outperform a single classifier. This could have
been investigated by creating a SVM or random forest classifier on the generated
feature maps and then let an ensemble of classifiers do the predictions.
7.1 Future work 57
![Page 66: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/66.jpg)
![Page 67: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/67.jpg)
8Conclusion
The goal of the thesis was to evaluate the use of convolutional neural networks on
the task of filtering out penises and swastikas from emblems drawn by players in
the game Battlefield 1. A CNN with the GoogLeNet architecture was pretrained on
ImageNet, and then used as a black box for feature extraction, also called transfer
learning. A multi layer perceptron was then trained on the generated feature maps
from 17 220 emblems. The produced model achieved an accuracy of 96.22%, a
precision of 34.58% and a recall of 97.37% on a sample drawn from the game at
random. It can be concluded that the model is successful at finding swastikas and
penises, but among the emblems flagged as swastikas and penises, a large portion
are non-offensive.
59
![Page 68: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/68.jpg)
![Page 69: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/69.jpg)
Bibliography
[1]Samet Akçay, Mikolaj E Kundegorski, Michael Devereux, and Toby P Breckon. „Transfer
learning using convolutional neural networks for object classification within x-ray bag-
gage security imagery“. In: Image Processing (ICIP), 2016 IEEE International Conference
on. IEEE. 2016, pp. 1057–1061 (cit. on p. 18).
[3]Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan
Carlsson. „From generic to specific deep representations for visual recognition“. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
2015, pp. 36–45 (cit. on p. 2).
[6]Phillip M Cheng and Harshawn S Malhi. „Transfer Learning with Convolutional Neural
Networks for Classification of Abdominal Ultrasound Images“. In: Journal of Digital
Imaging (2016), pp. 1–10 (cit. on p. 18).
[8]Jeff Donahue, Yangqing Jia, Oriol Vinyals, et al. „DeCAF: A Deep Convolutional Acti-
vation Feature for Generic Visual Recognition.“ In: Icml. Vol. 32. 2014, pp. 647–655
(cit. on p. 15).
[9]David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. „Understanding deep archi-
tectures using a recursive convolutional network“. In: arXiv preprint arXiv:1312.1847
(2013) (cit. on p. 1).
[10]Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal
Vincent. „The Difficulty of Training Deep Architectures and the Effect of Unsupervised
Pre-Training.“ In: AISTATS. Vol. 5. 2009, pp. 153–160 (cit. on p. 1).
[11]Andre Esteva, Brett Kuprel, Roberto A Novoa, et al. „Dermatologist-level classification
of skin cancer with deep neural networks“. In: Nature 542.7639 (2017), pp. 115–118
(cit. on p. 17).
[12]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. „Region-based convolu-
tional networks for accurate object detection and segmentation“. In: IEEE transactions
on pattern analysis and machine intelligence 38.1 (2016), pp. 142–158 (cit. on p. 16).
[14]Goodfellow. Deep learning. MIT Press, 2016 (cit. on pp. 7, 10, 11, 13, 23, 41, 56).
[15]Mohammad Havaei, Axel Davy, David Warde-Farley, et al. „Brain tumor segmentation
with deep neural networks“. In: Medical image analysis 35 (2017), pp. 18–31 (cit. on
p. 17).
[16]Benjamin Q Huynh, Hui Li, and Maryellen L Giger. „Digital mammographic tumor
classification using transfer learning from deep convolutional neural networks“. In:
Journal of Medical Imaging 3.3 (2016), pp. 034501–034501 (cit. on p. 18).
61
![Page 70: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/70.jpg)
[17]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. „Imagenet classification with
deep convolutional neural networks“. In: Advances in neural information processing
systems. 2012, pp. 1097–1105 (cit. on p. 1).
[18]Yann LeCun, Bernhard Boser, John S Denker, et al. „Backpropagation applied to hand-
written zip code recognition“. In: Neural computation 1.4 (1989), pp. 541–551 (cit. on
p. 1).
[19]Min Lin, Qiang Chen, and Shuicheng Yan. „Network In Network“. In: CoRR abs/1312.4400
(2013) (cit. on p. 19).
[20]Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine
learning. MIT press, 2012 (cit. on p. 5).
[21]Mohamed Moustafa. „Applying deep learning to classify pornographic images and
videos“. In: arXiv preprint arXiv:1511.08899 (2015) (cit. on p. 18).
[22]Andrew Ng. Nuts and bolts of applying Deep Learning. 2016 (cit. on p. 24).
[23]Sinno Jialin Pan and Qiang Yang. „A survey on transfer learning“. In: IEEE Transactions
on knowledge and data engineering 22.10 (2010), pp. 1345–1359 (cit. on pp. 15, 16).
[24]Otávio AB Penatti, Keiller Nogueira, and Jefersson A dos Santos. „Do deep features
generalize from everyday objects to remote sensing and aerial scenes domains?“ In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
2015, pp. 44–51 (cit. on p. 2).
[27]Holger R Roth, Amal Farag, Le Lu, Evrim B Turkbey, and Ronald M Summers. „Deep
convolutional networks for pancreas segmentation in CT imaging“. In: SPIE Medical
Imaging. International Society for Optics and Photonics. 2015, 94131G–94131G (cit. on
p. 17).
[28]Masaki Saito and Yusuke Matsui. „Illustration2vec: a semantic vector representation of
illustrations“. In: SIGGRAPH Asia 2015 Technical Briefs. ACM. 2015, p. 5 (cit. on p. 17).
[29]Mundher Al-Shabi, Tee Connie, and Andrew Beng Jin Teoh. „Adult Content Recognition
from Images Using a Mixture of Convolutional Neural Networks“. In: arXiv preprint
arXiv:1612.09506 (2016) (cit. on p. 18).
[30]Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. „CNN
features off-the-shelf: an astounding baseline for recognition“. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops. 2014, pp. 806–813
(cit. on p. 2).
[31]Jae Shin, Nima Tajbakhsh, R Todd Hurst, Christopher B Kendall, and Jianming Liang.
„Automating carotid intima-media thickness video interpretation with convolutional
neural networks“. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2016, pp. 2526–2535 (cit. on p. 17).
[32]Karen Simonyan and Andrew Zisserman. „Very deep convolutional networks for large-
scale image recognition“. In: arXiv preprint arXiv:1409.1556 (2014) (cit. on p. 1).
[33]Christian Szegedy, Wei Liu, Yangqing Jia, et al. „Going deeper with convolutions“. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015,
pp. 1–9 (cit. on pp. 1, 19, 20).
62 Bibliography
![Page 71: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/71.jpg)
[34]Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, et al. „Convolutional neural net-
works for medical image analysis: full training or fine tuning?“ In: IEEE transactions on
medical imaging 35.5 (2016), pp. 1299–1312 (cit. on pp. 1, 16).
[35]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. „How transferable are
features in deep neural networks?“ In: Advances in neural information processing systems.
2014, pp. 3320–3328 (cit. on pp. 2, 16).
[36]Matthew D Zeiler and Rob Fergus. „Visualizing and understanding convolutional net-
works“. In: European conference on computer vision. Springer. 2014, pp. 818–833 (cit. on
p. 1).
[37]Kailong Zhou, Li Zhuo, Zhen Geng, Jing Zhang, and Xiao Guang Li. „Convolutional
Neural Networks Based Pornographic Image Classification“. In: Multimedia Big Data
(BigMM), 2016 IEEE Second International Conference on. IEEE. 2016, pp. 206–209 (cit.
on p. 18).
Websites
[2]Stanford Author. Resources and links. 2014. URL: http://vision.stanford.edu/
resources_links.html (visited on June 8, 2017) (cit. on p. 30).
[4]Danilo Bargen. Programming a Perceptron in Python. 2013. URL: https://blog.dbrgn.
ch/2013/3/26/perceptrons-in-python/ (visited on June 8, 2017) (cit. on p. 6).
[5]Satvik Beri. Could someone explain how to create an artificial neural network in a
simple and concise way that doesn’t require a PhD in mathematics? 2013. URL: https:
//www.quora.com/Could- someone- explain- how- to- create- an- artificial-
neural-network-in-a-simple-and-concise-way-that-doesnt-require-a-PhD-
in-mathematics (visited on June 8, 2017) (cit. on p. 7).
[7]Adit Deshpande. A Beginner’s Guide To Understanding Convolutional Neural. 2016. URL:
https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner’s-Guide-
To-Understanding-Convolutional-Neural-Networks-Part-2/ (visited on May 22,
2017) (cit. on pp. 12, 13).
[13]Amar Gondaliya. Regularization implementation in R : Bias and Variance diagnosis. 2014.
URL: http://pingax.com/regularization-implementation-r/ (visited on June 8,
2017) (cit. on p. 10).
[25]Sebastian Raschka. Machine Learning FAQ. 2016. URL: https://sebastianraschka.
com/faq/docs/closed-form-vs-gd.html (visited on May 31, 2017) (cit. on p. 8).
[26]Jeff Dean Ray Kurzweil. 10 breakthrough technologies. 2013. URL: https : / / www .
technologyreview.com/s/513696/deep-learning/ (visited on Apr. 22, 2017) (cit.
on p. 1).
Websites 63
![Page 72: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/72.jpg)
![Page 73: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/73.jpg)
List of Figures
2.1 Perceptron topology, illustration modified from Danilo Bargen [4] . . . 6
2.2 Multi layer perceptron topology, illustration modified from Satvik Beri [5] 7
2.3 Gradient descent, illustration modified from Sebastian Raschka [25] . 8
2.4 Dataset partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Illustrative example of overfitting, underfitting and optimal capacity.
Illustration modified from Amar Gondaliya [13] . . . . . . . . . . . . . 10
2.6 Illustration displaying the convolution operation [14] . . . . . . . . . . 11
2.7 A 7 × 7 image with a 3 × 3 kernel and a stride of one [7] . . . . . . . . 12
2.8 The 5 × 5 output feature map [7] . . . . . . . . . . . . . . . . . . . . . 12
2.9 A 7 × 7 image with a 3 × 3 kernel and a stride of two [7] . . . . . . . . 12
2.10 The 3 × 3 output feature map [7] . . . . . . . . . . . . . . . . . . . . . 12
2.11 A 32 × 32 image with a padding of two [7] . . . . . . . . . . . . . . . . 12
2.12 Image displaying the output of a 2 × 2 maxpool kernel, with a stride of
two[7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 GoogLeNet CNN architecture. Illustration taken from the research paper
"Going Deeper with Convolutions" [33] . . . . . . . . . . . . . . . . . . 19
3.2 Inception module illustration [33] . . . . . . . . . . . . . . . . . . . . 20
3.3 Figure illustrating the difference between a normal linear convolution
layer, and a MLPconv layer. Illustration taken from the paper "Network-
In-Network" [33] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Screenshot capture from the Battlefield companion emblem editor . . . 22
4.2 Flow-chart displaying the process of applying deep learning. Illustration
taken from "Nuts and Bolts of Applying Deep Learning" [22] . . . . . . 24
5.1 Rotation augmentation example . . . . . . . . . . . . . . . . . . . . . . 28
5.2 ImageNet sample. Image taken from Stanford Vision Lab [2] . . . . . . 30
6.1 Sample emblems. From left to right: nude, miscellaneous, nazi symbol,
penis and text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Distribution among hidden emblems in BF1 . . . . . . . . . . . . . . . 33
6.3 Distribution among all top 1000 emblems BF1 . . . . . . . . . . . . . . 33
6.4 Distribution between offensive emblems in top 1000 . . . . . . . . . . 33
6.5 Sample from the miscellaneous category . . . . . . . . . . . . . . . . . 36
6.6 Emblem thumbnails from each of the categories . . . . . . . . . . . . . 37
65
![Page 74: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/74.jpg)
6.7 Accuracy plot during training. Performance on training batch in orange,
validation performance in turquoise. The x-axis show the number of
epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.8 Cross-entropy plot during training. Performance on training batch
in orange, validation performance in turquoise. The x-axis show the
number of epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.9 Penises misclassified as non-offensive . . . . . . . . . . . . . . . . . . . 40
6.10 Non-offensive misclassified as penis . . . . . . . . . . . . . . . . . . . . 40
6.11 Swastikas misclassified as non-offensive . . . . . . . . . . . . . . . . . 40
6.12 Non-offensive misclassified as swastika . . . . . . . . . . . . . . . . . . 40
6.13 Emblems from the non-offensive category containing the SpongeBob
character Patrik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.14 A small sample of the emblems in the swastika category containing eagles 41
6.15 Web-labeling service user-interface . . . . . . . . . . . . . . . . . . . . 42
6.16 Accuracy plot during training . . . . . . . . . . . . . . . . . . . . . . . 43
6.17 Cross-entropy plot during training . . . . . . . . . . . . . . . . . . . . . 43
6.18 Emblems marked as misclassified during testing . . . . . . . . . . . . . 44
6.19 Emblems incorrectly given the label penis in the dataset . . . . . . . . 44
6.20 Emblems incorrectly given the label swastika in the dataset . . . . . . . 44
6.21 Accuracy plot during training . . . . . . . . . . . . . . . . . . . . . . . 45
6.22 Cross-entropy plot during training . . . . . . . . . . . . . . . . . . . . . 45
6.23 Accuracy plot during training . . . . . . . . . . . . . . . . . . . . . . . 47
6.24 Cross-entropy plot during training . . . . . . . . . . . . . . . . . . . . . 47
6.25 Non-offensive emblems misclassified as penises . . . . . . . . . . . . . 49
6.26 Non-offensive emblems misclassified as swastikas . . . . . . . . . . . . 49
6.27 Penis emblems misclassified as non-offensive . . . . . . . . . . . . . . . 49
6.28 Swastika emblems misclassified as non-offensive . . . . . . . . . . . . 49
6.29 Accuracy plot during training . . . . . . . . . . . . . . . . . . . . . . . 51
6.30 Cross-entropy plot during training . . . . . . . . . . . . . . . . . . . . . 51
6.31 Penis emblems incorrectly labeled as non-offensive, but found by model 53
6.32 Swastika emblems incorrectly labeled as non-offensive, but found by
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
66 List of Figures
![Page 75: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/75.jpg)
List of Tables
5.1 Data augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 Categories within the offensive dataset . . . . . . . . . . . . . . . . . . 32
6.2 Emblems hidden by customer service at Dice, categorized . . . . . . . 33
6.3 Distribution among top 1000 emblems after manual categorization . . 33
6.4 Dataset baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.5 Training parameters baseline model . . . . . . . . . . . . . . . . . . . . 35
6.6 Performance on test set . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.7 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.8 Dataset used during training in iteration one . . . . . . . . . . . . . . . 37
6.9 Confusion matrix for the first iteration model . . . . . . . . . . . . . . 39
6.10 Performance on dev test set . . . . . . . . . . . . . . . . . . . . . . . . 39
6.11 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.12 Dataset used during training in iteration two . . . . . . . . . . . . . . . 45
6.13 Confusion matrix second iteration model . . . . . . . . . . . . . . . . . 46
6.14 Performance on dev test set . . . . . . . . . . . . . . . . . . . . . . . . 46
6.15 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.16 Dataset used during training in iteration three . . . . . . . . . . . . . . 47
6.17 Confusion matrix third iteration model . . . . . . . . . . . . . . . . . . 48
6.18 Performance on dev test set . . . . . . . . . . . . . . . . . . . . . . . . 48
6.19 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.20 The dataset used during training, including both rotated and not rotated
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.21 Confusion matrix third iteration model . . . . . . . . . . . . . . . . . . 51
6.22 Performance on dev test set . . . . . . . . . . . . . . . . . . . . . . . . 52
6.23 Production test set distribution, before relabeling . . . . . . . . . . . . 52
6.24 Production test set distribution, after relabeling . . . . . . . . . . . . . 53
6.25 Confusion matrix best classifier run on production test set . . . . . . . 54
6.26 Performance on production test set . . . . . . . . . . . . . . . . . . . . 54
67
![Page 76: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/76.jpg)
![Page 77: Classification of offensive game-emblem drawings …uu.diva-portal.org/smash/get/diva2:1198914/FULLTEXT01.pdfGoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed3f9828d46b66d22633271/html5/thumbnails/77.jpg)
Colophon
This thesis was typeset with LATEX 2ε. It uses the Clean Thesis style developed by
Ricardo Langner. The design of the Clean Thesis style is inspired by user guide
documents from Apple Inc.
Download the Clean Thesis style at http://cleanthesis.der-ric.de/.