applications of deep learning in text classification …1323153/... · 2019-06-11 · cnns by using...
TRANSCRIPT
APPLICATIONS OF DEEP LEARNING IN TEXTCLASSIFICATION FOR HIGHLY MULTICLASS DATA
Submitted by
Adam Grünwald
A thesis submitted to the Department of Statistics inpartial fulfillment of the requirements for Master
degree in Statistics in the Faculty of Social Sciences
Supervisor
Rauf Ahmad
Spring, 2019
ABSTRACT
Text classification using deep learning is rarely applied to tasks with more than ten target
classes. This thesis investigates if deep learning can be successfully applied to a task with over
1000 target classes. A pretrained Long Short-Term Memory language model is fine-tuned and
used as a base for the classifier. After five days of training, the deep learning model achieves
80.5% accuracy on a publicly available dataset, 9.3% higher than Naive Bayes. With five
guesses, the model predicts the correct class 92.2% of the time.
Contents
1 Introduction 3
2 Related Work 4
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Data 5
4 Method 7
4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 LSTM Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Optimization and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4.3 Weight decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 ULMFiT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Experiments 16
5.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 ULMFiT implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Results 22
7 Conclusion 26
2
1 Introduction
Labelling different types of text documents is both important and desirable. There are plenty
of different situations where this is useful. Automating these tasks is therefore something of
great value if the automated system performs on a par with, or better than, humans. Labelling
essays with grades is an example of a task that is time consuming but also important, which
is why a lot of research has been put into Automated Essay Scoring (for research on Swedish
essays, see Östling 2013). Removing posts from social media platforms which are against the
terms of use or illegal (e.g. hate speech or threats of physical violence) is another case where
automation, if done right, would be beneficial.
Automated labelling of texts can also be useful in other cases. Classifying e-mails as spam,
classifying reviews as either positive or negative and assigning topics to Wikipedia articles (see
Zhang, Zhao, and Lecun 2016) are more examples of useful applications. When assigning
topics to Wikipedia articles, the number of target classes is larger. The literature makes a
distinction between the case when the target classes are binary (e.g. "Spam"/"Not Spam",
"Positive"/"Negative") and when there are several possible target classes (e.g. different topics),
where the latter problem is more complex.
A different text classification task is when a document can be labelled with several labels.
This is referred to as a multi-label classification task. If the label-space is very big, the task
becomes an extreme multi-label classification task (XMTC). An example of such a problem is
the challenge proposed by BioASQ in 2013 (BioASQ 2013). The objective of this task was
to assign several Medical Subject Headings, also known as MeSH (Wikipedia 2019a), to new
PubMed (large database of medical articles) documents.
Text classification tasks can be summarized as four different types: one out of two classes,
one out of multiple classes, several labels out of a limited number of labels and several labels
out of extremely many labels (see Gargiulo, Silvestri, and Ciampi 2018 for a similar summary).
Previously, there has not been any papers which deal with more than about 50 classes in a deep
learning framework. The data used in this thesis has over 1000 target classes which makes this
type of classification problem uncharted territory.
The objective of this thesis is to investigate the performance of a neural network transfer
learning technique, known as ULMFiT (Howard and Ruder 2018), on a task similar to the
second type; to determine which one class out of multiple classes that a text belongs to. This
can be thought of as a fifth category of text classification task; a highly multiclass classification
3
task.
In the next section there will be an overview of related work. Then, the data and method will
be introduced. A benchmark classifier will be constructed for comparison purposes. Lastly, an
implementation of ULMFiT on a highly multiclass classification task will be presented along
with results and a conclusion.
2 Related Work
2.1 Machine Learning
In text classification, a simple approach is to consider the text as a bag-of-words. In this ap-
proach, a sentence or a document is an observation and the variables are all the words which
occur in all observations. The value of each variable is then the number of times it occurs in
that particular observation. It is common to put an upper bound on how large the vocabulary
can be in which case the least common (but also most common) words are omitted. It is also
common to use bigrams, two-word sequences in a text. The sentence I love you has unigrams
I, love and you and bigrams I love and love you.
In bag-of-words representations, raw counts of words are usually not the best option. It is
often better to use TF-IDF (Term Frequency-Inverse Document Frequency) instead. Explained
in simple terms, TF-IDF is a combination of how common a term is in a given document and
how common the term is across all documents in the data. These features can then be used to
train classifiers like Naive Bayes. The Naive Bayes will be used as a benchmark for comparison
in this thesis.
Other popular representations include embedding words into vectors, done in word2vec
(Mikolov et al. 2013) and fastText (Joulin et al. 2016). These embeddings are used to capture
similarities between words and can be used to train a classifier that achieves good performance
in a very short time.
2.2 Deep learning
A popular way to approach different tasks in NLP is to use a Long Short-Term Memory Recur-
rent Neural Network (LSTM RNN) (Hochreiter and Schmidhuber 1997). The strength of the
LSTM is that it can capture information in any part of the document. It also allows the model
4
to account for the specific order of words which has been shown in a paper by Sutskever et al.
(2014).
Another interesting approach is to use a Convolutional Neural Network (CNN) on a char-
acter level (see Zhang and Lecun 2016 and Zhang, Zhao, and Lecun 2016). Since CNNs are
the current state-of-the-art in image recognition it has been suggested and shown that they can
be successful in various NLP tasks. Conneau et al. (2017) extended the idea of character level
CNNs by using up to 29 convolutional layers with promising results. CNNs has also been used
for the multi-label problem by Liu et al. (2017) and by Kim (2014). A combination of RNN
and CNN has been tried and shown to work well for text classification with few classes (Lai
et al. 2015).
The idea to use inductive transfer learning in NLP was introduced by Dai and Le (2015)
and later improved by Howard and Ruder (2018). They use a pre-trained language model from
Merity, Keskar and Socher (2017) and then show that it can be fine-tuned with small amounts
of data to perform well on a range of different tasks. This is the technique that will be used in
this thesis since it has shown to be very successful on other text classification tasks.
3 Data
Training and validation data The data used for experiments is publicly available on Kaggle,
an online community for data scientists, and a link to the data can be found in the references
(Kaggle 2018). It consists of forum posts made on Reddit, "a social news aggregation, web
content rating and discussion website" (Wikipedia 2019b). The purpose of such posts is usually
to start a discussion. Fig. 1 shows an example of such a post which was posted in the subreddit
Movies.
5
Figure 1: Self-post from the subreddit /r/Movies
The data has 1013 classes and 1000 posts per class resulting in over a million observations.
All observations are labelled with their respective class automatically when a user decides to
make their post in a certain subreddit (in Fig. 1, the user has decided to make the post in the
subreddit /r/Movies and the post is thus labelled as the class Movies). There are many more
than 1013 classes on Reddit but the creators of the dataset have tried to clean the classes such
that the overlap between them is as small as possible. The creators also mention that they
believe the highest possible accuracy on this dataset is around 96% because some texts do not
contain any useful information at all. One can refer to Fig. 2 for a small subset of the data. We
concatenate the title and selftext and use it to predict the subreddit.
6
Figure 2: A subsample illustrating the structure of the data
Test data The creators of the Kaggle dataset kindly provided us their code for downloading
and cleaning the data. We used this to download some data of our own in order to test the
models performance on new data. However, the new test data contain some classes with only
one observation. In the validation data, no class has less than 70 observations. This might affect
the performance. See table 1 for differences in observations per class between validation and
test data.
Minimum 25th percentile 50th percentile 75th percentile Maximum
Test 1 83 141 212 260
Validation 70 94 100 106 136
Table 1: Minimum values, maximum values and percentiles of observations per class for the
test and validation data.
4 Method
4.1 Neural Networks
A simplified structure of the two-layer, feed-forward network can be seen in Fig. 3. It takes
x = [x1 x2 . . . xp]T as input vector and produces z = [z1 z2 . . . zK ]
T as output vector where
7
p is the number of variables and K the number of classes. In this network, the input x is
transformed into z through one layer of hidden units and activation functions.
x0
x1
x2
x3
Input
layer
h(1)0
h(1)1
h(1)2
h(1)3
h(1)4
Hidden
layer 1
z1
z2
Output
layer
Figure 3: A feed-forward neural network with three input variables, four hidden units and two
output variables. The intercept is represented by x0 = h(1)0 = 1. The arrows between the
nodes are the weights and there are also activation functions between the layers but they are not
visible in this simplified illustration.
In the following equations, the Rectified Linear Unit (ReLU) (Nair and Hinton 2010) acti-
vation function is used. It is defined as σ(x) = max(0, x) and introduces non-linearity into the
network. The hidden units are then defined as
hi = σ(β(1)i0 + β
(1)i1 x1 + β
(1)i2 x2 · · · β
(1)ip xp), i = 1, 2, . . . , M. (1)
which is a linear function of the input variables put through the ReLU activation function. M
is the number of hidden units in this layer. They can also be written in matrix notation
h = σ(b(1) + β(1)x
)(2)
where h = [h1 h2 . . . hM ]T is the vector of hidden units and
β(1) =
β(1)11 β
(1)12 . . . β
(1)1p
β(1)21 β
(1)22 . . . β
(1)2p
...... . . . ...
β(1)M1 β
(1)M2 . . . β
(1)Mp
, b(1) =
β(1)10
β(1)20
...
β(1)M0
8
is the weight matrix and vector of intercepts used to transform the input into hidden units. The
superscript refers to that this weight matrix and intercept vector corresponds to the first layer in
the network.
In the two-layer example, the output is expressed as a function of the hidden units as
zi = β(2)i0 + β
(2)i1 h1 + β
(2)i2 h2 · · · β
(2)iMhM , i = 1, 2, . . . , K. (3)
In matrix notation
z = b(2) + β(2)h (4)
where
β(2) =
β(2)11 β
(2)12 . . . β
(2)1M
β(2)21 β
(2)22 . . . β
(2)2M
...... . . . ...
β(2)K1 β
(2)K2 . . . β
(2)KM
, b(2) =
β(2)10
β(2)20
...
β(2)K0
and M and K are the number of hidden units and output classes respectively.
To extend this two-layer network into a deep neural network with L layers it can be repre-
sented in matrix notation as
h(1) = σ(b(1) + β(1)x
)h(2) = σ
(b(2) + β(2)h(1)
)...
h(L−1) = σ(b(L−1) + β(L−1)h(L−2)
)z = b(L) + β(L)h(L−1)
(5)
Note that the ReLU is not used when generating z, instead, if the network is trained for
classification, z is put through a softmax activation to convert the values into probabilities. The
softmax is defined as
softmax(z) =1∑K
j=1 ezj[ez1 ez2 . . . ezK ]T (6)
9
The softmax function will assign values close to one for the largest zi and close to zero for
all others, unless the largest and second largest are very close.
4.2 LSTM Recurrent Neural Network
Another type of neural network suited for sequential data is a Recurrent Neural Network
(RNN). A RNN takes an input vector and a vector of hidden states to produce a new vector
of hidden states and an output. Let xt = [x1 x2 . . . xt]T be the ordered sequence of input
words, ht = [h1 h2 . . . ht]T , ht−1 = [h1 h2 . . . ht−1]
T be the hidden states corresponding to
the inputs. Also let βhx,βhh and βoh be weight matrices and let ot be the output. Then the
state of the network at time t is
ht = σ (βhxxt + βhhht−1 + bh)
ot = softmax(βohht + bo)(7)
where σ(x) usually are the tanh or ReLU nonlinear function and bh, bo are the respective bias
terms.
The LSTM (briefly discussed in section 2.2), was introduced to improve the RNN’s ability
to remember more than the last parts of a document (Hochreiter and Schmidhuber 1997). The
LSTM cell take as input the feature vector, xt, the previous cell state vector (defined in Eqn.
(8)), ct−1, and the previous hidden state vector, ht−1. This input then flows through three
different "gates", generally referred to as the forget gate, input gate and output gate. They
consist of different activation functions deciding what to forget from previous states, what to
use as input and what to output. The LSTM cell then outputs a new hidden state, ht, and a new
cell state, ct, to be used in the next cell. The flow through the LSTM cell can be seen in Fig. 4
and is defined in equation form as
ft = σ (βfxxt + βfhht−1 + bf )
it = σ (βixxt + βihht−1 + bi)
ot = σ (βoxxt + βohht−1 + bo)
ct = tanh (βcxxt + βchht−1 + bc)
ct = ft × ct−1 + it × ct
ht = ot × tanh(ct)
(8)
10
where × denotes the Hadamard product, the gate function is σ(x) = 11+e−x and ft, it, ot are
the output of the forget gate, input gate and output gate respectively. The different β matrices
contains the different weights for the gates and new candidate state. The bias/intercept terms are
represented by the different b vectors. The candidate state, ct, is combined with the previous
cell state to form the new cell state, ct. Lastly, the new hidden state, ht, is formed as a function
of the output state ot and the cell state.
σ σ tanh σ
× +
× ×
tanh
ct−1
Previous cell
ht−1
Previous hidden
xtInput
ct
New cell
ht
New hidden
htNew hidden
Figure 4: Visualization of the LSTM cell structure. σ represents the logistic function, × is the
Hadamard product and the + is just regular addition of the incoming terms.
The LSTM cells (shown in Fig. 4) are the core part of a language model. A language
model would first use a layer that learns the relation between different words, known as an
embedding layer. Then, some layers of LSTM-cells are applied (three in our case) and a linear
layer that takes the hidden state from the LSTM as input and propagates it forward through a
softmax activation (see Eqn. (6)) in order to predict the next word in a sequence of words. The
prediction is defined as
yt = softmax(V ht + by) (9)
where V is a weight matrix (with dimensions Vocabulary size× Number of hidden states) and
by are the bias/intercept terms.
11
4.3 Training a Neural Network
When training a machine learning model one wants to find the parameter values which mini-
mize a loss function. Given the model
y =Xβ + ε (10)
a closed form solution that minimizes the MSE loss function can be directly calculated as
β = (XTX)−1XTy (11)
If the amount of parameters is extremely large, one could instead use an algorithm called Gradi-
ent Descent. The gradient descent method, when applied to linear regression, tries to minimize
an appropriate loss function, e.g. MSE. To make the partial derivatives look nicer we define a
slightly modified MSE as
L(β0, β1, . . . , βp) =1
2n
n∑i=1
(y(i) − y(i))2 (12)
Gradient descent then takes the partial derivatives in each iteration with regard to all βj as
∂L(β0, β1, . . . , βp)
∂βj=
1
n
n∑i=1
(y(i) − y(i))x(i)j , j = 1, 2 . . . p. (13)
Then it will update all parameters simultaneously with learning rate γ > 0 such that
βj(new)
= βj(old)− γ 1
n
n∑i=1
(y(i) − y(i))x(i)j , j = 1, 2, . . . , p. (14)
This update scheme is repeated until the difference between loss functions between iterations
is sufficiently small. Then we can say that the loss function, which in this case is convex, has
converged to its global minimum (see Goodfellow, Bengio, and Courville 2016 for gradient
based optimization).
Deep neural networks often have millions of parameters and sometimes billions. The gra-
dient can therefore not be calculated for all parameters and observations every time. Luckily,
gradients between subsets of the data are often similar (Bottou 2018). Therefore, it is possi-
ble to split the dataset into mini-batches and then calculate the gradient on each mini-batch.
The size of the mini-batches is determined by the memory in the computers GPU and is often
12
somewhere between 32 and 256 (Goodfellow, Bengio, and Courville 2016). The calculation of
gradients is done with back-propagation (Rumelhart, Hinton, and Williams 1986).
Optimizing the parameter values through calculating gradients on mini-batches is known as
Stochastic Gradient Descent (SGD). The parameters are updated in a similar manner to Eqn.
(14) but n is replaced by the mini-batch size. When applying SGD with an adaptive learning
rate scheme and adaptive momentum (Hinton 1977) the optimization algorithm must be able
to handle this. The Adam optimizer (Kingma and Ba 2014) is a common choice under these
circumstances since it is fast and can handle the adaptive learning rate and momentum scheme.
Therefore, the Adam optimizer will be used in this thesis.
When training a neural network for classification, the MSE loss function, described in Eqn.
(14), is replaced by another loss function known as cross-entropy loss. It is defined as
L(xi,yi,θ) = −K∑k=1
yiklog(p(k|xi;θ)) = −yTi log(softmax(zi)) (15)
With xi being the predictors of observation i, yi being a one-hot encoded vector where the
correct label of observation i is coded as 1 and the rest are 0 and θ are the current parameters
of the model. It reduces to the negative logarithm of the probability assigned to the correct
class by the softmax function. Thus, it penalizes the model for assigning high probabilities to
incorrect classes. Correct guesses, especially when assigned probabilities close to 1, will yield
a low loss. The task of optimizing the classifier can then be described in equation form as
θ = argminθ
1
n
n∑i=1
L(xi,yi,θ) (16)
4.4 Optimization and Regularization
4.4.1 Dropout
A common way to prevent overfitting of a neural network is to implement dropout (Srivastava
et al. 2014). With dropout, each time the gradient is calculated each unit and its connections
will have a probability of being excluded in this particular calculation and updating of weights.
This has the effect that units in the network will not co-adapt to any greater degree, meaning
that a unit cannot rely exclusively on the input of any other unit since there is a chance that this
unit will not be present during training. At test time, all units will be included and weighted
based on their probability of inclusion during training.
13
This method of preventing overfitting works well in practice. It is recommended that the
dropout probability should be high if the amount of training data is small and vice versa. It
makes intuitive sense that it is easier for a model to memorize a small training dataset and thus
overfit which makes the need for regularization greater. In the implementation of ULMFiT in
this thesis, different dropout probabilities will be used for different layers in the model. The
relative size of the dropout probabilities is difficult to motivate theoretically.
4.4.2 Batch Normalization
When training a neural network with some kind of gradient descent algorithm one will face a
problem which is referred to as internal covariate shift. It is defined as a change in distribution
of the activations of a layer in the network due to changes in parameters from earlier layers.
A change in distribution during training will slow the training significantly since it increases
the risk that the optimizer gets stuck due to vanishing gradients. Batch Normalization (Ioffe
and Szegedy 2015) remedies this problem by normalizing activations while still allowing the
normalized values to take on the same value as the original ones if this would be the optimal so-
lution. Two layers of Batch Normalization, one before each linear layer, is used in the classifier
part of the model in this thesis.
4.4.3 Weight decay
Weight decay is another way to reduce overfitting. It is commonly done in the form of L2
regularization (Ridge regression) which adds a penalty for big weights to the cost function of
the network. Smith (2018) shows in his paper that the weight decay should be chosen to be a
larger value for smaller learning rate values and vice versa. He also suggest to try out weight
decay values of anything between 10−2 and 10−6 depending on the dataset size and how other
regularization techniques are implemented.
4.5 ULMFiT
The Universal Language Model Fine Tuning (ULMFiT) is a type of transfer learning technique
in the Natural Language Processing domain introduced by Howard and Ruder (2018). Training
a classifier with this technique includes three different steps. First, a language model is trained
on a preferably very large corpus of documents. The more variety in the language of this corpus
14
the better. Training the language model on only medical documents for example would result
in a model which understands medical terms very well but it would not generalize as well to
other domains. The current state-of-the-art language model seems to be the GPT-2 presented by
OpenAI in a very recent paper (Radford et al. 2019). They have not released their pre-trained
model to the public so the AWD-LSTM (Merity, Keskar, and Socher 2017) is used instead in
our implementation of ULMFiT.
The second step is to fine tune the language model to the corpus which is specific to the
task. This is done using something called discriminative fine tuning and slanted triangular
learning rates (STLR). Discriminative fine tuning means that when updating the weights of the
model, different learning rates are used for different layers. The reasoning behind this kind of
fine-tuning is that the first layers are found to contain more general information and the last
layers contain more specific information (Yosinski et al. 2014).
STLR builds on the idea of the triangular learning rate schedule, proposed by Smith (2017).
The learning rate is triangular when it linearly increases and then decreases between a minimum
and maximum value cyclically over a certain number of training iterations, called a cycle length.
The motivation for such a learning rate structure is to more rapidly escape saddle points (where
the gradient is close to zero but far from global minimum) in the loss function and also to
speed up training (Smith 2017). The STLR, which Howard and Ruder (2018) proposes, is a
slightly modified version of Smiths triangular learning rate. The increase to its maximum value
happens in fewer iterations and the decreasing period is longer. They suggest that this works
better in practice.
Momentum (Hinton 1977) is another way to speed up training and quickly escape saddle
points. Smith (2018) shows that the learning rate and momentum goes hand in hand, if you
change one then you must change the other if you want optimal performance. He introduces
cyclical momentum which Howard and Ruder then uses in combination with STLR in order to
achieve good performance.
The third step, after the language model is fine tuned to a specific corpus, is to add the
classifier on top of the language model. It takes a concatenated pooling of the last hidden states
from the language model as input. The concatenated pooling, hc, is defined as:
hc = [hT ,maxpool(H),meanpool(H)] (17)
where [·] is a concatenation and H = [h1, h2, . . . , hT ]. The maxpool-operation takes the
15
largest values (or most important features) from the hidden states in H and the meanpool-
operation takes the average from each hidden state in H . This input is then fed into two linear
layers which uses dropout and batch normalization described in section 4.4. The first of these
layers uses the ReLU activation function described in section 4.1 and the second layer is prop-
agated forward into the softmax function (Eqn. (6)) in order to assign the class probabilities.
4.6 Evaluation metric
The metric used to evaluate performance of the classifier is Accuracy@K. If K = 3, it means
that the classifier gets three guesses at each document. The guesses are the three classes
assigned the highest probabilities by the softmax-function. Accuracy@K is then defined as:Number of correct guesses
Number of documents . Since only one guess can be correct per document, the metric is bounded
between 0 and 1. In our experiments we will use K = 1, 3 and 5.
5 Experiments
5.1 Benchmark
First we trained a Naive Bayes classifier to use as a reference point for our neural network model
since it can be trained relatively quickly. Unigrams and bigrams with TF-IDF representation
were used as features. We removed words which appeared in more than half of the documents.
Using chi-squared feature selection (Manning, Raghavan, and Schütze 2008), we only included
the top 60000 features. The performance of this benchmark is shown in table 2.
Model Accuracy@1 Accuracy@3 Accuracy@5
Naive Bayes 73.63% 85.65% 88.97%
Table 2: Naive Bayes performance given one, three and five guesses.
5.2 ULMFiT implementation
Language model The first thing we did when creating the language model was to build a
vocabulary. In order to do so, the text is pre-processed where, for example, words like don’t is
divided into two tokens: do and n’t. Special tokens indicating important things happening in the
16
text was used. For example, a token indicating that the following word is all upper case letters
was used since the semantic meaning of, for instance, STOP and stop might be very different
and thus carry a lot of information. There is also a token indicating where a new text starts, a
token indicating capitalized letter and a token for words that are not in the vocabulary. In these
experiments a vocabulary size of the 60000 tokens most common in the dataset has been used.
The tokenized data are then quantified, the most common token gets the value 0, second most
common gets 1, and so on. One could limit the size of the vocabulary to a lower number to
save computation time but since the data contains so many different classes we believe that the
language can be quite diverse and that a large vocabulary is needed to successfully distinguish
between classes.
The structure of the neural network used in the experiments is a version of the ULMFiT-
model, tuned in different ways to be more suited for the highly multiclass classification task.
First, a language model is trained with the purpose that the model should get familiar with the
english language in general and the specific language of the data in particular. The language
model (seen in a down-scaled version in Fig. 5) consists of an embedding layer with 400 units,
three LSTM layers with 1150 hidden units in each layer where the output of the last layer are
400 units, the same as the number of embedding units.
17
Input
60000 units
Embedding
400 units
LSTM
1150 units
LSTM
1150 units
LSTM
1150 units
Output
400 units
Figure 5: Language model structure. It takes the quantified vocabulary as input which is then
fed into an embedding layer, used to learn the relations between words. This is then propagated
forward through three layers of LSTM cells which then outputs hidden states of the same size
as the embedding layer.
This is quite a large network which takes a long time to train. A model with pre-trained
weights is used to initialize the weights in our training as suggested by Howard and Ruder
(2018). Starting with these weights, the language model is trained on the Reddit data in order
to learn the language used in this realm. For the purpose of tracking the models performance
across training, 10% of the data is kept for validation. With more data, the language model
can learn more which is why the validation set is somewhat small. It is also worth noting that
language model does not get to know the class labels associated to each observation which
means that a validation set for the classifier does not need to be set aside in this part of training.
Dropout values seen in table 3 is used during training. The values are set low since the
training data is large.
18
Input layer Embedding layer Hidden layers Weights
Dropout probability 0.072 0.012 0.024 0.060
Table 3: Dropout probabilities for different layers and the weight matrices in the language
model.
A short test is run for a few iterations where the learning rate versus loss is plotted, seen in
Fig. 6. A rule of thumb is to use a learning rate somewhere in the steepest descent in the plot.
We want to find a point in the plot where the learning rate is high while loss is low. This leads
us to initialize training with a learning rate of 0.04.
Figure 6: A smoothed plot of learning rate versus loss for the language model run on a few
minibatches until the loss started increasing.
We also plotted the learning rate versus loss (see Fig. 7) after every second epoch during
training in order to make reasonable adjustments to learning rate during training.
19
(a) Plot after second epoch (b) Plot after fourth epoch
Figure 7: Smoothed plots of learning rate versus loss constructed after second and fourth epoch.
Epoch Train loss Valid loss Accuracy Time LR
1 4.83 4.78 0.231 9:36:57 0.04
2 4.23 4.14 0.277 9:37:02 0.04
3 3.78 3.74 0.321 9:26:58 3e-04
4 3.74 3.69 0.328 9:27:29 3e-04
5 3.71 3.67 0.329 10:10:43 1e-04
6 3.7 3.66 0.331 10:11:01 1e-04
Table 4: Training schedule for the first language model. Cyclical momentum was also used and
set to vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β1 = 0.9
and β2 = 0.99. Weight decay was set to 0.01.
In table 4 the training progression for the language model fine-tuned on the Reddit data is
detailed. The total training was almost 60 hours of GPU time (on a Nvidia Tesla K80 GPU) and
after the sixth epoch, the model is able to successfully predict the next word in the validation
part of the data with 33.1% accuracy. In hindsight, an inspection of Fig. 7a and 7b lead us to
believe that the learning rates of epoch three, four, five and six could have been set a little more
aggressively, possibly leading to faster conversion and improvement in performance.
Classifier After the language model had been trained we trained the classifier. It sits on top
of the language model, as explained in section 4.5. Fig. 8 shows a simplified structure of the
classifier. The number of units in the output layer corresponds to the number of classes in the
data.
20
LM
Output
Concatenate
pooling
Linear
1200 units
ReLU
activation
Linear
50 units
Output
1013 units
Figure 8: Classifier structure. The leftmost layer is the output of the language model which
is fed into a pooling layer. The pooled output then goes into a large linear layer which is
propagated forward through a ReLU activation. The activations is then fed into another, smaller,
linear layer where the units of this layer is used to calculate the output layer.
Batch normalization is used between layers and the dropout probability is set to 0.048 for
the first layer and 0.1 for the second layer of the classifier. 15% of the data is used for valida-
tion and a batch size of 32 is used during training (largest that fit in memory). Training was
initialized with a learning rate of 0.04.
The classifier training is detailed in table 5. One cycle policy (Smith 2018) and discrimi-
native fine tuning (Howard and Ruder 2018) is used to train the model. The values in the last
eight rows of the LR-column in table 5 refers to the minimum and maximum learning rate used
during the cycle.
Gradual unfreezing was also used during training of the classifier in order to remedy the
problem of catastrophic forgetting (that the model forgets the general language contained in the
21
language model). The last column of table 5 specifies which layers were trained when.
Epoch Train loss Valid loss Accuracy Time LR Layers
1 3.70 3.23 0.323 4:22:30 0.04 All
2 2.69 2.21 0.538 4:03:48 0.04 All
3 2.92 2.43 0.470 5:13:17 2e-2/2.64 to 2e-2 Last two
4 1.84 1.50 0.679 5:07:30 2e-2/2.64 to 2e-2 Last two
5 1.64 1.28 0.724 6:58:24 3e-3/2.64 to 3e-3 Last three
6 1.22 0.98 0.790 7:36:26 3e-3/2.64 to 3e-3 Last three
7 1.23 1.05 0.773 7:17:33 1e-3/2.64 to 1e-3 Last three
8 1.10 0.93 0.800 7:12:42 1e-3/2.64 to 1e-3 Last three
9 1.02 0.92 0.801 7:28:19 1e-3/2.64 to 1e-3 Last three
10 1.00 0.91 0.805 8:45:06 1e-4/2.64 to 1e-4 All
Table 5: Training schedule for the classifier. Cyclical momentum was also used and set to
vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β1 = 0.9 and
β2 = 0.99. Weight decay was set to 0.01.
The learning rate becomes smaller as training progresses and is chosen by constructing
plots like the one in Fig. 6. It was quite slow in the beginning which could indicate that the
learning rate should have been initialized at a larger value. The training started by fine-tuning
the whole model for two epochs and then used gradual unfreezing where we fine-tuned just
the last two and last three layers for seven epochs. On the last epoch, the whole model was
fine-tuned. Training time was 64 hours. After five epochs, the accuracy was almost on par with
the Naive Bayes benchmark and after ten epochs the classifier could correctly classify a text
with 80.5% accuracy, given one guess.
6 Results
Table 6 show the Accuracy@K metric for our classifier compared to the benchmark classifier
on the validation and test data. We see that our neural network based model outperforms the
benchmark, Naive Bayes, by a wide margin on all metrics and both datasets. The largest
difference is observed when the model is only given one guess.
Fig. 9 show the distribution of error percentage between classes for the validation data and
test data respectively. As seen in both figures, very few classes have a higher error rate than
40%.
22
Model Testdata Accuracy@1 Accuracy@3 Accuracy@5
Naive Bayes Validation 73.63% 85.65% 88.97%
ULMFiT Validation 80.49% 89.78% 92.2%
ULMFiT Test 78.57% 88.64% 91.32%
Table 6: Table of performance measured as Accuracy@K of the benchmark model and our
classifier on both validation and test data.
(a) Validation (b) Test
Figure 9: Plot of error distribution between classes for both validation and test data
Table 7 show the classes with lowest and highest error rates in the validation data. The table
also show the number of observations from each class and its error rate.
Table 8 is the same as table 7 but for the test data. As mentioned, this data is more unbal-
anced than the validation data with some classes having less than ten observations.
It can be seen in table 7 and 8 that most of the classes that the model fail to predict accurately
seem very general. For example, topics such as canada, united kingdom and networking seems
very broad and could contain almost any type of discussion. Only around 20 classes has a
higher error rate than 50% on both validation and test data and there are lots of classes with
almost no errors.
Fig. 10 show some examples which the model failed to correctly classify (with one guess).
The first example seems like a decent guess. The second example could probably fit into either
of the predicted and true class. In the third example, the text seems very hard to categorize
without context. However, if one were to know that the label Invisalign refers to a kind of
dental treatment similar to braces, maybe it would be possible to make that prediction. The
23
Lowest errors Highest errors
Class Observations Error rate Class Observations Error rate
KeybaseProofs 101 0 Construction 91 0.495
incest 110 0 seduction 95 0.495
ACL 105 0.01 southafrica 107 0.495
Stormlight_Archive 96 0.01 cscareerquestions 98 0.5
SkincareAddiction 91 0.011 Psychic 88 0.5
Kava 84 0.012 privacy 96 0.5
snapchat 95 0.021 bladeandsoul 97 0.505
ShingekiNoKyojin 117 0.026 linuxquestions 105 0.514
Mattress 116 0.026 AvPD 87 0.517
mead 114 0.026 socialism 110 0.518
reloading 108 0.028 dndnext 113 0.522
swoleacceptance 104 0.029 asktrp 105 0.524
WritingPrompts 101 0.03 canada 99 0.525
asmr 100 0.03 personalfinance 91 0.538
vikingstv 99 0.03 networking 103 0.544
sharditkeepit 97 0.031 Anarchism 104 0.567
Snus 93 0.032 hacking 108 0.574
Chromecast 92 0.033 actuallesbians 89 0.618
Geosim 112 0.036 techsupport 102 0.647
hookah 83 0.036 unitedkingdom 87 0.655
Table 7: Table of classes with lowest and highest error rates, number of observations in each
class and error rate of each class for the validation data.
predicted label, wls, refers to discussions about "weight loss surgery", which seems like a good
guess with the given information. In the last example, the model predicts russian with the true
label being russia. Perhaps one of these classes should not have been in the data to begin with
since there is very likely to be a big overlap between them.
24
Lowest errors Highest errors
Class Observations Error rate Class Observations Error rate
garlicoin 5 0 PoloniexForum 6 0.5
FidgetSpinners 4 0 hitmobile 2 0.5
netneutrality 12 0 seduction 234 0.504
vergecurrency 4 0 datascience 196 0.51
lightsabers 54 0 hacking 78 0.513
KeybaseProofs 217 0.005 Lineage2Revolution 31 0.516
DestructiveReaders 90 0.011 DFO 207 0.517
SkincareAddiction 209 0.019 privacy 206 0.519
snapchat 187 0.027 networking 234 0.526
TOR 74 0.027 FORTnITE 236 0.542
Porsche 72 0.028 personalfinance 203 0.552
incest 143 0.028 Psychic 215 0.558
malehairadvice 208 0.029 canada 224 0.562
emojipasta 133 0.03 schizophrenia 221 0.57
OneNote 59 0.034 actuallesbians 223 0.583
WritingPrompts 176 0.034 techsupport 213 0.601
puppy101 228 0.035 asktrp 206 0.607
tarantulas 83 0.036 StateOfDecay 64 0.609
sharditkeepit 52 0.038 bladeandsoul 216 0.62
SampleSize 225 0.04 unitedkingdom 206 0.743
Table 8: Table of classes with lowest and highest error rates, number of observations in each
class and error rate of each class for the test data.
Figure 10: Some examples of texts that the model could not correctly predict. The text is
displayed in tokenized form along with predicted and true labels.
25
7 Conclusion
This thesis aimed to investigate if a neural network could perform well on a multi-class clas-
sification task. A publicly available dataset was used, consisting of 1013 classes with 1000
observations per class. We found that a LSTM-based model which used a transfer learning
technique known as ULMFiT could successfully be trained to perform well on this data. This
model beat our benchmark model by a wide margin and could correctly classify the right class
with 80.5% accuracy (given one guess), 89.8% accuracy (given three guesses) and 92.2% ac-
curacy (given five guesses).
The models ability to correctly classify a text seemed to be dependent on how broad or
narrow a class was. Classes which could contain almost any type of discussion was harder
to classify, whereas more narrow classes where easier. There was also some overlap between
some of the classes which contributed to the models inability to classify some texts (especially
when given only one guess). The creators of the dataset believed that the maximum possible
accuracy on this data were approximately 96% because some texts seemed to contain no useful
information at all. In light of this, 92.2% Accuracy@5 must be considered a good result.
The accuracy could likely be increased even more with better choices of learning rate in
each epoch. Increased training time would also be a way of boosting performance, although
expensive. Another expensive way to boost performance would be to train several different
models and ensemble their predictions.
Some interesting topics for further research would, for example, be to investigate if CNN-
based language models and classifiers could get good performance in a similar setting. Another
thing to investigate would be how the performance of the language model affects the perfor-
mance of the classifier. Would using GPT-2 instead of AWD-LSTM as the pre-trained model
result in a large boost in performance? Roughly one thousand observations per class were used
in this thesis and it would be of interest to find out how much training data is required to achieve
good results.
26
References
BioASQ (2013). The Challenge. [Accessed 2019-03-26]. URL: http://bioasq.org/
participate/challenges_year_1.
Bottou, L. (2018). “Online Learning and Stochastic Approximations (revised 5/2018)”. Online
Learning in Neural Networks, 1–35.
Conneau, A. et al. (2017). “Very Deep Convolutional Networks for Text Classification”. arXiv:
1901.09821.
Dai, A. M. and Q. V. Le (2015). “Semi-supervised Sequence Learning”, 1–10. arXiv: 1511.
01432.
Gargiulo, F., S. Silvestri, and M. Ciampi (2018). “Deep Convolution Neural Network for Ex-
treme Multi-label Text Classification”. Healthinf, 641–650.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.deeplearningbook.
org. MIT Press.
Hinton, G. E. (1977). “Relaxation and its role in vision”. Ph.D Thesis, University of Edinburgh.
Hochreiter, S. and J. Schmidhuber (1997). “Long short term memory. Neural computation”.
Neural Computation 9.8, 1735–1780. arXiv: 1206.2944.
Howard, J. and S. Ruder (2018). “Universal Language Model Fine-tuning for Text Classifica-
tion”. arXiv: 1801.06146.
Ioffe, S. and C. Szegedy (2015). “Batch Normalization : Accelerating Deep Network Training
by Reducing Internal Covariate Shift”. arXiv: 1502.03167.
Joulin, A. et al. (2016). “Bag of Tricks for Efficient Text Classification”. arXiv: 1607.01759.
Kaggle (2018). The reddit self-post classification task. [Accessed 2019-03-27]. URL: https:
//www.kaggle.com/mswarbrickjones/reddit-selfposts.
Kim, Y. (2014). “Convolutional Neural Networks for Sentence Classification”. Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
1746–1751.
Kingma, D. P. and J. Ba (2014). “Adam: A Method for Stochastic Optimization”. arXiv: 1412.
6980.
Lai, S. et al. (2015). “Recurrent Convolutional Neural Networks for Text Classification”. Aaai’15,
2267–2273.
27
Liu, J. et al. (2017). “Deep Learning for Extreme Multi-label Text Classification”. Proceed-
ings of the 40th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 115–124.
Manning, C. D., P. Raghavan, and H. Schütze (2008). Introduction to Information Retrieval.
New York, NY, USA: Cambridge University Press. ISBN: 0521865719, 9780521865715.
Merity, S., N. S. Keskar, and R. Socher (2017). “Regularizing and optimizing LSTM language
models”. arXiv: 1708.02182v1.
Mikolov, T. et al. (2013). “Efficient Estimation of Word Representations in Vector Space”.
arXiv: 1301.3781.
Nair, V. and G. E. Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Ma-
chines”. ICML’10 Proceedings of the 27th International Conference on International Con-
ference on Machine Learning, 807–814.
Östling, R. (2013). “Automated Essay Scoring for Swedish”. Proceedings of the Eighth Work-
shop on Innovative Use of NLP for Building Educational Applications, 42–47.
Radford, A. et al. (2019). “Language Models are Unsupervised Multitask Learners”. URL:
https://d4mucfpksywv.cloudfront.net/better-language-models/
language_models_are_unsupervised_multitask_learners.pdf.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning representations by back-
propagating errors.” Nature 323, 533–536.
Smith, L. N. (2017). “Cyclical learning rates for training neural networks”. Proceedings - 2017
IEEE Winter Conference on Applications of Computer Vision, WACV 2017 April, 464–472.
arXiv: 1506.01186.
— (2018). “A disciplined approach to neural network hyper-parameters: Part 1 – learning rate,
batch size, momentum, and weight decay”, 1–21. arXiv: 1803.09820.
Srivastava, N. et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Over-
fitting”. Journal of Machine Learning Research 15 (2014) 15, 1929–1958.
Sutskever, I., O. Vinyals, and Q. V. Le (2014). “Sequence to Sequence Learning with Neural
Networks”. arXiv: 1409.3215.
Wikipedia (2019a). Medical Subject Headings. [Accessed 2019-03-26]. URL: https://en.
wikipedia.org/wiki/Medical_Subject_Headings.
— (2019b). Reddit. [Accessed 2019-03-27]. URL: https://en.wikipedia.org/
wiki/Reddit.
28
Yosinski, J. et al. (2014). “How transferable are features in deep neural networks?” arXiv:
1411.1792.
Zhang, X. and Y. Lecun (2016). “Text Understanding from Scratch”. arXiv: 1502.01710v5.
Zhang, X., J. Zhao, and Y. Lecun (2016). “Character-level Convolutional Networks for Text”.
arXiv: 1509.01626v3.
29