applications of deep learning in text classification …1323153/... · 2019-06-11 · cnns by using...

APPLICATIONS OF DEEP LEARNING IN TEXTCLASSIFICATION FOR HIGHLY MULTICLASS DATA

Submitted by

Adam Grünwald

A thesis submitted to the Department of Statistics inpartial fulfillment of the requirements for Master

degree in Statistics in the Faculty of Social Sciences

Supervisor

Rauf Ahmad

Spring, 2019

ABSTRACT

Text classification using deep learning is rarely applied to tasks with more than ten target

classes. This thesis investigates if deep learning can be successfully applied to a task with over

1000 target classes. A pretrained Long Short-Term Memory language model is fine-tuned and

used as a base for the classifier. After five days of training, the deep learning model achieves

80.5% accuracy on a publicly available dataset, 9.3% higher than Naive Bayes. With five

guesses, the model predicts the correct class 92.2% of the time.

Contents

1 Introduction 3

2 Related Work 4

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Data 5

4 Method 7

4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 LSTM Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.4 Optimization and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.4.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.4.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4.3 Weight decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.5 ULMFiT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.6 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Experiments 16

5.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 ULMFiT implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Results 22

7 Conclusion 26

2

1 Introduction

Labelling different types of text documents is both important and desirable. There are plenty

of different situations where this is useful. Automating these tasks is therefore something of

great value if the automated system performs on a par with, or better than, humans. Labelling

essays with grades is an example of a task that is time consuming but also important, which

is why a lot of research has been put into Automated Essay Scoring (for research on Swedish

essays, see Östling 2013). Removing posts from social media platforms which are against the

terms of use or illegal (e.g. hate speech or threats of physical violence) is another case where

automation, if done right, would be beneficial.

Automated labelling of texts can also be useful in other cases. Classifying e-mails as spam,

classifying reviews as either positive or negative and assigning topics to Wikipedia articles (see

Zhang, Zhao, and Lecun 2016) are more examples of useful applications. When assigning

topics to Wikipedia articles, the number of target classes is larger. The literature makes a

distinction between the case when the target classes are binary (e.g. "Spam"/"Not Spam",

"Positive"/"Negative") and when there are several possible target classes (e.g. different topics),

where the latter problem is more complex.

A different text classification task is when a document can be labelled with several labels.

This is referred to as a multi-label classification task. If the label-space is very big, the task

becomes an extreme multi-label classification task (XMTC). An example of such a problem is

the challenge proposed by BioASQ in 2013 (BioASQ 2013). The objective of this task was

to assign several Medical Subject Headings, also known as MeSH (Wikipedia 2019a), to new

PubMed (large database of medical articles) documents.

Text classification tasks can be summarized as four different types: one out of two classes,

one out of multiple classes, several labels out of a limited number of labels and several labels

out of extremely many labels (see Gargiulo, Silvestri, and Ciampi 2018 for a similar summary).

Previously, there has not been any papers which deal with more than about 50 classes in a deep

learning framework. The data used in this thesis has over 1000 target classes which makes this

type of classification problem uncharted territory.

The objective of this thesis is to investigate the performance of a neural network transfer

learning technique, known as ULMFiT (Howard and Ruder 2018), on a task similar to the

second type; to determine which one class out of multiple classes that a text belongs to. This

can be thought of as a fifth category of text classification task; a highly multiclass classification

3

task.

In the next section there will be an overview of related work. Then, the data and method will

be introduced. A benchmark classifier will be constructed for comparison purposes. Lastly, an

implementation of ULMFiT on a highly multiclass classification task will be presented along

with results and a conclusion.

2 Related Work

2.1 Machine Learning

In text classification, a simple approach is to consider the text as a bag-of-words. In this ap-

proach, a sentence or a document is an observation and the variables are all the words which

occur in all observations. The value of each variable is then the number of times it occurs in

that particular observation. It is common to put an upper bound on how large the vocabulary

can be in which case the least common (but also most common) words are omitted. It is also

common to use bigrams, two-word sequences in a text. The sentence I love you has unigrams

I, love and you and bigrams I love and love you.

In bag-of-words representations, raw counts of words are usually not the best option. It is

often better to use TF-IDF (Term Frequency-Inverse Document Frequency) instead. Explained

in simple terms, TF-IDF is a combination of how common a term is in a given document and

how common the term is across all documents in the data. These features can then be used to

train classifiers like Naive Bayes. The Naive Bayes will be used as a benchmark for comparison

in this thesis.

Other popular representations include embedding words into vectors, done in word2vec

(Mikolov et al. 2013) and fastText (Joulin et al. 2016). These embeddings are used to capture

similarities between words and can be used to train a classifier that achieves good performance

in a very short time.

2.2 Deep learning

A popular way to approach different tasks in NLP is to use a Long Short-Term Memory Recur-

rent Neural Network (LSTM RNN) (Hochreiter and Schmidhuber 1997). The strength of the

LSTM is that it can capture information in any part of the document. It also allows the model

4

to account for the specific order of words which has been shown in a paper by Sutskever et al.

(2014).

Another interesting approach is to use a Convolutional Neural Network (CNN) on a char-

acter level (see Zhang and Lecun 2016 and Zhang, Zhao, and Lecun 2016). Since CNNs are

the current state-of-the-art in image recognition it has been suggested and shown that they can

be successful in various NLP tasks. Conneau et al. (2017) extended the idea of character level

CNNs by using up to 29 convolutional layers with promising results. CNNs has also been used

for the multi-label problem by Liu et al. (2017) and by Kim (2014). A combination of RNN

and CNN has been tried and shown to work well for text classification with few classes (Lai

et al. 2015).

The idea to use inductive transfer learning in NLP was introduced by Dai and Le (2015)

and later improved by Howard and Ruder (2018). They use a pre-trained language model from

Merity, Keskar and Socher (2017) and then show that it can be fine-tuned with small amounts

of data to perform well on a range of different tasks. This is the technique that will be used in

this thesis since it has shown to be very successful on other text classification tasks.

3 Data

Training and validation data The data used for experiments is publicly available on Kaggle,

an online community for data scientists, and a link to the data can be found in the references

(Kaggle 2018). It consists of forum posts made on Reddit, "a social news aggregation, web

content rating and discussion website" (Wikipedia 2019b). The purpose of such posts is usually

to start a discussion. Fig. 1 shows an example of such a post which was posted in the subreddit

Movies.

5

Figure 1: Self-post from the subreddit /r/Movies

The data has 1013 classes and 1000 posts per class resulting in over a million observations.

All observations are labelled with their respective class automatically when a user decides to

make their post in a certain subreddit (in Fig. 1, the user has decided to make the post in the

subreddit /r/Movies and the post is thus labelled as the class Movies). There are many more

than 1013 classes on Reddit but the creators of the dataset have tried to clean the classes such

that the overlap between them is as small as possible. The creators also mention that they

believe the highest possible accuracy on this dataset is around 96% because some texts do not

contain any useful information at all. One can refer to Fig. 2 for a small subset of the data. We

concatenate the title and selftext and use it to predict the subreddit.

6

Figure 2: A subsample illustrating the structure of the data

Test data The creators of the Kaggle dataset kindly provided us their code for downloading

and cleaning the data. We used this to download some data of our own in order to test the

models performance on new data. However, the new test data contain some classes with only

one observation. In the validation data, no class has less than 70 observations. This might affect

the performance. See table 1 for differences in observations per class between validation and

test data.

Minimum 25th percentile 50th percentile 75th percentile Maximum

Test 1 83 141 212 260

Validation 70 94 100 106 136

Table 1: Minimum values, maximum values and percentiles of observations per class for the

test and validation data.

4 Method

4.1 Neural Networks

A simplified structure of the two-layer, feed-forward network can be seen in Fig. 3. It takes

x = [x1 x2 . . . xp]T as input vector and produces z = [z1 z2 . . . zK ]

T as output vector where

7

p is the number of variables and K the number of classes. In this network, the input x is

transformed into z through one layer of hidden units and activation functions.

x0

x1

x2

x3

Input

layer

h(1)0

h(1)1

h(1)2

h(1)3

h(1)4

Hidden

layer 1

z1

z2

Output

layer

Figure 3: A feed-forward neural network with three input variables, four hidden units and two

output variables. The intercept is represented by x0 = h(1)0 = 1. The arrows between the

nodes are the weights and there are also activation functions between the layers but they are not

visible in this simplified illustration.

In the following equations, the Rectified Linear Unit (ReLU) (Nair and Hinton 2010) acti-

vation function is used. It is defined as σ(x) = max(0, x) and introduces non-linearity into the

network. The hidden units are then defined as

hi = σ(β(1)i0 + β

(1)i1 x1 + β

(1)i2 x2 · · · β

(1)ip xp), i = 1, 2, . . . , M. (1)

which is a linear function of the input variables put through the ReLU activation function. M

is the number of hidden units in this layer. They can also be written in matrix notation

h = σ(b(1) + β(1)x

)(2)

where h = [h1 h2 . . . hM ]T is the vector of hidden units and

β(1) =

β(1)11 β

(1)12 . . . β

(1)1p

β(1)21 β

(1)22 . . . β

(1)2p

...... . . . ...

β(1)M1 β

(1)M2 . . . β

(1)Mp

, b(1) =

β(1)10

β(1)20

...

β(1)M0

8

is the weight matrix and vector of intercepts used to transform the input into hidden units. The

superscript refers to that this weight matrix and intercept vector corresponds to the first layer in

the network.

In the two-layer example, the output is expressed as a function of the hidden units as

zi = β(2)i0 + β

(2)i1 h1 + β

(2)i2 h2 · · · β

(2)iMhM , i = 1, 2, . . . , K. (3)

In matrix notation

z = b(2) + β(2)h (4)

where

β(2) =

β(2)11 β

(2)12 . . . β

(2)1M

β(2)21 β

(2)22 . . . β

(2)2M

...... . . . ...

β(2)K1 β

(2)K2 . . . β

(2)KM

, b(2) =

β(2)10

β(2)20

...

β(2)K0

and M and K are the number of hidden units and output classes respectively.

To extend this two-layer network into a deep neural network with L layers it can be repre-

sented in matrix notation as

h(1) = σ(b(1) + β(1)x

)h(2) = σ

(b(2) + β(2)h(1)

)...

h(L−1) = σ(b(L−1) + β(L−1)h(L−2)

)z = b(L) + β(L)h(L−1)

(5)

Note that the ReLU is not used when generating z, instead, if the network is trained for

classification, z is put through a softmax activation to convert the values into probabilities. The

softmax is defined as

softmax(z) =1∑K

j=1 ezj[ez1 ez2 . . . ezK ]T (6)

9

The softmax function will assign values close to one for the largest zi and close to zero for

all others, unless the largest and second largest are very close.

4.2 LSTM Recurrent Neural Network

Another type of neural network suited for sequential data is a Recurrent Neural Network

(RNN). A RNN takes an input vector and a vector of hidden states to produce a new vector

of hidden states and an output. Let xt = [x1 x2 . . . xt]T be the ordered sequence of input

words, ht = [h1 h2 . . . ht]T , ht−1 = [h1 h2 . . . ht−1]

T be the hidden states corresponding to

the inputs. Also let βhx,βhh and βoh be weight matrices and let ot be the output. Then the

state of the network at time t is

ht = σ (βhxxt + βhhht−1 + bh)

ot = softmax(βohht + bo)(7)

where σ(x) usually are the tanh or ReLU nonlinear function and bh, bo are the respective bias

terms.

The LSTM (briefly discussed in section 2.2), was introduced to improve the RNN’s ability

to remember more than the last parts of a document (Hochreiter and Schmidhuber 1997). The

LSTM cell take as input the feature vector, xt, the previous cell state vector (defined in Eqn.

(8)), ct−1, and the previous hidden state vector, ht−1. This input then flows through three

different "gates", generally referred to as the forget gate, input gate and output gate. They

consist of different activation functions deciding what to forget from previous states, what to

use as input and what to output. The LSTM cell then outputs a new hidden state, ht, and a new

cell state, ct, to be used in the next cell. The flow through the LSTM cell can be seen in Fig. 4

and is defined in equation form as

ft = σ (βfxxt + βfhht−1 + bf )

it = σ (βixxt + βihht−1 + bi)

ot = σ (βoxxt + βohht−1 + bo)

ct = tanh (βcxxt + βchht−1 + bc)

ct = ft × ct−1 + it × ct

ht = ot × tanh(ct)

(8)

10

where × denotes the Hadamard product, the gate function is σ(x) = 11+e−x and ft, it, ot are

the output of the forget gate, input gate and output gate respectively. The different β matrices

contains the different weights for the gates and new candidate state. The bias/intercept terms are

represented by the different b vectors. The candidate state, ct, is combined with the previous

cell state to form the new cell state, ct. Lastly, the new hidden state, ht, is formed as a function

of the output state ot and the cell state.

σ σ tanh σ

× +

× ×

tanh

ct−1

Previous cell

ht−1

Previous hidden

xtInput

ct

New cell

ht

New hidden

htNew hidden

Figure 4: Visualization of the LSTM cell structure. σ represents the logistic function, × is the

Hadamard product and the + is just regular addition of the incoming terms.

The LSTM cells (shown in Fig. 4) are the core part of a language model. A language

model would first use a layer that learns the relation between different words, known as an

embedding layer. Then, some layers of LSTM-cells are applied (three in our case) and a linear

layer that takes the hidden state from the LSTM as input and propagates it forward through a

softmax activation (see Eqn. (6)) in order to predict the next word in a sequence of words. The

prediction is defined as

yt = softmax(V ht + by) (9)

where V is a weight matrix (with dimensions Vocabulary size× Number of hidden states) and

by are the bias/intercept terms.

11

4.3 Training a Neural Network

When training a machine learning model one wants to find the parameter values which mini-

mize a loss function. Given the model

y =Xβ + ε (10)

a closed form solution that minimizes the MSE loss function can be directly calculated as

β = (XTX)−1XTy (11)

If the amount of parameters is extremely large, one could instead use an algorithm called Gradi-

ent Descent. The gradient descent method, when applied to linear regression, tries to minimize

an appropriate loss function, e.g. MSE. To make the partial derivatives look nicer we define a

slightly modified MSE as

L(β0, β1, . . . , βp) =1

2n

n∑i=1

(y(i) − y(i))2 (12)

Gradient descent then takes the partial derivatives in each iteration with regard to all βj as

∂L(β0, β1, . . . , βp)

∂βj=

1

n

n∑i=1

(y(i) − y(i))x(i)j , j = 1, 2 . . . p. (13)

Then it will update all parameters simultaneously with learning rate γ > 0 such that

βj(new)

= βj(old)− γ 1

n

n∑i=1

(y(i) − y(i))x(i)j , j = 1, 2, . . . , p. (14)

This update scheme is repeated until the difference between loss functions between iterations

is sufficiently small. Then we can say that the loss function, which in this case is convex, has

converged to its global minimum (see Goodfellow, Bengio, and Courville 2016 for gradient

based optimization).

Deep neural networks often have millions of parameters and sometimes billions. The gra-

dient can therefore not be calculated for all parameters and observations every time. Luckily,

gradients between subsets of the data are often similar (Bottou 2018). Therefore, it is possi-

ble to split the dataset into mini-batches and then calculate the gradient on each mini-batch.

The size of the mini-batches is determined by the memory in the computers GPU and is often

12

somewhere between 32 and 256 (Goodfellow, Bengio, and Courville 2016). The calculation of

gradients is done with back-propagation (Rumelhart, Hinton, and Williams 1986).

Optimizing the parameter values through calculating gradients on mini-batches is known as

Stochastic Gradient Descent (SGD). The parameters are updated in a similar manner to Eqn.

(14) but n is replaced by the mini-batch size. When applying SGD with an adaptive learning

rate scheme and adaptive momentum (Hinton 1977) the optimization algorithm must be able

to handle this. The Adam optimizer (Kingma and Ba 2014) is a common choice under these

circumstances since it is fast and can handle the adaptive learning rate and momentum scheme.

Therefore, the Adam optimizer will be used in this thesis.

When training a neural network for classification, the MSE loss function, described in Eqn.

(14), is replaced by another loss function known as cross-entropy loss. It is defined as

L(xi,yi,θ) = −K∑k=1

yiklog(p(k|xi;θ)) = −yTi log(softmax(zi)) (15)

With xi being the predictors of observation i, yi being a one-hot encoded vector where the

correct label of observation i is coded as 1 and the rest are 0 and θ are the current parameters

of the model. It reduces to the negative logarithm of the probability assigned to the correct

class by the softmax function. Thus, it penalizes the model for assigning high probabilities to

incorrect classes. Correct guesses, especially when assigned probabilities close to 1, will yield

a low loss. The task of optimizing the classifier can then be described in equation form as

θ = argminθ

1

n

n∑i=1

L(xi,yi,θ) (16)

4.4 Optimization and Regularization

4.4.1 Dropout

A common way to prevent overfitting of a neural network is to implement dropout (Srivastava

et al. 2014). With dropout, each time the gradient is calculated each unit and its connections

will have a probability of being excluded in this particular calculation and updating of weights.

This has the effect that units in the network will not co-adapt to any greater degree, meaning

that a unit cannot rely exclusively on the input of any other unit since there is a chance that this

unit will not be present during training. At test time, all units will be included and weighted

based on their probability of inclusion during training.

13

This method of preventing overfitting works well in practice. It is recommended that the

dropout probability should be high if the amount of training data is small and vice versa. It

makes intuitive sense that it is easier for a model to memorize a small training dataset and thus

overfit which makes the need for regularization greater. In the implementation of ULMFiT in

this thesis, different dropout probabilities will be used for different layers in the model. The

relative size of the dropout probabilities is difficult to motivate theoretically.

4.4.2 Batch Normalization

When training a neural network with some kind of gradient descent algorithm one will face a

problem which is referred to as internal covariate shift. It is defined as a change in distribution

of the activations of a layer in the network due to changes in parameters from earlier layers.

A change in distribution during training will slow the training significantly since it increases

the risk that the optimizer gets stuck due to vanishing gradients. Batch Normalization (Ioffe

and Szegedy 2015) remedies this problem by normalizing activations while still allowing the

normalized values to take on the same value as the original ones if this would be the optimal so-

lution. Two layers of Batch Normalization, one before each linear layer, is used in the classifier

part of the model in this thesis.

4.4.3 Weight decay

Weight decay is another way to reduce overfitting. It is commonly done in the form of L2

regularization (Ridge regression) which adds a penalty for big weights to the cost function of

the network. Smith (2018) shows in his paper that the weight decay should be chosen to be a

larger value for smaller learning rate values and vice versa. He also suggest to try out weight

decay values of anything between 10−2 and 10−6 depending on the dataset size and how other

regularization techniques are implemented.

4.5 ULMFiT

The Universal Language Model Fine Tuning (ULMFiT) is a type of transfer learning technique

in the Natural Language Processing domain introduced by Howard and Ruder (2018). Training

a classifier with this technique includes three different steps. First, a language model is trained

on a preferably very large corpus of documents. The more variety in the language of this corpus

14

the better. Training the language model on only medical documents for example would result

in a model which understands medical terms very well but it would not generalize as well to

other domains. The current state-of-the-art language model seems to be the GPT-2 presented by

OpenAI in a very recent paper (Radford et al. 2019). They have not released their pre-trained

model to the public so the AWD-LSTM (Merity, Keskar, and Socher 2017) is used instead in

our implementation of ULMFiT.

The second step is to fine tune the language model to the corpus which is specific to the

task. This is done using something called discriminative fine tuning and slanted triangular

learning rates (STLR). Discriminative fine tuning means that when updating the weights of the

model, different learning rates are used for different layers. The reasoning behind this kind of

fine-tuning is that the first layers are found to contain more general information and the last

layers contain more specific information (Yosinski et al. 2014).

STLR builds on the idea of the triangular learning rate schedule, proposed by Smith (2017).

The learning rate is triangular when it linearly increases and then decreases between a minimum

and maximum value cyclically over a certain number of training iterations, called a cycle length.

The motivation for such a learning rate structure is to more rapidly escape saddle points (where

the gradient is close to zero but far from global minimum) in the loss function and also to

speed up training (Smith 2017). The STLR, which Howard and Ruder (2018) proposes, is a

slightly modified version of Smiths triangular learning rate. The increase to its maximum value

happens in fewer iterations and the decreasing period is longer. They suggest that this works

better in practice.

Momentum (Hinton 1977) is another way to speed up training and quickly escape saddle

points. Smith (2018) shows that the learning rate and momentum goes hand in hand, if you

change one then you must change the other if you want optimal performance. He introduces

cyclical momentum which Howard and Ruder then uses in combination with STLR in order to

achieve good performance.

The third step, after the language model is fine tuned to a specific corpus, is to add the

classifier on top of the language model. It takes a concatenated pooling of the last hidden states

from the language model as input. The concatenated pooling, hc, is defined as:

hc = [hT ,maxpool(H),meanpool(H)] (17)

where [·] is a concatenation and H = [h1, h2, . . . , hT ]. The maxpool-operation takes the

15

largest values (or most important features) from the hidden states in H and the meanpool-

operation takes the average from each hidden state in H . This input is then fed into two linear

layers which uses dropout and batch normalization described in section 4.4. The first of these

layers uses the ReLU activation function described in section 4.1 and the second layer is prop-

agated forward into the softmax function (Eqn. (6)) in order to assign the class probabilities.

4.6 Evaluation metric

The metric used to evaluate performance of the classifier is Accuracy@K. If K = 3, it means

that the classifier gets three guesses at each document. The guesses are the three classes

assigned the highest probabilities by the softmax-function. Accuracy@K is then defined as:Number of correct guesses

Number of documents . Since only one guess can be correct per document, the metric is bounded

between 0 and 1. In our experiments we will use K = 1, 3 and 5.

5 Experiments

5.1 Benchmark

First we trained a Naive Bayes classifier to use as a reference point for our neural network model

since it can be trained relatively quickly. Unigrams and bigrams with TF-IDF representation

were used as features. We removed words which appeared in more than half of the documents.

Using chi-squared feature selection (Manning, Raghavan, and Schütze 2008), we only included

the top 60000 features. The performance of this benchmark is shown in table 2.

Model Accuracy@1 Accuracy@3 Accuracy@5

Naive Bayes 73.63% 85.65% 88.97%

Table 2: Naive Bayes performance given one, three and five guesses.

5.2 ULMFiT implementation

Language model The first thing we did when creating the language model was to build a

vocabulary. In order to do so, the text is pre-processed where, for example, words like don’t is

divided into two tokens: do and n’t. Special tokens indicating important things happening in the

16

text was used. For example, a token indicating that the following word is all upper case letters

was used since the semantic meaning of, for instance, STOP and stop might be very different

and thus carry a lot of information. There is also a token indicating where a new text starts, a

token indicating capitalized letter and a token for words that are not in the vocabulary. In these

experiments a vocabulary size of the 60000 tokens most common in the dataset has been used.

The tokenized data are then quantified, the most common token gets the value 0, second most

common gets 1, and so on. One could limit the size of the vocabulary to a lower number to

save computation time but since the data contains so many different classes we believe that the

language can be quite diverse and that a large vocabulary is needed to successfully distinguish

between classes.

The structure of the neural network used in the experiments is a version of the ULMFiT-

model, tuned in different ways to be more suited for the highly multiclass classification task.

First, a language model is trained with the purpose that the model should get familiar with the

english language in general and the specific language of the data in particular. The language

model (seen in a down-scaled version in Fig. 5) consists of an embedding layer with 400 units,

three LSTM layers with 1150 hidden units in each layer where the output of the last layer are

400 units, the same as the number of embedding units.

17

Input

60000 units

Embedding

400 units

LSTM

1150 units

LSTM

1150 units

LSTM

1150 units

Output

400 units

Figure 5: Language model structure. It takes the quantified vocabulary as input which is then

fed into an embedding layer, used to learn the relations between words. This is then propagated

forward through three layers of LSTM cells which then outputs hidden states of the same size

as the embedding layer.

This is quite a large network which takes a long time to train. A model with pre-trained

weights is used to initialize the weights in our training as suggested by Howard and Ruder

(2018). Starting with these weights, the language model is trained on the Reddit data in order

to learn the language used in this realm. For the purpose of tracking the models performance

across training, 10% of the data is kept for validation. With more data, the language model

can learn more which is why the validation set is somewhat small. It is also worth noting that

language model does not get to know the class labels associated to each observation which

means that a validation set for the classifier does not need to be set aside in this part of training.

Dropout values seen in table 3 is used during training. The values are set low since the

training data is large.

18

Input layer Embedding layer Hidden layers Weights

Dropout probability 0.072 0.012 0.024 0.060

Table 3: Dropout probabilities for different layers and the weight matrices in the language

model.

A short test is run for a few iterations where the learning rate versus loss is plotted, seen in

Fig. 6. A rule of thumb is to use a learning rate somewhere in the steepest descent in the plot.

We want to find a point in the plot where the learning rate is high while loss is low. This leads

us to initialize training with a learning rate of 0.04.

Figure 6: A smoothed plot of learning rate versus loss for the language model run on a few

minibatches until the loss started increasing.

We also plotted the learning rate versus loss (see Fig. 7) after every second epoch during

training in order to make reasonable adjustments to learning rate during training.

19

(a) Plot after second epoch (b) Plot after fourth epoch

Figure 7: Smoothed plots of learning rate versus loss constructed after second and fourth epoch.

Epoch Train loss Valid loss Accuracy Time LR

1 4.83 4.78 0.231 9:36:57 0.04

2 4.23 4.14 0.277 9:37:02 0.04

3 3.78 3.74 0.321 9:26:58 3e-04

4 3.74 3.69 0.328 9:27:29 3e-04

5 3.71 3.67 0.329 10:10:43 1e-04

6 3.7 3.66 0.331 10:11:01 1e-04

Table 4: Training schedule for the first language model. Cyclical momentum was also used and

set to vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β1 = 0.9

and β2 = 0.99. Weight decay was set to 0.01.

In table 4 the training progression for the language model fine-tuned on the Reddit data is

detailed. The total training was almost 60 hours of GPU time (on a Nvidia Tesla K80 GPU) and

after the sixth epoch, the model is able to successfully predict the next word in the validation

part of the data with 33.1% accuracy. In hindsight, an inspection of Fig. 7a and 7b lead us to

believe that the learning rates of epoch three, four, five and six could have been set a little more

aggressively, possibly leading to faster conversion and improvement in performance.

Classifier After the language model had been trained we trained the classifier. It sits on top

of the language model, as explained in section 4.5. Fig. 8 shows a simplified structure of the

classifier. The number of units in the output layer corresponds to the number of classes in the

data.

20

LM

Output

Concatenate

pooling

Linear

1200 units

ReLU

activation

Linear

50 units

Output

1013 units

Figure 8: Classifier structure. The leftmost layer is the output of the language model which

is fed into a pooling layer. The pooled output then goes into a large linear layer which is

propagated forward through a ReLU activation. The activations is then fed into another, smaller,

linear layer where the units of this layer is used to calculate the output layer.

Batch normalization is used between layers and the dropout probability is set to 0.048 for

the first layer and 0.1 for the second layer of the classifier. 15% of the data is used for valida-

tion and a batch size of 32 is used during training (largest that fit in memory). Training was

initialized with a learning rate of 0.04.

The classifier training is detailed in table 5. One cycle policy (Smith 2018) and discrimi-

native fine tuning (Howard and Ruder 2018) is used to train the model. The values in the last

eight rows of the LR-column in table 5 refers to the minimum and maximum learning rate used

during the cycle.

Gradual unfreezing was also used during training of the classifier in order to remedy the

problem of catastrophic forgetting (that the model forgets the general language contained in the

21

language model). The last column of table 5 specifies which layers were trained when.

Epoch Train loss Valid loss Accuracy Time LR Layers

1 3.70 3.23 0.323 4:22:30 0.04 All

2 2.69 2.21 0.538 4:03:48 0.04 All

3 2.92 2.43 0.470 5:13:17 2e-2/2.64 to 2e-2 Last two

4 1.84 1.50 0.679 5:07:30 2e-2/2.64 to 2e-2 Last two

5 1.64 1.28 0.724 6:58:24 3e-3/2.64 to 3e-3 Last three

6 1.22 0.98 0.790 7:36:26 3e-3/2.64 to 3e-3 Last three

7 1.23 1.05 0.773 7:17:33 1e-3/2.64 to 1e-3 Last three

8 1.10 0.93 0.800 7:12:42 1e-3/2.64 to 1e-3 Last three

9 1.02 0.92 0.801 7:28:19 1e-3/2.64 to 1e-3 Last three

10 1.00 0.91 0.805 8:45:06 1e-4/2.64 to 1e-4 All

Table 5: Training schedule for the classifier. Cyclical momentum was also used and set to

vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β1 = 0.9 and

β2 = 0.99. Weight decay was set to 0.01.

The learning rate becomes smaller as training progresses and is chosen by constructing

plots like the one in Fig. 6. It was quite slow in the beginning which could indicate that the

learning rate should have been initialized at a larger value. The training started by fine-tuning

the whole model for two epochs and then used gradual unfreezing where we fine-tuned just

the last two and last three layers for seven epochs. On the last epoch, the whole model was

fine-tuned. Training time was 64 hours. After five epochs, the accuracy was almost on par with

the Naive Bayes benchmark and after ten epochs the classifier could correctly classify a text

with 80.5% accuracy, given one guess.

6 Results

Table 6 show the Accuracy@K metric for our classifier compared to the benchmark classifier

on the validation and test data. We see that our neural network based model outperforms the

benchmark, Naive Bayes, by a wide margin on all metrics and both datasets. The largest

difference is observed when the model is only given one guess.

Fig. 9 show the distribution of error percentage between classes for the validation data and

test data respectively. As seen in both figures, very few classes have a higher error rate than

40%.

22

Model Testdata Accuracy@1 Accuracy@3 Accuracy@5

Naive Bayes Validation 73.63% 85.65% 88.97%

ULMFiT Validation 80.49% 89.78% 92.2%

ULMFiT Test 78.57% 88.64% 91.32%

Table 6: Table of performance measured as Accuracy@K of the benchmark model and our

classifier on both validation and test data.

(a) Validation (b) Test

Figure 9: Plot of error distribution between classes for both validation and test data

Table 7 show the classes with lowest and highest error rates in the validation data. The table

also show the number of observations from each class and its error rate.

Table 8 is the same as table 7 but for the test data. As mentioned, this data is more unbal-

anced than the validation data with some classes having less than ten observations.

It can be seen in table 7 and 8 that most of the classes that the model fail to predict accurately

seem very general. For example, topics such as canada, united kingdom and networking seems

very broad and could contain almost any type of discussion. Only around 20 classes has a

higher error rate than 50% on both validation and test data and there are lots of classes with

almost no errors.

Fig. 10 show some examples which the model failed to correctly classify (with one guess).

The first example seems like a decent guess. The second example could probably fit into either

of the predicted and true class. In the third example, the text seems very hard to categorize

without context. However, if one were to know that the label Invisalign refers to a kind of

dental treatment similar to braces, maybe it would be possible to make that prediction. The

23

Lowest errors Highest errors

Class Observations Error rate Class Observations Error rate

KeybaseProofs 101 0 Construction 91 0.495

incest 110 0 seduction 95 0.495

ACL 105 0.01 southafrica 107 0.495

Stormlight_Archive 96 0.01 cscareerquestions 98 0.5

SkincareAddiction 91 0.011 Psychic 88 0.5

Kava 84 0.012 privacy 96 0.5

snapchat 95 0.021 bladeandsoul 97 0.505

ShingekiNoKyojin 117 0.026 linuxquestions 105 0.514

Mattress 116 0.026 AvPD 87 0.517

mead 114 0.026 socialism 110 0.518

reloading 108 0.028 dndnext 113 0.522

swoleacceptance 104 0.029 asktrp 105 0.524

WritingPrompts 101 0.03 canada 99 0.525

asmr 100 0.03 personalfinance 91 0.538

vikingstv 99 0.03 networking 103 0.544

sharditkeepit 97 0.031 Anarchism 104 0.567

Snus 93 0.032 hacking 108 0.574

Chromecast 92 0.033 actuallesbians 89 0.618

Geosim 112 0.036 techsupport 102 0.647

hookah 83 0.036 unitedkingdom 87 0.655

Table 7: Table of classes with lowest and highest error rates, number of observations in each

class and error rate of each class for the validation data.

predicted label, wls, refers to discussions about "weight loss surgery", which seems like a good

guess with the given information. In the last example, the model predicts russian with the true

label being russia. Perhaps one of these classes should not have been in the data to begin with

since there is very likely to be a big overlap between them.

24

Lowest errors Highest errors

Class Observations Error rate Class Observations Error rate

garlicoin 5 0 PoloniexForum 6 0.5

FidgetSpinners 4 0 hitmobile 2 0.5

netneutrality 12 0 seduction 234 0.504

vergecurrency 4 0 datascience 196 0.51

lightsabers 54 0 hacking 78 0.513

KeybaseProofs 217 0.005 Lineage2Revolution 31 0.516

DestructiveReaders 90 0.011 DFO 207 0.517

SkincareAddiction 209 0.019 privacy 206 0.519

snapchat 187 0.027 networking 234 0.526

TOR 74 0.027 FORTnITE 236 0.542

Porsche 72 0.028 personalfinance 203 0.552

incest 143 0.028 Psychic 215 0.558

malehairadvice 208 0.029 canada 224 0.562

emojipasta 133 0.03 schizophrenia 221 0.57

OneNote 59 0.034 actuallesbians 223 0.583

WritingPrompts 176 0.034 techsupport 213 0.601

puppy101 228 0.035 asktrp 206 0.607

tarantulas 83 0.036 StateOfDecay 64 0.609

sharditkeepit 52 0.038 bladeandsoul 216 0.62

SampleSize 225 0.04 unitedkingdom 206 0.743

Table 8: Table of classes with lowest and highest error rates, number of observations in each

class and error rate of each class for the test data.

Figure 10: Some examples of texts that the model could not correctly predict. The text is

displayed in tokenized form along with predicted and true labels.

25

7 Conclusion

This thesis aimed to investigate if a neural network could perform well on a multi-class clas-

sification task. A publicly available dataset was used, consisting of 1013 classes with 1000

observations per class. We found that a LSTM-based model which used a transfer learning

technique known as ULMFiT could successfully be trained to perform well on this data. This

model beat our benchmark model by a wide margin and could correctly classify the right class

with 80.5% accuracy (given one guess), 89.8% accuracy (given three guesses) and 92.2% ac-

curacy (given five guesses).

The models ability to correctly classify a text seemed to be dependent on how broad or

narrow a class was. Classes which could contain almost any type of discussion was harder

to classify, whereas more narrow classes where easier. There was also some overlap between

some of the classes which contributed to the models inability to classify some texts (especially

when given only one guess). The creators of the dataset believed that the maximum possible

accuracy on this data were approximately 96% because some texts seemed to contain no useful

information at all. In light of this, 92.2% Accuracy@5 must be considered a good result.

The accuracy could likely be increased even more with better choices of learning rate in

each epoch. Increased training time would also be a way of boosting performance, although

expensive. Another expensive way to boost performance would be to train several different

models and ensemble their predictions.

Some interesting topics for further research would, for example, be to investigate if CNN-

based language models and classifiers could get good performance in a similar setting. Another

thing to investigate would be how the performance of the language model affects the perfor-

mance of the classifier. Would using GPT-2 instead of AWD-LSTM as the pre-trained model

result in a large boost in performance? Roughly one thousand observations per class were used

in this thesis and it would be of interest to find out how much training data is required to achieve

good results.

26

References

BioASQ (2013). The Challenge. [Accessed 2019-03-26]. URL: http://bioasq.org/

participate/challenges_year_1.

Bottou, L. (2018). “Online Learning and Stochastic Approximations (revised 5/2018)”. Online

Learning in Neural Networks, 1–35.

Conneau, A. et al. (2017). “Very Deep Convolutional Networks for Text Classification”. arXiv:

1901.09821.

Dai, A. M. and Q. V. Le (2015). “Semi-supervised Sequence Learning”, 1–10. arXiv: 1511.

01432.

Gargiulo, F., S. Silvestri, and M. Ciampi (2018). “Deep Convolution Neural Network for Ex-

treme Multi-label Text Classification”. Healthinf, 641–650.

Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.deeplearningbook.

org. MIT Press.

Hinton, G. E. (1977). “Relaxation and its role in vision”. Ph.D Thesis, University of Edinburgh.

Hochreiter, S. and J. Schmidhuber (1997). “Long short term memory. Neural computation”.

Neural Computation 9.8, 1735–1780. arXiv: 1206.2944.

Howard, J. and S. Ruder (2018). “Universal Language Model Fine-tuning for Text Classifica-

tion”. arXiv: 1801.06146.

Ioffe, S. and C. Szegedy (2015). “Batch Normalization : Accelerating Deep Network Training

by Reducing Internal Covariate Shift”. arXiv: 1502.03167.

Joulin, A. et al. (2016). “Bag of Tricks for Efficient Text Classification”. arXiv: 1607.01759.

Kaggle (2018). The reddit self-post classification task. [Accessed 2019-03-27]. URL: https:

//www.kaggle.com/mswarbrickjones/reddit-selfposts.

Kim, Y. (2014). “Convolutional Neural Networks for Sentence Classification”. Proceedings of

the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

1746–1751.

Kingma, D. P. and J. Ba (2014). “Adam: A Method for Stochastic Optimization”. arXiv: 1412.

6980.

Lai, S. et al. (2015). “Recurrent Convolutional Neural Networks for Text Classification”. Aaai’15,

2267–2273.

27

Liu, J. et al. (2017). “Deep Learning for Extreme Multi-label Text Classification”. Proceed-

ings of the 40th International ACM SIGIR Conference on Research and Development in

Information Retrieval, 115–124.

Manning, C. D., P. Raghavan, and H. Schütze (2008). Introduction to Information Retrieval.

New York, NY, USA: Cambridge University Press. ISBN: 0521865719, 9780521865715.

Merity, S., N. S. Keskar, and R. Socher (2017). “Regularizing and optimizing LSTM language

models”. arXiv: 1708.02182v1.

Mikolov, T. et al. (2013). “Efficient Estimation of Word Representations in Vector Space”.

arXiv: 1301.3781.

Nair, V. and G. E. Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Ma-

chines”. ICML’10 Proceedings of the 27th International Conference on International Con-

ference on Machine Learning, 807–814.

Östling, R. (2013). “Automated Essay Scoring for Swedish”. Proceedings of the Eighth Work-

shop on Innovative Use of NLP for Building Educational Applications, 42–47.

Radford, A. et al. (2019). “Language Models are Unsupervised Multitask Learners”. URL:

https://d4mucfpksywv.cloudfront.net/better-language-models/

language_models_are_unsupervised_multitask_learners.pdf.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning representations by back-

propagating errors.” Nature 323, 533–536.

Smith, L. N. (2017). “Cyclical learning rates for training neural networks”. Proceedings - 2017

IEEE Winter Conference on Applications of Computer Vision, WACV 2017 April, 464–472.

arXiv: 1506.01186.

— (2018). “A disciplined approach to neural network hyper-parameters: Part 1 – learning rate,

batch size, momentum, and weight decay”, 1–21. arXiv: 1803.09820.

Srivastava, N. et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Over-

fitting”. Journal of Machine Learning Research 15 (2014) 15, 1929–1958.

Sutskever, I., O. Vinyals, and Q. V. Le (2014). “Sequence to Sequence Learning with Neural

Networks”. arXiv: 1409.3215.

Wikipedia (2019a). Medical Subject Headings. [Accessed 2019-03-26]. URL: https://en.

wikipedia.org/wiki/Medical_Subject_Headings.

— (2019b). Reddit. [Accessed 2019-03-27]. URL: https://en.wikipedia.org/

wiki/Reddit.

28

Yosinski, J. et al. (2014). “How transferable are features in deep neural networks?” arXiv:

1411.1792.

Zhang, X. and Y. Lecun (2016). “Text Understanding from Scratch”. arXiv: 1502.01710v5.

Zhang, X., J. Zhao, and Y. Lecun (2016). “Character-level Convolutional Networks for Text”.

arXiv: 1509.01626v3.

29

applications of deep learning in text classification …1323153/... · 2019-06-11 · cnns by using...

Documents