drug-drug interaction extraction from biomedical...

Information Sciences 415–416 (2017) 100–109

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier.com/locate/ins

Drug–drug interaction extraction from biomedical literature

using support vector machine and long short term memory

networks

Degen Huang

a , Zhenchao Jiang

b , Li Zou

c , ∗, Lishuang Li a , ∗

a School of Computer Science and Technology Dalian University of Technology, Dalian, Liaoning, China b Sangfor Technologies, Shenzhen, Guangdong, China c School of Computer Science and Information Technology, Liaoning Normal University, Dalian,Liaoning, China

a r t i c l e i n f o

Article history:

Received 7 January 2017

Revised 28 April 2017

Accepted 17 June 2017

Available online 19 June 2017

MSC:

00-01

99-00

Keywords:

Drug-Drug interaction extraction

Long short term memory

Two-stage approach

a b s t r a c t

Since Drug-drug interactions (DDIs) can cause adverse effects when patients take two or

more drugs and therefore increase health care costs, the extraction of DDIs is an impor-

tant research area in patient safety. To improve the performance of Drug–drug interaction

extraction (DDIE), we present a novel two-stage method in this paper. It first identifies

the positive instances using a feature based binary classifier, and then a Long Short Term

Memory (LSTM) based classifier is used to classify the positive instances into specific cate-

gory. The experimental results show that the two-stage method has many advantages over

one-stage ones, and among the factors related to LSTM, we find that the two layer bidirec-

tional LSTM embedded with word, distance and Part-of-Speech obtains the highest F-score

of 69.0%, which is state-of-the-art.

© 2017 Elsevier Inc. All rights reserved.

1. Introduction

Drug–drug interaction extraction (DDIE) is an important task in Biomedical Natural Language Processing (BioNLP) do-

main. The DDIExtraction 2013 challenge [1] is the second edition of the DDIExtraction Shared Task series, a community-wide

effort to promote the implementation and comparative assessment of natural language processing (NLP) techniques in the

field of the pharmacovigilance domain. In the challenge the DDIs need to be classified into four predefined DDI types (“ad-

vise”, “effect”, “mechanism” and “int”). “Advise” is assigned when a recommendation or advice regarding concomitant use

of two drugs involved is described; “Effect” is assigned when the effect of the DDI is described; “Mechanism” is assigned

when a DDI is described by its pharmacokinetic mechanism; “Int” is assigned when a DDI appears in the text without any

additional information provided. The corpus was annotated manually consisting of 792 texts selected from the DrugBank

database and other 233 Medline abstracts. This fined-grained corpus has been annotated with a total of 18,502 pharma-

cological substances and 5028 DDIs. The full DDIExtraction 2013 corpus which consists of 1017 documents (784 DrugBank

documents and 233 MedLine documents, totally 27,792 instances for training and 5716 instances for testing) which was

manually annotated with a total of 18,491 drug names and 5021 postive DDIs (4673 for DrugBank and 327 for MedLine).

∗ Corresponding authors.

E-mail addresses: [email protected] (D. Huang), [email protected] (Z. Jiang), [email protected] (L. Zou), [email protected] (L. Li).

http://dx.doi.org/10.1016/j.ins.2017.06.021

0020-0255/© 2017 Elsevier Inc. All rights reserved.


http://www.ScienceDirect.com

http://www.elsevier.com/locate/ins

http://crossmark.crossref.org/dialog/?doi=10.1016/j.ins.2017.06.021&domain=pdf

mailto:[email protected]





D. Huang et al. / Information Sciences 415–416 (2017) 100–109 101

Fig. 1. The two-stage drug–drug interaction extraction model.

Support Vector Machine (SVM) based methods have achieved successful and promising results in the past five years, e.g.,

Chowdhury et al. [2] applied a two-stage hybrid kernel based relation extraction approach, taking advantage of different

kernels of SVM, Thomas et al. [3] also combined several kernel methods. These two methods were the top two ranked

teams in DDIExtraction 2013, and other teams such as [4,5] also used SVM as classifier. Afterwards, Kim et al. [6] used a

feature based method, which is also a two-stage system, achieving an F-score of 0.670. It is clear that SVM is effective on

this task, however, the one-stage methods generally cannot perform better than the two-stage ones [7] . On the other hand,

one of limitations of SVM is the incapability of dealing with text of arbitrary length.

In recent years, the machine learning community has witnessed significant advances of deep learning, and deep learning

based methods have been applied on related tasks. For example, Zeng et al. [8] exploited a convolutional deep neural net-

work to extract lexical and sentence level features and further these features were fed into a softmax classifier to predict

the relationship between two marked nouns. Sahu et al. [9] proposed a Joint AB-LSTM model that utilized word and posi-

tion embedding as latent features on DDIExtracion 2013 and achieved competitive results against traditional feature based

methods.

Compared to Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) can deal with input of arbitrary

length, which is a good property for text data. Previous successes in real world applications with RNNs were limited due

to practical problems when long time lags between relevant events make learning difficult, i.e., gradient vanishing [10] .

The reason for this failure is the rapid decay of back-propagated error. The LSTM algorithm, an improved version of RNN,

overcomes this by enforcing constant error flow. Using gradient descent, LSTM explicitly learns when to store informa-

tion and when to access it. For many tasks, LSTMs are better than the standard RNNs. Almost all results based on RNNs

are achieved with LSTMs, and many studies have attempted to solve text mining problems using LSTM, for example, Lim-

sopatham et al. [11] investigated an approach for named entity recognition by enabling bidirectional LSTM to automatically

learn orthographic features, Sutskever et al. [12] used a multilayered LSTM to map the input sequence to a vector of a fixed

dimensionality, and then another deep LSTM to decode the target sequence from the vector sequence to sequence learning.

Due to effectiveness of LSTM on other related text mining tasks, LSTM is expected to help improving the DDIE performance.

Considering that on the one hand, LSTM is a suitable machine learning model for text mining that can deal with input

of arbitrary length and have enough layers to overcome gradient vanishing, and on the other hand, the distribution of

DDIExtraction is highly skewed (the numbers of instances of “advise”, “effect”, “mechanism”, “int” and “negative” are 1047,

2047, 1621, 285 and 28,508, respectively), in this work, to improve the performance of DDIE, we present a novel two-stage

method. In the first stage, we identify the four kinds of positive DDI instances from the negative ones using feature based

binary classifier, and in the second stage, an LSTM based classifier is used to classify the positive instances into each of the

four drug interaction types. We study many factors that possibly influence the LSTM model, such as part-of-speech (POS) tag

embedding, distance information, dropout, etc. By conducting experiments, we show that word embedding, part-of-speech

(POS) tag embedding, distance information and multi-layer bidirectional LSTM help to improve the performance of DDIE.

2. Method

A two-stage classifier offers a distinct advantage over a one-stage classifier for DDIE, not only because it is highly skewed

towards one class (the negative class) but also because this majority class is clearly semantically distinct from the other

positive classes. Two-stage classifer comprises two classifiers in separate stages. In the first stage, a binary classifer is trained

to classify drug pairs into positive and negative classes. Then in the second stage, only the instances that are classified as

positive by the first classifer are considered, and classified into one of four types within the positive class (“advise”, “effect”,

“mechanism” and “int”) using a multi-class classifier.

102 D. Huang et al. / Information Sciences 415–416 (2017) 100–109

Our two-stage DDIE model is shown in Fig. 1 . Before the first stage, the training set, test set and background texts

downloaded from MEDLINE are first tokenized, then analyzed by GDEP parser to get the stem, POS-tag, syntactic chunk

and biomedical entity. Then we obtain word embedding, stem embedding, POS embedding, chunk embedding and entity

embedding using the word representation model introduced in [13] , and these embeddings will be used in the LSTM based

second stage classifier.

2.1. First stage: feature based binary classification

In the first stage, inspired by Bui et al. [14] , we use a feature based binary classifier to recognize positive DDI instances.

The features include:

Context word feature, which means the left three words and the right three words of the two drugs of the DDI. Left

and right words are distinguished by adding _L and _R suffixes, and if a word is a drug, then replace it by DRUG. This

feature might reveal the syntactic role of the drug within the phrase containing it, such as whether the drug is a part of the

coordination or is an abbreviation of another drug.

Pattern feature, which means certain patterns of the DDI pair, where three patterns are considered:

Trigger [prep] ∗ DRUG1 [prep] ∗ DRUG2 (case 1)

DRUG1 [prep] ∗ trigger [prep] ∗ DRUG2 (case 2)

DRUG1 [prep] ∗ DRUG2 [prep] ∗ trigger (case 3)

Where prep is preposition connecting chunk that contains the trigger and the DDI pair. DRUG1 and DRUG2 are target

drugs of the DDI instance. The symbol ∗ indicates zero or more. For example, the instance “concurrent use of DRUG1 and

DRUG2 has the effect of trigger” fits the pattern 3. Pattern features reveals the patterns of the drug relation.

Verb feature, which means bag-of-words (unigrams and bigrams) generated from the verb chunk of the clause to which

the DDI instance belongs. The verb features indicate how the drug in the left phrase (subject) and the drug in the right

phrase (object) are related.

Syntactic feature, which means the syntactic structure of the phrase that contains the DDI instance. It includes the types

of chunks (noun chunks or preposition chunks) within the phrase and the patterns of drugs (note that the pattern feature

is the pattern of a DDI instance, while pattern of drug is the pattern within the phrase).

Auxiliary features, which means whether the drug names of the pair are real names or pronouns (e.g. these drugs, this

drug). The second one is whether the drugs have the same name, and the third one is whether the target drugs are in the

same chunk.

After extracting features, an SVM classifier is employed to perform the binary classification since SVM has been widely

used on this task and has been proved to be effective. The DDI instances that are classified as positive are then input into

the second stage.

2.2. Second stage: LSTM based multi-class classification

In this stage, we use an LSTM classifier to categorize the positive DDI instances into each of the four DDI types: “advise”,

“effect”, “int” and “mechanism”. The LSTM based multi-class classifier is shown in Fig. 2 .

The LSTM based model first applies dropout on the input to randomly drop some units. In Fig. 2 , ‘A’ stands for a chunk

of neural network, which consists of three components: a maxout layer, a highway layer and a BLSTM layer. And there are

totally three layers of chunk ‘A’.

At the top of the LSTM based model, we use Softmax layer for multiclass classification. Since the length of word sequence

is arbitrary, the length of output vectors of the top ‘A’ layer is also arbitray. Therefore, the Softmax layer unifies the output

by mean pooling, and then performs a standard softmax.

The LSTM based classifier not only has many hyperparameters to choose, such as number of layers, batch size and word

embedding dimension, but also there are many related factors, such as bidirectional network, dropout and maxout, that

highly influence the performance. We consider several of the most widely used theoretically or empirically effective factors

to improve LSTM, and later we show a strategy of how to combine these factors to tune LSTM.

2.2.1. Embeddings

Word embeddings have been widely used in deep learning based text mining applications. Incorporating word vectors as

input or additional features could improve the performance of biomedical text mining systems. Besides word vectors, other

information such as stem, POS tag, syntactic chunk and biomedical entity can also be vectorized by a biomedical domain

oriented embedding model [13] . The model first analyzes the text using GDEP [15] , which is a dependency parser trained

on the GENIA Treebank for parsing biomedical text, then jointly trains stem embeddings, POS tag embeddings, chunk em-

beddings and entity embeddings. Since in traditional text mining systems, stems, POS tags, syntactic chunks and entities are

often used as features to improve the performance, they contains important semantic informations, therefore, it is reason-

able to integrate these embeddings in our LSTM based DDIE model.

2.2.2. Distance vector

A DDI instance is a tuple, (drug 1 , drug 2 , sentence). Distance information is essential to distinguish DDI instances from

same sentence. For example, assume that there are three DDI instances in a sentence, the inputs to LSTM are identical if


Fig. 2. The LSTM based multi-class classifier used in the second stage.

only the “sentence” of the tuple is considered. Adding the distance between drug and each word to the input, so that each

of the three DDI instance can be differentiated.

Each word in the context has two distances from the two drugs. First, for each word in the sentence, calculate the

number of tokens between the word and drug, thus we get distances d 1 and d 2 for (drug 1 and drug 2 respectively. However,

the importance of the value of d 1 and d 2 on the DDI instance is not linear. Concretely, the larger the values of d 1 and d 2are, the less they influence the representation of DDI instance. For example, assume two groups of d 1 and d 2 , the values of

the first group are 23 and 24, and the values of the second group are 2 and 3, alghouth the gap between 23 and 24 is equal

to the gap between 2 and 3, 23 and 24 are almost the same for the DDI instance because the distance is too long, while 2

and 3 need to be distinguished since they are highly related to the understanding of the instance. Therefore, we introduce

a non-linear transformation of distance through a tanh ( ·) as shown in Eq. (1) :

s ( d ) = tanh

(d

A

)(1)

where d is the number of tokens between current token and the target drug entity (drug 1 and drug 2 ), and A is the average

number of tokens of sentences. This formula makes small values of distance differentiated from each other, while make

large values of distance similar.

2.2.3. Dropout

Dropout [16] is a way to prevent from overfitting on training set and therefore to improve the performance on test set.

The key idea is to randomly drop units (along with their connections) from the neural network during training. Dropout can

be applied in any neural networks, usually on the input layers. All unit activations of a dropout layer H

l+1 t are set explicitly

inactive with “dropping out units” probably (1 − p) . The activations of the ( l + 1 )th layer are multiplied by an appropriate

mask m , where the elements m j can be generated with a Bernoulli distribution, m j ∼ Bernoulli ( p ):

H

l+1 t = m ◦ σ

(W

l H

l t + b l

)(2)


where in Eq. (2) , ◦ means Hadamard product or entrywise product.

2.2.4. Bidirectional LSTM (BLSTM)

In this section, we first briefly introduce LSTM, and then introduce BLSTM. The key to LSTM [17] is the cell state C , which

runs straight down the entire sequence. LSTM can remove or add information to the cell state, regulated by gates. Taking

the t th ( t ∈ [1, N], N is the length of sequence) word for example, as Eq. (3) shows, it first concatenates the hidden layer of

the previous word H t−1 and the input layer of the current word x t :

z t = [ H t−1 , x t ] (3)

then obtains the forget gate F t by applying a sigmoid function. Each element in F t is a fraction between 0 and 1, where 1

represents “completely keep this” while 0 represents “completely forget it.”

F t = σ(W

F z t + b F )

(4)

Next, a sigmoid layer called the input gate layer I t decides which values to be updated, and a tanh layer creates a vector

of new candidate values, ˜ C t , that could be added to the state C t .

I t = σ(W

I z t + b I )

(5)

˜ C t = tanh

(W

˜ C z t + b ˜ C )

(6)

Where I t in (5) is called input gate. Next in Eq. (7) , combine F t , C t−1 , I t and

˜ C t to create an update to the state.

C t = F t ◦ C t−1 + I t ◦ ˜ C t (7)

Finally, in Eq. (9) the output of LSTM H t will be based on the cell state C t , but will be a filtered version. First, run a

sigmoid layer as shown in Eq. (8) which decides what parts of the cell state are going to output O t . Then put the cell state

C t through tanh to push the values to be between −1 and 1, and multiply it by the output of the sigmoid gate:

O t = σ(W

O z t + b O )

(8)

H t = O t ◦ tanh ( C t ) (9)

where F t , I t , O t and H t respectively indicate forget gate, input gate, output gate and hidden layer, x t is the input to the

memory cell layer, W F , W I , W C and W O , are weight matrices, b F , b I , b C and b O are bias vectors, and σ ( · ) is the sigmoid

function. The LSTM layer is followed by mean pooling over time H = mean ( [ H 1 , H 2 , . . . , H N ] ) and softmax r egr ession t o deal

with multi-class classification. Let W

S be a weight matrix, and b S be a bias vector, the probability that the input is a member

of a class i can be written as Eq. (10) :

P

(Y = i | H, W

S , b S )

=

exp

(−W

S i

H − b S i

)∑

j exp

(−W

S j H − b S

j

) (10)

Sentences are ordered sequences of words, and are structured by grammar. Therefore, to understand the meaning of

a given word, the words before and after it should be both considered. LSTM has the ability to model the sentence from

forward direction, and another backward LSTM should be added to construct a bidirectional model [18] . As the name implies,

the advantage of BLSTM is considering both directions of the input sequence. First, build two LSTM networks from both

directions, obtaining two hidden layers after mean pooling, −→

H and

← −H , and the conditional probability of the classification

is:

P

(Y = i | −→

H , ← −H , W

S , b S )

=

exp

(−W

S i

(−→

H ◦ ← −H

)− b S

i

)∑

j exp

(−W

S j

(−→

H ◦ ← −H

)− b S

j

) (11)

2.2.5. Highway

There is evidence that depth of neural networks is a crucial ingredient for their success. Although LSTM overcomes

the gradient vanishing problem of RNN to some extent, however, as the depth of the LSTM neural network increases, the

gradients of the early layers decreases. In this paper, the depth of the LSTM model is from four aspects: the depth of LSTM

unit itself, multilayer LSTM, the length of the sequence and bidirectional LSTM. To add more layers, Highway networks

[19] are designed to ease the gradient vanishing problem. It uses a standard hidden layer ˆ H t and a transform gate R t :

ˆ H t = σ(

W

ˆ H [ H t−1 , x t ] + b ˆ H )

(12)

R t = σ(W

R [ H t−1 , x t ] + b R )

(13)


where W

ˆ H and W

R are weight matrices, b ˆ H and b R are bias vectors. By adding high way mechanism, as shown in Eq. (14) the

input of LSTM with high way becomes:

z t =

ˆ H ◦ R t + [ H t−1 , x t ] ◦ ( 1 − R t ) (14)

Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety

of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

2.2.6. Data balancing

When the data is highly unbalanced, the algorithms tend to degenerate by assigning all instances to the most common

outcome, and F-scores tend to decline. In the training part of DDIExtraction 2013, there are 23,755 negative instances and

only 189 int instances. Therefore, we use an over-sampling strategy, i.e., random replication of minority class instances, to

balance class distribution.

2.3. Tuning strategy

To show the advantages of the two-stage over one-stage, we first use a one-stage LSTM based multi-class classifier to

perform DDIE directly in the experiments, and then use the two-stage method. The computational cost of the LSTM is high,

for example, a run of full model requires about 80 h on Titan X. However, it does not necessarily contribute to a high

performance. LSTM has so many optional tuning factors as stated before, and it is certain that not all of them benefit LSTM

on DDIExtraction task, making us think which ones work and which ones do not. Note that it is normal that one factor

works well on some data sets while not so well on the others. Each of these factors has been proved theoretically and

empirically useful previously. What we concern is the tuning of LSTM for DDIExtraction 2013.

To tune the LSTM based DDIE model while keeping the computational cost relatively low, we consider three aspects of

each of the 10 tuning factors: importance, implementation cost and computational cost. As stated before, one layer LSTM

neural network with word embedding, distance embedding is a basic model, once any one of them is removed, the data

cannot be represented correctly. The combination of 1 layer LSTM + word embedding + distance embedding is a baseline

system, thus the importance, implementation cost and computational cost are ignored.

The importance of a tuning factor indicates the possibility of improving the performance of DDIE, and the value is based

on the experience of the researchers. For example, dropout has been proved to be an effective way of improving the perfor-

mance of image processing systems, however, in the experiments we have conducted, we found dropout performed not so

well on biomedical text mining tasks, therefore, dropout gets 1 star. Another example is the two-stage classification, since

there are a lot of works shows that two-stage classification [2,6] can improve the performance of DDIE, thus the two-stage

classification gets two stars.

The implementation cost means the cost of adding the tuning factor into the model, and it is also based on the experi-

ence of the researchers. For example, it is easy to add POS embedding to the 1 layer LSTM + word embedding + distance

baseline system, therefore the implementation of POS embedding is 1 star, while the two-stage system changes the baseline

system a lot, and therefore the implementation cost is 2 star.

The computational cost is the increment of computational complexity of adding a tuning factor to the baseline system.

Non-linear transformation and matrix multiplication are included. A standard LSTM chunk requires 5 times of non-linear

transformation and 3 times of matrix multiplication, thus the computational cost of the baseline system is 8. When adding

POS embedding to the baseline system, the length of input vectors increases by a half, and the computational cost increases

to 3/2. When adding data balance, the number of DDI instances of each category increases, the total number of DDI instances

is four times of the number of the negative instances, which is about 4 times of the size of the original trainig set.

The objective of Table 1 is not to evaluate each tuning factor. On one hand, the importance of the tuning factors is not

different on different tasks or different domains, for example on image processing tasks, the importance of dropout should

be increased. On the other hand, the scores given by the researchers are subjective and highly depend on the experience of

researchers. The significance of Table 1 is a rough ranking of all tuning factors, based on which we can make experiment

plans, to find the best combination of tuning factors as far as possible while keeping the computational cost as low as

possible.

3. Experiments and discussion

3.1. DDIExtraction 2013

The second shared task challenge on drug-drug interactions, DDIExtraction-2013 [1] , was offered as part of the 2013

International Workshop on Semantic Evaluation (SemEval-2013). The training corpus provided in the challenge contains 142

Medline abstracts on the subject of drug-drug interactions, and 572 documents describing drug-drug interactions from the

DrugBank database.

We use the standard metrics of Precision, Recall, and F-score to evaluate the performance. F-score is calculated with

the macro-averaged method, i.e., by taking precision to be the average of the precision calculated for each type (“advise”,

“effect”, “int”, “mechanism”), and similarly for recall.


Table 1

A list of tuning factors. ICost stands for implementation cost and

CCost stands for Computational cost. The ICost increases when

the deep learning model is modified, and the CCost increases

when the complexity of the model increases, e.g., the CCost dou-

bles when adding an extra LSTM layer on a 1 layer LSTM.

Factor Importance ICost CCost

1 layer LSTM – – −(8)

2 layer LSTM

∗∗ ∗ 8 ∗2

3 layer LSTM

∗∗ ∗ 8 ∗3

1 layer BLSTM

∗∗ ∗ 8 ∗2

2 layer BLSTM

∗∗ ∗ 8 ∗4

3 layer BLSTM

∗∗ ∗ 8 ∗6

Highway ∗∗ ∗∗ 8 + 3

Dropout ∗ ∗∗ 8 + 1

Maxout ∗∗ ∗∗ 8 + k

Word Embedding (WE) – – − (8)

Distance Embedding (DE) – – − (8)

POS Embedding (PE) ∗∗ ∗ 8 ∗(3/2)

Chunk Embedding (CE) ∗ ∗ 8 ∗(3/2)

Stem Embedding (SE) ∗ ∗ 8 ∗(3/2)

Entity Embedding (EE) ∗ ∗ 8 ∗(3/2)

Data balancing ∗∗ ∗∗ 8 ∗4

Two stage ∗∗ ∗∗ 8/7

Table 2

Baseline system. INT, MEC, EFF, ADV and F-score indicate “int”, “mecha-

nism”, “effect”, “advise” and macro-Fscore respectively.

INT MEC EFF ADV MAVG

LSTM1 + Word P 43.3 46.1 46.2 64.9 50.0

R 3.3 45.2 46.4 26.9 30.3

F 6.1 45.6 46.3 38.0 37.7

LSTM1 + Word+Distance P 62.2 54.4 65.4 65.8 61.8

R 30.3 57.5 69.6 57.8 53.5

F 40.7 55.9 67.4 61.5 57.4

Table 3

Comparison of Multilayer LSTM and POS embedding.


LSTM2 + Word+Distance P 76.5 76.6 54.8 70.7 69.3

R 30.5 49.0 75.5 60.0 53.5

F 43.6 59.8 63.5 64.9 60.4

LSTM2 + POS+Word+Distance P 72.9 70.1 63.0 67.9 68.0

R 33.0 59.6 71.1 74.6 59.5

F 45.4 64.4 66.8 71.1 63.5

LSTM3 + POS+Word+Distance P 75.0 69.4 63.1 73.8 70.3

R 35.9 58.4 68.9 66.6 57.0

F 48.6 63.4 65.9 70.0 63.0

3.2. Baseline system

To begin with, we use a one-stage LSTM based classifier, and Section 2.2.2 illustrates that distance is essential for distin-

guishing DDI instances from the same sentence. To verify the assumption, Table 2 shows the comparison of baseline system

with and without distance embedding. When distance is not added, the F-score is 37.7%, while the F-score of baseline sys-

tem is 57.4%. The improvement of int category is the most significant as the recall increases from 3% to 30%, and F-score

increases from 6% to 41%, which strongly verifies that distance is essential for LSTM based DDIE model.

The macro-F score is an average of the four DDI categories, however, the number of “int” instances is only 189 while

the number of “effect” instances is 1687. Therefore, the “int” category strongly influences the macro-F score, which always

obtains the lowest F-scores among the four categories. In the future works, “int” instances need special treatment.


Table 4

Comparison of bidirectional LSTM and other embeddings.


BLSTM1 + POS+Word+Distance P 87.2 73.0 64.3 69.1 73.3

R 34.4 60.3 71.7 68.0 58.3

F 49.3 66.0 67.8 68.5 64.9

BLSTM1 + POS+Word+Distance+Highway P 72.2 65.0 65.0 55.8 64.0

R 35.8 47.9 67.0 71.3 55.5

F 47.9 55.2 66.0 62.6 59.4

BLSTM1 + POS+Word+Distance+Maxout P 86.2 67.1 71.3 70.0 73.5

R 33.2 66.6 62.1 63.3 56.3

F 47.9 66.8 66.4 66.5 63.8

BLSTM1 + POS+Word+Distance+Stem+Chunk+Entity P 56.2 73.7 65.7 72.9 66.8

R 34.0 61.9 73.5 56.6 56.3

F 42.4 67.3 69.4 63.7 61.1

Table 5

Comparison of data balance and two-stage.


BLSTM1 + POS+Word+Distance+Data balance P 68.0 75.0 68.8 56.8 66.8

R 45.7 58.9 65.8 74.2 60.8

F 54.7 66.0 67.3 64.3 63.6

BLSTM1 + POS+Word+Distance+Two-stage P 76.4 78.1 66.0 76.6 74.3

R 40.3 68.2 74.4 66.8 62.3

F 52.8 72.8 69.9 71.4 67.7

3.3. Multilayer LSTM and POS embedding

According to Table 1 , we first choose multilayer LSTM and POS embedding. As shown in Table 3 , two layer LSTM obtains

an F-score of 60.4%, which is higher than one layer does (57.3%). Although both the recall are 53.5%, the precision increases

from 61.8% to 69.3%. Therefore, two layer is better than one layer.

POS is a category of words which have similar grammatical properties. Table 3 shows that POS embedding improves the

performance by 2.9 of F-score. In traditional machine learning based DDIE systems, POS tags are often used as features, and

Table 3 shows that POS embedding is also important for LSTM model.

The three layer LSTM model slightly decreases the F-score by 0.6%. Since F = 2 P R/ (P + R ) , it is better that the precision

is close to the recall. However, the gap between precision and recall is 9.5 on two layer LSTM and it increases 13.3 on three

layer LSTM. We can see that more layers do not bring better F-scores.

3.4. BLSTM and embeddings

Table 4 shows that one layer BLSTM (64.9%) is better than two layer LSTM (63.5%). However, after adding highway,

maxout, stem embedding, chunk embedding and entity embedding, the F-scores are lower. We think that highway and

maxout are all good techniques, and have been proved to be useful on other tasks. It is the distribution of the DDIE dataset

that does not applicable to these tuning factors.

3.5. Data balance and two-stage

The above experiments test the performance of tuning factors that modifies the LSTM based model, while data balance

changes the training set. From Table 5 we can find that the F-score decreases from 64.9% to 63.6% after applying data

balance, however, the gap between precision and recall becomes 6, which is the smallest among all the experiments.

After applying two-stage classification, the F-score increases from 64.9% to 67.7%, which again proves that the two-stage

classifier is effective for the multi-class classification problem of DDIE. The results also illustrates that by adding the first

stage binary classification, our algorithm boosts from an F-score 64.9% to 67.7%. Although the two-stage model does not

directly address the problem of data skew, the bias of different category is eased during the two stages of classification.

3.6. Multilayer BLSTM and dropout

Table 1 shows that the computational cost of multilayer BLSTM is the highest. To reduce the computational cost as much

as possible, we conduct the experiments of BLSTM at last. Table 6 shows that two layer BLSTM achieves the highest F-score

(69.0%) among all the experiments, while three layers achieves an F-score of 68.1%. Again the results illustrate that more

computational cost and more layers do not contribute to higher F-scores. We also find that dropout decreases the F-score

as expected (68.1%), and we think dropout is good for image processing tasks while not applicable for DDIE.


Table 6

Comparison of multilayer BLSTM and dropout.



R 42.1 72.4 74.0 66.6 63.7

F 54.9 73.8 72.0 71.5 69.0

BLSTM2 + POS+Word+Distance+Two-stage+dropout P 78.6 77.5 68.4 75.4 74.8

R 38.4 70.2 75.4 66.7 62.5

F 51.6 73.7 71.7 70.8 68.1


R 40.0 69.6 76.1 65.4 62.7

F 52.3 73.5 71.8 70.5 68.1

Table 7

Comparison with other works.

Method Classifier F-score

Rastegar (2013) [20] SVM 47.2

Bjorne (2013) [5] SVM 58.7

Bokharaeian (2013) [4] SVM 53.5

Chowdhury (2013) [2] SVM 64.8

Thomas (2013) [3] SVM etc. 59.7

Kim (2015) [6] SVM 67.0

Bobic (2013) [21] SVM etc. 44.8

Sanchez (2013) [22] SVM 53.4

Hailu (2013) [23] SVM 40.7

Zhao (2016) [7] two-stage CNN 68.6

Our method SVM + LSTM 69.0

3.7. Comparison with other works

Table 7 lists all the works on DDIExtraction 2013 as far as we know. We can find that deep learning based methods

achieve higher F-scores. Most of the previous works are based on features and kernels, and based on SVM classifier. Zhao

et al. proposed a two-stage method using CNN which achieved an F-score of 68.6%, and our LSTM based model achieves

69.0%, both of them are higher than previous methods. Therefore, deep learning techniques can help improving the perfor-

mance of DDIE.

3.8. Discussion

In this section, we have conducted a group of experiments to show the performance of the proposed method. From

Tables 2 to 7 , we summarize that the two-stage strategy generally performs better than one-stage strategy, and two layer

bidirectional LSTM achieves the highest F-score when cooperated with word embedding, distance embedding and POS em-

bedding. The reasons behind these phenomenons are probably as follows: First, the DDIE data suffers from class bias, and

the two-stage based methods can deal with the class bias better the one-stage based ones; Second, taking only word se-

quence as input is not enough for DDI instance representation, while the distance embedding and POS embedding can enrich

the input to obtain better representations of DDI instances; Third, although our human-beings usually read text from the

beginning to the end, however, for LSTM, reading backward also helps to comprehense the meaning, and therefore BLSTM

can extract DDIs more accurately.

4. Conclusion

In this paper, we propose a two-stage method to extract DDIs from literature, obtaining an F-score of 69.0%. In the first-

stage, we use a feature based classifier to identify positive DDIs, and in the second stage, we use an LSTM based classifier

to categorize the positive DDIs into each of the four DDI categories. Considering the importance, implementation cost and

computational cost, we compare 10 different tuning factors that possibly influence the performance of LSTM in the experi-

ments, and we find that the best combination of tuning factors are: two layer bidirectional LSTM, word embedding, distance

embedding, POS embedding and two-stage classification. In our opinion, there are some rooms for further improvement of

DDIE, such as special treatment of “int” category and reducing the gap between precision and recall, which will be focused

in our future works.


Acknowledgements

The authors gratefully acknowledge the financial support provided by the National Natural Science Foundation of China

under No. 61672126 , 61672127 , and the National Natural Science Foundation of Liaoning Province under No. 2015020059.

The Titan X used for this research was donated by the NVIDIA Corporation.

References

[1] I. Segura Bedmar , P. Martínez , M. Herrero Zazo , Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts, in: Proceedings ofthe 7th International Workshop on Semantic Evaluation, 2013 .

[2] M. Faisal , Fbk-irst: a multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information, in:International Workshop on Semantic Evaluation, 2013, pp. 351–355 .

[3] P. Thomas , M. Neves , T. Rocktäschel , U. Leser , Wbi-ddi: drug-drug interaction extraction using majority voting, in: Seventh International Workshop onSemantic Evaluation, 2013, pp. 628–635 .

[4] B. Bokharaeian , A. Diaz , Nil ucm: extracting drug-drug interactions from text through combination of sequence and tree kernels, in: Seventh Interna-

tional Workshop on Semantic Evaluation, 2013, pp. 644–650 . [5] J. Bjrne , S. Kaewphan , T. Salakoski , Uturku: drug named entity recognition and drug-drug interaction extraction using svm classification and domain

knowledge, in: International Workshop on Semantic Evaluation, 2013, pp. 651–659 . [6] K. Sun , H. Liu , L. Yeganova , W.J. Wilbur , Extracting drugdrug interactions from literature using a rich feature-based linear kernel approach, J. Biomed.

Inform. 55 (C) (2015) 23–30 . [7] Z. Zhao , Z. Yang , L. Ling , H. Lin , W. Jian , Drug drug interaction extraction from biomedical literature using syntax convolutional neural network,

Bioinformatics 32 (22) (2016) 34 4 4–3453 . [8] D. Zeng , K. Liu , S. Lai , G. Zhou , J. Zhao , Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344 .

[9] S.K. Sahu, A. Anand, Drug-drug interaction extraction from biomedical text using long short term memory network, arXiv:1701.08303 (2017).

[10] Y. Bengio , P.Y. Simard , P. Frasconi , Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Networks 5 (2) (1994)157–166 .

[11] J.P.C. Chiu, E. Nichols, Named entity recognition with bidirectional lstm-cnns, 2015. arXiv:1511.08308 . [12] I. Sutskever , O. Vinyals , Q.V. Le , I. Sutskever , O. Vinyals , Q.V. Le , Sequence to sequence learning with neural networks, Adv. Neural Inf. Process. Syst. 4

(2014) 3104–3112 . [13] Z. Jiang , L. Li , D. Huang , L. Jin , Training word embeddings for deep learning in biomedical text mining tasks, in: 2015 IEEE International Conference on

Bioinformatics and Biomedicine, 2015, pp. 625–628 .

[14] Q.C. Bui , P.M.A. Sloot , E.M.V. Mulligen , J.A. Kors , A novel feature-based approach to extract drug-drug interactions from biomedical text., Bioinformatics30 (23) (2014) 3365–3371 .

[15] K. Sagae , J.I. Tsujii , Dependency parsing and domain adaptation with lr models and parser ensembles., in: EMNLP-CoNLL 2007, Proceedings of the2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28–30, 2007, Prague,

Czech Republic, 2007, pp. 1044–1050 . [16] N. Srivastava , G. Hinton , A. Krizhevsky , I. Sutskever , R. Salakhutdinov , Dropout: a simple way to prevent neural networks from overfitting, J. Mach.

Learn. Res. 15 (1) (2014) 1929–1958 .

[17] S. Hochreiter , J. Schmidhuber , Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780 . [18] M. Schuster , K.K. Paliwal , Bidirectional recurrent neural networks, IEEE Trans. Sig. Process. 45 (11) (1997) 2673–2681 .

[19] R.K. Srivastava, K. Greff, J. Schmidhuber, Highway networks, CoRR (2015). abs/1505.0038. . [20] M. Rastegar-Mojarad , R.D. Boyce , R. Prasad , Uwm-triads: classifying drug-drug interactions with two-stage svm and post-processing, in: Seventh Inter-

national Workshop on Semantic Evaluation, 2013, pp. 667–674 . [21] T. Bobic, J. Fluck, M. Hofmannapitius, Scai: Extracting drug-drug interactions using a rich feature vector, in: Seventh International Workshop on Se-

mantic Evaluation, pp. 675–683.

[22] D. Sánchez Cisneros , F. Aparicio Gali , Uem-uc3m: an ontology-based named entity recognition system for biomedical texts, in: Seventh InternationalWorkshop on Semantic Evaluation, 2013, pp. 622–627 .

[23] N.D. Hailu , L.E. Hunter , K.B. Cohen , Ucolorado som: extraction of drug-drug interactions from biomedical text using knowledge-rich and knowl-edge-poor features, in: International Workshop on Semantic Evaluation, 2013, pp. 6 84–6 88 .

http://dx.doi.org/10.13039/501100001809

http://refhub.elsevier.com/S0020-0255(17)30110-X/sbref0001



































http://arxiv.org/abs/1701.08303


















































drug-drug interaction extraction from biomedical...

Documents