drug-drug interaction extraction from biomedical...
TRANSCRIPT
Information Sciences 415–416 (2017) 100–109
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Drug–drug interaction extraction from biomedical literature
using support vector machine and long short term memory
networks
Degen Huang
a , Zhenchao Jiang
b , Li Zou
c , ∗, Lishuang Li a , ∗
a School of Computer Science and Technology Dalian University of Technology, Dalian, Liaoning, China b Sangfor Technologies, Shenzhen, Guangdong, China c School of Computer Science and Information Technology, Liaoning Normal University, Dalian,Liaoning, China
a r t i c l e i n f o
Article history:
Received 7 January 2017
Revised 28 April 2017
Accepted 17 June 2017
Available online 19 June 2017
MSC:
00-01
99-00
Keywords:
Drug-Drug interaction extraction
Long short term memory
Two-stage approach
a b s t r a c t
Since Drug-drug interactions (DDIs) can cause adverse effects when patients take two or
more drugs and therefore increase health care costs, the extraction of DDIs is an impor-
tant research area in patient safety. To improve the performance of Drug–drug interaction
extraction (DDIE), we present a novel two-stage method in this paper. It first identifies
the positive instances using a feature based binary classifier, and then a Long Short Term
Memory (LSTM) based classifier is used to classify the positive instances into specific cate-
gory. The experimental results show that the two-stage method has many advantages over
one-stage ones, and among the factors related to LSTM, we find that the two layer bidirec-
tional LSTM embedded with word, distance and Part-of-Speech obtains the highest F-score
of 69.0%, which is state-of-the-art.
© 2017 Elsevier Inc. All rights reserved.
1. Introduction
Drug–drug interaction extraction (DDIE) is an important task in Biomedical Natural Language Processing (BioNLP) do-
main. The DDIExtraction 2013 challenge [1] is the second edition of the DDIExtraction Shared Task series, a community-wide
effort to promote the implementation and comparative assessment of natural language processing (NLP) techniques in the
field of the pharmacovigilance domain. In the challenge the DDIs need to be classified into four predefined DDI types (“ad-
vise”, “effect”, “mechanism” and “int”). “Advise” is assigned when a recommendation or advice regarding concomitant use
of two drugs involved is described; “Effect” is assigned when the effect of the DDI is described; “Mechanism” is assigned
when a DDI is described by its pharmacokinetic mechanism; “Int” is assigned when a DDI appears in the text without any
additional information provided. The corpus was annotated manually consisting of 792 texts selected from the DrugBank
database and other 233 Medline abstracts. This fined-grained corpus has been annotated with a total of 18,502 pharma-
cological substances and 5028 DDIs. The full DDIExtraction 2013 corpus which consists of 1017 documents (784 DrugBank
documents and 233 MedLine documents, totally 27,792 instances for training and 5716 instances for testing) which was
manually annotated with a total of 18,491 drug names and 5021 postive DDIs (4673 for DrugBank and 327 for MedLine).
∗ Corresponding authors.
E-mail addresses: [email protected] (D. Huang), [email protected] (Z. Jiang), [email protected] (L. Zou), [email protected] (L. Li).
http://dx.doi.org/10.1016/j.ins.2017.06.021
0020-0255/© 2017 Elsevier Inc. All rights reserved.
D. Huang et al. / Information Sciences 415–416 (2017) 100–109 101
Fig. 1. The two-stage drug–drug interaction extraction model.
Support Vector Machine (SVM) based methods have achieved successful and promising results in the past five years, e.g.,
Chowdhury et al. [2] applied a two-stage hybrid kernel based relation extraction approach, taking advantage of different
kernels of SVM, Thomas et al. [3] also combined several kernel methods. These two methods were the top two ranked
teams in DDIExtraction 2013, and other teams such as [4,5] also used SVM as classifier. Afterwards, Kim et al. [6] used a
feature based method, which is also a two-stage system, achieving an F-score of 0.670. It is clear that SVM is effective on
this task, however, the one-stage methods generally cannot perform better than the two-stage ones [7] . On the other hand,
one of limitations of SVM is the incapability of dealing with text of arbitrary length.
In recent years, the machine learning community has witnessed significant advances of deep learning, and deep learning
based methods have been applied on related tasks. For example, Zeng et al. [8] exploited a convolutional deep neural net-
work to extract lexical and sentence level features and further these features were fed into a softmax classifier to predict
the relationship between two marked nouns. Sahu et al. [9] proposed a Joint AB-LSTM model that utilized word and posi-
tion embedding as latent features on DDIExtracion 2013 and achieved competitive results against traditional feature based
methods.
Compared to Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) can deal with input of arbitrary
length, which is a good property for text data. Previous successes in real world applications with RNNs were limited due
to practical problems when long time lags between relevant events make learning difficult, i.e., gradient vanishing [10] .
The reason for this failure is the rapid decay of back-propagated error. The LSTM algorithm, an improved version of RNN,
overcomes this by enforcing constant error flow. Using gradient descent, LSTM explicitly learns when to store informa-
tion and when to access it. For many tasks, LSTMs are better than the standard RNNs. Almost all results based on RNNs
are achieved with LSTMs, and many studies have attempted to solve text mining problems using LSTM, for example, Lim-
sopatham et al. [11] investigated an approach for named entity recognition by enabling bidirectional LSTM to automatically
learn orthographic features, Sutskever et al. [12] used a multilayered LSTM to map the input sequence to a vector of a fixed
dimensionality, and then another deep LSTM to decode the target sequence from the vector sequence to sequence learning.
Due to effectiveness of LSTM on other related text mining tasks, LSTM is expected to help improving the DDIE performance.
Considering that on the one hand, LSTM is a suitable machine learning model for text mining that can deal with input
of arbitrary length and have enough layers to overcome gradient vanishing, and on the other hand, the distribution of
DDIExtraction is highly skewed (the numbers of instances of “advise”, “effect”, “mechanism”, “int” and “negative” are 1047,
2047, 1621, 285 and 28,508, respectively), in this work, to improve the performance of DDIE, we present a novel two-stage
method. In the first stage, we identify the four kinds of positive DDI instances from the negative ones using feature based
binary classifier, and in the second stage, an LSTM based classifier is used to classify the positive instances into each of the
four drug interaction types. We study many factors that possibly influence the LSTM model, such as part-of-speech (POS) tag
embedding, distance information, dropout, etc. By conducting experiments, we show that word embedding, part-of-speech
(POS) tag embedding, distance information and multi-layer bidirectional LSTM help to improve the performance of DDIE.
2. Method
A two-stage classifier offers a distinct advantage over a one-stage classifier for DDIE, not only because it is highly skewed
towards one class (the negative class) but also because this majority class is clearly semantically distinct from the other
positive classes. Two-stage classifer comprises two classifiers in separate stages. In the first stage, a binary classifer is trained
to classify drug pairs into positive and negative classes. Then in the second stage, only the instances that are classified as
positive by the first classifer are considered, and classified into one of four types within the positive class (“advise”, “effect”,
“mechanism” and “int”) using a multi-class classifier.
102 D. Huang et al. / Information Sciences 415–416 (2017) 100–109
Our two-stage DDIE model is shown in Fig. 1 . Before the first stage, the training set, test set and background texts
downloaded from MEDLINE are first tokenized, then analyzed by GDEP parser to get the stem, POS-tag, syntactic chunk
and biomedical entity. Then we obtain word embedding, stem embedding, POS embedding, chunk embedding and entity
embedding using the word representation model introduced in [13] , and these embeddings will be used in the LSTM based
second stage classifier.
2.1. First stage: feature based binary classification
In the first stage, inspired by Bui et al. [14] , we use a feature based binary classifier to recognize positive DDI instances.
The features include:
Context word feature, which means the left three words and the right three words of the two drugs of the DDI. Left
and right words are distinguished by adding _L and _R suffixes, and if a word is a drug, then replace it by DRUG. This
feature might reveal the syntactic role of the drug within the phrase containing it, such as whether the drug is a part of the
coordination or is an abbreviation of another drug.
Pattern feature, which means certain patterns of the DDI pair, where three patterns are considered:
Trigger [prep] ∗ DRUG1 [prep] ∗ DRUG2 (case 1)
DRUG1 [prep] ∗ trigger [prep] ∗ DRUG2 (case 2)
DRUG1 [prep] ∗ DRUG2 [prep] ∗ trigger (case 3)
Where prep is preposition connecting chunk that contains the trigger and the DDI pair. DRUG1 and DRUG2 are target
drugs of the DDI instance. The symbol ∗ indicates zero or more. For example, the instance “concurrent use of DRUG1 and
DRUG2 has the effect of trigger” fits the pattern 3. Pattern features reveals the patterns of the drug relation.
Verb feature, which means bag-of-words (unigrams and bigrams) generated from the verb chunk of the clause to which
the DDI instance belongs. The verb features indicate how the drug in the left phrase (subject) and the drug in the right
phrase (object) are related.
Syntactic feature, which means the syntactic structure of the phrase that contains the DDI instance. It includes the types
of chunks (noun chunks or preposition chunks) within the phrase and the patterns of drugs (note that the pattern feature
is the pattern of a DDI instance, while pattern of drug is the pattern within the phrase).
Auxiliary features, which means whether the drug names of the pair are real names or pronouns (e.g. these drugs, this
drug). The second one is whether the drugs have the same name, and the third one is whether the target drugs are in the
same chunk.
After extracting features, an SVM classifier is employed to perform the binary classification since SVM has been widely
used on this task and has been proved to be effective. The DDI instances that are classified as positive are then input into
the second stage.
2.2. Second stage: LSTM based multi-class classification
In this stage, we use an LSTM classifier to categorize the positive DDI instances into each of the four DDI types: “advise”,
“effect”, “int” and “mechanism”. The LSTM based multi-class classifier is shown in Fig. 2 .
The LSTM based model first applies dropout on the input to randomly drop some units. In Fig. 2 , ‘A’ stands for a chunk
of neural network, which consists of three components: a maxout layer, a highway layer and a BLSTM layer. And there are
totally three layers of chunk ‘A’.
At the top of the LSTM based model, we use Softmax layer for multiclass classification. Since the length of word sequence
is arbitrary, the length of output vectors of the top ‘A’ layer is also arbitray. Therefore, the Softmax layer unifies the output
by mean pooling, and then performs a standard softmax.
The LSTM based classifier not only has many hyperparameters to choose, such as number of layers, batch size and word
embedding dimension, but also there are many related factors, such as bidirectional network, dropout and maxout, that
highly influence the performance. We consider several of the most widely used theoretically or empirically effective factors
to improve LSTM, and later we show a strategy of how to combine these factors to tune LSTM.
2.2.1. Embeddings
Word embeddings have been widely used in deep learning based text mining applications. Incorporating word vectors as
input or additional features could improve the performance of biomedical text mining systems. Besides word vectors, other
information such as stem, POS tag, syntactic chunk and biomedical entity can also be vectorized by a biomedical domain
oriented embedding model [13] . The model first analyzes the text using GDEP [15] , which is a dependency parser trained
on the GENIA Treebank for parsing biomedical text, then jointly trains stem embeddings, POS tag embeddings, chunk em-
beddings and entity embeddings. Since in traditional text mining systems, stems, POS tags, syntactic chunks and entities are
often used as features to improve the performance, they contains important semantic informations, therefore, it is reason-
able to integrate these embeddings in our LSTM based DDIE model.
2.2.2. Distance vector
A DDI instance is a tuple, (drug 1 , drug 2 , sentence). Distance information is essential to distinguish DDI instances from
same sentence. For example, assume that there are three DDI instances in a sentence, the inputs to LSTM are identical if
D. Huang et al. / Information Sciences 415–416 (2017) 100–109 103
Fig. 2. The LSTM based multi-class classifier used in the second stage.
only the “sentence” of the tuple is considered. Adding the distance between drug and each word to the input, so that each
of the three DDI instance can be differentiated.
Each word in the context has two distances from the two drugs. First, for each word in the sentence, calculate the
number of tokens between the word and drug, thus we get distances d 1 and d 2 for (drug 1 and drug 2 respectively. However,
the importance of the value of d 1 and d 2 on the DDI instance is not linear. Concretely, the larger the values of d 1 and d 2are, the less they influence the representation of DDI instance. For example, assume two groups of d 1 and d 2 , the values of
the first group are 23 and 24, and the values of the second group are 2 and 3, alghouth the gap between 23 and 24 is equal
to the gap between 2 and 3, 23 and 24 are almost the same for the DDI instance because the distance is too long, while 2
and 3 need to be distinguished since they are highly related to the understanding of the instance. Therefore, we introduce
a non-linear transformation of distance through a tanh ( ·) as shown in Eq. (1) :
s ( d ) = tanh
(d
A
)(1)
where d is the number of tokens between current token and the target drug entity (drug 1 and drug 2 ), and A is the average
number of tokens of sentences. This formula makes small values of distance differentiated from each other, while make
large values of distance similar.
2.2.3. Dropout
Dropout [16] is a way to prevent from overfitting on training set and therefore to improve the performance on test set.
The key idea is to randomly drop units (along with their connections) from the neural network during training. Dropout can
be applied in any neural networks, usually on the input layers. All unit activations of a dropout layer H
l+1 t are set explicitly
inactive with “dropping out units” probably (1 − p) . The activations of the ( l + 1 )th layer are multiplied by an appropriate
mask m , where the elements m j can be generated with a Bernoulli distribution, m j ∼ Bernoulli ( p ):
H
l+1 t = m ◦ σ
(W
l H
l t + b l
)(2)
104 D. Huang et al. / Information Sciences 415–416 (2017) 100–109
where in Eq. (2) , ◦ means Hadamard product or entrywise product.
2.2.4. Bidirectional LSTM (BLSTM)
In this section, we first briefly introduce LSTM, and then introduce BLSTM. The key to LSTM [17] is the cell state C , which
runs straight down the entire sequence. LSTM can remove or add information to the cell state, regulated by gates. Taking
the t th ( t ∈ [1, N], N is the length of sequence) word for example, as Eq. (3) shows, it first concatenates the hidden layer of
the previous word H t−1 and the input layer of the current word x t :
z t = [ H t−1 , x t ] (3)
then obtains the forget gate F t by applying a sigmoid function. Each element in F t is a fraction between 0 and 1, where 1
represents “completely keep this” while 0 represents “completely forget it.”
F t = σ(W
F z t + b F )
(4)
Next, a sigmoid layer called the input gate layer I t decides which values to be updated, and a tanh layer creates a vector
of new candidate values, ˜ C t , that could be added to the state C t .
I t = σ(W
I z t + b I )
(5)
˜ C t = tanh
(W
˜ C z t + b ˜ C )
(6)
Where I t in (5) is called input gate. Next in Eq. (7) , combine F t , C t−1 , I t and
˜ C t to create an update to the state.
C t = F t ◦ C t−1 + I t ◦ ˜ C t (7)
Finally, in Eq. (9) the output of LSTM H t will be based on the cell state C t , but will be a filtered version. First, run a
sigmoid layer as shown in Eq. (8) which decides what parts of the cell state are going to output O t . Then put the cell state
C t through tanh to push the values to be between −1 and 1, and multiply it by the output of the sigmoid gate:
O t = σ(W
O z t + b O )
(8)
H t = O t ◦ tanh ( C t ) (9)
where F t , I t , O t and H t respectively indicate forget gate, input gate, output gate and hidden layer, x t is the input to the
memory cell layer, W F , W I , W C and W O , are weight matrices, b F , b I , b C and b O are bias vectors, and σ ( · ) is the sigmoid
function. The LSTM layer is followed by mean pooling over time H = mean ( [ H 1 , H 2 , . . . , H N ] ) and softmax r egr ession t o deal
with multi-class classification. Let W
S be a weight matrix, and b S be a bias vector, the probability that the input is a member
of a class i can be written as Eq. (10) :
P
(Y = i | H, W
S , b S )
=
exp
(−W
S i
H − b S i
)∑
j exp
(−W
S j H − b S
j
) (10)
Sentences are ordered sequences of words, and are structured by grammar. Therefore, to understand the meaning of
a given word, the words before and after it should be both considered. LSTM has the ability to model the sentence from
forward direction, and another backward LSTM should be added to construct a bidirectional model [18] . As the name implies,
the advantage of BLSTM is considering both directions of the input sequence. First, build two LSTM networks from both
directions, obtaining two hidden layers after mean pooling, −→
H and
← −H , and the conditional probability of the classification
is:
P
(Y = i | −→
H , ← −H , W
S , b S )
=
exp
(−W
S i
(−→
H ◦ ← −H
)− b S
i
)∑
j exp
(−W
S j
(−→
H ◦ ← −H
)− b S
j
) (11)
2.2.5. Highway
There is evidence that depth of neural networks is a crucial ingredient for their success. Although LSTM overcomes
the gradient vanishing problem of RNN to some extent, however, as the depth of the LSTM neural network increases, the
gradients of the early layers decreases. In this paper, the depth of the LSTM model is from four aspects: the depth of LSTM
unit itself, multilayer LSTM, the length of the sequence and bidirectional LSTM. To add more layers, Highway networks
[19] are designed to ease the gradient vanishing problem. It uses a standard hidden layer ˆ H t and a transform gate R t :
ˆ H t = σ(
W
ˆ H [ H t−1 , x t ] + b ˆ H )
(12)
R t = σ(W
R [ H t−1 , x t ] + b R )
(13)
D. Huang et al. / Information Sciences 415–416 (2017) 100–109 105
where W
ˆ H and W
R are weight matrices, b ˆ H and b R are bias vectors. By adding high way mechanism, as shown in Eq. (14) the
input of LSTM with high way becomes:
z t =
ˆ H ◦ R t + [ H t−1 , x t ] ◦ ( 1 − R t ) (14)
Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety
of activation functions, opening up the possibility of studying extremely deep and efficient architectures.
2.2.6. Data balancing
When the data is highly unbalanced, the algorithms tend to degenerate by assigning all instances to the most common
outcome, and F-scores tend to decline. In the training part of DDIExtraction 2013, there are 23,755 negative instances and
only 189 int instances. Therefore, we use an over-sampling strategy, i.e., random replication of minority class instances, to
balance class distribution.
2.3. Tuning strategy
To show the advantages of the two-stage over one-stage, we first use a one-stage LSTM based multi-class classifier to
perform DDIE directly in the experiments, and then use the two-stage method. The computational cost of the LSTM is high,
for example, a run of full model requires about 80 h on Titan X. However, it does not necessarily contribute to a high
performance. LSTM has so many optional tuning factors as stated before, and it is certain that not all of them benefit LSTM
on DDIExtraction task, making us think which ones work and which ones do not. Note that it is normal that one factor
works well on some data sets while not so well on the others. Each of these factors has been proved theoretically and
empirically useful previously. What we concern is the tuning of LSTM for DDIExtraction 2013.
To tune the LSTM based DDIE model while keeping the computational cost relatively low, we consider three aspects of
each of the 10 tuning factors: importance, implementation cost and computational cost. As stated before, one layer LSTM
neural network with word embedding, distance embedding is a basic model, once any one of them is removed, the data
cannot be represented correctly. The combination of 1 layer LSTM + word embedding + distance embedding is a baseline
system, thus the importance, implementation cost and computational cost are ignored.
The importance of a tuning factor indicates the possibility of improving the performance of DDIE, and the value is based
on the experience of the researchers. For example, dropout has been proved to be an effective way of improving the perfor-
mance of image processing systems, however, in the experiments we have conducted, we found dropout performed not so
well on biomedical text mining tasks, therefore, dropout gets 1 star. Another example is the two-stage classification, since
there are a lot of works shows that two-stage classification [2,6] can improve the performance of DDIE, thus the two-stage
classification gets two stars.
The implementation cost means the cost of adding the tuning factor into the model, and it is also based on the experi-
ence of the researchers. For example, it is easy to add POS embedding to the 1 layer LSTM + word embedding + distance
baseline system, therefore the implementation of POS embedding is 1 star, while the two-stage system changes the baseline
system a lot, and therefore the implementation cost is 2 star.
The computational cost is the increment of computational complexity of adding a tuning factor to the baseline system.
Non-linear transformation and matrix multiplication are included. A standard LSTM chunk requires 5 times of non-linear
transformation and 3 times of matrix multiplication, thus the computational cost of the baseline system is 8. When adding
POS embedding to the baseline system, the length of input vectors increases by a half, and the computational cost increases
to 3/2. When adding data balance, the number of DDI instances of each category increases, the total number of DDI instances
is four times of the number of the negative instances, which is about 4 times of the size of the original trainig set.
The objective of Table 1 is not to evaluate each tuning factor. On one hand, the importance of the tuning factors is not
different on different tasks or different domains, for example on image processing tasks, the importance of dropout should
be increased. On the other hand, the scores given by the researchers are subjective and highly depend on the experience of
researchers. The significance of Table 1 is a rough ranking of all tuning factors, based on which we can make experiment
plans, to find the best combination of tuning factors as far as possible while keeping the computational cost as low as
possible.
3. Experiments and discussion
3.1. DDIExtraction 2013
The second shared task challenge on drug-drug interactions, DDIExtraction-2013 [1] , was offered as part of the 2013
International Workshop on Semantic Evaluation (SemEval-2013). The training corpus provided in the challenge contains 142
Medline abstracts on the subject of drug-drug interactions, and 572 documents describing drug-drug interactions from the
DrugBank database.
We use the standard metrics of Precision, Recall, and F-score to evaluate the performance. F-score is calculated with
the macro-averaged method, i.e., by taking precision to be the average of the precision calculated for each type (“advise”,
“effect”, “int”, “mechanism”), and similarly for recall.
106 D. Huang et al. / Information Sciences 415–416 (2017) 100–109
Table 1
A list of tuning factors. ICost stands for implementation cost and
CCost stands for Computational cost. The ICost increases when
the deep learning model is modified, and the CCost increases
when the complexity of the model increases, e.g., the CCost dou-
bles when adding an extra LSTM layer on a 1 layer LSTM.
Factor Importance ICost CCost
1 layer LSTM – – −(8)
2 layer LSTM
∗∗ ∗ 8 ∗2
3 layer LSTM
∗∗ ∗ 8 ∗3
1 layer BLSTM
∗∗ ∗ 8 ∗2
2 layer BLSTM
∗∗ ∗ 8 ∗4
3 layer BLSTM
∗∗ ∗ 8 ∗6
Highway ∗∗ ∗∗ 8 + 3
Dropout ∗ ∗∗ 8 + 1
Maxout ∗∗ ∗∗ 8 + k
Word Embedding (WE) – – − (8)
Distance Embedding (DE) – – − (8)
POS Embedding (PE) ∗∗ ∗ 8 ∗(3/2)
Chunk Embedding (CE) ∗ ∗ 8 ∗(3/2)
Stem Embedding (SE) ∗ ∗ 8 ∗(3/2)
Entity Embedding (EE) ∗ ∗ 8 ∗(3/2)
Data balancing ∗∗ ∗∗ 8 ∗4
Two stage ∗∗ ∗∗ 8/7
Table 2
Baseline system. INT, MEC, EFF, ADV and F-score indicate “int”, “mecha-
nism”, “effect”, “advise” and macro-Fscore respectively.
INT MEC EFF ADV MAVG
LSTM1 + Word P 43.3 46.1 46.2 64.9 50.0
R 3.3 45.2 46.4 26.9 30.3
F 6.1 45.6 46.3 38.0 37.7
LSTM1 + Word+Distance P 62.2 54.4 65.4 65.8 61.8
R 30.3 57.5 69.6 57.8 53.5
F 40.7 55.9 67.4 61.5 57.4
Table 3
Comparison of Multilayer LSTM and POS embedding.
INT MEC EFF ADV MAVG
LSTM2 + Word+Distance P 76.5 76.6 54.8 70.7 69.3
R 30.5 49.0 75.5 60.0 53.5
F 43.6 59.8 63.5 64.9 60.4
LSTM2 + POS+Word+Distance P 72.9 70.1 63.0 67.9 68.0
R 33.0 59.6 71.1 74.6 59.5
F 45.4 64.4 66.8 71.1 63.5
LSTM3 + POS+Word+Distance P 75.0 69.4 63.1 73.8 70.3
R 35.9 58.4 68.9 66.6 57.0
F 48.6 63.4 65.9 70.0 63.0
3.2. Baseline system
To begin with, we use a one-stage LSTM based classifier, and Section 2.2.2 illustrates that distance is essential for distin-
guishing DDI instances from the same sentence. To verify the assumption, Table 2 shows the comparison of baseline system
with and without distance embedding. When distance is not added, the F-score is 37.7%, while the F-score of baseline sys-
tem is 57.4%. The improvement of int category is the most significant as the recall increases from 3% to 30%, and F-score
increases from 6% to 41%, which strongly verifies that distance is essential for LSTM based DDIE model.
The macro-F score is an average of the four DDI categories, however, the number of “int” instances is only 189 while
the number of “effect” instances is 1687. Therefore, the “int” category strongly influences the macro-F score, which always
obtains the lowest F-scores among the four categories. In the future works, “int” instances need special treatment.
D. Huang et al. / Information Sciences 415–416 (2017) 100–109 107
Table 4
Comparison of bidirectional LSTM and other embeddings.
INT MEC EFF ADV MAVG
BLSTM1 + POS+Word+Distance P 87.2 73.0 64.3 69.1 73.3
R 34.4 60.3 71.7 68.0 58.3
F 49.3 66.0 67.8 68.5 64.9
BLSTM1 + POS+Word+Distance+Highway P 72.2 65.0 65.0 55.8 64.0
R 35.8 47.9 67.0 71.3 55.5
F 47.9 55.2 66.0 62.6 59.4
BLSTM1 + POS+Word+Distance+Maxout P 86.2 67.1 71.3 70.0 73.5
R 33.2 66.6 62.1 63.3 56.3
F 47.9 66.8 66.4 66.5 63.8
BLSTM1 + POS+Word+Distance+Stem+Chunk+Entity P 56.2 73.7 65.7 72.9 66.8
R 34.0 61.9 73.5 56.6 56.3
F 42.4 67.3 69.4 63.7 61.1
Table 5
Comparison of data balance and two-stage.
INT MEC EFF ADV MAVG
BLSTM1 + POS+Word+Distance+Data balance P 68.0 75.0 68.8 56.8 66.8
R 45.7 58.9 65.8 74.2 60.8
F 54.7 66.0 67.3 64.3 63.6
BLSTM1 + POS+Word+Distance+Two-stage P 76.4 78.1 66.0 76.6 74.3
R 40.3 68.2 74.4 66.8 62.3
F 52.8 72.8 69.9 71.4 67.7
3.3. Multilayer LSTM and POS embedding
According to Table 1 , we first choose multilayer LSTM and POS embedding. As shown in Table 3 , two layer LSTM obtains
an F-score of 60.4%, which is higher than one layer does (57.3%). Although both the recall are 53.5%, the precision increases
from 61.8% to 69.3%. Therefore, two layer is better than one layer.
POS is a category of words which have similar grammatical properties. Table 3 shows that POS embedding improves the
performance by 2.9 of F-score. In traditional machine learning based DDIE systems, POS tags are often used as features, and
Table 3 shows that POS embedding is also important for LSTM model.
The three layer LSTM model slightly decreases the F-score by 0.6%. Since F = 2 P R/ (P + R ) , it is better that the precision
is close to the recall. However, the gap between precision and recall is 9.5 on two layer LSTM and it increases 13.3 on three
layer LSTM. We can see that more layers do not bring better F-scores.
3.4. BLSTM and embeddings
Table 4 shows that one layer BLSTM (64.9%) is better than two layer LSTM (63.5%). However, after adding highway,
maxout, stem embedding, chunk embedding and entity embedding, the F-scores are lower. We think that highway and
maxout are all good techniques, and have been proved to be useful on other tasks. It is the distribution of the DDIE dataset
that does not applicable to these tuning factors.
3.5. Data balance and two-stage
The above experiments test the performance of tuning factors that modifies the LSTM based model, while data balance
changes the training set. From Table 5 we can find that the F-score decreases from 64.9% to 63.6% after applying data
balance, however, the gap between precision and recall becomes 6, which is the smallest among all the experiments.
After applying two-stage classification, the F-score increases from 64.9% to 67.7%, which again proves that the two-stage
classifier is effective for the multi-class classification problem of DDIE. The results also illustrates that by adding the first
stage binary classification, our algorithm boosts from an F-score 64.9% to 67.7%. Although the two-stage model does not
directly address the problem of data skew, the bias of different category is eased during the two stages of classification.
3.6. Multilayer BLSTM and dropout
Table 1 shows that the computational cost of multilayer BLSTM is the highest. To reduce the computational cost as much
as possible, we conduct the experiments of BLSTM at last. Table 6 shows that two layer BLSTM achieves the highest F-score
(69.0%) among all the experiments, while three layers achieves an F-score of 68.1%. Again the results illustrate that more
computational cost and more layers do not contribute to higher F-scores. We also find that dropout decreases the F-score
as expected (68.1%), and we think dropout is good for image processing tasks while not applicable for DDIE.
108 D. Huang et al. / Information Sciences 415–416 (2017) 100–109
Table 6
Comparison of multilayer BLSTM and dropout.
INT MEC EFF ADV MAVG
BLSTM2 + POS+Word+Distance+Two-stage P 78.7 75.2 70.1 77.2 75.3
R 42.1 72.4 74.0 66.6 63.7
F 54.9 73.8 72.0 71.5 69.0
BLSTM2 + POS+Word+Distance+Two-stage+dropout P 78.6 77.5 68.4 75.4 74.8
R 38.4 70.2 75.4 66.7 62.5
F 51.6 73.7 71.7 70.8 68.1
BLSTM3 + POS+Word+Distance+Two-stage P 75.6 77.9 68.0 76.5 74.5
R 40.0 69.6 76.1 65.4 62.7
F 52.3 73.5 71.8 70.5 68.1
Table 7
Comparison with other works.
Method Classifier F-score
Rastegar (2013) [20] SVM 47.2
Bjorne (2013) [5] SVM 58.7
Bokharaeian (2013) [4] SVM 53.5
Chowdhury (2013) [2] SVM 64.8
Thomas (2013) [3] SVM etc. 59.7
Kim (2015) [6] SVM 67.0
Bobic (2013) [21] SVM etc. 44.8
Sanchez (2013) [22] SVM 53.4
Hailu (2013) [23] SVM 40.7
Zhao (2016) [7] two-stage CNN 68.6
Our method SVM + LSTM 69.0
3.7. Comparison with other works
Table 7 lists all the works on DDIExtraction 2013 as far as we know. We can find that deep learning based methods
achieve higher F-scores. Most of the previous works are based on features and kernels, and based on SVM classifier. Zhao
et al. proposed a two-stage method using CNN which achieved an F-score of 68.6%, and our LSTM based model achieves
69.0%, both of them are higher than previous methods. Therefore, deep learning techniques can help improving the perfor-
mance of DDIE.
3.8. Discussion
In this section, we have conducted a group of experiments to show the performance of the proposed method. From
Tables 2 to 7 , we summarize that the two-stage strategy generally performs better than one-stage strategy, and two layer
bidirectional LSTM achieves the highest F-score when cooperated with word embedding, distance embedding and POS em-
bedding. The reasons behind these phenomenons are probably as follows: First, the DDIE data suffers from class bias, and
the two-stage based methods can deal with the class bias better the one-stage based ones; Second, taking only word se-
quence as input is not enough for DDI instance representation, while the distance embedding and POS embedding can enrich
the input to obtain better representations of DDI instances; Third, although our human-beings usually read text from the
beginning to the end, however, for LSTM, reading backward also helps to comprehense the meaning, and therefore BLSTM
can extract DDIs more accurately.
4. Conclusion
In this paper, we propose a two-stage method to extract DDIs from literature, obtaining an F-score of 69.0%. In the first-
stage, we use a feature based classifier to identify positive DDIs, and in the second stage, we use an LSTM based classifier
to categorize the positive DDIs into each of the four DDI categories. Considering the importance, implementation cost and
computational cost, we compare 10 different tuning factors that possibly influence the performance of LSTM in the experi-
ments, and we find that the best combination of tuning factors are: two layer bidirectional LSTM, word embedding, distance
embedding, POS embedding and two-stage classification. In our opinion, there are some rooms for further improvement of
DDIE, such as special treatment of “int” category and reducing the gap between precision and recall, which will be focused
in our future works.
D. Huang et al. / Information Sciences 415–416 (2017) 100–109 109
Acknowledgements
The authors gratefully acknowledge the financial support provided by the National Natural Science Foundation of China
under No. 61672126 , 61672127 , and the National Natural Science Foundation of Liaoning Province under No. 2015020059.
The Titan X used for this research was donated by the NVIDIA Corporation.
References
[1] I. Segura Bedmar , P. Martínez , M. Herrero Zazo , Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts, in: Proceedings ofthe 7th International Workshop on Semantic Evaluation, 2013 .
[2] M. Faisal , Fbk-irst: a multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information, in:International Workshop on Semantic Evaluation, 2013, pp. 351–355 .
[3] P. Thomas , M. Neves , T. Rocktäschel , U. Leser , Wbi-ddi: drug-drug interaction extraction using majority voting, in: Seventh International Workshop onSemantic Evaluation, 2013, pp. 628–635 .
[4] B. Bokharaeian , A. Diaz , Nil ucm: extracting drug-drug interactions from text through combination of sequence and tree kernels, in: Seventh Interna-
tional Workshop on Semantic Evaluation, 2013, pp. 644–650 . [5] J. Bjrne , S. Kaewphan , T. Salakoski , Uturku: drug named entity recognition and drug-drug interaction extraction using svm classification and domain
knowledge, in: International Workshop on Semantic Evaluation, 2013, pp. 651–659 . [6] K. Sun , H. Liu , L. Yeganova , W.J. Wilbur , Extracting drugdrug interactions from literature using a rich feature-based linear kernel approach, J. Biomed.
Inform. 55 (C) (2015) 23–30 . [7] Z. Zhao , Z. Yang , L. Ling , H. Lin , W. Jian , Drug drug interaction extraction from biomedical literature using syntax convolutional neural network,
Bioinformatics 32 (22) (2016) 34 4 4–3453 . [8] D. Zeng , K. Liu , S. Lai , G. Zhou , J. Zhao , Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344 .
[9] S.K. Sahu, A. Anand, Drug-drug interaction extraction from biomedical text using long short term memory network, arXiv:1701.08303 (2017).
[10] Y. Bengio , P.Y. Simard , P. Frasconi , Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Networks 5 (2) (1994)157–166 .
[11] J.P.C. Chiu, E. Nichols, Named entity recognition with bidirectional lstm-cnns, 2015. arXiv:1511.08308 . [12] I. Sutskever , O. Vinyals , Q.V. Le , I. Sutskever , O. Vinyals , Q.V. Le , Sequence to sequence learning with neural networks, Adv. Neural Inf. Process. Syst. 4
(2014) 3104–3112 . [13] Z. Jiang , L. Li , D. Huang , L. Jin , Training word embeddings for deep learning in biomedical text mining tasks, in: 2015 IEEE International Conference on
Bioinformatics and Biomedicine, 2015, pp. 625–628 .
[14] Q.C. Bui , P.M.A. Sloot , E.M.V. Mulligen , J.A. Kors , A novel feature-based approach to extract drug-drug interactions from biomedical text., Bioinformatics30 (23) (2014) 3365–3371 .
[15] K. Sagae , J.I. Tsujii , Dependency parsing and domain adaptation with lr models and parser ensembles., in: EMNLP-CoNLL 2007, Proceedings of the2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28–30, 2007, Prague,
Czech Republic, 2007, pp. 1044–1050 . [16] N. Srivastava , G. Hinton , A. Krizhevsky , I. Sutskever , R. Salakhutdinov , Dropout: a simple way to prevent neural networks from overfitting, J. Mach.
Learn. Res. 15 (1) (2014) 1929–1958 .
[17] S. Hochreiter , J. Schmidhuber , Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780 . [18] M. Schuster , K.K. Paliwal , Bidirectional recurrent neural networks, IEEE Trans. Sig. Process. 45 (11) (1997) 2673–2681 .
[19] R.K. Srivastava, K. Greff, J. Schmidhuber, Highway networks, CoRR (2015). abs/1505.0038. . [20] M. Rastegar-Mojarad , R.D. Boyce , R. Prasad , Uwm-triads: classifying drug-drug interactions with two-stage svm and post-processing, in: Seventh Inter-
national Workshop on Semantic Evaluation, 2013, pp. 667–674 . [21] T. Bobic, J. Fluck, M. Hofmannapitius, Scai: Extracting drug-drug interactions using a rich feature vector, in: Seventh International Workshop on Se-
mantic Evaluation, pp. 675–683.
[22] D. Sánchez Cisneros , F. Aparicio Gali , Uem-uc3m: an ontology-based named entity recognition system for biomedical texts, in: Seventh InternationalWorkshop on Semantic Evaluation, 2013, pp. 622–627 .
[23] N.D. Hailu , L.E. Hunter , K.B. Cohen , Ucolorado som: extraction of drug-drug interactions from biomedical text using knowledge-rich and knowl-edge-poor features, in: International Workshop on Semantic Evaluation, 2013, pp. 6 84–6 88 .