incorporating dictionaries into deep neu ral networks for...

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/331333737

Incorporating dictionaries into deep neural networks for the Chinese clinical

named entity recognition

Article in Journal of Biomedical Informatics · February 2019

DOI: 10.1016/j.jbi.2019.103133

CITATIONS

6READS

25

6 authors, including:

Some of the authors of this publication are also working on these related projects:

BioNLP View project

Metaheuristic Optimisation View project

Qi Wang

East China University of Science and Technology

14 PUBLICATIONS 11 CITATIONS

SEE PROFILE

Yangming Zhou

East China University of Science and Technology


SEE PROFILE

Tong Ruan

East China University of Technology


SEE PROFILE

All content following this page was uploaded by Yangming Zhou on 28 February 2019.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/331333737_Incorporating_dictionaries_into_deep_neural_networks_for_the_Chinese_clinical_named_entity_recognition?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/publication/331333737_Incorporating_dictionaries_into_deep_neural_networks_for_the_Chinese_clinical_named_entity_recognition?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_3&_esc=publicationCoverPdf

https://www.researchgate.net/project/BioNLP-2?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/project/Metaheuristic-Optimisation?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Qi_Wang390?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/East_China_University_of_Science_and_Technology?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Yangming_Zhou3?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/East_China_University_of_Science_and_Technology?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Tong_Ruan2?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/East_China_University_of_Technology?enrichId=rgreq-22315a5830dd311472efc663f576f207-XXX&enrichSource=Y292ZXJQYWdlOzMzMTMzMzczNztBUzo3MzEyNjI1NzQ2NzgwMjRAMTU1MTM1Nzk5Mjk4MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf



Incorporating dictionaries into deep neural

networks for the Chinese clinical named entity

recognition

Qi Wang a, Yangming Zhou a,∗, Tong Ruan a, Daqi Gao a,Yuhang Xia a, Ping He b

aSchool of Information Science and Engineering, East China University of Scienceand Technology, Shanghai 200237, China

bShanghai Hospital Development Center, Shanghai 200040, China

Abstract

Clinical named entity recognition aims to identify and classify clinical termssuch as diseases, symptoms, treatments, exams, and body parts in electronic healthrecords, which is a fundamental and crucial task for clinical and translational re-search. In recent years, deep neural networks have achieved significant success innamed entity recognition and many other natural language processing tasks. Mostof these algorithms are trained end to end, and can automatically learn featuresfrom large scale labeled datasets. However, these data-driven methods typicallylack the capability of processing rare or unseen entities. Previous statistical meth-ods and feature engineering practice have demonstrated that human knowledge canprovide valuable information for handling rare and unseen cases. In this paper, wepropose a new model which combines data-driven deep learning approaches andknowledge-driven dictionary approaches. Specifically, we incorporate dictionariesinto deep neural networks. In addition, two different architectures that extend thebi-directional long short-term memory neural network and five different feature rep-resentation schemes are also proposed to handle the task. Computational results onthe CCKS-2017 Task 2 benchmark dataset show that the proposed method achievesthe highly competitive performance compared with the state-of-the-art deep learn-ing methods.

Key words: Clinical named entity recognition; electronic health records; deepneural network; dictionary features.

∗ Corresponding author.Email addresses: [email protected] (Qi Wang), [email protected]

(Yangming Zhou), [email protected] (Tong Ruan),

Preprint submitted to Elsevier 22 February 2019

1 Introduction

Clinical named entity recognition (CNER) is a critical task for extracting pa-tient information from electronic health records (EHRs) to support clinicaland translational research. The main aim of CNER is to identify and classifyclinical terms in EHRs, such as diseases, symptoms, treatments, exams, andbody parts. To identify these important clinical concepts is a fundamentalstep for many clinical Natural Language Processing (NLP) systems. Com-pared to the generic domain, NER in medical domain is more difficult. On theone hand, non-standard abbreviations or acronyms, and multiple variationsof same entities are usually used in clinical texts. On the other hand, clinicalnotes are more noisy, grammatically error prone and contain less context dueto shorter and incomplete sentences [1]. Moreover, building such a CNER isnot an easy task because of the richness of EHRs, and CNER in Chinese textsis more difficult compared to those in Romance languages due to the lack ofword boundaries in Chinese and the complexity of Chinese composition forms[2].

CNER has attracted considerable research effort in last decade and a numberof methods have been proposed. All these existing methods can be roughly di-vided into two main categories: knowledge-driven approaches and data-drivenapproaches. Knowledge-driven approaches rely on handcrafted rules and/orman-made dictionaries, which usually leads to high system engineering costand low recalls, so the state-of-the-art CNER approaches are data-driven deeplearning approaches, i.e. Bi-LSTM-CRF methods [3,4]. Despite the great suc-cess achieved by deep learning methods, some issues still have not been wellsolved. One significant drawback is that such methods rarely take the inte-gration of human knowledge into account. The deep neural networks usuallyemploy an end-to-end approach and try to directly learn features from largescale labeled data. However, there also exists a huge number of entities thatrarely or even do not occur in the training set. Several reasons lie behind this,including use of non-standard abbreviations or acronyms, multiple variationsof same entities, etc. Thus, the data-driven deep learning methods usually can-not handle such cases well, especially for CNER task. It is because data-drivenapproaches aims to learn pattern recognition according to characters/wordsthemselves and their contexts, such that phrases ending with “术” (opera-tion) indicate operations and phrases after “服用” (take) indicate medicines.However, different from generic named entity recognition, there also existssome cases in medical area that no patterns can be followed only via char-acters/words themselves and their contexts because the definitions of namedentities require additional knowledge, such as the differences between diseases

[email protected] (Daqi Gao), [email protected] (Yuhang Xia),[email protected] (Ping He).

2

and symptoms (they all belong to clinical findings) and the differences betweentraditional Chinese medicines and modern Western medicines (they all belongto medicines). While, dictionaries contain both commonly used entities andrare entities. If we can incorporate a dictionary into a deep neural network,rare and unseen clinical named entities can be better processed.

In this paper, we propose a novel method for the Chinese CNER. We ex-tend Bi-directional Long Short-Term Memory and Conditional Random Field(Bi-LSTM-CRF) [5] to model the CNER task as a character level sequencelabeling problem. To integrate dictionaries, we design five different schemesto construct feature vectors for each Chinese character based on dictionar-ies and contexts. Also, two different architectures are introduced to integratefeature vectors with character embeddings to perform the task. Finally, ourproposed approach was extensively evaluated based on a CCKS-2017 1 bench-mark dataset.

The main contributions of this paper can be summarized as follows:

• To the best of our knowledge, it is the first time to combine knowledge-driven dictionary methods and data-driven deep learning methods for thenamed entity recognition tasks. We design two architectures and five featurerepresentation schemes to integrate information extracted from dictionariesinto deep neural networks.

• We assess the performance of the proposed approaches on the CCKS-2017Task 2 benchmark dataset. The computational results indicate that ourproposed approach performs remarkably well compared to state-of-the-artmethods. By incorporating the dictionaries into our model, our model out-performs the basic BI-LSTM-CRF model for processing rare and unseenentities.

The rest of this paper is organized as follows. In the next section, we brieflyreview the related work of the clinical named entity recognition. Then weintroduce the basic Bi-LSTM-CRF model in Section 3. In Section 4, we presentour proposed approaches. Section 5 is dedicated to some experimental studies.We experimentally investigate several important issues of the proposed modelin Section 6. Finally, conclusions are provided in Section 7.

1 CCKS: China Conference on Knowledge Graph and Semantic Computing, 2017,and its website: http://www.ccks2017.com/

3

2 Related work

Due to the practical significance, Clinical Named Entity Recognition (CNER)has attracted considerable research effort, and a great number of solutionapproaches have been proposed in the literature. Generally, all the existingapproaches fall into four categories: rule-based approaches, dictionary-basedapproaches, statistical machine learning approaches, and recently, deep learn-ing approaches are more investigated in CNER community.

Rule-based approaches rely on heuristics and handcrafted rules to identifyentities. They were the dominant approaches in the early CNER systems [6,7]and some recent work [8,9]. However, it is difficult to list all rules to model thestructure of clinical named entities, and this kind of handcrafted approachesalways leads to a relatively high system engineering cost.

Dictionary-based approaches rely on existing clinical vocabularies to identifyentities [10–12]. They were widely used because of their simplicity and theirperformance. A dictionary-based CNER system can extract all the matchedentities defined in a dictionary from a given clinical text. However, it cannotdeal with entities which are not included in the dictionary, and usually causeslow recalls.

Statistical machine learning approaches consider CNER as a sequence labelingproblem where the goal is to find the best label sequence for a given input sen-tence [13,14]. Typical methods are Hidden Markov Models (HMMs) [15,12],Maximum Entropy Markov Models (MEMMs) [16,17], Conditional RandomFields (CRFs) [18–20], and Support Vector Machines (SVMs) [21,22]. How-ever, these statistical methods rely on pre-defined features, which makes theirdevelopment costly. What’s more, feature engineering, i.e. finding the best setof features which helps to discern entities of a specific type from others is moreof an art than a science, incurring extensive trial-and-error experiments.

Recently, deep learning approaches [23], especially the methods based on Bidi-rectional Recurrent Neural Network (RNN) using CRF as the output inter-face (Bi-RNN-CRF) [5], achieve state-of-the-art performance in CNER tasksand outperform the traditional statistical models [3,4,24]. RNNs with gatedrecurrent cells, such as Long-Short Term Memory (LSTM) [25] and GatedRecurrent Units (GRU) [26], are capable of capturing long dependencies andretrieving rich global information. The sequential CRF on top of the recurrentlayers ensures that the optimal sequence of tags over the entire sentence isobtained.

4

3 Bi-LSTM-CRF model

The Chinese clinical named entity recognition task is usually regarded as asequence labeling task. Due to the ambiguity in the boundary of Chinesewords, following our previous work [27], we label the sequence in the characterlevel to avoid introducing noise caused by segmentation error. Thus, given aclinical sentence X =< x1, ..., xn >, our goal is to label each character xi inthe sentence X with BIEOS (Begin, Inside, End, Outside, Single) tag scheme,i.e. obtain a tag sequence Y =< y1, ..., yn >. An example of the tag sequencefor “腹平坦，未见腹壁静脉曲张。” (The abdomen is flat and no varicoseveins can be seen on the abdominal wall) can be found in the first two lines inTable 1. Through the tag sequence, the clinical named entities can be decoded,as seen in the last line of Table 1.

Table 1An illustrative example of the tag sequence and features.

Character sequence 腹平坦，未见腹壁静脉曲张。

tag sequence S-b O O O O O B-b E-b B-s I-s I-s E-s O

PIET features b None None None None None b b s s s s None

PDET features S-b None None None None None B-b E-b B-s I-s I-s E-s None

entity type body body symptom

? The B-tag indicates the beginning of an entity. The I-tag indicates the inside of an entity. The E-tag indicates the end of an entity.The O-tag indicates the character is outside an entity. The S-tag indicates the character is merely a single-character entity. As forentity types, the b-tag indicates the entity is a body part, and the s-tag indicates the entity is a symptom.

In this section, we will give a brief description of the general Bi-LSTM-CRFarchitecture for Chinese CNER. Bi-LSTM-CRF model is originally proposedby Huang et al. [5], the main architecture of which is illustrated in Fig. 1.Unlike Huang et al. [5], we employ character embeddings rather than wordembeddings to deal with the ambiguity in the boundary of Chinese words.

e1

LSTM

LSTM

e2 e3 e4

CRF

Fig. 1. Main architecture of the Bi-LSTM-CRF model.

5

3.1 Embedding layer

Given a clinical sentence X =< x1, ..., xn >, which is a sequence of n charac-ters, the first step is to map discrete language symbols to distributed embed-ding vectors. Formally, we lookup embedding vector from embedding matrixfor each character xi as ei ∈ Rde , where i ∈ {1, 2, ..., n} indicates xi is thei-th word in X, and de is a hyper-parameter indicating the size of characterembedding.

3.2 Bi-LSTM layer

The Long Short-Term Memory (LSTM) network [25] is a variant of the Re-current Neural Network (RNN), which incorporates a gated memory cell tocapture long-range dependencies within the data and is able to avoid gradientvanishing/exploding problems caused by standard RNNs.

For each position t, LSTM computes ht with input et and previous state ht−1,as:

it = σ(Wiet + Uiht−1 + bi) (1)

ft = σ(Wfet + Ufht−1 + bf ) (2)

c̃t = tanh(Wcet + Ucht−1 + bc) (3)

ct = ft � ct−1 + it � c̃t (4)

ot = σ(Woet + Uoht−1 + bo) (5)

ht = ot � tanh(ct) (6)

where h, i, f , o ∈ Rdh are dh-dimensional hidden state (also called outputvector), input gate, forget gate and output gate, respectively; Wi, Wf , Wc,Wo ∈ R4dh×de , Ui, Uf , Uc, Uo ∈ R4dh×dh and bi, bf , bc, bo ∈ R4dh are theparameters of the LSTM; σ is the sigmoid function, and � denotes element-wise production.

However, the hidden state hi of LSTM only takes information from past, notconsidering future information. One solution is to utilize Bidirectional LSTM(Bi-LSTM) [28], which incorporate information from both past and future.

Formally, for any given sequence, the network computes both a left,−→h t, and

a right,←−h t, representations of the sequence context at every input, et. The

final representation is created by concatenating them as:

ht =−→h t ⊕

←−h t (7)

The Bi-LSTM along with the embedding layer is the main machinery respon-

6

sible for learning a good feature representation of the data.

3.3 CRF layer

For the character-based Chinese CNER task, it is beneficial to consider thedependencies of adjacent tags. For example, a B (begin) tag should be followedby an I (middle) tag or E (end) tag, and an I tag cannot be followed by a Btag or S (single) tag. Therefore, instead of making tagging decisions using hiindependently, we employ a Conditional Random Field (CRF) [29] to modelthe tag sequence jointly.

Generally, the CRF layer is represented by lines which connect consecutiveoutput layers, and has a state transition matrix as parameters. With such alayer, we can efficiently use past and future tags to predict the current tag,which is similar to the use of past and future input features via a Bi-LSTMnetwork. We consider the matrix of scores fθ([x]T1 ) as the output of the Bi-LSTM network. The element [fθ]i,t of the matrix is the score output by thenetwork with parameters θ, for the sentence [x]T1 and for the i-th tag, at thet-th character. We introduce a transition score [A]i,j to model the transitionfrom i-th state to j-th for a pair of consecutive time steps. Note that thistransition matrix is position independent. We now denote the new parametersfor our whole network as θ̃ = θ ∪ {[A]i,j∀i, j}. The score of [x]T1 along with apath of tags [i]T1 is then given by the sum of transition scores and Bi-LSTMnetwork scores:

S([x]T1 , [i]T1 , θ̃) =

T∑t=1

([A][i]t−1,[i]t + [fθ][i]t,t) (8)

The conditional probability p([y]T1 |[x]T1 , θ̃) is calculated with a softmax func-tion:

p([y]T1 |[x]T1 , θ̃) =eS([x]

T1 ,[y]

T1 ,θ̃)∑

j eS([x]T1 ,[j]

T1 ,θ̃)

(9)

where [y]T1 is the true tag sequence and [j]T1 is the set of all possible outputtag sequences.

We use the maximum conditional likelihood estimation to train the model:

logp([y]T1 |[x]T1, θ̃)=S([x]T1, [y]T1, θ̃)−log∑∀[j]T1

eS([x]T1 ,[j]

T1 ,θ̃) (10)

7

Viterbi algorithm [30], which is a dynamic programming, can be used effi-ciently to compute [A]i,j and optimal tag sequences for inference.

4 Incorporating dictionaries for Chinese CNER

From the brief description given above, we can observe that the Bi-LSTM-CRFmodel can learn information from large-scale labeled data. However, it cannotprocess rare and unseen entities very well. Hence, in this work, inspired by thesuccess of integrating dictionaries into the CRF models for CNER [31,32], weconsider integrating dictionaries into the deep neural networks.

For a given sentence X =< x1, ..., xn >, we first construct feature vector difor each character xi based on dictionary D and the context. We propose fivefeature representation schemes for di to represent whether character segmentsthat consist of character xi and its surroundings are clinical named entities ornot. After that, we propose two architectures to integrate the feature vector diinto the Bi-LSTM-CRF model. We will detail the feature vector constructionand integration architectures in the following subsection.

4.1 Feature vector construction

The dictionary D is constructed in entity level, but the sequence is labeled incharacter level. Since an entity may have several characters, how to representthe entity features in character level is challenging. In this section, we proposefive different schemes to represent dictionary features. These schemes can befurther classified into three categories: n-gram feature, Position-IndependentEntity Type (PIET) feature and Position-Dependent Entity Type (PDET)feature.

4.1.1 N-gram feature

Given a sentence X and an external dictionary D, we construct text segmentsbased on the context of xi using the pre-defined n-gram feature templates.The templates used in our work are listed in Table 2.

For a text segment that appears in an n-gram feature template, we can gen-erate a binary value to indicate whether the text segment is a clinical namedentity in D or not. Here we use ti,j,k to represent the binary value of the outputcorresponding to the k-th entity type in j-th n-gram template for xi. For adictionary D with five types of clinical named entities, we finally generate a 40-

8

Table 2N-gram feature templates for the i-th character, which are used to generate featurevector di.

Type template

2-gram xi−1xi, xixi+1

3-gram xi−2xi−1xi, xixi+1xi+2

4-gram xi−3xi−2xi−1xi, xixi+1xi+2xi+3

5-gram xi−4xi−3xi−2xi−1xi, xixi+1xi+2xi+3xi+4

dimensional feature vector containing entity type and boundary informationfor xi. Fig. 2 illustrates an example of n-gram feature vector construction.

The abdomen is flat and no varicose veins can be

seen on the abdominal wall.

腹平坦，未见腹壁静脉曲张。

2-gram 见腹腹壁

3-gram 未见腹腹壁静

4-gram ，未见腹腹壁静脉

5-gram 坦，未见腹腹壁静脉曲

Templates

0 0 0 0 1

d s t e b

0 0 0 0 0

d s t e b

Fig. 2. Example of n-gram feature vector construction. The character with the redshadow is the character xi. The character segment with solid rectangle is a bodypart in the dictionary D. Here d, s, t, e, b denote disease, symptom, treatment,exam, and body part, respectively.

4.1.2 Position-Independent Entity Type (PIET) feature

Given a sentence X and an external dictionary D, we first use the classic Bi-Directional Maximum Matching (BDMM) algorithm [33] to segment X. Thepseudo code of the BDMM algorithm is provided in Algorithm 1. Then eachcharacter xi is labeled as the type of the entity which xi belongs to, as shownin the third line of Table 1. The feature can be further represented in theformat of one-hot encoding or feature embedding.

9

Algorithm 1: The pseudo-code of the BDMM segmentation methodInput: A clinical sentence X, a dictionary with different types of clinical

named entities D and the maximum length of the entities in DmaxLen

Output: The entity list Y after segmentation1 begin2 Y1 ← ∅;3 // Direction 1: from the beginning to the end4 while X is not empty do5 cut a string S of size maxLen from X;6 make a match between S and each entity in D;7 if there is an entity matches S then8 split out S from X;9 add S with its type into Y1;

10 else11 remove a character from the tail of S;12 add the character back to X;13 make a match between S and each entity in D;14 if a match is found then15 go to line 7;

16 if |S| = 1 then17 split out S from X;18 Y1 ←“None”;

19 else20 repeat lines 11-17;

21 Y2 ← ∅;22 // Direction 2: from the end to the beginning23 build Y2 in the same way of Y1;24 if |Y1| < |Y2| then25 Y ← |Y1|;26 else27 Y ← |Y2|;

28 return the entity list Y after segmentation;

4.1.3 Position-Dependent Entity Type (PDET) feature

PIET feature only considers the type of the entity which a character belongsto. Different from PIET feature, PDET feature also takes the position of acharacter in an entity into account: If the character is merely a single-characterentity, we add a flag “S” before the PIET feature. Otherwise, for the firstcharacter of an entity, we add a flag “B” before the PIET feature; For thelast character of an entity, we add a flag “E” before the PIET feature; For themiddle character(s) of an entity, we add a flag “I” before the PIET feature.

10

The example is shown in the fourth line of Table 1. Similar to PIET feature,PDET feature can also be represented in the format of one-hot encoding orfeature embedding.

To some degree, the feature vector, whatever n-gram feature or other entitytype feature, can represent the candidate labels of a character based on thegiven dictionary. The values in the feature vector are dependent on the contextand dictionary. They are not impacted by other sentences or statistical infor-mation. Hence, feature vectors can provide much information quite differentfrom data-driven methods.

4.2 Integration architecture

Through the construction steps of dictionary feature vectors, given a sentenceX, we obtain both character embedding ei and feature vector di for each char-acter xi. As described above, the original Bi-LSTM-CRF model only takes eias inputs. Since dictionary features could provide valuable information forCNER, we try to integrate it with the original Bi-LSMT-CRF model. In thissection, we will introduce two different architectures to integrate feature vec-tors with character embeddings.

4.2.1 Model-I

The general architecture of the proposed model is illustrated in Fig. 3. Thecharacter embedding ei and feature vector di are first concatenated and thenfed into a Bi-LSTM layer:

mi = ei ⊕ di (11)

hi = Bi-LSTM(−→h i−1,

←−h i+1,mi) (12)

where ei denotes the embedding vector of xi, and di represents the featurevector of xi.

The other part of this model uses the same operation as the basic Bi-LSTM-CRF model.

4.2.2 Model-II

The general architecture of the proposed model is illustrated in Fig. 4. Thetwo parallel Bi-LSTMs can extract context information and potential entityinformation, respectively. For sentence X, the hidden states of the two parallel

11

e1 d1

LSTM

LSTM

concat

e2 d2

concat

e3 d3

concat

e4 dn

concat

CRF

Fig. 3. Main architecture of Model-I.

Bi-LSTMs at position i can be defined as:

hxi = Bi-LSTM(−→h xi−1,←−h xi+1, ei) (13)

hdi = Bi-LSTM(−→h di−1,←−h di+1,di) (14)

where ei denotes the embedding vector of xi, and di represents the featurevector of xi. Note that in our formulation, the two parallel Bi-LSTMs areindependent, without any shared parameters.

Finally, we concatenate the two hidden states of the parallel Bi-LSTMs as theinputs of the CRF layer:

hi = hxi ⊕ hdi (15)

The other part of this model is the same as the basic Bi-LSTM-CRF model.

concat

e1

d1

concat

e2

d2

concat

e3

d3

concat

en

dn

LSTM

LSTM

LSTM

LSTM

CRF

Fig. 4. Main architecture of Model-II.

12

5 Computational studies

The dictionary we exploit in the experiments is constructed according to thelists of charging items and drug information in Shanghai Shuguang Hospitalas well as some medical literature such as 《人体解剖学名词（第二版）》(Chinese Terms in Human Anatomy [Second Edition]).

We use the CCKS-2017 Task 2 benchmark dataset 2 to conduct our experi-ments. The 2017 China Conference on Knowledge Graph and Semantic Com-puting is held by the Chinese Information Processing Society of China (CIPS),and it is the most famous conference about knowledge graph and semanticcomputing in China. The dataset contains 1,596 annotated instances (10,024sentences) with five types of clinical named entities, including diseases, symp-toms, exams, treatments, and body parts. The annotated instances are al-ready partitioned into 1,198 training instances (7,906 sentences) and 398 testinstances (2,118 sentences). Each instance has one or several sentences. Wefurther split these sentences into clauses by commas. The statistics of differenttypes of entities are listed in Table 3.

Following the standard practice for evaluating the CNER tasks [3,4,34] andsequence labeling tasks[5,35,36], we use widely-used measures such as preci-sion, recall, and F1-Measure [37] to evaluate the methods in the followingexperiments.

Table 3Statistics of different types of entities.

Type training set test set

disease 722 553

symptom 7,831 2,311

exam 9,546 3,143

treatment 1,048 465

body part 10,719 3,021

sum 29,866 9,493

2 It is publickly available at http://www.ccks2017.com/en/index.php/

sharedtask/

13

5.1 Experimental settings

Parameter configuration may influence the performance of a deep neural net-work. The parameter configurations of the proposed approach are shown inTable 4. Note that in order to avoid material impacts on the experimentalresults, the hidden unit number of each LSTM in Model-II is set the half ofthat in Model-I, so that the input length of CRF layer at each time step inModel-II is ensured the same as that in Model-I.

We initialize character embeddings and dictionary feature embeddings viaword2vec [38] on both the training data and the test data. Dropout technique[39] is applied to the outputs of the Bi-LSTM layers in order to reduce over-fitting. The models are trained by Adam optimization algorithm [40], theparameters of which is the same as the default settings.

Table 4Parameter configurations of the proposed approach.

Parameters value

character embedding size de = 128

feature embedding size dd = 128

number of LSTM hidden units in Model-I dh = 256

number of LSTM hidden units in Model-II dhx = dhd = 128

dropout rate p = 0.2

batch size b = 128

5.2 Comparative results of different combinations between feature represen-tations and integration architectures

In this section, we compare different combinations between three feature rep-resentations and two integration architectures. The comparative results arelisted in Table 5.

Table 5 displays the computational results of each combination. First of all,Model-I with PDET features achieves the best performance among all theten models, with 90.83% in Precision, 91.64% in Recall, and 91.24% in F1-Measure. Second, as for feature representations, n-gram features perform theworst compared to PIET features and PDET features, because n-gram fea-tures only consider the boundary information of potential entities, ignoringcharacters in the middle of entities. What’s more, PDET features achieve thebest results, because this type of features indicate not only the potential type

14

Table 5Comparative results of combinations between different feature representations andintegration architectures.

Model-I Model-II

Precision Recall F1-Measure Precision Recall F1-Measure

N-gram feature 88.39 88.46 88.43 88.72 88.71 88.71

PIET featureone-hot encoding 89.53 90.58 90.05 89.38 90.49 89.93

feature embedding 90.11 90.01 90.56 90.00 90.60 90.30

PDET featureone-hot encoding 90.51 91.04 90.77 90.22 90.64 90.43

feature embedding 90.83 91.64 91.24 90.36 91.35 90.85

but also the potential boundary of clinical named entities in the dictionary.Further more, as to PIET features and PDET features, feature embeddinghas better results than one-hot encoding. It is because the dense vector rep-resentation can bring more information than one-hot encoding. Third, exceptusing n-gram features, Model-I performs better than Model-II in F1-Measure,with an improvement of 0.28% on average. It indicates that considering char-acters and their dictionary features together is better than considering themseparately.

5.3 Compared with two base models

According to the comparative results of Section 5.2, we know that the modelcombining Model-I and PDET features achieves the best performance. In thissection, we compare the best model (i.e. Model-I with PDET features) withtwo base models, i.e. BDMM algorithm (see Algorithm 1) and the basic Bi-LSTM-CRF model (see Section 3). The comparative results are summarizedin Table 6.

Table 6Comparative results between our best model and two base models.

Method Precision Recall F1-Measure

BDMM algorithm 70.29 84.44 76.72

basic Bi-LSTM-CRF 88.22 88.53 88.38

our best model 90.83 91.64 91.24

From Table 6, we clearly observe that our best model performs better than thetwo base models in terms of all the three metrics. It indicates the benefit ofthe incorporation between a dictionary and a Bi-LSTM-CRF model. BDMMalgorithm with dictionaries performs the worst among the three models, andits Precision is far below its Recall. One reason is that applying BDMM algo-

15

rithm directly may annotate clinical named entities with wrong boundaries.For example, “双侧瞳孔” (both pupils) is a body part in the clinical text, butit is not in the dictionary and the dictionary only has the body part “瞳孔”(pupil) in it, so the entity will be falsely recognized as “瞳孔”. Another reasonis type errors that the same entity can correspond to different entity typeswith different contexts. For example, “维生素 C” (vitamin C) is a drug namein the clause “维生素C注射2g” (inject with 2 grams of vitamin C), while it isalso an exam index in the clause “缺乏维生素 C” (lack vitamin C). However,BDMM algorithm can only deal with the situation that an entity is correspondto one entity type.

To further analyze these results, we use a two-step statistical method to com-pare these three algorithms. Firstly, we conduct a ANOVA test which makesthe null hypothesis that all algorithms are equivalent. The hull hypothesis isrejected, then we proceed with a post-hoc test named Tukey HSD test in orderto compare methods in a pairwise way. At a significant level 0.05, we observethat our proposed best model is significantly better than BDMM algorithmin terms of Precision. Also, we find that our best model significantly outper-forms the basic Bi-LSTM-CRF algorithm in terms of Precision. For other twomeasures, i.e., Recall and F1-measure, we also can obtain same conclusions.

5.4 Compared with state-of-the-art deep models

In this section, we compare our best model, i.e. Model-I with PDET features,with several state-of-the-art deep models in the literature. Li et al. [41] seethe Chinese CNER task as a sequence labeling problem in word level. Theyexploit a Bi-LSTM-CRF model to solve the problem. To improve recogni-tion, They also use health domain datasets to create richer, more specializedword embeddings, and utilize external health domain lexicons to help wordsegmentation. Ouyang et al. [42] adopts bidirectional RNN-CRF architecturewith concatenated n-gram character representation to recognize Chinese clin-ical named entities. They also incorporate word segmentation results, part-of-speech (POS) tagging and medical vocabulary as features into their model.Hu et al. [43] develop a hybrid system based on rules, CRF and LSTM meth-ods for the CNER task. They also utilize a self-training algorithm on extraunlabeled clinical texts to improve recognition performance. Note that exceptLi et al. [41], the other systems all regard the CNER task as a character levelsequence labeling problem.

The comparative results are shown in Table 7. Some blank parts are remainedbecause the corresponding reference methods utilize external resources whichare not supplied, and they didn’t report the precision and recall in their papers.From the table, we can see that our best model achieves the best results in

16

terms of recall and F1-Measure among all the models. Li et al. [41] get theworst performance because their word-level approach inevitably have wrongsegmentation, leading to boundary errors when recognition, and the word setis much bigger than the character set, which means the corpus may be notbig enough to learn word embeddings effectively. What’s more, Hu et al. [43]utilize an ensemble method which consists of three separated models to handlethe task, and finally get 91.02% in F1-Measure, which is the best one amongthe previous models, while we only exploit one model and achieve a betterF1-Measure value, i.e., 91.24%. In addition, our model achieves the best recall(i.e., 91.64%) compared to these reference algorithms.

Table 7Comparative results between our best model with state-of-the-art deep models.

Methods Precision Recall F1-Measure

Li et al. [41] - - 87.95

Ouyang et al. [42] - - 88.85

Hu et al. [43] 94.49 87.79 91.02

Hu et al. [43]? 92.99 89.25 91.08

Our best model 90.83 91.64 91.24? The results are obtained by allowing the use of external resources for self-

training.

6 Experiment analysis

In this section, we experimentally investigate several key issues of the proposedmodel. We first study the benefit of our model to handle the instances with rareand unseen entities. Afterwards, we conduct sensitivity analysis to investigatethe impact of the dictionary size and the number of hidden units of LSTMon the performance of our proposed approach. In following experiments, allexperimental results are obtained by using our approach in the case of theModel-I and PDET features.

6.1 Benefit of our model for handling rare and unseen entities

To verify the benefit of our model for handling rare and unseen entities, weexperimentally compare the performance of our best model with the basicBi-LSTM-CRF model in terms of recall. The rare and unseen entities meansthat they occur no more than three times in the training set，i.e., occurrencenumber ∈ {0, 1, 2, 3}.

17

Table 8Comparative performance (i.e., recall) of our best model and the basic Bi-LSTM-CRF for handling rare and unseen entities.

Occurrence number Basic Bi-LSTM-CRF Our best model ∆

0 50.41 68.92 18.51

1 72.87 81.39 8.52

2 72.91 79.80 6.89

3 82.81 88.61 5.80

The comparative results are shown in Table 8. From the table, we can seethat our best model achieves an improvement of 18.51% in terms of recall forthe unseen entities (i.e. occurrence number is 0) compared with the basic Bi-LSTM-CRF model. As for the rare entities (i.e. occurrence number ∈ {1,2,3}),the improvements are 7.07% on average. It proves the benefits brought bydictionary features for rare and unseen entities. In addition, the improvementsby dictionary features decrease with the increase of the occurrence number ofentities. It is because the more frequency entities appear in the training set,the better performance of recognition can be got from the basic model.

6.2 The effect of the dictionary size

We first conduct the sensitivity analysis on the dictionary size by investigatingthe effect of the dictionary size on the algorithm performance. We randomlyselect 80%, 85%, 90%, and 95% of the entities from the original dictionary toconstruct new dictionaries with different sizes. Fig. 5 summarizes the compu-tational results.

From Fig. 5, we can see that the performance of our proposed approach im-proves gradually as the dictionary size increases in all three different evaluationmeasures. In other words, if we build a dictionary containing more words, wecan get better results.

6.3 The effect of the hidden unit number

We further explore the influence of the hidden unit number dh of the LSTM,which takes the concatenation of character embeddings and PDET featurevectors as the inputs. In our experiment, dh is set to 128, 192, 256, 320, and384, respectively. The results are shown in Fig. 6.

From Fig. 6, we can clearly observe that when the value of dh increases, the

18

0.80 0.85 0.90 0.95 1.00dictionary size

0.900

0.905

0.910

0.915

0.920

pefo

rman

ce(%

)

PrecisionRecallF

1-Measure

Fig. 5. The impact of the different dictionary size on the performance of our proposedapproach in terms of Precision, Recall and F1-Measure.

128 192 256 320 384number of hidden units

0.900

0.905

0.910

0.915

0.920

pefo

rman

ce(%

)

PrecisionRecallF

1-Measure

Fig. 6. The impact of different number of hidden units on the performance of ourproposed approach in terms of Precision, Recall and F1-Measure.

performance (in terms of F1-Measure) of our proposed approach first growsa bit, then keeps stable, finally drop down. We can also obtain the sameobservations on the curves of Precision and Recall. These observations seemreasonable because a more complex model can obtain more powerful expressionability. However, when the model is too complex, it may be overfitting.

19

7 Conclusion

Since the previous methods do not take human knowledge into considerationwhen recognizing clinical named entities through deep learning methods, inthis paper, we propose several effective approaches for the Chinese CNER byintegrating dictionaries into neural networks. Five different feature representa-tion schemes are designed to extract information based on given dictionariesfor clinical texts. Also, two different architectures are developed to use theinformation extracted from the dictionaries. Due to dictionaries contain rareand unseen entities, the proposed approaches could process them better thanprevious methods.

Experimental results on the CCKS-2017 Task 2 benchmark dataset show thatwith PDET features in the format of feature embedding, firstly concatenatingcharacters and dictionary features then sending them into a Bi-LSTM layercould significantly enhance the performance of deep neural network for theChinese CNER, with an improvement of nearly 3% in terms of precision com-pared with the basic Bi-LSTM-CRF method, and an improvement of about3% in terms of recall compared with the state-of-the-art methods. It is rea-sonable because our PDET features indicate not only the potential type butalso the potential boundary of clinical named entities, and the dense vectorrepresentation can bring more information than traditional one-hot encoding.

Acknowledgment

We would like to thank the reviewers for their useful comments and sugges-tions. This work was supported by the National Key R&D Program of Chinafor “Precision Medical Research” (No. 2018YFC0910500), National Major Sci-entific and Technological Special Project for “Significant New Drugs Develop-ment” (No. 2018ZX09201008), and the National Natural Science Foundationof China (No. 61772201).

References

[1] . Uzuner, B. R. South, S. Shen, S. L. Duvall, 2010 i2b2/va challenge on concepts,assertions, and relations in clinical text., Journal of the American MedicalInformatics Association Jamia 18 (5) (2011) 552.

[2] H. Duan, Y. Zheng, A study on features of the CRFs-based chinese namedentity recognition, International Journal of Advanced Intelligence Paradigms(2011) 3.

20

[3] M. Gridach, Character-level neural network for biomedical named entityrecognition, Journal of Biomedical Informatics 70 (2017) 85–91.

[4] M. Habibi, L. Weber, M. Neves, D. L. Wiegandt, U. Leser, Deep learning withword embeddings improves biomedical named entity recognition, Bioinformatics33 (14) (2017) i37–i48.

[5] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging,arXiv preprint arXiv:1508.01991.

[6] C. Friedman, P. O. Alderson, J. H. M. Austin, J. J. Cimino, S. B. Johnson,A general natural-language text processor for clinical radiology., Journal of theAmerican Medical Informatics Association 1 (2) (1994) 161–174.

[7] K. Fukuda, A. Tamura, T. Tsunoda, T. Takagi, Toward information extraction:identifying protein names from biological papers., in: Pacific Symposium onBiocomputing. Pacific Symposium on Biocomputing, 1998, pp. 707–718.

[8] Q. T. Zeng, S. Goryachev, S. Weiss, M. Sordo, S. N. Murphy, R. Lazarus,Extracting principal diagnosis, co-morbidity and smoking status for asthmaresearch: evaluation of a natural language processing system, BMC MedicalInformatics and Decision Making 6 (1) (2006) 1–9.

[9] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, C. G. Chute, Mayo clinical text analysis and knowledge extractionsystem (ctakes): architecture, component evaluation and applications, Journalof the American Medical Informatics Association Jamia 17 (5) (2010) 507.

[10] T. C. Rindflesch, L. Tanabe, J. N. Weinstein, L. Hunter, Edgar: extractionof drugs, genes and relations from the biomedical literature, in: Biocomputing2000, World Scientific, 1999, pp. 517–528.

[11] R. Gaizauskas, G. Demetriou, K. Humphreys, Term recognition andclassification in biological science journal articles, in: Computional Terminologyfor Medical & Biological Applications Workshop of the 2nd InternationalConference on NLP, 2000, pp. 37–44.

[12] M. Song, H. Yu, W. Han, Developing a hybrid dictionary-based bio-entityrecognition technique, BMC Med. Inf. & Decision Making 15 (S-1) (2015) S9.

[13] J. Lei, B. Tang, X. Lu, K. Gao, M. Jiang, H. Xu, A comprehensive studyof named entity recognition in chinese clinical text, Journal of the AmericanMedical Informatics Association : JAMIA 21 (5) (2014) 808–814.

[14] J. Lei, Named entity recognition in chinese clinical text, UT SBMI Dissertations(Open Access).

[15] G. D. Zhou, J. Su, Named entity recognition using an HMM-based chunk tagger,in: Meeting on Association for Computational Linguistics, 2002, pp. 473–480.

[16] A. Mccallum, D. Freitag, F. Pereira, Maximum entropy markov models forinformation extraction and segmentation, Proceedings of the 17th InternationalConference on Machine Learning (2000) 591–598.

21

[17] J. Finkel, S. Dingare, H. Nguyen, M. Nissim, C. Manning, G. Sinclair,Exploiting context for biomedical entity recognition: from syntax to the web, in:International Joint Workshop on Natural Language Processing in Biomedicineand ITS Applications, 2004, pp. 88–91.

[18] A. Mccallum, W. Li, Early results for named entity recognition with conditionalrandom fields, feature induction and web-enhanced lexicons, in: Conference onNatural Language Learning at Hlt-Naacl, 2003, pp. 188–191.

[19] B. Settles, Biomedical named entity recognition using conditional random fieldsand rich feature sets, In Proceedings of COLING 2004, International JointWorkshop On Natural Language Processing in Biomedicine and its Applications(NLPBA) (2004) 104–107.

[20] M. Skeppstedt, M. Kvist, G. H. Nilsson, H. Dalianis, Automatic recognitionof disorders, findings, pharmaceuticals and body structures from clinical text,Journal of Biomedical Informatics 49 (C) (2014) 148–158.

[21] Y. C. Wu, T. K. Fan, Y. S. Lee, S. J. Yen, Extracting named entities usingsupport vector machines, in: International Workshop on Knowledge Discoveryin Life Science Literature, 2006, pp. 91–103.

[22] Z. Ju, J. Wang, F. Zhu, Named entity recognition from biomedical textusing SVM, in: International Conference on Bioinformatics and BiomedicalEngineering, 2011, pp. 1–4.

[23] Y. Wu, M. Jiang, J. Lei, H. Xu, Named entity recognition in chinese clinical textusing deep neural network, Stud Health Technol Inform 216 (2015) 624–628.

[24] D. Zeng, C. Sun, L. Lin, B. Liu, LSTM-CRF for drug-named entity recognition,Entropy 19 (6) (2017) 283.

[25] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation9 (8) (1997) 1735–1780.

[26] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078.

[27] Y. Xia, Q. Wang, Clinical named entity recognition: ECUST in the CCKS-2017shared task 2, in: CEUR Workshop Proceedings, Vol. 1976, Chengdu, China,2017, pp. 43 – 48.

[28] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectionalLSTM networks, in: IEEE International Joint Conference on Neural Networks,IJCNN ’05. Proceedings, 2005, pp. 2047–2052 vol. 4.

[29] J. D. Lafferty, A. Mccallum, F. C. N. Pereira, Conditional random fields:probabilistic models for segmenting and labeling sequence data, in: EighteenthInternational Conference on Machine Learning, 2001, pp. 282–289.

[30] L. R. Rabiner, A tutorial on hidden markov models and selected applicationsin speech recognition, Readings in Speech Recognition 77 (2) (1990) 267–296.

22

[31] H. Lin, Y. Li, Z. Yang, Incorporating dictionary features into conditionalrandom fields for gene/protein named entity recognition, in: Pacific-AsiaConference on Knowledge Discovery and Data Mining, Springer, 2007, pp. 162–173.

[32] D. Li, K. Kipper-Schuler, G. Savova, Conditional random fields and supportvector machines for disorder named entity recognition in clinical texts, in:Proceedings of the Workshop on Current Trends in Biomedical NaturalLanguage Processing, BioNLP 2008, Columbus, Ohio, June 19, 2008, 2008, pp.94–95.

[33] R. L. Gai, F. Gao, L. M. Duan, X. H. Sun, H. Z. Li, Bidirectional maximalmatching word segmentation algorithm with rules, in: Progress in AppliedSciences, Engineering and Technology, Vol. 926 of Advanced Materials Research,Trans Tech Publications, 2014, pp. 3368–3372.

[34] L. Ratinov, D. Roth, Design challenges and misconceptions in named entityrecognition, in: Proceedings of the Thirteenth Conference on ComputationalNatural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5,2009, 2009, pp. 147–155.

[35] X. Ma, E. H. Hovy, End-to-end sequence labeling via bi-directional lstm-cnns-crf, in: Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany,Volume 1: Long Papers, 2016.

[36] J. Yang, S. Liang, Y. Zhang, Design challenges and misconceptions in neuralsequence labeling, in: Proceedings of the 27th International Conference onComputational Linguistics, COLING 2018, Santa Fe, New Mexico, USA,August 20-26, 2018, pp. 3879–3889.

[37] Y. Liu, Y. Zhou, S. Wen, C. Tang, A strategy on selecting performancemetrics for classifier evaluation, International Journal of Mobile Computing& Multimedia Communications 6 (4) (2014) 20–35.

[38] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of wordrepresentations in vector space, CoRR abs/1301.3781. arXiv:1301.3781.

[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,Dropout: a simple way to prevent neural networks from overfitting, Journalof Machine Learning Research 15 (1) (2014) 1929–1958.

[40] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXivpreprint arXiv:1412.6980.

[41] Z. Li, Q. Zhang, Y. Liu, D. Feng, Z. Huang, Recurrent neural networks withspecialized word embedding for chinese clinical named entity recognition, in:CEUR Workshop Proceedings, Vol. 1976, Chengdu, China, 2017, pp. 55 – 60.

[42] E. Ouyang, Y. Li, L. Jin, Z. Li, X. Zhang, Exploring n-gram characterpresentation in bidirectional RNN-CRF for chinese clinical named entityrecognition, in: CEUR Workshop Proceedings, Vol. 1976, Chengdu, China, 2017,pp. 37 – 42.

23

[43] J. Hu, X. Shi, Z. Liu, X. Wang, Q. Chen, B. Tang, HITSZ CNER: A hybridsystem for entity recognition from chinese clinical text, in: CEUR WorkshopProceedings, Vol. 1976, Chengdu, China, 2017, pp. 25 – 30.

24

View publication statsView publication stats

https://www.researchgate.net/publication/331333737

incorporating dictionaries into deep neu ral networks for...

Documents