a little annotation does a lot of good: a study in bootstrapping … · 2020. 1. 23. · 3.human...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 5164–5174,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

5164

A Little Annotation does a Lot of Good:A Study in Bootstrapping Low-resource Named Entity Recognizers

Aditi Chaudhary, Jiateng Xie, Zaid Sheikh,Graham Neubig, Jaime G. Carbonell

{aschaudh, jiatengx, zsheikh, gneubig, jgc}@cs.cmu.eduLanguage Technologies Institute

Carnegie Mellon University

Abstract

Most state-of-the-art models for named en-tity recognition (NER) rely on the availabil-ity of large amounts of labeled data, mak-ing them challenging to extend to new, lower-resourced languages. However, there are nowseveral proposed approaches involving eithercross-lingual transfer learning, which learnsfrom other highly resourced languages, or ac-tive learning, which efficiently selects effec-tive training data based on model predictions.This paper poses the question: given this re-cent progress, and limited human annotation,what is the most effective method for effi-ciently creating high-quality entity recognizersin under-resourced languages? Based on ex-tensive experimentation using both simulatedand real human annotation, we find a dual-strategy approach best, starting with a cross-lingual transferred model, then performing tar-geted annotation of only uncertain entity spansin the target language, minimizing annotatoreffort. Results demonstrate that cross-lingualtransfer is a powerful tool when very little datacan be annotated, but an entity-targeted anno-tation strategy can achieve competitive accu-racy quickly, with just one-tenth of trainingdata. The code is publicly available here.1

1 Introduction

Named entity recognition (NER) is the task of de-tecting and classifying named entities in text intoa fixed set of pre-defined categories (person, loca-tion, etc.) with several downstream applicationsincluding machine reading (Chen et al., 2017), en-tity and event co-reference (Yang and Mitchell,2016), and text mining (Han and Sun, 2012). Re-cent advances in deep learning have yielded state-of-the-art performance on many sequence label-ing tasks, including NER (Collobert et al., 2011;

1https://github.com/Aditi138/EntityTargetedActiveLearning

Ma and Hovy, 2016; Lample et al., 2016; Pe-ters et al., 2018). However, the performance ofthese models is highly dependent on the availabil-ity of large amounts of annotated data, and as aresult their accuracy is significantly lower on lan-guages that have fewer resources than English. Inthis work, we ask the question “how can we effi-ciently bootstrap a high-quality named entity rec-ognizer for a low-resource language with only asmall amount of human effort?” Specifically, weleverage recent advances in data-efficient learningfor low-resource languages, proposing the follow-ing “recipe” for bootstrapping low-resource entityrecognizers: First, we use cross-lingual transferlearning (Yarowsky et al., 2001; Ammar et al.,2016), which applies a model trained on anotherlanguage to low-resource languages, to providea good preliminary model to start the bootstrap-ping process. Specifically, we use the model ofXie et al. (2018), which reports strong results ona number of language pairs. Next, on top of thistransferred model we further employ active learn-ing (Settles and Craven, 2008; Marcheggiani andArtieres, 2014), which helps improve annotationefficiency by using model predictions to select in-formative, rather than random, data for human an-notators. Finally, the model is fine-tuned on dataobtained using active learning to improve accuracyin the target language.

Within this recipe, the choice of specific methodfor choosing and annotating data within activelearning is highly important to minimize humaneffort. One relatively standard method used inprevious work on NER is to select full sequencesbased on a criterion for the uncertainty of the en-tities recognized therein (Culotta and McCallum,2005). However, as it is often the case that only asingle entity within the sentence may be of inter-est, it can still be tedious and wasteful to annotatefull sequences when only a small portion of the

https://github.com/Aditi138/EntityTargetedActiveLearning

https://github.com/Aditi138/EntityTargetedActiveLearning

5165

English labeled dataset

Cross-LingualTransfer Learning

Target Languagelabeled dataset

NER Model

Target languageunlabeled dataset

fitted model

Active Learning

query spans

label

train model

�ल और �श�क� क� कमी पर स�ीम कोट� नमागा जवाब

�ल और �श�क� क� कमी पर स�ीम कोट� नमागा जवाब BORGIORG

Figure 1: Our proposed recipe: cross-lingual transfer is used for projecting annotations from an English labeleddataset to the target language. Entity-targeted active learning is then used to select informative sub-spans whichare likely entities for humans to annotate. Finally, the NER model is fine-tuned on this partially-labeled dataset.

sentence is of interest (Neubig et al., 2011; Sper-ber et al., 2014). Inspired by this finding and con-sidering the fact that named entities are both im-portant and sparse, we propose an entity-targetedstrategy to save annotator effort. Specifically, weselect uncertain subspans of tokens within a se-quence that are most likely named entities. Thisway, the annotators only need to assign types tothe chosen subspans without having to read andannotate the full sequence. To cope with the re-sulting partial annotation of sequences, we applya constrained version of conditional random fields(CRFs), partial CRFs, during training that onlylearn from the annotated subspans (Tsuboi et al.,2008; Wanvarie et al., 2011).

To evaluate our proposed methods, we con-ducted simulated active learning experiments on5 languages: Spanish, Dutch, German, Hindi andIndonesian. Additionally, to study our methodin a more practical setting, we conduct humanannotation experiments on two low-resource lan-guages, Indonesian and Hindi, and one simulatedlow-resource language, Spanish. In sum, this pa-per makes the following contributions:

1. We present a bootstrapping recipe for im-proving low-resource NER. With just one-tenth of tokens annotated, our proposedentity-targeted active learning method pro-vides the best results among all active learn-ing baselines, with an average improvementof 9.9 F1.

2. Through simulated experiments, we showthat cross-lingual transfer is a powerful tool,outperforming the un-transferred systems byan average of 8.6 F1 with only one-tenth oftokens annotated.

3. Human annotation experiments show that an-notators are more accurate in annotating enti-ties when using the entity-targeted strategy asopposed to full sequence annotation. More-over, this strategy minimizes annotator effortby requiring them to label fewer tokens thanthe full-sequence annotation.

2 Approach

As noted in the introduction, our bootstrappingrecipe consists of three components (1) cross-lingual transfer learning, (2) active learning to se-lect relevant parts of the data to annotate, and (3)fine-tuning of the model on these annotated seg-ments. Steps (2) and (3) are continued until themodel has achieved an acceptable level of accu-racy, or until we have exhausted our annotationbudget. The system overview can be seen in Fig-ure 1. In the following sections, we describe eachof these three steps in detail.

2.1 Cross-lingual Transfer Learning

The goal of cross-lingual learning is to take a rec-ognizer trained in a source language, and transferit to a target language. Our approach to doing sofor NER follows that of Xie et al. (2018), and weprovide a brief review in this section.

To begin with, we assume access to two sets ofpre-trained monolingual word embeddings in thesource and target languages, X and Y , one smallbilingual lexicon, either provided or obtained in anunsupervised manner (Artetxe et al., 2017; Con-neau et al., 2017a), and labeled training data inthe source language. Using these resources, wetrain bilingual word embeddings (BWE) to createa word-to-word translation dictionary, and finally

5166

use this dictionary to translate the source trainingdata into the target language, which we use to trainan NER model.

To learn BWE, we first obtain a linear mappingW by solving the following objective:

W ∗ = arg minW

‖WXD−YD‖F s.t. WW> = I,

where XD and YD correspond to the aligned wordembeddings from the bilingual lexicon. F denotesthe Frobenius norm. We can first compute the sin-gular value decomposition Y T

DXD = U∑V >,

and solve the objective by takingW ∗ = UV >. Weobtain BWE by linearly transforming the sourceand target monolingual word embeddings with Uand V , namely XU and Y V .

After obtaining the BWE, we find the nearestneighbor target word for every source word inthe BWE space using the cross-domain similar-ity local scaling (CSLS) metric (Conneau et al.,2017b), which produces a word-to-word transla-tion dictionary. We use this dictionary to translatethe source training data into the target language,and simply copy the label for each word, whichyields transferred training data in the target lan-guage. We train an NER model on this transferreddata as our preliminary model. Going forward, werefer to the use of cross-lingual transferred data asCT.

2.2 Entity-Targeted Active LearningAfter training a model using cross-lingual transferlearning, we start the active learning process basedon this model’s outputs. We begin by training aNER model Θ using the above model’s outputs astraining data. Using this trained model, our pro-posed entity-targeted active learning strategy, re-ferred as ETAL, then selects the most informativespans from a corpus D of unlabeled sequences.Given an unlabeled sequence s, ETAL first selectsa span of tokens sji = si · · · sj such that sji is alikely named entity, where i, j ∈ [0, |s|]. Then, inorder to obtain highly informative spans across D,ETAL computes the entropyH for each occurrenceof the span sji in D and then aggregates them overthe entire corpus D, given by:

Haggregate(sji ) =

∑xji∈D

H(xji )1(xji = sji )

where x is an unlabeled sequence in D. Finally,the spans sji with the highest aggregate uncertaintyHaggregate are selected for manual annotation.

We now describe the procedure for calculatingH(xji ), which is the entropy of a span xji beinga likely entity. Given an unlabeled sequence x,the trained NER model Θ is used for computingthe marginal probabilities pθ(yi|x) for each to-ken xi across all possible labels yi ∈ Y usingthe forward-backward algorithm (Rabiner, 1989),where Y is the set of all labels. Using thesemarginals we calculate the entropy of a given spanxji being an entity as shown in Algorithm 1.

Algorithm 1 Entity-Targeted Active Learning1: B← label-set denoting beginning of an entity2: I← label-set denoting inside of an entity3: O← outside of an entity span4: pθ(yi|x)← marginal probability of label yi5: for token xi6:

7: for i← 1...len(x), j = 1 do8: pijspan =

∑y∈B pθ(yi|x)

9: for j ← i+ 1...len(x) do10: pentity = pijspan ∗ pθ(Oj |x)11: H = entropy (pentity)12: if H > Hthreshold then13: Haggregate(x

ji )+ = H

14: pijspan = pijspan ∗∑

y∈I pθ(yj |x)

Let B denote the set of labels indicating begin-ning of an entity, I the set of labels indicating in-side of an entity and O denoting outside of an en-tity. First, we compute the probability of a spanxji being an entity, starting with the token i, bymarginalizing pθ(yi|x) over all labels in B, de-noted as pijspan. Since an entity can span multipletokens, for each subsequent token j being part ofthat entity, we marginalize pθ(yj |x) over all labelsin I and combine it with pijspan. Finally, we com-pute pentity = pijspan ∗ pθ(Oj |x), which denotesend of a likely entity. Since we use the marginalprobability for computing pentity, it already fac-tors in the transition probability between tags.Thus, any invalid sequences such as BPERIORG

have low scores. Since contiguous spans haveoverlapping tokens, using dynamic programming(DP) to compute pijspan avoids an exponential com-putation when considering all possible spans in asequence. Using pentity, we compute the entropyH and only consider the spans having H higherthan a pre-defined threshold Hthreshold. The rea-son for this thresholding is purely for computa-tional purposes as it allows us to discard all spans

5167

that have a very low probability of being an en-tity, keeping the number of spans actually stored inmemory low. As mentioned above, we aggregatethe entropy of spans Haggregate over the entire un-labeled set, thus combining uncertainty samplingwith a bias towards high frequency entities.

Using this strategy, we select subspans in eachsequence for annotation. The annotator only needsto assign named entity types to the chosen sub-spans, adjust the span boundary if needed, and ig-nore the rest of the sequence, saving much effort.

2.3 Training the NER model

With the newly obtained training data from activelearning, we attempt to improve the original trans-ferred model. In this section, we first describeour model architecture, and try to address: 1) howto train the NER model effectively with partiallyannotated sequences? 2) what training scheme isbest suited to improve the transferred model?

2.3.1 Model ArchitectureOur NER model is a BiLSTM-CNN-CRF modelbased on Ma and Hovy (2016) consisting of: acharacter-level CNN, that allows the model to cap-ture subword information; a word-level BiLSTM,that consumes word embeddings and producescontext sensitive hidden representations; and alinear-chain CRF layer that models the depen-dency between labels for inference. We use theabove model for training the initial NER model onthe transferred data as well as for re-training themodel on the data acquired from active learning.

2.3.2 PARTIAL-CRFActive learning with span-based strategies suchas ETAL, produces a training dataset of partiallylabeled sequences. To train the NER model onthese partially labeled sequences, we take inspi-ration from Bellare and McCallum (2007); Tsuboiet al. (2008) and use a constrained CRF decoder.Normally, CRF computes the likelihood of a labelsequence y given a sequence x as follows:

pθ(y|x) =

∏Tt=1 ψi(yt−1, yt,x, t)

Z(x)

Z(x) =∑

y∈Y(T )

T∏t=1

ψi(yt−1, yt,x, t)

where T is the length of the sequence, Y(T )denotes the set of all possible label sequences with

length T , andψi(yt−1, yt,x) = exp(WTyt−1,ytxi+

byt−1,yt) is the energy function. To computethe likelihood of a sequence where some labelsare unknown, we use a constrained CRF whichmarginalizes out the un-annotated tokens. Specif-ically, let YL denote the set of all possible se-quences that include the partial annotations (forun-annotated tokens, all labels are possible), andwe compute the likelihood as: pθ(YL|x) =∑

y∈YLpθ(y|x), referred as PARTIAL-CRF.

2.3.3 Training SchemeTo improve our model with the newly labeled data,we directly fine-tune the initial model, trained onthe transferred data, on the data acquired throughactive learning, referred as FINETUNE. Each ac-tive learning run produces more labeled data, forwhich this training procedure is repeated again.We also compare the NER performance usingtwo other training schemes: CORPUSAUG, wherewe train the model on the concatenated corpusof transferred data and the newly acquired data,and CORPUSAUG+FINETUNE, where we additionallyfine-tune the model trained using CORPUSAUG onjust the newly acquired data.

3 Experiments

In this section, we evaluate the effectiveness ofour proposed strategy in both simulated (§3.2) andhuman-annotation experiments (§3.3).

3.1 Experimental SettingsDatasets: The first evaluation set includes thebenchmark CoNLL 2002 and 2003 NER datasets(Tjong Kim Sang, 2002; Tjong Kim Sang andDe Meulder, 2003) for Spanish (from the Ro-mance family), Dutch and German (like English,from the Germanic family). We use the stan-dard corpus splits for train/dev/test. The sec-ond evaluation set is for the low-resource settingwhere we use the Indonesian (from the Austrone-sian family), Hindi (from the Indo-Aryan fam-ily) and Spanish datasets released by the Lin-guistic Data Consortium (LDC).2 We generate thetrain/dev/test split by random sampling. Details ofthe corpus statistics are in the Appendix.

English-transferred Data: We use the same ex-perimental settings and resources as described inXie et al. (2018) to get the translations of the En-glish training data for each target language.

2LDC2017E62,LDC2016E97,LDC2017E66

5168

Active Learning Setup: As described in Sec-tion §2.2, a DP-based algorithm is employed toselect the uncertain entity spans which runs for alln-grams having length <= 5. This length was ap-proximated by computing the 90th percentile onthe length of entities in the English training data.Hthreshold is a hyper-parameter set to 1e-8. Thedetails of the NER model hyper-parameters can befound in the Appendix.

3.2 Simulation Experiments

Setup: We use cross-lingual transfer (§2.1) totrain our initial NER model and test on the tar-get language. This is the same setting as Xie et al.(2018) and serves as our baseline. Then we useseveral active learning strategies to select data formanual annotation using this trained NER model.We compare our proposed ETAL strategy with thefollowing baseline strategies:

SAL: Select whole sequences for which themodel has least confidence in the most likely la-beling (Culotta and McCallum, 2005).

CFEAL: Select least confident spans within asequence using the confidence field estimationmethod (Culotta and McCallum, 2004).

RAND: Select spans randomly from the unla-beled set for annotation.

In this experimental setting, we simulate man-ual annotation by using gold labels for the dataselected by active learning. At each subsequentrun, we annotate 200 tokens and fine-tune the NERmodel on all the data acquired so far, which is thenused to select data for the next run of annotation.

3.2.1 ResultsFigure 2 summarizes the results for all datasetsacross the different experimental settings. Eachdata-point on the x-axis corresponds to the NERperformance after annotating 200 additional to-kens. CT denotes using cross-lingual transferreddata to train the initial NER model for both kick-starting the active learning process and also forfine-tuning the NER model on the newly-acquireddata. PARTIAL-CRF/FULL-CRF denote the type ofCRF decoder used in the NER model. Through-out this paper, we report results averaged acrossall active learning runs unless otherwise noted. In-dividual scores are reported in the Appendix.

As can be seen in the figure, our proposed recipeETAL+PARTIAL-CRF+CT outperforms the previous

Dataset Tokens ETAL SAL RAND CFEALHindi 200 54.8 ± 2.6 49.6 ± 2.8 50.2 ± 0.4 50.4 ± 2.8LDC 600 64.7 ± 2.6 51.5 ± 2.9 56.1 ± 2.6 54.3 ± 2.5

1200 69.9 ± 2.5 56.6 ± 2.7 56.8 ± 2.6 64.4 ± 2.7

Indonesian 200 47.4± 2.6 47.9 ± 2.4 46.8 ± 2.3 48.5 ± 2.3LDC 600 54.5 ± 2.4 44.5 ± 2.2 47.2 ± 2.2 46.0 ± 2.3

1200 60.5 ± 2.3 44.7 ± 2.3 51.9 ± 2.3 49.5 ± 2.3

Spanish 200 66.3 ± 3.8 62.0 ± 3.6 61.2 ± 1.2 62.5 ± 3.7LDC 600 65.8 ± 4.1 62.5 ± 3.7 61.9 ± 2.0 63.8 ± 3.7

1200 78.9 ± 3.5 62.3 ± 3.6 64.6 ± 4.0 68.6 ± 3.9

Table 1: Variance analysis for significance testing ofdifferent active learning systems using paired bootstrapresampling. ± denotes the 95% confidence intervals.Systems which are not statistically significant than thebest system ETAL are in bold. The CoNLL datasets re-flect the same observation, as can be seen in Appendix.

active learning baselines for all the datasets. Hold-ing the other two components of CT and PARTIAL-

CRF constant, we conduct experiments to comparethe different active learning strategies, which aredenoted by the solid lines in Figure 2. We see thatETAL outperforms the other strategies by a signif-icant margin for both the CoNLL datasets: Ger-man (+6.1 F1), Spanish (+5.3 F1), Dutch (+6.3 F1)and the LDC datasets: Hindi (+9.3 F1), Indonesian(+9.0 F1), Spanish (+7.5 F1), at the end of all runs.Furthermore, even with just one-tenth annotatedtokens, the proposed recipe is only (avg.) -5.2 F1behind the model trained using all labeled data,denoted by SUPERVISED ALL. Although CFEAL

also selects informative spans, ETAL outperformsit because ETAL is optimized to select likely en-tities, causing more entities to be annotated forHindi (+43), Indonesian (+207), Spanish-CoNLL(+1579), German (+906), Dutch (+836), exceptfor Spanish-LDC (-184). Despite fully labeleddata being adding in SAL, ETAL outperforms it be-cause SAL selects longer sentences with fewer en-tities: Hindi (-934), Indonesian (-1290), Spanish-LDC (-527), Spanish-CoNLL (-2395), German (-2086), Dutch (-2213).

From Figure 2 we see that ETAL performs betterthan the baselines across multiple runs. To ver-ify that this is not an artifact of randomness inthe test data, we use a paired bootstrap resamplingmethod, as illustrated in Koehn (2004), to compareSAL, CFEAL, RAND with ETAL. For each system,we compute the F1 score on randomly sampled50% of the data and perform 10k bootstrappingsteps at three active learning runs. From Table 1we see that the baselines are significantly worsethan ETAL at 600 and 1200 annotated tokens.

5169

0102030405060708090100

Baseline

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

Hindi LDC

Supervised All

Baseline

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

Indonesian LDC

Supervised All

Baselin

e200

400600

8001000

12001400

16001800

20002200

2400

Spanish LDC

Supervised All

0102030405060708090100

Baseline

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

Spanish CoNLL

Supervised All

Baseline

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

German CoNLL

Supervised All

Baseline

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

Dutch CoNLL

Supervised All

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

Spanish LDC

ETAL + PARTIAL-CRF + CT CFEAL + PARTIAL-CRF + CT ETAL + PARTIAL-CRF

RAND + PARTIAL-CRF + CT SAL + FULL-CRF + CT ETAL + FULL-CRF + CT

Supervised All

Figure 2: Comparison of the NER performance trained with the FineTune scheme, across six datasets. Solid linescompare the different active learning strategies. Dashed lines show the ablation experiments. The x-axis denotesthe total number of tokens annotated and the y-axis denotes the F1 score.

3.2.2 Ablation StudyIn order to study the contribution of CT andPARTIAL-CRF in improving the NER performance,we conduct the following ablation, denoted bydashed lines in Figure 2.

CT: We observe that the transferred data fromEnglish provides a good start to the NER model:69.4 (Dutch), 63.0 (Spanish-LDC), 65.7 (Spanish-CoNLL), 54.7 (German), 45.4 (Indonesian), 45.0(Hindi) F1. As expected, cross-lingual transferhelps more for the languages closely related to En-glish which are Dutch, German, Spanish. For thisablation, we train a ETAL+PARTIAL-CRF where notransferred data is used. Therefore, to create theseed data, we randomly annotate 200 tokens in thetarget language and thereafter use ETAL. We ob-serve that as more in-domain data is acquired, theun-transferred setting soon approaches the trans-ferred setting ETAL+PARTIAL-CRF+CT suggestingthat an efficient annotation strategy can help closethe gap between these two systems with as few as∼1000 tokens (avg.).

PARTIAL-CRF: We study the effect of using theoriginal CRF (FULL-CRF) instead of the PARTIAL-

CRF for training with partially labeled data. Sincethe former requires fully labeled sequences, the

un-annotated tokens in a sequence are labeled withthe model predictions. We see from Figure 2 thatthe ETAL+FULL-CRF+CT performs worse (avg. -4.1 F1) than ETAL+PARTIAL-CRF+CT. This is be-cause the FULL-CRF significantly hurts the recall, asmuch as by an average of -11.0 points for Hindi, -1.4 for Indonesian, -7.4 for Spanish-LDC, -3.3 forGerman, -3.7 for Dutch, -4.8 for Spanish CoNLL.

3.2.3 Comparison of Training SchemesWe experiment with different NER trainingregimes (described in §2.3.3) for ETAL. We ob-serve that generally fine-tuning not only speeds upthe training but also gives better performance thanCORPUSAUG. For brevity of space, we compare re-sults for two languages in Figure 3:3 Dutch, a rel-ative of English, and Hindi, a distant language.We see that FINETUNE performs better for Hindiwhereas CORPUSAUG+FINETUNE performs better forDutch. This is because Dutch is closely related toEnglish and benefits the most from the transferreddata being explicitly augmented. Whereas forHindi, which is typologically distant from English,the transferred data is noisy and thus the modeldoesn’t gain much from the transferred data. Xieet al. (2018) make a similar observation in their

3The results for other datasets can be found in Appendix.

5170

0

20

40

60

80

100B

asel

ine

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

Hindi

Bas

elin

e20

040

060

080

010

0012

0014

0016

0018

0020

0022

0024

0026

0028

0030

0032

0034

0036

0038

0040

00

Dutch

Figure 3: Comparison of the NER performance trained with different schemes for the ETAL strategy. The x-axisdenotes the total number of tokens annotated and the y-axis denotes the F1 score.

experiments with German.

3.3 Human Annotation Experiments

Setup: We conduct human annotation experi-ments for Hindi, Indonesian and Spanish to under-stand whether ETAL helps reduce the annotationeffort and improve annotation quality in practicalsettings. We compare ETAL with the full sequencestrategy (SAL). We use six native speakers, two foreach language, with different levels of familiaritywith the NER task. Each annotator was providedwith practice sessions to gain familiarity with theannotation guidelines and the user interface. Theannotators annotated for 20 mins time for eachstrategy. For ETAL, the annotator was required toannotate single spans i.e each sequence containedone span of tokens. This involved assigning thecorrect label and adjusting the span boundary ifrequired. For SAL, the annotator was required toannotate all possible entities in the sequence. Werandomized the order in which the annotators hadto annotate using ETAL and SAL strategy. Figure5 illustrates the human annotation process for theETAL strategy in the annotation interface. 4

3.3.1 ResultsTable 2 records the results of the human annota-tion experiments. We first compare each annota-tor’s annotation quality with respect to the oracleunder both ETAL and SAL, denoted by AnnotatorPerformance. We see that both Hindi and Span-ish annotators have higher annotation quality us-ing ETAL. This is because by selecting possible en-tity spans, ETAL not only saves effort on searchingthe entities in a sequence but also allows the an-notators to read less overall and concentrate more

4The code for the annotation interface can befound at https://gitlab.com/cmu_ariel/ariel-annotation-interface.

Annotator Test Performance (# annotated tokens)Performance

ETAL SAL ETAL SAL SAL-Full

HI-1 78.8 63.7 50.4 (326) 44.2 (326) 53.3 (1894)HI-2 82.7 72.2 49.1 (234) 45.9 (234) 55.6 (2242)

ID-1 66.1 77.8 50.4 (425) 45.8 (425) 51.3 (3232)ID-2 73.0 79.5 51.2 (251) 46.5 (251) 54.0 (2874)

ES-1 79.7 75.0 63.7 (204) 62.2 (204) 64.6 (2134)ES-2 83.1 70.4 63.8 (199) 62.2 (199) 62.6 (2134)

Table 2: Annotator performance measures F1 of eachannotator with respect to the oracle. Test Performancemeasures the NER F1 scores using the annotations astraining data. The number in the brackets denote thenumber of annotated tokens used for training the NERmodel. ES:Spanish, HI:Hindi, ID: Indonesian.

on the things that they do read, as seen in Figure4(a). However, for SAL we see that the annotatormissed a likely entity because they focused on theother more salient entities in the sequence. For In-donesian, we see an opposite trend due to severalinconsistencies in the gold labels. The most com-mon inconsistency being when a common noun ispart of an entity. For instance, the gold standardannotates the span Kabupaten Bogor as an en-tity where Kabupaten means “district”. Whereasfor Kabupaten Aceh tengah, the gold standarddoes not include Kabupaten. Similarly, the samespan gunung krakatau is annotated inconsistentlyacross different mentions where sometimes theyexclude the gunung (mountain) token.Since theannotators referred to these examples during theirpractice session, their annotations had similar in-consistencies. This issue affects ETAL more thanSAL because ETAL selects more entities for anno-tation.

The Test Performance compares the perfor-mance of the NER models trained with these an-notations. The number in the brackets denotes thetotal number of annotated tokens used for train-

https://gitlab.com/cmu_ariel/ariel-annotation-interface.

https://gitlab.com/cmu_ariel/ariel-annotation-interface.

5171

Sentence: !कल और (श*क+ क, कमी पर [स2ीम कोट5] न मागा जवाब School and teachers ‘s lack of Supreme Court asks answer Gold: BORG IORG Human: BORG IORG

Sentence: ?वराट [कोहलA] को आईसीसी क, ट!ट टAम मD जगह नहA Virat Kohli has ICC ‘s Test Team in place no Gold: BPER IPER Human: BPER IPER

Sentence: [(मE क 21 ईसाइय+ बधक+ का IS न Iकया (सर कलम] Egypt ‘s 21 Christian brothers ‘s IS made head lines Gold: BGPE O O BORG O O BORG O O O O Human: BGPE O O O O O BORG O O O O

ETAL

SA

L

(a)

127

126

240

186

312

188

63 63

81

100

147

137

E S -1 ES -2 HI -1 HI -2 I D-1 I D-2

NU

MBE

R O

F EN

TITI

ES

ORACLE ANNOTATORS

ETAL SAL

(b)

Figure 4: (a) Examples from Hindi human annotation experiments for both ETAL and SAL. Square bracketsdenote the spans (for ETAL) or the entire sequence (for SAL) selected by the respective active learning strategy.(b) Comparing the number of entities in the data selected by ETAL and SAL, as annotated by oracle.

ing the NER model. We observe that SAL has alarger number of annotated tokens than ETAL. Thisis because most sequences selected by SAL did nothave any entities. Since “not-an-entity” is the de-fault label in the annotation interface, no operationis required for annotating these, allowing for moretokens to be annotated per unit times. When wecount the number of entities present in the dataselected by the two strategies, we see in Figure4(b) that data selected by ETAL has a significantlylarger number of entities than SAL, across all the 6annotation experiments. Therefore, we first com-pare the NER performance on the same number ofannotated tokens. From Table 2 we see that underthis setting ETAL outperforms SAL, similar to thesimulation results. We note that when we considerall the annotated tokens, SAL (denoted by SAL-

FULL) has slightly better results. However, despitehaving 6 times fewer annotated tokens, the differ-ence between ETAL and SAL-FULL is (avg.) 2.1 F1.This suggests that ETAL can achieve competitiveperformance with fewer annotations.

From both the simulation and human experi-ments, we can conclude that a targeted annotationstrategy such as ETAL achieves competitive perfor-mance with less manual effort while maintaininghigh annotation quality. Given that ETAL can helpfind twice as many entities as SAL, a potential ap-plication of ETAL can also be for creating a high-quality entity gazetteer under a short time budget.Since a naive strategy of SAL allows for more la-belled data to be acquired in the same amount oftime, in the future we plan to explore mixed-modeannotation where we choose either full sequences

or spans for annotation.

4 Related Work

Cross-Lingual Transfer: Transferring knowl-edge from high-resource languages has been ex-tensively used for improving low-resource NER.More common approaches rely on annotation pro-jection methods where annotations in source lan-guage are projected to the target language us-ing parallel corpora (Zitouni and Florian, 2008;Ehrmann et al., 2011) or bilingual dictionaries(Xie et al., 2018; Mayhew et al., 2017). Cross-lingual word embeddings (Bharadwaj et al., 2016;Chaudhary et al., 2018) also provide a way toleverage annotations from related languages.

Active Learning (AL): AL has been widely ex-plored for many NLP tasks- NER: Shen et al.(2017) explore token-level annotation strategies,Chen et al. (2015) present a study on AL for clin-ical NER; Baldridge and Palmer (2009) evaluatehow well AL works with annotator expertise andlabel suggestions, Garrette and Baldridge (2013)study type and token based strategies for low-resource languages. Settles and Craven (2008)present a nice survey on the different AL strategiesfor sequence labeling tasks, whereas Marcheg-giani and Artieres (2014) discuss the strategies foracquiring partially labeled data. Wanvarie et al.(2011); Neubig et al. (2011); Sperber et al. (2014)show the advantages of training a model on thispartially labeled data. All above methods focus oneither token or full sequence annotation.

The most similar work to ours perhaps is that of

5172

(a) Selected spans using ETAL strategy are highlighted for the human annotator to annotate.

(b) Human annotator correcting the span boundary and assigning the correct entity type.

(c) Human annotator assigning the correct entity type only since selected span boundary is correct.

(d) Partially-annotated sequences after being annotated by the human annotator.

Figure 5: Example of the human annotation process for Hindi.

Fang and Cohn (2017), which selects informativeword types for low-resource POS tagging. How-ever, their method requires the annotator to anno-tate single tokens, which is not trivially applicablefor multi-word entities in practical settings.

5 Conclusion

In this paper, we presented a study on how to ef-ficiently bootstrap NER systems for low-resourcelanguages using a combination of cross-lingualtransfer learning and active learning. We con-ducted both simulated and human annotation ex-periments across different languages and foundthat: 1) cross-lingual transfer is a powerful tool,constantly beating systems without using trans-fer; 2) our proposed recipe works the best amongknown active learning baselines; 3) our proposedactive learning strategy saves annotator much ef-fort while ensuring high quality. In future, to ac-count for different levels of annotator expertise,we plan to combine proactive learning (Li et al.,2017) with our proposed method.

Acknowledgement

The authors would like to thank Sachin Ku-mar, Kundan Krishna, Aldrian Obaja Muis,Shirley Anugrah Hayati, Rodolfo Vega and Ra-mon Sanabria for participating in the human an-notation experiments. This work is sponsoredby Defense Advanced Research Projects AgencyInformation Innovation Office (I2O). Program:Low Resource Languages for Emergent Incidents(LORELEI). Issued by DARPA/I2O under Con-tract No. HR0011-15-C0114. This research wassupported also in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program. Theviews and conclusions contained in this documentare those of the authors and should not be inter-preted as representing the official policies, eitherexpressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notation here on.

5173

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A Smith.2016. Massively multilingual word embeddings.arXiv preprint arXiv:1602.01925.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.Learning bilingual word embeddings with (almost)no bilingual data. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 451–462,Vancouver, Canada. Association for ComputationalLinguistics.

Jason Baldridge and Alexis Palmer. 2009. How welldoes active learning actually work? Time-basedevaluation of cost-reduction strategies for languagedocumentation. In Proceedings of the 2009 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 296–305, Singapore. Associationfor Computational Linguistics.

Kedar Bellare and Andrew McCallum. 2007. Learn-ing extractors from unlabeled text using relevantdatabases. In Sixth international workshop on in-formation integration on the web.

Akash Bharadwaj, David Mortensen, Chris Dyer, andJaime Carbonell. 2016. Phonologically aware neu-ral model for named entity recognition in low re-source transfer settings. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 1462–1472.

Aditi Chaudhary, Chunting Zhou, Lori Levin, GrahamNeubig, David R. Mortensen, and Jaime Carbonell.2018. Adapting word embeddings to new languageswith morphological and phonological subword rep-resentations. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 3285–3295. Association for Compu-tational Linguistics.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Compu-tational Linguistics.

Yukun Chen, Thomas A Lasko, Qiaozhu Mei, Joshua CDenny, and Hua Xu. 2015. A study of active learn-ing methods for named entity recognition in clinicaltext. Journal of biomedical informatics, 58:11–18.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. Journal of Machine Learning Research,12(Aug):2493–2537.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herve Jegou.2017a. Word translation without parallel data. arXivpreprint arXiv:1710.04087.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herve Jegou.2017b. Word translation without parallel data.arXiv preprint arXiv:1710.04087.

Aron Culotta and Andrew McCallum. 2004. Confi-dence estimation for information extraction. In Pro-ceedings of HLT-NAACL 2004: Short Papers, pages109–112. Association for Computational Linguis-tics.

Aron Culotta and Andrew McCallum. 2005. Reduc-ing labeling effort for structured prediction tasks. InAssociation for the Advancement of Artificial Intel-ligence (AAAI), volume 5, pages 746–751.

Maud Ehrmann, Marco Turchi, and Ralf Steinberger.2011. Building a multilingual named entity-annotated corpus using annotation projection. InProceedings of the International Conference RecentAdvances in Natural Language Processing 2011,pages 118–124.

Meng Fang and Trevor Cohn. 2017. Model transferfor tagging low-resource languages using a bilingualdictionary. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguis-tics (Volume 2: Short Papers), pages 587–593, Van-couver, Canada. Association for Computational Lin-guistics.

Dan Garrette and Jason Baldridge. 2013. Learning apart-of-speech tagger from two hours of annotation.In Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 138–147, Atlanta, Georgia. Association forComputational Linguistics.

Xianpei Han and Le Sun. 2012. An entity-topic modelfor entity linking. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning, pages 105–115. Association forComputational Linguistics.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofthe 2004 conference on empirical methods in naturallanguage processing.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270, San Diego, California. Associationfor Computational Linguistics.

Maolin Li, Nhung Nguyen, and Sophia Ananiadou.2017. Proactive learning for named entity recogni-tion. In BioNLP 2017, pages 117–125, Vancouver,Canada,. Association for Computational Linguistics.

https://doi.org/10.18653/v1/P17-1042

https://doi.org/10.18653/v1/P17-1042

http://www.aclweb.org/anthology/D/D09/D09-1031




http://aclweb.org/anthology/D18-1366



https://doi.org/10.18653/v1/P17-1171

https://doi.org/10.18653/v1/P17-1171

https://doi.org/10.18653/v1/P17-2093

https://doi.org/10.18653/v1/P17-2093

https://doi.org/10.18653/v1/P17-2093

http://www.aclweb.org/anthology/N13-1014

http://www.aclweb.org/anthology/N13-1014

https://doi.org/10.18653/v1/N16-1030

https://doi.org/10.18653/v1/W17-2314

https://doi.org/10.18653/v1/W17-2314

5174

Xuezhe Ma and Eduard Hovy. 2016. End-to-endsequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1064–1074. Associa-tion for Computational Linguistics.

Diego Marcheggiani and Thierry Artieres. 2014. Anexperimental comparison of active learning strate-gies for partially labeled sequences. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages898–906.

Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017.Cheap translation for cross-lingual named entityrecognition. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2536–2545.

Graham Neubig, Yosuke Nakata, and Shinsuke Mori.2011. Pointwise prediction for robust, adaptablejapanese morphological analysis. In Proceedings ofthe 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies: short papers-Volume 2, pages 529–533. Asso-ciation for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

Lawrence R Rabiner. 1989. A tutorial on hiddenmarkov models and selected applications in speechrecognition. Proceedings of the IEEE, 77(2):257–286.

Burr Settles and Mark Craven. 2008. An analysis of ac-tive learning strategies for sequence labeling tasks.In Proceedings of the conference on empirical meth-ods in natural language processing, pages 1070–1079. Association for Computational Linguistics.

Yanyao Shen, Hyokun Yun, Zachary Lipton, YakovKronrod, and Animashree Anandkumar. 2017.Deep active learning for named entity recognition.In Proceedings of the 2nd Workshop on Representa-tion Learning for NLP, pages 252–256, Vancouver,Canada. Association for Computational Linguistics.

Matthias Sperber, Mirjam Simantzik, Graham Neubig,Satoshi Nakamura, and Alex Waibel. 2014. Seg-mentation for efficient supervised language annota-tion with an explicit cost-utility tradeoff. Transac-tions of the Association for Computational Linguis-tics, 2:169–180.

Erik F. Tjong Kim Sang. 2002. Introduction tothe conll-2002 shared task: Language-independent

named entity recognition. In Proceedings of the 6thConference on Natural Language Learning - Volume20, COLING-02, pages 1–4, Stroudsburg, PA, USA.Association for Computational Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Yuta Tsuboi, Hisashi Kashima, Hiroki Oda, ShinsukeMori, and Yuji Matsumoto. 2008. Training condi-tional random fields using incomplete annotations.In Proceedings of the 22nd International Conferenceon Computational Linguistics-Volume 1, pages 897–904. Association for Computational Linguistics.

Dittaya Wanvarie, Hiroya Takamura, and Manabu Oku-mura. 2011. Active learning with subsequence sam-pling strategy for sequence labeling tasks. Informa-tion and Media Technologies, 6(3):680–700.

Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A.Smith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, pages 369–379, Brussels, Belgium. Associationfor Computational Linguistics.

Bishan Yang and Tom M. Mitchell. 2016. Joint ex-traction of events and entities within a documentcontext. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 289–299, San Diego, California. As-sociation for Computational Linguistics.

David Yarowsky, Grace Ngai, and Richard Wicen-towski. 2001. Inducing multilingual text analysistools via robust projection across aligned corpora.In Proceedings of the first international conferenceon Human language technology research, pages 1–8. Association for Computational Linguistics.

Imed Zitouni and Radu Florian. 2008. Mention detec-tion crossing the language barrier. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing, pages 600–609. Associationfor Computational Linguistics.

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/W17-2630

https://doi.org/10.3115/1118853.1118877

https://doi.org/10.3115/1118853.1118877

https://doi.org/10.3115/1118853.1118877

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/D18-1034



https://doi.org/10.18653/v1/N16-1033

https://doi.org/10.18653/v1/N16-1033

https://doi.org/10.18653/v1/N16-1033

a little annotation does a lot of good: a study in bootstrapping … · 2020. 1. 23. · 3.human...

Documents