ranking georeferences for efficient crowdsourcing of ...ceur-ws.org/vol-2624/paper11.pdfsystem....

6
Ranking Georeferences for Efficient Crowdsourcing of Toponym Annotations in a Historical Corpus of Alpine Texts Janis Goldzycher University of Zurich [email protected] Isabel Meraner University of Zurich [email protected] Martin Volk University of Zurich [email protected] Simon Clematide University of Zurich [email protected] Abstract This paper presents a simple method to rank georeference candidates to optimally support the workflow of a citizen sci- ence web application for toponym anno- tation in historical texts. We implement the general idea of efficient crowdsourc- ing based on human and artificial intelli- gence working hand in hand. For named entity recognition, we apply recent neu- ral pretraining-based NER tagger meth- ods. For named entity linking to geo- graphical knowledge bases, we report on georeference ranking experiments testing the hypothesis that textual proximity in- dicates geographic proximity. Simulation results with online reranking that immedi- ately integrates user verification show fur- ther improvements. 1 Introduction Named entity recognition (NER) in texts (Nadeau and Sekine, 2007) is an established and crucial task in Information Extraction (Tjong Kim Sang and De Meulder, 2003; Weissenbacher et al., 2019). The recognition of toponym mentions, i.e. the detection of names for geographical entities of interest such as cities, mountains, rivers, re- gions, etc. typically relies either (a) on gazetteer lookup and rule-based pattern matching tech- niques, which are hand-crafted by language and domain experts, or (b) on supervised machine learning methods for sequence labeling, which need annotated task-specific in-domain training material for good performance. The main prob- lems of NER in general are insufficient coverage Copyright c 2020 for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0) of gazetteers or lack of in-domain training mate- rial, geo/non-geo ambiguities and the number of entity classes that need to be distinguished. Named entity linking (NEL) of toponyms is normally cast as a consecutive task to NER and consists in annotating each toponym men- tion with a unique identifier from a domain- specific knowledge base. This linking of to- ponyms, also known as toponym resolution (TR) (Leidner, 2007) or Geocoding 1 (Gritta et al., 2019), “grounds” the mentions in georeferences of geo- graphic ontologies, which in turn provide points or complex polygons in a geographic coordinate system. These shapes can then be used for geovi- sualization of the toponyms on a map (Figure 1). The main problems of toponym resolution are the ambiguity of toponym names (e.g. in Switzerland alone there are 12 mountains called Schwarzhorn), and especially in the case of his- torical texts, the renaming of geographical entities over time and changes in the spelling of names, which leads to insufficient coverage of name vari- ants even in large contemporary geographical on- tologies. For our corpus, additional problems arise from multilinguality, (a) because many geograph- ical entities genuinely have more than one name due to the multilingual cultural background of Switzerland, and (b) because we are dealing with a multilingual text corpus. The historical corpus of Alpine texts for which we aim at a complete, fine-grained and precise toponym annotation consists of the early year- books of the Swiss Alpine Club (SAC) published since 1864 (ohring and Volk, 2011). Mostly written in German and French, it contains moun- taineering reports, scientifically oriented contribu- tions written for an interested lay public and club news, thus constituting a highly valuable domain- 1 The term Geotagging corresponds to NER tagging re- stricted to location names.

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

Ranking Georeferences for Efficient Crowdsourcing of ToponymAnnotations in a Historical Corpus of Alpine Texts

Janis GoldzycherUniversity of Zurich

[email protected]

Isabel MeranerUniversity of Zurich

[email protected]

Martin VolkUniversity of [email protected]

Simon ClematideUniversity of Zurich

[email protected]

Abstract

This paper presents a simple method torank georeference candidates to optimallysupport the workflow of a citizen sci-ence web application for toponym anno-tation in historical texts. We implementthe general idea of efficient crowdsourc-ing based on human and artificial intelli-gence working hand in hand. For namedentity recognition, we apply recent neu-ral pretraining-based NER tagger meth-ods. For named entity linking to geo-graphical knowledge bases, we report ongeoreference ranking experiments testingthe hypothesis that textual proximity in-dicates geographic proximity. Simulationresults with online reranking that immedi-ately integrates user verification show fur-ther improvements.

1 Introduction

Named entity recognition (NER) in texts (Nadeauand Sekine, 2007) is an established and crucialtask in Information Extraction (Tjong Kim Sangand De Meulder, 2003; Weissenbacher et al.,2019). The recognition of toponym mentions, i.e.the detection of names for geographical entitiesof interest such as cities, mountains, rivers, re-gions, etc. typically relies either (a) on gazetteerlookup and rule-based pattern matching tech-niques, which are hand-crafted by language anddomain experts, or (b) on supervised machinelearning methods for sequence labeling, whichneed annotated task-specific in-domain trainingmaterial for good performance. The main prob-lems of NER in general are insufficient coverage

Copyright c© 2020 for this paper by its authors. Use permit-ted under Creative Commons License Attribution 4.0 Interna-tional (CC BY 4.0)

of gazetteers or lack of in-domain training mate-rial, geo/non-geo ambiguities and the number ofentity classes that need to be distinguished.

Named entity linking (NEL) of toponyms isnormally cast as a consecutive task to NERand consists in annotating each toponym men-tion with a unique identifier from a domain-specific knowledge base. This linking of to-ponyms, also known as toponym resolution (TR)(Leidner, 2007) or Geocoding1(Gritta et al., 2019),“grounds” the mentions in georeferences of geo-graphic ontologies, which in turn provide pointsor complex polygons in a geographic coordinatesystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure 1).

The main problems of toponym resolutionare the ambiguity of toponym names (e.g. inSwitzerland alone there are 12 mountains calledSchwarzhorn), and especially in the case of his-torical texts, the renaming of geographical entitiesover time and changes in the spelling of names,which leads to insufficient coverage of name vari-ants even in large contemporary geographical on-tologies. For our corpus, additional problems arisefrom multilinguality, (a) because many geograph-ical entities genuinely have more than one namedue to the multilingual cultural background ofSwitzerland, and (b) because we are dealing witha multilingual text corpus.

The historical corpus of Alpine texts for whichwe aim at a complete, fine-grained and precisetoponym annotation consists of the early year-books of the Swiss Alpine Club (SAC) publishedsince 1864 (Gohring and Volk, 2011). Mostlywritten in German and French, it contains moun-taineering reports, scientifically oriented contribu-tions written for an interested lay public and clubnews, thus constituting a highly valuable domain-

1The term Geotagging corresponds to NER tagging re-stricted to location names.

Page 2: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

Figure 1: Efficient in-text validation of top-ranked to-ponym references in the crowdsourcing web interfacewhere people can read pages from yearbooks and anno-tate (verify, add, delete) georeferences. A mouse clickon the unverified toponym mention “Matterhorn” (yel-low background) opens a map view where in the bestcase a second click concludes the verification.

specific resource when geographically fully in-dexed and publicly available.

Given the difficulties of toponym annotationin domain-specific historical texts, automatic to-ponym resolution methods are not able to achievethe desired performance. For NER using adomain-specific rule-based system, Kew et al.(2019) report a recall of 63% and a precision of88% when evaluating on 1300 sentences sampledfrom the full corpus (1864-2015). For a modernneural NER approach (Akbik et al., 2018), a re-call of 71% and precision of 87% is reached whenusing the output of the rule-based system as a sil-ver quality training corpus and roughly 800 man-ual corrections. As the quality of NEL is bound byNER, providing more and better training materialis crucial for achieving higher performance. In or-der to do so, crowd-sourcing toponym annotationsseems promising given the positive experience ofasking the SAC community to crowd-correct theOCR errors in this corpus (Clematide et al., 2016).

However, the task of toponym annotation ismore complex and knowledge-intensive than OCRcorrection. Our goal is to provide an effi-cient workflow that ensures that automatic pre-annotation and human correction from citizen sci-entists profit from each other as early as possible.For NER, this means to retrain the neural NERmodels regularly and to update the pre-annotationswithout interfering with already curated material.Recent neural NER taggers (Akbik et al., 2018)with language modeling pretraining have modest

requirements for task-specific training material. Inthe interface, we additionally adapted our originalcorrection workflow where NER and NEL werehitherto closely intertwined, now allowing for cor-rections restricted to NER (mentions and toponymtypes) if preferred by the user.

For NEL, it means to minimize the user’s effortsto identify the correct georeference of a toponymmention. Ideally, the NEL component should (a)precompute all possible georeference candidatesfor a mention (taking into account typical spellingvariations) in order to free the user from perform-ing time-consuming knowledge base queries onhis own, and (b) rank these candidates such thatthe true reference appears first on the list. Figure 1illustrates the intended setup for linking the men-tion “Matterhorn” to its intended georeference.2

Verifying a suggested georeference candidate by asingle click is far less time consuming than search-ing through a long unordered list (sometimes up to70 candidates).

The remainder of this paper reports on simpleand efficient methods to optimally rank georefer-ence candidates for toponym annotation based onthe principle of textual and geographical proxim-ity (Buscaldi and Rosso, 2008; Buscaldi, 2011).

2 (Re)Ranking Georeference Candidates

We investigate two scenarios: (a) ranking candi-dates using as only evidence automatically com-puted georeference candidates, and (b) dynam-ically reranking candidates simulating a humanvalidation process where the automatic rankingsand the human corrections serve as iterativelyimproving evidence. The original ranking hap-pens offline during automatic NEL. In contrast,the reranking happens online during the annota-tion process and can be done in the client’s webbrowser.

The ranking algorithms for both scenarios relyon the hypothesis that textual proximity indicatesgeographic proximity (Buscaldi, 2011). Both sce-narios make use of this hypothesis by applying apoint system that rewards the target candidate (sit-ting at the center of a sliding window) that is ge-ographically closest to a toponym candidate froma context position of the sliding window. Figure2 illustrates, how georeference candidate 2 of the

2Our citizen science web applicationhttps://www.geokokos.ch currently features the offlineranking of georeferences.

Page 3: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

“. . . Monchjoch, erste Ueberschreitung. Alphubel undAlphubelpass, Feegletscher, Rumpfishorn, sammtlich erste

Besteigungen.”

Figure 2: Candidate ranking with window size n “ 1of the text snippet shown above. Scores are assignedby building a set of candidate pairs for each toponymin the context of the target toponym and by rewardingtarget-context pairs with the smallest distance.

ambiguous toponym Alphubel is rewarded twicedue to its smallest distance to all other candidatesin a context window of n “ 1 toponyms. In otherwords, each context toponym “votes” for the targettoponym candidate with the smallest distance.

Ranking. More formally (see Algorithm 1),given a target toponym tti at position iwe determine the set of context toponymstcti´n, ..., cti´1, cti`1, ..., cti`nu. Then, targetcandidate tcj indexes all admissible georeferencecandidates of the target toponym tti, and for eachcontext toponym ctk the context candidate cckl in-dexes all its admissible georeferences. The scoreof every target candidate tcj is initialized with 0,and for each context position k, the score of tcjwith the smallest distance of all target/context pairptcj , ccklq is incremented by 1. Thus, for a givencontext size n containing 2n toponyms, 2n is themaximum candidate score in a window.

Aggregating all window scores over a yearbookresults in a single global score for each georefer-ence. Our final scoring normalizes the yearbookscores of each georeference into the range r0, 1sand multiplies it with the score of each candidategeoreference from the local window. In this way,the overall prominence of a georeference in a year-book (in early years, each SAC yearbooks had onemountain region as a main topic) is combined withthe proximity in a “local story” told within thecontext window. Candidates are then sorted in de-scending order according to the final total score inorder to produce the candidate ranking.

“Wir fliegen mit dem Blick uber den [Tschingelgletscher]hin und einen Moment verweilen wir bei der jahen

Gneistafel des Lauterbrunner Breithorns, welches, scharf inseinen breiten Graten, uns nur kahle Platten zeigt und vondessen Fuss einige sekundare Gletscher in’s Lotschenthal

herabhangen. Ueber [Ebene Fluh], Grosshorn undGletscherhorn fliegen wir neuerdings hinweg, senken wieder

den Blick in den grossen Ocean des [Aletschfirns] undstehen gebannt vor den scharfen Formen der Jungfrau [. . . ]”.

Figure 3: Candidate ranking and reranking for “Brei-thorn” in the text snippet shown above. Detected andlinkable toponyms are in italics. Toponyms in bracketswere not automatically linked with candidates and arethus not included in the ranking. But they are includedin the reranking if they appear in the left window of thetarget toponym. Two target and four context candidatesare too far away to be included in the map. Context-target candidate pairs with closest distances resulting inrewards are connected with thin green lines. The year-book scores of the target entities are indicated abovethe entity’s marker.

Dynamic Reranking As soon as humans cor-rect a georeference, new information is availablethat can be used to update and improve existingcandidate rankings on the fly and to further min-imize the effort of a crowd corrector. In order toassess the expected benefit, we define the follow-ing correction simulation strategy that assumes theuser to correct all toponyms in reading order of thetext. Each time a user verifies a reference candi-date, we update the candidate ranking of the fol-lowing toponym.

The dynamic reranking is also based on win-dow and yearbook scores and only differs in thefollowing aspects: (a) The yearbook scores arenot updated by the window scores. (b) The win-dow score rewards by 10 points instead of 1 pointif a verified candidate is involved. (c) The win-dow score rewards by 3 points if only one candi-

Page 4: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

Algorithm 1 Candidate RankingInput: pages P , window size nOutput: pages Pinitialize yearbook scores ysfor each page P P do

for each target toponym tti P page doinitialize window score map wstcÐ get candidatespttiqctÐ get context toponymsptti, nqfor each context toponym ctk P ct do

cck Ð get candidatespctkqfor each context candidate cckl P cck do

for each target candidate tcj P tc doset tcj to tcj if closest to cckl

end forend forws

tcj‰

`“ 1end forupdate page with ws

end forincrement ys by ws

end forupdate P with yssort candidates of P by ys and wsreturn P

date is involved. Figure 3 shows how candidatesfor Alphubel are ranked and reranked. Both can-didates are assigned the same window score butthe southern central candidate has a higher year-book score and is thus ranked first by the ranking.When a user reads the sentence and adds and veri-fies Tschingelgletscher, which is close the correctcandidate, the dynamic reranking updates the can-didate positions and ranks the correct candidate onthe first place.

3 Ranking and Reranking Experiments

We test the quality of our ranking method on Ger-man pages of the yearbook 1864 and 1874. OurNEL uses two different geographical ontologies,SwissNames3D3 for toponyms within Switzerlandand GeoNames4 for all others. We made thischoice in order to achieve maximal coverage inSwitzerland and to avoid linking ambiguity due tomultiple knowledge bases. For linking with Swiss-Names3D, only 23 relevant entity types out of 103are used5, for GeoNames 127 out of 676 entitytypes (feature codes) are used. Table 1 reportsthe number of toponym candidates and their am-biguity. Note that 30% of the toponyms cannot beresolved by the NEL, and therefore, they do notcontribute to the ranking.

3https://shop.swisstopo.admin.ch/en/products/landscape/names3D

4https://www.geonames.org5We exclude field names (traditional “Flurnamen” in Ger-

man) due to their extensive ambiguity.

For our experiments, we randomly sampled 20pages from the yearbook 1864 and 1874 that con-tain at least 4 ambiguous toponyms and manuallyresolved all ambiguous cases. Additionally, we in-vested roughly one hour per page to verify or addother toponyms on the page.

Evaluation Systematically evaluating NEL sys-tems is still a challenging task (Rosales-Mendez,2019). In our case, we focus on the improvementof the candidate ranking, therefore, consideringonly the cases where the true georeference is ac-tually one of the proposed candidates. Deleted ornewly added toponyms do not appear in our eval-uation statistics.

In Table 2, we report results for 3 different rank-ing conditions: Randomized (rand.) is a base-line that shuffles the candidates arbitrarily. Rank-ing (rank.) reports the outcome of the proximityranking algorithm. Reranking (rerank.) showsthe results of our dynamic reranking derived fromthe correction simulation. Our evaluation measurereflects the overall frequency of a correct georef-erence being ranked first (labeled as rank1), sec-ond (rank2), third (rank3) or below rank three(rank4+).

Additionally, we report the relative improve-ment of ranking in comparison to random shuf-fling, the improvement of reranking in comparisonto ranking and relative error reductions. Further,for comparability, we report the mean reciprocalrank (MRR) of the true references. For a given setof ranks of true references R, the MRR is com-puted as

MRRpRq “1

|R|

|R|ÿ

i“0

1

Ri(1)

with Ri P r1, 4s because we map all rank po-sitions ą 4 to 4 for consistency with the absoluteranks reported (rank1 to rank4+).

An important hyperparameter of our approachis the sliding window size n. We evaluated ourranking system with values between 1 and 10 anddecided to use a size of 4, which is efficient tocompute and performs as well as larger windows.Table 3 shows the rank1 and MRR results for vary-ing window sizes.

Discussion We see that the simple ranking algo-rithm works pretty well in general. Especially forthe yearbook sample 1864, there is a stark rela-tive improvement over random shuffling of 267%.

Page 5: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

1864 1874SN GN

ř

SN GNř

# topo 2496 3618topo w/o georef 749 1078topo w/ georef 1594 153 1747 2102 438 2540-ambig 1362 69 1431 1835 246 2081+ambig 232 84 316 267 192 459+ambig (in %) 15 55 18 13 44 18

Table 1: Toponym statistics for our automatic NER andNEL in German articles of yearbooks 1864 and 1874using the geographical databases SwissNames3D (SN)and GeoNames (GN).

1864 1874rand rank rerank rand rank rerank# % # % # % # % # % # %

rank1 30 34 80 90 85 96 11 22 18 37 34 71rank2 27 30 3 3 2 2 21 43 22 45 12 25rank3 12 14 1 1 1 1 5 10 3 6 0 0rank4+ 20 23 5 6 1 1 12 25 6 12 3 4total 89 89 89 49 49 49MRR .59 .93 .97 .53 .64 .84Rel. impr. rank1 267 106 164 189Rel. error reduction 85 56 18 55

Table 2: Evaluation of candidate ranking methodsbased on textual and geographical proximity hypothe-sis. The context window size for these results is 4. Rel-ative improvement is computed on rank1 results (com-paring rand to rank and rank to rerank). Relative errorreduction is analogous.

Reranking then cannot improve much more on topof that. For 1874, ranking works decently, butleaves many true georeferences on second posi-tion. Reranking almost doubles the number ofrank1 rankings. Reranking also reduces the num-ber of rank3 and rank4+ cases in comparison toranking. The poor ranking performance in 1874is probably due to several ambiguous toponym oc-currences where a lot of the surrounding namedentities were not found by the NER componentinitially. The reranking based on incremental usercorrections alleviates this problem.

It is interesting to note that by qualitativelylooking at reranking errors we could detect sev-eral errors in the initial ground truth. A next stepfor improving the ranking is probably the inclu-sion of external prominence features (populationsize, existence of a Wikipedia page, etc.) directlyavailable from some of our geographical knowl-edge bases.

4 Conclusion

We have shown that a simple ranking approach us-ing a sliding window of 4 is an effective way to

1864 1874n 1 2 3 4 5 10 1 2 3 4 5 10rank1 75 80 83 85 84 84 35 30 35 34 33 33MRR .91 .94 .96 .97 .97 .97 .85 .80 .85 .84 .83 .84

Table 3: Comparison of reranking performance withincreasing values for window size n.

profile the intended georeferences on top positionsin our two test sets. The quality of our automaticpreannotation in historical texts is low enough toprofit from a dynamic reranking that integrateshuman verification as early as possible into thegeoreference suggestions prominently presentedto the user. In crowdsourcing, human and artificialintelligence should work hand in hand in order toefficiently produce high-quality annotations. Thedisambiguation of rank1 cases using a two-clickverification speeds up the process and leaves moretime for citizen scientists to address the difficulttoponym resolution problems that need real detec-tive work.

ReferencesAlan Akbik, Duncan Blythe, and Roland Vollgraf.

2018. Contextual String Embeddings for SequenceLabeling. In Proceedings of the 27th InternationalConference on Computational Linguistics (COL-ING), pages 1638–1649, Santa Fe, NM, USA.

Davide Buscaldi. 2011. Approaches to disambiguatingtoponyms. SIGSPATIAL Special, 3(2):16–19.

Davide Buscaldi and Paolo Rosso. 2008. Map-basedvs. Knowledge-based Toponym Disambiguation. InProceedings of the 5th Workshop on Geographic In-formation Retrieval (GIR), pages 19–22.

Simon Clematide, Lenz Furrer, and Martin Volk. 2016.Crowdsourcing an OCR Gold Standard for a Ger-man and French Heritage Corpus. In Proceedings ofthe 10th Language Resources and Evaluation Con-ference (LREC), Portoroz, Slovenia.

Ann Gohring and Martin Volk. 2011. The Text+BergCorpus: An Alpine French-German Parallel Re-source. In Proceedings of the 18th TraitementAutomatique des Langues Naturelles Conference(TALN), Montpellier, France.

Milan Gritta, Mohammad Taher Pilehvar, and NigelCollier. 2019. A pragmatic guide to geoparsing eval-uation. Language Resources and Evaluation, pages1–30.

Tannon Kew, Anastassia Shaitarova, Isabel Meraner,Janis Goldzycher, Simon Clematide, and MartinVolk. 2019. Geotagging a diachronic corpus of

Page 6: Ranking Georeferences for Efficient Crowdsourcing of ...ceur-ws.org/Vol-2624/paper11.pdfsystem. These shapes can then be used for geovi-sualization of the toponyms on a map (Figure1)

alpine texts: Comparing distinct approaches to to-ponym recognition. In Proceedings of the Work-shop on Language Technology for Digital HistoricalArchives, pages 11–18, Varna, Bulgaria. INCOMALtd.

Jochen L. Leidner. 2007. Toponym Resolution in Text:Annotation, Evaluation and Applications of SpatialGrounding of Place Names. Ph.D. thesis, Universityof Edinburgh, Edinburgh, UK.

David Nadeau and Satoshi Sekine. 2007. A sur-vey of named entity recognition and classification.Lingvisticae Investigationes, 30(1):3–26.

Henry Rosales-Mendez. 2019. Towards better entitylinking evaluation. In Companion Proceedings ofThe 2019 World Wide Web Conference, WWW ’19,page 50–55, New York, NY, USA. Association forComputing Machinery.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 SharedTask: Language-independent Named Entity Recog-nition. In Proceedings of the 7th Conference on Nat-ural Language Learning (CoNLL) at HLT-NAACL -Volume 4, pages 142–147, Edmonton, Canada.

Davy Weissenbacher, Arjun Magge, Karen O’Connor,Matthew Scotch, and Graciela Gonzalez-Hernandez.2019. SemEval-2019 task 12: Toponym resolutionin scientific papers. In Proceedings of the 13th Inter-national Workshop on Semantic Evaluation, pages907–916, Minneapolis, Minnesota, USA. Associa-tion for Computational Linguistics.