multi-lingual concept extraction with linked data and...

8
Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop Alfredo Alba IBM Research Almaden, CA, US [email protected] Anni Coden IBM Watson Research Lab, NY, US [email protected] Anna Lisa Gentile IBM Research Almaden, CA, US [email protected] Daniel Gruhl IBM Research Almaden, CA, US [email protected] Petar Ristoski IBM Research Almaden, CA, US [email protected] Steve Welch IBM Research Almaden, CA, US [email protected] ABSTRACT Ontologies are dynamic artifacts that evolve both in structure and content. Keeping them up-to-date is a very expensive and critical operation for any application relying on semantic Web technolo- gies. In this paper we focus on evolving the content of an ontology by extracting relevant instances of ontological concepts from text. We propose a novel technique which is (i) completely language independent, (ii) combines statistical methods with human-in-the- loop and (iii) exploits Linked Data as bootstrapping source. Our experiments on a publicly available medical corpus and on a Twitter dataset show that the proposed solution achieves comparable per- formances regardless of language, domain and style of text. Given that the method relies on a human-in-the-loop, our results can be safely fed directly back into Linked Data resources. CCS CONCEPTS Computing methodologies Information extraction; ACM Reference Format: Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski, and Steve Welch. 2017. Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop. In Proceedings of K-CAP 2017: Knowledge Capture Conference (K-CAP 2017). ACM, New York, NY, USA, 8 pages. https://doi. org/10.1145/3148011.3148021 1 INTRODUCTION By their very nature real world ontologies are dynamic artifacts, and ontology evolution poses major challenges for all applications that rely on semantic Web technologies. Ontologies evolve both in their structure (the data model) and their content (instances), and keeping them up-to-date can be quite expensive. In this paper we focus on a computer/human partnership to more rapidly evolve the content of an ontology through extraction of new relevant concepts from text. The atomic operation behind this population step is the discovery of all instances that belong to each concept. A plethora of solutions have been proposed to extract relevant terminology or dictionaries from both unstructured text [9, 13, 15, 22] and semi-structured con- tent [10, 29]. The need to constantly update ontologies, dictionaries and terminologies is well known. As a motivating example, online K-CAP 2017, December 4–6, 2017, Austin, TX, USA © 2017 Association for Computing Machinery. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of K-CAP 2017: Knowledge Capture Conference (K-CAP 2017), https://doi.org/10.1145/3148011. 3148021. shops must integrate new product descriptions provided by vendors on a daily basis. The features and vocabulary used to describe the products continuously evolve, with different vendors providing the product descriptions in varied writing styles and standards. Despite these differences, to fully integrate new products (e.g., be able to provide meaningful “comparison shopping” grids), merchants must correctly identify and assign equivalences to all these instances. As another example consider medical surveillance. Medical reports need to be scanned for clinical issues, e.g. for adverse drug reac- tions which might be caused by prescription drugs. New drugs are constantly approved and available on the market, therefore using an obsolete drug dictionary to identify them would miss all new products. These are, coincidentally, some of the most important to surveil. Another source of valuable information in terms of pharma- covigilance is user-generated content, for example focused online communities such as Ask a Patient 1 or general ones like Twitter [13], where drugs, symptoms and reactions are expressed in many varied ways by different users. The evolution of dictionaries is not confined to products (or other naturally growing sets). Even con- cepts that we would assume as simple and stable, for example color names, are constantly evolving. The way color names change in different languages can be quite dissimilar, given the cultural differ- ences in how we express them in different countries. For instance, a new color name, mizu, has recently been proposed for addition in the list of Japanese basic color terms [12]. On a more practical level, capturing the right instances for a concept can also be highly task-dependent: as our users learned during the experiment, they discovered “space gray", “matte black" and “jet black" are all rele- vant colors for mobile phones, while “white chocolate" or “amber rose" are colors of wall paint products. While the task of ontology population from text has been exten- sively addressed in previous research, we suggest that for many tasks fully automated approaches are not effective, mainly due to the negative effect of semantic drift. Moreover, the majority of In- formation Extraction techniques are language dependent, i.e. they rely on Natural Language Processing (NLP) operations and tools that are language specific, such as parsing, part of speech tagging, etc. We propose glimpseLD a novel solution that builds upon our previous work [5] and revolves around three main aspects: (i) it is a statistical method which extracts concept instances based on context patterns; (ii) it relies on human feedback on the extracted 1 http://www.askapatient.com/

Upload: trinhdung

Post on 15-Jun-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

Multi-lingual Concept Extractionwith Linked Data and Human-in-the-Loop

Alfredo AlbaIBM Research Almaden, CA, US

[email protected]

Anni CodenIBM Watson Research Lab, NY, US

[email protected]

Anna Lisa GentileIBM Research Almaden, CA, US

[email protected]

Daniel GruhlIBM Research Almaden, CA, US

[email protected]

Petar RistoskiIBM Research Almaden, CA, US

[email protected]

Steve WelchIBM Research Almaden, CA, US

[email protected]

ABSTRACTOntologies are dynamic artifacts that evolve both in structure andcontent. Keeping them up-to-date is a very expensive and criticaloperation for any application relying on semantic Web technolo-gies. In this paper we focus on evolving the content of an ontologyby extracting relevant instances of ontological concepts from text.We propose a novel technique which is (i) completely languageindependent, (ii) combines statistical methods with human-in-the-loop and (iii) exploits Linked Data as bootstrapping source. Ourexperiments on a publicly available medical corpus and on a Twitterdataset show that the proposed solution achieves comparable per-formances regardless of language, domain and style of text. Giventhat the method relies on a human-in-the-loop, our results can besafely fed directly back into Linked Data resources.

CCS CONCEPTS• Computing methodologies→ Information extraction;ACM Reference Format:Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski,and Steve Welch. 2017. Multi-lingual Concept Extraction with Linked Dataand Human-in-the-Loop. In Proceedings of K-CAP 2017: Knowledge CaptureConference (K-CAP 2017). ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3148011.3148021

1 INTRODUCTIONBy their very nature real world ontologies are dynamic artifacts, andontology evolution poses major challenges for all applications thatrely on semantic Web technologies. Ontologies evolve both in theirstructure (the datamodel) and their content (instances), and keepingthem up-to-date can be quite expensive. In this paper we focus on acomputer/human partnership to more rapidly evolve the content ofan ontology through extraction of new relevant concepts from text.The atomic operation behind this population step is the discoveryof all instances that belong to each concept. A plethora of solutionshave been proposed to extract relevant terminology or dictionariesfrom both unstructured text [9, 13, 15, 22] and semi-structured con-tent [10, 29]. The need to constantly update ontologies, dictionariesand terminologies is well known. As a motivating example, online

K-CAP 2017, December 4–6, 2017, Austin, TX, USA© 2017 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in Proceedings of K-CAP2017: Knowledge Capture Conference (K-CAP 2017), https://doi.org/10.1145/3148011.3148021.

shops must integrate new product descriptions provided by vendorson a daily basis. The features and vocabulary used to describe theproducts continuously evolve, with different vendors providing theproduct descriptions in varied writing styles and standards. Despitethese differences, to fully integrate new products (e.g., be able toprovide meaningful “comparison shopping” grids), merchants mustcorrectly identify and assign equivalences to all these instances.As another example consider medical surveillance. Medical reportsneed to be scanned for clinical issues, e.g. for adverse drug reac-tions which might be caused by prescription drugs. New drugs areconstantly approved and available on the market, therefore usingan obsolete drug dictionary to identify them would miss all newproducts. These are, coincidentally, some of the most important tosurveil. Another source of valuable information in terms of pharma-covigilance is user-generated content, for example focused onlinecommunities such as Ask a Patient1 or general ones like Twitter[13], where drugs, symptoms and reactions are expressed in manyvaried ways by different users. The evolution of dictionaries is notconfined to products (or other naturally growing sets). Even con-cepts that we would assume as simple and stable, for example colornames, are constantly evolving. The way color names change indifferent languages can be quite dissimilar, given the cultural differ-ences in how we express them in different countries. For instance,a new color name, mizu, has recently been proposed for additionin the list of Japanese basic color terms [12]. On a more practicallevel, capturing the right instances for a concept can also be highlytask-dependent: as our users learned during the experiment, theydiscovered “space gray", “matte black" and “jet black" are all rele-vant colors for mobile phones, while “white chocolate" or “amberrose" are colors of wall paint products.

While the task of ontology population from text has been exten-sively addressed in previous research, we suggest that for manytasks fully automated approaches are not effective, mainly due tothe negative effect of semantic drift. Moreover, the majority of In-formation Extraction techniques are language dependent, i.e. theyrely on Natural Language Processing (NLP) operations and toolsthat are language specific, such as parsing, part of speech tagging,etc.

We propose glimpseLD a novel solution that builds upon ourprevious work [5] and revolves around three main aspects: (i) itis a statistical method which extracts concept instances based oncontext patterns; (ii) it relies on human feedback on the extracted

1http://www.askapatient.com/

Page 2: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

K-CAP 2017, December 4–6, 2017, Austin, TX, USA A. Alba et al.

items to automatically tune scores and thresholds for the extractionpatterns; (iii) it uses Linked Data (LD)2 - when available, even insmall quantities - to bootstrap the process.

The contribution of this work is threefold. First, we show thatthe performance of the original glimpse algorithm [5] is languageindependent. To do so, we design a novel testing strategy that sys-tematically simulates the human-in-the-loop, exploiting parallelcorpora. We demonstrate that glimpse has similar performancesin all languages. Second, we show that glimpseLD (glimpse withLinked Data), which exploits Linked Data in the bootstrappingphase, (i) maintains the same comparable performances in all lan-guages, (ii) is robust with respect to the choice of seeds and (iii)reduces the number of required human-in-the-loop iterations by atleast half. While the subject matter expert can still provide initialseed(s), Linked Data is used to suggest further candidates to theuser, speeding up the hurdle of initial iterations. We show thatglimpseLD significantly speeds up the concept discovery process.Finally, we demonstrate that glimpseLD is robust with respect todifferent text styles: specifically we prove efficacy in extractingcolor names from a Twitter dataset in multiple languages and showthat it can extract additional color terms, not already available onLinked Data. We prove that, despite the richness of Linked Data,there are always new data to be extracted from unstructured text.

The advantage of the method is that it is fully independent withrespect to language, domain and style of the corpus. The human-in-the-loop has the power to drive the extraction of concepts towardstheir semantic interpretation, thus ensuring the domain member-ship of extracted concepts and promptly discarding non pertinentmembers - avoiding the propagation of erroneous extraction pat-terns, while Linked Data serves as a powerful bootstrapping tool.Whilst we prove the robustness of the method with a syntheticexperiment on a parallel corpus, we also test its efficacy in realuser experiments and we prove that a subject matter expert canproduce quality dictionaries for different semantic classes. Withrespect to available state of the art, we propose a method with astrong partnership between human and machine, and consider thehuman an integral part of the learning process with the power ofsteering the semantic drift at any point in the process.

In the following we explore related work (Section 2), presentour human-in-the-loop solution for concept extraction (Section 3),which we test with extensive experiments (Section 4) and finallyconclude with lessons learned and future work (Section 5).

2 STATE OF THE ARTThere is a vast amount of literature devoted to ontology popula-tion from text, with a number of established initiatives to fosterresearch on the topic, such as the Knowledge Base Population taskat TAC3, the TREC Knowledge Base Acceleration track4, and theOpen Knowledge Extraction Challenge [19] to name a few. In theseinitiatives, systems are compared on the basis of recognizing in-dividuals belonging to a few selected ontology classes, spanningfrom the common Person, Place and Organization [31], to morespecific classes such as Facility, Weapon, Vehicle [7], Role [19] or

2http://linkeddata.org/3http://www.nist.gov/tac/2015/KBP4http://trec-kba.org/

Drug [28], among others. The evaluation focus is usually on thespecific sub-tasks involved in the process, such as Entity Recogni-tion, Linking and Typing. Several solutions have been proposed inthe literature, spanning from general purpose comprehensive ap-proaches [9] to more domain-specific ones [13, 15, 22]. FRED [9] isan established example of a comprehensive solution to the problem:it converts text into linked-data-ready ontologies: it transformstext in an internal ontology representation and then attempts toalign it with available Linked Data. It is a general purpose machinereader, mostly based on core NLP tools, which can potentially pro-cess text from any domain. Many works rely on machine learningtechniques and tailor the algorithms to certain specific domains(e.g. drugs): these methods are in general expensive, requiring anannotated corpus and/or language specific feature extraction (acomprehensive overview can be found in [21]). Other works relyon statistical methods to iteratively grow the number of ontologyentries starting with seed knowledge [3, 24] but they require NLP-tools, POS-tagging as minimum, therefore they are bounded to thelanguage. Moreover, without human-in-the-loop, iterative methodscan easily generate semantic drift.

The majority of available methods operate (and are assessed) forthe English language and although specific initiatives are aimedat encouraging replicable studies in other languages [1], we arguethat truly language-independent methods for this task are not yetwidespread. Sahlgren et al. [27] address the task of building multi-lingual lexica, but their method requires aligned corpora in eachlanguage for which the lexicon is to be constructed. They achievegood results for English and German, although failing on wordsappearing with low frequency (< 100). Pappu et al. [20] use Condi-tional Random Fields (CRF) exploiting linguistic features such aspart-of-speech tagging and other morphological features. They testtheir method for English, Spanish and Chinese, with varying resultsdepending on the language. Ben Abacha et al. [2] also propose a CRFapproach for drug name extraction based on linguistic features. Thedrawback of these methods is that although the proposed featurescan be extracted for many languages, relying on NLP tools doesnot guarantee out-of-the-box portability to different languages nordifferent domains. Furthermore, it is not clear if the methods canbe applied to un-grammatical text.

Indeed, one of the major challenges with concept extractioninvolves dealing with chaotic text, given the importance of usergenerated content, which can prove to be an extremely valuablesource of information for many domains, pharmacovigilance beingone of those5. To this end, Lee et al. [13] propose a semi-supervisedmodel which uses a random Twitter stream as unlabeled trainingdata and prove it successful for the recognition of Adverse DrugReaction. Another hurdle is the fact that the dictionary to be createdcan be highly dependent on the task at hand, especially whendealing with positive/negative words which are highly domain-dependent [11, 22]. While completely automatic techniques arehighly appealing they need to be fine-tuned for every new task. Wepropose a human-in-the-loop approach where the “tuning" is anintegral part of the process, i.e. the humanworks in partnershipwith

5PSB2016 is a recent benchmarking initiative on the problem http://diego.asu.edu/psb2016/sharedtaskeval.html.

Page 3: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

Multi-lingual Concept Extractionwith Linked Data and Human-in-the-Loop K-CAP 2017, December 4–6, 2017, Austin, TX, USA

the statistical method to drive the semantic of the task effectivelyand efficaciously.

The last contribution of our work is the usage of Linked Data toinform the extraction process. Linked Data has been vastly exploredfor many Information Extraction tasks and specifically for conceptextraction. Dolby et al. [8] exploit type information found in LinkedData in their statistical Named Entity Recognition system withpromising results when applied to an English grammatically correctcorpus. Mitzias et al. [17] exploit Linked Data frommultiple sourcesto populate a target ontology. Their system retrieves appropriateinstances and inserts them into the target ontology with a human-in-the-loop step for mapping model properties. The applicability ofthe model to other languages has not been explored in the paper.Moreover, the method requires that all instances are already presentas Linked Data, whereas our proposed method uses Linked Datato bootstrap the process and collects more data from unstructuredtext. A similar bootstrapping technique has been used to extractgazetteers from semi-structured content [10], with the differencethat structural Web page information has been exploited in theprocess.

With this work we tackle three specific aspects of concept ex-traction: the multi-linguality, the usage of Linked Data to informthe extraction and the organic integration of human-in-the-loopand we show that our algorithm is truly language independent,performing with similar accuracy in all languages, as well as beingrobust on non-grammatical text and different semantic classes. Tothe best of our knowledge, our work seems to be the first approachto expose such language independence.

3 CONCEPT EXTRACTION: THEGLIMPSE APPROACH

Glimpse is a statistical algorithm for dictionary extraction based onSPOT [5] with a faster underlying matching engine. The input is alarge text corpus whose content is relevant to the domain of theconcept to be extracted. For example, one would choose a corpus ofmedical documents to find drugs or a general corpus (e.g. contentfrom Twitter) to find colors. Besides the corpus, glimpse needs oneormore examples (seeds) of the concept instances to extract. Startingfrom these it evaluates the contexts (the set of words surroundingan item) in which the seeds occur and identifies “good" contexts.Contexts are scored retrospectively in terms of how many “good”results they generate. All contexts are kept which score over agiven clip level and the candidates that appear in the most “good”contexts are provided first (more details on the scoring function canbe found in [5]). “Good” contexts are used to identify further termsor phrases in the corpus: these are presented to a human as newconcept candidates - the ones accepted by the subject expert are (i)added to the dictionary and (ii) used as additional seeds for the nextiteration. The algorithm also learns from the rejected items - theircontexts are down voted as “not good". The steps of finding newgood contexts and additional candidate terms are repeated untilsome threshold (i.e. saturation) is achieved and/or no new termsare found.

The human-in-the-loop model is quite conducive to many “real-world" scenarios. However it is expensive to evaluate the operationof an algorithm across many different languages, semantic contexts

and types of text, as many gold-standard corpora would need to becreated. Towards this end we developed an automatic evaluationmethod which shows the “discovery growth" of dictionary items asa function of the number of iterations; this is a good approximationto the real world problem of finding concept surface forms: in areal scenario - where no gold standard is available, but the correct-ness of extraction is assured by human-in-the-loop - the ratio ofnew correct terms added at each iteration is a useful indication ofperformance. In a large enough corpus (e.g., Twitter) there is notand cannot be a complete “gold standard” as language is alwaysevolving. The best bet is to rapidly grow a dictionary to capturethe new terms as they emerge.

In this work we introduce a methodology to automatically testthat glimpse is independent from the language and the style of thecorpus as well as from the desired concept extraction type. We reusethe concept of a “synthetic user" introduced in [4] for evaluatinghuman-in-the-loop performance. The synthetic user “knows" allthe answers (i.e., all the items belonging to the target concept) butdoes not share them a priori. Clearly, concept instances which arenot in the corpus cannot be discovered. To define the synthetic userwe utilize an established dictionary in the domain and determinethe subset of it which has mentions in the corpus - the oracle. Thealgorithm then works as before, however it is not a human whoaccepts or rejects proposed candidates but the synthetic user, whoonly accepts terms in its oracle.

Specifically, in the multilingual experiment (which is furtherdescribed in Section 4) we require that each item in the multi-language oracle has fairly similar frequencies in the target corpora.As a consequence we do not include some otherwise correct terms,therefore some “good" contexts are rejected despite being semanti-cally correct. However since the corpora are parallel this “penalty"is language independent. Our experiments show that the conceptdiscovery growth using glimpse is nearly identical for all languages.

Furthermore, we present glimpseLD, which uses knowledge fromLinked Data to bootstrap the algorithm. We seed glimpseLD withinitial seed terms of the same type as the target concept from LD.Increasing the number of initial and relevant seed terms can signif-icantly improve the effectiveness and efficiency of the algorithm(i.e. allowing it to extract a higher number of terms in the corpus infewer human iterations). It is worth noting that retrieving relevantseeds from LD can be performed in several ways: starting with ahandful of user-defined seeds and searching for similar items inLD, or letting the user explore the ontology model, etc. As shownin [26], there are many tools and approaches that can be used forlinking string terms to a given LD dataset, such as the DBpedia Spot-light tool [16], or pattern-based and label-based approaches. [25].Furthermore, as shown in [32] Linked Data is rich with multilingualsemantic resources, which can be exploited for the multilingualsettings of glimpseLD. However, a deep discussion on this topic isoutside the scope of this paper. For the purpose of our experimentswe identify the set of relevant types and we query for all theirinstances and extract their labels in the user defined language/s touse as seed terms.

Page 4: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

K-CAP 2017, December 4–6, 2017, Austin, TX, USA A. Alba et al.

4 EXPERIMENTSWe run a series of experiments aimed at testing the different con-tributions of this work. First we test the language independentcapabilities of glimpse. We use parallel corpora in the medical do-main, with the task of creating a dictionary of drugs. We then testthe performance of glimpseLD on the same parallel corpora: weuse Linked Data to seed the discovery (instead of manual seeds)and we evaluate how fast we can converge to a desired dictionaryfor each of the considered languages6. Finally, we perform a truehuman-in-the-loop experiment to quantify, given a concept, howmany new instances can be discovered, i.e. how glimpseLD can beused to assist the task of concept (or dictionary) expansion andmaintenance. Additionally, we test that glimpseLD is also robustindependently of the style of writing. For this purpose we use a cor-pus of tweets in several languages and use glimpseLD to constructa dictionary of color names. We choose “color" as concept as it is asimple enough domain for which we can recruit native speakers ofthe different languages for the human-in-the-loop experiment. Inall the experiments we evaluate the performance with the discoverygrowth, i.e. at each iteration we quantify how many new instanceswe add to the dictionary, relative to the initial number of seeds.While accuracy can be calculated in the presence of a gold standard,the discovery growth is a useful indication of performance in a realscenario, where no gold standard is available, but correctness ofextraction is assured by human-in-the-loop.

In the following we give details about all the datasets that weuse (Section 4.1) and describe our experimental settings (Section4.2).

4.1 Datasets4.1.1 EMEA dataset. EMEA7 (European Medicines Agency doc-

uments) is a parallel corpus comprised of PDF documents fromthe European Medicines Agency, related to medicinal products andtheir translations into 22 official languages of the European Union.The documents have been sentence alignedwithin the OPUS project[30]. The strength of the EMEA corpus is that it is a nearly parallelcorpus in many languages. The reason we say “nearly" is that notall documents are there in all languages. However, the desideratafor us is that the statistical properties of the terms which our al-gorithm is supposed to find are very similar. Note that for somesemantic classes some languages use two words (e.g. blood pressurein English) versus a single word (e.g. Blutdruck in German) - so thefrequency of the word blood / Blut would be quite different in bothlanguages even when they are translated word for word. We selectthe English, Spanish, Italian and German portion of the dataset andwe use it for the task of constructing a dictionary of drugs in thevarious languages. This parallel dataset has been selected with theaim of creating a “clean testing environment" and effectively obtaina gold standard, but it is worth specifying that the technique is notbound to existence of such resources.

4.1.2 Twitter dataset. Twitter8 is one of the most popular mi-croblogging platforms on the Web and provides a huge collection

6The target languages have been chosen based on the availability of a native speakerin the team who could participate in the experiments and analyze the results.7http://opus.lingfil.uu.se/EMEA.php8https://twitter.com/

of brief user generated text updates (tweets). Given the personalnature of the content and limited size of each message (max 140characters), the text style usually does not follow strict grammaticalrules and is often cryptic and chaotic. Although the majority oftweets are in English, tweets in many different languages are alsoavailable. We build a collection of tweets in our 4 target languagesand we task to extract a dictionary from all of them. We choose“colors" as a simple concept to extract. We collected tweets postedin the period between the 1st and the 14th of January 2016, writtenin English, German, Spanish and Italian language9, which containat least one mention of a color in the respective language (we useboth Wikidata and DBpedia as gold standard lists of colors to selectthe tweets) - this is to create a manageable size collection and tore-create a somehow “focused" corpus. Given that some languages(English in particular) have many more tweets than others, wemake sure that the size of the datasets in different languages isbalanced: we process tweets one day at a time and we downsize alllanguage chunks to the number of tweets in the smallest collection,randomly selecting the tweets in the bigger languages. The finaldataset contains 155, 828 tweets per language.

4.1.3 Gold Standard: human-in-the-loop and synthetic user. InSection 3 we introduced the concept of “synthetic user", i.e. an oracleacting as human-in-the-loop to allow for exhaustive explorationof the impact of changes to the system without requiring verylarge human user studies. We build several synthetic users: for thedrug extraction scenario we experiment both with (i) RxNorm10,a standard drug dataset and (ii) drug collections obtained fromLinked Data (specifically we use DBpedia and Wikidata). Whilewith RxNorm we can only build an English oracle, with LinkedData we can obtain oracles in all the target languages. For theTwitter experiment, given the semantic simplicity of the domain,we perform a true human-in-the-loop evaluation for all languages.

Figure 1:Discovery growth for glimpse for English (en), Italian (it), Spanish (es) andGerman(de) starting with one seed (the drug irbesartan), using RxNorm as gold standard for thesynthetic user. Average correlation amongst all languages r = 0.998.

9The language is identified based on the twitter_lang field.10https://www.nlm.nih.gov/research/umls/rxnorm/

Page 5: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

Multi-lingual Concept Extractionwith Linked Data and Human-in-the-Loop K-CAP 2017, December 4–6, 2017, Austin, TX, USA

(a) EMEA English (r = 0.991) (b) EMEA German (r= 0.995)

(c) EMEA Spanish (r = 0.994) (d) EMEA Italian (r = 0.996)

Figure 2: Discovery growth for glimpseLD (with 5-fold cross validation) on the EMEA dataset using DBpedia as seeds. Each plot (2a for English, 2b for German, 2c for Spanish and 2d forItalian) shows the discovery growth for each of the randomly generated 5 folds and reports the Pearson correlation (r) amongst them.

(a) EMEA English (r = 0.991) (b) EMEA German (r= 0.990)

(c) EMEA Spanish (r = 0.994) (d) EMEA Italian (r = 0.990)

Figure 3: Discovery growth for glimpseLD (with 5-fold cross validation) on the EMEA dataset using Wikidata as seeds. Each plot (3a for English, 3b for German, 3c for Spanish and 3d forItalian) shows the discovery growth for each of the randomly generated 5 folds and reports the Pearson correlation (r) amongst them.

4.2 Experimental settings4.2.1 Multilingual drug extraction with synthetic user. We use

the EMEA corpus and we run glimpse with a synthetic user. As asynthetic user for this experiment we build a gold standard datasetof drugs using RxNorm, a resource providing normalized names forclinical drugs and links its names to many of the drug vocabulariescommonly used in pharmacy management and drug interactionsoftware. As RxNorm is a dataset in English, from the full list ofdrugs we select only those that appear in EMEA in all the four

selected languages. This amounts to 363 terms that are the same(and have the same distribution) in all languages, but of course theiroccurrence patterns are language dependent.

The following example uses two parallel sentences from theEMEA corpus - one in English and one in Italian - and illustrates theoccurrence of the term pravastatin in the two different languages.While the target term is the same, its context is highly languagedependent:Plasma elimination half-life of oral pravastatin is 1.5 to 2 hours.

Page 6: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

K-CAP 2017, December 4–6, 2017, Austin, TX, USA A. Alba et al.

L’emivita plasmatica di eliminazione delpravastatin orale é compresa tra un’ora emezzo e due ore.

Figure 1 shows the discovery growth in the EMEA corpus forall languages, using a synthetic user defined utilizing RxNorm.Starting with one seed only in every language (specifically we usedthe drug irbesartan), the behavior of glimpse is homogeneous inevery language, with similar concept growth at each iteration. Theaverage Pearson correlation amongst the results in all languages isabove 0.99.

4.2.2 Multilingual drug extraction with Linked Data and syntheticuser. We build a synthetic user by crawling relevant Linked Datain all target languages, making sure we cover the same drugs inall languages. Particularly, we use two of the biggest cross-domainLOD datasets, DBpedia [14] and Wikidata [33]. We select all theentities of type dbo:Drug11 from DBpedia and all the entities of typewikidata:Q11173 from Wikidata. For all of the selected entities, weretrieve the corresponding labels in English, German, Spanish andItalian and consider this our gold standard dictionary.We then select20% of this gold standard as seeds from each of the languages andmeasure the performance of recreating the remaining 80% by usingglimpseLD. We perform 5-fold cross validation without repetitionand randomly select 20% of seeds at each iteration (making surethat the seeds represent the same drugs for all 4 languages), totest if the choice of initial seeds impacts the results. Figures 2 and3 show that the algorithm has the same behavior, independentlyof the selection of seeds, respectively using DBpedia or Wikidataas GS for the four languages. Figure 4 shows the average of the5-fold experiments for each language in a single plot. The discoverygrowth is comparable for all languages, with correlation alwaysabove 0.98.

4.2.3 Drug extraction with human-in-the-loop. Wewish to quan-tify the benefit of using glimpseLD for enriching Linked Data, i.e.given an existing Linked Dataset how much can we add by run-ning glimpseLD over a relevant corpus to extract new terms. Weuse as seeds a random 20% of available Linked Data, the sameas we did in the previous experiment, and we run glimpseLD onthe English documents in EMEA, involving a medical doctor ashuman-in-the-loop (we only use English as it is the native languageof the subject expert). With the synthetic user we test glimpseLDagainst the already available Linked Data, therefore drugs whichare not already there do not get counted as correct. With the humanadjudicating those as correct, we can not only evaluate the trueperformance of glimpseLD, but also quantify the portion of newterms that we are able to add to LD. Figure 5a shows the discoverygrowth: in 20 iterations - that only took 57 minutes - we couldobtain a dictionary 10 times bigger than the initial seeds. Usingthe dictionary produced in this experiment as gold standard, wecan closely approximate the recall of both glimpse and glimpseLD.Figure 5b shows the comparison of the two methods: glimpse startswith one manually provided seed, glimpseLD with Linked Dataseeds. In 10 iterations glimpseLD can cover the same instances thatwould take more than 20 iterations with glimpse.

11We use the abbreviation dbo for http://dbpedia.org/ontology/ andwikidata forhttp://www.wikidata.org/entity/.

Moreover, we checked the produced terms against all items al-ready available in DBpedia and Wikipedia. By using our methodto expand LD we would achieve an extended coverage of 8.16% onDBpedia and 0.36% on Wikidata. Although the figure for Wikidataseems small, Wikidata already contains 156633 different lexicali-sations of drugs, yet our subject expert identified some relevantdrugs and drug classes in the corpus (specifically 561) which arenot yet in LD.

4.2.4 Building color dictionary from Twitter data. Using the samemethodology as before, we collect seeds from Linked Data. We se-lect all the entities of type dbo:Colour from DBpedia, and all theentities of typewikidata:Q1075 fromWikidata. For all of the selectedentities, we retrieve the corresponding labels in English, German,Spanish and Italian. We run the experiment with 4 native speakers.The experiment was run simultaneously with all participants inthe same room so that if anyone had doubts or concerns on accept-ing/rejecting an item could discuss with other participants. Theusers were instructed to stop after 10 iterations. It is interesting tonotice that despite seeding glimpseLD with all available colors onDBpedia and Wikidata, we were still able to find additional colors,e.g. “azulgrana" or “rojo vivo" for Spanish as well as capturing theoccurrence of shortened lexicalisation for certain colors, such as“limn" in place of the color “límon". In total we could find 19 newcolors for English, 22 for Spanish, 5 for German and Italian respec-tively. We then employed these human-created dictionary as goldstandards to perform a further synthetic test, starting with one seed(the color red in all languages) - as we are using a manually createdgold standard, this is the equivalent to re-running the test with theusers. For all languages, the number of iterations required to get tothe same size lexicon was at least double with respect to startingwith Linked Data seeds.

4.3 Comparison with state of the artIn this section we compare our approach to DBpedia Spotlight [16],Babelfy and FRED. Because of strict API limits we were not able tocompare Babelfy [18] and FRED on the EMEA dataset, but only onthe Twitter dataset. More precisely, we compare the number of iden-tified lexicon terms using glimpseLD with synthetic user, glimpsewith human-in-the-loop and DBpedia Spotlight. The results areshown in Table 1a and Table 1b, for the EMEA and Twitter dataset,respectively. In both cases and for all the languages, glimpseLD withhuman-in-the-loop is able to identify the highest number of lexiconterms. Furthermore, when looking at languages other than English,glimpseLD outperforms DBpedia Spotlight on both datasets. Weargue that this is an important result, as information extractionmethods, especially when targeting user-generated data, have tra-ditionally focused on English, while adaptation to new languagesstill remains an open issue [6]. It is worth noting that a standardNER tool might fail with arbitrary types (e.g. Color). FRED tries tolink words to external knowledge and identifies concepts such asWhite_House, Black_Lives_Matter etc., which for the particular task(identifying instances of Color) does not fit.

5 CONCLUSIONS AND FUTUREWORKThe world is an inherently multicultural and multilingual place,where different countries and regions develop their own unique

Page 7: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

Multi-lingual Concept Extractionwith Linked Data and Human-in-the-Loop K-CAP 2017, December 4–6, 2017, Austin, TX, USA

(a) glimpseLD, DBpedia seeds. r = 0.997 (b) glimpseLD, Wikidata seeds. r= 0.985

Figure 4: Comparison of discovery growth for glimpseLD across different languages on the EMEA dataset, using seeds from DBpedia (Figure 4a) andWikidata (Figure 4b). Pearson correlation(r) amongst results from different languages is reported.

(a) Discovery growth for glimpseLD. (b) Recall for glimpse vs glimpseLD.

Figure 5: Human-in-the-loop experiment with a subject matter expert (physician). Figure 5a shows the discovery growth of glimpseLD, seeded with Linked Data. Figure 5b shows the com-parison of recall between glimpse (completely manually seeded) and glimpseLD (Linked Data seeded).

technologies, social concepts, fashion language, etc. at an astound-ing rate. Developing semantic assets (dicionaries, taxonomies, etc.)in all these languages allows detection and interconnection of con-cepts between these cultures, opening the door for even more rapiddiscovery. Even within a single language, cultural and technologicalterms develop and evolve staggeringly quickly. As usual, it is thesevery new terms that are of most interest for inclusion in semanticassets and linked data resources. This paper addresses the challengeby proposing a solution to discover new instances for a specificontology concept for an which is independent of (i) language, (ii)domain and (iii) text style. Our algorithm is iterative and purelystatistical, hence does not require any feature extraction whichcan be difficult and expensive in different languages and texts. It

organically incorporates human feedback to improve accuracy andcontrol concept drift at every iteration cycle. Additionally, we wereable to run the many experiments required to quantify this due tothe usage of an oracle to determine the growth rate for adding in-stances automatically. We show extremely similar discovery growth(of over 250) extracting drug names on four languages over par-allel corpora of medical text, with minimal variation from initialseeds. We show similar efficacy for a second entity type (color)over non-parallel microblogging corpora. We show that exploitingLinked Data in the bootstrapping phase we were able to maintainthe same comparable performances in all languages, speeding upthe hurdle of initial iterations. Lastly, due to the tight integrationof the human-in-the-loop the very high quality instances (their

Page 8: Multi-lingual Concept Extraction with Linked Data and ...k-cap2017.org/wp-content/uploads/2017/11/KCAP2017_28authorVersion.pdf · Multi-lingual Concept Extraction with Linked Data

K-CAP 2017, December 4–6, 2017, Austin, TX, USA A. Alba et al.

gLD-S gLD-H DBSpot.en 248 822 352de 257 / 234es 239 / 109it 247 / 184

(a) Lexicon size on the EMEA dataset.

gLD-S gLD-H DBSpot. Babelfy FREDen 21 54 13 27 0de 18 32 6 14 0es 23 43 12 22 0it 18 36 8 17 0

(b) Lexicon size on the Twitter dataset.

Table 1: Comparison of produced lexicon size for the drug extraction task on the EMEAdataset (Table 1a) and n the Twitter dataset (Table 1b), when using glimpseLD, either withthe synthetic user (gLD-S) orwith true human-in-the-loop (gLD-H), andDBpedia Spotlight.

surface forms) developed by our technique can be included rapidlyand directly back into the Linked Data.

Future work will include using multiple semantic resources toimprove the performance (e.g., identifying and employing multiple“drug” concepts in parallel). Additionally, the obvious next step isa cross lingual alignment of the discovered instances, which is acritical aspect in the current efforts towards a multilingual semanticWeb [23]. This alignment can benefit from the rich set of contextsdeveloped by glimpse and is an important area of future interest.Finally, although we already have developed and tested a userinterface to facilitate human feedback [4], we plan on improvingit.

REFERENCES[1] Pierpaolo Basile, Annalina Caputo, Anna Lisa Gentile, and Giuseppe Rizzo. 2016.

Overview of the EVALITA 2016 Named Entity rEcognition and Linking in ItalianTweets (NEEL-IT) Task. In CEUR workshop proceedings, Vol. 1749. RWTH.

[2] Asma Ben Abacha, Md Faisal Mahbub Chowdhury, Aikaterini Karanasiou, Yas-sine Mrabet, Alberto Lavelli, and Pierre Zweigenbaum. 2015. Text mining forpharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification. Journal of Biomedical Informatics58 (2015), 122–132. https://doi.org/10.1016/j.jbi.2015.09.015

[3] Sebastian Blohm and Philipp Cimiano. 2007. Using the Web to Reduce DataSparseness in Pattern-Based Information Extraction. In PKDD’07. Springer, 18–29.https://doi.org/10.1007/978-3-540-74976-9_6

[4] Anni Coden, Marina Danilevsky, Daniel Gruhl, Linda Kato, and MeenaNagarajan. 2017. A Method to Accelerate Human in the Loop Clus-tering. In SIAM 2017. 237–245. https://doi.org/10.1137/1.9781611974973.27arXiv:http://epubs.siam.org/doi/pdf/10.1137/1.9781611974973.27

[5] Anni Coden, Daniel Gruhl, Neal Lewis, Michael Tanenblatt, and Joe Terdiman.2012. SPOT the drug! An unsupervised pattern matching method to extractdrug names from very large clinical corpora. HISB’12 (2012), 33–39. https://doi.org/10.1109/HISB.2012.16

[6] Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, GenevieveGorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysisof named entity recognition and linking for tweets. Information Processing &Management 51, 2 (2015), 32–49.

[7] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw,Stephanie Strassel, and Ralph M Weischedel. 2004. The Automatic ContentExtraction (ACE) Program-Tasks, Data, and Evaluation.. In LREC.

[8] Julian Dolby, Achille Fokoue, Aditya Kalyanpur, Edith Schonberg, and KavithaSrinivas. 2009. Extracting enterprise vocabularies using linked open data. TheSemantic Web-ISWC 2009 (2009), 779–794.

[9] Aldo Gangemi, Valentina Presutti, Diego Reforgiato Recupero, Andrea GiovanniNuzzolese, Francesco Draicchio, and Misael Mongiovì. 2016. Semantic webmachine reading with FRED. Semantic Web (2016), 1–21.

[10] Anna Lisa Gentile, Ziqi Zhang, Isabelle Augenstein, and Fabio Ciravegna. 2013.Unsupervised Wrapper Induction Using Linked Data. In K-CAP’13 (K-CAP’13).

ACM, 41–48. https://doi.org/10.1145/2479832.2479845[11] William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Induc-

ing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. In Proceedingsof the 2016 Conference on Empirical Methods in Natural Language Processing. ACL,595–605. https://aclweb.org/anthology/D16-1057

[12] Ichiro Kuriki, Ryan Lange, Yumiko Muto, Angela M. Brown, Kazuho Fukuda,Rumi Tokunaga, Delwin T. Lindsey, Keiji Uchikawa, and Satoshi Shioiri. 2017.The modern Japanese color lexicon. Journal of Vision 17, 3 (2017), 1. https://doi.org/10.1167/17.3.1

[13] Kathy Lee, Ashequl Qadir, Sadid A. Hasan, Vivek Datla, Aaditya Prakash, JoeyLiu, and Oladimeji Farri. 2017. Adverse Drug Event Detection in Tweets withSemi-Supervised Convolutional Neural Networks. In Proceedings of the 26thInternational Conference on World Wide Web (WWW ’17). International WorldWide Web Conferences Steering Committee, Republic and Canton of Geneva,Switzerland, 705–714. https://doi.org/10.1145/3038912.3052671

[14] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef,SÃűren Auer, and Christian Bizer. 2013. DBpedia – A Large-scale, MultilingualKnowledge Base Extracted from Wikipedia. Semantic Web Journal (2013).

[15] Shengyu Liu, Buzhou Tang, Qingcai Chen, and XiaolongWang. 2015. Effects of se-mantic features onmachine learning-based drug name recognition systems:Wordembeddings vs. Manually constructed dictionaries. Information (Switzerland) 6, 4(2015), 848–865. https://doi.org/10.3390/info6040848

[16] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011.DBpedia spotlight: shedding light on the web of documents. In I-Semantics’11.ACM, 1–8.

[17] Panagiotis Mitzias, Marina Riga, Efstratios Kontopoulos, Thanos G Stavropoulos,Stelios Andreadis, Georgios Meditskos, and Ioannis Kompatsiaris. 2016. User-Driven Ontology Population from Linked Data Sources. In KESW. Springer,31–41.

[18] Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity Linkingmeets Word Sense Disambiguation: a Unified Approach. TACL 2 (2014), 231–244.

[19] Andrea Giovanni Nuzzolese, Anna Lisa Gentile, Valentina Presutti, Aldo Gangemi,Darío Garigliotti, and Roberto Navigli. 2015. Open knowledge extraction chal-lenge. In Semantic Web Evaluation Challenge. Springer, 3–15.

[20] Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani.2017. Lightweight Multilingual Entity Extraction and Linking. In WSDM’17.ACM, 365–374. https://doi.org/10.1145/3018661.3018724

[21] Maria Teresa Pazienza, Marco Pennacchiotti, and Fabio Massimo Zanzotto.2005. Terminology Extraction: an analysis of linguistic and statistical ap-proaches. Knowledge Mining SFSC185, 2005 (2005), 255–279. https://doi.org/10.1007/3-540-32394-5_20

[22] Nicolas Pröllochs, Stefan Feuerriegel, and Dirk Neumann. 2015. GeneratingDomain-Specific Dictionaries using Bayesian Learning. Ecis 2015 (2015), 0–14.

[23] Andi Rexha, Mauro Dragoni, Roman Kern, and Mark Kröll. 2016. An InformationRetrieval Based Approach forMultilingual OntologyMatching. InNLDB. Springer,433–439.

[24] Ellen Riloff and Rosie Jones. 1999. Learning Dictionaries for InformationExtraction by Multi-level Bootstrapping. In AAAI ’99. AAAI, 474–479. http://dl.acm.org/citation.cfm?id=315149.315364

[25] Petar Ristoski, Christian Bizer, and Heiko Paulheim. 2015. Mining the web oflinked data with rapidminer. Journal of Web Semantics 35 (2015), 142–151.

[26] Petar Ristoski and Heiko Paulheim. 2016. Semantic Web in data mining andknowledge discovery: A comprehensive survey. Journal of Web Semantics 36(2016), 1–22.

[27] M. Sahlgren and J. Karlgren. 2005. Automatic Bilingual Lexicon AcquisitionUsing Random Indexing of Parallel Corpora. Nat. Lang. Eng. 11, 3 (Sept. 2005),327–341. https://doi.org/10.1017/S1351324905003876

[28] Isabel Segura-Bedmar, Paloma Martínez, and María Herrero Zazo. 2013. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIEx-traction 2013). In SemEval 2013. ACL, 341–350. http://www.aclweb.org/anthology/S13-2056

[29] Hyun-Je Song, Seong-Bae Park, and Se-Young Park. 2009. An automatic ontologypopulation with a machine learning technique from semi-structured documents.In ICIA’09. IEEE, 534–539.

[30] J Tiedemann. 2009. News fromOPUS-A collection of multilingual parallel corporawith tools and interfaces. RANLP (2009). http://books.google.com/books?hl=en

[31] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to theCoNLL-2003 Shared Task: Language-independent Named Entity Recognition. InCONLL’03. ACL, 142–147. https://doi.org/10.3115/1119176.1119195

[32] Piek Vossen, Rodrigo Agerri, Itziar Aldabe, Agata Cybulska, Marieke van Erp,Antske Fokkens, Egoitz Laparra, Anne-Lyse Minard, Alessio Palmero Aprosio,German Rigau, et al. 2016. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams ofnews. Knowledge-Based Systems 110 (2016), 60–85.

[33] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborativeknowledgebase. Commun. ACM 57, 10 (2014), 78–85.