revealing entities from texts with a hybrid approach

Julien Plu, Giuseppe Rizzo, Raphaël Troncy{firstname.lastname}@eurecom.fr,

@julienplu, @giusepperizzo, @rtroncy

Revealing Entities From Texts With a Hybrid Approach

On June 21th, I went to Paris to see the Eiffel Tower and to enjoy the world music day.

§ Goal: link (or disambiguate) entity mentions one can find in text to their corresponding entries in a knowledge base (e.g. DBpedia)

db:Paris db:Eiffel_Towerdb:Fête_de_la_Musiquedb:June_21

What is Entity Linking?

2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 2

§ Extract entities in diverse type of textual documents:Ø newspaper article, encyclopaedia article,

micropost (tweet, status, photo caption), video subtitle, etc.

Ø deal with grammar free and short texts that have little context

§ Adapt what can be extracted depending on guidelines or challengesØ #Micropost2014 NEEL challenge: link entities that may belong to:

Person, Location, Organization, Function, Amount, Animal, Event, Product, Time, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects)

Ø OKE2015 challenge: extract and link entities that must belong to:Person, Location ,Organization, and Role

Problems


Research Question

How do we adapt an entity linking system to solve these problems?


§ Input and output to different formats: Ø Input: plain text, NIF, Micropost2014 (pruning phase)

Ø Output: NIF, TAC (tsv format), Micropost2014 (tsv format with no offset)

§ Text is classified according to its provenance

§ Text is normalized if necessaryFor microposts content, RT symbols (in case of tweets) and emoticons are removed

Text

microposts

newspaper article, video subtitle,

encyclopaedia article, ...

TextNormalization Entity

Extractor

EntityLinking index

Pruning

ADEL

ADEL Workflow


§ Multiple extractors can be used:Ø Possibility to switch on and off an extractor in order to adapt the system

to some guidelines

Ø Extractors can be:Funsupervised: Dictionary, Hashtag + Mention, Number Extractor

Fsupervised: Date Extractor, POS Tagger, NER System

§ Overlaps are resolved by choosing the longest extracted mention

DateExtractor

Number Extractor

POSTagger

(NNP/NNPS)

Dictionary NERSystem (Stanford)

….

Hashtag + MentionExtractor

Overlap Resolution

Date Extractor: June 21June 21

Number extractor: 21

Entity Extractor


§ From DBpedia:Ø PageRank

Ø Title

Ø Redirects, Disambiguation

§ From Wikipedia:Ø Anchors

Ø Link references

For example, from the EN Wikipedia article about Xabi Alonso:

index

(Arsenal F.C., 1);(Mikel Arteta, 2);(San Sebastián, 1);(Liverpool, 2);(Everton F.C., 1)

Alonso and [[Arsenal F.C.|Arsenal]] player [[Mikel Arteta]] were neighbours on the same street while growing up in [[San Sebastián]] and also lived near each other in [[Liverpool]]. Alonso convinced [[Mikel Arteta|Arteta]] to transfer to [[Everton F.C.|Everton] after he told him how happy he was living in [[Liverpool]]].

How is the index created?


§ Generate candidates from a fuzzy match to the index

§ Filter candidates:Ø Filter out candidates that are not semantically related

to other entities from the same sentence

§ Score each candidate using a linear formula:score(cand) = (a * L(m, cand) + b * max(L(m, R(cand))) + c * max(L(m, D(cand)))) * PR(cand)

L for Levenshtein distance, R for set of redirects, D for set of disambiguation and PR for PageRanka, b and c are weights set with a > b > c and a + b + c = 1

Candidate Generation

CandidateFiltering

Scoring

mention

index

query

Entity Linking


Sentence: I went to Paris to see the Eiffel Tower.§ Generate Candidates:

Ø Paris: db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris

Ø Eiffel Tower: db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)

§ Filter candidates:Ø db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris

Ø db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee)

§ Scoring:Ø Score(db:Paris)= (a * L(“Paris”, “Paris”) + b * max(L(“Paris”, R(“Parisien”, “Paname”))) + c *

max(L(“Paris”, D(“Paris (disambiguation)”)))) * PR(db:Paris)

Ø Score(db:Notre_Dame_de_Paris)= (a * L(“Paris”, “Notre Dame de Paris”) + b * max(L(“Paris”, R(“NôtreDame”, “Paris Cathedral”))) + c * max(L(“Paris”, D(“Notre Dame”, “Notre Dame de Paris(disambiguation)”)))) * PR(db:Notre_Dame_de_Paris)

Entity Linking example


§ k-NN machine learning algorithm training process:Ø Run the system on a training set

Ø Classify entities as true/false according to the training set Gold Standard

Ø Create a file with the features of each entities and their true/false classificationØ Train k-NN with the previous file to get a model

§ Use 10 features for the training:

• Length in number of characters• Extracted mention• Title• Type• PageRank

• HITS• Number of inLinks• Number of outLinks• Redirects number• Linking score

Training set ADEL Create file of features

Train k-NN

Pruning


§ Tweets datasetØ Training set: 2340 tweets

Ø Test set: 1165 tweets

§ Link entities that may belong to one of these tentypes:Ø Person, Location, Organization, Function, Amount, Animal, Event,

Product, Time, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects)

§ Twitter user name dereferencing

§ Disambiguate in DBpedia 3.9

#Micropost2014 NEEL challenge


Results on #Micropost2014

§ Results of ADEL with and without pruning

§ Results of other systems

Without pruning With pruning

Precision Recall F-measure Precision Recall F-measure

Extraction 69.17 72.51 70.8 70 41.62 52.2

Linking 47.39 45.23 46.29 48.21 26.74 34.4


E2E UTwente DataTXT ADEL AIDA Hyberabad SAP

F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02

§ Sentences from WikipediaØ Training set: 96 sentences

Ø Test set: 101 sentences

§ Extract and link entities that must belong to one of these four types:ØPerson, Location, Organization and Role

§ Must disambiguate co-references

§ Allow emerging entities (NIL)

§ Disambiguate in DBpedia 3.9

OKE2015 challenge


Results on OKE2015

§ Results of ADEL with and without pruning

§ Results of other systemshttps://github.com/anuzzolese/oke-challenge


Without pruning With pruning

Precision Recall F-measure Precision Recall F-measure

Extraction 78.2 65.4 71.2 83.8 9.3 16.8

Recognition 65.8 54.8 59.8 75.7 8.4 15.1

Linking 49.4 46.6 48 57.9 6.2 11.1

ADEL FOX FRED

F-measure 60.75 49.88 34.73

#Micropost2015 NEEL challenge

§ Tweets dataset:Ø Training set: 3498

ØDevelopment set: 500

Ø Test set: 2027

§ Extract and link entities that must belong to one of these seven types:Ø Person, Location, Organization, Character, Event, Product, and Thing

(languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects)

§ Twitter user name dereferencing

§ Disambiguate in DBpedia 3.9 + NIL2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 15

Results on #Micropost2015

§ Results of ADEL without pruning

§ Results of other systemsØStrong type mention match

ØStrong link match (not considering the type correctness)

Precision Recall F-measure

Extraction 68.4 75.2 71.6

Recognition 62.8 45.5 52.8

Linking 48.8 47.1 47.9


ousia ADEL uva acubelab uniba ualberta cen_neel

F-measure 80.7 52.8 41.2 38.8 36.7 32.9 0

ousia acubelab ADEL uniba ualberta uva cen_neel

F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0

Error Analysis

§ Issue for the extraction:Ø “FB is a prime number.”

FFB stands for 251 in hexadecimal and will be extracted as Facebook acronym by the wrong extractor

§ Issue for the filtering:Ø “The series of HP books have been sold million times in France.”

FNo relation in Wikipedia between Harry Potter and France. Then no filtering is applied.

§ Issue for the scoring:Ø “The Spanish football player Alonso played twice for the national team

between 1954 and 1960.”FXabi Alonso will be selected instead of Juan Alonso because of the PageRank.


§ Our system gives the possibility to adapt the entity linking task to different kind of text

§ Our system gives the possibility to adapt the type of extracted entities

§ Results are similar regardless of the kind of text

§ Performance at extraction stage similar to top state-of-the-art systems (or slightly better)

§ Big drop of performance at linking stage mainly due to an unsupervised approach

Conclusion


§ Add more adaptive features: language, knowledge base

§ Improve linking by using a graph-based algorithm:Ø finding the common entities linked to each of the extracted entities

Ø example: “Rafael Nadal is a friend of Alonso” . There is no existing direct link between Rafael Nadal and Alonso in DBpedia (or Wikipedia) but they have the entity Spain in common

§ Improve pruning by:Ø adding additional features:

Frelatedness: compute the relation score between one entity and all the others in the text. If there are more than two, compute the average

FPOS tag of the previous and the next token in the sentence

Ø using other algorithms:FEnsemble Learning

FUnsupervised Feature Learning + Deep Learning

Future Work



http://www.slideshare.net/julienplu

http://xkcd.com/1319/

revealing entities from texts with a hybrid approach

Data & Analytics