jrc-names - ec - diplohack datamarket

23
1 JRC-Names: A freely available, highly multilingual named entity resource Hissar, Bulgaria, 12 September 2011 Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot Technical details and publications: http:// langtech.jrc.ec.europa.eu/ Applications: http ://emm.newbrief.eu/overview.html

Upload: open-knowledge-belgium

Post on 23-Jan-2018

333 views

Category:

Technology


4 download

TRANSCRIPT

1RANLP’2011, Hissar, Bulgaria, 12.09.2011

JRC-Names: A freely available, highly multilingual

named entity resource

Hissar, Bulgaria, 12 September 2011

Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov,

Jenya Belyaeva & Erik van der Goot

Technical details and publications: http://langtech.jrc.ec.europa.eu/

Applications: http://emm.newbrief.eu/overview.html

2RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

3RANLP’2011, Hissar, Bulgaria, 12.09.2011

What is JRC-Names?

• JRC-Names consists of:

• Lists of names and their many spelling variants,

• ~205,000 person and organisation names plus

• ~204,000 name spelling variants

• In 27 scripts and many more languages

• Software to recognise these names in multilingual text, with offset and unique name identifier

• Download from http://langtech.jrc.ec.europa.eu/

4RANLP’2011, Hissar, Bulgaria, 12.09.2011

Possible uses of JRC-Names

• Standardise name spellings in databases, text collections and the internet for

improved retrieval (Stern & Sagot 2010)

• Improve Machine Translation – names must be treated differently from other

words (Babych & Hartley 2003; Steinberger & Pouliquen 2009)

• Use as input to learn automatic transliteration rules (e.g. Pouliquen 2009)

• Use output of JRC-Names as seeds to learn NER rules (e.g. Buchholz & van

den Bosch 2000)

• Social networks are less biased by national viewpoints if based on information

extracted from multilingual texts

• NER results are useful for other text mining tasks (opinion mining; co-reference

resolution; summarisation; topic detection and tracking; cross-lingual linking of

related documents across languages; …)

5RANLP’2011, Hissar, Bulgaria, 12.09.2011

Related work – other multilingual (ml) NE resources

• Wentlant et al. (2008) – built a ml NE repository based on Wikipedia links and

case information; 2.5 Mio English names, 250K German, 3K Swahili, …

• Toral et al. (2008) – built Named Entity WordNet by searching NEs in WordNet

and Wikipedia: 310K entities, including 278K persons

• Stern & Sagot (2010) – exploit French Wikipedia and GeoNames to produce

French resource: 263K person names + 883K variants.

• Maurel (2009) –produced Prolexbase mostly manually: 75K entities of all types

• Most resources are based on Wikipedia

• Strong at providing cross-lingual and cross-script variants

• Offers only few other spelling variants

• No morphological inflections

• JRC-Names contains mostly spelling variants from real-life text,

enriched with Wikipedia – up to 413 variants for the same NE.

6RANLP’2011, Hissar, Bulgaria, 12.09.2011

Name variants found and used in 6 hours (!) of EMM news analysis

26.08.2011, PM

7RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

8RANLP’2011, Hissar, Bulgaria, 12.09.2011

NER on news gathered by the Europe Media Monitor (EMM)

• ~ 3600 Sources (world-wide, with focus on Europe)• ~ 3225 news sources (web portals)

• ~ 360 specialist medical sites

• ~ 20 commercial newswires

• Specialist pay-for sources (LexisMed)

• 24/7, updated every 10 minutes

• ~ 100,000 articles / day in ~ 50 languages

• Named Entity Recognition (NER) performed on 20 languages.

• Articles are fed into the various publicly accessible EMM applications:

9RANLP’2011, Hissar, Bulgaria, 12.09.2011

Multilingual NER in EMM – A brief overview

• Lookup of most frequent known names and their variants in all languages

• Database currently contains about 1,18 million names + 225.000 variants (status July 2011)

• Including morphological (and other) variants by pre-generating inflection forms

(Slovene example):

Tony(a|o|u|om|em|m|ju|jem|ja)?\s+Blair(a|o|u|om|em|m|ju|jem|ja)

• Guessing new names using empirically-derived lexical patterns in 20 languages.• President, Minister, Head of State, Sir, American

• “death of”, “[0-9]+-year-old”, …

• Known first names + uppercase words

• Identification of a current average of 1,000 unknown names per day.

• Only names found repeatedly will become known names (error reduction).

10RANLP’2011, Hissar, Bulgaria, 12.09.2011

Multilingual name recognition using lexical patterns

asesinato del exprimer ministro Rafic al-Hariri, que la oposición atribuyóes

l'assassinat de l'ex-dirigeant Rafic Hariri et le départ du chef de la diplomfr

na de moord op oud-premier Rafiq al-Hariri gingen gisteren bijna eennl

libanesischen Regierungschef Rafik Hariri vor einem Monat wichtige Bde

danjega libanonskega premiera Rafika Haririja. Libanonska opozicija sisl

möödumisele ekspeaminister Rafik al-Hariri surma põhjustanud pommiplet

death of former Prime Minister Rafik Hariri, blamed by many oppositionen

arبأياد يهودية وما حدث سابقارفيق الحريريرئيس الوزراء السابقاغتيال

Бывший премьер-министр Ливана Рафик Харири, который ru

11RANLP’2011, Hissar, Bulgaria, 12.09.2011

Merging name variants for the same entity

• For all newly found name forms, detect whether they are a variant of an existing NE:

• Transliteration;

• Normalisation, using ~30 hand-written rules and removing vowels;

• Calculate similarity (threshold: 94%).

• Below threshold new entity

20%

+

80%

Condition:

12RANLP’2011, Hissar, Bulgaria, 12.09.2011

Enriching the EMM data with Wikipedia name variants

• For frequent or highly visible names, manually launch a Wikipedia mining process.

• Check for each variant of a name whether there is a Wikipedia entry.

• New name variants, in all scripts, will be recognised in new EMM articles.

Хамид Карзай

Hamid Karzai

Hamid Karzaï

Hamid Karsai

حامد كرزاي

हामिद करजई哈米德·卡尔扎伊

http://en.wikipedia.org/wiki/Hamid_Karzai

13RANLP’2011, Hissar, Bulgaria, 12.09.2011

Manual moderation of EMM name database

• Process is fully automatic, but it can be useful to make changes manually.

• Manual process only for frequent or important names (e.g. Nobel Prize winners):

• Name changes: (e.g. Cardinal Josef Ratzinger Pope Benedict XVI)

• Correct NER mistakes (e.g. Genius Report, Opfer von Diskriminierung);

• Add new stop name parts (e.g. Monday, Report);

• Merge name variants with similarity below the threshold;

• Change the display name of an entity;

• Correct the entity type (PER, ORG, T, U, …);

• Launch Wikipedia mining process;

• …

• Caveat: Name database contains errors!

14RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

15RANLP’2011, Hissar, Bulgaria, 12.09.2011

Statistics on JRC-Names (1)

• JRC-Names include names from the EMM database if any of the following hold:

• Found in 5 or more news clusters;

• Manually verified;

• Retrieved from Wikipedia;

• Number of entries (status July 2011):

• 205,000 distinct names;

• 204,000 additional variants;

• ~3.2% names of organisations / events

• Number of variants:

• 413 variants for Muammar Gaddafi (entity 262)

• 256 variants for Mikhail Saakashvili (entity 472)

• 246 variants for Mahmoud Ahmadinejad (entity 101358)

• Grows by ~230 new entities and ~430 new variants per week.

Variant forms No. of entities

1 63.76%

2 22.52%

3 5.31%

10 or more 3760 entities

50 or more 242 entities

100 or more 37 entities

16RANLP’2011, Hissar, Bulgaria, 12.09.2011

Statistics on JRC-Names (2)

• Number of scripts: 27 Number of languages: ???

• News mentions names from

around the world.

• Frequency does not reflect origin

• European Union (10101) is most

frequent entity in German, and

second in English.

• It does not matter where a name

like Silvio Berlusconi comes from.

17RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

18RANLP’2011, Hissar, Bulgaria, 12.09.2011

Details about the JRC-Names software

• Java-implemented demonstrator

• Finite state automaton

• Reads the NE resource file entities.gzip (frequently updated)

• Searches for known names (and their variants) in UTF8-encoded text files. Returns:

• Numerical name identifier

• Main name for that entity

• Name string found in the text

• Position (Offset and string length)

• For any given name string, returns all variants.

• Software and NE resource file can be downloaded from

• http://langtech.jrc.ec.europe.eu/ , Section on ‘Resources’

• Free usage, according to accompanying end-user licence.

19RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

20RANLP’2011, Hissar, Bulgaria, 12.09.2011

Treatment of morphological inflections

• The recognition of morphological inflections used in EMM processing chain are

not currently part of JRC-Names.

• We are working on a solution to include morphological processing in a future

release of JRC-Names.

• Further variants will also be included more consistently:

• Hyphenation (e.g. Yves Saint-Laurent vs. Yves Saint Laurent)

• Names with and without name ‘infixes’ (e.g. Khan al Khalil vs. Khan Khalil)

• Abbreviations (e.g. Saint vs. St.)

• …

• Current solution: Add the approximately 45,000 full-forms of inflected names,

as found in EMM processing results since January 2011, to the resource file

entities.gzip

• This helps to recognise the most frequent inflection forms of the frequent names.

21RANLP’2011, Hissar, Bulgaria, 12.09.2011

Agenda

• What is JRC-Names; What can it be used for

• Related work: other named entity (NE) resources

• How JRC-Names was produced

• Recognition of named entities in news reports in 20 languages

• Introduction to EMM

• Automatic mapping of name variants to the same entity

• Enrichment with Wikipedia variants

• Partial manual moderation

• Statistics on JRC-Names

• Programming details / Usage of the tool

• Solutions to capture morphological variants

• Further multilingual linguistic resources

22RANLP’2011, Hissar, Bulgaria, 12.09.2011

Further JRC/EC-provided multilingual linguistic resources

• JRC-Acquis (2006): 1 billion word parallel corpus in 22 languages

• DGT-TM (2007): Translation Memory in 22 languages; up to 2 million segments

• DGT-TM-2011 (forthcoming): 23 languages; 4 million segments? Yearly updates

• JEX (JRC Eurovoc Indexer) (forthcoming): software to automatically label texts

according to the thousands of categories of the Eurovoc thesaurus; 23 languages.

• Further smaller resources:

• Multilingual summary evaluation data (2010): 4 clusters for each of 7 languages

• Sentiment-annotated collection of quotations (2010): English (German forthcoming)

• Multilingual Named Entity-annotated parallel corpus (forthcoming)

• Available at http://langtech.jrc.ec.europa.eu/, section on ‘Resources’

23RANLP’2011, Hissar, Bulgaria, 12.09.2011

JRC-Names: A freely available, highly multilingual

named entity resource

Hissar, Bulgaria, 12 September 2011

Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov,

Jenya Belyaeva & Erik van der Goot

Technical details and publications: http://langtech.jrc.ec.europa.eu/

Applications: http://emm.newbrief.eu/overview.html