ewa rudnicka, wojciech witkowski, maciej piasecki g4.19 research group institute of informatics,...

Post on 17-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group

Institute of Informatics,

Wrocław University of Technology

nlp.pwr.wroc.pl

plwordnet.pwr.wroc.pl

Large Polish-English Lexico-Semantic

Resource Based on plWordNet - Princeton

WordNet Mapping

Outline

• What is a wordnet?

• Mapping plWordNet on Princeton WordNet

• Extending Princeton WordNet

• Applications

• Conclusions

What is a wordnet? (1)

A huge electronic lexico-semantic database (a kind of thesaurus)

Basic building blocks:

- lemma – base form representing different inflectional forms and

different meanings

e.g. czwórka – 'good'

- lexical unit – lemma plus sense pair (in wordnets marked with

number)

e.g. czwórka 3 (por – 'communication')

- synset – a set of synonymous lexical units

e.g. {czwórka 3 (por), czwóra 1 (por)}

What is a wordnet? (2)

Both lexical units and synsets linked via different lexico-semantic relations such as:

synonymy, near-synonymy,

hypernymy/hyponymy,

meronymy/holonymy, fuzzynymy

Examples: Lexical relations:

czwórka 3 (por) has a derivativity relation to czwórka 4 (por)

czwórka 3 (por) has an expressiveness relation to czwóra 1(por) Synset relations:

{czwórka 3 (por), czwóra 1 (por)} is a hyponym of

{stopień 3(il), ocena 1(il), nota 3(il)}

Princeton WordNet

Princeton WordNet (Fellbaum 1998):

the first wordnet ever built on psycholinguistic principles – mapping the structure of

human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for

verbs,

antonym relations for adjectives synsets represent 'lexicalised concepts' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation,

understood as a conceptual relation established on the basis of

linguist's intuitions and dictionary definitions

No major changes since 2006, last version 2012

plWordNet - Słowosieć

• plWordNet (plWN)• developed fairly independently of

Princeton WordNet (PWN) by applying • a unique corpus-based method

• one of the biggest existing wordnets

Number of plWN PWN enWN

lemmas 156,402 155,593 157,541

lexical units 220,129 206,978 209,147

synsets 162,629 117,659 119,290

• the emphasis on

• relations between lexical units, new relations specially designed to cover the pecularities of morphosyntactic structure of Polish

• cf. Piasecki et al. 2009, Maziarz et al. 2012

• synsets built of

• lexical units sharing the same set of constitutive relations

• such as hyponymy, hypernymy, meronymy, holonymy

• partly linked to Princeton WordNet

• cf. Rudnicka et al. 2012

plWordNet vs. Princeton WordNet

Mapping plWordNet on Princeton WordNet

• Goal: Linking plWordNet synsets with Princeton Wordnet synsets

• Steps:• Defining a set of inter-lingual relations and setting

their hierarchy

• Designing mapping procedures for nouns and adjectives

• Mapping direction: plWordNet > Princeton WordNet

• Bottom-up approach – starting from the lowest levels in the hierarchy

• Currently mapped lexical categories:

• nouns (most of them), adjectives (about a half)

Automatic prompts

Two systems, based on: 1) relaxation labeling algorithm (nouns)

2) rules relying on the network of the existing

intra and inter-lingual relations (adjectives)

Resource: cascade dictionary Generated prompts: - visible in the form of special links in

WordNetLoom editing system

- verified by lexicographers

A set of inter-lingual relationsand current statistics

• A set of inter-lingual relations between plWN and PWN

• inspired by:• inter-lingual relations from EuroWordNet (Vossen

2002)

• intra-lingual relations from plWordNet (Maziarz et al. 2011)

• Statistics of the established inter-lingual links:

Nouns Adjectives1. Synonymy 28 736 3 1992. Partial synonymy 2 580 1 0033. Inter-register synonymy 1 510 354. Hyponymy 57 029 6 5615. Hypernymy 3 744 346. Meronymy 6 0347. Holonymy 1 204 8. Cross-categorial synonymy 3 891

Motivation for the extension of Princeton WordNet

the high percentage of inter-lingual hyponymy links between plWordNet and Princeton WordNet synsets

Established due to a number of lexical coverage gaps in Princeton WordNet

And the resulting impossibility to establish much more informative and useful inter-lingual synonymy links

possible to be used as ‘pointers’ to specific

Princeton WordNet gaps (‘missing’ lexical units) and whole ‘empty nests’ (several missing co-hyponyms of one hypernym synset) in the network

Inter-lingual hyponymy links

General extension procedure

• The starting point -- existing inter-lingual hyponymy links

• Lemmas of plWordNet synsets translated by a cascade dictionary• Which combines several traditional dictionaries, the data

ordered in the hierarchy of importance; the topmost gaining more priority

the results are filtered by lemmas of Princeton WordNet, to gain:• A list of plWN lemmas with the ‘equivalent’ cascade

dictionary lemmas absent from PWN• A list of plWN lemmas without the ‘equivalent’ cascade

dictionary lemmas• A list of plWN lemmas with the ‘equivalent’ cascade

dictionary lemmas present in PWN

Extension procedure

• Start is at the lowest level of hierarchy• in order not to change the structure of the original Princeton WordNet

• Verification of the suggested English equivalent(s)• in corpora and other reliable sources

• on the basis of

• the researcher’s knowledge

• dictionaries

• frequency lists from corpora

• Creation of the new Princeton WordNet synset

• The synset is linked

• via intra-lingual hyponymy relation to a proper PWN hypernym synset

• via inter-lingual synonymy relation to its direct counterpart in plWordNet

Extension results

• Each added synset provided with:• a definition

• major source - English Wikipedia

• a usage example • from a corpus or

• other reliable English source

• Total number of selected plWN synsets --- 42785

• Domains selected for the first stage :• shape (156)

• substance (1181)

• quantity (547)

• food (885)

• property (1492)

Extension via plWN.Pros and cons

• Pros:• There is a definite vocabulary basis for the

extension

• New synsets can be easily and safely located in the structure of the original PWN

• Cons:• Polish orientation of the extension

• Addition of lexical units related to strictly Polish domains

Extension via corpora data.An alternative strategy

• This extesion procedure uses frequency lists derived from:• British National Corpus • Wacky corpus• Corpus of Contemporary American English• American National Corpus• English Wikipedia

• Independent of plWordNet• Criterion for inclusion of a new lexical unit

• its appearance in five different texts

Pros and cons

• Pros:• English oriented

• no Polish bias• Cons:• new synsets have to be introduced at different

levels of the PWN hierarchy

• there is a risk of changing • the structure of the original PWN

Cross-lingual Applications

Cross-lingual Semantic searching, Semantic indexing of texts, Text classification, Statistical semantic analysis of corpora in

different languages Information Extraction, Machine Translation

Multi-lingual Princeton WordNet 3.1 is linked to more than

60 languages

Conclusions

• The created bilingual resource will become a gateway to CLARIN bilingual resources

• It has a number of practical applications

• Princeton WordNet can be enriched and updated

• Extension of Princeton WordNet allows one to replace

• the existing inter-lingual hyponymy links between plWN and PWN synsets with

• more precise and useful inter-lingual synonymy links

References

Fellbaum, Ch. (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge, Massachusets.

Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet 2.0. Proceedings of the 6th Global Wordnet Conference, Matsue.

Piasecki, M., Maziarz, M. Szpakowicz, S & Rudnicka, E. (2014). plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. W Proc. 7th International Global Wordnet Conference.

Princeton WordNet http://wordnet.princeton.edu/wordnet/

Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A Strategy of Mapping Polish WordNet onto Princeton WordNet. In Proceedings of COLING 2012. ACL.

Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/

Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.

top related