2nd gwc, january 20th-23rd 2004 - brno extending wordnet with syntagmatic information luisa...

33
2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

Upload: imogen-mason

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20th-23rd 2004 - Brno

Extending WordNet

with syntagmatic information

Luisa Bentivogli, Emanuele Pianta

ITC-irst

Page 2: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Overview

• WordNet: paradigmatic vs syntagmatic information• Recurrent Free Phrases• Encoding RFP through Phrasets and Syntagmatic

Relations • Getting RFPs in bilingual dictionaries and corpora• Conclusions

Page 3: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

Page 4: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

Page 5: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

Paradigmatic relations (in absentia)

Page 6: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

multiword expression

Paradigmatic relations (in absentia)

Page 7: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

multiword expression

semantic restriction

Paradigmatic relations (in absentia)

Page 8: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

free phrasemultiword expression

semantic restriction

Paradigmatic relations (in absentia)

Page 9: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Paradigmatic vs Syntagmatic

An international conference took place in Brno

national symposium

meeting

Prague

Czech Republic

free phrasemultiword expression

semantic restriction

Paradigmatic relations (in absentia)

Syntagmatic relations (in presentia)

Page 10: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Why is syntagmatic info useful

• From a lexicographic point of view– See examples of usage in dictionaries (and WN itself)– Often a very short phrase– Sometimes more useful than definitions

• From a computational point of view– statistics oriented, corpus based methods– crucial role of co-occurrence information– co-occurrence of words vs meanings

Page 11: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Lexical units in WordNet

• Criterium for inclusion in synsets: only lexicalized concept

• What counts as a lexical unit

– Simple words: {tree}

– Idioms • non compositional meaning• {rollercoaster, big_dipper, ...}

– Restricted collocations• compositional, reduced substitution, no literal translation• {criminal_record, record} (Italian: precedenti penali)

– Named entities: {Praha, capital_of_the_Czech_Repubblic, …}

Page 12: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Problems with inclusion criteria - 1

• Artificial nodes: synsets with no lexical unit– {social_group}– {gruppo_sociale}– Free combinations of words (Benson et al., 1986)

• DEF: a combination of words following only the general rules of syntax

• Restricted collocations: – reduced substitution, no literal transl., but compositional – ex: circulatory system (*blood, *circulation system)– are they lexical unit?– should we include them in synsets?

• Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets?

Page 13: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Problems with inclusion criteria - 2

• A considerable number of expressions which aresystematically used to express a concept are excluded from (Multi)WordNet as they are not lexical units

• Ex: “andare in bicicletta” [to bike]– andare: to move by walking or using a means of

locomotion– in bicicletta: by bike

• Ex: “punta di freccia” [arrowhead]

Page 14: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Introducing Recurrent Free Phrases

• Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept

• 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations)

• 2. High frequency (“governo italiano” Italian government)

• 3. High degree of association (“prima volta” first time)

• 4. Salience: – intuition of the native speaker lexicographer that a certain

expression picks up a concept which is perceived as relevant and somehow unitary

– not necessarily related to frequency and word association• “vertice internazionale” international summit (high salience)• “coscia destra” right thigh

Page 15: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

The salience criterium

• Hypothesis:– Related to the amount of world knowledge that is

attached to a certain phrase

– Such knowledge cannot be inferred from the meanings composing the phrase

• Example:– right hand (more salient)

– right thigh

Page 16: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Recurrent Free Phrases for NLP

• Knowledge-based word alignment of parallel corpora– EX: cornfield ~ campo di grano

• Word Sense Disambiguation– campo: 12 senses in MWN

– grano: 9 senses

– both unambiguous in “campo di grano”

Page 17: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Criteria for RFP selection

• RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) – EX: andare in bicicletta [to bike]

• RFPs synonyms with a lexical unit in the same language– EX: strofinaccio dei piatti / canovaccio [dishcloth]

• RPFs that are frequent, cohese and salient within a corpus considered as reference corpus– EX: vertice internazionale [international summit]

• RPFs whose components are highly polysemous. – EX: campo di grano [cornfield ]

Page 18: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

MultiWordNet

• MultiWordNet: Italian/English lexical database• Princeton WordNet building criteria• Strict alignment (see expand model)• Explicit treatment of lexical gaps• Italian (44,000 words) and

– Hebrew (University of Haifa, just started)

– Cfr Spanish WordNet (EuroWordNet)

Page 19: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Introducing Phrasets

• Phraset: a set of synonymous recurrent free phrases

ENG-synset {cornfield}ITA-synset {GAP} ITA-phraset {campo_di_grano}

ENG-synset {toilet_roll}ITA-synset {GAP} ITA-phraset {rotolo_di_carta_igienica}

ENG-synset {dishcloth} ITA-synset {canovaccio}ITA-phraset {strofinaccio_dei_piatti,

strofinaccio_da_cucina}

Page 20: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

RFPs vs definitions

RFPs are not definitions

E-synset {tree -- a tall perennial wody plant having a main trunk …}

I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato}

I-phraset{}

E-synset {paperboy}I-synset {GAP – ragazzo che recapita i giornali}

I-phraset{ragazzo_dei_giornali}

E-synset {straphanger}I-synset {GAP – chi viaggia in piedi su mezzi pubblici

reggendosi ad un sostegno} I-phraset{}

Page 21: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Synsets vs Phrasets

Simple words

Idioms

Restricted collocations

Named entities

Recurrent Free Phrases

Free combination of words

Synsets

Phrasets

Page 22: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Syntagmatic Relations in WN

• MEANING project: using the involve semantic relation to encode deep selectional restrictions

• Can RFP be encoded through semantic relations?

Page 23: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Encoding “campagna antifumo” -1

Synset: {campagna}Phraset: {}

Synset: {GAP}Phraset: {campagna_antifumo}

campaign

campaign against smoking

hypernym

Through phrasets

Page 24: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Encoding “campagna antifumo” - 2

Synset: {campagna} Synset: {antifumo}

campaign against smoking

has_constraint

Through a semantic relation

Page 25: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Pros and cons of using semantic rels for encoding RPFS

• Smart and concise

but what about

• trigram RFP?• synonymous RFPs• RPFs that are translation equivalent of lexical units?• Restrictions on word order and word morphology?

Page 26: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Taking the best of both encodings

• Phrasets and lexical syntagmatic relations

GAP -- campo di grano (cornfield)

frumento, grano (corn) campo (field)

cereale (cereal) appezzamento (parcel)

hypernym

composed-of(campo)

composed-of (grano)

hypernym

hypernym

Page 27: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

RFP in Bilingual Dictionaries

• Collins bilingual dictionary (medium size)• Italian Translation Equivalents (Bentivogli and Pianta, 2000)

– 92.2% correspond to lexical units

– 7.8% correspond to free combination of words (lexical gaps)

• Manual check of 300 lexical gaps– 67% correspond to RFPs

=> More than half of the synsets which are gaps in Italian potentially have an associated phraset

Page 28: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

RFPs in corpora

• Correlation between RPFs and frequency?• Analysis of a 32M word corpus (Repubblica, 2000-

2001)• Standard n-gram analysis package (NSP)• All bigrams including at least a stopword excluded• 118,464 bigrams occurring more than 3 times• Highest rank: 5,914 occurrences (“New York”)• Rank 4: 31,453 bigrams• 497 distinct ranks (frequence classes)

Page 29: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

RFPs in corpora cont.

• Lower ranks are systematically and densely populated

• Higher ranks are sparsely and poorly populated

• Rank groups– A: 5,914-509 (100 bigrams)– B: 505-257 (257)– C: 256-129 (731)– D: 128-65 (1,965)– E: 64-33 (4,525)– F: 32-17 (10,477)– G: 16-9 (22,167)– H: 8-5 (46,798)– I: 4 (31,453)

• Manual check of 100 random bigrams from each rank group

Page 30: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

RFPs in corpora cont.

A5,914

B505

C256

D128

E64

F32

G16

H8

I(4)

Lexical units

82 79 74 65 58 55 42 35 28

Recurrent free phrases

14 4 9 14 17 4 15 3 15

Other 4 17 17 21 25 41 43 58 57

NB: similar results on trigrams

Manual check of 100 random bigrams from each rank group

Page 31: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Correlation between num. of RFPs and frequency in a reference corpus

0

10

20

30

40

50

60

70

80

90

A B C D E F G H I

Lex Unit

R.F.P.

Other

Page 32: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Future work

• Better characterization and classification• Correlation with association measures• Evaluating RFP for WSD

Page 33: 2nd GWC, January 20th-23rd 2004 - Brno Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst

2nd GWC, January 20-23 2004 - Brno

Conclusions

• Wordnet is poor of syntagmatic information

• We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations

• RFP: free combination of word recurrently used to express a concept

• Criteria for their selection

• Bilingual dictionaries contain many RFPs

• Corpora: no clear correlation with frequency

• Useful for: – lexicographic work

– Word Sense Disambiguation