2nd gwc, january 20th-23rd 2004 - brno extending wordnet with syntagmatic information luisa...
TRANSCRIPT
2nd GWC, January 20th-23rd 2004 - Brno
Extending WordNet
with syntagmatic information
Luisa Bentivogli, Emanuele Pianta
ITC-irst
2nd GWC, January 20-23 2004 - Brno
Overview
• WordNet: paradigmatic vs syntagmatic information• Recurrent Free Phrases• Encoding RFP through Phrasets and Syntagmatic
Relations • Getting RFPs in bilingual dictionaries and corpora• Conclusions
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
Paradigmatic relations (in absentia)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
multiword expression
Paradigmatic relations (in absentia)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
multiword expression
semantic restriction
Paradigmatic relations (in absentia)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
free phrasemultiword expression
semantic restriction
Paradigmatic relations (in absentia)
2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic
An international conference took place in Brno
national symposium
meeting
Prague
Czech Republic
free phrasemultiword expression
semantic restriction
Paradigmatic relations (in absentia)
Syntagmatic relations (in presentia)
2nd GWC, January 20-23 2004 - Brno
Why is syntagmatic info useful
• From a lexicographic point of view– See examples of usage in dictionaries (and WN itself)– Often a very short phrase– Sometimes more useful than definitions
• From a computational point of view– statistics oriented, corpus based methods– crucial role of co-occurrence information– co-occurrence of words vs meanings
2nd GWC, January 20-23 2004 - Brno
Lexical units in WordNet
• Criterium for inclusion in synsets: only lexicalized concept
• What counts as a lexical unit
– Simple words: {tree}
– Idioms • non compositional meaning• {rollercoaster, big_dipper, ...}
– Restricted collocations• compositional, reduced substitution, no literal translation• {criminal_record, record} (Italian: precedenti penali)
– Named entities: {Praha, capital_of_the_Czech_Repubblic, …}
2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 1
• Artificial nodes: synsets with no lexical unit– {social_group}– {gruppo_sociale}– Free combinations of words (Benson et al., 1986)
• DEF: a combination of words following only the general rules of syntax
• Restricted collocations: – reduced substitution, no literal transl., but compositional – ex: circulatory system (*blood, *circulation system)– are they lexical unit?– should we include them in synsets?
• Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets?
2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 2
• A considerable number of expressions which aresystematically used to express a concept are excluded from (Multi)WordNet as they are not lexical units
• Ex: “andare in bicicletta” [to bike]– andare: to move by walking or using a means of
locomotion– in bicicletta: by bike
• Ex: “punta di freccia” [arrowhead]
2nd GWC, January 20-23 2004 - Brno
Introducing Recurrent Free Phrases
• Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept
• 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations)
• 2. High frequency (“governo italiano” Italian government)
• 3. High degree of association (“prima volta” first time)
• 4. Salience: – intuition of the native speaker lexicographer that a certain
expression picks up a concept which is perceived as relevant and somehow unitary
– not necessarily related to frequency and word association• “vertice internazionale” international summit (high salience)• “coscia destra” right thigh
2nd GWC, January 20-23 2004 - Brno
The salience criterium
• Hypothesis:– Related to the amount of world knowledge that is
attached to a certain phrase
– Such knowledge cannot be inferred from the meanings composing the phrase
• Example:– right hand (more salient)
– right thigh
2nd GWC, January 20-23 2004 - Brno
Recurrent Free Phrases for NLP
• Knowledge-based word alignment of parallel corpora– EX: cornfield ~ campo di grano
• Word Sense Disambiguation– campo: 12 senses in MWN
– grano: 9 senses
– both unambiguous in “campo di grano”
2nd GWC, January 20-23 2004 - Brno
Criteria for RFP selection
• RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) – EX: andare in bicicletta [to bike]
• RFPs synonyms with a lexical unit in the same language– EX: strofinaccio dei piatti / canovaccio [dishcloth]
• RPFs that are frequent, cohese and salient within a corpus considered as reference corpus– EX: vertice internazionale [international summit]
• RPFs whose components are highly polysemous. – EX: campo di grano [cornfield ]
2nd GWC, January 20-23 2004 - Brno
MultiWordNet
• MultiWordNet: Italian/English lexical database• Princeton WordNet building criteria• Strict alignment (see expand model)• Explicit treatment of lexical gaps• Italian (44,000 words) and
– Hebrew (University of Haifa, just started)
– Cfr Spanish WordNet (EuroWordNet)
2nd GWC, January 20-23 2004 - Brno
Introducing Phrasets
• Phraset: a set of synonymous recurrent free phrases
ENG-synset {cornfield}ITA-synset {GAP} ITA-phraset {campo_di_grano}
ENG-synset {toilet_roll}ITA-synset {GAP} ITA-phraset {rotolo_di_carta_igienica}
ENG-synset {dishcloth} ITA-synset {canovaccio}ITA-phraset {strofinaccio_dei_piatti,
strofinaccio_da_cucina}
2nd GWC, January 20-23 2004 - Brno
RFPs vs definitions
RFPs are not definitions
E-synset {tree -- a tall perennial wody plant having a main trunk …}
I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato}
I-phraset{}
E-synset {paperboy}I-synset {GAP – ragazzo che recapita i giornali}
I-phraset{ragazzo_dei_giornali}
E-synset {straphanger}I-synset {GAP – chi viaggia in piedi su mezzi pubblici
reggendosi ad un sostegno} I-phraset{}
2nd GWC, January 20-23 2004 - Brno
Synsets vs Phrasets
Simple words
Idioms
Restricted collocations
Named entities
Recurrent Free Phrases
Free combination of words
Synsets
Phrasets
2nd GWC, January 20-23 2004 - Brno
Syntagmatic Relations in WN
• MEANING project: using the involve semantic relation to encode deep selectional restrictions
• Can RFP be encoded through semantic relations?
2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” -1
Synset: {campagna}Phraset: {}
Synset: {GAP}Phraset: {campagna_antifumo}
campaign
campaign against smoking
hypernym
Through phrasets
2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” - 2
Synset: {campagna} Synset: {antifumo}
campaign against smoking
has_constraint
Through a semantic relation
2nd GWC, January 20-23 2004 - Brno
Pros and cons of using semantic rels for encoding RPFS
• Smart and concise
but what about
• trigram RFP?• synonymous RFPs• RPFs that are translation equivalent of lexical units?• Restrictions on word order and word morphology?
2nd GWC, January 20-23 2004 - Brno
Taking the best of both encodings
• Phrasets and lexical syntagmatic relations
GAP -- campo di grano (cornfield)
frumento, grano (corn) campo (field)
cereale (cereal) appezzamento (parcel)
hypernym
composed-of(campo)
composed-of (grano)
hypernym
hypernym
2nd GWC, January 20-23 2004 - Brno
RFP in Bilingual Dictionaries
• Collins bilingual dictionary (medium size)• Italian Translation Equivalents (Bentivogli and Pianta, 2000)
– 92.2% correspond to lexical units
– 7.8% correspond to free combination of words (lexical gaps)
• Manual check of 300 lexical gaps– 67% correspond to RFPs
=> More than half of the synsets which are gaps in Italian potentially have an associated phraset
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora
• Correlation between RPFs and frequency?• Analysis of a 32M word corpus (Repubblica, 2000-
2001)• Standard n-gram analysis package (NSP)• All bigrams including at least a stopword excluded• 118,464 bigrams occurring more than 3 times• Highest rank: 5,914 occurrences (“New York”)• Rank 4: 31,453 bigrams• 497 distinct ranks (frequence classes)
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont.
• Lower ranks are systematically and densely populated
• Higher ranks are sparsely and poorly populated
• Rank groups– A: 5,914-509 (100 bigrams)– B: 505-257 (257)– C: 256-129 (731)– D: 128-65 (1,965)– E: 64-33 (4,525)– F: 32-17 (10,477)– G: 16-9 (22,167)– H: 8-5 (46,798)– I: 4 (31,453)
• Manual check of 100 random bigrams from each rank group
2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont.
A5,914
B505
C256
D128
E64
F32
G16
H8
I(4)
Lexical units
82 79 74 65 58 55 42 35 28
Recurrent free phrases
14 4 9 14 17 4 15 3 15
Other 4 17 17 21 25 41 43 58 57
NB: similar results on trigrams
Manual check of 100 random bigrams from each rank group
2nd GWC, January 20-23 2004 - Brno
Correlation between num. of RFPs and frequency in a reference corpus
0
10
20
30
40
50
60
70
80
90
A B C D E F G H I
Lex Unit
R.F.P.
Other
2nd GWC, January 20-23 2004 - Brno
Future work
• Better characterization and classification• Correlation with association measures• Evaluating RFP for WSD
2nd GWC, January 20-23 2004 - Brno
Conclusions
• Wordnet is poor of syntagmatic information
• We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations
• RFP: free combination of word recurrently used to express a concept
• Criteria for their selection
• Bilingual dictionaries contain many RFPs
• Corpora: no clear correlation with frequency
• Useful for: – lexicographic work
– Word Sense Disambiguation