towards a corpus-based online dictionary of italian word combinations the combinet project sara...

TOWARDS A CORPUS-BASED ONLINE DICTIONARY OF

ITALIAN WORD COMBINATIONS

The CombiNet project

SARA CASTAGNOLI

FRANCESCA MASINI

(UNIVERSITY OF BOLOGNA)

MALVINA NISSIM

(UNIVERSITY OF GRONINGEN)

GIANLUCA E. LEBANI

ALESSANDRO LENCI

(UNIVERSITY OF PISA)

ENeL meeting @ Herstmonceux Castle, 13 August 2015

VALENTINA PIUNNO

(UNIVERSITY OF ROMA TRE)

THIS PRESENTATION

• INTRODUCING CombiNet, an ongoing project aimed at building a corpus-based, lexicographic resource for Italian Word Combinations (Universities of Roma Tre, Pisa, Bologna)

• an innovative resource for the Italian language• relevance for ENeL-WG3:

• an electronic resource• an integrated computational-lexicographic approach:

1) automatic extraction of candidate WoCs from corpora2) manual evaluation and compilation

• OUTLINE: • our view of Word Combinations (WoCs)• AKA: extracting WoCs from corpora – methods• evaluation of AKA: automatic and manual 3

WORD COMBINATIONS (WoCs)

The whole range of combinatory possibilities associated with a word, including:

•Multiword Expressions (MWEs), i.e. a variety of WoCs characterised by different degrees of fixedness and idiomaticity that act as a single unit at some level of linguistic analysis, e.g.:

• idioms• phrasal lexemes

•More abstract combinations, i.e. the distributional properties of a word at the level of e.g.:

• argument structure• subcategorization frames• selectional preferences

4

• collocations

• preferred combinations

5

EXTRACTING WoCs - METHODS

Using POS PATTERNS(P-BASED methods)

- POS-tagged corpus- list of POS patterns

NOUN PREP NOUNpunto di vista‘point of view’

NOUN ADJanno accademico‘academic year’

VER DET (ADJ) NOUNcostruire un piccolo impero‘build a small empire’

Using SYNTACTIC INFO(S-BASED methods)

- parsed corpus- list of syntactic relations

SUBJ – VERBguerra – scoppiare‘war – burst’

VERB – OBJperdere – vista‘lose – (one’s)sight’

VERB – COMP_DIparlare – di sport‘talk – about sport’

COMPARING EXTRACTION METHODS

- satisfactory results for relatively fixed | adjacent | short WOCs

- also target discontinuous and syntactically flexible WoCs

6

Using POS PATTERNS(P-BASED methods)

Using SYNTACTIC INFO(S-BASED methods)

- patterns need to be specified a priori

- noise, even after applying AMs- cannot capture complex and

flexible WOCs- dismissing abstract

combinatory information (e.g. argument structure)

- abstracting away from information such as linear order, morphosyntactic features etc.

- no information about how exactly words combine

- cannot distinguish frequent but productive combinations, from idiomatic ones with the very same syntactic structure

Castagnoli et al. 2015; Lenci et al. 2014, 2015

AUTOMATIC EXTRACTION OF CANDIDATE WoCs - DATA

• La Repubblica corpus (Baroni et al. 2004)

• approx. 380M tokens, POS-tagged and dependency parsed• “clean” corpus, but only newspaper language

• POS-based extraction:• 122 POS sequences deemed representative of Italian WoCs, in 3

subsets (nominal, verbal, prepositional WoCs)• Independent extraction rounds, using the EXTra tool

• contiguous sequences, no optional slots, LL ranking, freq>5

• Syntax-based extraction:• distributional profiles, containing the syntactic slots (subject,

complements, modifiers, etc.) and the combinations of slots (frames) with which words co-occur, abstracted away from their surface morphosyntactic patterns

• each slot is associated with lexical sets formed by its most prototypical fillers

• LexIt tool• contiguous and discontinuous sequences, LL ranking, freq>5 7

DATA FOR LEXICOGRAPHERS

1) All sequences corresponding to the mentioned patterns are extracted from the corpus.

•2) Lists of candidate WoCs are filtered to extract lines containing specific Target Lemmas (i.e. future headwords)

• Headwords: “fundamental” 2,100 words from the Senso Comune lexicon (http://www.sensocomune.it/)

• Nouns, Verbs, Adjectives

•3) Lexicographers are provided with structured lists:

• lemmatised candidate WoCs for a given TL• ranked according to their LL score• raw frequency of each combination in the corpus• underlying POS pattern or syntactic relation

8

POS-BASED DATA

9

POS-BASED DATA

10

SYNTAX-BASED DATA

11

LEXICOGRAPHERS’ USE OF DATA

• Candidate lists for each TL are imported into a spreadsheet.

• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.

12

LEXICOGRAPHERS’ USE OF DATA

• Candidate lists for each TL are imported into a spreadsheet.

• As our current lexicographic layout groups WoCs on the basis of their function and syntactic configuration, lexicographers can scroll candidate lists or filter them to observe and evaluate only candidate WoCs corresponding to specific POS patterns and/or syntactic relations.

• Candidates considered as valid WoCs are manually selected

• and edited

• before being recorded in the relevant part of the lexicographic record

15

LEXICOGRAPHERS’ EVALUATION - 1

(“highly impressionistic feedback from our lexicographers”)

•LL ranking is generally helpful, as most higher-ranking candidates represent (or contain, or suggest) proper WoCs which deserve inclusion in the dictionary.

• However, difficult to set thresholds, since WoCs which they would intuitively include in the entry also appear in the middle and lower part of the ranking.

•POS-based data are more useful to compile the entries for nominal and adjectival TLs, whereas SYNTAX-based data would be more helpful for verbal TLs.

• No systematic evidence provided.

16

AUTOMATIC EVALUATION - 1

• We tested and compared the performance of the two extraction methods using an existing Italian combinatory dictionary as a benchmark (25 TLs).

• Recall, (R-)precision, thresholds, systems’ overlap

• Interesting findings supporting the lexicographers’ intuition:

• Recall is rather high for both systems

• Recall of P-based method is higher for N and A, while S-based method has higher recall for V

• Recall for P-based method appears to plateau at 2,000 hits (*)

• P-based and S-based method often extract/don’t extract the same WoCs (performance is identical for 76% of gold standard combinations) (*)

• But they also extract different gold standard combinations, with a complementary distribution (P-based: N+A, S-based: V) (*)

• R-precision is higher for S-based method

• Crowdsourcing evaluation: nearly 25% of candidates are valid WoCs even if they are not included in the benchmark dictionary (*) 17

Castagnoli et al. 2015

LEXICOGRAPHERS’ EVALUATION - 2

• Lexicographers report adding WoCs that “should intuitively be there” but are not extracted from the corpus.

• More research is needed to:

a) analyse the nature of these WoCs

• Patterns we haven’t thought of? (Long) idioms?

b) assess the impact of extraction techniques and settings

• Min. frequency?

c) assess the impact of corpus type and size

• Limited to a single newspaper corpus

• Virtually no difference with the PAISA’ corpus (250M words, copyright-free web content)

• Maybe a huge web corpus?

18

OTHER LIMITATIONS

• Still a lot of manual work for lexicographers

• No automatic import / conversion of acquired data into an editing database / interface

• We are not using a proper Dictionary Writing System

• Many other ideas that came up listening to some eLex presentations…

THANK YOU!

19

towards a corpus-based online dictionary of italian word combinations the combinet project sara...

Documents