multiword expressions in nlp

MultiWord Expressions in NLP

Jan Odijk

LOT Summerschool

Utrecht, June 2004

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

Natural Language Processing

• Automatic processing of natural language– Generation: Semantic Repr String– Analysis: String Semantic Representation

• Example applications– Machine Translation (MT)– Information Retrieval (IR)– Cross-language Information Retrieval (CLIR)– Question-Answering

• Based on Grammars– (Popular) frameworks

• Feature structure based– Head-driven Phrase Structure Grammar (HPSG)– Lexical-Functional Grammar (LFG)

• Tree-based– Tree-Adjoining Grammar (TAG)– M-Grammar

• Based on grammar components or dedicated modules– Decompounding– PoS-tagging– Chunking– Named Entity Recognition– Name/Address grammars– Date / Amount grammars

• Based on Statistics– No explicit grammar– Statistics

• Derived from (annotated) training corpus

• Tested with test corpus

• Applied to new corpora

• Combinations of grammar and statistics

NLP Grammar

• Defines <form, meaning> pairs and structural descriptions at various levels

• Components– Semantics– Syntax– Morphology– Orthography (Phonology)

NLP Grammar

• Semantics– Defines the meaning of an utterance– usually synchronized with syntax

(compositionality)• HPSG: CONTENTS attribute• M-Grammar: in-tandem build up• Synchronous TAG: in-tandem build-up with

derivation trees• LFG: in tandem with f-structure

NLP Grammar

• Syntax– Defines the syntactic structure of an utterance– Object types: Trees, DAGs– Features: attribute-value pairs– Value: atomic or structured

NLP Grammar

• Syntax– Often surface syntax and deep syntax

(not necessarily on a separate level)

• HPSG: surface tree v. DAG

• M-Grammar: surface trees v. derivation trees

• LFG: c-structure v. f-structure

• TAG: derived tree v. derivation tree

• Alpino: surface tree v. dependency tree

NLP Grammar

• Morphology– Relates (word structure, string)– Word-internal structure build-up usually in the

syntactic component – Usually a rule system (intensional definition)– Simple Inflection: sometimes list of triples

<base form, morph prop, word form> (extensional definition)

NLP Grammar

• Orthography– Relates ([String], String)

• [he, said, :, “, come, in, !, “]• He said: “come in!”

– Usually trivial in generation– Easy in analysis (tokenization) for many

languages– Sometimes split (erop, opgebeld)– Very problematic for Chinese, Japanese, etc.

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

What are MWEs?

• sequence of words that has lexical, orthographic, phonological, morphological, syntactic, semantic, pragmatic or translational properties not predictable from the individual components or their normal mode of combination

What are MWEs?

• sequence of – Not necessarily contiguous in a concrete utterance

• ...omdat hij de plaat wilde poetsen

– Not necessarily always in the same order in each utterance• Hij poetste gisteren de plaat

• words – Ambiguity between type and token (intentional)– Inflected word form v. lemma– Ambiguity between

• Character sequences separated from other character sequences by spaces and other separators (Narrow interpretation)

• Abstract lexical units of the grammar (Broad interpretation)

What are MWEs?

• that has properties not predictable from the individual components and their normal mode of combination

What are MWEs?

• Lexical– De plaat poetsen– Een poging wagen / doen / *maken– Dat varkentje eens wassen– Zware / *sterke shag– Scherpe kritiek– Perdre la tête/ la boule / *la cervelle– Se creuser la tête / * la boule / la cervelle

What are MWEs?

• Orthographic– viz.

– Bijv.

– www.uilots.nl

– i.v.m.

– Yahoo!

– Groen!

– Aujourd’hui (v. l’homme)

– ‘s (avonds/morgens/middags)

What are MWEs?

• phonological, – Over de rooie/*rode (gaan/zijn/raken)

– om de dooie/*dode donder niet

– op zijn dooie akkertje/gemak

– op zijn dooie eentje

– De kwaaie/*kwade Piet toegespeeld krijgen

– Je niet in de kouwe/*koude kleren gaan zitten

– Een gouwe ouwe

– (but geen rode/rooie cent/duit (hebben))

What are MWEs?

• morphological, – Ten gevolge van– Ter wereld– Van goeden huize– Zonder aanzien des persoons– Het lood*(je) leggen– Dat varken*(tje) wassen– De *raap is / rapen zijn gaar

What are MWEs?

• Syntactic– Ten gevolge van– In opdracht van (no article)– Iemand een oor aannaaien– Rekening houden met (obligatorily indefinite)– Het bijvoeglijk(*e) naamwoord (v. een

groot/grote man)

What are MWEs?

• Semantic– De plaat poetsen– Dat varkentje wassen– Een bok schieten– Een flater slaan

What are MWEs?

• Pragmatic– Ladies and Gentlemen– Ik heb gezegd.– Eet smakelijk! (Bon appétit!, Enjoy!)– Sincerely yours

What are MWEs?

• Translational properties – Laten zien (F. montrer, E. show)– Witte wijn (P. vinho verde)– Nuclear power plant (D. atoomcentrale, G.

Kernkraftwerk)– Space probe (F. sonde spatiale)– Iemand iets laten weten

• inform someone of something

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

MWEs in NLP

• MWEs occur very often in natural language– Esp. in languages with little compounding

• Especially in specialized domains– Multi-word terminology

MWEs in NLP

• MT– Improves parsing and translation of the MWEs

– Also improves parsing hence translation of the sentence containing the MWEs (Nivre & Nilsson LREC 2004)

• CLIR– Nuclear power plant

• Kern- macht plant

• Kern- Macht Pflanz

• v. atoomcentrale / Kernkraftwerk

MWEs in NLP

• Problems MWEs pose for NLP– How are MWEs to be dealt with in the

grammar of an NLP system?– What lexical representation of MWEs is

required for this?– How can we obtain lexicons containing MWEs

with such lexical representations

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

Types of MWEs (I)

• Fixed

• Semi-flexible

• Flexible

Fixed MWEs

• Fixed MWEs– Words of the MWE in a fixed order– No variation in lexical item choice– Always contiguous (no other elements in

between)– No inflectional processes except at the edges

Fixed MWEs

• Fixed MWEs– ad hoc, stante pede, ter plaatse– Hong Kong, Kuala Lumpur, New York, San Francisco– credit card, travel agency, real estate agency

• NOT– in plaats van (cf. in plaats daarvan) (‘instead of’)– carta telefonica (cf. carte telefoniche)– de plaat poetsen (‘polish the plate’, ‘bolt’)

Semi-Flexible MWEs

• Semi-Flexible MWEs– MWEs with fixed order of elements– That are impenetrable for other words– Parts can be inflected

Semi-Flexible MWEs

• Examples:– Chambre des représentants

• House of representatives

– Patatas fritas• French fries

– Mise au point automatique• Autofocus

– Calculateur analogique• Analogue computer

Semi-Flexible MWEs

• Examples:– Cité plus haut

• Above-stated

– Résistant aux acides• Acid-proof

– Malade en altitude• Airsick

Flexible MWEs

• Flexible MWEs• Allow or require inflection in multiple parts, and

• Allow permutations of subphrases, or

• Allow intrusion by other phrases, or

• Have controlled variation (bound pronouns)

Flexible MWEs

– de plaat poetsen (‘bolt’)• Hij heeft gisteren de plaat gepoetst• …omdat hij de plaat wilde poetsen• Hij poetste gisteren de plaat

– to lose one’s temper• He lost his temper• She lost her temper

Treatment

• Fixed MWEs– No inflection: Relate single string to sequence of

strings (in Orthography)• ([ad_hoc] , [ad, hoc]) • Lexical entry for ad_hoc

– With inflection: Relate single stem to sequence of stems in Morphology

• ([real, estate, agency, Plur] -> [real_estate_agency, Plur])• Lexical entry for real_estate_agency

Treatment

• Semi-flexible MWEs– Require local syntax– Chunking may be enough

Treatment

• Flexible MWEs– Require sophisticated syntax

Types of MWEs (II)

• Verb –particle combinations (English, German, Dutch, Hungarian)– Ik sloeg hem over– I looked the passage up

Types of MWEs (II)

• Verb + prepositional complement– I looked after her– Hij heeft altijd van haar gehouden

Types of MWEs (II)

• Circumpositions (Dutch, German)– Op iemand af / ?toe / *heen– Auf jemanden *ab / zu– Over de brug heen / *af / *toe

Types of MWEs (II)

• Lexical item (from open or closed class)

• + closed class lexical item– Finite (actually small) list

• Limited variety of predictable syntactic structures

• Dealt with by almost any grammar-based NLP system

Types of MWEs (II)

• Multiword Names– Examples

• Fifth Avenue

• Koning Leopold III-laan

• Krimpen aan de IJssel

• Koninklijke Nederlandse Philips N.V.

Types of MWEs (II)

• Multiword Names– Issues

• Keys – variation– (Koning) Leopold III-laan

– Fifth (Avenue)

– ((Calle) Roberto) González

– Many different ones, continuously new ones

– Very important for correct parsing and translation• Minister Kohl Minister Cabbage

Types of MWEs(II)

• Compounds (in English)– Examples

• Real estate agency

• Nuclear power plant

• Blue cheese

• Private eye

• High school

Types of MWEs(II)

• Idioms– No or unpredictable meaning of the

components– Fixed (or very limited ) lexical item selection– Opaque

• Kick the bucket

• De plaat poetsen

• Casser sa pipe

Types of MWEs(II)

• Idioms– Semi-transparant

• `een bok schieten’– Bok (male goat) = blunder

– Schieten (shoot) = make

• `dat varkentje wassen’– Varkentje (little pig) = problem

– Wassen (wash) = address, take care of

Types of MWEs(II)

• Idioms– Usually completely normal syntactic structure– Both a literal and an idiomatic reading– Participate normally in many grammatical

processes– BUT: often restrictions on participating in

grammatical processes

Types of MWEs(II)

• Idioms (opaque)– Normal participation

• Hij poetste de plaat (V2)• Poetste hij de plaat? (V1, question formation)• ...omdat hij de plaat wilde poetsen (VR)

– But not• #De plaat heeft hij niet gepoetst (topicalization)• #Welke plaat heeft hij gepoetst? (wh-Q)• #Hij heeft de mooie plaat gepoetst (internal modification)• #Hij heeft de plaat waarschijnlijk niet gepoetst (Mid-field NP-Adv

permutation)• #De plaat die hij gepoetst heeft (Relativization)• #De plaat werd door hem gepoetst (Passive)• #Wat een plaat (independent occurrence))

Types of MWEs(II)

• Idioms (semi-transparant)– Normal participation

• Hij schoot een bok (V2)• Schoot hij een bok? (V1, question formation)• ...omdat hij een bok zou schieten (VR)

– Also• Een bok heeft hij niet geschoten (topicalization)• Wat voor een bok heeft hij nu weer geschoten? (wh-Q)• Hij heeft een enorme bok geschoten (internal modification)• Hij heeft die bok waarschijnlijk geschoten omdat ... (Mid-field NP-Adv

permutation)• De bok die hij geschoten heeft (Relativization)• Er werd door hem een enorme bok geschoten (Passive)

– But not:• Wat een bok! (independent occurrence)

Types of MWEs(II)

• Idioms– In some cases irregular syntactic structure

• Ten gevolge van (fossilized portmanteau words, noun e-form)• Het bijvoeglijk naamwoord (no –e)• Iemand de oren wassen (inalienable possession construction)

– Regular syntactic structure but not predictable from the components’ properties

• Iemand welkom heten– (`heten’ on its own can only take a pred. complement)– *Hij heet hem aardig / president / Jan

• to lose face (count noun in determinerless NP)

Types of MWEs(II)

• Idioms– Cranberry words

• Ergens de brui aan geven

Types of MWEs(II)

• Semi-idioms (collocations)– One element occurs in its normal meaning– The lexical selection of the other element is

fixed or very limited– The other element has a special meaning– Examples

• Zware tabak (heavy tobacco) `strong tobacco’• Scherpe kritiek (sharp criticism) `severe criticism’• Heavy smoker

Types of MWEs(II)

• Support verb constructions– Type I

• Een poging wagen

• Een lezing houden / geven

• To pay attention to (aandacht schenken aan)

• To take advantage of

Types of MWEs (II)

• Arguments of the noun outside the NP– De kritiek die we hadden op hem– ?de kritiek op hem die we hadden– *De kritiek die we naar voren brachten op hem– De kritiek op hem die we naar voren brachten– De kritiek op hem verstomde

Types of MWEs (II)

• Arguments of the noun outside the NP– De aandacht die we schonken aan hem– *de aandacht aan hem die we schonken– *De aandacht die we becommentarieerden aan

hem– De aandacht aan hem die we

becommentarieerden– De aandacht *aan / voor hem verflauwde

Types of MWEs (II)

• Arguments of the noun outside the NP– The attention that we paid to this subject– *The attention to this subject that we paid– *The attention that we criticised to this subject– The attention to this subject that we criticized– The attention to this subject

Types of MWEs (II)

• Arguments of the noun outside the NP– No attention was paid to this subject– This subject was paid no attention to– Advantage was taken of this proposition– This proposition was taken advantage of

Types of MWEs (II)

– Type II• Iemand een stomp geven

• Iemand een klap geven

• To give someone a kiss

– Noun signifying bodily touch– `give’ + indirect object as Patient

Types of MWEs (II)

– Type III(?) ..\Utrecht\MWEs\copulasetc\selectzijn edited Current.xls

• In de war zijn / raken / * gaan / ?komen / brengen

• In zijn nopjes zijn / raken / * komen / *brengen

• De pijp uit zijn / ?raken / gaan/ *komen / *brengen

Types of MWEs (II)

• Quasi-idioms– Characteristics: Regular meaning + something

extra (specialization)– Examples

• Huisdeur (`front door’)• Bijvoeglijk naamwoord• Fried eggs

– Terms– Compounds

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

Requirements

• Account for the fact that MWEs usually have `normal’ syntactic structures

• Account for the normal participation of MWEs in most syntactic processes

• Account for the restrictions on the participation in some syntactic processes

• Recognize it as an MWE and assign it the associated semantics

Tree-Adjoining Grammar (TAG)

• Originally developed by Aravind Joshi• Extended by Shieber, Schabes• Applied to French by Abeillé • Basic object: trees• Enriched with features (incl. unification)• Known parsing algorithm and complexity

properties• Defines mildly context-sensitive languages

• Lexicalized TAG (LTAG)

• Trees:– Elementary Trees

• Associated with a lexical item

• Initial– Leaves labeled by

» Terminals

» Substitution nodes

• Initial Trees (examples)– N[Jean]– S[N0↓ V[dormir]]– S[N0↓ V[aimer] N1↓]– S[N0↓ V[ressembler] PP[P[à] N1↓]

• Used for words (verbs) taking nominal, adjectival, prepositional arguments

• Auxiliary Trees– One leaf node (foot node) same category as root node

– Used for modifiers, auxiliary verbs, raising verbs, verbs taking sentential arguments

– Examples• N[A[beau] N*]

• N[Det[le] N*]

• S[N0↓ V[penser] S1’[C[que] S1*]]

• V[V[sembler] V*]

• Constraints on Elementary Trees– Lexicalization– Predicate-Argument co-occurrence– Semantic Consistency

• Operations– Substitution– Adjunction

• NO– Deletion– Movement– Permutation

• Operations– Substitution

• Substitutes a tree at a leaf node marked for substitution (↓)

• Example– S[N0↓ V[dormir]] + N[Jean] – S[N[Jean] V[dormir]]

• Operations– Adjunction

• Inserts an auxiliary tree (or a tree derived from an auxiliary tree) at any node (with the same label)

• If this node dominates a subtree, this subtree is copied under the foot node of the auxiliary tree

– Example• S[N[Jean] V[dormir]] + V[semble V*] • S[N[Jean] V[semble V[dormir]]]

• Derived tree– Tree created by substitution or adjunction– Encodes word order, inflection,

morphosyntactic features

• Derivation Tree– α-dormir[ 1/α’-Jean 2/β3-semble]– Close to dependency trees– Basis for semantic interpretation

• Lexical rules– Elementary tree elementary tree– For passive, wh-questions, cleft-constructions, relatve

clauses, cliticization– Define the lexical item’s tree family

• Examples of derived elementary trees– Passive: S[N1↓ V[être] V[aimé] PP[P[par] N0↓]– Object-cleft S[V[CI[ce] V[être] N1↓ S’[C[que] S[N0↓

V[aimer]]]]– Object-Rel.: N[N1* S’[C[que] S[N 0↓V[aimer]]]]

• Synchronous TAG

• Each elementary tree is associated with a semantic tree

• Links between nodes from the elementary tree and the semantic tree

• N-1[Jean] – T-1[jean’]

• S-2[N0↓-1 V[dormir]] – F-2[R[dormir’] T1↓-1]

• S-1[N0↓-2 V[aimer] N1↓-3]– F-1[R[aimer] T0↓-2 T1↓-3]

• S-1[N0↓-2 V[ressembler] PP[P[à] N1↓-3]– F-1[R[ressembler’] T0↓-2 T1↓-3]

• N-1[A[beau] N*]– F-1[R[beau’] T0*]

• N[Det[le] N*]– F-1[R[le’] T0*]

• S-1[N0↓-2 V[penser] S1’[C[que] S1*]]– F-1[R[penser’] T0↓-2 T1*]

• V-1[V[sembler] V*]– F-1[R[sembler’] T0*]

• Given a pair, select a link (non-deterministically) with roots A and B

• Select another pair with roots A and B• Combine the syntactic trees (at node A) and

the semantic trees (at node B), by adjunction or substitution, and remove the link between A and B

• Do this recursively

• <S[N0↓-1 V-2[dormir]], F-2[R[dormir’] T1↓-1]>, select link 1

• + <N-1[Jean], T-1[jean’]> <S[N[Jean] V-2[dormir]],

F-2[ R[dormir’] T[jean’]]

• !+ < V-1[V[sembler] V*], F-1[R[sembler’] T0*]> <S[N[Jean] V[V[sembler] V-2[dormir]]],

F[R[sembler] F-2[ R[dormir’] T[jean’]]]

• Idiomatic Expressions in LTAG– Each idiomatic expression represented by an

elementary tree

• Examples– S[N0↓ V[briser] N1[D[la] N[glace]]– S[N0↓ V[prendre] N1↓ PP[P[en] N[compte]]]– S[N0[D[des] N[ailes]] V[pousser] PP1[P[à]

N1↓]]

• Associated with a semantic tree

• F[R[briser-la-glace’] T0↓]

• F[R[prendre-en-compte’] T0↓ T1↓]

• F[R[des-ailes-pousser- à’] T0↓]

• With the appropriate links

• Lexical rules– Apply to idiom elementary trees normally– But can be individually constrained

• Links between parts of an idiomatic elementary tree and semantic trees are allowed:– internal syntactic modification can correspond

to extenal semantic modification

• MWEs are normal elementary trees

• Lexical rules apply to idiom elementary trees as usual

• Lexical rules can be restricted to apply to certain elementary trees

• Recognition and semantics: same as with single word elementary trees

• But– No restrictions on elementary trees– Idiomatic elementary trees can deviate– Elementary trees are complex (esp. features):

difficult to maintain– Restrictions on grammatical processes basically

stipulated

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

M-Grammar

• Developed by Jan Landsbergen• Inspired by Montague grammar• Compositional Grammars

– The meaning of an expression is a function of the meaning of its parts and the way they are combined

• Use traditional syntactic surface trees (but with relations)

M-Grammar

• Used for Machine Translation

• Research Prototype MT System– Dutch, English, Spanish

• Developed at Philips Research Labs– Rosetta project– Rosetta3 System

• Compositional Translation Method

M-Grammar

• Compositionality of Meaning– The grammars are organised in such a way that

the meaning of an expression is a function of the meaning of its parts and the way they are combined.

• Implemented by Compositional Grammars– Basic Expressions (BE), with a meaning– Rules, with a meaning (recursively applicable)

M-Grammar

• Basic Expressions– With basic meaning

• M-Rules– With meaning operation

• Basic object: S-trees• S-tree = N[r1/T1,...rn/Tn]

– N: node = CAT{a-v pairs}– Ti: S-trees– Ri: grammatical relation (subject, object, head,...)

• S-tree of a basic expression is basic S-tree

M-Grammar

• S-tree of a full utterance is created by applying M-rules to S-trees, initially basic S-trees

• Derivation history is recorded in syntactic derivation tree (syntactic D-tree)

• M-rules– Powerful rules

– Structure creation, deletion, permutations, movements, insertions

M-Grammar

• M-Rules– Relate [T1,..,Tn] to T

– Reversible• Analytic and generative versions can be derived automatically

– Measure condition • Each Ti in [T1,...,Tn] must be `smaller’ (according to some

measure) than T

– Reversibility and Measure Condition guarantee effectiveness of parsing

M-Grammar

• Syntactic D-trees contain names of basic expressions and names of rules

• Can be mapped into (isomorphic) Semantic D-trees, containing names of basic meanings and names of meaning operations

M-Grammar

• Principle of Compositionality of Translation– Two expressions are each other's translation if they are

built up from parts which are each other's translation, by means of rules which are each other’s translation.

• Implemented by tuning Compositional Grammars G1 and G2– For each BE in G1 at least one BE in G2 that is

translationally equivalent– For each rule in G1 at least one rule in G2 that is

translationally equivalent

M-Grammar

• Interlingual System• Interlingua obtained as a side-effect of

tuning compositional grammars (no independent interlingua)

• Interlingua expresses translational equivalence– B1 and B1’ are translations of each other– Not necessarily the meaning of B1/B1’

M-Grammar

• G1:– BEs: N-boek (M-book:book’), A-interessant (M-interesting:

interesting’)– Mrule R1: meaning name: IndefPlMod

• syntax– <[N, A], NP[mod/A{e}, head/N{pl}]>

• Semantics: [[A]](x) & [[N]](x)

• G2– BEs: N-livre (M-book: book’), A-intéressant( M-interesting:

interesting’)– Mrule R2 meaning name: IndefPlMod

• Syntax– <[N,A], NP[det/D-des, head/N’[head/N{pl, m}, mod/A{pl, m}]]

• Semantics– [[A]](x) & [[N]](x)

M-Grammar

• NP[mod/A{e}-interessant head/N{pl}-boek] interessante boeken

• Syn D-Tree: R1 [boek interessant• NP[det/D{pl}-des head/N’[head/N{pl,m}-livre],

mod/A{pl,m}-interéssant] des livres interéssants

• Syn D-Tree: R2 [livre interessant]• Sem D-Tree IndefPlMod[M-book M-interesting]• Semantics: book’(x) & interesting’(x)

– (ignoring plurality, indefiniteness)

M-Grammar

• Full System • Sem D-Trees (IL)• A-TRANSFER G-TRANSFER• Syn D-Trees Syn D-Trees• M-PARSER M-GENERATOR• S-Trees S-Trees• S-PARSER LEAVES• [Lexical S-Trees] [Lexical S-Trees]• A-MORPH G-MORPH • String String

M-Grammar:MWEs

• Method for dealing with idioms• Each idiom

– is a basic expression– Associated with a complex syntactic structure complex basic expressions

• Syntactic Structure of an idiom: – Canonical (abstracting from syntactic operations: passive,

topicalization, verb movements, wh-questioning, ...– Represented as a D-tree

• Much more stable part of the grammar• Much simpler than S-trees (no features etc)• Makes it easier to deal with bound variation• Better guarantee of correctness of structures

– In the lexicon a identifier to a D-tree (idiom pattern)

M-Grammar:MWEs

• Start rules:– Combine a BE with its arguments (variables)– If the BE is complex structure created by

applying the rules in the associated D-Tree• Resulting Structure runs through all the

normal rules of grammar• In analysis: incoming structure is analyzed

using the rules in the D-tree

M-Grammar:MWEs

• In analysis:– Guide the analysis by the idiom’s D-tree– If successful, and the arguments are also

correctly analyzed: extend the sentence’s D-tree with the start rule dominating a BE (for the idiom) and the D-trees for the arguments

M-Grammar:MWEs

• De pijp uit gaan• D-tree for vpid30 (simplified):

Rsubst,i[RVP [$aV_00_ga,

VAR_j RPPpost [$s_prep1286700, VAR_i ]

], RNPdef [$aV_00_pijp]]

M-Grammar:MWEs

• Start Rule 1: BE(verb) + VAR• S-Tree created in case of an idiom by applying the D-Tree for the

idiom• S[subj/VAR_j, head/VP[compl/PP [obj/NP[det/D-de head/N-pijp ] head/P-uit ] head/V-ga ] ]

M-Grammar:MWEs

• Normal participation in grammatical processes: structures for idioms are normal

• Restrictions:– Rules that affect meaningful elements cannot apply to

meaningless parts of idioms relativization, topicalization, wh-questioning, NP-

Adv order switch, exclamative formation, adjectival modification, independent occurrence, ...

– (parts of ) Rules that are purely syntactic not restricted V2, VR, V1 (if question formation and V1 are

separated)

M-Grammar:MWEs

• Deviant syntax– Portmanteau words (ten, ter)– E-form of nouns– Inalienable possession construction

• Minor Rules: rules that can only occur in an idiom D-Tree

M-Grammar:MWEs

• Idioms have normal syntactic structures generated by applying the normal M-rules following the associated idiom D-Tree

• Other M-Rules apply to these structures in the normal way• Restrictions on applications accounted for because M-Rules applying

to meaningful elements cannot apply to meaningless parts of idioms• Deviant syntactic structure can be accounted for by Minor Rules• The idiom D-trees can be derived using the system itself (putting

minor rules on in the grammar)• Recognition of idioms: by guiding the analysis along the idiom D-tree• Semantics: each idiom is (complex) basic expression with associated

basic meaning

M-Grammars and TAG

• Neither has an adequate treatment of semi-idioms (yet)

• Neither has an adequate treatment of transparent idioms (yet)

• Idiom representation as D-trees can also be done in TAG (Abeillé)

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

Lexical representation

• How do we obtain lists of MWEs?

• How do we represent them lexically

• How can we improve exchangeability of MWE lexical representations?

MWE Acquisition

• Lists from existing dictionaries– Always incomplete– Especially for specific domains– Do not necessarily reflect actual usage

• Semi-automatic acquisition from text corpora is called for– Especially for rapid tuning NLP system to

specific domain, company, organization

MWE Acquisition

• Use statistical properties of MWEs to acquire them (semi-)automatically

• Examples:– Mutual Information: log Pr(xy)/(Pr(x)Pr(y))– Salience: Pr(xy) * MI– Dice, chi-square, log-likelyhood, em metric, ...

MWE Acquisition

• Mutual Information:– Pr(z) estimated by F(z)/T– Pr(xy)/Pr(x)Pr(y) = (F(xy)*T)/F(x)F(y)– Experiment: (just for adjacent 2-word MWEs)

• Salience– Compensates for favoring low frequency items– Experiment: (just for adjacent 2-word MWEs)

MWE Acquisition

• Extend to 3-word MWEs, etc.

• Combine with NLP system/components– PoS-tag to obtain syntactically meaningful

combinations (yes: A N, no: Adv N)– Parsing and compute statistics on D-trees

• Allows acquisition of discontinuous MWEs

• Very lively and active research area

Lexical Representation of Idioms

• Many grammatical treatments of idioms require– Whatever is needed for single word lexical

items– Syntactic structure– Unique references to lexical items

• Highly framework / theory / implementation-specific

Lexical Representation

• M-Grammar system-specific matters:– D-trees, compatible with the specific

implementation – Unique references to items in Rosetta-

lexicon– Order of the items (`gaan uit pijp’)– Presence of items (articles absent)

Lexical Representation

• LTAG system-specific matters:– Syntactic trees, compatible with the

specific implementation – Unique references to items in

system’s lexicon– Order of the items (= canonical

surface order)– Presence of items (all present)

• Lexical Representation– Maximally theory-neutral

• Incorporation Method – Generic– Maximally reuses existing NLP system

• Core Idea:– Describes which idioms have the same structure– Structural Equivalence Classes for Idioms

• Lexical Representation= – Idiom Descriptions– Idiom Pattern Descriptions

• Idiom description=– Idiom pattern (identifier)

• Identifier for idiom structures• Used to define the equivalence classes

– Idiom Component List (ICL), with base forms• All• Any order, but the same within one equivalence class

– Example sentence containing the idiom• Same syntactic structure for each idiom of the same

equivalence class

• Idiom pattern description• Idiom pattern identifier• Comments (free text)

• Example:– Idiom Descriptions

• Idp30;De pijp uit gaan;Hij is de pijp uit gegaan• Idp30;De boot in gaan;Hij is de boot in gegaan• Idp30:Het schip in gaan;Hij is het schip in gegaan

– Idiom pattern definition• Idp30• Idiom headed by a verb taking a postpositional PP

containing a definite singular NP and one free argument as subject

• Incorporation Method– Manual part, once for each idiom pattern – Automatic Part, for each idiom

description

• Manual part (`hij is de pijp uit gegaan’)1. Parse the example sentence of an idiom description

with idiom pattern P, yielding the Reference Parse 2. Define a transformation to turn the reference parse into

the idiom structure ( Parse Transformation, PT) 3. Determine the list of unique IDs of the lexical items in

the idiom structure for the system derived from the reference parse (Idiom Component ID List, ICIL)

4. Define a transformation to relate ICL and ICIL (Idiom Component Transformation, ICT)

5. Apply the ICT to the ICL, yielding the transformed ICL (TICL) and check that each item in it equals the base form of the corresponding element on the ICIL

Automatic part, for each idiom description I(`hij is de boot in gegaan’)1. Parse example sentence (Syntactic Structure)2. Apply IPT and check identity with idiom

structure modulo the lexical items3. Select the component IDs from the parse tree, in

order to obtain the ICIL)4. Apply ICT to the ICL of I, yielding the TICL5. Check that <bf(c1),…bf(cn)>=TICL

where ICIL = <c1, …cn> ( TICL check)

• Advantages– Technically Simple– As theory/grammar/implementation-

independent as possible– No need for prescribing syntactic structures– System-specific aspects are derived from the

NLP-system itself

• Will it work?– If there are not too many different idiom

patterns, and– Sufficient number of instantiations per idiom

pattern

• Various improvements possible– Parameterization

• Over local morphosyntactic differences (sg/pl; pos/dim; pos/comp/sup; ...)

– Abstraction• Especially for large fixed parts

– Weten waar Abraham de mosterd haalt

– Use of underspecified syntactic structures• Optional, of any kind

– Guidelines for selection of example sentences

Coverage #idioms #patterns#idioms #patterns50% 7383 28 449 2160% 8853 54 539 3670% 10304 140 628 5980% 11773 481 716 9885% 12509 908 760 13490% 13245 1644 804 17895% 13981 2380 849 223100% 14716 3116 893 267

SAID-Database Dutch Minidatabase

Conclusions

• SEQCI: – Technically simple– Highly theory, grammar, implementation independent

• Can reduce/share lexicon development efforts significantly• Candidate for a standard lexical representation for idioms• Extension to other types of MWEs looks promising• Initial experiments started• More testing required, with more NLP systems

– I can provide test data– Try it out in your own NLP system!

• Possible Problems/Objections?– Manual Part

• Pattern sentence does not yield a parse• Pattern sentence yields multiple parses• Pattern corresponds to 2 or more different structures in my system

– Automatic Part• Sentence does not yield a parse• Sentence yields multiple parses• Multiple skeys result for the same base form

SEQCI: Reference Parse

Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i)

[RVP[$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ] ]RNPdef [$aN_00_pijp] ],VAR_j ],RNP[$hij_PRON] ]]

SEQCI: Idiom Structure

• IPT: IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]

• D-tree for vpid30 (simplified):Rsubst,i

[RVP [$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ]

], RNPdef [$aN_00_pijp]]

< $aV_00_ga, $prep1286700, $aN_00_pijp >

SEQCI: Illustration

Manual Part, applied to `de pijp uitgaan’1. Reference Parse: See D-tree next slide2. IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]3. ICIL: < $aV_00_ga, $aN_00_pijp, $prep1286700>4. ICT: 1 2 3 4 => 4 3 25. TICL = ICT(<de, pijp, uit, gaan>) = <gaan, pijp,

uit> = < Bf($aV_00_ga), Bf($aN_00_pijp), Bf($prep1286700) >

ICL: <de, pijp, uit, gaan>Must be turned into: < gaan, uit, pijp>

ICT: 1 2 3 4 => 4 3 2

TICL = ICT(ICL) = ICT(<de, pijp, uit, gaan>) = <gaan, uit, pijp> = < Bf($aV_00_ga), Bf($prep1286700),

Bf($aN_00_pijp)>

Syntactic Structure

Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i) [RVP[$aV_00_ga,

RPPpost [$s_prep1286800, VAR_i ] ], RNPdef [$aN_00_boot] ],VAR_j ],RNP[$hij_PRON] ]]

Apply IPTRsubst,i

[$aV_00_ga,

RPPpost

[$s_prep1286800,

RNPdef [$aN_00_boot]

ICIL=< $aV_00_ga , $s_prep1286800, $aN_00_boot>)

ICT(ICL) =

ICT(<de, boot, in, gaan>)=

<gaan, in, boot>

TICL check

<bf($aV_00_ga), bf($s_prep1286800), bf($aN_00_boot) > =

TICL = <gaan, in, boot>

multiword expressions in nlp

Documents

multiword expressions: between lexicographyand nlp

holistic and compositional processing in multiword...

timothy baldwin - mse...1 multiword expressions structure of...

multiword expressions and lmf€¦ · multiword expressions...

multiword expressions facilitate, not hinder, understanding...

multilingual multiword expressions - arxivbengali:...

multiword expressions at the grammar lexicon interface

identi cation of verbal multiword expressions by berna...

multiword expressions

by: matthew shaw multiword verbs. the english language has a...

1 a few words to do with multiword...

1 a few words to do with multiword...

05 multiword expressions

multiword expressions: from theory to practicum ·...

representing multiword expressions in lexical and

ahmedreyad.comahmedreyad.com/uploader/upload/file/sotf...

icall and what it's good forbit = noun she = pronoun moved =...

multiword expressions: from theory to...

i will write about: investigating multiword expressions in

sejf - a grammatical lexicon of polish multiword...