what to do with a parser ? learn - wordpress.com...inria what can be done with parsing ? since 2004,...
TRANSCRIPT
INRIA
What to do with a parser ?Learn !
Éric de la Clergerie
INRIA Paris & University Paris-Diderothttp://alpage.inria.fr
NLP MeetupParis, November 23rd 2016
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 1 / 22
INRIA
FRMG: a large coverage French grammar/parser
My main research topics:parsing technologies (symbolic, statistics, hybrid)FRMG a large coverage French (meta)grammar ; parser
Several output annotation schemas: richer native DepXML,but also PASSAGE, FTB/CONLL, Universal Dependencies, . . .
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 2 / 22
INRIA
What can be done with parsing ?
Since 2004, FRMG has become an efficient, accurate, & large coverage parser(on journalistic French TreeBank [FTB]: LAS ∼ 88%, coverage > 97%)
but 2 main questions:
What to do with a parser ?I Information Extraction (http://passage.inria.fr/SAPIENS)
Citation extraction from AFP news about Presidential Campaign 2007I Question-AnswerI . . .I Knowledge Acquisition (knowledge bottleneck)
How to continue to improve parsing ?; knowledge injection for syntactic disambiguationtremblement de terre de forte magnitude (earth-quake with high magnitude)
; virtuous circle between language and knowledge
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 3 / 22
INRIA
Knowledge Acquisition experiments
Two main directions explored during FUI SCRIBO (circa 2010)
Concepts
Terminology extractiongarde à vueimplant chirurgical non actif[implant/nc]GN [chirurgical/adj]GA[non/adv]GR [actif/adj]GA
Semantic networks
Word clustering (synset)
Ontological relations (eg.hyperonymy)warship: destroyer, aviso
Events
event-based verb clusteringI /transfer/ donner, offrir, céderI /communication act/ annoncer,
indiquer, affirmer
verb-noun pairsI déclarer/déclaration ;I identifier/identification ;I commencer/commencement/début
relations between named entitiesappartenance(PERS,ORG)
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 4 / 22
INRIA
Step 0 – Knowledge sources: parsed corporaA large heterogeneous “general” corpus
Corpora #Msent #Mwords Description
Wikipedia (fr) 18.0 178.9 504K encyclopedic pagesWikisource (fr) 4.4 64.0 12.8K literacy textsEstRepublicain 10.5 144.9 journalisticJRC 3.5 66.5 European directivesEuroParl 1.6 41.5 parliamentary debatesAFP 14.0 248.3 400K newsTotal ALL 52.0 744.2
But also smaller specialized corpora (some from a law editor)
Corpora #Msentences #Mwords
fiscal 7.2 145.2social 6.8 127.5civil 2.6 40.9business 7.2 133.8
And several others: botanical corpus, medical, automobile, travel stories, . . .INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 5 / 22
INRIA
From language to meanings
“Beware the Jabberwock, my son!The jaws that bite, the claws that catch!Beware the Jubjub bird, and shunThe frumious Bandersnatch!”
Il était grilheure; les slictueux tovesGyraient sur l’alloinde et vriblaient:Tout flivoreux allaient les borogoves;Les verchons fourgus bourniflaient.
Paul s’est cassé la binti.Sa fracture à la binti a été correctement réduite.Il a des douleurs dans la binti.
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 6 / 22
INRIA
Grouping words: distributional approach
Harris distributional hypothesisMeanings of words are (largely) determined by their distributional
patterns (Harris 1968)
You shall know a word by the company it keeps (Firth 1957)
1 attach to each word a (weighted) vector of contexts,dependency-based ones in our case
2 exploit these vectors to measure the similarity of pairs of words
3 exploit word similarity to organize/group words
Many variants on these 3 points (Lin, Pantel, Pedersen, Bourrigault, . . . )
But often: black box, no explanations, hard classes (no polysemy), . . .⇒ looking for a more flexible approach
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 7 / 22
INRIA
Step 1 – Collecting and counting dependencies
<governor> <rel> <governee> <freq>---------- -------- ---------- -------chaise_nc et table_nc 235asseoir_v sur chaise_nc 227chaise_nc modifieur long_adj 168chaise_nc de= poste_nc 115tomber_v sur chaise_nc 103chaise_nc modifieur musical_adj 102se_asseoir_v sur chaise_nc 93prendre_v cod chaise_nc 87chaise_nc modifieur électrique_adj 82chaise_nc modifieur vide_adj 80chaise_nc à= porteur_nc 80dossier_nc de chaise_nc 78avoir_v cod chaise_nc 71table_nc et chaise_nc 62chaise_nc de= paille_nc 56
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 8 / 22
INRIA
Preprocessing dependencies
Abstracting and completing PASSAGE dependencies (at collect time):
rectification of passives (surface subject ; deep object)
addition of se for pronominal verbs
direct relation between an attribute and a subject(apple,att,red) in the apple is red
abstraction of verbs in sentential arguments(can,object,eat) ; (can,object,*sentence*)
distribution over coordinated elementshe takes an apple and a beer ; (take,object,apple) & (take,object,beer)
addition of potential (ambiguous) PP attachmentsterre_nc de= magnitude_nc 344tremblement_nc de=* magnitude_nc 357
injection of candidate termsqualité_nc de= président_du_conseil 189tremblement_de_terre de=* magnitude_nc
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 9 / 22
INRIA
From dependencies to contexts
A dependency (to_sit ,on, chair) provides
a syntactic context <to_sit on •> for word chair
and, symmetrically, <• on chair> for to_sit
#dep #(distinct forms) #(distinct contexts)Corpora (millions) (thousands) (millions)
CPL 170 1149 4AFP 93 378 2Total ALL 263 1366 5
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 10 / 22
INRIA
Step 2 – Clustering algorithm
Inspired from Markov clustering [MCL, van Dongen]in a weighted graph connecting words to contexts, we try
to reinforce high densityof short pathsto weaken long paths
wj cb
cawiwci,a
cwa,icca,b
wcj,b
cwb,j
wwi,j
wwi,j =1Zi
∑a,b
wci,acca,bwcj,b
α
cca,b =1Za
∑i,j
cwa,iwwi,jcwb,j
α
with inflation α > 1 (default: 2) et normalization 1Z
⇒ strengthen high coefficients, lower weak ones !
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 11 / 22
INRIA
Matrix formulationCompact matrix formulation: W = Γα(F tCF )
C = Γα(GtWG)
with the inflation and normalization operator Γα
where:W = (wwi,j ) and C = (cca,b) are the similarity matrices to be computedF = (wci,a) and G = (cwa,i ) parameter matrices
I wci,a : weight of context ca for word wiI cwa,i : weight of word wi for context ca
Recursive formulation ; iterative fix-point algorithmstarting from initial matrix W (0)
Many extensions: bonus/malus,transfer words↔ contexts (chair ∼ stool ; <• on chair> ∼ <• on stool>),. . .
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 12 / 22
INRIA
What’s the usage of chairs ?The algo provides (weighted) explaining contexts for close words
chaise chaise banquette banquette banquettedivan tabouret divan canapé chaise
se asseoir sur [•]asseoir sur [•]allonger sur [•]dormir sur [•]tomber sur [•]monter sur [•]place sur [•]grimper sur [•]installer sur [•]poser sur [•]INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 13 / 22
INRIA
Visualisation: so many bones !
Graph with about 40K edgesVisualization with TULIP (http://tulip.labri.fr/), layout BubbleTree
Others on http://alpage.inria.fr/~clerger/wnet/wnet.html
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 14 / 22
INRIA
Step 3 – Validation with LIBELLEX interface
Need for local views, browsing, and validation⇒ collaborative WEB interfacehttp://alpage.inria.fr/Lbx (guest/guest)Note: collaboration with startup Lingua & Machina
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 15 / 22
INRIA
Topological structuresCoarse-grained view already useful to detect some topological structures:
strongly connected bushes: very close from semantic classes
threads: progressive sense shifts
star-like structures: a center with many satellitessometimes pertinent, often not !
some polysemic words at the junctions between bushes
char (carriage) and chariot<• modifieur atteler>, <promenade en •>
char (tank) and tank<• de combat>, <régiment de •>
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 16 / 22
INRIA
Some topological classesThe bushes may be used to extract classes⇒ 4000 classes (ALL)
<79> (a cluster of various kinds of dogs) sulky malinois fox-terrier settercocker colley chiot fox labrador ratier griffon caniche teckelépagneul
<80> (a cluster of various kinds of soldiers and military groups)arrière-garde canonnier cavalerie carabinier tirailleur hussardpanzer voltigeur blindé grenadier cuirassier avant-garde zouavelancier
<83> (a cluster of various kinds of diseases) pneumonie paludismediphtérie pneumopathie variole dysenterie malaria botulismepoliomyélite septicémie varicelle polio rougeole méningite
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 17 / 22
INRIA
Step 4 – Injecting and “reasoning”Injecting knowledge in FRMG (similarity + contexts)il mange une tartelette maison à la quetsche.
tartelette close to tartequetsche a kind of fruitaux_fruits frequent context for tarte
⇒ tartelette à la quetsche
il mange une tartelette maison à la quetsche .
subject detobject
NN2
detN2
S
ftb61 ftb62 ftb63 frwiki europarl emeatest78
80
82
84
FRMG raw
FRMG + knowledge
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 18 / 22
INRIA
Moving to Word Embeddings (recent)
Recent buzz on word embeddings (“low-dimension” dense word vectors)word2vec [Mikolov] and Glove [Pennington, Socher, Manning]≡ distributional-based approaches [Goldberg]
DepGlove: minimization of objective function J ; vectors wiextented to syntactic dependencies r in subject, object, . . . with matrices Mr
J = Σi,r,j f (Xirj )(wTi Mrwj +bi +bj +br−logXirj ) with f (Xirj ) =
{(
Xirjxmax
)α if Xirj < xmax
1 otherwise
r extended to longer syntactic paths between words2 brothers of a same governor: cat subject+eat+object mousea grand-father with a grand-son: cat subject/of eatin a large majority of cats does not eat mouses
Note: wTi Mr similar to a vector for a syntactic context (word + relation)
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 19 / 22
INRIA
Evaluations (TOEFL)Random generation of TOEFL-like tests from French WordNet (FWN) synsets
toutefois néanmoins complètement progressivement sensiblementexploit prouesse offset plie bit
MCL and DepGlove not evaluated exactly the same way(graph shortest paths for MCL, minimal cosine for DepGlove)
Many parameters to explore
categories passage/MCL passage/depglove depxml/depglove d+path/depglove
all 51.00 67.68 71.52 76.34nouns 94.00 68.36 72.01 77.00
d = 200, wmin = 20
Influence of the algorithm: MCL << DepGlove on recalll,mais interest of MCL for precisionInfluence of annotation schema: Passage < DepXMLInfluence of collected data: dependancies < paths
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 20 / 22
INRIA
Playing with FRMG
Long-standing effort to ease installation and use of Alpage’s tools; FRMGWIKI a linguistic wiki with extended functionalities
http://alpage.inria.fr/frmgwiki
exploring and using FRMG
discussing syntactic phenomena withsample sentencesaccess to a corpus processing servicelinks to LIBELLEX and Word Vectorshttp://alpage.inria.fr/Lbx
http://alpage.inria.fr/depglove/
process.pl
large parsed corpora available(Wikipédia, Wikisource, . . . )corpus indexing and querying
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 21 / 22
INRIA
Merci
Questions bienvenues
INRIA Éric de la Clergerie What to do with a parser ? Learn ! 23/11/2016 22 / 22