nederbooms...•october, 2010 – february, 2012 nederbooms •exploitation of dutch treebanks for...

45
NEDERBOOMS Treebank Mining for Data-based Linguistics Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde Digital Humanities Summer School - September 20, 2013

Upload: others

Post on 19-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

NEDERBOOMS Treebank Mining for Data-based Linguistics

Liesbeth Augustinus Vincent Vandeghinste

Ineke Schuurman Frank Van Eynde

Digital Humanities Summer School - September 20, 2013

Page 2: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

NEDERBOOMS • Exploitation of Dutch treebanks for research in linguistics

• CLARIN-VL project

• October, 2010 – February, 2012

Page 3: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

NEDERBOOMS • Exploitation of Dutch treebanks for research in linguistics

• CLARIN-VL project

• October, 2010 – February, 2012

• Goals:

o User-friendly tools

o Access to large data files

o Fast and accurate

Page 4: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Query engine for treebanks

Page 5: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Query engine for treebanks

• Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch)

Page 6: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

TREEBANKS

CGN treebank LASSY small Spoken Dutch Written Dutch

Stylistic & regional differences

conversations vs read texts NL vs VL

Stylistic differences

Wikipedia vs legal texts

± 1M tokens ± 1M tokens

130k sentences 65k sentences

Manually corrected Manually corrected

Page 7: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Query engine for treebanks

• Treebank = syntactically annotated corpus e.g. Penn Treebank (English), TüBa (German), LASSY, CGN (Dutch)

• Parser e.g. Alpino (Van Noord 2006)

Page 8: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

ALPINO PARSER

Dit is een zin. >> ALPINO parser >> “This is a sentence.”

Page 9: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

ALPINO PARSER

Dit is een zin. >> ALPINO parser >> “This is a sentence.”

XML trees

Query language: XPath

Page 10: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

Page 11: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

Page 12: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

Page 13: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

XPATH

Page 14: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Query treebanks by example

Page 15: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Query treebanks by example

No or limited knowledge of data structures and/or formal query languages needed

Page 16: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL 1. Example sentence

2. Indicate relevant items of the sentence

3. (Adapt XPath) Select treebank

4. Inspect results

• Parser (Alpino)

• Automatically generate XPath expression

• Present results

the user

Page 17: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions and future work

Page 18: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

CASE STUDY Number agreement in quantifying noun constructions

Een + noun + plural noun + single/plural verb form

o Een aantal treinen heeft vertraging. A number trains has-SG delay

o Een aantal treinen hebben vertraging. A number trains have-PL delay

“A number of trains are running late.”

Page 19: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

GrETEL ONLINE

Page 20: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

INPUT

Page 21: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

ANNOTATION MATRIX

Page 22: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

ANNOTATION GUIDELINES

Page 23: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

XPATH GENERATOR

Page 24: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

TREEBANK SELECTION

Page 25: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS aantal + plural noun + single/plural verb form

o Een aantal treinen heeft vertraging. A number trains has-SG delay

18 matches

“A number of trains are running late.”

Page 26: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS: table

Page 27: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS: data

Page 28: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS: data

“greedy” search

Page 29: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS: trees

Page 30: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

RESULTS aantal + plural noun + single/plural verb form

o Een aantal treinen heeft vertraging. A number trains has-SG delay

18 matches

o Een aantal treinen hebben vertraging. A number trains have-PL delay 30 matches

“A number of trains are running late.”

Page 31: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions and future work

Page 32: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Below annotation matrix

Page 33: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS PP-over-V

o V + PP

o … dat hij opstond met een kater. ... that he woke-up with a hangover

o PP + V

o … dat hij met een kater opstond. … that he with a hangover woke-up

“... that he woke up with a hangover.”

Page 34: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS PP-over-V

o V + PP

o … dat hij opstond met een kater. ... that he woke-up with a hangover

2,895 matches in 2,769 sentences

But: results include PP + V as well!

Page 35: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS PP-over-V

o V + PP + word order option

o … dat hij opstond met een kater. ... that he woke-up with a hangover

789 matches in 777 sentences Results only include V + PP

Page 36: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 37: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 38: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 39: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 40: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 41: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

SEARCH OPTIONS

Page 42: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions and future work

Page 43: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

CONCLUSIONS • GrETEL: search engine for Dutch treebanks

• Input = natural language example

• Output = sample of similar sentences

• Syntactic concordancer

• Available online (via Mozilla Firefox)

• No installation required

Page 44: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

FUTURE WORK • GrETEL 2.0

o Query very large treebanks

o Improve user interface

o Include SoNaR corpus (ca 500M tokens, 42M sentences)

• AfriBooms

o GrETEL for Afrikaans

o Include other treebank formats

Page 45: NEDERBOOMS...•October, 2010 – February, 2012 NEDERBOOMS •Exploitation of Dutch treebanks for research in linguistics •CLARIN-VL project •October, 2010 – February, 2012

Thanks for your attention!

Try it yourself at http://nederbooms.ccl.kuleuven.be/eng/gretel