olga pustylnikov, alexander mehler bielefeld university a unified database of dependency treebanks...

18
Olga Pustylnikov, Alexander Mehler Bielefeld University A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

Upload: loreen-rice

Post on 13-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Olga Pustylnikov, Alexander Mehler

Bielefeld University

A Unified Database of Dependency Treebanks

Integrating, Quantifying & EvaluatingDependency Data

SFB 673Motivation

Exploring similarities among languages by means of syntactic treebanks

We collected a database covering 11 languages

Treebanks have been developed separately by different research projects

quantitative investigations on these treebanks -> the need for unification

SFB 673Motivation

(+) generic: allowing to represent as many treebanks as possible

(+) extensible to new treebanks

(+) complete: preserving all corpus specific information

(+) transferable to other kinds of corpora

(–) complex: exhibiting the minimal

complexity

-> graph representations

Demands on the unified format of treebanks

SFB 673Motivation

Graph eXtensible Language is a graph model representig corpora in terms of graphs

XML

GXL

WIKI

MultimodalData

Treebanks

TOOLS

GXL (Holt et al., 2006)

GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008))

TreebankseGXL

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

SFB 673eGXL

Sentences

Types

IDREF

<graph id=“Types”>

<node id=“POS” />

<node id=“t245” name=“VERB” />

</graph>

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

2-level data model

SFB 673The eGXL Sentences-graph

vill

Detta bestämtjag bemöta .

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

each token of a treebankeach token of a treebank

word formword forman IDREF to the POS-node of the Types-graph

an IDREF to the POS-node of the Types-graph

a (syntactic) relationa (syntactic) relation

from (e.g. a head verb)

to (e.g. a dependent argument)

from (e.g. a head verb)

to (e.g. a dependent argument)

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

SFB 67311 Dependency Treebanks

7 different formats

SFB 673Input vs. Output Formats

Examples from Dutch, Swedish, Italian treebanks

SFB 673Unification is possible…

… due to the separation of the core from the secondary parts

<graph id=“Types”>

<node id=“POS” />

<node id=“t245” name=“VERB” />

</graph>

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

diversity

commonality

SFB 673The TreebankWiki

http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

SFB 673Complexity of eGXL

Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) node rel

eGXLothereGXLother

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

SFB 673DTDB

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

SFB 673Conclusions

a database covering 11 languages eGXL – a generic XML graph model adopted to syntactic

treebanks use of treebanks within a single application (Ariadne)

[email protected]@uni-bielefeld.de

[email protected]

SFB 673Thank you for your attention!