pdt 2.0 prague dependency treebank 2ufal.mff.cuni.cz/~zabokrtsky/vyuka/pfl076-pdt-intro.pdf · pdt...
TRANSCRIPT
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Prague Dependency Treebank 2.0
Zdeněk ŽabokrtskýDept. of Formal and Applied Linguistics
Charles University, [email protected]
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Introduction
treebank syntactically annotated corpus (“bank” of syntactic trees)
Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
About Czech
western group of Slavic languages
rich inflectional morphology
(relatively) free word order language
Latin alphabet extended with accents
(příliš žluťoučký kůň)
spoken in the Czech republic
10+ million speakers
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Historical backgroundand development of PDT
1920’s – Prague Linguistic Circle founded
1930-50’s – influential dependency-oriented works of Lucien Tesniere and Vladimír Šmilauer
mid 1960’s – Petr Sgall’s Functional Generative Description
1992 – Penn Treebank
1994 – Czech National Corpus
1995 – PDT started
1998 – PDT 0.5 pre-release
2001 – PDT 1.0 released by LDC
2006 – PDT 2.0 to be released by LDC
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction
Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Layered annotation scheme
tectogrammatical layersurface-syntactic dependency tree
analytical layersurface-syntactic dependency tree
morphological layermorphological lemma and tag associated with each token
word layeroriginal text, segmented on word boundaries
He would have gone intoforest.
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
M-layer
sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag)15-character long positional morphological tag
1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
A-layer (1)- nodes and edges
sentence represented as a rooted ordered tree with labeled nodes and edges
edges labeled with analytical functions:dependency relations (Sb, Obj, Adv, Atr)non-dep. relations (Coord)auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...)
special treatment of coordination constructions
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
A-layer (2)- coordination
intricate interplay between dependency and coordination relationsPDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member)direct parent vs. effective parent:
M M
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (1) - nodes
t-nodescomplex typed feature structuresnodes represent autosemantic wordsfunctional words do not have nodes of their ownartificially added nodes (e.g. for pro-drops)
node attributestectogrammatical lemmadependency relation – functor and subfunctorgrammateme attributes (representing morphological meanings)attributes for topic-focus articulationattributes for coreference relations
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (2) - dependency relations
according to FGD, two types of functorsactants (arguments)
ACT – actorPAT – patientADDR – addresseeEFF – effectORIG - origin
free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN...spatial and directional modifiers – LOC, DIR1, DIR2, DIR3MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition...
additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (3) - valency
all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEXindividual valency frames roughly corresponds to individual senses of the given verbvalency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (3) - coreference
two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones)
coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow)
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (4) - grammatemes
grammatemes t-node attributes representing morphological meanings
motivation
number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ...
Peter met her youngest brother. Peter will meet her young brothers.
PeterACT
meetPREDtense=ant brother
PATnumber=sg
#PersPronAPP
youngRSTRdegree=sup
PeterACT
meetPREDtense=post brother
PATnumber=pl
#PersPronAPP
youngRSTRdegree=pos
Peter met her youngest brother. Peter will meet her young brothers.
PeterACT
meetPREDtense=ant brother
PATnumber=sg
#PersPronAPP
youngRSTRdegree=sup
PeterACT
meetPREDtense=post brother
PATnumber=pl
#PersPronAPP
youngRSTRdegree=pos
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
T-layer (5) - node typing
presence/absence of a given attribute? the need for node typing
two-level hierarchy of t-layer node types used in PDT 2.0:
tectogrammatical node
complex atom qcomplexlistcoap dphrfphrroot
semanticadjectives
semanticadverbs
semanticverbs
semantic nouns
denotativen. denot
(number,gender)
pronominal
indefiniten. pron. i ndef
(number,gender,person,indeftype)
definiten. quant. def
(number,gender,numertype)
quantificative
definitenegationn. denot . neg
(number,gender,negation)
demonstrativen. pron. def . demon
(number,gender)
personaln. pron. def. pers
(number,gender,person,politeness)
tectogrammatical node
complex atom qcomplexlistcoap dphrfphrroot
semanticadjectives
semanticadverbs
semanticverbs
semantic nouns
denotativen. denot
(number,gender)
pronominal
indefiniten. pron. i ndef
(number,gender,person,indeftype)
definiten. quant. def
(number,gender,numertype)
quantificative
definitenegationn. denot . neg
(number,gender,negation)
demonstrativen. pron. def . demon
(number,gender)
personaln. pron. def. pers
(number,gender,person,politeness)
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Interlinking the layers
any unit at any layer has a PDT unique ID
neighboring layers connected by top-down pointers
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation
Data Software tools Documentation Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Sources of text
texts provided by the Czech National Corpus
7000 articles (or article fragments) from Czech newspapers and journals:
Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal)
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Amount of annotated data
m-layer data1.96 MW in 116 kS
a-layer data (75 % of m-layer)1.5 MW in 88 kS
t-layer data (59 % of a-layer)0.8 MW in 49 kS
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Division into files
1 XML file per document and annotation layer
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Train/test data
train : devtest : evaltest = 8 : 1 : 1
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Full vs. sample data
sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed)
the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web
the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation Data
Software tools Documentation Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Tree editor TrEd
general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Batch processing of the data
btred – batch processing version of tred
ntred – networked (parallelized) version of btred
$ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Netgraph client-server application for on-line PDT search implemented in Java
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Tools for post-annotation consistency checking
hundreds of btred scripts of various types:
technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist...
m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel)
a-layer testsnot more than one subject in a clauseattributes (afun Atr) should not appear directly below verbs
t-layer testssurface forms of verb arguments match the specifications in the valency lexiconrelative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference)
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Tools for automatic annotation
chain of tools for automatic text processing (from a raw text to a-layer trees):
1. sentence segmentation and tokenization 2. morphological analysis 3. morphological disambiguation 4. dependency parsing (adapted Collins) 5. analytical function assignment
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Tools for format conversions
conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd:
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation Data Software tools
Documentation Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
PDT 2.0 Documentation
PDT Guide overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM
Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages)
Publications conference and journal papers, technical reports, theses ...
Technical documentation (software tools and data formats)
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation Data Software tools Documentation
Tour through the CD-ROM Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Outline of the talk
Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM
Final remarks
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Want to experiment with...
tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ...
Use PDT 2.0,it’s all there !!!
http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0
Annotation scheme not limited to Czech
T-layer in English T-layer in German A-layer in German
A-layer in Arabic A-layer in Slovene A-layer in Romanian