pdt: the tools
DESCRIPTION
PDT: The Tools. Jan Haji č Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Tectogrammatical Annotation Tools. Manual annotation Speech Reconstruction: MEd - PowerPoint PPT PresentationTRANSCRIPT
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
1
PDT:The Tools
Jan Hajič
Institute of Formal and Applied Linguistics
School of Computer Science
Faculty of Mathematics and Physics
Charles University, Prague
Czech Republic
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
2
Tectogrammatical AnnotationTools
Manual annotation Speech Reconstruction: MEd Morphology (linear structure annotation): LAW Special graphical tool (TrEd)
Customizable graphical tree editor Viewing and Searching
TrEd, Netgraph (linear structure: also Bonito/Manatee) Automatic annotation
(ASR, Segmentation), Morphology, Tagging, Parsing, Deep parsing, Co-reference, WSD, …
Generation Jan Ptacek’s generation tools (rule-based, so far)
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
3
Manual annotation
Speech reconstruction MEd z-layer, w-layer, m-layer Audio – annotators can listen
Morphology LAW – new version fro fast morphological
disambiguation Syntax (analytical, tectogrammatical)
TrEd
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
4
MEd: speech reconstruction viewer / annotation tool
m-layer (annotation)
w-layer
z-layer
audio
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
5
The Morphological Annotation Tool (LAW)
Java-based
Dictionary
access
XML-aware
PML:
m-layer
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
6
TrEd: Manual Annotation Tool Perl/PerlTk based, platform-independent
Linux, Windows 95/98/2000, Solaris, ...
Perl as the “macro” language “unlimited” online processing capability
Flexibility for interactive checking split screen, graphical “diff” function
Customization, printing, “plugins”, ... [Automatic processing: btred – no GUI] [Fast search (parallel processing): btred/ntred]
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
7
The “TrEd” Tree Editor Graphical tool
TrEd Main screen:
Original sentence: [This year’s flu seasonis still quiet in Europe.]
Editing windowcustomization
Run a macro
Multiwindowediting/compare
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
8
TrEd
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
9
Valency Lexicon in TrEd
to write sth (about sth)
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
10
Searching the treebank
TrEd (obviously) Programming possible (perl) Fast search (parallelization)
Netgraph Linguist-user-friendly Easy to write queries Not as flexible Java
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
11
Netgraph
Query
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
12
Netgraph
Search
results
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
13
Automatic annotation
Morphological analysis Tagging Parsing (surface) Tectogrammatical (deep) parsing
Tectogrammatical structure Co-reference Grammatemes
Generation
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
14
Morphological dictionary
Czech UFAL-developed C implementation 800k lemmas
English Open source Amorph-generated from data From WSJ
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
15
Tagging
Czech 10+ taggers Best: “MORCE”
Averaged perceptron + unsupervised + rules, > 96% Testing on spoken (ASR) input
English Off-the-shelf (97%) (…will retrain MORCE on WSJ/PTB) NB: within parsers (mostly)
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
16
Parsing
Czech McDonald et al.
MST + MIRA, 85-86% dep. Accuracy Labeling (afun)
C 5.0 or within parser, also ~ 85% accuracy
English Collins / Charniak NB: Phrase-based
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
17
Tectogrammatical parsing
Czech: TrEd-implemented, 4-step process Starts from analytic layer
English Rule-based so far
Too little data annotated Annotation underway currently
Starts from classical Collins/Charniak WSJ-type output
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
18
Tectogrammatical parsing - accuracy
Newest results: 4 phases Transformation
-based learning FnTBL
Largely langu-
age independent Coreference: >90%
m- and a-layer:Attribute manual autostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 %
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
19
Word sense disambiguation
For words with valency frames All verbs Some nouns, adjectives
Valency frame ~ meaning (sense) Jiri Semecky’s work Accuracy on PDT: 70%+ Portable to English
No results yet
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
20
Generation
From TR to text Jan Ptacek’s work (cf. review meeting) Rule-based Czech: completed
Integrated with TTS (UWB) English: before completion of first version Results
No metrics yet, subjectively very good
March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
21
Some (more) pointers
http://ufal.mff.cuni.cz/pdt2.0 Current version of PDT, all three levels, 1.9/1.5/0.8
Mw
http://ufal.mff.cuni.cz/REST/CAC/CAC.html The Czech Academic Corpus, v 1.0
http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0),
LDC2004T25 (PCEDT 1.0)
http://www.clsp.jhu.edu: Workshop 2002 Using TL for MT Generation