pdt: the tools

21
March 5, 2008 Companions Semantic Representation and Dia log Interfacing Workshop - Morphology and Surface Syntax 1 PDT: The Tools Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

Upload: ovidio

Post on 23-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

PDT: The Tools. Jan Haji č Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Tectogrammatical Annotation Tools. Manual annotation Speech Reconstruction: MEd - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

1

PDT:The Tools

Jan Hajič

Institute of Formal and Applied Linguistics

School of Computer Science

Faculty of Mathematics and Physics

Charles University, Prague

Czech Republic

Page 2: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

2

Tectogrammatical AnnotationTools

Manual annotation Speech Reconstruction: MEd Morphology (linear structure annotation): LAW Special graphical tool (TrEd)

Customizable graphical tree editor Viewing and Searching

TrEd, Netgraph (linear structure: also Bonito/Manatee) Automatic annotation

(ASR, Segmentation), Morphology, Tagging, Parsing, Deep parsing, Co-reference, WSD, …

Generation Jan Ptacek’s generation tools (rule-based, so far)

Page 3: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

3

Manual annotation

Speech reconstruction MEd z-layer, w-layer, m-layer Audio – annotators can listen

Morphology LAW – new version fro fast morphological

disambiguation Syntax (analytical, tectogrammatical)

TrEd

Page 4: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

4

MEd: speech reconstruction viewer / annotation tool

m-layer (annotation)

w-layer

z-layer

audio

Page 5: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

5

The Morphological Annotation Tool (LAW)

Java-based

Dictionary

access

XML-aware

PML:

m-layer

Page 6: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

6

TrEd: Manual Annotation Tool Perl/PerlTk based, platform-independent

Linux, Windows 95/98/2000, Solaris, ...

Perl as the “macro” language “unlimited” online processing capability

Flexibility for interactive checking split screen, graphical “diff” function

Customization, printing, “plugins”, ... [Automatic processing: btred – no GUI] [Fast search (parallel processing): btred/ntred]

Page 7: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

7

The “TrEd” Tree Editor Graphical tool

TrEd Main screen:

Original sentence: [This year’s flu seasonis still quiet in Europe.]

Editing windowcustomization

Run a macro

Multiwindowediting/compare

Page 8: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

8

TrEd

Page 9: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

9

Valency Lexicon in TrEd

to write sth (about sth)

Page 10: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

10

Searching the treebank

TrEd (obviously) Programming possible (perl) Fast search (parallelization)

Netgraph Linguist-user-friendly Easy to write queries Not as flexible Java

Page 11: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

11

Netgraph

Query

Page 12: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

12

Netgraph

Search

results

Page 13: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

13

Automatic annotation

Morphological analysis Tagging Parsing (surface) Tectogrammatical (deep) parsing

Tectogrammatical structure Co-reference Grammatemes

Generation

Page 14: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

14

Morphological dictionary

Czech UFAL-developed C implementation 800k lemmas

English Open source Amorph-generated from data From WSJ

Page 15: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

15

Tagging

Czech 10+ taggers Best: “MORCE”

Averaged perceptron + unsupervised + rules, > 96% Testing on spoken (ASR) input

English Off-the-shelf (97%) (…will retrain MORCE on WSJ/PTB) NB: within parsers (mostly)

Page 16: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

16

Parsing

Czech McDonald et al.

MST + MIRA, 85-86% dep. Accuracy Labeling (afun)

C 5.0 or within parser, also ~ 85% accuracy

English Collins / Charniak NB: Phrase-based

Page 17: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

17

Tectogrammatical parsing

Czech: TrEd-implemented, 4-step process Starts from analytic layer

English Rule-based so far

Too little data annotated Annotation underway currently

Starts from classical Collins/Charniak WSJ-type output

Page 18: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

18

Tectogrammatical parsing - accuracy

Newest results: 4 phases Transformation

-based learning FnTBL

Largely langu-

age independent Coreference: >90%

m- and a-layer:Attribute manual autostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 %

Page 19: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

19

Word sense disambiguation

For words with valency frames All verbs Some nouns, adjectives

Valency frame ~ meaning (sense) Jiri Semecky’s work Accuracy on PDT: 70%+ Portable to English

No results yet

Page 20: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

20

Generation

From TR to text Jan Ptacek’s work (cf. review meeting) Rule-based Czech: completed

Integrated with TTS (UWB) English: before completion of first version Results

No metrics yet, subjectively very good

Page 21: PDT: The Tools

March 5, 2008 Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

21

Some (more) pointers

http://ufal.mff.cuni.cz/pdt2.0 Current version of PDT, all three levels, 1.9/1.5/0.8

Mw

http://ufal.mff.cuni.cz/REST/CAC/CAC.html The Czech Academic Corpus, v 1.0

http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0),

LDC2004T25 (PCEDT 1.0)

http://www.clsp.jhu.edu: Workshop 2002 Using TL for MT Generation