copyright 2007, toshiba corporation. 8 january 2007 ijcai-07 workshop on “shallow parsing in south...

Copyright 2007, Toshiba Corporation.

8 January 2007IJCAI-07 workshop on “Shallow Parsing in South Asian Languages”

Chunking and Parsing

Sabine Buchholz

Speech Technology Group (STG) Cambridge Research Laboratory (CRL) Toshiba Research Europe Ltd (TREL)

2

Outline

• Chunking– English only

– from PhD time: 1997-2002

– chunking proper, and chunking as preprocessing to relation finding

• Full dependency parsing (without chunking)– many languages

– CoNLL-X shared task on Multilingual Dependency Parsing, 2006

• 10th Conference on Computational Natural Language Learning

• Conclusions

3

A brief history of chunks until 1997

• Church, 1988– inserting brackets around “simple non-recursive noun phrases”

• Abney, 1991, “Parsing by Chunks”– importance for prosody, parser = chunker + attacher, also PP chunks

• Abney, 1996, “Chunk Stylebook”– chunk ends with the head, types of verb chunks, no PP chunks

• Ramshaw & Marcus, 1995, “Text chunking using TBL”– baseNPs, N-type/V-type, chunking as tagging: IOB (Begin/Between)

• Collins, 1996– NP chunking, then parsing; chunker uses 5 tags (S=[, E=], B=][, I, O)

• Magerman, 1995– bottom-up parser using 5 “extensions” similar to IOB tags

• Ratnaparkhi, 1997– chunking, then parsing; chunker: IOB, parser: B(egin), I + complete?

4

A brief history of chunks: Summary

• Types of chunks– noun chunks most popular

– use/definition of other types vary

• Representations– brackets

– IOB tags

• B as “begin” or “between”

• 3 or 5 tags

• Important concepts– prosodic relevance of chunks

– chunking to make parsing more efficient (as a pre-processing step)

– chunking as a tagging problem

– using techniques similar to IOB tagging for (full) parsing

Tjong Kim Sang & Veenstra, 1999:

IOB(B=Between) best but various

others not significantly worse

5

PhD thesis

• “Memory-Based Grammatical Relation Finding” (2002)

• Data– I converted Penn Treebank from constituent structure to chunks and

grammatical relations (e.g. subject, temporal) between chunks

– conversion program available at http://ilk.uvt.nl/~sabine/homepage/software.html

• Division of work– (Mainly) other people worked on chunking (using my converted data)

– I worked on finding grammatical relations to verb chunks

• Machine Learning framework: Memory-Based Learning• based on k-nearest neighbor algorithm; multi-class classifier

• for numeric and symbolic features; sophisticated feature weighting

• implementation: TiMBL (Tilburg Memory-Based Learner), Daelemans et al., 2004; available at http://ilk.uvt.nl/timbl/

6

CoNLL-2000 shared task on chunking

• Data: converted Wall Street Journal sections 15-18 (training) and 20 (testing) of the Penn Treebank– 211,727 tokens, 106,978 chunks in training data

• 51% NP chunks, e.g. [NP the most volatile form NP]

• 20% VP chunks, e.g. [VP may not want to sell VP]

• 20% PP chunks, e.g. [PP on PP], [PP due to PP], [PP just after PP]

• 4% ADVP chunks, e.g. [ADVP earlier ADVP]

• 2% SBAR chunks, e.g. [SBAR because SBAR], [SBAR so that SBAR]

• 2% ADJP chunks, e.g. [NP 68 years NP] [ADJP old ADJP]

• 1% PRT chunks, e.g. [PRT on and off PRT]

• < 1% CONJP, INTJ, LST, UCP, e.g. [CONJP as well as CONJP], [INTJ oh

INTJ]

– representation: IOB (B=Begin)

• e.g.: He/B-NP reckons/B-VP the/B-NP account/I-NP deficit/I-NP ...

7

CoNLL-2000 shared task on chunking

• Tjong Kim Sang & Buchholz, 2000– PoS: not from treebank but by Brill tagger

– 11 groups took part

– evaluation metric:

• chunk is correct if boundaries and type are correct

• precision: percentage of predicted chunks that are correct

• recall: percentage of chunks in data that are correctly predicted

• Fβ=1 score: 2 × precision × recall / (precision + recall)

– baseline (assigning most frequent chunk tag to each PoS tag): 77.1

– best 3 systems:

• Kudoh & Matsumoto: 93.5; SVMs, pairwise classification

• van Halteren: 93.2; Weighted Probability Distribution Voting, MBL

• Tjong Kim Sang: 92.5; MBL, different representations

– all top 3 systems used several classifiers, combined by voting

8

Chunking as preprocessing to relation finding

[NP Miller], [NP who] [VP organized] [NP conference] {PNP in York}, [VP come] {PNP to Paris}.

Peter Miller, who organized the conference in New York, does not want to come to Paris.

PP-DIR

NP-SBJ

NP-SBJ NP-OBJ

NNP NNP, WP VBD DT NN IN NNP NNP, VBZ RB VB TO VB TO NNP.

[NPMiller], [NPwho] [VPorganized] [NPconference] [PPin] [NPYork], [VPcome] [PPto] [NPParis].

PoS tagging

chunking, reduce each chunk to its head

PNP finding; also “{PNP in London, Berlin and Paris}”

relation finding (to verbs)

NNP , WP VBD NN IN NNP, VB TO NNP .

Peter Miller, who organized the conference in New York, does not want to come to Paris.

[NP Miller], [NP who] [VP organized] [NP conference] {PNP in York}, [VP come] {PNP to Paris}. NNP , WP VBD NN IN NNP, VB TO NNP .

NNP , WP VBD NN IN NNP, VB TO NNP .

9


• One classifier instance for each pair of a verb chunk and another chunk (= focus) → 2 × 6 = 12 instances

verb dist vcs … focus−1 focus … class

word chunk prep word PoS chunk

orga. −1 0 , ― ― who WP NP NP-SBJ

come −7 1 ― ― ― Miller NNP NP NP-SBJ

orga. +2 0 conf. NP in York NNP PNP ―

[NP Miller], [NP who] [VP organized] [NP conference] {PNP in York}, [VP come] {PNP to Paris}. NNP , WP VBD NN IN NNP, VB TO NNP .

• Each instance has various types of features

10


• Buchholz, Veenstra & Daelemans, 1999– data: Penn Treebank II, WSJ sections 00-19 training, 20-24 testing

– evaluation: Fβ=1 score

– for efficiency: instance only if at most 1 verb/verb chunk between focus-verb pair, and no verb/verb chunk between verb-focus pair

– even rarer chunks with lower F-score help relation finder

Structure in input Chunk Fβ=1 # Inst. Av. dist. Relation Fβ=1

Words, PoS 350,091 6.1 49.1

+ NP chunks 92.3 227,995 4.2 60.4

+ VP chunks 91.8 186,364 4.5 67.2

+ AD(J/V)P chunks 66.7/77.9 185,005 4.4 67.3

+ PP chunks 96.1 184,455 4.4 68.2

+ PNPs 92.0 149,341 3.6 69.3

11


• Buchholz, 2002, PhD, Section 6.1.4– data: Penn Treebank II, WSJ sections 10-19

• 21,747 sentences; 515,390 tokens; 320 different relations

– evaluation: Fβ=1 score, doing 10-fold cross validation

• relation is correct if relation type and both head words are correct

– input: treebank PoS tags/chunks/PNPs vs. those predicted by combined tagger/chunker and PNP finder (trained on sections 00-09)

– When applying modules in sequence (output of one module is input to next), it is important to train “later” modules on realistic input

Info in training set Info in test set Fβ=1

From treebank From treebank 81.36

From treebank Predicted 71.38

Predicted Predicted 72.59


CoNLL-X Shared Task on Multilingual Dependency

ParsingSabine Buchholz, Toshiba Research Europe Ltd, UK

Erwin Marsi, Tilburg University, The NetherlandsAmit Dubey, University of Edinburgh, UK

Yuval Krymolowski, University of Haifa, Israel

(Buchholz & Marsi, 2006)

13

Dependency structure

• No constituents (unlike phrase structure)

• Dependency relations between two lexical items (tokens)

• Graphical representations

This is a test .

ROOT

subj

det

punc

comp

This is a test .

ROOTpunc

comp

detsubj

14

Dependency structure — terminology

• Child

• Dependent

• Modifier

This is

subj

• Parent

• Governor

• Head

• Label

15

Dependency structure ↔ phrase structure

• Head table– S: VP

– VP: V

– NP: Pronoun, N

• Function mapping– NP-S-VP: subj

– NP-VP-V: comp

– Det-NP-N: det

This is a test .

ROOTpunc

comp

detsubj

16

Dependency structures — in the shared task

• Virtual root node

• Each token except BOS has exactly one head

• More than one token can link to BOS

• Crossing arcs are allowed, i.e. structures can be non-projective

BOS This is a test .

ROOT

punc

compdetsubj

0 1 2 3 4 5

Do you need it for something ? What do you need it for ?An arc (i,j) is projective iff all nodes occurring between i and j are

dominated by i (where dominates is the transitive closure of the arc relation)

17

Data format

BOS This is a test .

ROOT

punc

compdetsubj

0 1 2 3 4 5

ID FORM LEMMA CPOS

TAG

POS

TAG

FEATS HEAD DEPREL

1 This this pronoun demon sg 2 subj

2 is be v v-fin 3|sg|pres 0 ROOT

3 a a art art indef 4 det

4 test test n nc sg 2 comp

5 . . punc punc _ 2 punc

18

Data format — details

• ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL guaranteed to contain a non-dummy value

• Although CPOSTAG and POSTAG may be identical

• LEMMA and FEATS allowed to contain dummy value (_)

• Unicode (UTF-8)

19

Treebanks used• Czech: Prague Dependency Treebank (PDT)• Arabic: Prague Arabic Dependency Treebank (PADT)• Slovene: Slovene Dependency Treebank (SDT)• Danish: Danish Dependency Treebank (DDT)• Swedish: Talbanken05

• Turkish: Metu-Sabancı treebank

• German: TIGER treebank• Japanese: Japanese Verbmobil treebank• Portuguese: The Bosque part of the Floresta sintá(c)tica• Dutch: Alpino treebank • Chinese: Sinica treebank• Spanish: Cast3LB• Bulgarian: BulTreeBank

Depen-dencyformat

Consti-tuents and

functions

Constituents and some functions

20

Data format — some examples: Chinese

ID FORM LEMMA CPOS POS FEATS HEAD DEPREL

1 也 _ D Dbb _ 2 evaluation

2 是 _ V V_11 _ 0 ROOT

3 同班 _ N Nv3 _ 4 property

4 同學 _ N Nab _ 2 range

#5:5.[39031] VP(evaluation:Dbb: 也 |Head:V_11:是 |range:NP(property:Nv3: 同班 |Head:Nab: 同學 ))# 。 (PERIODCATEGORY)

21


1 ياُب� giyAbu_ِغ� giyAb_ِغ�ياُب N N case=1|def=R

0 ExD

2 fu&Ad_ُف�ؤاد fu&Ad_ُف�ؤاد Z Z _ 3 Atr

3 kanoE_َك�ْن عانAn

kanoE_َك�ْن عانAn

Z Z _ 1 Atr

[#1,AuxS,tag=HEADLINE,#1,ord=0,comment=Sun Oct 3 05:02:28 2004 \[SyntaxFS.pl 1.06\],x_id_ord=#1_1, x_comment=ALH20010911.0001_story Wed Jul 21 12:51:09 2004 \[MorphoFS.pl 1.09\]]\(\[ �,ExD,ِغ!ياُب 1ِغ!ياُب_ ,N-------1R,ِغياُب,ord=1,x_id_ord=#1/1_12,x_lookup=gyAb,giyAb+u,absence/disappearance + \[def.nom.\]]\%عان]\) 'ْن ,Atr,َك %عان_ 'ْن 2َك ,Z---------,َكْنعان,ord=3,x_id_ord=#1/3_6,x_lookup=knEAn,kanoEAn,Kan'an]\,Atr,ُف�ؤاد]\) 2ُف�ؤاد_ ,Z---------,ُفؤاد,ord=2,x_id_ord=#1/2_11,x_lookup=f&Ad,fu&Ad,Fuad/Fouad]\)))

Data format — some examples: Arabic

22


1 Mas mas conj conj-c _ 0 UTT

2 se se conj conj-s _ 3 SUB

3 falhar falhar v v-fin FUT|3S|SUBJ 1 ADVL

4 ? ? punc punc _ 1 PUNC

SOURCE: CETEMPúblico n=22 sec=eco sem=92aCP22-4 Mas se falhar?A1UTT:acl=CO:conj-c('mas') Mas=ADVL:fcl==SUB:conj-s('se') se==P:v-fin('falhar' FUT 3S SUBJ) falhar=?

Head table

• acl: COM, PRD, P, leftmost non-punctuation

• fcl: P, PAUX, …

Data format — some examples: Portuguese

23

Data format — training data and test data

• Training data: 29,000 (Slovene) – 1,250,000 (Czech) tokens

– Contains all columns

• “Blind” test data (given to participants): ±5000 scor. tokens

– Contains only first six columns

• Participants predict: HEAD and DEPREL


1 This this pronoun demonstrative sg 2 subj

2 is be v v-fin 3|sg|pres 0 ROOT

3 a a art art indef 4 det

4 test test n nc sg 2 comp

5 . . punc punc _ 2 punc

24

Evaluation metric

• Official metric: Labelled attachment score (LAS)– The percentage of “scoring” tokens (of all languages!) for which the

system predicted the correct HEAD and DEPREL value

– A token is “non-scoring” if all characters of the FORM value have the Unicode category property “Punctuation”

• E.g. “.” “,” “?” “(“ “¿” “…” “--” “_” “؟_?” “%” …

• Also computed, for error analysis and system comparison:– Unlabelled attachment score (UAS)

• The percentage of “scoring” token for which the system predicted the correct HEAD value

– Label accuracy

• The percentage of “scoring” token for which the system predicted the correct DEPREL value

25

Results (http://nextens.uvt.nl/~conll/results.html)

Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu Tot SD Bu

McD 66.9 85.9 80.2 84.8 79.2 87.3 90.7 86.8 73.4 82.3 82.6 63.2 80.3 8.4 87.6

Niv 66.7 86.9 78.4 84.8 78.6 85.8 91.7 87.6 70.3 81.3 84.6 65.7 80.2 8.5 87.4

Rie 66.7 90.0 67.4 83.6 78.6 86.2 90.5 84.4 71.2 77.4 80.7 58.6 77.9 10.1 0.0

…

Av 59.9 78.3 67.2 78.3 70.7 78.6 85.9 80.6 65.2 73.5 76.4 56.0 80.0

SD 6.5 8.8 8.9 5.5 6.7 7.5 7.1 5.8 6.8 8.4 6.5 7.7 6.3

• McDonald et al., 2006• Nivre et al., 2006• Riedel et al., 2006• …

• about 30 registrations• 17 submissions (parsed test data and paper)

26

Results

• Little difference in ranking (mostly just +/−1) when using UAS or label accuracy metric instead of LAS

• Very little difference in scores as well as rankings when scoring all tokens (i.e. including punctuation)

• “Good” parsers good on all languages

• Using all available information seems to be a good idea: FORM+LEMMA, POS+CPOS, FEATS

• Two best overall scores achieved by two very different approaches– McDonald: Compute score of each possible pair of tokens, then

clever search to find best tree; first unlabelled parsing, then labelling

– Nivre: Build dependency structure stepwise from left to right using deterministic classifier; parsing and labelling at the same time

27

The CoNLL-2007 shared task

• Organizers– Joakim Nivre

– Johan Hall

– Sandra Kübler

– Ryan McDonald

– Jens Nilsson

– Sebastian Riedel

– Deniz Yuret

• Topic: dependency parsing– Multilingual Track similar to last year

– Domain Adaptation Track

• http://nextens.uvt.nl/depparse-wiki/SharedTaskWebsite

• Register by emailing [email protected], by 20 Jan. 2007

28

Conclusions

• General trend seems to go away from chunk parsers– computers got faster, more memory: efficiency less of a problem

– new algorithms, e.g. linear-time full parsers: Nivre & Scholz, 2004

– single parser more “elegant” than chunker+attacher

– corpora with full parse trees (treebanks) available for more languages

• But still useful for– parsing huge amounts of text → Internet

– real-time applications, especially on embedded systems (low memory/computing power)

– treebank construction

• see e.g. Brants & Plaehn, 2000, for the German NEGRA corpus

• Questions?

29

References A-D• Steven Abney (1991), “Parsing by Chunks”, Principle-Based Parsing, pp. 257-278,

Kluwer• Steven Abney (1996), “Chunk Stylebook”, Manuscript• Sabine Buchholz (2002), “Memory-Based Grammatical Relation Finding”, PhD

thesis, Tilburg University• Sabine Buchholz, Jorn Veenstra and Walter Daelemans (1999), “Cascaded

Grammatical Relation Assignment“, Proc. of EMNLP/VLC, pp.239-246• Sabine Buchholz and Erwin Marsi (2006), “

CoNLL-X Shared Task on Multilingual Dependency Parsing”, Proc. of CoNLL-X, ACL

• Thorsten Brants and Oliver Plaehn (2000), “Interactive Corpus Annotation,” Proc. of LREC-2000

• Kenneth W. Church (1988), “A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text”, Proc. of 2nd Applied NLP, ACL

• Michael Collins (1996), “A New Statistical Parser Based on Bigram Lexical Dependencies”, Proc. of the 34th Annual Meeting of the ACL

• Walter Daelemans, Jakub Zavrel, Ko van der Sloot and Antal van den Bosch (2004), “TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide”, ILK Technical Report Series 04-02

30

References E-N• Jason Eisner (1996), “Three new probabilistic models for dependency parsing:

An exploration”, Proc. of 16th COLING• James P. Gee and François Grosjean (1983), “Performance Structures: A

Psycholinguistic and Linguistic Appraisal”, Cognitive Psychology 15, pp. 411-458

• Taku Kudoh and Yuji Matsumoto (2000), ”Use of Support Vector Learning for Chunk Identification”, Proc. of CoNLL-2000 and LLL-2000

• David M. Magerman (1995), “Statistical Decision-Tree Models for Parsing”, Proc. of the 33th Annual Meeting of the ACL

• Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz and Britta Schasberger (1994), “The Penn Treebank: Annotating Predicate Argument Structure”, Proc. of the ARPA Human Language Technology Workshop, pp. 110-115

• Ryan McDonald, Kevin Lerman andFernando Pereira (2006), “Multilingual Dependency Analysis with a Two-Stage Discriminative Parser”, Proc. of CoNLL-X, ACL

• Joakim Nivre, Johan Hall, Jens Nilsson, Gülşen Eryiğit, Svetoslav Marinov (2006), “Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines”, Proc. of CoNLL-X, ACL

• Joakim Nivre and Mario Scholz (2004), “Deterministic Dependency Parsing of English Text”, Proc. of 20th COLING

31

References R-Z• Lance Ramshaw and Mitchell Marcus (1995), “

Text Chunking Using Transformation-Based Learning”, Proc. of 3rd VLC • Adwait Ratnaparkhi (1997), “A Linear Observed Time Statistical Parser Based

on Maximum Entropy Models”, Proc. of 2nd EMNLP• Sebastian Riedel, Ruket Çakıcı and Ivan Meza-Ruiz (2006), “Multi-lingual

Dependency Parsing with Incremental Integer Linear Programming”, Proc. of CoNLL-X, ACL

• Erik Tjong Kim Sang and Jorn Veenstra (1999), “Representing text chunks”, Proc. of EACL, pp. 173-179

• Erik Tjong Kim Sang (2000), “Text Chunking by System Combination”, Proc. of CoNLL-2000 and LLL-2000

• Erik Tjong Kim Sang and Sabine Buchholz (2000), “Introduction to the CoNLL-2000 shared task: Chunking”, Proc. of CoNLL-2000 and LLL-2000

• Antal van den Bosch and Sabine Buchholz (2002), “Shallow parsing on the basis of words only: A case study”, Proc. of the 40th Annual Meeting of the ACL, pp. 433-440

• Hans van Halteren (2000), “Chunking with WPDV Models”, Proc. of CoNLL-2000 and LLL-2000

32

References for CoNLL-X treebanks• S. Afonso, E. Bick, R. Haber, and D. Santos (2002), “Floresta sintá(c)tica”: a

treebank for Portuguese, Proc. of the Third Intern. Conf. on Language• N. B. Atalay, K. Oflazer, and B. Say (2003) “The annotation process in the

Turkish treebank”, Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC)

• A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká (2003), “The PDT: a 3-level annotation scenario”, In: A. Abeillé (editor), 2003, Treebanks: Building and Using Parsed Corpora, volume 20 of Text, Speech and Language Technology, Kluwer Academic Publishers, chapter 7

• S. Brants, S. Dipper, S. Hansen, W. Lezius, and G. Smith (2002), “The TIGER treebank”, Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT)

• K. Chen, C. Luo, M. Chang, F. Chen, C. Chen, C. Huang, and Z. Gao (2003), “Sinica treebank: Design criteria, representational issues and implementation”, In: A. Abeillé (editor), 2003, Treebanks: Building and Using Parsed Corpora, volume 20 of Text, Speech and Language Technology, Kluwer Academic Publishers, chapter 13

• M. Civit Torruella and Ma A. Martí Antonín (2002), “Design principles for a Spanish treebank”, Proc. of the First Workshop on Treebanks and Linguistic Theories (TLT).

33

References for CoNLL-X treebanks• S. Džeroski, T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky, and A. Žele,

(2006), “Towards a Slovene dependency treebank”, Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC)

• J. Hajič, O. Smrž, P. Zemánek, J. Šnaidauf, and E. Beška (2004), “Prague Arabic dependency treebank: Development in data and tools”, Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pp. 110–117

• Y. Kawata and J. Bartels. 2000. “Stylebook for the Japanese treebank in VERBMOBIL”, Verbmobil-Report 240, Seminar für Sprachwissenschaft, Universität Tübingen

• M. T. Kromann (2003), “The Danish dependency treebank and the underlying linguistic theory”, Proc. of the Second Workshop on Treebanks and Linguistic Theories (TLT)

• J. Nilsson, J. Hall, and J. Nivre (2005), “MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity”, Proc. of the NODALIDA Special Session on Treebanks

• K. Oflazer, B. Say, D. Zeynep Hakkani-Tür, and G. Tür (2003), “Building a Turkish treebank”, In: A. Abeillé (editor), 2003, Treebanks: Building and Using Parsed Corpora, volume 20 of Text, Speech and Language Technology, Kluwer Academic Publishers, chapter 15

34

References for CoNLL-X treebanks• K. Simov, P. Osenova, A. Simov, and M. Kouylekov (2005), “Design and

implementation of the Bulgarian HPSG-based treebank”, Journal of Research on Language and Computation – Special Issue, pp. 495– 522. Kluwer Academic Publishers

• K. Simov and P. Osenova (2003), “Practical annotation scheme for an HPSG treebank of Bulgarian”, Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC), pp. 17–24

• L. van der Beek, G. Bouma, R. Malouf, and G. van Noord (2002), “The Alpino dependency treebank”, Computational Linguistics in the Netherlands (CLIN)


Additional slides

36

My background

• 1996: Bachelor/Master in Computational Linguistics at University of Saarbrücken, Germany– German: about 100 million speakers (in Germany, Austria, …)

• far fewer speakers than English, Hindi, Bengali

– University of Saarbrücken: first treebank of German (NEGRA, 1998)

• 2002: PhD at Tilburg University, The Netherlands– Dutch: about 25 million speakers (in the Netherlands, Belgium, …)

• far fewer speakers than Telugu

– Tilburg University: data-driven processing of Dutch (e.g. PoS tagging)

– However: used Penn Treebank (English) for PhD

• since 2003: working for Toshiba; on text-to-speech

• 2006: co-organized CoNLL-X shared task on Multilingual Dependency Parsing

37

A brief history of chunks

• Church, 1988– PoS tagging, then inserting brackets around “simple non-recursive

noun phrases”

• Abney, 1991, “Parsing by Chunks”– “chunks” are connected subgraphs of the parse tree, defined by

semantic heads (content words)

– related to φ-phrases (Gee & Grosjean, 1983), which are important for prosody:

– one (strong) stress per phrase

– pauses most likely between phrases

– global parse trees can also be constructed from chunks

• parser = chunker + attacher

– chunker: simple context-free grammar

– attacher: lexical information, selection restrictions

38


• Abney, 1996, “Chunk Stylebook”– “non-recursive core of an intra-clausal constituent, extending from

the beginning of the constituent to its head, but not including post-head dependents” → last element is the head

– contains explicit list of chunks: noun, verb, infinitive, present participle or gerund, past participle, adjective, and adverb chunks

– in contrast to Abney, 1991: no prepositional chunks

• Ramshaw & Marcus, 1995, “Text chunking using TBL”– non-recursive “baseNP” chunks (NPs that contain no nested NPs)– non-overlapping N-type and V-type chunks

• N-type: includes prepositions → Abney, 1991• V-type: includes (predicative?) adjective phrases

– chunking as a tagging problem• baseNPs: I, O, B (B=between)• partitioning chunks: BN, N, BV, V, P (B=begin, P for punctuation)

39


• Collins, 1996– NP chunking (baseNPs), then full dependency parsing (then

transforming to constituent structure)– NP chunker uses 5 tags: S(tart), C(ontinue), B(etween), E(nd), N(ull)

• Magerman, 1995– bottom-up parser using 5 “extensions” similar to IOB tags: first child

of a constituent, last child, neither first nor last child, single child, root of the sentence

• Ratnaparkhi, 1997– chunking, then variant of shift/reduce parsing– "flat" phrase chunk: a constituent whose children consist solely of

POS tags– chunk tags: Start X, Join X, Other = IOB– BUILD decisions: Start X, Join X; CHECK decisions: yes (constituent

is complete), no (not complete)

40

Work on the chunk data

• Buchholz, 2002, PhD, Section 6.1.4– use pre-existing PoS tagger to tag and chunk at the same time

– e.g. “61/CD-I-NP years/NNS-I-NP old/JJ-I-ADJP”

– see later slide

• van den Bosch & Buchholz, 2002– chunking and function tagging at the same time

– e.g.

((S (ADVP-TMP Once) (NP-SBJ he) (VP was (VP held (PP-TMP for (NP three

months))))))

Oncehewasheldforthreemonths

I-ADVP_ADVP-TMPI-NP_NP-SBJI-VP_NOFUNCI-VP_VP/SI-PP_PP-TMPI-NP_NOFUNCI-NP_NP

41


• van den Bosch & Buchholz, 2002 (continued)– data: all Penn Treebank II data (WSJ, Brown, ATIS)

• 74,024 sentences; 1,637,268 tokens; 874 chunk-function-tags

– evaluation: Fβ=1 score, doing 10-fold cross validation

• chunk is correct if boundaries, type and function are correct

• split data into 10 parts, each time test on 1 part, train on all others

– input: words only, PoS only, both

• words vs. “attenuated“ words (Eisner, 1996): replace words occurring 10 times or less in training set by simplified form, replace words in test set that do not occur in attenuated training set

– ends in digit → □-NUM, first letter capitalised → □-CAP

– 1–5 characters → □-SHORT, others → □-xx (last two letters)

• gold standard (=treebank) PoS tags vs. those predicted by tagger

– learning curve experiment: same test sets but only subset of training

42


• Realistic PoS help much less than gold-standard PoS (no surprise)• The more training data, the less addition of PoS helps• Attenuation always helps (but might depend on machine learner used...)

Input Fβ=1 chunks+functions chunks functions

Treebank PoS 73.9

Tagger PoS 72.3

Words 75.4 78.2

Words + treebank PoS 76.8

Words + tagger PoS 75.9

Att. words 77.3 91.5 80.1

Att. words + treebank PoS 79.0 94.2 80.9

Att. words + tagger PoS 77.6 92.8 79.7

43

Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu Bu

Top score 66.9 90.0 80.2 84.8 79.2 87.3 91.7 87.6 73.4 82.3

84.6 65.7 87.6

Av. score 59.9 78.3 67.2 78.3 70.7 78.6 85.9 80.6 65.2 73.5

76.4 56.0 80.0

Tokens (k) 54 337 1249 94 195 700 151 207 29 89 191 58 190

Tok./tree 37.2 5.9 17.2 18.2 14.6 17.8 8.9 22.8 18.7 27.0

17.3 11.5 14.8

DEP./CPOS 1.9 6.3 6.5 5.2 2 .88 .35 3.7 2.3 1.4 1.5 1.8 1.6

DEP./POS 1.4 .28 1.2 2.2 .09 .88 .09 2.6 .89 .55 1.5 .83 .34

%H. prec. 82.9 24.8 50.9 75.0 46.5 50.9 8.9 60.3 47.2 60.8

52.8 6.2 62.9

%H. foll. 11.6 58.2 42.4 18.6 44.6 42.7 72.5 34.6 46.9 35.1

40.7 80.4 29.2

%np trees 11.2 0.0 23.2 15.6 36.4 27.8 5.3 18.9 22.2 1.7 9.8 11.6 5.4

%new FOR. 17.3 9.3 5.2 18.1 20.7 6.5 0.96 11.6 22.0 14.7

18.0 41.4 14.5

%new LEM. 4.3 n/a 1.8 n/a 15.9 n/a n/a 7.8 9.9 9.7 n/a 13.2 n/a

Analysis — “easy” data sets

44

Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu Bu

Top score 66.9 90.0 80.2 84.8 79.2 87.3 91.7 87.6 73.4 82.3

84.6 65.7 87.6

Av. score 59.9 78.3 67.2 78.3 70.7 78.6 85.9 80.6 65.2 73.5

76.4 56.0 80.0

Tokens(k) 54 337 1249 94 195 700 151 207 29 89 191 58 190

Tok./tree 37.2 5.9 17.2 18.2 14.6 17.8 8.9 22.8 18.7 27.0

17.3 11.5 14.8

DEP./CPOS 1.9 6.3 6.5 5.2 2 .88 .35 3.7 2.3 1.4 1.5 1.8 1.6

DEP./POS 1.4 .28 1.2 2.2 .09 .88 .09 2.6 .89 .55 1.5 .83 .34

%H. prec. 82.9 24.8 50.9 75.0 46.5 50.9 8.9 60.3 47.2 60.8

52.8 6.2 62.9

%H. foll. 11.6 58.2 42.4 18.6 44.6 42.7 72.5 34.6 46.9 35.1

40.7 80.4 29.2

%np trees 11.2 0.0 23.2 15.6 36.4 27.8 5.3 18.9 22.2 1.7 9.8 11.6 5.4

%new FOR. 17.3 9.3 5.2 18.1 20.7 6.5 0.96 11.6 22.0 14.7

18.0 41.4 14.5

%new LEM. 4.3 n/a 1.8 n/a 15.9 n/a n/a 7.8 9.9 9.7 n/a 13.2 n/a

Analysis — difficult data sets

copyright 2007, toshiba corporation. 8 january 2007 ijcai-07 workshop on “shallow parsing in south...

Documents