conversion of penn treebank data to text

9
Conversion of Penn Treebank Data to Text

Upload: juliet

Post on 07-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992). University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure. Tree Linguistic Components. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Conversion of Penn Treebank Data to Text

Conversion of Penn Treebank Data to Text

Page 2: Conversion of Penn Treebank Data to Text

Penn TreeBank Project“A Bank of Linguistic Trees”

(as of 11/1992)

• University of Pennsylvania, LINC Laboratory

• 4.5 million words of American English

• Annotation of naturally-occurring text for linguistic structure

Page 3: Conversion of Penn Treebank Data to Text

• Tokenization– Treatment of punctuation, words, etc. as separate tokens

• Children’s Children ’s• Part-of-speech (POS) tagging

– Text first assigned POS tags automatically– Human annotators correct first-pass POS tags

• Bracketing– (Fidditch, a deterministic parser (Hindle 1983, 1989) )– Two-stage parsing process made explicit with brackets

Tree Linguistic Components

Page 4: Conversion of Penn Treebank Data to Text

Penn TreeBank: Brown Corpus (as of 11/1992)

• POS Tags (Tokens) 1,172,041• Skeletal Parsing (Tokens) 1,172,041

Page 5: Conversion of Penn Treebank Data to Text

You know you’re in trouble when …

Robert MacIntyre Programmer/Data Manager Penn Treebank Project [email protected]

ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2

“0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)”

Page 6: Conversion of Penn Treebank Data to Text

• ( END_OF_TEXT_UNIT )• ( END_OF_TEXT_UNIT )• ( END_OF_TEXT_UNIT )• ( (`` ``) • (S • (S • (NP (PRP I) )• (VP (VBP leave) • (NP (DT this) (NN church) )• (PP (IN with) • (NP (DT a) (NN feeling) • (SBAR (IN that) • (S • (NP (DT a) (JJ great) (NN weight) )• (AUX (VBZ has) )• (VP (VBN been) • (VP (VBN lifted) • (PP (IN off) • (NP (PRP$ my) (NN heart) ))))))))))• (, ,) • (S • (NP (PRP I) )• (AUX (VBP have) )• (VP • (VP (VBN left) • (NP (PRP$ my) (NN grudge) )• (PP (IN at) • (NP (DT the) (NN altar) )))• (CC and) • (VP (VBN forgiven) • (NP (PRP$ my) (NN neighbor) )))))• ('' '') (. .) )• ( END_OF_TEXT_UNIT )

cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''.

Tree Conversion: Clean Case

Page 7: Conversion of Penn Treebank Data to Text

• ( (S • (NP (PRP He) )• (VP (VBD reported) • (SBAR (IN that) • (S • (NP • (NP (DT the) (NN city) )• (POS 's) (NNS contributions) • (PP (IN for) • (NP (NN animal) (NN care) )))• (VP (VBD included) • (NP • (NP ($ $) (CD 67,000) • (PP (TO to) • (NP • (NP (DT the) (NNS Women) )• (POS 's) (NN S.P.C.A.) )))• (: ;) (: ;) • (NP • (NP ($ $) (CD 15,000) )• (S • (NP (-NONE- T) )• (AUX (TO to) )• (VP (VB pay) • (NP • (NP (CD six) (NNS policemen) )• (VP (VBN assigned) • (PP (IN as) • (NP (NN dog) (NNS catchers) )))))))• (CC and) • (NP • (NP ($ $) (CD 15,000) )• (S • (NP (-NONE- T) )• (AUX (TO to) )• (VP (VB investigate) • (NP (NN dog) (NNS bites) ))))))))))• (. .) )• ( END_OF_TEXT_UNIT )

ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites.

(NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) )

Tree Conversion : Problematic Case

Page 8: Conversion of Penn Treebank Data to Text

Summary of Problems Encountered

• Typing Errors– Punctuation duplication in data

• Special notation for delimiter characters– RRB, LRB, RSB, LSB, RCB, LCB

• Special Null Elements– ( -NONE- ) * 0 T NIL

** Conventions for final output need to consider these lessons

Page 9: Conversion of Penn Treebank Data to Text

Future Recommendations• Put POS tree data into proper database

– Increases confidence in correctness of data– Minimizes error

• Spend more effort upfront *once* to clean data• SQL queries more reusable than (write-only) perl

scripts• Due to random graduate student ability

• If DB option not available– Avoid duplication of data in final output– Avoid text delimiters that exist as data tokens (“ ‘ , \s ) – Do thoughtful labeling conventions