conversion of penn treebank data to text
DESCRIPTION
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992). University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure. Tree Linguistic Components. - PowerPoint PPT PresentationTRANSCRIPT
Conversion of Penn Treebank Data to Text
Penn TreeBank Project“A Bank of Linguistic Trees”
(as of 11/1992)
• University of Pennsylvania, LINC Laboratory
• 4.5 million words of American English
• Annotation of naturally-occurring text for linguistic structure
• Tokenization– Treatment of punctuation, words, etc. as separate tokens
• Children’s Children ’s• Part-of-speech (POS) tagging
– Text first assigned POS tags automatically– Human annotators correct first-pass POS tags
• Bracketing– (Fidditch, a deterministic parser (Hindle 1983, 1989) )– Two-stage parsing process made explicit with brackets
Tree Linguistic Components
Penn TreeBank: Brown Corpus (as of 11/1992)
• POS Tags (Tokens) 1,172,041• Skeletal Parsing (Tokens) 1,172,041
You know you’re in trouble when …
Robert MacIntyre Programmer/Data Manager Penn Treebank Project [email protected]
ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2
“0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)”
• ( END_OF_TEXT_UNIT )• ( END_OF_TEXT_UNIT )• ( END_OF_TEXT_UNIT )• ( (`` ``) • (S • (S • (NP (PRP I) )• (VP (VBP leave) • (NP (DT this) (NN church) )• (PP (IN with) • (NP (DT a) (NN feeling) • (SBAR (IN that) • (S • (NP (DT a) (JJ great) (NN weight) )• (AUX (VBZ has) )• (VP (VBN been) • (VP (VBN lifted) • (PP (IN off) • (NP (PRP$ my) (NN heart) ))))))))))• (, ,) • (S • (NP (PRP I) )• (AUX (VBP have) )• (VP • (VP (VBN left) • (NP (PRP$ my) (NN grudge) )• (PP (IN at) • (NP (DT the) (NN altar) )))• (CC and) • (VP (VBN forgiven) • (NP (PRP$ my) (NN neighbor) )))))• ('' '') (. .) )• ( END_OF_TEXT_UNIT )
cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''.
Tree Conversion: Clean Case
• ( (S • (NP (PRP He) )• (VP (VBD reported) • (SBAR (IN that) • (S • (NP • (NP (DT the) (NN city) )• (POS 's) (NNS contributions) • (PP (IN for) • (NP (NN animal) (NN care) )))• (VP (VBD included) • (NP • (NP ($ $) (CD 67,000) • (PP (TO to) • (NP • (NP (DT the) (NNS Women) )• (POS 's) (NN S.P.C.A.) )))• (: ;) (: ;) • (NP • (NP ($ $) (CD 15,000) )• (S • (NP (-NONE- T) )• (AUX (TO to) )• (VP (VB pay) • (NP • (NP (CD six) (NNS policemen) )• (VP (VBN assigned) • (PP (IN as) • (NP (NN dog) (NNS catchers) )))))))• (CC and) • (NP • (NP ($ $) (CD 15,000) )• (S • (NP (-NONE- T) )• (AUX (TO to) )• (VP (VB investigate) • (NP (NN dog) (NNS bites) ))))))))))• (. .) )• ( END_OF_TEXT_UNIT )
ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites.
(NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) )
Tree Conversion : Problematic Case
Summary of Problems Encountered
• Typing Errors– Punctuation duplication in data
• Special notation for delimiter characters– RRB, LRB, RSB, LSB, RCB, LCB
• Special Null Elements– ( -NONE- ) * 0 T NIL
** Conventions for final output need to consider these lessons
Future Recommendations• Put POS tree data into proper database
– Increases confidence in correctness of data– Minimizes error
• Spend more effort upfront *once* to clean data• SQL queries more reusable than (write-only) perl
scripts• Due to random graduate student ability
• If DB option not available– Avoid duplication of data in final output– Avoid text delimiters that exist as data tokens (“ ‘ , \s ) – Do thoughtful labeling conventions