Title: Conversion of Penn Treebank Data to Text
1Conversion of Penn Treebank Data to Text
2Penn TreeBank ProjectA Bank of Linguistic
Trees (as of 11/1992)
- University of Pennsylvania, LINC Laboratory
- 4.5 million words of American English
- Annotation of naturally-occurring text for
linguistic structure
3Tree Linguistic Components
- Tokenization
- Treatment of punctuation, words, etc. as separate
tokens - Childrens ? Children s
- Part-of-speech (POS) tagging
- Text first assigned POS tags automatically
- Human annotators correct first-pass POS tags
- Bracketing
- (Fidditch, a deterministic parser (Hindle 1983,
1989) ) - Two-stage parsing process made explicit with
brackets
4Penn TreeBank Brown Corpus (as of 11/1992)
- POS Tags (Tokens) 1,172,041
- Skeletal Parsing (Tokens) 1,172,041
5You know youre in trouble when
0. You will always have a certain amount of
error. Sometimes there is just no way to find the
head of a phrase, because it is tagged or parsed
completely incorrectly. (no big surprise, that)
- Robert MacIntyre Programmer/Data Manager Penn
Treebank Project robertm_at_unagi.cis.upenn.edu - ftp//ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2
6- ( END_OF_TEXT_UNIT )
- ( END_OF_TEXT_UNIT )
- ( END_OF_TEXT_UNIT )
- ( ( )
- (S
- (S
- (NP (PRP I) )
- (VP (VBP leave)
- (NP (DT this) (NN church) )
- (PP (IN with)
- (NP (DT a) (NN feeling)
- (SBAR (IN that)
- (S
- (NP (DT a) (JJ great) (NN weight)
) - (AUX (VBZ has) )
- (VP (VBN been)
- (VP (VBN lifted)
- (PP (IN off)
- (NP (PRP my) (NN heart)
))))))))))
Tree Conversion Clean Case
cb08_42 I leave this church with a feeling
that a great weight has been lifted off my heart,
I have left my grudge at the altar and forgiven
my neighbor''.
7- ( (S
- (NP (PRP He) )
- (VP (VBD reported)
- (SBAR (IN that)
- (S
- (NP
- (NP (DT the) (NN city) )
- (POS 's) (NNS contributions)
- (PP (IN for)
- (NP (NN animal) (NN care) )))
- (VP (VBD included)
- (NP
- (NP ( ) (CD 67,000)
- (PP (TO to)
- (NP
- (NP (DT the) (NNS Women) )
- (POS 's) (NN S.P.C.A.) )))
- ( ) ( )
- (NP
Tree Conversion Problematic Case
(NP (DT the) (NNS Women) )
(POS 's) (NN S.P.C.A.) ))) ( ) (
) (NP (NP ( )
(CD 15,000) ) (S
(NP (-NONE- T) ) (AUX (TO to)
) (VP (VB pay)
(NP (NP (CD six) (NNS
policemen) )
ca09_46 He reported that the city's
contributions for animal care included 67,000 to
the Women's S.P.C.A. 15,000 to pay six
policemen assigned as dog catchers and 15,000 to
investigate dog bites.
8Summary of Problems Encountered
- Typing Errors
- Punctuation duplication in data
- Special notation for delimiter characters
- RRB, LRB, RSB, LSB, RCB, LCB
- Special Null Elements
- ( -NONE- ) 0 T NIL
Conventions for final output need to consider
these lessons
9Future Recommendations
- Put POS tree data into proper database
- Increases confidence in correctness of data
- Minimizes error
- Spend more effort upfront once to clean data
- SQL queries more reusable than (write-only) perl
scripts - Due to random graduate student ability
- If DB option not available
- Avoid duplication of data in final output
- Avoid text delimiters that exist as data tokens
( , \s ) - Do thoughtful labeling conventions