Conversion of Penn Treebank Data to Text - PowerPoint PPT Presentation

About This Presentation
Title:

Conversion of Penn Treebank Data to Text

Description:

Conversion of Penn Treebank Data to Text. Penn TreeBank Project 'A Bank of ... Penn TreeBank: Brown Corpus (as of 11/1992) POS Tags (Tokens) 1,172,041 ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 10
Provided by: sis79
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Conversion of Penn Treebank Data to Text


1
Conversion of Penn Treebank Data to Text
2
Penn TreeBank ProjectA Bank of Linguistic
Trees (as of 11/1992)
  • University of Pennsylvania, LINC Laboratory
  • 4.5 million words of American English
  • Annotation of naturally-occurring text for
    linguistic structure

3
Tree Linguistic Components
  • Tokenization
  • Treatment of punctuation, words, etc. as separate
    tokens
  • Childrens ? Children s
  • Part-of-speech (POS) tagging
  • Text first assigned POS tags automatically
  • Human annotators correct first-pass POS tags
  • Bracketing
  • (Fidditch, a deterministic parser (Hindle 1983,
    1989) )
  • Two-stage parsing process made explicit with
    brackets

4
Penn TreeBank Brown Corpus (as of 11/1992)
  • POS Tags (Tokens) 1,172,041
  • Skeletal Parsing (Tokens) 1,172,041

5
You know youre in trouble when
0. You will always have a certain amount of
error. Sometimes there is just no way to find the
head of a phrase, because it is tagged or parsed
completely incorrectly. (no big surprise, that)
  • Robert MacIntyre Programmer/Data Manager Penn
    Treebank Project robertm_at_unagi.cis.upenn.edu
  • ftp//ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2

6
  • ( END_OF_TEXT_UNIT )
  • ( END_OF_TEXT_UNIT )
  • ( END_OF_TEXT_UNIT )
  • ( ( )
  • (S
  • (S
  • (NP (PRP I) )
  • (VP (VBP leave)
  • (NP (DT this) (NN church) )
  • (PP (IN with)
  • (NP (DT a) (NN feeling)
  • (SBAR (IN that)
  • (S
  • (NP (DT a) (JJ great) (NN weight)
    )
  • (AUX (VBZ has) )
  • (VP (VBN been)
  • (VP (VBN lifted)
  • (PP (IN off)
  • (NP (PRP my) (NN heart)
    ))))))))))

Tree Conversion Clean Case
cb08_42 I leave this church with a feeling
that a great weight has been lifted off my heart,
I have left my grudge at the altar and forgiven
my neighbor''.
7
  • ( (S
  • (NP (PRP He) )
  • (VP (VBD reported)
  • (SBAR (IN that)
  • (S
  • (NP
  • (NP (DT the) (NN city) )
  • (POS 's) (NNS contributions)
  • (PP (IN for)
  • (NP (NN animal) (NN care) )))
  • (VP (VBD included)
  • (NP
  • (NP ( ) (CD 67,000)
  • (PP (TO to)
  • (NP
  • (NP (DT the) (NNS Women) )
  • (POS 's) (NN S.P.C.A.) )))
  • ( ) ( )
  • (NP

Tree Conversion Problematic Case
(NP (DT the) (NNS Women) )
(POS 's) (NN S.P.C.A.) ))) ( ) (
) (NP (NP ( )
(CD 15,000) ) (S
(NP (-NONE- T) ) (AUX (TO to)
) (VP (VB pay)
(NP (NP (CD six) (NNS
policemen) )
ca09_46 He reported that the city's
contributions for animal care included 67,000 to
the Women's S.P.C.A. 15,000 to pay six
policemen assigned as dog catchers and 15,000 to
investigate dog bites.
8
Summary of Problems Encountered
  • Typing Errors
  • Punctuation duplication in data
  • Special notation for delimiter characters
  • RRB, LRB, RSB, LSB, RCB, LCB
  • Special Null Elements
  • ( -NONE- ) 0 T NIL

Conventions for final output need to consider
these lessons
9
Future Recommendations
  • Put POS tree data into proper database
  • Increases confidence in correctness of data
  • Minimizes error
  • Spend more effort upfront once to clean data
  • SQL queries more reusable than (write-only) perl
    scripts
  • Due to random graduate student ability
  • If DB option not available
  • Avoid duplication of data in final output
  • Avoid text delimiters that exist as data tokens
    ( , \s )
  • Do thoughtful labeling conventions
Write a Comment
User Comments (0)
About PowerShow.com