Thoughts%20on%20Treebanks - PowerPoint PPT Presentation

About This Presentation
Title:

Thoughts%20on%20Treebanks

Description:

There's not much point in having a treebank if really you're ... E.g., sentence initial capitalized Separately, Frankly, Currently, Hopefully analyzed as NNP ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 9
Provided by: billm156
Category:

less

Transcript and Presenter's Notes

Title: Thoughts%20on%20Treebanks


1
Thoughts on Treebanks
  • Christopher Manning
  • Stanford University

2
Q1 What do you really care about when you're
building a parser?
  • Completeness of information
  • Theres not much point in having a treebank if
    really youre having to end up doing unsupervised
    learning
  • You want to be giving human value add
  • Classic bad example
  • Noun compound structure in the Penn English
    Treebank
  • Consistency of information
  • If things are annotated inconsistently, you lose
    both in training (if it is widespread) and in
    evaluation
  • Bad example
  • Long ago constructions as long ago as not so
    long ago
  • Mutual information
  • Categories should be as mutually informative as
    possible

3
Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?
  • Information on function is definitely useful
  • Should move to always having typed dependencies.
  • Clearest example in Penn English Treebank
    temporal NPs
  • Empty categories dont necessarily give much
    value in the dumbed-down world of Penn English
    Treebank parsing work
  • Though it should be tried again/more
  • But definitely useful if you want to know this
    stuff!
  • Subcategorization/argument structure
    determination
  • Natural Language Understanding!!
  • Cf. Johnson, Levy and Manning, etc. work on long
    distance dependencies
  • Im sceptical that there is a categorical
    argument adjunct distinction to be make
  • Leave it to the real numbers
  • This means that subcategorization frames can only
    be statistical
  • Cf. Manning (2003)
  • Ive got some more slides on this from another
    talk if you want

4
Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?
  • Do you prefer a more refined tagset for parsing?
  • Yes. I mightnt use it, but I often do
  • The transform-detransform framework
  • RawInput ? TransformedInput ? Parser ?
    TransformedOutput ? DesiredOutput
  • I think everyone does this to some extent
  • Some like Johnson, Klein and Manning have
    exploited it very explicitly NN-TMP, INT,
    NP-Poss, VP-VBG, NP-v,
  • Everyone else should think about it more
  • Its easy to throw away too precise information,
    or to move information around deterministically
    (tag to phrase or vice versa), if its
    represented completely and consistently!

5
Q4 How does grammar writing interact with
treebanking?
  • In practice, they often havent interacted much
  • Im a great believer that they should
  • Having a grammar is a huge guide to how things
    should be parsed and to check parsing consistency
  • It also allows opportunities for analysis
    updating, etc.
  • Cf. the Redwoods Treebank, and subsequent efforts
  • The inability to automatically update treebanks
    is a growing problem
  • Current English treebanking isnt having much
    impact because of annotation differences with
    original PTB
  • Feedback from users has only rarely been harvested

6
Q5 What methodological lessons can be drawn for
treebanking?
  • Good guidelines (loosely, a grammar!)
  • Good, trained people
  • Annotator buy-in
  • Ann Bies said all this I strongly agree!
  • I think there has been a real underexploitation
    of technology for treebank validation
  • Doing vertical searches/checks almost always
    turns up inconsistencies
  • Either these or a grammar should give vertical
    review

7
Q6 What are advantages and disadvantages of
pre-processing the data to be treebanked with an
automatic parser?
  • The economics are clear
  • You reduce annotation costs
  • The costs are clear
  • The parser places a large bias on the trees
    produced
  • Humans are lazy/reluctant to correct mistakes
  • Clear e.g. I think it is fair to say that many
    POS errors in the Penn English Treebank can be
    traced to the POS tagger
  • E.g., sentence initial capitalized Separately,
    Frankly, Currently, Hopefully analyzed as NNP
  • Doesnt look like a human beings mistakes to me.
  • The answer
  • More use of technology to validate and check
    humans

8
Q7 What are the advantages of a phrase-structure
and/or a dependency treebank for parsing?
  • The current split in the literature between
    phrase-structure and dependency parsing is
    largely bogus (in my opinion)
  • The Collins/Bikel parser operates largely in the
    manner of a dependency parser
  • The Stanford parser contains a strict (untyped)
    dependency parser
  • Phrase structure parsers have the advantage of
    phrase structure labels
  • A dependency parser is just a phrase structure
    parser where you cannot refer to phrasal types or
    conditional on phrasal span
  • This extra info is useful its silly not to use
    it
  • Labeling phrasal headsdependencies is useful.
    Silly not to do it
  • Automatic head rules should have had their day
    by now!!
  • Scoring based on dependencies is much better than
    Parseval !!!
  • Labeling dependency types is useful
  • Especially, this will be the case in free-er word
    order languages
Write a Comment
User Comments (0)
About PowerShow.com