Title: Thoughts%20on%20Treebanks
1Thoughts on Treebanks
- Christopher Manning
- Stanford University
2Q1 What do you really care about when you're
building a parser?
- Completeness of information
- Theres not much point in having a treebank if
really youre having to end up doing unsupervised
learning - You want to be giving human value add
- Classic bad example
- Noun compound structure in the Penn English
Treebank - Consistency of information
- If things are annotated inconsistently, you lose
both in training (if it is widespread) and in
evaluation - Bad example
- Long ago constructions as long ago as not so
long ago - Mutual information
- Categories should be as mutually informative as
possible
3Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?
- Information on function is definitely useful
- Should move to always having typed dependencies.
- Clearest example in Penn English Treebank
temporal NPs - Empty categories dont necessarily give much
value in the dumbed-down world of Penn English
Treebank parsing work - Though it should be tried again/more
- But definitely useful if you want to know this
stuff! - Subcategorization/argument structure
determination - Natural Language Understanding!!
- Cf. Johnson, Levy and Manning, etc. work on long
distance dependencies - Im sceptical that there is a categorical
argument adjunct distinction to be make - Leave it to the real numbers
- This means that subcategorization frames can only
be statistical - Cf. Manning (2003)
- Ive got some more slides on this from another
talk if you want
4Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?
- Do you prefer a more refined tagset for parsing?
- Yes. I mightnt use it, but I often do
- The transform-detransform framework
- RawInput ? TransformedInput ? Parser ?
TransformedOutput ? DesiredOutput - I think everyone does this to some extent
- Some like Johnson, Klein and Manning have
exploited it very explicitly NN-TMP, INT,
NP-Poss, VP-VBG, NP-v, - Everyone else should think about it more
- Its easy to throw away too precise information,
or to move information around deterministically
(tag to phrase or vice versa), if its
represented completely and consistently!
5Q4 How does grammar writing interact with
treebanking?
- In practice, they often havent interacted much
- Im a great believer that they should
- Having a grammar is a huge guide to how things
should be parsed and to check parsing consistency - It also allows opportunities for analysis
updating, etc. - Cf. the Redwoods Treebank, and subsequent efforts
- The inability to automatically update treebanks
is a growing problem - Current English treebanking isnt having much
impact because of annotation differences with
original PTB - Feedback from users has only rarely been harvested
6Q5 What methodological lessons can be drawn for
treebanking?
- Good guidelines (loosely, a grammar!)
- Good, trained people
- Annotator buy-in
- Ann Bies said all this I strongly agree!
- I think there has been a real underexploitation
of technology for treebank validation - Doing vertical searches/checks almost always
turns up inconsistencies - Either these or a grammar should give vertical
review
7Q6 What are advantages and disadvantages of
pre-processing the data to be treebanked with an
automatic parser?
- The economics are clear
- You reduce annotation costs
- The costs are clear
- The parser places a large bias on the trees
produced - Humans are lazy/reluctant to correct mistakes
- Clear e.g. I think it is fair to say that many
POS errors in the Penn English Treebank can be
traced to the POS tagger - E.g., sentence initial capitalized Separately,
Frankly, Currently, Hopefully analyzed as NNP - Doesnt look like a human beings mistakes to me.
- The answer
- More use of technology to validate and check
humans
8Q7 What are the advantages of a phrase-structure
and/or a dependency treebank for parsing?
- The current split in the literature between
phrase-structure and dependency parsing is
largely bogus (in my opinion) - The Collins/Bikel parser operates largely in the
manner of a dependency parser - The Stanford parser contains a strict (untyped)
dependency parser - Phrase structure parsers have the advantage of
phrase structure labels - A dependency parser is just a phrase structure
parser where you cannot refer to phrasal types or
conditional on phrasal span - This extra info is useful its silly not to use
it - Labeling phrasal headsdependencies is useful.
Silly not to do it - Automatic head rules should have had their day
by now!! - Scoring based on dependencies is much better than
Parseval !!! - Labeling dependency types is useful
- Especially, this will be the case in free-er word
order languages