Thoughts%20on%20Treebanks

About This Presentation

Title:

Description:

Number of Views:19

Avg rating:3.0/5.0

Slides: 9

Provided by: billm156

Learn more at: http://faculty.washington.edu

Category:

Tags: 20treebanks | 20on | frankly | thoughts

Transcript and Presenter's Notes

Title: Thoughts%20on%20Treebanks

1
Thoughts on Treebanks

2
Q1 What do you really care about when you're
building a parser?

Completeness of information
Theres not much point in having a treebank if
really youre having to end up doing unsupervised
learning
You want to be giving human value add
Classic bad example
Noun compound structure in the Penn English
Treebank
Consistency of information
If things are annotated inconsistently, you lose
both in training (if it is widespread) and in
evaluation
Bad example
Long ago constructions as long ago as not so
long ago
Mutual information
Categories should be as mutually informative as
possible

3
Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?

Information on function is definitely useful
Should move to always having typed dependencies.
Clearest example in Penn English Treebank
temporal NPs
Empty categories dont necessarily give much
value in the dumbed-down world of Penn English
Treebank parsing work
Though it should be tried again/more
But definitely useful if you want to know this
stuff!
Subcategorization/argument structure
determination
Natural Language Understanding!!
Cf. Johnson, Levy and Manning, etc. work on long
distance dependencies
Im sceptical that there is a categorical
argument adjunct distinction to be make
Leave it to the real numbers
This means that subcategorization frames can only
be statistical
Cf. Manning (2003)
Ive got some more slides on this from another
talk if you want

4
Q3 What info (e.g., function tags, empty
categories, coindexation) is useful, what is not?

Do you prefer a more refined tagset for parsing?
Yes. I mightnt use it, but I often do
The transform-detransform framework
RawInput ? TransformedInput ? Parser ?
TransformedOutput ? DesiredOutput
I think everyone does this to some extent
Some like Johnson, Klein and Manning have
exploited it very explicitly NN-TMP, INT,
NP-Poss, VP-VBG, NP-v,
Everyone else should think about it more
Its easy to throw away too precise information,
or to move information around deterministically
(tag to phrase or vice versa), if its
represented completely and consistently!

5
Q4 How does grammar writing interact with
treebanking?

In practice, they often havent interacted much
Im a great believer that they should
Having a grammar is a huge guide to how things
should be parsed and to check parsing consistency
It also allows opportunities for analysis
updating, etc.
Cf. the Redwoods Treebank, and subsequent efforts
The inability to automatically update treebanks
is a growing problem
Current English treebanking isnt having much
impact because of annotation differences with
original PTB
Feedback from users has only rarely been harvested

6
Q5 What methodological lessons can be drawn for
treebanking?

Good guidelines (loosely, a grammar!)
Good, trained people
Annotator buy-in
Ann Bies said all this I strongly agree!
I think there has been a real underexploitation
of technology for treebank validation
Doing vertical searches/checks almost always
turns up inconsistencies
Either these or a grammar should give vertical
review

7
Q6 What are advantages and disadvantages of
pre-processing the data to be treebanked with an
automatic parser?

The economics are clear
You reduce annotation costs
The costs are clear
The parser places a large bias on the trees
produced
Humans are lazy/reluctant to correct mistakes
Clear e.g. I think it is fair to say that many
POS errors in the Penn English Treebank can be
traced to the POS tagger
E.g., sentence initial capitalized Separately,
Frankly, Currently, Hopefully analyzed as NNP
Doesnt look like a human beings mistakes to me.
The answer
More use of technology to validate and check
humans

8
Q7 What are the advantages of a phrase-structure
and/or a dependency treebank for parsing?

The current split in the literature between
phrase-structure and dependency parsing is
largely bogus (in my opinion)
The Collins/Bikel parser operates largely in the
manner of a dependency parser
The Stanford parser contains a strict (untyped)
dependency parser
Phrase structure parsers have the advantage of
phrase structure labels
A dependency parser is just a phrase structure
parser where you cannot refer to phrasal types or
conditional on phrasal span
This extra info is useful its silly not to use
it
Labeling phrasal headsdependencies is useful.
Silly not to do it
Automatic head rules should have had their day
by now!!
Scoring based on dependencies is much better than
Parseval !!!
Labeling dependency types is useful
Especially, this will be the case in free-er word
order languages