Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors presentation

About This Presentation

Transcript and Presenter's Notes

Title: Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

1
Bootstrapping Feature-Rich Dependency Parsers
with Entropic Priors

David A. Smith
Jason Eisner
Johns Hopkins University

2
Only Connect
Textual Entailment
Training trees
Raw text
LM
Parser
Trained
(Dependency)
Learning
Weischedel 2004
IE
Parallel comparable corpora
Quirk et al. 2005
MT
Pantel Lin 2002
Out-of-domain text
Lexical Semantics
3
Outline Bootstrapping Parsers

What kind of parser should we train?
How should we train it semi-supervised?
Does it work? (initial experiments)
How can we incorporate other knowledge?

4
Re-estimation EM or Viterbi EM
TrainedParser
5
Re-estimation EM or Viterbi EM
(iterate process)
TrainedParser
Oops! Not much supervised training. So most of
these parses were bad. Retraining on all of them
overwhelms the good supervised data.
6
Simple Bootstrapping Self-Training
So only retrain on good parses ...
?
TrainedParser
7
Simple Bootstrapping Self-Training
So only retrain on good parses ...
TrainedParser
at least, those the parser itself thinks are
good. (Can we trust it? Well see ...)
8
Why Might This Work?

Sure, now we avoid harming the parser with bad
training.
But why do we learn anything new from the unsup.
data?

TrainedParser

But unsupervised parses have
Few positive or negative features
Mostly unknown features
Words or situations not seen in training data

After training, training parses have
Many features with positive weights
Few features with negative weights

Still, sometimes enough positive features to be
sure its the right parse
9
Why Might This Work?

Sure, we avoid bad guesses that harm the parser.
But why do we learn anything new from the unsup.
data?

TrainedParser
Now, retraining the weights makes the gray (and
red)features greener
Still, sometimes enough positive features to be
sure its the right parse
10
Why Might This Work?

Sure, we avoid bad guesses that harm the parser.
But why do we learn anything new from the unsup.
data?

TrainedParser
Now, retraining the weights makes the gray (and
red)features greener
... and makes features redder for the losing
parses of this sentence (not shown)
Still, sometimes enough positive features to be
sure its the right parse
Learning!
11
This Story Requires Many Redundant Features!
More features ? more chances to identify correct
parse even when were undertrained

Bootstrapping for WSD (Yarowsky 1995)
Lots of contextual features ? success
Co-training for parsing (Steedman et. al 2003)
Feature-poor parsers ? disappointment
Self-training for parsing (McClosky et al. 2006)
Feature-poor parsers ? disappointment
Reranker with more features ? success

12
This Story Requires Many Redundant Features!
More features ? more chances to identify correct
parse even when were undertrained

So, lets bootstrap a feature-rich parser!
In our experiments so far, we followMcDonald et
al. (2005)
Our model has 450 million features (on Czech)
Prune down to 90 million frequent features
About 200 are considered per possible edge

Note Even more features proposed at end of talk
13
Edge-Factored Parsers (McDonald et al. 2005)

No global features of a parse
Each feature is attached to some edge
Simple allows fast O(n2) or O(n3) parsing

Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
14
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

yes, lots of green ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
15
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
16
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
17
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
18
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N preceding conjunction
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
19
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

not as good, lots of red ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
20
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasný ? hodiny (bright clocks)
... undertrained ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
21
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasn- ? hodi- (bright clock,stems only)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
22
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasn- ? hodi- (bright clock,stems only)
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
23
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasný ? hodiny (bright clock,stems only)
A ? N where N followsa conjunction
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
24
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
bright day or bright clocks?

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
25
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
Score of an edge e ? ? features(e)
Standard algos ? valid parse with max total score

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být
jasný
studený
dubnový
den
a
hodiny
odbit
trináct
26
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
Score of an edge e ? ? features(e)
Standard algos ? valid parse with max total score

cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges. Retraining then learns
toreduce (or increase) its score.
27
Only Connect
Textual Entailment
Training trees
Raw text
LM
TrainedParser
Learning
IE
Parallel comparable corpora
MT
Out-of-domain text
Lexical Semantics
28
Can we recast this declaratively?
Only retrain on good parses ...
TrainedParser
at least, those the parser itself thinks are
good.
29
Can we recast this declaratively?
Seed set
Classifier
Label Examples
Select Examples W/ High Confidence
New Labeled Set
30
Bootstrapping as Optimization
Maximize a function on supervised and
unsupervised data
Entropy regularization (Brand 1999 Grandvalet
Bengio Jiao et al.)
Yesterdays talk How to compute these for
non-projective models
See Hwa 01 for projective tree entropy
31
Claim Gradient descent on this objective
function works like bootstrapping

When were pretty sure the true parse is A or B,
we reduce entropy H by becoming even surer(?
retraining ? on the example)
When were not sure, the example doesnt affect ?
(? not retraining on the example)

not sure(H?1)
32
Claim Gradient descent on this objective
function works like bootstrapping
In the paper, we generalize replace Shannon
entropy H(?) with Rényi entropy H?(?)

This gives us a tunable parameter ?
Connect to Abneys view of bootstrapping (?0)
Obtain Viterbi variant (limit as ? ? ?)
Obtain Gini variant (?2)
Still get Shannon entropy (limit as ? ? 1)
Also easier to compute in some circumstances

33
Experimental Questions

Are confident parses (or edges) actually good for
retraining?
Does bootstrapping help accuracy?
What is being learned?

34
Experimental Design

Czech, German, and Spanish (some Bulgarian)
CoNLL-X dependency trees
Non-projective (MST) parsing
Hundreds of millions of features
Supervised training sets of 100 1000 trees
Unparsed but tagged sets of 2k to 70k sentences
Stochastic gradient descent
First optimize just likelihood on seed set
Then optimize likelihood confidence criterion
on all data
Stop when accuracy peaks on development data

35
Are confident parses accurate?Correlation of
entropy with accuracy
Shannon entropy
Viterbi self-training
-.32
-.26
Gini -log(expected 0/1 gain)
log( of parses)favor short sentences Abneys
Yarowsky alg.
-.27
-.25
36
How Accurate Is Bootstrapping?
100-tree supervised set
?
(baseline)
71K
37K
2K
Significant on paired permutation test
37
How Does Bootstrapping Learn?
Precision
Recall
38
Bootstrapping vs. EM
Two ways to add unsupervised data
Compare on a feature-poor model that EM can
handle (DMV)
90
80
70
60
50
40
30
20
10
0
Bulgarian
German
Spanish
100 training trees, 100 dev trees for model
selection
39
Theres No Data Like More Data
Textual Entailment
Training trees
Raw text
LM
TrainedParser
Learning
IE
Parallel comparable corpora
MT
Out-of-domain text
Lexical Semantics
40
Token Projection
What if some sentences have parallel text?

Project 1-best English dependencies (Hwa et al.
04)???
Imperfect or free translation
Imperfect parse
Imperfect alignment

Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
41
Token Projection
What if some sentences have parallel text?
Probably aligns to some English link A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
It
bright
cold
day
April
and
clocks
were
thirteen
was
a
in
the
striking
42
Token Projection
What if some sentences have parallel text?
Probably aligns to some English path N ? in ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
It
bright
cold
day
April
and
clocks
were
thirteen
was
a
in
the
striking
Cf. quasi-synchronous grammars(Smith Eisner,
2006)
43
Type Projection
Can we use world knowledge, e.g., from comparable
corpora?
Probably translate as English words that usually
link as N ? V when cosentential
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
44
Type Projection
Can we use world knowledge, e.g., from comparable
corpora?
Probably translate as English words that usually
link as N ? V when cosentential
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
45
Conclusions

Declarative view of bootstrapping as entropy
minimization
Improvements in parser accuracy with feature-rich
models
Easily added features from alternative data
sources, e.g. comparable text
In future consider also the WSD decision list
learner is it important for learning robust
feature weights?

46
Thanks
Noah Smith Keith Hall The Anonymous
Reviewers Ryan McDonald for making his code
available
47
Extra slides
48
Dependency Treebanks
49
A Supervised CoNLL-X System
What system was this?
50
How Does Bootstrapping Learn?
Supervised iter. 10
Supervised iter. 1
Boostrapping w/ R2
Boostrapping w/ Rinf
51
How Does Bootstrapping Learn?
Updated M feat. Acc. Updated M feat. Acc.
all 15.5 64.3 none 0 60.9
seed 1.4 64.1 Non-seed 14.1 44.7
Non-lex. 3.5 64.4 lexical 12.0 59.9
Non-bilex. 12.6 64.4 bilexical 2.9 61.0
52
Review Yarowskys bootstrapping algorithm
table taken from Yarowsky (1995)
life (1)
target word plant
98
manufacturing(1)
53
Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
Should be a good classifier, unless we
accidentally learned some bad cues along the way
that polluted the original sense distinction.
54
Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
55
Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
repeat
That confidently classifies some of the remaining
examples.
repeat
56
Bootstrapping Pivot Features
Sat beside the river bank
quick
and
sly
fox
Sat on the bank
sly
and
crafty
fox
Run on the bank
quick
of
sly
fox
gait
the
Lots of overlapping features vs. PCFG (McClosky
et al.)
57
Bootstrapping as Optimization
Given a labeling distribution p, log
likelihood to max is
Abney (2004)
On labeled data, p is 1 at the label and 0
elsewhere. Thus, supervised training
58
Triangular Trade
Features
Data
Words, Tags, Translations,
Parent Prediction Inside/Outside Matrix-Tree
???
Models
Objectives
Derivational (Rényi) entropy
EM Abneys K Entropy Regularization
Globally normalized LL Projective/non-projective

Write a Comment

User Comments (0)

About PowerShow.com

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors PowerPoint PPT Presentation