Title: David Smith JHU UMass Amherst
1Dependency Parsingby Belief Propagation
- David Smith (JHU ? UMass Amherst)
- Jason Eisner (JHU)
2Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
- In the beginning, we used generative models.
-
p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
3Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
- In the beginning, we used generative models.
-
- Solution Log-linear (max-entropy) modeling
- Features may interact in arbitrary ways
- Iterative scaling keeps adjusting the feature
weightsuntil the model agrees with the training
data!
p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
4How about structured outputs?
- Log-linear models great for n-way classification
- Also good for predicting sequences
- Also good for dependency parsing
but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
5How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features
- With arbitrary features, runtime blows up
- Projective parsing O(n3) by dynamic programming
- Non-projective O(n2) by minimum spanning tree
6Lets reclaim our freedom (again!)
This paper in a nutshell
- Output probability is a product of local factors
- Throw in any factors we want! (log-linear
model) - Let local factors negotiate via belief
propagation - Links (and tags) reinforce or suppress one
another - Each iteration takes total time O(n2) or O(n3)
- Converges to a pretty good (but approx.) global
parse
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
7Lets reclaim our freedom (again!)
This paper in a nutshell
New!
8Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining
variables)
v
v
v
preferred
find
tags
Observed input sentence (shaded)
9Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining
variables) Another possible tagging
v
a
n
preferred
find
tags
Observed input sentence (shaded)
10Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position
preferred
find
tags
11Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Its values
depend on corresponding word
cant be adj
preferred
find
tags
12Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Its values
depend on corresponding word
preferred
find
tags
(could be made to depend onentire observed
sentence)
13Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Different unary
factor at each position
preferred
find
tags
14Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
p(v a n) is proportionalto the product of
all factors values on v a n
v
a
n
preferred
find
tags
15Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
p(v a n) is proportionalto the product of
all factors values on v a n
130.30.10.2
v
a
n
preferred
find
tags
16Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
preferred
find
links
17Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
preferred
find
links
18Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f
preferred
find
links
19Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f
preferred
find
links
20Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t
preferred
find
links
(multiple parents)
21Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
some other links arent as good given this
inputsentence
preferred
find
links
22Global factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1
preferred
find
links
23Global factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1
t
f
were legal!
f
f
64 entries (0/1)
f
t
preferred
find
links
24Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
t
3
t
preferred
find
links
25Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
- no-cross
t
t
preferred
find
links
by
26Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
- no-cross
- siblings
- hidden POS tags
- subcategorization
-
preferred
find
links
by
26
27Good to have lots of features, but
- Nice model ?
- Shame about the NP-hardness ?
- Can we approximate?
- Machine Learning (aka Statistical Physics) to
the Rescue!
28Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
MacKay 2003
29Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
MacKay 2003
29
30Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
MacKay 2003
30
31Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
MacKay 2003
32Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
6 here ( 331)
3 here
MacKay 2003
32
33Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
MacKay 2003
33
34Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
MacKay 2003
34
35Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
MacKay 2003
35
36Great ideas in ML Forward-Backward
- In the CRF, message passing forward-backward
belief
message
message
a
ß
a
ß
find
tags
preferred
37Great ideas in ML Forward-Backward
- Extend CRF to skip chain to capture non-local
factor - More influences on belief ?
a
ß
find
tags
preferred
37
38Great ideas in ML Forward-Backward
- Extend CRF to skip chain to capture non-local
factor - More influences on belief ?
- Graph becomes loopy ?
Red messages not independent? Pretend they are!
a
ß
find
tags
preferred
38
39Loopy Belief Propagation for Parsing
- Higher-order factors (e.g., Grandparent) induce
loops - Lets watch a loop around one triangle
- Strong links are suppressing or promoting other
links
preferred
find
links
40Loopy Belief Propagation for Parsing
- Higher-order factors (e.g., Grandparent) induce
loops - Lets watch a loop around one triangle
- How did we compute outgoing message to green
link? - Does the TREE factor think that the green link
is probably t,given the messages it receives all
the other links?
?
preferred
find
links
40
41Loopy Belief Propagation for Parsing
- How did we compute outgoing message to green
link? - Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?
Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
41
42Runtimes for each factor type (see paper)
per iteration
Additive, not multiplicative!
43Runtimes for each factor type (see paper)
Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
44Experimental Details
- Decoding
- Run several iterations of belief propagation
- Get final beliefs at link variables
- Feed them into first-order parser
- This gives the Minimum Bayes Risk tree
- Training
- BP computes beliefs about each factor, too
- which gives us gradients for max conditional
likelihood. - (as in forward-backward algorithm)
- Features used in experiments
- First-order Individual links just as in McDonald
et al. 2005 - Higher-order Grandparent, Sibling bigrams,
NoCross
45Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
46Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
exact, slow
doesnt fixenough edges
47Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
48Freedom Regained
This paper in a nutshell
- Output probability defined as product of local
and global factors - Throw in any factors we want! (log-linear
model) - Each factor must be fast, but they run
independently - Let local factors negotiate via belief
propagation - Each bit of syntactic structure is influenced by
others - Some factors need combinatorial algorithms to
compute messages fast - e.g., existing parsing algorithms using dynamic
programming - Each iteration takes total time O(n3) or even
O(n2) see paper - Compare reranking and stacking
- Converges to a pretty good (but approximate)
global parse - Fast parsing for formerly intractable or slow
models - Extra features help accuracy
48
49Future Opportunities
- Modeling hidden structure
- POS tags, link roles, secondary links (DAG-shaped
parses) - Beyond dependencies
- Constituency parsing, traces, lattice parsing
- Beyond parsing
- Alignment, translation
- Bipartite matching and network flow
- Joint decoding of parsing and other tasks (IE,
MT, ...) - Beyond text
- Image tracking and retrieval
- Social networks
50Thanks!
51The Tree Factor
- What is this message?
- P(3?2 link other links)
- So, if P(2? 3 link) 1, P(3 ? 2 link other
links) 0 - For projective trees For nonprojective
- The outside probability Inverse Kirchoff
- Edge-factored parsing!
preferred
find
links
51
52Runtime BP vs. DP
Vs. O(n4) DP
Vs. O(n5) DP
52