David Smith JHU UMass Amherst - PowerPoint PPT Presentation

About This Presentation

Title:

David Smith JHU UMass Amherst

Description:

by Belief Propagation. 2. Great ideas in NLP: ... negotiate via 'belief propagation' ... Belief propagation. Iterative scaling. Each variable in turn is ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 53

Provided by: davidasmit

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: David Smith JHU UMass Amherst

1
Dependency Parsingby Belief Propagation

David Smith (JHU ? UMass Amherst)
Jason Eisner (JHU)

2
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)

In the beginning, we used generative models.

p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
3
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)

In the beginning, we used generative models.
Solution Log-linear (max-entropy) modeling
Features may interact in arbitrary ways
Iterative scaling keeps adjusting the feature
weightsuntil the model agrees with the training
data!

p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
4
How about structured outputs?

Log-linear models great for n-way classification
Also good for predicting sequences
Also good for dependency parsing

but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
5
How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features

With arbitrary features, runtime blows up
Projective parsing O(n3) by dynamic programming
Non-projective O(n2) by minimum spanning tree

6
Lets reclaim our freedom (again!)
This paper in a nutshell

Output probability is a product of local factors
Throw in any factors we want! (log-linear
model)
Let local factors negotiate via belief
propagation
Links (and tags) reinforce or suppress one
another
Each iteration takes total time O(n2) or O(n3)
Converges to a pretty good (but approx.) global
parse

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
7
Lets reclaim our freedom (again!)
This paper in a nutshell
New!
8
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables)

v
v
v
preferred
find
tags
Observed input sentence (shaded)
9
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables) Another possible tagging

v
a
n
preferred
find
tags
Observed input sentence (shaded)
10
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position

preferred
find
tags
11
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word

cant be adj
preferred
find
tags
12
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word

preferred
find
tags
(could be made to depend onentire observed
sentence)
13
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Different unary
factor at each position

preferred
find
tags
14
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n

v
a
n
preferred
find
tags
15
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
130.30.10.2

v
a
n
preferred
find
tags
16
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

preferred
find
links
17
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars

preferred
find
links
18
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f

preferred
find
links
19
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f

preferred
find
links
20
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t

preferred
find
links
(multiple parents)
21
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation

But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
some other links arent as good given this
inputsentence

preferred
find
links
22
Global factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1

preferred
find
links
23
Global factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1

t
f
were legal!
f
f
64 entries (0/1)
f
t

preferred
find
links
24
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent

t
3
t

preferred
find
links
25
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent
no-cross

t
t

preferred
find
links
by
26
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent
no-cross
siblings
hidden POS tags
subcategorization

preferred
find
links
by
26
27
Good to have lots of features, but

Nice model ?
Shame about the NP-hardness ?
Can we approximate?
Machine Learning (aka Statistical Physics) to
the Rescue!

28
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
MacKay 2003
29
Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
MacKay 2003
29
30
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
MacKay 2003
30
31
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
MacKay 2003
32
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
6 here ( 331)
3 here
MacKay 2003
32
33
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
MacKay 2003
33
34
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
MacKay 2003
34
35
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
MacKay 2003
35
36
Great ideas in ML Forward-Backward

In the CRF, message passing forward-backward

belief
message
message
a
ß
a
ß

find
tags
preferred
37
Great ideas in ML Forward-Backward

Extend CRF to skip chain to capture non-local
factor
More influences on belief ?

a
ß

find
tags
preferred
37
38
Great ideas in ML Forward-Backward

Extend CRF to skip chain to capture non-local
factor
More influences on belief ?
Graph becomes loopy ?

Red messages not independent? Pretend they are!
a
ß

find
tags
preferred
38
39
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce
loops
Lets watch a loop around one triangle
Strong links are suppressing or promoting other
links

preferred
find
links
40
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce
loops
Lets watch a loop around one triangle
How did we compute outgoing message to green
link?
Does the TREE factor think that the green link
is probably t,given the messages it receives all
the other links?

?

preferred
find
links
40
41
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green
link?
Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?

Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
41
42
Runtimes for each factor type (see paper)

per iteration
Additive, not multiplicative!
43
Runtimes for each factor type (see paper)

Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
44
Experimental Details

Decoding
Run several iterations of belief propagation
Get final beliefs at link variables
Feed them into first-order parser
This gives the Minimum Bayes Risk tree
Training
BP computes beliefs about each factor, too
which gives us gradients for max conditional
likelihood.
(as in forward-backward algorithm)
Features used in experiments
First-order Individual links just as in McDonald
et al. 2005
Higher-order Grandparent, Sibling bigrams,
NoCross

45
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
46
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
exact, slow
doesnt fixenough edges
47
Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
48
Freedom Regained
This paper in a nutshell