David Smith JHU UMass Amherst - PowerPoint PPT Presentation

About This Presentation
Title:

David Smith JHU UMass Amherst

Description:

by Belief Propagation. 2. Great ideas in NLP: ... negotiate via 'belief propagation' ... Belief propagation. Iterative scaling. Each variable in turn is ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 53
Provided by: davidasmit
Category:
Tags: jhu | amherst | belief | david | smith | umass

less

Transcript and Presenter's Notes

Title: David Smith JHU UMass Amherst


1
Dependency Parsingby Belief Propagation
  • David Smith (JHU ? UMass Amherst)
  • Jason Eisner (JHU)

2
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
  • In the beginning, we used generative models.

p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
3
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
  • In the beginning, we used generative models.
  • Solution Log-linear (max-entropy) modeling
  • Features may interact in arbitrary ways
  • Iterative scaling keeps adjusting the feature
    weightsuntil the model agrees with the training
    data!

p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
4
How about structured outputs?
  • Log-linear models great for n-way classification
  • Also good for predicting sequences
  • Also good for dependency parsing

but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
5
How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features
  • With arbitrary features, runtime blows up
  • Projective parsing O(n3) by dynamic programming
  • Non-projective O(n2) by minimum spanning tree

6
Lets reclaim our freedom (again!)
This paper in a nutshell
  • Output probability is a product of local factors
  • Throw in any factors we want! (log-linear
    model)
  • Let local factors negotiate via belief
    propagation
  • Links (and tags) reinforce or suppress one
    another
  • Each iteration takes total time O(n2) or O(n3)
  • Converges to a pretty good (but approx.) global
    parse

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
7
Lets reclaim our freedom (again!)
This paper in a nutshell
New!
8
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables)


v
v
v
preferred
find
tags
Observed input sentence (shaded)
9
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables) Another possible tagging


v
a
n
preferred
find
tags
Observed input sentence (shaded)
10
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position


preferred
find
tags
11
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word


cant be adj
preferred
find
tags
12
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word


preferred
find
tags
(could be made to depend onentire observed
sentence)
13
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Different unary
factor at each position


preferred
find
tags
14
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n


v
a
n
preferred
find
tags
15
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
130.30.10.2


v
a
n
preferred
find
tags
16
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links



preferred
find
links
17
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars


preferred
find
links
18
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f


preferred
find
links
19
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f


preferred
find
links
20
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t


preferred
find
links
(multiple parents)
21
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation

But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
some other links arent as good given this
inputsentence


preferred
find
links
22
Global factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1



preferred
find
links
23
Global factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1

t
f
were legal!
f
f
64 entries (0/1)
f
t


preferred
find
links
24
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent

t
3
t


preferred
find
links
25
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent
  • no-cross

t
t


preferred
find
links
by
26
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent
  • no-cross
  • siblings
  • hidden POS tags
  • subcategorization



preferred
find
links
by
26
27
Good to have lots of features, but
  • Nice model ?
  • Shame about the NP-hardness ?
  • Can we approximate?
  • Machine Learning (aka Statistical Physics) to
    the Rescue!

28
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
MacKay 2003
29
Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
MacKay 2003
29
30
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
MacKay 2003
30
31
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
MacKay 2003
32
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
6 here ( 331)
3 here
MacKay 2003
32
33
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
MacKay 2003
33
34
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
MacKay 2003
34
35
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
MacKay 2003
35
36
Great ideas in ML Forward-Backward
  • In the CRF, message passing forward-backward

belief
message
message
a
ß
a
ß


find
tags
preferred
37
Great ideas in ML Forward-Backward
  • Extend CRF to skip chain to capture non-local
    factor
  • More influences on belief ?

a
ß


find
tags
preferred
37
38
Great ideas in ML Forward-Backward
  • Extend CRF to skip chain to capture non-local
    factor
  • More influences on belief ?
  • Graph becomes loopy ?

Red messages not independent? Pretend they are!
a
ß


find
tags
preferred
38
39
Loopy Belief Propagation for Parsing
  • Higher-order factors (e.g., Grandparent) induce
    loops
  • Lets watch a loop around one triangle
  • Strong links are suppressing or promoting other
    links



preferred
find
links
40
Loopy Belief Propagation for Parsing
  • Higher-order factors (e.g., Grandparent) induce
    loops
  • Lets watch a loop around one triangle
  • How did we compute outgoing message to green
    link?
  • Does the TREE factor think that the green link
    is probably t,given the messages it receives all
    the other links?

?


preferred
find
links
40
41
Loopy Belief Propagation for Parsing
  • How did we compute outgoing message to green
    link?
  • Does the TREE factor think that the green link
    is probably t,given the messages it receives
    from all the other links?

Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
41
42
Runtimes for each factor type (see paper)

per iteration
Additive, not multiplicative!
43
Runtimes for each factor type (see paper)

Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
44
Experimental Details
  • Decoding
  • Run several iterations of belief propagation
  • Get final beliefs at link variables
  • Feed them into first-order parser
  • This gives the Minimum Bayes Risk tree
  • Training
  • BP computes beliefs about each factor, too
  • which gives us gradients for max conditional
    likelihood.
  • (as in forward-backward algorithm)
  • Features used in experiments
  • First-order Individual links just as in McDonald
    et al. 2005
  • Higher-order Grandparent, Sibling bigrams,
    NoCross

45
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
46
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
exact, slow
doesnt fixenough edges
47
Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
48
Freedom Regained
This paper in a nutshell
  • Output probability defined as product of local
    and global factors
  • Throw in any factors we want! (log-linear
    model)
  • Each factor must be fast, but they run
    independently
  • Let local factors negotiate via belief
    propagation
  • Each bit of syntactic structure is influenced by
    others
  • Some factors need combinatorial algorithms to
    compute messages fast
  • e.g., existing parsing algorithms using dynamic
    programming
  • Each iteration takes total time O(n3) or even
    O(n2) see paper
  • Compare reranking and stacking
  • Converges to a pretty good (but approximate)
    global parse
  • Fast parsing for formerly intractable or slow
    models
  • Extra features help accuracy

48
49
Future Opportunities
  • Modeling hidden structure
  • POS tags, link roles, secondary links (DAG-shaped
    parses)
  • Beyond dependencies
  • Constituency parsing, traces, lattice parsing
  • Beyond parsing
  • Alignment, translation
  • Bipartite matching and network flow
  • Joint decoding of parsing and other tasks (IE,
    MT, ...)
  • Beyond text
  • Image tracking and retrieval
  • Social networks

50
Thanks!
51
The Tree Factor
  • What is this message?
  • P(3?2 link other links)
  • So, if P(2? 3 link) 1, P(3 ? 2 link other
    links) 0
  • For projective trees For nonprojective
  • The outside probability Inverse Kirchoff
  • Edge-factored parsing!



preferred
find
links
51
52
Runtime BP vs. DP
Vs. O(n4) DP
Vs. O(n5) DP
52
Write a Comment
User Comments (0)
About PowerShow.com