Probabilistic CFG with Latent Annotations - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Probabilistic CFG with Latent Annotations

Description:

A single symbol for all NPs. The difference between. Sbj-NP and ... We cannot use DP like the F-B algo. to. solve (1) (NP-hard) ... (1) Parsing by Approximation ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 50

Provided by: mtzk

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic CFG with Latent Annotations

1
Probabilistic CFG with Latent Annotations

Takuya Matsuzaki
Yusuke Miyao
Junichi Tsujii
University of Tokyo

2
Motivation Independence assumption in PCFG models
Treebank
Independence Assumption

Treebank-PCFG
3
Wrong independence assumption

Short phrase / Bare pronoun
Long phrase

A single symbol for all NPs
The difference between Sbj-NP and Obj-NP is
not captured by the model

Subject-NPs and object-NPs have different
properties
4
Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Treebank
Annotating labelswith features - parent
labels,- head words, -
Annotated-PCFG
5
Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Annotated-PCFG

6
Natural Questions

What types of features are effective?
How should we combine the features?
How many features suffice?

Previous approach manual feature
selection
Our approach automatic induction of
features

7
Our Approach
Treebank
- Annotation of labels with latent variables -
Induction of their values using an
EM-algorithm
PCFG with Latent Annotations

8
Our Approach

A rule with different assignmentshave different
rule probabilities
Different assignments to the latent variables?
Different features
9
Outline

?Model Definition
Parameter Estimation
Parsing Algorithms
Experiments

10
Model Definition (1/4)PCFG-LA model

PCFG-LA (PCFG with Latent Annotation) is
a generative model of parse trees, and
a latent variable model
Observed data CFG-style parse trees
Complete data parse trees with (latent)
annotations

Observed tree
Complete tree
11
Model Definition (2/4) Generation of a Tree
Generation of a Complete tree Tx successive
applications of annotated PCFG rules
T x (2,1,3 )
12
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) )
13
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2)
14
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3)
15
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He)
16
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He) P(VP3 ? kicked it)
17
Model Definition (3/4) Components of PCFG-LA

Backbone-CFG A simple treebank CFG
Parameters Rule probs. for each rule with ALL
assignments P(S1?NP1 VP1), P(S1?NP1
VP2), P(S1?NP2 VP1), P(S1?NP2
VP2), P(S2?NP1 VP1), P(S2?NP1
VP2),
Domain of latent variables
A finite set H 1, 2, 3, , N
H is chosen before training
H 16 ? reasonable performance

18
Model Definition (4/4)
Probability of a Complete tree
Probability of an Observed tree
Sum of all possible assignments to latent
variables
19
Outline

Model definition
?Parameter estimation
Parsing
Experiments

20
Parameter Estimation (1/2)

Training data a set of parse trees (treebank)
Algorithm an EM-algorithm similar to the
Baum-Welch algorithm

Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
21
Parameter Estimation (1/2)

Training data a set of parse trees (treebank)
Algorithm an EM-algorithm similar to the
Baum-Welch algorithm

Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
22
Parameter Estimation (1/2)

Training data a set of parse trees (treebank)
Algorithm an EM-algorithm similar to the
Baum-Welch algorithm

23
Parameter Estimation (1/2)
Forward
Backward
Forward
x1
x2
x3
x4
w1
w2
w3
w4
Backward
24
Parameter Estimation (2/2)Comparison with I-O
algorithm

Estimation of PCFG-LA
Training data parsed sentences ? parse trees
are given
Inside-Outside algorithm
Training data raw sentences? tree structures
are unknown

25
Outline

Model definition
Parameter estimation
?Parsing Algorithms
Experiments

26
Parsing with PCFG-LA

We want the most probable observable tree

Tmax argmax P(T w) argmax P(T)
(1)

However, an obsevable tree has exponentially
many complete trees

?Hn complete trees

We cannot use DP like the F-B algo. to solve
(1) (NP-hard)

27
Parsing by Approximation

Method1 Reranking of PCFG N-best parses
Do N-best parsing using a PCFG, and
Select the best tree in the N candidates
Method2 Viterbi Complete tree
Select the most probable complete tree and
discard the annotation part
Method3 Viterbi search with approximate
distribution
Details are in the next slides

28
Method 3 On-the-fly approximation by a simpler
model

Input a sentence w w1 w2
Parse w with the backbone-CFG andobtain a packed
parse forest F
Break down F and make a PCFG-like distribution
Q(Tw) using the fragments
Obtain argmax Q(Tw) using Viterbi search

29
Method 3 The design of Q(Tw)

Q(T w) a product of parameters.
??i

? Decoding is easy
?is are determined so that the KL distance from
Q(T w) to P(T w) is minimized
Approximation
P(T w) sum-of-products of parameters
30
Outline

Model definition
Parameter estimation
Parsing
?Experiments

31
Experiment

Data Penn WSJ Corpus
Extraction of backbone CFG estimation
Section02-21
Dev. set Section22
Test set Section23
Three experiments
Size of models ( H ) vs. Parsing performance
Comparison of 3 approximation methods
Results on Section 23

32
Setting Backbone-CFG

Function tags (-SBJ, -LOC, etc.) are removed
No feature annotations
4 types of binarization
LEFT
RIGHT
PARENT-CENTER
HEAD-CENTER

PARENT-CENTER
RIGHT
33
Size of H vs. Performance
H4
H16
H2
H8
H1
34
Comparison of Approximations
N300
N100
35
Results on Section23
The same level of performance as
KleinMannings extensively annotated
unlexicalized-PCFG Several points lower than
the lexicalized-PCFG parsers
? The amount of learned features matches KMs
PCFG ? Some types of lexical information is not
captured
36
Conclusion

Automatic induction of features by the PCFG-LA
Several points lower than lexicalized parsers,
but promising results 86 F1-score
On-the-fly approximation of a complex model by a
simpler model for parsing
Better performance than other straightforward
method
Further applications to models of similar
typesingle output ?? many derivation
Mixture of parsers latent variables correspond
to component parsers
Data-Oriented parsing
Projection of head-lexicalized parses to
dependency structures

( Suggestion by an anonymous reviewer. Thank
you.)
37
Thank you!????????????
38
(No Transcript)
39
Parameter Estimation(2/3)

How can we calculate the sum of HN terms?
P(T) Sx1Sx2Sx3
Forward-Backward algo. (Just as in HMMs)

Forward
Backward
40
Parameter Estimation(3/3)

Intuition
Soft assignments of values to latent variables,
or
Soft clustering of non-terminal node (Klein
Manning, 03)

41
NP-hardness of PCFG-LA Parsing

A similar situation DOP model
One parse tree ?? Exponentially many derivation
Similar (unfortunate) results
Obtaining a most probable tree in a DOP model is
NP-hard (Simaán, 02)
Obtaining a most probable tree in a PCFG-LA is
also NP-hard (we prove it by using Simaáns
result)

42
Why cant we use a Dynamic Programming?

To aboid the wrong markov assumption,
PCFG-LA expands rules horizontally, while
DOP expands rules vertically

43
Why cant we use a Dynamic Programming?

As a result, every node remembers her
grand-grand- mother node in both model

Sa1
S
NPa2
NP
VP
N
V
NP
kicked
kicked
44
(No Transcript)
45
Method 3 Local selection probs.
Q(T w) a product of local selection
probabilities 0.3, 0.7 x
0.6, 0.4
Q(Tw) 0.3 x 0.6
S
Q(Tw) 0.7 x 0.6
S
Q(Tw) 0.7 x 0.4

46
Method 3 Minimation of KL(QP)
Minimization of KL(QP) yields simple,
closed-form solutions for local probabilities.
?See the paper for details
47
A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
VP
?1
?2
V
N
ADV
48
A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
VP
?1
?2
VP
V
N
ADV
P(
w)
N
ADV
?2
P(
w)
49
A more about the Method 3 (4/4)
Calculation ofthese marginal probsInside-Outsid
e algo. on the packed forest

Write a Comment

User Comments (0)