Title: Probabilistic CFG with Latent Annotations
1Probabilistic CFG with Latent Annotations
- Takuya Matsuzaki
- Yusuke Miyao
- Junichi Tsujii
- University of Tokyo
2Motivation Independence assumption in PCFG models
Treebank
Independence Assumption
Treebank-PCFG
3Wrong independence assumption
Short phrase / Bare pronoun
Long phrase
- A single symbol for all NPs
- The difference between Sbj-NP and Obj-NP is
not captured by the model
Subject-NPs and object-NPs have different
properties
4Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Treebank
Annotating labelswith features - parent
labels,- head words, -
Annotated-PCFG
5Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Annotated-PCFG
6Natural Questions
- What types of features are effective?
- How should we combine the features?
- How many features suffice?
- Previous approach manual feature
selection - Our approach automatic induction of
features
7Our Approach
Treebank
- Annotation of labels with latent variables -
Induction of their values using an
EM-algorithm
PCFG with Latent Annotations
8Our Approach
A rule with different assignmentshave different
rule probabilities
Different assignments to the latent variables?
Different features
9Outline
- ?Model Definition
- Parameter Estimation
- Parsing Algorithms
- Experiments
10Model Definition (1/4)PCFG-LA model
- PCFG-LA (PCFG with Latent Annotation) is
- a generative model of parse trees, and
- a latent variable model
- Observed data CFG-style parse trees
- Complete data parse trees with (latent)
annotations
Observed tree
Complete tree
11Model Definition (2/4) Generation of a Tree
Generation of a Complete tree Tx successive
applications of annotated PCFG rules
T x (2,1,3 )
12Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) )
13Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2)
14Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3)
15Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He)
16Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He) P(VP3 ? kicked it)
17Model Definition (3/4) Components of PCFG-LA
- Backbone-CFG A simple treebank CFG
- Parameters Rule probs. for each rule with ALL
assignments P(S1?NP1 VP1), P(S1?NP1
VP2), P(S1?NP2 VP1), P(S1?NP2
VP2), P(S2?NP1 VP1), P(S2?NP1
VP2), - Domain of latent variables
- A finite set H 1, 2, 3, , N
- H is chosen before training
- H 16 ? reasonable performance
18Model Definition (4/4)
Probability of a Complete tree
Probability of an Observed tree
Sum of all possible assignments to latent
variables
19Outline
- Model definition
- ?Parameter estimation
- Parsing
- Experiments
20Parameter Estimation (1/2)
- Training data a set of parse trees (treebank)
- Algorithm an EM-algorithm similar to the
Baum-Welch algorithm
Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
21Parameter Estimation (1/2)
- Training data a set of parse trees (treebank)
- Algorithm an EM-algorithm similar to the
Baum-Welch algorithm
Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
22Parameter Estimation (1/2)
- Training data a set of parse trees (treebank)
- Algorithm an EM-algorithm similar to the
Baum-Welch algorithm
23Parameter Estimation (1/2)
Forward
Backward
Forward
x1
x2
x3
x4
w1
w2
w3
w4
Backward
24Parameter Estimation (2/2)Comparison with I-O
algorithm
- Estimation of PCFG-LA
- Training data parsed sentences ? parse trees
are given - Inside-Outside algorithm
- Training data raw sentences? tree structures
are unknown
25Outline
- Model definition
- Parameter estimation
- ?Parsing Algorithms
- Experiments
26Parsing with PCFG-LA
- We want the most probable observable tree
Tmax argmax P(T w) argmax P(T)
(1)
- However, an obsevable tree has exponentially
many complete trees
?Hn complete trees
- We cannot use DP like the F-B algo. to solve
(1) (NP-hard)
27Parsing by Approximation
- Method1 Reranking of PCFG N-best parses
- Do N-best parsing using a PCFG, and
- Select the best tree in the N candidates
- Method2 Viterbi Complete tree
- Select the most probable complete tree and
discard the annotation part - Method3 Viterbi search with approximate
distribution - Details are in the next slides
28Method 3 On-the-fly approximation by a simpler
model
- Input a sentence w w1 w2
- Parse w with the backbone-CFG andobtain a packed
parse forest F - Break down F and make a PCFG-like distribution
Q(Tw) using the fragments - Obtain argmax Q(Tw) using Viterbi search
29Method 3 The design of Q(Tw)
- Q(T w) a product of parameters.
??i
? Decoding is easy
?is are determined so that the KL distance from
Q(T w) to P(T w) is minimized
Approximation
P(T w) sum-of-products of parameters
30Outline
- Model definition
- Parameter estimation
- Parsing
- ?Experiments
31Experiment
- Data Penn WSJ Corpus
- Extraction of backbone CFG estimation
Section02-21 - Dev. set Section22
- Test set Section23
- Three experiments
- Size of models ( H ) vs. Parsing performance
- Comparison of 3 approximation methods
- Results on Section 23
32Setting Backbone-CFG
- Function tags (-SBJ, -LOC, etc.) are removed
- No feature annotations
- 4 types of binarization
- LEFT
- RIGHT
- PARENT-CENTER
- HEAD-CENTER
PARENT-CENTER
RIGHT
33Size of H vs. Performance
H4
H16
H2
H8
H1
34Comparison of Approximations
N300
N100
35Results on Section23
The same level of performance as
KleinMannings extensively annotated
unlexicalized-PCFG Several points lower than
the lexicalized-PCFG parsers
? The amount of learned features matches KMs
PCFG ? Some types of lexical information is not
captured
36Conclusion
- Automatic induction of features by the PCFG-LA
- Several points lower than lexicalized parsers,
but promising results 86 F1-score - On-the-fly approximation of a complex model by a
simpler model for parsing - Better performance than other straightforward
method - Further applications to models of similar
typesingle output ?? many derivation - Mixture of parsers latent variables correspond
to component parsers - Data-Oriented parsing
- Projection of head-lexicalized parses to
dependency structures
( Suggestion by an anonymous reviewer. Thank
you.)
37Thank you!????????????
38(No Transcript)
39Parameter Estimation(2/3)
- How can we calculate the sum of HN terms?
P(T) Sx1Sx2Sx3 - Forward-Backward algo. (Just as in HMMs)
Forward
Backward
40Parameter Estimation(3/3)
- Intuition
- Soft assignments of values to latent variables,
or - Soft clustering of non-terminal node (Klein
Manning, 03)
41NP-hardness of PCFG-LA Parsing
- A similar situation DOP model
- One parse tree ?? Exponentially many derivation
- Similar (unfortunate) results
- Obtaining a most probable tree in a DOP model is
NP-hard (Simaán, 02) - Obtaining a most probable tree in a PCFG-LA is
also NP-hard (we prove it by using Simaáns
result)
42Why cant we use a Dynamic Programming?
- To aboid the wrong markov assumption,
- PCFG-LA expands rules horizontally, while
- DOP expands rules vertically
43Why cant we use a Dynamic Programming?
- As a result, every node remembers her
grand-grand- mother node in both model
Sa1
S
NPa2
NP
VP
N
V
NP
kicked
kicked
44(No Transcript)
45Method 3 Local selection probs.
Q(T w) a product of local selection
probabilities 0.3, 0.7 x
0.6, 0.4
Q(Tw) 0.3 x 0.6
S
Q(Tw) 0.7 x 0.6
S
Q(Tw) 0.7 x 0.4
46Method 3 Minimation of KL(QP)
Minimization of KL(QP) yields simple,
closed-form solutions for local probabilities.
?See the paper for details
47A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
VP
?1
?2
V
N
ADV
48A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
VP
?1
?2
VP
V
N
ADV
P(
w)
N
ADV
?2
P(
w)
49A more about the Method 3 (4/4)
Calculation ofthese marginal probsInside-Outsid
e algo. on the packed forest