Title: Learning and Inference for Hierarchically Split PCFGs
1Learning and Inference for Hierarchically Split
PCFGs
- Slav Petrov
- Joint work with Dan Klein
2Motivation (Syntax)
He was right.
- Why?
- Information Extraction
- Syntactic Machine Translation
3Treebank Parsing
4The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
5The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
- Head lexicalization Collins 99, Charniak 00
6The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
- Head lexicalization Collins 99, Charniak 00
- Automatic clustering?
7Learning accurate, compact, and interpretable
tree annotation
- Slav Petrov, Leon Barrett, Romain Thibaux and Dan
Klein - ACL 2006
8Previous WorkManual Annotation
Klein Manning 03
- Manually split categories
- NP subject vs object
- DT determiners vs demonstratives
- IN sentential vs prepositional
- Advantages
- Fairly compact grammar
- Linguistic motivations
- Disadvantages
- Performance leveled out
- Manually annotated
9Previous WorkAutomatic Annotation Induction
Matsuzaki et. al 05, Prescher 05
- Advantages
- Automatically learned
- Label all nodes with latent variables.
- Same number k of subcategories for all
categories. - Disadvantages
- Grammar gets too large
- Most categories are oversplit while others are
undersplit.
10Previous work is complementary
11Learning Latent Annotations
- Brackets are known
- Base categories are known
- Only induce subcategories
Just like Forward-Backward for HMMs.
12Overview
- Hierarchical Training - Adaptive Splitting -
Parameter Smoothing
13Refinement of the DT tag
DT
14Refinement of the DT tag
DT
15Hierarchical refinement of the DT tag
DT
16Hierarchical Estimation Results
17Refinement of the , tag
- Splitting all categories the same amount is
wasteful
18The DT tag revisited
19Adaptive Splitting
- Want to split complex categories more
- Idea split everything, roll back splits which
were least useful
20Adaptive Splitting
- Want to split complex categories more
- Idea split everything, roll back splits which
were least useful
21Adaptive Splitting
- Evaluate loss in likelihood from removing each
split - Data likelihood with split reversed
- Data likelihood with split
- No loss in accuracy when 50 of the splits are
reversed.
22Adaptive Splitting Results
23Number of Phrasal Subcategories
24Number of Phrasal Subcategories
NP
VP
PP
25Number of Phrasal Subcategories
NAC
X
26Number of Lexical Subcategories
POS
TO
,
27Number of Lexical Subcategories
RB
VBx
IN
DT
28Number of Lexical Subcategories
NNP
JJ
NNS
NN
29Smoothing
- Heavy splitting can lead to overfitting
- Idea Smoothing allows us to pool
- statistics
30Linear Smoothing
31Result Overview
32Linguistic Candy
- Proper Nouns (NNP)
- Personal pronouns (PRP)
33Linguistic Candy
- Relative adverbs (RBR)
- Cardinal Numbers (CD)
34Improved Inference for Unlexicalized Parsing
- Slav Petrov and Dan Klein
- NAACL 2007
35Time to parse 1576 sentences
36Coarse-to-Fine Parsing
Goodman 97, CharniakJohnson 05
37Prune?
- For each chart item Xi,j, compute posterior
probability -
lt threshold
E.g. consider the span 5 to 12
coarse
refined
38Time to parse 1576 sentences
- 1621 min
- 111 min
- (no search error)
39Hierarchical Pruning
- Consider again the span 5 to 12
coarse
split in two
split in four
split in eight
40Intermediate Grammars
X-BarG0
G
41Time to parse 1576 sentences
- 1621 min
- 111 min
- 35 min
- (no search error)
42State Drift (DT tag)
43Projected Grammars
X-BarG0
G
44Estimating Projected Grammars
NP0
NP1
VP1
VP0
S0
S1
Nonterminals in ?(G)
Nonterminals in G
45Estimating Projected Grammars
S ? NP VP
S1 ? NP1 VP1 0.20 S1 ? NP1 VP2 0.12 S1 ?
NP2 VP1 0.02 S1 ? NP2 VP2 0.03 S2 ? NP1
VP1 0.11 S2 ? NP1 VP2 0.05 S2 ? NP2 VP1
0.08 S2 ? NP2 VP2 0.12
46Estimating Projected Grammars
Corazza Satta 06
Estimating Grammars
0.56
47Calculating Expectations
- Nonterminals
- ck(X) expected counts up to depth k
- Converges within 25 iterations (few seconds)
- Rules
48Time to parse 1576 sentences
- 1621 min
- 111 min
- 35 min
- 15 min
- (no search error)
49Parsing times
X-BarG0
G
50Bracket Posteriors
(after G0)
51Bracket Posteriors (after G1)
52Bracket Posteriors
(Movie)
(Final Chart)
53Bracket Posteriors (Best Tree)
54Parse Selection
- Computing most likely unsplit tree is NP-hard
- Settle for best derivation.
- Rerank n-best list.
- Use alternative objective function.
55Final Results (Efficiency)
- Berkeley Parser
- 15 min
- 91.2 F-score
- Implemented in Java
- Charniak Johnson 05 Parser
- 19 min
- 90.7 F-score
- Implemented in C
56Final Results (Accuracy)
57Extensions
- Learning Structured Models for Phone Recognition
EMNLP 07 - The Infinite PCFG Using Hierarchical Dirichlet
Processes EMNLP 07 - Discriminative Log-Linear Grammars with hidden
Variables NIPS 07
58Conclusions
- Split Merge Learning
- Hierarchical Training
- Adaptive Splitting
- Parameter Smoothing
- Hierarchical Coarse-to-Fine Inference
- Projections
- Marginalization
- Multi-lingual Unlexicalized Parsing
59 60Inside/Outside Scores
Inside
Outside
Ax
61Learning Latent Annotations (Details)
62Adaptive Splitting (Details)
- True data likelihood
- Approximate likelihood with split at n reversed
- Approximate loss in likelihood