Title: Learning PCFGs: Estimating Parameters, Learning Grammar Rules
1Learning PCFGs Estimating Parameters,Learning
Grammar Rules
- Many slides are taken or adapted from slides by
- Dan Klein
2Treebanks
An example tree from the Penn Treebank
3The Penn Treebank
- 1 million tokens
- In 50,000 sentences, each labeled with
- A POS tag for each token
- Labeled constituents
- Extra information
- Phrase annotations like TMP
- empty constituents for wh-movement traces,
empty subjects for raising constructions
4Supervised PCFG Learning
- Preprocess the treebank
- Remove all extra information (empties, extra
annotations) - Convert to Chomsky Normal Form
- Possibly prune some punctuation, lower-case all
words, compute word shapes, and other
processing to combat sparsity. - Count the occurrence of each nonterminal c(N) and
each observed production rule c(N-gtNL NR) and
c(N-gtw) - Set the probability for each rule to the MLE
- P(N-gtNL NR) c(N-gtNL NR) / c(N)
- P(N-gtw) c(N-gtw) / c(N)
- Easy, peasy, lemon-squeezy.
5Complications
- Smoothing
- Especially for lexicalized grammars, many test
productions will never be observed during
training - We dont necessarily want to assign these
productions zero probability - Instead, define backoff distributions, e.g.
- Pfinal(VPtransmogrified -gt Vtransmogrified
PPinto) - a P(VPtransmogrified -gt Vtransmogrified
PPinto) - (1-a) P(VP -gt V PPinto)
6Problems with Supervised PCFG Learning
- Coming up with labeled data is hard!
- Time-consuming
- Expensive
- Hard to adapt to new domains, tasks, languages
- Corpus availability drives research (instead of
tasks driving the research) - Penn Treebank took many person-years to manually
annotate it.
7Unsupervised Learning of PCFGS Feasible?
8Unsupervised Learning
- Systems take raw data and automatically detect
data - Why?
- More data is available
- Kids learn (some aspects of) language with no
supervision - Insights into machine learning and clustering
9Grammar Induction and Learnability
- Some have argued that learning syntax from
positive data alone is impossible - Gold, 1967 non-identifiability in the limit
- Chomsky, 1980 poverty of the stimulus
- Surprising result its possible to get entirely
unsupervised parsing to work (reasonably) well.
10Learnability
- Learnability formal conditions under which a
class of languages can be learned - Setup
- Class of languages ?
- Algorithm H (the learner)
- H sees a sequence X of strings x1 xn
- H maps sequences X to languages L in ?
- Question is for what classes ? do learners H
exist?
11Learnability Gold, 1967
- Criterion Identification in the limit
- A presentation of L is an infinite sequence of
xs from L in which each x occurs at least once - A learner H identifies L in the limit if, for any
presentation of L, from some point n onwards, H
always outputs L - A class ? is identifiable in the limit if there
is some single H which correctly identifies in
the limit every L in ?. - Example L a,a,b is identifiable in the
limit. - Theorem (Gold, 67) Any ? which contains all
finite languages and at least one infinite
language (ie is superfinite) is unlearnable in
this sense.
12Learnability Gold, 1967
- Proof sketch
- Assume ? is superfinite, H identifies ? in the
limit - There exists a chain L1 ? L2 ? L8
- Construct the following misleading sequence
- Present strings from L1 until H outputs L1
- Present strings from L2 until H outputs L2
-
- This is a presentation of L8
- but H never outputs L8
13Learnability Horning, 1969
- Problem, IIL requires that H succeeds on all
examples, even the weird ones - Another criterion measure one identification
- Assume a distribution PL(x) for each L
- Assume PL(x) puts non-zero probability on all and
only the x in L - Assume an infinite presentation of x drawn i.i.d.
from PL(x) - H measure-one identifies L if the probability of
drawing a sequence X from which H can identify
L is 1. - Theorem (Horning, 69) PCFGs can be identified
in this sense. - Note there can be misleading sequences, but
they have to be (infinitely) unlikely
14Learnability Horning, 1969
- Proof sketch
- Assume ? is a recursively enumerable set of
recursive languages (e.g., the set of all PCFGs) - Assume an ordering on all strings x1 lt x2 lt
- Define two sequences A and B agree through n
iff for all xltxn, x is in A ? x is in B. - Define the error set E(L,n,m)
- All sequences such that the first m elements do
not agree with L through n - These are the sequences which contain early
strings outside of L (cant happen), or which
fail to contain all of the early strings in L
(happens less as m increases) - Claim P(E(L,n,m)) goes to 0 as m goes to 8
- Let dL(n) be the smallest m such that P(E) lt 2n
- Let d(n) be the largest dL(n) in first n
languages - Learner after d(n), pick first L that agrees
with evidence through n - This can only fail for sequences X if X keeps
showing up in E(L, n, d(n)), which happens
infinitely often with probability zero.
15Learnability
- Golds results say little about real learners
(the requirements are too strong) - Hornings algorithm is completely impractical
- It needs astronomical amounts of data
- Even measure-one identification doesnt say
anything about tree structures - It only talks about learning grammatical sets
- Strong generative vs. weak generative capacity
16Unsupervised POS Tagging
- Some (discouraging) experiments
- Merialdo 94
- Setup
- You know the set of allowable tags for each word
(but not frequency of each tag) - Learn a supervised model on k training sentences
- Learn P(wt), P(titi-1,ti-2) on these sentences
- On ngtk, reestimate with EM
17Merialdo Results
18Grammar Induction
- Unsupervised Learning
- of Grammars and Parameters
19Right-branching Baseline
- In English (but not necessarily in other
languages), trees tend to be right-branching - A simple, English-specific baseline is to choose
the right chain structure for each sentence.
20Distributional Clustering
21Nearest Neighbors
22Learn PCFGs with EM Lari and Young, 1990
- Setup
- Full binary grammar with n nonterminals X1, ,
Xn - (that is, at the beginning, the grammar has all
possible rules) - Parse uniformly/randomly at first
- Re-estimate rule expecations off of parses
- Repeat
- Their conclusion it doesnt really work
23EM for PCFGs Details
- Start with a full grammar, with all possible
binary rules for our nonterminals N1 Nk.
Designate one of them as the start symbol, say N1 - Assign some starting distribution to the rules,
such as - Random
- Uniform
- Some smart initialization techniques (see
assigned reading) - E-step Take an unannotated sentence S, and
compute, for all nonterminals N, NL, NR, and all
terminals w - E(N S), E(N-gtNL NR, N is used S), E(N-gtw, N is
used S) - M-step Reset rule probabilities to the MLE
- P(N-gtNL NR) E(N-gtNL NRS) / E(N S)
- P(N-gtw) E(N-gtw S) / E(N S)
- Repeat 3 and 4 until rule probabilities
stabilize, or converge
24Definitions
This is the sum of P(T, S G) over all possible
trees T for w1m where the root is N1.
25E-Step
- We can define the expectations we want in terms
of p, a, ß quantities
26Inside Probabilities
Nj
Nl
Nr
wp
wd
wd1
wq
27Outside Probabilities
28Problem Model Symmetries
29Distributional Syntax?
30Problem Identifying Constituents
31A nested distributional model
- Wed like a model that
- Ties spans to linear contexts (like
distributional clustering) - Considers only proper tree structures (like
PCFGs) - Has no symmetries to break (like a dependency
model)
32Constituent Context Model (CCM)
33Results Constituency
34Results Dependencies
35Results Combined Models
36Multilingual Results