Title: Probabilistic Context Free Grammar
1Probabilistic Context Free Grammar
2Language structure is not linear
- The velocity of seismic waves rises to
3Context free grammars a reminder
- A CFG G consists of -
- A set of terminals wk, k1, , V
- A set of nonterminals Ni, i1, , n
- A designated start symbol, N1
- A set of rules, Ni?pj (where pj is a sequence
of terminals and nonterminals)
4A very simple example
- Gs rewrite rules
- S?aSb
- S?ab
- Possible derivations
- S?aSb?aabb
- S?aSb?aaSbb?aaabbb
- In general, G creates the language anbn
5Modeling natural language
- G is given by the rewrite rules
- S?NP VP
- NP?the N a N
- N?man boy dog
- VP?V NP
- V?saw heard sensed sniffed
6Recursion can be included
- G is given by the rewrite rules
- S?NP VP
- NP?the N a N
- N?man CP boy CP dog CP
- VP?V NP
- V?saw heard sensed sniffed
- CP?that VP e
7Probabilistic Context Free Grammars
- A PCFG G consists of
- A set of terminals wk, k1, , V
- A set of nonterminals Ni, i1, , n
- A designated start symbol, N1
- A set of rules, Ni?pj (where pj is a sequence
of terminals and nonterminals) - A corresponding set of probabilities on rules
8Example
9astronomers saw stars with ears
10astronomers saw stars with ears
- P(t2) 0.0006804
- P(w15) P(t1)P(t2) 0.0015876
11Training PCFGs
- Given a corpus, its possible to estimate rule
probabilities to maximize its likelihood - This is regarded a form of grammar induction
- However the rules of the grammar must be pre-given
12Questions for PCFGs
- What is the probability of a sentence w1n given a
grammar G P(w1nG)? - Calculated using dynamic programming
- What is the most likely parse for a given
sentence argmaxtP(tw1n, G) - Likewise
- How can we choose rule probabilities for the
grammar G that maximize the probability of a
given corpus? - The inside-outside algorithm
13Chomsky Normal Form
- We will be dealing only with PCFGs of the
above-mentioned form - That means that there are exactly two types of
rules - Ni?NjNk
- Ni?wj
14Estimating string probability
- Define inside probabilities
- We would like to calculate
- A dynamic programming algorithm
- Base step
15Estimating string probability
16Drawbacks of PCFGs
- Do not factor in lexical co-occurrence
- Rewrite rules must be pre-given according to
human intuitions - The ATIS-CFG fiasco
- The capacity of PCFG to determine the most likely
parse is very limited - As grammars grow larger, they become increasingly
ambiguous - The following sentences look the same to a PCFG,
although suggest different parses - I saw the boat with the telescope
- I saw the man with the scar
17PCFGs some more drawbacks
- Have some inappropriate biases
- In general, the probability of a smaller tree
will be larger than a larger one - Most frequent length for Wall Street Journal
sentences is around 23 words - Training is slow and problematic
- Converges to a local optimum
- Non-terminals do not always resemble true
syntactic classes
18PCFGs and language models
- Because they ignore lexical co-occurrence, PCFGs
are not good as language models - However, some work has been done on combining
PCFGs with n-gram models - PCFGs modeled long-range syntactic constraints
- Performance generally improved
19Is natural language a CFG?
- There is an on-going debate on the CFGness of
English - There are some languages that can be shown to be
more complex than CFGs - For example, Dutch
20Dutch oddities
- Dat Jan Marie Pieter Arabisch laat zien schrijven
- THAT JAN MARIE PIETER ARABIC LET SEE WRITE
- that Jan Let Marie see Pieter write Arabic
- However, from a purely syntactic view point, this
is just dat PnVn
21Other languages
- Bambara (Malinese language) has non-CF features,
in the form of AnBmCnDm - Swiss-German as well
- However, CFGs seem to be a good approximation for
most phenomena in most languages
22Grammar Induction
- With ADIOS
- (Automatic DIstillation Of Structure)
23Previous work
- Probabilistic Context Free Grammars
- Supervised induction methods
- Little work on raw data
- Mostly work on artificial CFGs
- Clustering
24Our goal
- Given a corpus of raw text separated into
sentences, we want to derive a specification of
the underlying grammar - This means we want to be able to
- Create new unseen grammatically correct sentences
- Accept new unseen grammatically correct sentences
and reject ungrammatical ones
25What do we need to do?
- G is given by the rewrite rules
- S?NP VP
- NP?the N a N
- N?man boy dog
- VP?V NP
- V?saw heard sensed sniffed
26ADIOS in outline
- Composed of three main elements
- A representational data structure
- A segmentation criterion (MEX)
- A generalization ability
- We will consider each of these in turn
27The Model Graph representation with words as
vertices and sentences as paths.
Is that a dog?
Is that a cat?
Where is the dog?
And is that a horse?
28ADIOS in outline
- Composed of three main elements
- A representational data structure
- A segmentation criterion (MEX)
- A generalization ability
29Toy problem Alice in Wonderland
a l i c e w a s b e g i n n i n g t o g e t v e r
y t i r e d o f s i t t i n g b y h e r s i s t e
r o n t h e b a n k a n d o f h a v i n g n o t h
i n g t o d o o n c e o r t w i c e s h e h a d p
e e p e d i n t o t h e b o o k h e r s i s t e r
w a s r e a d i n g b u t i t h a d n o p i c t u
r e s o r c o n v e r s a t i o n s i n i t a n d
w h a t i s t h e u s e o f a b o o k t h o u g h
t a l i c e w i t h o u t p i c t u r e s o r c o
n v e r s a t i o n
30Detecting significant patterns
- Identifying patterns becomes easier on a graph
- Sub-paths are automatically aligned
31Motif EXtraction
32The Markov Matrix
- The top right triangle defines the PL
probabilities, bottom left triangle the PR
probabilities - Matrix is path-dependent
33(No Transcript)
34Example of a probability matrix
35Rewiring the graph
Once a pattern is identified as significant, the
sub-paths it subsumes are merged into a new
vertex and the graph is rewired accordingly.
Repeating this process, leads to the formation of
complex, hierarchically structured patterns.
36MEX at work
37ALICE motifs
38ADIOS in outline
- Composed of three main elements
- A representational data structure
- A segmentation criterion (MEX)
- A generalization ability
39Generalization
40Bootstrapping
41Determining L
- Involves a tradeoff
- Larger L will demand more context sensitivity in
the inference - Will hamper generalization
- Smaller L will detect more patterns
- But many might be spurious
42The ADIOS algorithm
- Initialization load all data into a pseudograph
- Until no more patterns are found
- For each path P
- Create generalized search paths from P
- Detect significant patterns using MEX
- If found, add best new pattern and equivalence
classes and rewire the graph
43The Model The training process
44567
321
120
132
234
621
987
2000
567
120
132
621
1203
321
1203
321
1204
2001
987
1204
1205
451205
987
1204
2001
321
321
1203
567
120
132
621
2000
567
321
120
132
234
621
987
46Example
47More Patterns
48Evaluating performance
- In principle, we would like to compare
ADIOS-generated parse-trees with the true
parse-trees for given sentences - Alas, the true parse-trees are subject to
opinion - Some approaches dont even suppose parse trees
49Evaluating performance
- Define
- Recall the probability of ADIOS recognizing an
unseen grammatical sentence - Precision the proportion of grammatical ADIOS
productions - Recall can be assessed by leaving out some of the
training corpus - Precision is trickier
- Unless were learning a known CFG
50The ATIS experiments
- ATIS-NL is a 13,043 sentence corpus of natural
language - Transcribed phone calls to an airline reservation
service - ADIOS was trained on 12,700 sentences of ATIS-NL
- The remaining 343 sentences were used to assess
recall - Precision was determined with the help of 8
graduate students from Cornell University
51The ATIS experiments
- ADIOS performance scores
- Recall 40
- Precision 70
- For comparison, ATIS-CFG reached
- Recall 45
- Precision - lt1(!)
52ADIOS/ATIS-N comparison
53An ADIOS drawback
- ADIOS is inherently a heuristic and greedy
algorithm - Once a pattern is created it remains forever
errors conflate - Sentence ordering affects outcome
- Running ADIOS with different orderings gives
patterns that cover different parts of the
grammar
54An ad-hoc solution
- Train multiple learners on the corpus
- Each on a different sentence ordering
- Create a forest of learners
- To create a new sentence
- Pick one learner at random
- Use it to produce sentence
- To check grammaticality of given sentence
- If any learner accepts sentence, declare as
grammatical
55The effects of context window width
56Meta-analysis of ADIOS results
- Define a pattern spectrum as the histogram of
pattern types for an individual learner - A pattern type is determined by its contents
- E.g. TT, TET, EE, PE
- A single ADIOS learner was trained with each of 6
translations of the bible
57Pattern spectra
58Language dendogram
59To be continued