Title: Probabilistic and Lexicalized Parsing
1Probabilistic and Lexicalized Parsing
2Probabilistic CFGs PCFGs
- Weighted CFGs
- Attach weights to rules of CFG
- Compute weights of derivations
- Use weights to choose preferred parses
- Utility Pruning and ordering the search space,
disambiguate, Language Model for ASR - Parsing with weighted grammars find the parse T
which maximizes the weights of the derivations in
the parse tree for all the possible parses of S - T(S) argmaxT?t(S) W(T,S)
- Probabilistic CFGs are one form of weighted CFGs
3Rule Probability
- Attach probabilities to grammar rules
- Expansions for a given non-terminal sum to 1
- R1 VP ? V .55
- R2 VP ? V NP .40
- R3 VP ? V NP NP .05
- Estimate probabilities from annotated corpora
- E.g. Penn Treebank
- P(R1)counts(R1)/counts(VP)
4Derivation Probability
- For a derivation T R1Rn
- Probability of the derivation
- Product of probabilities of rules expanded in
tree - Most likely probable parse
- Probability of a sentence
- Sum over all possible derivations for the
sentence - Note the independence assumption Parse
probability does not change based on where the
rule is expanded.
5One Approach CYK Parser
- Bottom-up parsing via dynamic programming
- Assign probabilities to constituents as they are
completed and placed in a table - Use the maximum probability for each constituent
type going up the tree to S - The Intuition
- We know probabilities for constituents lower in
the tree, so as we construct higher level
constituents we dont need to recompute these
6CYK (Cocke-Younger-Kasami) Parser
- Bottom-up parser with top-down filtering
- Uses dynamic programming to store intermediate
results (cf. Earley algorithm for top-down case) - Input PCFG in Chomsky Normal Form
- Rules of form A?w or A?BC no e
- Chart array i,j,A to hold probability that
non-terminal A spans input i-j - Start State(s) (i,i1,A) for each A?wi1
- End State (1,n,S) where n is the input size
- Next State Rules (i,k,B) (k,j,C) ? (i,j,A) if
A?BC - Maintain back-pointers to recover the parse
7Structural Ambiguity
- S ? NP VP
- VP ? V NP
- NP ? NP PP
- VP ? VP PP
- PP ? P NP
- NP ? John Mary Denver
- V -gt called
- P -gt from
John called Mary from Denver
S
VP
NP
NP
V
NP
PP
called
John
Mary
P
NP
from
Denver
8Example
John called Mary from Denver
9Base Case A?w
NP
P Denver
NP from
V Mary
NP called
John
10Recursive Cases A?BC
NP
P Denver
NP from
X V Mary
NP called
John
11NP
P Denver
VP NP from
X V Mary
NP called
John
12NP
X P Denver
VP NP from
X V Mary
NP called
John
13PP NP
X P Denver
VP NP from
X V Mary
NP called
John
14PP NP
X P Denver
S VP NP from
V Mary
NP called
John
15PP NP
X X P Denver
S VP NP from
X V Mary
NP called
John
16NP PP NP
X P Denver
S VP NP from
X V Mary
NP called
John
17NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
18VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
19VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
20VP1 VP2 NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
21S VP1 VP2 NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
22S VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
23Problems with PCFGs
- Probability model just based on rules in the
derivation. - Lexical insensitivity
- Doesnt use words in any real way
- But structural disambiguation is lexically driven
- PP attachment often depends on the verb, its
object, and the preposition - I ate pickles with a fork.
- I ate pickles with relish.
- Context insensitivity of the derivation
- Doesnt take into account where in the derivation
a rule is used - Pronouns more often subjects than objects
- She hates Mary.
- Mary hates her.
- Solution Lexicalization
- Add lexical information to each rule
- I.e. Condition the rule probabilities on the
actual words
24An example Phrasal Heads
- Phrasal heads can take the place of whole
phrases, defining most important characteristics
of the phrase - Phrases generally identified by their heads
- Head of an NP is a noun, of a VP is the main
verb, of a PP is preposition - Each PFCG rules LHS shares a lexical item with a
non-terminal in its RHS
25Increase in Size of Rule Set in Lexicalized CFG
- If R is the number of binary branching rules in
CFG and ? is the lexicon, O(2?R) - For unary rules O(?R)
26Example (correct parse)
Attribute grammar
27Example (less preferred)
28Computing Lexicalized Rule Probabilities
- We started with rule probabilities as before
- VP ? V NP PP P(ruleVP)
- E.g., count of this rule divided by the number of
VPs in a treebank - Now we want lexicalized probabilities
- VP(dumped) ? V(dumped) NP(sacks) PP(into)
- i.e., P(ruleVP dumped is the verb sacks is
the head of the NP into is the head of the PP) - Not likely to have significant counts in any
treebank
29Exploit the Data You Have
- So, exploit the independence assumption and
collect the statistics you can - Focus on capturing
- Verb subcategorization
- Particular verbs have affinities for particular
VPs - Objects affinity for their predicates
- Mostly their mothers and grandmothers
- Some objects fit better with some predicates than
others
30Verb Subcategorization
- Condition particular VP rules on their heads
- E.g. for a rule r VP -gt V NP PP
- P(rVP) becomes P(r Vdumped VP dumped)
- How do you get the probability?
- How many times was rule r used with dumped,
divided by the number of VPs that dumped appears
in, in total - How predictive of r is the verb dumped?
- Captures affinity between VP heads (verbs) and VP
rules
31Example (correct parse)
32Example (less preferred)
33Affinity of Phrasal Heads for Other Heads PP
Attachment
- Verbs with preps vs. Nouns with preps
- E.g. dumped with into vs. sacks with into
- How often is dumped the head of a VP which
includes a PP daughter with into as its head
relative to other PP heads or whats
P(intoPP,dumped is mother VPs head)) - Vshow often is sacks the head of an NP with a PP
daughter whose head is into relative to other PP
heads or P(intoPP,sacks is mothers head))
34But Other Relationships do Not Involve Heads
(Hindle Rooth 91)
- Affinity of gusto for eat is greater than for
spaghetti and affinity of marinara for spaghetti
is greater than for ate
Vp (ate)
Vp(ate)
Np(spag)
Vp(ate)
Pp(with)
Pp(with)
np
v
v
np
Ate spaghetti with marinara
Ate spaghetti with gusto
35Log-linear models for Parsing
- Why restrict to the conditioning to the elements
of a rule? - Use even larger contextword sequence, word
types, sub-tree context etc. - Compute P(yx) where fi(x,y) tests properties of
context and li is weight of feature - Use as scores in CKY algorithm to find best parse
36Supertagging Almost parsing
Poachers now control the underground trade
S
S
VP
NP
S
NP
NP
V
VP
NP
e
N
NP
V
e
poachers
e
Adj
underground
37Summary
- Parsing context-free grammars
- Top-down and Bottom-up parsers
- Mixed approaches (CKY, Earley parsers)
- Preferences over parses using probabilities
- Parsing with PCFG and PCKY algorithms
- Enriching the probability model
- Lexicalization
- Log-linear models for parsing
- Super-tagging