Chapter 12: Probabilistic Parsing and Treebanks

About This Presentation

Title:

Chapter 12: Probabilistic Parsing and Treebanks

Description:

Recursive case makes this dynamic programming because we only calculate B and C once ... LOVE JOHN. LOVE MARY. where A B means that B depends on A. 29 ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 30

Provided by: Inderje9

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 12: Probabilistic Parsing and Treebanks

1
Chapter 12 Probabilistic Parsing and Treebanks

Heshaam Faili
hfaili_at_ece.ut.ac.ir
University of Tehran

2
Motivation and Outline

Previously, we used CFGs to parse with, but
Some ambiguous sentences could not be
disambiguated, and we would like to know the most
likely parse
How do we get such grammars? Do we write them
ourselves? Maybe we could use a corpus
Where were going
Probabilistic Context-Free Grammars (PCFGs)
Lexicalized PCFGs
Dependency Grammars

3
Statistical Parsing

Basic idea
Start with a treebank
a collection of sentences with syntactic
annotation, i.e., already-parsed sentences
Examine which parse trees occur frequently
Extract grammar rules corresponding to those
parse trees, estimating the probability of the
grammar rule based on its frequency
That is, well have a CFG augmented with
probabilities

4
Probabilistic Context-Free Grammars (PCFGs)

Definition of a CFG
Set of non-terminals (N)
Set of terminals (T)
Set of rules/productions (P), of the form ? ? ß
Designated start symbol (S)
Definition of a PCFG
Same as a CFG, but with one more function, D
D assigns probabilities to each rule in P

5
Probabilities

The function D gives probabilities for a
non-terminal A to be expanded to a sequence ß.
Written as P(A ? ß)
or as P(A ? ßA)
The idea is that, given A as the mother
non-terminal (LHS), what is the likelihood that ß
is the correct RHS
Note that Si (A ? ßi A) 1
For example, we would augment a CFG with these
probabilities
P(S ? NP VP S) .80
P(S ? Aux NP VP S) .15
P(S ? VP S) .05

6
Estimating Probabilities using a Treebank

Given a corpus of sentences annotated with
syntactic annotation (e.g., the Penn Treebank)
Consider all parse trees
(1) Each time you have a rule of the form A?ß
applied in a parse tree, increment a counter for
that rule
(2) Also count the number of times A is on the
left hand side of a rule
Divide (1) by (2)
P(A?ßA) Count(A?ß)/Count(A)
If you dont have annotated data, parse the
corpus (as well describe next) and estimate the
probabilities which are then used to re-parse.

7
An Example
8
Using Probabilities to Parse

P(T) Probability of a particular parse tree
P(T,S) ?n?T p( r(n) ) P(T).P(ST) but
P(ST) 1 ?
P(T) ?n?T p( r(n) )
i.e., the product of the probabilities of all
the rules r used to expand each node n in the
parse tree
Example given the probabilities on p. 449,
compute the probability of the parse tree on the
right

9
Computing probabilities

We have the following rules and probabilities
(adapted from Figure 12.1)
S ? VP .05
VP ? V NP .40
NP ? Det N .20
V ? book .30
Det ? that .05
N ? flight .25
P(T) P(S?VP)P(VP?V NP)P(N?flight)
.05 .40 .20 .30 .05 .25 .000015, or
1.5 x 10-5

10
Using probabilities

So, the probability for that parse is 0.000015.
Whats the big deal?
Probabilities are useful for comparing with other
probabilities
Whereas we couldnt decide between two parses
using a regular CFG, we now can.
For example, TWA flights is ambiguous between
being two separate NPs (cf. I gave NP John NP
money) or one NP
A book TWA flights
B book TWA flights
Probabilities allows us to choose choice B (see
figure 12.2)

11
(No Transcript)
12
Obtaining the best parse

Call the best parse T(S), where S is your
sentence
Get the tree which has the highest probability,
i.e.
T(S) argmaxT?parse-trees(S) P(T)
Can use the Cocke-Younger-Kasami (CYK) algorithm
to calculate best parse
CYK is a form of dynamic programming
CYK is a chart parser, like the Earley parser

13
The CYK algorithm

Base case
Add words to the chart
Store P(A ? wi) for every category A in the chart
Recursive case ? makes this dynamic programming
because we only calculate B and C once
Rules must be of the form A ? BC, i.e., exactly
two items on the RHS (we call this Chomsky Normal
Form (CNF))
Get the probability for A at this node by
multiplying the probabilities for B and for C by
P(A ? BC)
P(B)P(C)P(A ? BC)
For a given A, only keep the maximum probability
(again, this is dynamic programming)

14
PCYK pseudo-code
15
Example The flight includes a meal
16
Problems with PCFGs

Its still only a CFG, so dependencies on non-CFG
info not captured
e.g., Pronouns are more likely to be subjects
than objects
P(NP?Pronoun) NPsubj gtgt P(NP?Pronoun) NP
obj

17
Problems with PCFGs

Ignores lexical information (statistics), which
is usually crucial for disambiguation
(T1) America sent 250,000 soldiers into
Iraq
(T2) America sent 250,000 soldiers into Iraq
send with into-PP always attached high (T2) in
PTB!
To handle lexical information, well turn to
lexicalized PCFGs

18
Ignore lexical information
VP ? VBD NP PP
VP ? VBD NP NP ? NP PP
19
Lexicalized Grammars

Remember how we talked about head information
being passed up in a syntactic analysis?
e.g., VPhead 1 ? Vhead 1 NP
Well, if you follow this down all the way to the
bottom of a tree, you wind up with a head word
In some sense, we can say that Book that flight
is not just an S, but an S rooted in book
Thus, book is the headword of the whole sentence
By adding headword information to nonterminals,
we wind up with a lexicalized grammar

20
Lexicalized Grammars

Best Results until now,
Collins Parser
Charniak Parser

21
Lexicalized PCFGs

Lexicalized Parse Trees
Each PCFG rule in a tree is augmented to identify
one RHS constituent to be the head daughter
The headword for a node is set to the head word
of its head daughter

book
book
flight
flight
22
Incorporating Head Probabilities Wrong Way

Simply adding headword w to node wont work
So, the node A becomes Aw
e.g., P(Aw?ßA) Count(Aw?ß)/Count(A)
The probabilities are too small, i.e., we dont
have a big enough corpus to calculate these
probabilities
VP(dumped) ? VBD(dumped) NP(sacks) PP(into)
3x10-10
VP(dumped) ? VBD(dumped) NP(cats) PP(into)
8x10-11
These probabilities are tiny, and others will
never occur

23
(No Transcript)
24
Incorporating head probabilities Right way

Previously, we conditioned on the mother node
(A)
P(A?ßA)
Now, we can condition on the mother node and the
headword of A (h(A))
P(A?ßA, h(A))
Were no longer conditioning on simply the mother
category A, but on the mother category when h(A)
is the head
e.g., P(VP?VBD NP PP VP, dumped)

25
Calculating rule probabilities

Well write the probability more generally as
P(r(n) n, h(n))
where n node, r rule, and h headword
We calculate this by comparing how many times the
rule occurs with h(n) as the headword versus how
many times the mother/headword combination appear
in total
P(VP ? VBD NP PP VP, dumped)
C(VP(dumped) ? VBD NP PP)/ Sß C(VP(dumped) ?
ß)

26
Adding info about word-word dependencies

We want to take into account one other factor
the probability of being a head word (in a given
context)
P(h(n)word )
We condition this probability on two things 1.
the category of the node (n), and 2. the
headword of the mother (h(m(n)))
P(h(n)word n, h(m(n))), shortened as P(h(n)
n, h(m(n)))
P(sacks NP, dumped)
What were really doing is factoring in how words
relate to each other
We will call this a dependency relation later
sacks is dependent on dumped, in this case

27
Putting it all together

See p. 459 for an example lexicalized parse tree
for
workers dumped sacks into a bin
For rules r, category n, head h, mother m
P(T) ?n?T
p(r(n) n, h(n))
e.g., P(VP? VBD NP PP VP, dumped)
subcategorization info
p(h(n) n, h(m(n)))
e.g. P(sacks NP, dumped)
dependency info between words

28
Dependency Grammar

Capturing relations between words (e.g. dumped
and sacks) is moving in the direction of
dependency grammar (DG)
In DG, there is no such thing as constituency
The structure of a sentence is purely the binary
relations between words
John loves Mary is represented as
LOVE ? JOHN
LOVE ? MARY
where A ? B means that B depends on A

29
Evaluating Parser Output

Dependency relations are also useful for
comparing parser output to a treebank
Traditional measures of parser accuracy
Labeled bracketing precision
correct constituents in parse/ constituents in
parse
Labeled bracketing recall
correct constituents in parse/ (correct)
constituents in treebank parse
There are known problems with these measures, so
people are trying to use dependency-based
measures instead
How many dependency relations did the parse get
correct?

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 12: Probabilistic Parsing and Treebanks - PowerPoint PPT Presentation

Chapter 12: Probabilistic Parsing and Treebanks

Recursive case makes this dynamic programming because we only calculate B and C once ... LOVE JOHN. LOVE MARY. where A B means that B depends on A. 29 ... – PowerPoint PPT presentation