Natural Language Parsing: Graphs, the A* Algorithm, and Modularity presentation

About This Presentation

Transcript and Presenter's Notes

Title: Natural Language Parsing: Graphs, the A* Algorithm, and Modularity

1
Natural Language Parsing Graphs, the A
Algorithm, and Modularity

Christopher Manning
and Dan Klein, Roger Levy
Depts of Computer Science and Linguistics
Stanford University
http//nlp.stanford.edu/manning/

2
1. Hasnt this been solved?

Time complexity of (general) CFG parsing is
dominated by the number of traversals done.
Traversals represent the combination of two
adjacent parse items into a larger one

S0,3
O(g3n3)
NP0,2
VP2,3
3
Is the problem just cycles?

Bill Gates, Remarks to Gartner Symposium, October
6, 1997
Applications always become more demanding. Until
the computer can speak to you in perfect English
and understand everything you say to it and learn
in the same way that an assistant would learn
until it has the power to do that we need all
the cycles. We need to be optimized to do the
best we can. Right now linguistics are right on
the edge of what the processor can do. As we get
another factor of two, then speech will start to
be on the edge of what it can do.

4
Why is Natural Language Understanding difficult?

The hidden structure of language is highly
ambiguous
Tree for Fed raises interest rates 0.5 in
effort to control inflation (NYT headline
5/17/00)

5
Where are the ambiguities?
6
The bad effects of V/N ambiguities
7
The ambiguity of language Newspaper headlines

Ban on Nude Dancing on Governor's Desk from a
Georgia newspaper discussing current legislation
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Stolen Painting Found by Tree
Local High School Dropouts Cut in Half
Red Tape Holds Up New Bridges
and a couple of new ones
China to orbit human on Oct. 15
Moon wants to go to space

8
Goal Information ? Knowledge

Lots of unstructured text/web information
that wed like to turn into usable knowledge
employs(stanfordUniversity, chrisManning)
?t ?e ?x1 ?x2 (employing(e) employer(e, x1)
employed(e, x2) name(x2, Christopher Manning)
name(x1, Stanford University) at(e, t) t
? 1999, 2003

9
Question answering (QA) from text

TREC 8 Question A competition
With massive collections of on-line documents,
manual translation of knowledge is impractical.
We want answers from textbases e.g.
bioinformatics
Pasca and Harabagiu (2001)
Good IR is needed SMART paragraph retrieval
Large taxonomy of question types and expected
answer types is crucial
Statistical parser used to parse questions and
relevant text for answers, and to build KB

10
Question Answering Example

How hot does the inside of an active volcano get?
get(TEMPERATURE, inside(volcano(active)))
Lava fragments belched out of the mountain were
as hot as 300 degrees Fahrenheit.
fragments(lava, TEMPERATURE(degrees(300)),
belched(out, mountain))
volcano ISA mountain
lava ISPARTOF volcano ? lava inside volcano
fragments of lava HAVEPROPERTIESOF lava
The needed semantic information is in WordNet
definitions, and was successfully translated into
a form that can be used for rough proofs

11
Parsing Goals

The goal develop grammars and parsers that are
Accurate produce good parses
Model Optimal find their models actual best
parses
Fast seconds to parse long sentences
Technology exists to get any two, but not all
three.
Exhaustive Parsing Not Fast
Chart Parsing Earley 70
Approximate Parsing Not Optimal
Beam Parsing, Collins 97, Charniak 01
Best-First Parsing Charniak et al. 98
Always Build Right-Branching Structure Not
Accurate
The problem involves both learning and inference

12
Talk Outline

Big picture overview
Parsing and graphs Hypergraph parsing
A parsing efficient unlexicalized parsing
A factored, lexicalized parsing model
Accurate unlexicalized parsing

13
2. Parsing as Search
Xh i,j
goal
Sfell 0,5
Edge
VPfell 2,5
PPin 3,5
NPpayrolls 0,2
NNFactory 0,1
NNpayrolls 1,2
VBDfell 2,3
INin 3,4
NNSeptember 4,5
start
14
CKY Parsing

In CKY parsing, we visit edges by span size

Guarantees correctness by working inside-out.
Build all small bits before any larger bits that
could possibly require them.
Exhaustive the goal is among the nodes with
largest span size!

15
What can go wrong?

We can build too many edges.
Most edges that can be built, shouldnt.
CKY builds them all!
We can build in an bad order.
Might find bad parses before good parses.
Will trigger best-first propagation.

Speed build promising edges first.
Correctness keep edges on the agenda until
youre sure youve seen their best parse.
16
Uniform-Cost Parsing

We want to work on good parses inside-out.
CKY does this synchronously, by span size.
Uniform-cost orders edges by their best known
score.
Why its correct
Adding structure incurs probability cost.
Trees have lower probability than their
sub-parts.
What makes things tricky
We dont have a full graph to explore
The graph is built dynamically correctness
depends on the right bits of the graph being
built before an edge is finished

? ? ??
built before
17
3. A Search
?

Problem with uniform-cost
Even unlikely small edges have high score.
We end up processing every small edge!
Solution A Search
Small edges have to fit into a full parse.
The smaller the edge, the more the full parse
will cost.
Consider both the cost to build (?) and the cost
to complete (?).
We figure out ? during parsing.
We GUESS at ? in advance (pre-processing).

Score ?
?
?
Score ? ?
18
A Parsing

The true cost-to-completion (?)
We look for easy to compute bounds (a), a ? ?
The trivial bound, a(E,w) 1, gives uniform cost
search.
The exact bound a(E,w) ?(E,w) gives a perfect
search, but is impractical.
Why is A parsing not the standard thing to do?
Useful admissible estimates are hard to engineer.

19
Finding Estimates

Challenge find estimates which are
Admissible/Monotonic
Informative
Easy to precompute
Example Span-State (SX)
In a sentence w1w10, whats the best completion
for NP1,4?
There is a best completion for NP using 1 left
and 6 right words.
That completion is probably not valid for this
sentence.
BUT, its probability is not less than the actual
best completion!

SX Completion Score -11.3
True Completion Score -18.1
20
Pre-Calculating SX

Best way to parse an X using K words
Best way to parse ?X? if ?L and ?R words.

Table size XmaxLen
X
X
Y
Z
s
K-s
K
Table size XmaxLen2
Z
Y
X
X
L
R
s
R-s
L
21
Enriching Context

The more detailed the context, the sharper our
estimate

Fix outside size Score -11.3
22
Context Summary Savings
Estimate Time Memory Items Blocked Items Blocked
NULL 0 0 11.2
S 1 min 2.5K 40.5
SX 1 min 5M 80.3
SXL 30 min 250M 83.5
SXR 30 min 250M 93.8
BEST 540 min 1G 94.6
Over WSJ sentences, length 18 26, to facilitate
comparison to previous work.
23
Context Summary Sharpness
Adding local information changes the intercept,
but not the slope!
24
What to do?

Option 1 Find Global Estimates
Idea instead of pre-building a table, build
estimates tailored to each sentence.
It had better be fast, per sentence!
Option 2 Live Dangerously
We could just boost our estimates.
Lose admissibility, monotonicity, correctness,
O(n3) bound.
Need to do substantial extra work per item.
Best-first parsing Charniak et al. 98
resembles an inadmissible A search.

25
Grammar Projection Estimates

Alternative to context summary
Pre-parse the full context exhaustively, but
using a bounding grammar.
If the bounding grammar is simple enough,
exhaustive parsing can be fast enough to be
useful.
In general an equivalence map ? over grammar
symbols.
Example X-bar.

G
? (G)
?
NP NP? ? CC NP CC NP
NP NP?
26
Example Forward Filter

Let ? collapse all phrasal symbols to X
When can X? ? CC X CC X be completed?
Whenever the right context includes two CCs!
Gives an admissible lower bound on this
projection that is very efficient to calculate.

NP NP? ? CC NP CC NP
X X? ? CC X CC X
?
X
X? ? CC X CC X
and or
27
Grammar Projection Savings
Context Estimates augmented by the FILTER Estimate
Estimate Time Memory Items Blocked Items Blocked
NULL 0 0 58.3
S 1 min 2.5K 77.8
SX 1 min 5M 95.3
SXL 30 min 250M 96.1
SXR 30 min 250M 96.9
BEST 540 min 1G 97.3
The price of optimality? Item threshold for 96
parse success Caraballo and Charniak 98
10K BESTFILTER 6K Charniak et al. 98
2K
28
4. Lexicalized PCFG Models

Word-to-word affinities are useful for certain
ambiguities

29
Modeling a Lexicalized Tree

Task assign a score to each local tree
Joint generative models Collins 97 Charniak 97,
00
P(NPpayrolls VPfell Sfell) is modeled
directly, through complex back-off models.
Is this necessary?
In linguistics, syntax is standardly described
using categories with little mention of words
This is possible because acceptable syntactic
configurations are independent of words
Conversely, lexical preferences can effectively
decide attachments of arguments and modifiers
without paying much attention to syntax Church
and Gale 93

P(NPpayrolls VPfell Sfell)
30
Lexicalized A Parsing

Grammar projections shine for lexicalized
parsing.
Use two coupled projections
One strips the words and leaves the PCFG
backbone.
One strips the PCFG symbols and leaves the word
dependency tree.
Each projection can be parsed exhaustively much
faster than exhaustive lexicalized parsing.
Use the projections to guide a search over the
full lexicalized model.
Works best for a special factored form of lexical
models.

31
Factored A Estimates

If w factors over projections ?i, then for a
path P
Factored scores have the following natural A
estimate

32
Projecting Syntax and Semantics
Lexicalized Tree T (C,D) P(T) P(C)P(D)
Syntax C P(C) is a standard PCFG, captures
structural patterns
Semantics D P(D) is a dependency grammar,
captures word-word patterns
33
Parsing Results Time

Total time dominated by calculation of A tables
in each projection O(n3)

34
Parsing Results Nodes

Exact lexical parsing is in general infeasible
Suppressed work is 99.997 at length 10, and
further approaches 100 as length goes up.

35
Michael Collins (2003, COLT)
36
Sparseness 1 million words is like nothing

Much work uses bilexical statistics likelihoods
of relationships between pairs of words
Very sparse, even on topics central to the WSJ
stocks plummeted 2 occurrences
stocks stabilized 1 occurrence
stocks skyrocketed 0 occurrences
stocks discussed 0 occurrences
So far very little success in augmenting the
treebank with extra unannotated materials or
using semantic classes or clusters
Gildea 01 You only lose 0.5 by eliminating
bilexical statistics on WSJ nothing cross-domain

37
5. Accurate Unlexicalized Parsing with PCFGs

The symbols in a PCFG define independence
assumptions
At any node, the material inside that node is
independent of the material outside that node,
given the label of that node.
Any information that statistically connects
behavior inside and outside a node must flow
through that node.

S
S ? NP VP NP ? DT NN
NP
NP
VP
38
Breaking Up the Symbols

We can relax independence assumptions by encoding
dependencies into the PCFG symbols
A symbol like NPNP-POS is equivalent to
NP parent NP, POS were doing GPSG.

Parent annotation
Johnson 98

Marking possesive NPs
39
Experimental Process

Well take a highly conservative approach
Annotate as sparingly as possible
Highest accuracy with fewest symbols
Error-driven, largely manual hill-climb, adding
one annotation type at a time

40
Unlexicalized PCFGs

What do we mean by an unlexicalized PCFG?
Grammar rules are not systematically specified
down to the level of lexical items
NP-stocks is not allowed
NPS-CC is fine
Closed vs. open class words (PPVP-for)
Long tradition in linguistics of using function
words as features or markers for selection
Contrary to the bilexical idea of semantic heads
Open-class selection really a proxy for semantics
Honesty checks
Number of symbols keep the grammar very small
No smoothing over-annotating is a real danger

41
Tag Splits

Problem Treebank tags are too coarse.
Example Sentential, PP, and other prepositions
are all marked IN.
Partial Solution
Subdivide the IN tag.

Annotation F1 Size
Previous 78.3 8.0K
SPLIT-IN 80.3 8.1K
42
Yield Splits

Problem sometimes the behavior of a category
depends on something inside its future yield.
Examples
Possessive NPs
Finite vs. infinite VPs
Lexical heads!
Solution annotate future elements into nodes.

Annotation F1 Size
Previous 82.3 9.7K
POSS-NP 83.1 9.8K
SPLIT-VP 85.7 10.5K
43
A Fully Annotated Tree
44
Unlexicalized Sec. 23 Results
Parser LP LR F1 CB 0 CB
Magerman 95 84.9 84.6 84.7 1.26 56.6
Collins 96 86.3 85.8 86.0 1.14 59.9
Current Work 86.9 85.7 86.3 1.10 60.3
Charniak 97 87.4 87.5 87.4 1.00 62.1
Collins 99 88.7 88.6 88.6 0.90 67.1

Beats first generation lexicalized parsers.
Much of the power of lexicalization from
closed-class monolexicalization.

45
Conclusions

Parsing as shortest path finding is an appealing
conceptual approach
A parsing can give very considerable speedups
while maintaining exact inference
A modularized lexicalized parser
Is fast through the use of sharp A estimates
Continues to provide exact inference in this more
complex case
Accurate unlexicalized parsing
Shows component models can be improved
One can parse accurately without lexicalization

46
For more

Papers
Dan Klein and Christopher D. Manning. 2002. Fast
Exact Inference with a Factored Model for Natural
Language Parsing. Advances in Neural Information
Processing Systems 15 (NIPS 2002), December 2002.
Dan Klein and Christopher D. Manning. 2003.
Accurate Unlexicalized Parsing. ACL 2003, pp.
423-430.
Available at http//nlp.stanford.edu/manning/pap
ers/
Parser
http//nlp.stanford.edu/
(Chinese and German included in the box)

47
The End
Thank you!
48
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Natural Language Parsing: Graphs, the A* Algorithm, and Modularity PowerPoint PPT Presentation