Title: Natural Language Parsing: Graphs, the A* Algorithm, and Modularity
1Natural Language Parsing Graphs, the A
Algorithm, and Modularity
- Christopher Manning
- and Dan Klein, Roger Levy
- Depts of Computer Science and Linguistics
- Stanford University
- http//nlp.stanford.edu/manning/
21. Hasnt this been solved?
- Time complexity of (general) CFG parsing is
dominated by the number of traversals done. - Traversals represent the combination of two
adjacent parse items into a larger one
S0,3
O(g3n3)
NP0,2
VP2,3
3Is the problem just cycles?
- Bill Gates, Remarks to Gartner Symposium, October
6, 1997 - Applications always become more demanding. Until
the computer can speak to you in perfect English
and understand everything you say to it and learn
in the same way that an assistant would learn
until it has the power to do that we need all
the cycles. We need to be optimized to do the
best we can. Right now linguistics are right on
the edge of what the processor can do. As we get
another factor of two, then speech will start to
be on the edge of what it can do.
4Why is Natural Language Understanding difficult?
- The hidden structure of language is highly
ambiguous - Tree for Fed raises interest rates 0.5 in
effort to control inflation (NYT headline
5/17/00)
5Where are the ambiguities?
6The bad effects of V/N ambiguities
7The ambiguity of language Newspaper headlines
- Ban on Nude Dancing on Governor's Desk from a
Georgia newspaper discussing current legislation - Juvenile Court to Try Shooting Defendant
- Teacher Strikes Idle Kids
- Stolen Painting Found by Tree
- Local High School Dropouts Cut in Half
- Red Tape Holds Up New Bridges
- and a couple of new ones
- China to orbit human on Oct. 15
- Moon wants to go to space
8Goal Information ? Knowledge
- Lots of unstructured text/web information
- that wed like to turn into usable knowledge
- employs(stanfordUniversity, chrisManning)
- ?t ?e ?x1 ?x2 (employing(e) employer(e, x1)
employed(e, x2) name(x2, Christopher Manning)
name(x1, Stanford University) at(e, t) t
? 1999, 2003
9Question answering (QA) from text
- TREC 8 Question A competition
- With massive collections of on-line documents,
manual translation of knowledge is impractical. - We want answers from textbases e.g.
bioinformatics - Pasca and Harabagiu (2001)
- Good IR is needed SMART paragraph retrieval
- Large taxonomy of question types and expected
answer types is crucial - Statistical parser used to parse questions and
relevant text for answers, and to build KB
10Question Answering Example
- How hot does the inside of an active volcano get?
- get(TEMPERATURE, inside(volcano(active)))
- Lava fragments belched out of the mountain were
as hot as 300 degrees Fahrenheit. - fragments(lava, TEMPERATURE(degrees(300)),
- belched(out, mountain))
- volcano ISA mountain
- lava ISPARTOF volcano ? lava inside volcano
- fragments of lava HAVEPROPERTIESOF lava
- The needed semantic information is in WordNet
definitions, and was successfully translated into
a form that can be used for rough proofs
11Parsing Goals
- The goal develop grammars and parsers that are
- Accurate produce good parses
- Model Optimal find their models actual best
parses - Fast seconds to parse long sentences
- Technology exists to get any two, but not all
three. - Exhaustive Parsing Not Fast
- Chart Parsing Earley 70
- Approximate Parsing Not Optimal
- Beam Parsing, Collins 97, Charniak 01
- Best-First Parsing Charniak et al. 98
- Always Build Right-Branching Structure Not
Accurate - The problem involves both learning and inference
12Talk Outline
- Big picture overview
- Parsing and graphs Hypergraph parsing
- A parsing efficient unlexicalized parsing
- A factored, lexicalized parsing model
- Accurate unlexicalized parsing
132. Parsing as Search
Xh i,j
goal
Sfell 0,5
Edge
VPfell 2,5
PPin 3,5
NPpayrolls 0,2
NNFactory 0,1
NNpayrolls 1,2
VBDfell 2,3
INin 3,4
NNSeptember 4,5
start
14CKY Parsing
- In CKY parsing, we visit edges by span size
- Guarantees correctness by working inside-out.
- Build all small bits before any larger bits that
could possibly require them. - Exhaustive the goal is among the nodes with
largest span size!
15What can go wrong?
- We can build too many edges.
- Most edges that can be built, shouldnt.
- CKY builds them all!
- We can build in an bad order.
- Might find bad parses before good parses.
- Will trigger best-first propagation.
Speed build promising edges first.
Correctness keep edges on the agenda until
youre sure youve seen their best parse.
16Uniform-Cost Parsing
- We want to work on good parses inside-out.
- CKY does this synchronously, by span size.
- Uniform-cost orders edges by their best known
score. - Why its correct
- Adding structure incurs probability cost.
- Trees have lower probability than their
sub-parts. - What makes things tricky
- We dont have a full graph to explore
- The graph is built dynamically correctness
depends on the right bits of the graph being
built before an edge is finished
? ? ??
built before
173. A Search
?
- Problem with uniform-cost
- Even unlikely small edges have high score.
- We end up processing every small edge!
- Solution A Search
- Small edges have to fit into a full parse.
- The smaller the edge, the more the full parse
will cost. - Consider both the cost to build (?) and the cost
to complete (?). - We figure out ? during parsing.
- We GUESS at ? in advance (pre-processing).
Score ?
?
?
Score ? ?
18A Parsing
- The true cost-to-completion (?)
- We look for easy to compute bounds (a), a ? ?
- The trivial bound, a(E,w) 1, gives uniform cost
search. - The exact bound a(E,w) ?(E,w) gives a perfect
search, but is impractical. - Why is A parsing not the standard thing to do?
- Useful admissible estimates are hard to engineer.
19Finding Estimates
- Challenge find estimates which are
- Admissible/Monotonic
- Informative
- Easy to precompute
- Example Span-State (SX)
- In a sentence w1w10, whats the best completion
for NP1,4? - There is a best completion for NP using 1 left
and 6 right words. - That completion is probably not valid for this
sentence. - BUT, its probability is not less than the actual
best completion!
SX Completion Score -11.3
True Completion Score -18.1
20Pre-Calculating SX
- Best way to parse an X using K words
- Best way to parse ?X? if ?L and ?R words.
Table size XmaxLen
X
X
Y
Z
s
K-s
K
Table size XmaxLen2
Z
Y
X
X
L
R
s
R-s
L
21Enriching Context
- The more detailed the context, the sharper our
estimate
Fix outside size Score -11.3
22Context Summary Savings
Estimate Time Memory Items Blocked Items Blocked
NULL 0 0 11.2
S 1 min 2.5K 40.5
SX 1 min 5M 80.3
SXL 30 min 250M 83.5
SXR 30 min 250M 93.8
BEST 540 min 1G 94.6
Over WSJ sentences, length 18 26, to facilitate
comparison to previous work.
23Context Summary Sharpness
Adding local information changes the intercept,
but not the slope!
24What to do?
- Option 1 Find Global Estimates
- Idea instead of pre-building a table, build
estimates tailored to each sentence. - It had better be fast, per sentence!
- Option 2 Live Dangerously
- We could just boost our estimates.
- Lose admissibility, monotonicity, correctness,
O(n3) bound. - Need to do substantial extra work per item.
- Best-first parsing Charniak et al. 98
resembles an inadmissible A search.
25Grammar Projection Estimates
- Alternative to context summary
- Pre-parse the full context exhaustively, but
using a bounding grammar. - If the bounding grammar is simple enough,
exhaustive parsing can be fast enough to be
useful. - In general an equivalence map ? over grammar
symbols. - Example X-bar.
G
? (G)
?
NP NP? ? CC NP CC NP
NP NP?
26Example Forward Filter
- Let ? collapse all phrasal symbols to X
- When can X? ? CC X CC X be completed?
- Whenever the right context includes two CCs!
- Gives an admissible lower bound on this
projection that is very efficient to calculate.
NP NP? ? CC NP CC NP
X X? ? CC X CC X
?
X
X? ? CC X CC X
and or
27Grammar Projection Savings
Context Estimates augmented by the FILTER Estimate
Estimate Time Memory Items Blocked Items Blocked
NULL 0 0 58.3
S 1 min 2.5K 77.8
SX 1 min 5M 95.3
SXL 30 min 250M 96.1
SXR 30 min 250M 96.9
BEST 540 min 1G 97.3
The price of optimality? Item threshold for 96
parse success Caraballo and Charniak 98
10K BESTFILTER 6K Charniak et al. 98
2K
284. Lexicalized PCFG Models
- Word-to-word affinities are useful for certain
ambiguities
29Modeling a Lexicalized Tree
- Task assign a score to each local tree
- Joint generative models Collins 97 Charniak 97,
00 - P(NPpayrolls VPfell Sfell) is modeled
directly, through complex back-off models. - Is this necessary?
- In linguistics, syntax is standardly described
using categories with little mention of words - This is possible because acceptable syntactic
configurations are independent of words - Conversely, lexical preferences can effectively
decide attachments of arguments and modifiers
without paying much attention to syntax Church
and Gale 93
P(NPpayrolls VPfell Sfell)
30Lexicalized A Parsing
- Grammar projections shine for lexicalized
parsing. - Use two coupled projections
- One strips the words and leaves the PCFG
backbone. - One strips the PCFG symbols and leaves the word
dependency tree. - Each projection can be parsed exhaustively much
faster than exhaustive lexicalized parsing. - Use the projections to guide a search over the
full lexicalized model. - Works best for a special factored form of lexical
models.
31Factored A Estimates
- If w factors over projections ?i, then for a
path P - Factored scores have the following natural A
estimate
32Projecting Syntax and Semantics
Lexicalized Tree T (C,D) P(T) P(C)P(D)
Syntax C P(C) is a standard PCFG, captures
structural patterns
Semantics D P(D) is a dependency grammar,
captures word-word patterns
33Parsing Results Time
- Total time dominated by calculation of A tables
in each projection O(n3)
34Parsing Results Nodes
- Exact lexical parsing is in general infeasible
- Suppressed work is 99.997 at length 10, and
further approaches 100 as length goes up.
35Michael Collins (2003, COLT)
36Sparseness 1 million words is like nothing
- Much work uses bilexical statistics likelihoods
of relationships between pairs of words - Very sparse, even on topics central to the WSJ
- stocks plummeted 2 occurrences
- stocks stabilized 1 occurrence
- stocks skyrocketed 0 occurrences
- stocks discussed 0 occurrences
- So far very little success in augmenting the
treebank with extra unannotated materials or
using semantic classes or clusters - Gildea 01 You only lose 0.5 by eliminating
bilexical statistics on WSJ nothing cross-domain
375. Accurate Unlexicalized Parsing with PCFGs
- The symbols in a PCFG define independence
assumptions - At any node, the material inside that node is
independent of the material outside that node,
given the label of that node. - Any information that statistically connects
behavior inside and outside a node must flow
through that node.
S
S ? NP VP NP ? DT NN
NP
NP
VP
38Breaking Up the Symbols
- We can relax independence assumptions by encoding
dependencies into the PCFG symbols - A symbol like NPNP-POS is equivalent to
- NP parent NP, POS were doing GPSG.
- Parent annotation
- Johnson 98
Marking possesive NPs
39Experimental Process
- Well take a highly conservative approach
- Annotate as sparingly as possible
- Highest accuracy with fewest symbols
- Error-driven, largely manual hill-climb, adding
one annotation type at a time
40Unlexicalized PCFGs
- What do we mean by an unlexicalized PCFG?
- Grammar rules are not systematically specified
down to the level of lexical items - NP-stocks is not allowed
- NPS-CC is fine
- Closed vs. open class words (PPVP-for)
- Long tradition in linguistics of using function
words as features or markers for selection - Contrary to the bilexical idea of semantic heads
- Open-class selection really a proxy for semantics
- Honesty checks
- Number of symbols keep the grammar very small
- No smoothing over-annotating is a real danger
41Tag Splits
- Problem Treebank tags are too coarse.
- Example Sentential, PP, and other prepositions
are all marked IN. - Partial Solution
- Subdivide the IN tag.
Annotation F1 Size
Previous 78.3 8.0K
SPLIT-IN 80.3 8.1K
42Yield Splits
- Problem sometimes the behavior of a category
depends on something inside its future yield. - Examples
- Possessive NPs
- Finite vs. infinite VPs
- Lexical heads!
- Solution annotate future elements into nodes.
Annotation F1 Size
Previous 82.3 9.7K
POSS-NP 83.1 9.8K
SPLIT-VP 85.7 10.5K
43A Fully Annotated Tree
44Unlexicalized Sec. 23 Results
Parser LP LR F1 CB 0 CB
Magerman 95 84.9 84.6 84.7 1.26 56.6
Collins 96 86.3 85.8 86.0 1.14 59.9
Current Work 86.9 85.7 86.3 1.10 60.3
Charniak 97 87.4 87.5 87.4 1.00 62.1
Collins 99 88.7 88.6 88.6 0.90 67.1
- Beats first generation lexicalized parsers.
- Much of the power of lexicalization from
closed-class monolexicalization.
45Conclusions
- Parsing as shortest path finding is an appealing
conceptual approach - A parsing can give very considerable speedups
while maintaining exact inference - A modularized lexicalized parser
- Is fast through the use of sharp A estimates
- Continues to provide exact inference in this more
complex case - Accurate unlexicalized parsing
- Shows component models can be improved
- One can parse accurately without lexicalization
46For more
- Papers
- Dan Klein and Christopher D. Manning. 2002. Fast
Exact Inference with a Factored Model for Natural
Language Parsing. Advances in Neural Information
Processing Systems 15 (NIPS 2002), December 2002.
- Dan Klein and Christopher D. Manning. 2003.
Accurate Unlexicalized Parsing. ACL 2003, pp.
423-430. - Available at http//nlp.stanford.edu/manning/pap
ers/ - Parser
- http//nlp.stanford.edu/
- (Chinese and German included in the box)
47The End
Thank you!
48(No Transcript)