Title: Parsing
1Parsing
- Dr. Björn Gambäck
- SICS Swedish Institute of Computer Science AB
- Stockholm, Sweden
2Analysis of Natural Languages
- Syntax
- actual structure of an utterance
- Parsing
- best possible way to make an analysis of an
utterance - Semantics
- representation of the meaning of an utterance
3Syntactic Recognition and Parsing
- Recognise a sentence
- Assign some structure to it
- Goal Find all parse trees rooted in the start
symbol ?, covering exactly the input words. - Is it possible to parse a string
deterministically as it is being read (e.g. from
left-to-right)? - What is the largest context the parser must
examine to decide how to build the parse tree?
4Artificial Language Parsingvs.Natural Language
Parsing
- A parser for a computer language
- yields a unique tree for each string
- must be deterministic
- is allowed (basically) unrestricted memory
- A natural language parser
- must allow for more than one parse
- should predict which parse will most likely be
offered as the first choice by a native speaker - the short-term memory in humans is restricted
5Parsing Natural Languages
- Highly ambiguous
- All solutions must be found
- Analysis problem more complex
- Solutions based on saving partial parses
6Empty Rules and Left-Recursion
- Natural language grammars can be both
- empty the grammar has a rule
- A ? ?
- left-recursive the grammar has a rule
- A ? A B
7Cyclicity and Infinite Ambiguity
- Natural language grammars cannot be
- cyclic the grammar has a rule
- A ? A
- infinitely ambiguous some input sentences give
an infinite number of parses. - The number of rules in an NL grammar is large!
- The length of an NL sentence is normally 10-30
words.
8Hypothesis-Driven Parsing (Top-Down)
- Find the start symbol as the first current LHS
- Look up the current LHS as either
- a terminal a word in the lexicon (consume a
word in the input sentence) - or a non-terminal a grammar rule
- (make all RHS nodes new LHS).
- Continue until all words have been consumed.
9Context-Free Grammar
- name ? john.
- name ? mary.
- det ? a.
- det ? the.
- pro ? it.
- n ? book.
- n ? books.
- v ? snores.
- v ? sees.
- v ? book.
- v ? books.
- s ? np, vp.
- s ? vp.
- np ? name.
- np ? pro.
- np ? det, n.
- vp ? v.
- vp ? v, np.
10Bottom-Up Filtering
- non-terminal a grammar rule
- (make all RHS nodes new LHS).
- depth-first search expand the left-most (first)
daughter node all the time - breadth-first search expand all daughter nodes
at the same level after each other - Filtering expand only those nodes whose
left-corner can match the current input
11A Top-Down Parser in Prolog
- parse(Words) -
- top_symbol(S),
- parse(S,Words,).
- parse(Phrase,WordWords,Words) -
- lexicon(Word,Phrase).
- parse(Phrase,WordsIn,WordsOut) -
- (Phrase ? Body),
- parse_rest(Body,WordsIn,WordsOut).
- parse_rest(,Words,Words).
- parse_rest(PhraseRest,WordsIn,WordsOut) -
- parse(Phrase,WordsIn,WordsMid),
- parse_rest(Rest,WordsMid,WordsOut).
12Left Recursion
- Add rules for PP-modifiers
- np ? np, pp.
- vp ? vp, pp.
- pp ? p, np.
- John gave a dog to Mary.
- John saw the dog in the park.
13Data-Driven Parsing (Bottom-Up)
- Make the start symbol the first prediction
category. - Make a terminal matching the first word the
current phrase. - Try to link the current phrase with the
prediction category - by noting that they are identical (consume a
word in the input sentence) - or finding a grammar rule with the phrase as its
left corner (make the LHS node the new current
phrase). - When applying a grammar rule, make all following
RHS nodes prediction categories for the remaining
words. - Continue until all words in the sentence have
been consumed.
14Empty Production Rules
- A ? ?
- The LHS has no realisation in the input string
- Analysing bottom-up ? empty rules are always
applicable - A bottom-up parser dreams up an infinite number
of empty rule applications anywhere
15Top-Down Filtering
- Reduce the search space
- Avoid non-termination
- Link table
- a table with which phrases can be left-corners
of which categories
16Human Language Processing(Kimball 1973,
Cognition 215-47)
- Ungrammaticality vs. unacceptability
- First Principle Top-Down
- Parsing in natural language proceeds according to
a top-down algorithm - Weakest principle
- Universality?
17Principle Two Right Association
- Terminals associate to the lowest non-terminal
node - Sentences organize into right-branching
structures - (perceptually less complex than left-branching or
center-embedded)
center
right
left
- The girl took the job that was attractive.
- Joe said that Martha expected that it would rain
yesterday.
18Principle Three New Nodes
- A new node is signalled by a function word
- (prepositions, determiners, conjunctions,
complementizers, Wh-words, auxiliaries) - She asked him or
- Deleting complementizers and relative pronouns
make sentences more complex - He knew the girl left.
- He knew that the girl left.
she persuaded him to leave.
19Principle Four Two Sentences
- Max two sentences can be parsed in parallel
- That Joe left bothered Susan.
- That that Joe left bothered Susan surprised Max.
S
S
that
surprised Max
that
S
bothered Susan
Joe left
- That for Joe to leave bothers Susan surprised
Max.
20Principle Five Closure
- A phrase is closed as soon as possible
- (unless the next node is a constituent of the
phrase) - They knew that the girl was in the closet.
- They knew the girl
- If a terminal string can be interpreted as an XP,
it will be
was in the closet.
21Principle Six Fixed Structure
- It is costly to reorganize the constituents after
a phrase has been closed - Garden path sentences
- The horse raced past the barn fell.
- The dog knew the cat disappeared, was rescued.
- English is a look-ahead language
- Scanned terminals occupy a small portion of STM
- Biggest restriction on number of S nodes held
- The allocation of storage space is more than made
up for by efficiency of parsing.
22Principle Seven Processing
- When a phrase is closed,
- it is pushed down into a syntactic processing
stage and cleared from STM (short-term memory) - Tom saw that the cow jumped over the moon
S
VP
NP
NP
V
Tom
S
saw
S
NP
VP
NP
that
PP
V
the cow
over the moon
jumped
23Well-Formed Substring Table (WFST)
- Top-Down sequential rule application
- Substrings may be reparsed
- Smarter save partial parses in a WFST
- Analyse like top-down, but
- If current LHS has been exhaustively analysed
- gt look up the solution in the table
- Else ordinary top-down analysis of current LHS
- gt note the partial result in the table
- If there are no more possible parses of current
LHS - gt note that it has been exhaustively analysed
24Head Parsing
- One RHS element identified as the head in each
rule - Parsing
- Identify the head.
- (save the heads in a table!)
- Expand the heads left and right contexts using
the head as top-down prediction.
25Chart Parsing (Earleys Algorithm)
- Saves partial results in a chart (a table)
- - extends the WFST idea
- Not a fixed algorithm but a data structure
- Independent of parsing strategy
- and grammar formalism
- Stops the parser from trying to analyse the same
input several times.
26The Chart Data Structure
- A directed, partially cyclic graph
- The only cycles allowed are those which point
back to the node they came from - The nodes are called vertices
- The archs are called edges
27The Chart, Formally
- A directed graph C lt V, E gt
- V is a finite, non-empty set of nodes
- E is a finite set of edges
- Each edge consists of
- a finite set of dotted context-free rules
- a (possibly infinite) set of feature-value
structures for a constraint-based grammar
28Edges
- The chart entries are edges between various
positions in the sentence marked with phrases. - There are passive and active edges.
- A phrase spanning two positions in the input word
string is called a goal. - The parse is completed when one has constructed a
passive edge between the starting point and the
end point of the sentence marked with the top
symbol of the grammar.
29Active edges
- An active edge corresponds to an unproven goal,
- an unconfirmed hypothesis about the input
- (e.g., that the parser is about to find an NP).
- One can find the type of phrase indicated by the
edge if one can prove the remaining subgoals. - These subgoals are specified by the edge.
30Passive (inactive) edges
- A passive edge corresponds to a proven goal, a
confirmed hypothesis about the input - (e.g., that the parser has found an NP)
- One knows that there is a phrase of the type
indicated by the edge between its starting and
end points.
31Active chart parsing
- Chart parsing with both active and inactive edges
- If only inactive edges are used,
- the chart corresponds to a WFST.
32Dotted Rules
- A dot indicates to what extent the hypothesis
that a rule is applicable to has been verified. - Symbols in a label that are to the left of a dot
represent one or more confirmed hypotheses. - Symbols to the right of a dot represent one or
more unconfirmed hypotheses.
33The Fundamental Rule of Chart Parsing
- Describes how active and passive edges are
combined - Add an edge to the chart each time an active
edge meets a passive edge with the right
category - The new edge spans both the active and the
passive edge - That type of combination in the end gives more
passive edges (i.e., more input has been analysed)
34Efficiency
- A lot of spurious productions are generated.
- No work is done twice since all intermediate
results are saved and may be reused.
35Shift-Reduce Parsing
- In each cycle one of two actions is performed
- A shift action consumes an input word.
- A reduce action applies a grammar rule.
36LR Parsing
- A type of shift-reduce parsing.
- Straight shift-reduce
- one grammar rule is attempted at a time.
- LR
- a number of grammar rules are handled
simultaneously by prefix merging.
37The Parts of an LR Parser
- a pushdown stack
- a finite set of internal states
- a reader head scanning the input string
- from left to right one symbol at a time
- A left-corner parsing strategy.
- L left-to-right scanning of the input.
- R constructing right-most derivation in
reverse. - YACC Yet Another Compiler Compiler
38The stack
- Every other item is a grammar symbol
- Every other item is a state
- The top of the stack the current state
- The next symbol in the input string
- the look-ahead symbol
- (the symbol under the reader head)
39LR Parsing and Natural Languages
- An LR parser is very efficient
- totally deterministic
- no backtracking or search
- But It cannot treat ambiguities.
- Ambiguous grammar gt a parsing table with
multiple entries (conflicts).
40Tomitas Algorithm
- Extends the standard LR algorithm with
- a graph-structured stack
- (handles ambiguities in the parsing table)
- structure sharing in the parse trees
- packing of local ambiguities
41Tomita and Earley Recognition
- Both O(n3)
- Tomita 5-10 times faster than standard Earley
- (2-3 times faster than the best version of
Earley) - Tomita might be worse for densely ambiguous
grammars - Earley is better for extremely long or short
sentences
42Tomita and Earley Space
- The increase of space usage in Earley
- increases over the space usage in Tomita
- as sentences get longer.