Parsing - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Parsing

Description:

John gave a dog to Mary. John saw the dog in the park. Bj rn Gamb ck. 13 ... Max two sentences can be parsed in parallel. That Joe left bothered Susan. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 43
Provided by: bjrng
Category:
Tags: and | mary | max | parsing

less

Transcript and Presenter's Notes

Title: Parsing


1
Parsing
  • Dr. Björn Gambäck
  • SICS Swedish Institute of Computer Science AB
  • Stockholm, Sweden

2
Analysis of Natural Languages
  • Syntax
  • actual structure of an utterance
  • Parsing
  • best possible way to make an analysis of an
    utterance
  • Semantics
  • representation of the meaning of an utterance

3
Syntactic Recognition and Parsing
  • Recognise a sentence
  • Assign some structure to it
  • Goal Find all parse trees rooted in the start
    symbol ?, covering exactly the input words.
  • Is it possible to parse a string
    deterministically as it is being read (e.g. from
    left-to-right)?
  • What is the largest context the parser must
    examine to decide how to build the parse tree?

4
Artificial Language Parsingvs.Natural Language
Parsing
  • A parser for a computer language
  • yields a unique tree for each string
  • must be deterministic
  • is allowed (basically) unrestricted memory
  • A natural language parser
  • must allow for more than one parse
  • should predict which parse will most likely be
    offered as the first choice by a native speaker
  • the short-term memory in humans is restricted

5
Parsing Natural Languages
  • Highly ambiguous
  • All solutions must be found
  • Analysis problem more complex
  • Solutions based on saving partial parses

6
Empty Rules and Left-Recursion
  • Natural language grammars can be both
  • empty the grammar has a rule
  • A ? ?
  • left-recursive the grammar has a rule
  • A ? A B

7
Cyclicity and Infinite Ambiguity
  • Natural language grammars cannot be
  • cyclic the grammar has a rule
  • A ? A
  • infinitely ambiguous some input sentences give
    an infinite number of parses.
  • The number of rules in an NL grammar is large!
  • The length of an NL sentence is normally 10-30
    words.

8
Hypothesis-Driven Parsing (Top-Down)
  • Find the start symbol as the first current LHS
  • Look up the current LHS as either
  • a terminal a word in the lexicon (consume a
    word in the input sentence)
  • or a non-terminal a grammar rule
  • (make all RHS nodes new LHS).
  • Continue until all words have been consumed.

9
Context-Free Grammar
  • name ? john.
  • name ? mary.
  • det ? a.
  • det ? the.
  • pro ? it.
  • n ? book.
  • n ? books.
  • v ? snores.
  • v ? sees.
  • v ? book.
  • v ? books.
  • s ? np, vp.
  • s ? vp.
  • np ? name.
  • np ? pro.
  • np ? det, n.
  • vp ? v.
  • vp ? v, np.

10
Bottom-Up Filtering
  • non-terminal a grammar rule
  • (make all RHS nodes new LHS).
  • depth-first search expand the left-most (first)
    daughter node all the time
  • breadth-first search expand all daughter nodes
    at the same level after each other
  • Filtering expand only those nodes whose
    left-corner can match the current input

11
A Top-Down Parser in Prolog
  • parse(Words) -
  • top_symbol(S),
  • parse(S,Words,).
  • parse(Phrase,WordWords,Words) -
  • lexicon(Word,Phrase).
  • parse(Phrase,WordsIn,WordsOut) -
  • (Phrase ? Body),
  • parse_rest(Body,WordsIn,WordsOut).
  • parse_rest(,Words,Words).
  • parse_rest(PhraseRest,WordsIn,WordsOut) -
  • parse(Phrase,WordsIn,WordsMid),
  • parse_rest(Rest,WordsMid,WordsOut).

12
Left Recursion
  • Add rules for PP-modifiers
  • np ? np, pp.
  • vp ? vp, pp.
  • pp ? p, np.
  • John gave a dog to Mary.
  • John saw the dog in the park.

13
Data-Driven Parsing (Bottom-Up)
  • Make the start symbol the first prediction
    category.
  • Make a terminal matching the first word the
    current phrase.
  • Try to link the current phrase with the
    prediction category
  • by noting that they are identical (consume a
    word in the input sentence)
  • or finding a grammar rule with the phrase as its
    left corner (make the LHS node the new current
    phrase).
  • When applying a grammar rule, make all following
    RHS nodes prediction categories for the remaining
    words.
  • Continue until all words in the sentence have
    been consumed.

14
Empty Production Rules
  • A ? ?
  • The LHS has no realisation in the input string
  • Analysing bottom-up ? empty rules are always
    applicable
  • A bottom-up parser dreams up an infinite number
    of empty rule applications anywhere

15
Top-Down Filtering
  • Reduce the search space
  • Avoid non-termination
  • Link table
  • a table with which phrases can be left-corners
    of which categories

16
Human Language Processing(Kimball 1973,
Cognition 215-47)
  • Ungrammaticality vs. unacceptability
  • First Principle Top-Down
  • Parsing in natural language proceeds according to
    a top-down algorithm
  • Weakest principle
  • Universality?

17
Principle Two Right Association
  • Terminals associate to the lowest non-terminal
    node
  • Sentences organize into right-branching
    structures
  • (perceptually less complex than left-branching or
    center-embedded)

center
right
left
  • The girl took the job that was attractive.
  • Joe said that Martha expected that it would rain
    yesterday.

18
Principle Three New Nodes
  • A new node is signalled by a function word
  • (prepositions, determiners, conjunctions,
    complementizers, Wh-words, auxiliaries)
  • She asked him or
  • Deleting complementizers and relative pronouns
    make sentences more complex
  • He knew the girl left.
  • He knew that the girl left.

she persuaded him to leave.
19
Principle Four Two Sentences
  • Max two sentences can be parsed in parallel
  • That Joe left bothered Susan.
  • That that Joe left bothered Susan surprised Max.

S
S
that
surprised Max
that
S
bothered Susan
Joe left
  • That for Joe to leave bothers Susan surprised
    Max.

20
Principle Five Closure
  • A phrase is closed as soon as possible
  • (unless the next node is a constituent of the
    phrase)
  • They knew that the girl was in the closet.
  • They knew the girl
  • If a terminal string can be interpreted as an XP,
    it will be

was in the closet.
21
Principle Six Fixed Structure
  • It is costly to reorganize the constituents after
    a phrase has been closed
  • Garden path sentences
  • The horse raced past the barn fell.
  • The dog knew the cat disappeared, was rescued.
  • English is a look-ahead language
  • Scanned terminals occupy a small portion of STM
  • Biggest restriction on number of S nodes held
  • The allocation of storage space is more than made
    up for by efficiency of parsing.

22
Principle Seven Processing
  • When a phrase is closed,
  • it is pushed down into a syntactic processing
    stage and cleared from STM (short-term memory)
  • Tom saw that the cow jumped over the moon

S
VP
NP
NP
V
Tom
S
saw
S
NP
VP
NP
that
PP
V
the cow
over the moon
jumped
23
Well-Formed Substring Table (WFST)
  • Top-Down sequential rule application
  • Substrings may be reparsed
  • Smarter save partial parses in a WFST
  • Analyse like top-down, but
  • If current LHS has been exhaustively analysed
  • gt look up the solution in the table
  • Else ordinary top-down analysis of current LHS
  • gt note the partial result in the table
  • If there are no more possible parses of current
    LHS
  • gt note that it has been exhaustively analysed

24
Head Parsing
  • One RHS element identified as the head in each
    rule
  • Parsing
  • Identify the head.
  • (save the heads in a table!)
  • Expand the heads left and right contexts using
    the head as top-down prediction.

25
Chart Parsing (Earleys Algorithm)
  • Saves partial results in a chart (a table)
  • - extends the WFST idea
  • Not a fixed algorithm but a data structure
  • Independent of parsing strategy
  • and grammar formalism
  • Stops the parser from trying to analyse the same
    input several times.

26
The Chart Data Structure
  • A directed, partially cyclic graph
  • The only cycles allowed are those which point
    back to the node they came from
  • The nodes are called vertices
  • The archs are called edges

27
The Chart, Formally
  • A directed graph C lt V, E gt
  • V is a finite, non-empty set of nodes
  • E is a finite set of edges
  • Each edge consists of
  • a finite set of dotted context-free rules
  • a (possibly infinite) set of feature-value
    structures for a constraint-based grammar

28
Edges
  • The chart entries are edges between various
    positions in the sentence marked with phrases.
  • There are passive and active edges.
  • A phrase spanning two positions in the input word
    string is called a goal.
  • The parse is completed when one has constructed a
    passive edge between the starting point and the
    end point of the sentence marked with the top
    symbol of the grammar.

29
Active edges
  • An active edge corresponds to an unproven goal,
  • an unconfirmed hypothesis about the input
  • (e.g., that the parser is about to find an NP).
  • One can find the type of phrase indicated by the
    edge if one can prove the remaining subgoals.
  • These subgoals are specified by the edge.

30
Passive (inactive) edges
  • A passive edge corresponds to a proven goal, a
    confirmed hypothesis about the input
  • (e.g., that the parser has found an NP)
  • One knows that there is a phrase of the type
    indicated by the edge between its starting and
    end points.

31
Active chart parsing
  • Chart parsing with both active and inactive edges
  • If only inactive edges are used,
  • the chart corresponds to a WFST.

32
Dotted Rules
  • A dot indicates to what extent the hypothesis
    that a rule is applicable to has been verified.
  • Symbols in a label that are to the left of a dot
    represent one or more confirmed hypotheses.
  • Symbols to the right of a dot represent one or
    more unconfirmed hypotheses.

33
The Fundamental Rule of Chart Parsing
  • Describes how active and passive edges are
    combined
  • Add an edge to the chart each time an active
    edge meets a passive edge with the right
    category
  • The new edge spans both the active and the
    passive edge
  • That type of combination in the end gives more
    passive edges (i.e., more input has been analysed)

34
Efficiency
  • A lot of spurious productions are generated.
  • No work is done twice since all intermediate
    results are saved and may be reused.

35
Shift-Reduce Parsing
  • In each cycle one of two actions is performed
  • A shift action consumes an input word.
  • A reduce action applies a grammar rule.

36
LR Parsing
  • A type of shift-reduce parsing.
  • Straight shift-reduce
  • one grammar rule is attempted at a time.
  • LR
  • a number of grammar rules are handled
    simultaneously by prefix merging.

37
The Parts of an LR Parser
  • a pushdown stack
  • a finite set of internal states
  • a reader head scanning the input string
  • from left to right one symbol at a time
  • A left-corner parsing strategy.
  • L left-to-right scanning of the input.
  • R constructing right-most derivation in
    reverse.
  • YACC Yet Another Compiler Compiler

38
The stack
  • Every other item is a grammar symbol
  • Every other item is a state
  • The top of the stack the current state
  • The next symbol in the input string
  • the look-ahead symbol
  • (the symbol under the reader head)

39
LR Parsing and Natural Languages
  • An LR parser is very efficient
  • totally deterministic
  • no backtracking or search
  • But It cannot treat ambiguities.
  • Ambiguous grammar gt a parsing table with
    multiple entries (conflicts).

40
Tomitas Algorithm
  • Extends the standard LR algorithm with
  • a graph-structured stack
  • (handles ambiguities in the parsing table)
  • structure sharing in the parse trees
  • packing of local ambiguities

41
Tomita and Earley Recognition
  • Both O(n3)
  • Tomita 5-10 times faster than standard Earley
  • (2-3 times faster than the best version of
    Earley)
  • Tomita might be worse for densely ambiguous
    grammars
  • Earley is better for extremely long or short
    sentences

42
Tomita and Earley Space
  • The increase of space usage in Earley
  • increases over the space usage in Tomita
  • as sentences get longer.
Write a Comment
User Comments (0)
About PowerShow.com