Scanner wrap-up and Introduction to Parser - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Scanner wrap-up and Introduction to Parser

Description:

RE NFA (Thompson's construction) Build an NFA for each term. Combine them ... should think twice before introducing a feature that defeats a DFA-based scanner ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 47
Provided by: Sta4
Category:

less

Transcript and Presenter's Notes

Title: Scanner wrap-up and Introduction to Parser


1
Scanner wrap-up and Introduction to Parser
2
Automating Scanner Construction
  • RE?NFA (Thompsons construction) Build an NFA
    for each term
  • Combine them with ?-moves
  • NFA ?DFA (subset construction)
  • Build the simulation
  • DFA ?Minimal DFA (today)
  • Hopcrofts algorithm
  • DFA ?RE (not really part of scanner
    construction)
  • All pairs, all paths problem
  • Union together paths from s0 to a final state

3
DFA Minimization
  • The Big Picture
  • Discover sets of equivalent states
  • Represent each such set with just one state

4
DFA Minimization
  • The Big Picture
  • Discover sets of equivalent states
  • Represent each such set with just one state
  • Two states are equivalent if and only if
  • The set of paths leading to them are equivalent
  • ? ? ? ?, transitions on ? lead to equivalent
    states (DFA)
  • ?-transitions to distinct sets ? states must be
    in distinct sets
  • A partition P of S
  • Each s ? S is in exactly one set pi ? P
  • The algorithm iteratively partitions the DFAs
    states

5
DFA Minimization
  • Details of the algorithm
  • Group states into maximal size sets,
    optimistically
  • Iteratively subdivide those sets, as needed
  • States that remain grouped together are
    equivalent
  • Initial partition, P0 , has two sets F Q-F
    (D (Q,?,?,q0,F))
  • Splitting a set (partitioning a set by a)
  • Assume qa, qb ? s, and ?(qa,a) qx, ?(qb,a)
    qy
  • If qx qy are not in the same set, then s must
    be split
  • qa has transition on a, qb does not ? a splits s
  • One state in the final DFA cannot have two
    transitions on a

6
DFA Minimization
  • Why does this work?
  • Partition P ? 2Q
  • Start off with 2 subsets of Q F and Q-F
  • While loop takes Pi?Pi1 by splitting 1 or more
    sets
  • Pi1 is at least one step closer to the partition
    with Q sets
  • Maximum of Q splits
  • Note that
  • Partitions are never combined
  • Initial partition ensures that final states are
    intact
  • The algorithm

P ? F, Q-F while ( P is still changing)
T ? for each set S ? P for each ?
? ? partition S by ?
into S1, and S2 T ? T ? S1 ? S2 if
T ? P then P ? T
This is a fixed-point algorithm!
7
Key Idea Splitting S around a
Original set S
S
a
S has transitions on a to R, Q, T
a
a
The algorithm partitions S around a?
8
Key Idea Splitting S around a
Original set S
a
S1
S2
a
a
S2 is everything in S - S1
Could we split S2 further? Yes, but it does not
help asymptotically
9
DFA Minimization
  • Refining the algorithm
  • As written, it examines every S ? P on each
    iteration
  • This does a lot of unnecessary work
  • Only need to examine S if some T, reachable from
    S, has split
  • Reformulate the algorithm using a worklist
  • Start worklist with initial partition, F and
    Q-F
  • When it splits S into S1 and S2, place S2 on
    worklist
  • This version looks at each S ? P many fewer times
  • Well-known, widely used algorithm due to John
    Hopcroft

10
Hopcroft's Algorithm
W ? F, Q-F P ? F, Q-F // W is the
worklist, P the current partition while ( W is
not empty ) do begin select and remove S from W
// S is a set of states for each a in S do
begin let I a? da1( S ) // Ia is set of all
states that can reach S on a for each R in P
such that R ?Ia is not empty and R is
not contained in Ia do begin partition R
into R1 and R2 such that R1 ? R ?Ic R2 ? R
R1 replace R in P with R1 and R2
if R ??W then replace R with R1 in W and
add R2 to W else if R1
R2 then add add R1 to W else add
R2 to W end end end
11
A Detailed Example
  • Remember ( a b ) abb ?
  • Applying the subset construction
  • Iteration 3 adds nothing to S, so the algorithm
    halts

contains q4 (final state)
12
A Detailed Example
  • The DFA for ( a b ) abb
  • Not much bigger than the original
  • All transitions are deterministic
  • Use same code skeleton as before

13
A Detailed Example
  • Applying the minimization algorithm to the DFA

final state
14
DFA Minimization
  • What about a ( b c ) ?
  • First, the subset construction

?
b
q4
q5
?
?
a
?
?
?
q0
q1
q3
q8
q2
q9
c
?
?
q6
q7
?
b
s2
b
a
s0
s1
b
c
c
s3
c
15
DFA Minimization
  • Then, apply the minimization algorithm
  • To produce the minimal DFA

final states
Minimizing that DFA produces the one that a human
would design!
16
Limits of Regular Languages
  • Advantages of Regular Expressions
  • Simple powerful notation for specifying
    patterns
  • Automatic construction of fast recognizers
  • Many kinds of syntax can be specified with REs
  • Example an expression grammar
  • Term ? a-zA-Z (a-zA-z 0-9)
  • Op ? - ? /
  • Expr ? ( Term Op ) Term
  • Of course, this would generate a DFA
  • If REs are so useful
  • Why not use them for everything?

17
Limits of Regular Languages
  • Not all languages are regular
  • RLs ? CFLs ? CSLs
  • You cannot construct DFAs to recognize these
    languages
  • L pkqk
    (parenthesis languages)
  • L wcw r w ? ?
  • Neither of these is a regular language
    (nor an RE)
  • But, this is a little subtle. You can construct
    DFAs for
  • Strings with alternating 0s and 1s
  • ( e 1 ) ( 01 ) ( e 0 )
  • Strings with and even number of 0s and 1s
  • REs can count bounded sets and bounded
    differences

18
What can be so hard?
  • Poor language design can complicate scanning
  • Reserved words are important
  • if then then then else else else then
    (PL/I)
  • Insignificant blanks
    (Fortran Algol68)
  • do 10 i 1,25
  • do 10 i 1.25
  • String constants with special characters
    (C, C, Java, )
  • newline, tab, quote, comment delimiters,
  • Finite closures
    (Fortran 66 Basic)
  • Limited identifier length
  • Adds states to count length

19
What can be so hard? (Fortran 66/77)
  • How does a compiler scan this?
  • First pass finds inserts blanks
  • Can add extra words or tags to
  • create a scanable language
  • Second pass is normal scanner

Example due to Dr. F.K. Zadeck
20
Building Faster Scanners from the DFA
  • Table-driven recognizers waste effort
  • Read ( classify) the next character
  • Find the next state
  • Assign to the state variable
  • Trip through case logic in action()
  • Branch back to the top
  • We can do better
  • Encode state actions in the code
  • Do transition tests locally
  • Generate ugly, spaghetti-like code
  • Takes (many) fewer operations per input character

char ? next character state ? s0 call
action(state,char) while (char ? eof) state ?
?(state,char) call action(state,char)
char ? next character if ?(state) final then
report acceptance else report failure
21
Building Faster Scanners from the DFA
  • A direct-coded recognizer for r Digit Digit
  • Many fewer operations per character
  • Almost no memory operations
  • Even faster with careful use of fall-through
    cases

goto s0 s0 word ? Ø char ? next
character if (char r) then
goto s1 else goto se s1 word ? word
char char ? next character if (0
char 9) then goto s2
else goto se
s2 word ? word char char ? next
character if (0 char 9)
then goto s2 else if (char eof)
then report success else
goto se se print error message return
failure
22
Building Faster Scanners
  • Hashing keywords versus encoding them directly
  • Some (well-known) compilers recognize keywords as
    identifiers and check them in a hash table
  • Encoding keywords in the DFA is a better idea
  • O(1) cost per transition
  • Avoids hash lookup on each identifier
  • It is hard to beat a well-implemented DFA scanner

23
Building Scanners
  • The point
  • All this technology lets us automate scanner
    construction
  • Implementer writes down the regular expressions
  • Scanner generator builds NFA, DFA, minimal DFA,
    and then writes out the (table-driven or
    direct-coded) code
  • This reliably produces fast, robust scanners
  • For most modern language features, this works
  • You should think twice before introducing a
    feature that defeats a DFA-based scanner
  • The ones weve seen (e.g., insignificant blanks,
    non-reserved keywords) have not proven
    particularly useful or long lasting

24
Some Points of Disagreement with EAC
  • Table-driven scanners are not fast
  • EaC doesnt say they are slow it says you can do
    better
  • Faster code can be generated by embedding scanner
    in code
  • This was shown for both LR-style parsers and for
    scanners in the 1980s
  • Hashed lookup of keywords is slow
  • EaC doesnt say it is slow. It says that the
    effort can be folded into the scanner so that it
    has no extra cost. Compilers like GCC use hash
    lookup. A word must fail in the lookup to be
    classified as an identifier. With collisions in
    the table, this can add up. At any rate, the
    cost is unneeded, since the DFA can do it for
    O(1) cost per character.

25
Building Faster Scanners from the DFA
index
  • Table-driven recognizers waste a lot of effort
  • Read ( classify) the next character
  • Find the next state
  • Assign to the state variable
  • Trip through case logic in action()
  • Branch back to the top
  • We can do better
  • Encode state actions in the code
  • Do transition tests locally
  • Generate ugly, spaghetti-like code
  • Takes (many) fewer operations per input character

index
char ? next character state ? s0 call
action(state,char) while (char ? eof) state ?
?(state,char) call action(state,char)
char ? next character if ?(state) final then
report acceptance else report failure
index
26
Parsing
27
The Front End
  • Parser
  • Checks the stream of words and their parts of
    speech (produced by the scanner) for grammatical
    correctness
  • Determines if the input is syntactically well
    formed
  • Guides checking at deeper levels than syntax
  • Builds an IR representation of the code
  • Think of this as the mathematics of diagramming
    sentences

28
The Study of Parsing
  • The process of discovering a derivation for some
    sentence
  • Need a mathematical model of syntax a grammar G
  • Need an algorithm for testing membership in L(G)
  • Need to keep in mind that our goal is building
    parsers, not studying the mathematics of
    arbitrary languages
  • Roadmap
  • Context-free grammars and derivations
  • Top-down parsing
  • Hand-coded recursive descent parsers
  • Bottom-up parsing
  • Generated LR(1) parsers

29
Specifying Syntax with a Grammar
  • Context-free syntax is specified with a
    context-free grammar
  • SheepNoise ? SheepNoise baa
  • baa
  • This CFG defines the set of noises sheep normally
    make
  • It is written in a variant of BackusNaur form
  • Formally, a grammar is a four tuple, G
    (S,N,T,P)
  • S is the start symbol
    (set of strings in L(G))
  • N is a set of non-terminal symbols
    (syntactic variables)
  • T is a set of terminal symbols
    (words)
  • P is a set of productions or rewrite rules (P
    N ? (N ? T) )

30
Deriving Syntax
  • We can use the SheepNoise grammar to create
    sentences
  • use the productions as rewriting rules

And so on ...
31
A More Useful Grammar
  • To explore the uses of CFGs,we need a more
    complex grammar
  • Such a sequence of rewrites is called a
    derivation
  • Process of discovering a derivation is called
    parsing

We denote this derivation Expr ? id num id
32
Derivations
  • At each step, we choose a non-terminal to replace
  • Different choices can lead to different
    derivations
  • Two derivations are of interest
  • Leftmost derivation replace leftmost NT at
    each step
  • Rightmost derivation replace rightmost NT at
    each step
  • These are the two systematic derivations
  • (We dont care about randomly-ordered
    derivations!)
  • The example on the preceding slide was a leftmost
    derivation
  • Of course, there is also a rightmost derivation
  • Interestingly, it turns out to be different

33
The Two Derivations for x 2 y
  • In both cases, Expr ? id num id
  • The two derivations produce different parse trees
  • The parse trees imply different evaluation
    orders!

Leftmost derivation
Rightmost derivation
34
Derivations and Parse Trees
  • Leftmost derivation

This evaluates as x ( 2 y )
35
Derivations and Parse Trees
  • Rightmost derivation

This evaluates as ( x 2 ) y
36
Derivations and Precedence
  • These two derivations point out a problem with
    the grammar
  • It has no notion of precedence, or implied order
    of evaluation
  • To add precedence
  • Create a non-terminal for each level of
    precedence
  • Isolate the corresponding part of the grammar
  • Force the parser to recognize high precedence
    subexpressions first
  • For algebraic expressions
  • Multiplication and division, first
    (level one)
  • Subtraction and addition, next
    (level two)

37
Derivations and Precedence
  • Adding the standard algebraic precedence produces
  • This grammar is slightly larger
  • Takes more rewriting to reach
  • some of the terminal symbols
  • Encodes expected precedence
  • Produces same parse tree
  • under leftmost rightmost
  • derivations
  • Lets see how it parses x - 2 y

38
Derivations and Precedence
Its parse tree
The rightmost derivation
This produces x2 ( 2 y ), along with an
appropriate parse tree. Both the leftmost and
rightmost derivations give the same expression,
because the grammar directly encodes the desired
precedence.
39
Ambiguous Grammars
  • Our original expression grammar had other
    problems
  • This grammar allows multiple leftmost derivations
    for x - 2 y
  • Hard to automate derivation if gt 1 choice
  • The grammar is ambiguous

different choice than the first time
40
Two Leftmost Derivations for x 2 y
  • The Difference
  • Different productions chosen on the second step
  • Both derivations succeed in producing x - 2 y

Original choice
New choice
41
Ambiguous Grammars
  • Definitions
  • If a grammar has more than one leftmost
    derivation for a single sentential form, the
    grammar is ambiguous
  • If a grammar has more than one rightmost
    derivation for a single sentential form, the
    grammar is ambiguous
  • The leftmost and rightmost derivations for a
    sentential form may differ, even in an
    unambiguous grammar
  • Classic example the if-then-else problem
  • Stmt ? if Expr then Stmt
  • if Expr then Stmt else Stmt
  • other stmts
  • This ambiguity is entirely grammatical in nature

42
Ambiguity
  • This sentential form has two derivations
  • if Expr1 then if Expr2 then Stmt1 else Stmt2

43
Ambiguity
  • Removing the ambiguity
  • Must rewrite the grammar to avoid generating the
    problem
  • Match each else to innermost unmatched if
    (common sense rule)
  • With this grammar, the example has only one
    derivation

44
Ambiguity
  • if Expr1 then if Expr2 then Stmt1 else Stmt2
  • This binds the else controlling S2 to the inner if

45
Deeper Ambiguity
  • Ambiguity usually refers to confusion in the CFG
  • Overloading can create deeper ambiguity
  • a f(17)
  • In many Algol-like languages, f could be either a
    function or a subscripted variable
  • Disambiguating this one requires context
  • Need values of declarations
  • Really an issue of type, not context-free syntax
  • Requires an extra-grammatical solution (not in
    CFG)
  • Must handle these with a different mechanism
  • Step outside grammar rather than use a more
    complex grammar

46
Ambiguity - the Final Word
  • Ambiguity arises from two distinct sources
  • Confusion in the context-free syntax
    (if-then-else)
  • Confusion that requires context to resolve
    (overloading)
  • Resolving ambiguity
  • To remove context-free ambiguity, rewrite the
    grammar
  • To handle context-sensitive ambiguity takes
    cooperation
  • Knowledge of declarations, types,
  • Accept a superset of L(G) check it by other
    means
  • This is a language design problem
  • Sometimes, the compiler writer accepts an
    ambiguous grammar
  • Parsing techniques that do the right thing
  • i.e., always select the same derivation
Write a Comment
User Comments (0)
About PowerShow.com