COS 320 Compilers - PowerPoint PPT Presentation

About This Presentation
Title:

COS 320 Compilers

Description:

a grammar is ambiguous if the same sequence of tokens can give rise to two or more parse trees ... how do we know when we can parse grammars using recursive descent? ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 70
Provided by: csPrin
Category:
Tags: cos | compilers | parse

less

Transcript and Presenter's Notes

Title: COS 320 Compilers


1
COS 320Compilers
  • David Walker

2
The Front End
  • Lexical Analysis Create sequence of tokens from
    characters (Chap 2)
  • Syntax Analysis Create abstract syntax tree from
    sequence of tokens (Chap 3)
  • Type Checking Check program for well-formedness
    constraints

stream of characters
stream of tokens
abstract syntax
Lexer
Parser
Type Checker
3
Parsing with CFGs
  • Context-free grammars are (often) given by BNF
    expressions (Backus-Naur Form)
  • Appel Chap 3.1
  • More powerful than regular expressions
  • Matching parens
  • Nested comments
  • wait, we could do nested comments with ML-LEX!
  • CFGs are good for describing the overall
    syntactic structure of programs.

4
Context-Free Grammars
  • Context-free grammars consist of
  • Set of symbols
  • terminals that denotes token types
  • non-terminals that denotes a set of strings
  • Start symbol
  • Rules
  • left-hand side non-terminal
  • right-hand side terminals and/or non-terminals
  • rules explain how to rewrite non-terminals
    (beginning with start symbol) into terminals

symbol symbol symbol ... symbol
5
Context-Free Grammars
  • A string is in the language of the CFG if only if
    it is possible to derive that string using the
    following non-deterministic procedure
  • begin with the start symbol
  • while any non-terminals exist, pick a
    non-terminal and rewrite it using a rule
  • stop when all you have left are terminals (and
    check you arrived at the string your were hoping
    to)
  • Parsing is the process of checking that a string
    is in the CFG for your programming language. It
    is usually coupled with creating an abstract
    syntax tree.

6
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

Elist E Elist Elist , E
E ID E NUM E E E E ( S , Elist
)
S S S S ID E S PRINT ( Elist )
7
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
ID NUM PRINT ( NUM )
8
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S ID NUM PRINT ( NUM )
9
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S ID E ID NUM PRINT ( NUM )
10
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S ID E ID NUM PRINT ( NUM )
oops, cant make progress
11
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S ID NUM PRINT ( NUM )
12
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S S S ID NUM PRINT ( NUM )
13
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S S S ID E S ID NUM PRINT ( NUM )
14
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
Derive me!
S S S ID E S ID NUM S ID NUM PRINT
( Elist ) ID NUM PRINT ( E ) ID NUM PRINT
( NUM )
15
  • non-terminals S, E, Elist
  • terminals ID, NUM, PRINT, , , (, ),
  • rules

8. Elist E 9. Elist Elist , E
4. E ID 5. E NUM 6. E E E 7. E
( S , Elist )
1. S S S 2. S ID E 3. S PRINT
( Elist )
S S S ID E S ID NUM S ID NUM PRINT
( Elist ) ID NUM PRINT ( E ) ID NUM PRINT
( NUM )
S S S S PRINT ( Elist ) S PRINT ( E ) S
PRINT ( NUM ) ID E PRINT ( NUM ) ID NUM
PRINT ( NUM )
Another way to derive the same string
left-most derivation
right-most derivation
16
Parse Trees
  • Representing derivations as trees
  • useful in compilers Parse trees correspond
    quite closely (but not exactly) with abstract
    syntax trees were trying to generate
  • difference abstract syntax vs concrete (parse)
    syntax
  • each internal node is labeled with a non-terminal
  • each leaf note is labeled with a terminal
  • each use of a rule in a derivation explains how
    to generate children in the parse tree from the
    parents

17
Parse Trees
  • Example

S S S ID E S ID NUM S ID NUM PRINT
( Elist ) ID NUM PRINT ( E ) ID NUM PRINT
( NUM )
S
S
S

E

L
)
(
ID
PRINT
E
NUM
NUM
18
Parse Trees
  • Example 2 derivations, but 1 tree

S S S ID E S ID NUM S ID NUM PRINT
( Elist ) ID NUM PRINT ( E ) ID NUM PRINT
( NUM )
S
S
S

E

L
)
(
ID
PRINT
S S S S PRINT ( Elist ) S PRINT ( E ) S
PRINT ( NUM ) ID E PRINT ( NUM ) ID NUM
PRINT ( NUM )
E
NUM
NUM
19
Parse Trees
  • parse trees have meaning.
  • order of children, nesting of subtrees is
    significant

S
S
S
S
S

S

E

L
)
(
ID
L
)
(
PRINT
PRINT
E

ID
E
E
NUM
NUM
NUM
NUM
20
Ambiguous Grammars
  • a grammar is ambiguous if the same sequence of
    tokens can give rise to two or more parse trees

21
Ambiguous Grammars
characters 4 5 6 tokens NUM(4)
PLUS NUM(5) MULT NUM(6)
E
non-terminals E terminals ID NUM PLUS
MULT E ID NUM E E
E E
E
E

E
E

NUM(4)
NUM(6)
NUM(5)
I like using this notation where I avoid
repeating E
22
Ambiguous Grammars
characters 4 5 6 tokens NUM(4)
PLUS NUM(5) MULT NUM(6)
E
non-terminals E terminals ID NUM PLUS
MULT E ID NUM E E
E E
E
E

E
E

NUM(4)
NUM(6)
NUM(5)
E
E

E
E
E

NUM(6)
NUM(5)
NUM(4)
23
Ambiguous Grammars
  • problem compilers use parse trees to interpret
    the meaning of parsed expressions
  • different parse trees have different meanings
  • eg (4 5) 6 is not 4 (5 6)
  • languages with ambiguous grammars are DISASTROUS
    The meaning of programs isnt well-defined! You
    cant tell what your program might do!
  • solution rewrite grammar to eliminate ambiguity
  • fold precedence rules into grammar to
    disambiguate
  • fold associativity rules into grammar to
    disambiguate
  • other tricks as well

24
Building Parsers
  • In theory classes, you might have learned about
    general mechanisms for parsing all CFGs
  • algorithms for parsing all CFGs are expensive
  • to compile 1/10/100 million-line applications,
    compilers must be fast.
  • even for 10 thousand-line apps, speed is nice
  • sometimes 1/3 of compilation time is spent in
    parsing
  • compiler writers have developed specialized
    algorithms for parsing the kinds of CFGs that you
    need to build effective programming languages
  • LL(k), LR(k) grammars can be parsed.

25
Recursive Descent Parsing
  • Recursive Descent Parsing (Appel Chap 3.2)
  • aka predictive parsing top-down parsing
  • simple, efficient
  • can be coded by hand in ML quickly
  • parses many, but not all CFGs
  • parses LL(1) grammars
  • Left-to-right parse Leftmost-derivation 1
    symbol lookahead
  • key ideas
  • one recursive function for each non terminal
  • each production becomes one clause in the function

26
non-terminals S, E, L terminals NUM, IF,
THEN, ELSE, BEGIN, END, PRINT, , rules
1. S IF E THEN S ELSE S 2. BEGIN S
L 3. PRINT E
4. L END 5. S L 6. E NUM
NUM
27
non-terminals S, E, L terminals NUM, IF,
THEN, ELSE, BEGIN, END, PRINT, , rules
1. S IF E THEN S ELSE S 2. BEGIN S
L 3. PRINT E
4. L END 5. S L 6. E NUM
NUM
Step 1 Represent the tokens
datatype token NUM IF THEN ELSE BEGIN
END PRINT SEMI EQ
Step 2 build infrastructure for reading tokens
from lexing stream
val tok ref (getToken ()) fun advance () tok
getToken () fun eat t if (! tok t) then
advance () else error ()
28
non-terminals S, E, L terminals NUM, IF,
THEN, ELSE, BEGIN, END, PRINT, , rules
1. S IF E THEN S ELSE S 2. BEGIN S
L 3. PRINT E
4. L END 5. S L 6. E NUM
NUM
Step 1 Represent the tokens
datatype token NUM IF THEN ELSE BEGIN
END PRINT SEMI EQ
Step 2 build infrastructure for reading tokens
from lexing stream
val tok ref (getToken ()) fun advance () tok
getToken () fun eat t if (! tok t) then
advance () else error ()
29
non-terminals S, E, L terminals NUM, IF,
THEN, ELSE, BEGIN, END, PRINT, , rules
1. S IF E THEN S ELSE S 2. BEGIN S
L 3. PRINT E
4. L END 5. S L 6. E NUM
NUM
val tok ref (getToken ()) fun advance () tok
getToken () fun eat t if (! tok t) then
advance () else error ()
datatype token NUM IF THEN ELSE BEGIN
END PRINT SEMI EQ
Step 3 write parser gt one function per
non-terminal one clause per rule
fun S () case !tok of IF gt eat
IF E () eat THEN S () eat ELSE S ()
BEGIN gt eat BEGIN S () L () PRINT gt
eat PRINT E () and L () case !tok of END
gt eat END SEMI gt eat SEMI S ()
L () and E () eat NUM eat EQ eat NUM
30
non-terminals A, S, E, L rules
1. A S EOF 2. ID E 3.
PRINT ( L )
4. E ID 5. NUM 6. L E 7.
L , E
fun A () S () eat EOF and S () case !tok
of ID gt eat ID eat ASSIGN E
() PRINT gt eat PRINT eat LPAREN L ()
eat RPAREN and E () case !tok of ID
gt eat ID NUM gt eat NUM and L
() case !tok of ID gt ???
NUM gt ???
31
problem
  • predictive parsing only works for grammars where
    the first terminal symbol of each self-expression
    provides enough information to choose which
    production to use
  • LL(1)
  • if !tok ID, the parser cannot determine which
    production to use

6. L E (E could be ID) 7.
L , E (L could be E could be ID)
32
solution
  • eliminate left-recursion
  • rewrite the grammar so it parses the same
    language but the rules are different

A S EOF ID E PRINT ( L
) E ID NUM
A S EOF ID E PRINT ( L
) E ID NUM
L E M M , E M
L E L , E
33
eliminating left-recursion in general
  • Original grammar form
  • Transformed grammar

X base X X repeat
Strings base repeat repeat ...
X base Xnew Xnew repeat Xnew Xnew
Strings base repeat repeat ...
34
Recursive Descent Parsing
  • Unfortunately, left factoring doesnt always work
  • Questions
  • how do we know when we can parse grammars using
    recursive descent?
  • Is there an algorithm for generating such parsers
    automatically?

35
Constructing RD Parsers
  • To construct an RD parser, we need to know what
    rule to apply when
  • we have seen a non terminal X
  • we see the next terminal a in input
  • We apply rule X s when
  • a is the first symbol that can be generated by
    string s, OR
  • s reduces to the empty string (is nullable) and a
    is the first symbol in any string that can follow
    X

36
Constructing RD Parsers
  • To construct an RD parser, we need to know what
    rule to apply when
  • we have seen a non terminal X
  • we see the next terminal a in input
  • We apply rule X s when
  • a is the first symbol that can be generated by
    string s, OR
  • s reduces to the empty string (is nullable) and a
    is the first symbol in any string that can follow
    X

37
Constructing Predictive Parsers
1. Y 2. bb
5. Z d
3. X c 4. Y Z
next terminal
rule
non-terminal seen
38
Constructing Predictive Parsers
1. Y 2. bb
5. Z d
3. X c 4. Y Z
next terminal
rule
non-terminal seen
39
Constructing Predictive Parsers
1. Y 2. bb
5. Z d
3. X c 4. Y Z
next terminal
rule
non-terminal seen
40
Constructing Predictive Parsers
1. Y 2. bb
5. Z d
3. X c 4. Y Z
next terminal
rule
non-terminal seen
41
Constricting Predictive Parsers
  • in general, must compute
  • for each production X s, must determine if s
    can derive the empty string.
  • if yes, X ? Nullable
  • for each production X s, must determine the
    set of all first terminals Q derivable from s
  • Q ? First(X)
  • for each non terminal X, determine all terminals
    symbols Q that immediately follow X
  • Q ? Follow(X)

42
Iterative Analysis
  • Many compilers algorithms are iterative
    techniques.
  • Iterative analysis applies when
  • must compute a set of objects with some property
    P
  • P is defined inductively. ie, there are
  • base cases objects o1, o2 obviously have
    property P
  • inductive cases if certain objects (o3, o4)
    have property P, this implies other objects (f
    o3 f o4) have property P
  • The number of objects in the set is finite
  • or we can represent infinite collections using
    some finite notation we can find effective
    termination conditions

43
Iterative Analysis
  • general form
  • initialize set S with base cases
  • applied inductive rules over and over until you
    reach a fixed point
  • a fixed point is a set that does not change when
    you apply an inductive rule
  • Nullable, First and Follow sets can be determined
    through iteration
  • many program optimizations use iteration
  • worst-case complexity is bad
  • average-case complexity is good iteration
    usually terminates in a couple of rounds

44
Computing Nullable Sets
  • Non-terminal X is Nullable only if the following
    constraints are satisfied (computed using
    iterative analysis)
  • base case
  • if (X ) then X is Nullable
  • inductive case
  • if (X ABC...) and A, B, C, ... are all
    Nullable then X is Nullable

45
Computing First Sets
  • First(X) is computed iteratively
  • base case
  • if T is a terminal symbol then First (T) T
  • inductive case
  • if X is a non-terminal and (X ABC...) then
  • First (X) First (X) U First (ABC...)
  • where First(ABC...) F1 U F2 U F3 U ... and
  • F1 First (A)
  • F2 First (B), if A is Nullable
  • F3 First (C), if A is Nullable B is Nullable
  • ...

46
Computing Follow Sets
  • Follow(X) is computed iteratively
  • base case
  • initially, we assume nothing in particular
    follows X
  • (Follow (X) is initially )
  • inductive case
  • if (Y s1 X s2) for any strings s1, s2 then
  • Follow (X) First (s2) U Follow (X)
  • if (Y s1 X s2) for any strings s1, s2 then
  • Follow (X) Follow(Y) U Follow (X), if s2 is
    Nullable

47
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
48
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
base case
49
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
after one round of induction, we realize we have
reached a fixed point
50
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
base case
51
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
after one round of induction, no fixed point
52
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
after two rounds of induction, no more changes
gt fixed point
53
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
base case
54
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
after one round of induction, no fixed point
55
building a predictive parser
Y c Y
X a X b Y e
Z X Y Z Z d
after two rounds of induction, fixed point (but
notice, computing Follow(X) before Follow (Y)
would have required 3rd round)
56
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
57
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
58
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
59
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
60
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
61
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
  • if T ? First(s) then
  • enter (X s) in row X, col T
  • if s is Nullable and T ? Follow(X)
  • enter (X s) in row X, col T

Build parsing table where row X, col T tells
parser which clause to execute in function X with
next-token T
62
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
What are the blanks?
63
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
What are the blanks? --gt syntax errors
64
Grammar
Computed Sets
Z X Y Z Z d
Y c Y
X a X b Y e
Is it possible to put 2 grammar rules in the same
box?
65
Grammar
Computed Sets
Z X Y Z Z d Z d e
Y c Y
X a X b Y e
Is it possible to put 2 grammar rules in the same
box?
66
predictive parsing tables
  • if a predictive parsing table constructed this
    way contains no duplicate entries, the grammar is
    called LL(1)
  • Left-to-right parse, Left-most derivation, 1
    symbol lookahead
  • if not, of the grammar is not LL(1)
  • in LL(k) parsing table, columns include every
    k-length sequence of terminals

67
another trick
  • Previously, we saw that grammars with
    left-recursion were problematic, but could be
    transformed into LL(1) in some cases
  • the example non-LL(1) grammar we just saw
  • how do we fix it?

Z X Y Z Z d Z d e
Y c Y
X a X b Y e
68
another trick
  • Previously, we saw that grammars with
    left-recursion were problematic, but could be
    transformed into LL(1) in some cases
  • the example non-LL(1) grammar we just saw
  • solution here is left-factoring

Z X Y Z Z d Z d e
Y c Y
X a X b Y e
Z X Y Z Z d W
Y c Y
X a X b Y e
W W e
69
summary
  • CFGs are good at specifying programming language
    structure
  • parsing general CFGs is expensive so we define
    parsers for simple classes of CFG
  • LL(k), LR(k)
  • we can build a recursive descent parser for LL(k)
    grammars by
  • computing nullable, first and follow sets
  • constructing a parse table from the sets
  • checking for duplicate entries, which indicates
    failure
  • creating an ML program from the parse table
  • if parser construction fails we can
  • rewrite the grammar (left factoring, eliminating
    left recursion) and try again
  • try to build a parser using some other method
Write a Comment
User Comments (0)
About PowerShow.com