Title: CSCI 435 Compiler Design
1CSCI 435 Compiler Design
- Week 3 Class 2
- Section 2.2 From Tokens to Syntax Tree to Section
2.2.4.1 LL(1) Parsing - (110-126)
- Ray Schneider
2Topics of the Day
- Tokens to Syntax Tree
- Parsing Methods
- Error Detection and Error Recovery
- Top Down Parsing
3Tokens to Syntax Tree
- Two ways of parsing
- TOP DOWN and BOTTOM UP
- Top Down
- 1) written by hand or 2) generated automatically
- Bottom Up
- 3) can only be generated
- All 3 cases syntax structure specified using
context-free grammars
4Importance of Grammars
- 1) Imposes a structure on the linear sequence of
tokens and a framework for erecting semantics on
the nodes of the structure - 2) Allows automatic construction of parsers
through the field of formal languages - 3) Helps create syntactically correct programs
and provide detailed answers about syntax
5Two Ways to do Parsing
- the LL Method deterministic left-to-right
top-down - the LR and LALR Methods deterministic
left-to-right bottom-up - Left-to-right
- means the sequence of tokens (program text) is
processed from left to right one token at a time - Deterministic
- No Searching (ideal) each token processed leads
one closer to the final construction of the
syntax tree, hence implies LINEAR TIME - Only work for restricted classes of grammars
- Resulting grammars for deterministic parsers are
guaranteed to be non-ambiguous - Real grammars don't always cooperate so
transformation methods are needed to bring them
into line. - Non-ambiguous means either one syntax tree is
generated if the program is syntactically correct
or the program contains errors.
6Parsing Methods
- Constructs the syntax tree for a given sequence
of tokens (i.e. a tree of nodes labeled with
grammar symbols, such that - Leaf nodes are labeled with terminals
- Inner nodes are labeled with non-terminals
- TOP NODE is labeled with the Start Symbol
- Children of an inner node labeled N correspond to
the members of an alternative of N, in the same
order as they occur in that alternative - Terminals labeling the leaf nodes correspond to
the sequence of tokens, in the same order as they
occur in the input.
7Top Down or Bottom Upi.e. Pre-order or Post-order
- Parsing Methods are either Top Down or
Bottom up depending on how the nodes of the
syntax tree are constructed
TREE TRAVERSAL Pre-Order visit node N and then
N's sub-trees in left-to-right order. Post-Order
visit N's sub-trees in left-to-right order and
then visit node N. TERMS visiting a node doing
something to the node in support of an algorithm
that motivates the traversal. traversing a node
visiting that node and traversing its sub-trees
in some order. traversing a tree traversing the
top node which will recursively traverse the
whole tree. Traversing belongs to the control
mechanism.
8Top Down Parser
- construct top node
- from top node construct children in alternative
order - determine correct alternative
- proceed down until one reaches a leftmost
terminal - terminal then matches first token
9Bottom Up Parser
- constructs nodes in post-order
- constructs a node only when all children have
been constructed - 1st node constructed is the top of the first
complete sub-tree it meets going left to right
through the input
10Error Detection and Error Recovery
- an error is detected when the construction of the
syntax tree fails - since tree is built from parsing methods which
read the tokens from left-to-right failure occurs
at a SPECIFIC TOKEN, two questions - What error message to give to the user? and
- Whether and how to proceed after the error?
- ex. x a(pq(-b(r-s)
- position of detection may not reflect position of
error - We have to do error recovery to give users some
idea of how many errors there are, two strategies - error correction patch and continue
- non-correcting error recovery discard and
continue with suffix grammar yields reliable
error detection but difficult to implement and
rarely found in parser generators.
point of detection
11Manual Creation of a Top Down Parser
- Given non-terminal N, token t at position p in
the input, (N, t, p) - Top Down parser must decide
- Which alternative of N must be applied to obtain
a sub-tree headed by N with the correct sub-tree
at position p ? - How do you tell that a tree is incorrect?
- IT HAS A DIFFERENT TOKEN THAN t AS ITS LEFTMOST
LEAF AT POSITION p ! - So a correct tree (or reasonable approximation
thereto) starts with t or is empty - Obvious implementation is a recursive Boolean
function that tests possibilities until it find a
possible tree, called RECURSIVE DESIGN PARSER
t1 t2 t3 t4 t5 t6 t7 t8 t9
12Recursive descent parsing
- Next figure shows a Recursive Descent Parser
RECOGNIZER lacks code parse tree construction - Grammar is a simple arithmetic expression which
is right associative in the '' operator (see
example token stream) - IDENTIFIER (INDENTIFIER IDENTIFIER) EOF
- Parser text shows a very direct relationship
between parser and grammar (note utility of
and lazy Boolean operators) - One of the attractions of Recursive Descent
Parsing - Each rule corresponds with an integer routine
that returns 1 if a terminal production of N was
found in the present position in the input
stream, otherwise it returns 0, i.e. no terminal
found.
13The Driver
include "lex.h" / for start_lex(),
get_next_token(), Token / / DRIVER / int
main(void) start_lex() get_next_token()
require(input())//call START TOKEN return
0 void error(void) printf("Error in
expression\n") exit(1)
14Recursive Descent Parser for grammar
define EoF 256 define IDENTIFIER 257
include "tokennumbers.h" / PARSER / int
input(void) return expression()
require(token(EoF)) int expression(void)
return term() require(rest_expression()) int
term(void) return token(IDENTIFIER)
parenthesized_expression() int
parenthesized_expression(void) return
token('(') require(expression())
require(token(')')) int rest_expression(void)
return token('') require(expression())
1 int token(int tk) if (tk !
Token.class) return 0 get_next_token()
return 1 int require(int found) if
(!found) error() return 1
input? expression EOF expression? term
rest_expression term? INDENTIFIER
parenthesized_expression parenthesized_expression?
'(' expression ')' rest_expression? ''
expression e
grammar of 2.53 and fig 2.4
15Three Drawbacks
- Repeated backtracking over one token due to
repeated calls to token(int tk) causing repeated
testing of Token.class - Operationally method often fails to produce a
correct parser - partial consumption of expressions causes the
parser to be stranded (see examples pg. 119) - Recursive descent parsers cannot handle
left-recursive grammars a serious disadvantage - Error handling leaves a lot to be desired
(laconic error handling)
16Automatic Creation of a Top down Parser
- Grammars that allow automatic construction of a
top down parser are called LL(1) grammars - LL(1) uses a push down automaton (section
2.2.4.4) - Applying precomputation is based on the
observation that when a routine for N is called
with the same token t the same sequence of
operations is called with the same result so we
can precompute for each N what is required for
each token t - Don't need to call other routines to find the
answer ORTHOGONALITY - Avoid the search overhead since only a single
routine is called - and Serendipitously it provides a solution to the
problems on pages 119 and 120
17LL(1) parsing
- final decision on success or failure was made by
comparing the input token to the first token
produced by the alternatives - so we can create FIRST sets the sets of first
tokens produced by all alternatives in the
grammar, both of Non-Terminals, N and terminals - FIRST(a), i.e. the FIRST set of the alternative
a, contains all terminals a can start with, or e
the empty string may be included in FIRST(a) if a
can produce e - Trivial if a starts with a terminal, ex.
- parenthesized_expression? '(' expression ')'
- Tougher if a starts with a Non-Terminal, N
- Then we have to find FIRST(N) the Union of the
FIRST sets of its alternatives, which can be
computed with a closure algorithm
18Closure Algorithm for FIRST sets in G
- Data Definitions
- Token sets called FIRST sets for all terminals,
non-terminals and alternatives of non-terminals
in G - A token set called FIRST for each alternative
tail in G an alternative tail is a sequence of
zero or more grammar symbols a if A a is an
alternative or alternative tail in G. - Initializations
- For all terminals T, set FIRST(T) to T.
- For all non-terminals, N, set FIRST(N) to the
empty set. - For all non-terminal alternatives and alternative
tails a, set FIRST(a) to the empty set. - Set the FIRST set of all empty alternatives and
alternative tails to e. - Inference rules
- For each rule N?a in G, FIRST(N) must contain all
tokens in FIRST(a), including e if FIRST(a)
contains it. - For each alternative or alternative tail a of the
form Ab, FIRST(a) must contain all tokens in
FIRST(A), excluding e, should FIRST(A) contain
it. - For each alternative or alternative tail a of the
form Ab and FIRST(A) contains e, FIRST(a) must
contain all tokens in FIRST(b), including e if
FIRST(b) contains it.
figure 2.58
19Initial FIRST sets of our example grammar
input expression EOF
EOF EOF expression term
rest_expression rest_expression
term IDENTIFIER IDENTIFIER
parenthesized_expression parenthesized_expr
ession '(' expression ')' '('
expression ')' ')' ')'
rest_expression '' expression ''
expression e e
20Final FIRST sets
input IDENTIFIER '(' expression EOF
IDENTIFIER '(' EOF EOF
expression IDENTIFIER '(' term
rest_expression IDENTIFIER '('
rest_expression '' e term IDENTIFIER
'(' IDENTIFIER IDENTIFIER
parenthesized_expression '('
parenthesized_expression '(' '('
expression ')' '(' expression
')' IDENTIFIER '(' ')' ')'
rest_expression '' e ''
expression '' expression
IDENTIFIER '(' e e
21Predictive Recursive Descent Parser
- FIRST sets are used to construct a predictive
parser (probably ought to be called grammar
directed parser since it doesn't really predict
anything) - Code for each alternative is preceded by a CASE
label based on the FIRST set - Testing is done on tokens only (using switch
statements in C) - Routine for grammar rule only called when it is
certain (if no syntactic error) to produce a
terminal production
22Predictive parser 1 (first half)
void input(void) switch (Token.class)
case IDENTIFIER case '('
expression() token(EoF) break default
error() void expression(void)
switch (Token.class) case IDENTIFIER
case '(' term()
rest_expression() break default
error() void term(void) switch
(Token.class) case IDENTIFIER
token(IDENTIFIER) break case '('
parenthesized_expression() break default
error()
first part of 2.61
23Predictive Parser 2
void parenthesized_expression(void) switch
(Token.class) case '('
token('(') expression() token(')') break
default error() void
rest_expression(void) switch (Token.class)
case '' token('')
expression() break case EoF case ')'
break default error()
void token(int tk) if (tk !
Token.class) error() get_next_token()
second part of 2.61
24LL(1) parsing with nullable alternatives
- Complication how to handle the case label for
the empty alternative since it does not start
with any token - Solution When N produces an empty string we
don't see the string, but we do see a token that
can follow N - Create the FOLLOW set the set of tokens that can
immediately follow a given non-terminal N (see
closure algorithm fig. 2.62)
25LL(1) parser/grammar
- LL(1) parser is called LL(1) because the parser
works from Left to Right identifying nodes in
Leftmost derivative order, and '(1)' because all
choices are based on one-token look ahead. A
grammar for which this parsing works is called an
LL(1) grammar. - What we've seen is a strong LL(1) grammar, there
are lots of things to worry about (see the list
on page 124)
26Things to Worry About
- repetition operators in the grammar
- detecting and reporting parsing conflicts (to be
covered next time) - including code for the generation of the syntax
tree - including code and tables for syntax error
recovery - optimizations
27Homework for Week 3
- Objective (two weeks), get a version of lex
running and run it on the LexByLex folder
material importing other files as necessary and
provide a "blow-by-blow" description of your
efforts and the result (Failure Is Not An Option)
http//csmweb2.emcweb.com/durable/2000/08/10/p19s2
.htm - problem to turn in next Monday
- 2.8 (185-186)
- Some Flex/Lex
- http//www.ug.bcc.bilkent.edu.tr/resat/Articles/a
rticle_1.htm - http//www.monmouth.com/wstreett/lex-yacc/lex-yac
c.html
28References
- Text Modern Compiler Design Figures