Title: Agenda
1Agenda
- Scanner vs. parser
- Regular grammar vs. context-free grammar
- Grammars (context-free grammars)
- grammar rules
- derivations
- parse trees
- ambiguous grammars
- useful examples
- Reading
- Chapter 2, 4.1 and 4.2 ,
2Characteristics of a Parser
- Input sequence of tokens from scanner
- Output parse tree of the program
- parse tree is generated (implicitly or
explicitly) if the input is a legal program - if input is an illegal program, syntax errors are
issued - Note
- Instead of parse tree, some parsers produce
directly - abstract syntax tree (AST) symbol table , or
- intermediate code, or
- object code
- In the following lectures, well assume that
parse tree is generated.
3Comparison with Lexical Analysis
Phase Input Output
Lexical Analysis String of characters String of tokens
Syntax Analysis String of tokens Parse tree
4Example
- The program
- x y z
- Input to parser
- ID TIMES ID PLUS ID
- well write tokens as follows
- id id id
- Output of parser
- the parse tree ?
E
E
E
E
E
id
id
id
5Why are Regular Grammars Not Enough?
- Write an automaton that accepts strings
- a, (a), ((a)), and (((a)))
- a, (a), ((a)), (((a))), (ka)k
6What must parser do?
- Recognizer not all strings of tokens are
programs - must distinguish between valid and invalid
strings of tokens - Translator must expose program structure
- e.g., associativity and precedence
- hence must return the parse tree
- We need
- A language for describing valid strings of tokens
- context-free grammars
- (analogous to regular grammars in the scanner)
- A method for distinguishing valid from invalid
strings of tokens (and for building the parse
tree) - the parser
- (analogous to the state machine in the scanner)
7Context-free grammars (CFGs)
- Example Simple Arithmetic Expressions Grammar
- In English
- An integer is an arithmetic expression.
- If exp1 and exp2 are arithmetic expressions,
then so are the following - exp1 - exp2
- exp1 / exp2
- ( exp1 )
- the corresponding CFG well write tokens as
follows - exp ? INTLITERAL E ? intlit
- exp ? exp MINUS exp E ? E - E
- exp ? exp DIVIDE exp E ? E / E
- exp ? LPAREN exp RPAREN E ? ( E )
8Reading the CFG
- The grammar has five terminal symbols
- intlit, -, /, (, )
- terminals of a grammar tokens returned by the
scanner. - The grammar has one non-terminal symbol
- E
- non-terminals describe valid sequences of tokens
- The grammar has four productions or rules,
- each of the form E ? ?
- left-hand side a single non-terminal.
- right-hand side either
- a sequence of one or more terminals and/or
non-terminals, or - ? (an empty production)
9Example, revisited
- Note
- a more compact way to write previous grammar
- E ? INTLITERAL E - E E / E ( E )
- or
- E ? INTLITERAL
- E - E
- E / E
- ( E )
10A formal definition of CFGs
- A CFG consists of
- A set of terminals T
- A set of non-terminals N
- A start symbol S (a non-terminal)
- A set of productions
- X ? X1 X2 Xn
- where X ? N and Yi ? T U N U ?
11Notational Conventions
- In these lecture notes
- Non-terminals are written upper-case
- Terminals are written lower-case
- The start symbol is the left-hand side of the
first production
12The Language of a CFG
- The language defined by a CFG is the set of
strings that can be derived from the start symbol
of the grammar. - Derivation Read productions as rules
- X ? Y1 Yn
- ? Means X can be replaced by Y1 Yn
13Derivation key idea
- 1. Begin with a string consisting of the start
symbol S - 2. Replace any non-terminal X in the string by a
the right-hand side of some production - 3. Repeat (2) until there are no non-terminals in
the string
14Derivation an example
derivation
- CFG
- E ? id
- E ? E E
- E ? E E
- E ? ( E )
- Is string id id id in the
- language defined by the grammar?
15Terminals
- Terminals are called so because there are no
rules for replacing them - Once generated, terminals are permanent
- Therefore, terminals are the tokens of the
language
16The Language of a CFG (Cont.)
- More formally, write
- X1 X2 Xn ? X1 X2 X i-1 Y1 Y2 Ym X i1 Xn
- if there is a production
- X i ? Y1 Y2 Ym
17The Language of a CFG (Cont.)
- Write
- X1 X2 Xn ? Y1 Y2 Ym
- if
- X1 X2 Xn ? ? .. ? Y1 Y2 Ym
- in 0 or more steps
18The Language of a CFG
- Let G be a context-free grammar with start
symbol S. Then the language of G is - a1 a2 an S ? a1 a2 an
- where ai, i 1,2, .., n are terminal symbols
19Examples
- Strings of balanced parentheses
- The grammar
sameas
20Arithmetic Expression Example
- Simple arithmetic expressions
- Some elements of the language
21Notes
- The idea of a CFG is a big step. But
- Membership in a language is yes or no
- we also need parse tree of the input!
- furthermore, we must handle errors gracefully
- Need an implementation of CFGs,
- i.e. the parser
- well create the parser using a parser generator
- available generators CUP, bison, yacc
22More Notes
- Form of the grammar is important
- Many grammars generate the same language
- Parsers are sensitive to the form of the grammar
- Example
- E ? E E
- E E
- intlit
- is not suitable for an LL(1) parser (a common
kind of parser).
23Derivations and Parse Trees
- A derivation is a sequence of productions
- S .. ? .. ? ..
- A derivation can be drawn as a tree
- Start symbol is the trees root
- For a production X ? Y1 Y2 add children Y1 Y2
- to node X
24Derivation Example
25Derivation Example (Cont.)
E
E
E
E
E
id
id
id
26Notes on Derivations
- A parse tree has
- Terminals at the leaves
- Non-terminals at the interior nodes
- An in-order traversal of the leaves is the
original input - The parse tree shows the association of
operations, the input string does not
27Left-most and Right-most Derivations
- The example is a left-most derivation
- At each step, replace the left-most non-terminal
- There is an equivalent notion of a right-most
derivation
28Derivations and Parse Trees
- Note that right-most and left-most derivations
have the same parse tree - The difference is the order in which branches are
added
29Remarks on Derivation
- We are not just interested in whether s e
L(G) - We need a parse tree for s, (because we need to
build the AST) - A derivation defines a parse tree
- But one parse tree may have many derivations
- Left-most and right-most derivations are
important in parser implementation
30Ambiguity(1)
31Ambiguity (2)
- This string has two parse trees
E
E
E
E
E
E
E
E
id
E
E
id
id
id
id
id
32Ambiguity(3)
- for each of the two parse trees, find the
corresponding left-most derivation - for each of the two parse trees, find the
corresponding right-most derivation
33Ambiguity (4)
- A grammar is ambiguous if, for some string of the
language - it has more than one parse tree, or
- there is more than one right-most derivation, or
- there is more than one left-most derivation.
- (the three conditions are equivalent)
- Ambiguity Leaves meaning of some programs
ill-defined
34Dealing with Ambiguity
- There are several ways to handle ambiguity
- Most direct method is to rewrite grammar
unambiguously - Enforces precedence of over
35Removing Ambiguity
- Rewriting
- Expression Grammars
- precedence
- associativity
- IF-THEN-ELSE
- the Dangling-ELSE problem
36Handling operator precedence
- Rewrite the grammar
- use a different nonterminal for each precedence
level - start with the lowest precedence (MINUS)
- E ? E - E E / E ( E ) id
- rewrite to
- E ? E - T T
- T ? T / F F
- F ? id ( E )
37Example
- parse tree for id id / id
- E ? E - T T
- T ? T / F F
- F ? id ( E )
E
T
E
-
T
/
T
F
F
F
id
id
id
38Handling Operator Associativity
- The grammar captures operator precedence, but it
is still ambiguous! - fails to express that both subtraction and
division are left associative - e.g., 5-3-2 is equivalent to ((5-3)-2) and not
to (5-(3-2)).
39Recursion
- A grammar is recursive in nonterminal X if
- X ? X
- ? means in one or more steps, X derives a
sequence of symbols that includes an X - A grammar is left recursive in X if
- X ? X
- in one or more steps, X derives a sequence of
symbols that starts with an X - A grammar is right recursive in X if
- X ? X
- in one or more steps, X derives a sequence of
symbols that ends with an X
40Resolving ambiguity due to associativity
- The grammar given above is both left and right
recursive in nonterminals E and T - To correctly expresses operator associativity
- For left associativity, use left recursion.
- For right associativity, use right recursion.
- Here's the correct grammar
- E ? E T T
- T ? T / F F
- F ? id ( E )
41The Dangling Else ambiguity
- Consider the grammar
- St ? if E then St
- if E then St else St
- other
- This grammar is also ambiguous
42Resolving the dangling else
- else matches the closest unmatched then
- We can describe this in the grammar
- E ? MIF / all then are
matched / - UIF / some then are
unmatched / - MIF ? if E then MIF else MIF
- print
- UIF ? if E then E
- if E then MIF else UIF
- Describes the same set of strings
43Precedence and Associativity Declarationsin
Parser Generators
- Instead of rewriting the grammar
- Use the more natural (ambiguous) grammar
- Along with disambiguating declarations
- Most parser generators allow precedence and
associativity declarations to disambiguate
grammars
44Parsing Approaches
- Top-down parsing
- build parse tree from start symbol (root)
- match terminal symbols(tokens) in the production
rules with tokens in the input stream - simple but limited in power
- Bottom-up parsing
- start from input token stream
- build parse tree from terminal symbols (tokens)
until get start symbol - complex but powerful
45Top Down vs. Bottom Up
start here
result
match
start here
result
input token stream
input token stream
Top-down Parsing
Bottom-up Parsing
46Top-down Parsing
- A top-down parsing algorithm parses an input
string of tokens by tracing out the steps in a
leftmost derivation. - The parse tree associated with the input string
is constructed using preorder traversal and hence
the name top-down.
47Top-down parsers
- There are mainly two kinds of top-down parsers
- 1. Predictive parsers
- - Tries to make decisions about the
structure of the tree below a node based on a few
lookahead tokens (usually one!). - - Weakness Little program structure
has been seen before predictive decisions must be
made. - 2. Backtracking parsers
- - Backtracking parsers solve the
lookahead problem by backtracking if one decision
turns out to be wrong and making a different
choice. - - Weakness Backtracking parsers are
slow (exponential time in general).
48Recursive-descent parsing
- Main idea
- 1. Use the grammar rules as recipes for
procedure code that parses the rule - 2. Each non-terminal corresponds to a
procedure - 3. Each appearance of a terminal in the
right hand side of a rule causes a token to be
matched. - 4. Each appearance of a non-terminal
corresponds to a call of the associated
procedure.
49Example Recursive-descent Parsing
- F ? (E) num
- Code
- void F()
- if (token num) match(num)
- else
- match(()
- E()
- match())// match token (
-
50Example Recursive-descent Parsing (2)
- Observation
- Note how lookahead is not a problem in this
example if the token is number, go one way, if
the token is ( go the other, and if the token
is neither, declare error - void match(Token expect)
- if (token expect)
- getToken() //get next token
- else error(token,expect)
51Example Recursive-descent Parsing (3)
- A recursive-descent procedure can also compute
values or syntax trees - int F()
- if (token num)
- int temp atoi(lexeme)
- match(number) return temp
-
- else
- match(() int temp E()
- match()) return temp
-
-
52When Recursive Descent Does Not Work
- E ? E term term
- void E()
- if (token ??)
- E() // uh, oh!!
- match()
- term()
-
- else term()
-
- - A left-recursive grammar has a non-terminal A
- A ? A? for some ?
- - Recursive descent does not work in such cases
53Elimination of Left Recursion
- Consider the left-recursive grammar
- A ? A?b for some sentential forms a and b
- S generates all strings starting with a ? and
followed by a number of ? - Can rewrite the grammar using right-recursion
- A ? ? A
- A ? ? A ?
- where A is a new nonterminal
54Elimination of Left Recursion (2)
- In general
- A ? A ?1 A ?n ?1
?m - All strings derived from A start with one of
?1,,?m and continue with several instances of
?1,,?n - Rewrite as
- A ? ?1 A ?m A
- A ? ?1 A ?n A ?
55General Left Recursion
- The grammar
- S ? A ? ?
- A ? S ?
- is also left-recursive because
- S ? S ? ?
- This left-recursion can also be eliminated
- See book, Section 4.3 for general algorithm
56Summary of Recursive Descent with backtracking
- Simple and general parsing strategy
- Left-recursion must be eliminated first
- but that can be done automatically
- Unpopular because of backtracking
- Thought to be too inefficient
- In practice, backtracking is eliminated by
restricting the grammar
57Predictive Parsers
- Like recursive-descent but parser can predict
which production to use - - By looking at the next few tokens
- - No backtracking
- Predictive parsers accept LL(k) grammars
- - L means left-to-right scan of input
- - L means leftmost derivation
- - k means predict based on k tokens of
lookahead - In practice, LL(1) is used
58LL(1) Languages
- In recursive-descent, for each non-terminal and
input token there may be a choice of production - LL(1) means that for each non-terminal and token
there is only one production - Can be specified via 2D tables
- - One dimension for current non-terminal to
expand - - One dimension for next token
- - A table entry contains one production
59Predictive Parsing and Left Factoring
- Consider the grammar
- E ? T E T
- T ? num num T ( E )
- Hard to predict because
- For T, two productions start with num
- For E, it is not clear how to predict
- A grammar must be left-factored before use for
predictive parsing
60Left-Factoring Example
- Recall the grammar
- E ? T E T
- T ? num num T ( E )
Factor out common prefixes of productions E
? T X X ? E ? T ? ( E ) num Y
Y ? T ?
61LL(1) Parsing Table Example
- Left-factored grammar
- E ? T X X ? E ?
- T ? ( E ) num Y Y ? T ?
- The LL(1) parsing table
62LL(1) Parsing Table Example (Cont.)
- Consider the E, num entry
- - When current non-terminal is E and next input
is num, use production E ? T X - - This production can generate a num in the first
place - Consider the Y, entry
- - When current non-terminal is Y and current
token is , get rid of Y - Y can be followed by only in a derivation in
which Y ? ?
63LL(1) Parsing Tables. Errors
- Blank entries indicate error situations
- Consider the E, entry
- There is no way to derive a string starting with
from non-terminal E
64Using Parsing Tables
- Method similar to recursive descent, except
- - For each non-terminal S
- - We look at the next token a
- - And chose the production shown at S,a
- We use a stack to keep track of pending
non-terminals - We reject when we encounter an error state
- We accept when we encounter end-of-input
65LL(1) Parsing Algorithm
- Start nonterminal end-of-input
symbol - initialize stack ltS gt and Token nextToken()
- repeat
- case stack of
- ltX, restgt if TX,Token Y1Yn
- then stack ? ltY1 Yn
restgt - else error ()
- ltt, restgt if t nextToken
- then stack ? ltrestgt
- else error ()
- until stack lt gt // empty
66LL(1) Parsing Example
- Stack Input
Action - E num num
T X - T X num num
num Y - num Y X num num
terminal - Y X num
T - T X num
terminal - T X num
num Y - int Y X num
terminal - Y X
? - X
? -
ACCEPT
67Constructing Parsing Tables
- LL(1) languages are those defined by a parsing
table for the LL(1) algorithm - No table entry can be multiply defined
- We want to generate parsing tables from CFG
68Constructing Parsing Tables First and Follow sets
- If A ? ?, where in the row of A we place ? ?
- Answer In the column of t where t can start a
string derived from ? - ? ? t ?
- We say that t ? First(?)
- In the column of t if ? is ? and t can follow an
A - S ? ? A t ?
- We say t ? Follow(A)
69Computing First Sets
- Definition First(X) t X ? t? ? ? X
? ? - Algorithm sketch (see book for details)
- for all terminals t do First(t) ? t
- for each production X ? ? do First(X) ? ?
- if X ? A1 An ? and ? ? First(Ai), 1 ? i ? n
do - add First(?) to First(X)
- for each X ? A1 An s.t. ? ? First(Ai), 1 ? i ?
n do - add ? to First(X)
- repeat steps 3 4 until no First set can be grown
70First Sets. Example
- Recall the grammar
- E ? T X X ? E
? - T ? ( E ) num Y Y ? T ?
- First sets
- First( ( ) ( First( T )
num, ( - First( ) ) ) First( E )
num, ( - First( num) num First( X ) , ?
- First( ) First( Y )
, ? - First( )
71Computing Follow Sets
- Definition
- Follow(X) t S ? ? X t ?
- Intuition
- If S is the start symbol then ? Follow(S)
- If X ? A B then First(B) ? Follow(A) and
- Follow(X) ?
Follow(B) - Also if B ? ? then Follow(X) ? Follow(A)
72Computing Follow Sets (Cont.)
- Algorithm sketch
- 1. Follow(S) ?
- 2. For each production A ? ? X ?
- add First(?) \ ? to Follow(X)
- 3. For each A ? ? X ? where ? ? First(?)
- add Follow(A) to Follow(X)
- repeat step(s) 2 and 3 until no Follow set grows
73Follow Sets. Example
- Recall the grammar
- E ? T X X ? E
? - T ? ( E ) num Y Y ? T ?
- Follow sets
- Follow( ) num, ( Follow( )
num, ( - Follow( ( ) num, ( Follow( E )
), - Follow( X ) , ) Follow( T )
, ) , - Follow( ) ) , ) , Follow( Y )
, ) , - Follow( num) , , ) ,
74Constructing LL(1) Parsing Tables
- Construct a parsing table T for CFG G
- For each production A ? ? in G do
- For each terminal t ? First(?) do
- TA, t ?
- If ? ? First(?), for each t ? Follow(A) do
- TA, t ?
- If ? ? First(?) and ? Follow(A) do
- TA, ?
-
75Notes on LL(1) Parsing Tables
- If any entry is multiply defined then G is not
LL(1) - If G is ambiguous
- If G is left recursive
- If G is not left-factored
- Most programming language grammars are not LL(1)
- There are tools that build LL(1) tables
76Review
- For some grammars there is a simple parsing
strategy - Predictive parsing
- Next time Bottom-up parsing