Title: Parsing
1Parsing
2Parsing
- Calculate grammatical structure of program, like
diagramming sentences, where - Tokens words
- Programs sentences
For further information, read Aho, Sethi,
Ullman, Compilers Principles, Techniques, and
Tools (a.k.a, the Dragon Book)
3Outline of coverage
- Context-free grammars
- Parsing
- Tabular Parsing Methods
- One pass
- Top-down
- Bottom-up
- Yacc
4Parser extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
operator
expression
expression
variable
string
ltlt
cout
hello, world\n
5Context-free languages
- Grammatical structure defined by context-free
grammar - statement ? labeled-statement
expression-statement
compound-statementlabeled-statement ? ident
statement case
constant-expression statementcompound-statement
? declaration-list
statement-list
Context-free only one non-terminal in
left-part
terminal
non-terminal
6Parse trees
- Parse tree tree labeled with grammar symbols,
such that - If node is labeled A, and its children are
labeled x1...xn, then there is a productionA
??x1...xn - Parse tree from A root labeled with A
- Complete parse tree all leaves labeled with
tokens
7Parse trees and sentences
- Frontier of tree labels on leaves (in
left-to-right order) - Frontier of tree from S is a sentential form
- Frontier of a complete tree from S is a sentence
8Example
- G L ??L E E E ??a b
- Syntax trees from start symbol (L)
Sentential forms
9Derivations
- Alternate definition of sentence
- Given ?, ? in V, say ??? is a derivation step if
??????? and ? ??? , where A ? ??is a
production - ? is a sentential form iff there exists a
derivation (sequence of derivation steps)
S??????? ( alternatively, we say that S?? )
Two definitions are equivalent, but note that
there are many derivations corresponding to each
parse tree
10Another example
L
L
L
L
E
E
L
E
E
b
E
b
a
a
11Ambiguity
- For some purposes, it is important to know
whether a sentence can have more than one parse
tree - A grammar is ambiguous if there is a sentence
with more than one parse tree - Example E ? EE EE id
12- If e then if b then d else f
- int x y 0
- A.b.c d
- Id -gt s s.id
- E -gt E T -gt E T T -gt T T T -gt id T
T -gt id T id T -gt id id id T -gt - id id id id
13Ambiguity
- Ambiguity is a function of the grammar rather
than the language - Certain ambiguous grammars may have equivalent
unambiguous ones
14Grammar Transformations
- Grammars can be transformed without affecting the
language generated - Three transformations are discussed next
- Eliminating Ambiguity
- Eliminating Left Recursion (i.e.productions of
the form A?A ? ) - Left Factoring
15Eliminating Ambiguity
- Sometimes an ambiguous grammar can be rewritten
to eliminate ambiguity - For example, expressions involving additions and
products can be written as follows - E ? ET T
- T ? Tid id
- The language generated by this grammar is the
same as that generated by the grammar on
tranparency 11. Both generate id(idid) - However, this grammar is not ambiguous
16Eliminating Ambiguity (Cont.)
- One advantage of this grammar is that it
represents the precedence between operators. In
the parsing tree, products appear nested within
additions
17Eliminating Ambiguity (Cont.)
- An example of ambiguity in a programming language
is the dangling else - Consider
- S ? if b then S else S if b then S a
18Eliminating Ambiguity (Cont.)
- When there are two nested ifs and only one else..
19Eliminating Ambiguity (Cont.)
- In most languages (including C and Java), each
else is assumed to belong to the nearest if that
is not already matched by an else. This
association is expressed in the following
(unambiguous) grammar -
- S ? Matched
- Unmatched
- Matched ? if b then Matched else
Matched - a
- Unmatched ? if b then S
- if b then
Matched else Unmatched
20Eliminating Ambiguity (Cont.)
- Ambiguity is a property of the grammar
- It is undecidable whether a context free grammar
is ambiguous - The proof is done by reduction to Posts
correspondence problem - Although there is no general algorithm, it is
possible to isolate certain constructs in
productions which lead to ambiguous grammars
21Eliminating Ambiguity (Cont.)
- For example, a grammar containing the production
A?AA ? would be ambiguous, because the
substring aaa has two parses
A
A
A
A
A
A
A
A
a
A
A
a
a
a
a
a
- This ambiguity disappears if we use the
productions - A?AB B and B? ?
- or the productions
- A?BA B and B? ?.
22Eliminating Ambiguity (Cont.)
- Examples of ambiguous productions
- A?AaA
- A?aA Ab and
- A?aA aAbA
- A language generated by an ambiguous CFG is
inherently ambiguous if it has no unambiguous CFG - An example of such a language is
- Laibjcm ij or jm which can be generated
by the grammar - S?AB DC
- A?aA e C?cC e
- B?bBc e D?aDb e
23Elimination of Left Recursion
- A grammar is left recursive if it has a
nonterminal A and a derivation A?Aa for some
string a. Top-down parsing methods (to be
discussed shortly) cannot handle left-recursive
grammars, so a transformation to eliminate left
recursion is needed. - Immediate left recursion (productions of the form
A?A ? ) can be easily eliminated. - We group the A-productions as
- A?A ?1 A ?2 A ?m b1 b2 bn
- where no bi begins with A. Then we replace the
A-productions by - A? b1 A b2 A bn A
- A? ?1 A ?2 A ?m A e
24Elimination of Left Recursion (Cont.)
- The previous transformation, however, does not
eliminate left recursion involving two or more
steps. For example, consider the grammar - S?Aa b
- A?Ac Sd e
- S is left-recursive because S?Aa??Sda, but it is
not immediately left recursive
25Elimination of Left Recursion (Cont.)
- Algorithm. Eliminate left recursion
- Arrange nonterminals in some order A1, A2 ,,, An
- for i 1 to n
- for j 1 to i -1
- replace each production of the form Ai?Aj g
- by the production Ai? d1 g d2 g dn g
- where Aj? d1 d2 dn are all the current
Aj-productions -
- eliminate the immediate left recursion among the
Ai-productions
26Elimination of Left Recursion (Cont.)
- To show that the previous algorithm actually
works all we need notice is that iteration i only
changes productions with Ai on the left-hand
side. And m gt i in all productions of the form
Ai?Am ? - Induction proof
- Clearly true for i1
- If it is true for all iltk, then when the outer
loop is executed for ik, the inner loop will
remove all productions Ai?Am ? with m lt i - Finally, with the elimination of self recursion,
m in the Ai?Am ? productions is forced to be gt i - So, at the end of the algorithm, all derivations
of the form Ai?Ama will have m gt i and therefore
left recursion would not be possible
27Left Factoring
- Left factoring helps transform a grammar for
predictive parsing - For example, if we have the two productions
- S ? if b then S else S
- if b then S
- on seeing the input token if, we cannot
immediately tell which production to choose to
expand S - In general, if we have A? ? b1 ? b2 and the
input begins with a, we do not know (without
looking further) which production to use to
expand A
28Left Factoring (Cont.)
- However, we may defer the decision by expanding A
to ?A - Then after seeing the input derived from ?, we
may expand A to ?1 or to ?2 - Left-factored, the original productions become
- A? ? A
- A? b1 b2
29Non-Context-Free Language Constructs
- Examples of non-context-free languages are
- L1 wcw w is of the form (ab)
- L2 anbmcndm n ? 1 and m? 1
- L3 anbncn n ? 0
- Languages similar to these that are context free
- L1 wcwR w is of the form (ab) (wR stands
for w reversed) - This language is generated by the grammar
- S? aSa bSb c
- L2 anbmcmdn n ? 1 and m? 1
- This language is generated by the grammar
- S? aSd aAd
- A? bAc bc
30Non-Context-Free Language Constructs (Cont.)
- L2anbncmdm n ? 1 and m? 1
- is generated by the grammar
- S? AB
- A? aAb ab
- B? cBd cd
- L3anbn n ? 1
- is generated by the grammar
- S? aSb ab
- This language is not definable by any regular
expression
31Non-Context-Free Language Constructs (Cont.)
- Suppose we could construct a DFSM D accepting
L3. - D must have a finite number of states, say k.
- Consider the sequence of states s0, s1, s2, , sk
entered by D having read ?, a, aa, , ak. - Since D only has k states, two of the states in
the sequence have to be equal. Say, si ? sj
(i?j). - From si, a sequence of i bs leads to an accepting
(final) state. Therefore, the same sequence of i
bs will also lead to an accepting state from sj.
Therefore D would accept ajbi which means that
the language accepted by D is not identical to
L3. A contradiction.
32Parsing
- The parsing problem is Given string of tokens
w, find a parse tree whose frontier is w.
(Equivalently, find a derivation from w.) - A parser for a grammar G reads a list of tokens
and finds a parse tree if they form a sentence
(or reports an error otherwise) - Two classes of algorithms for parsing
- Top-down
- Bottom-up
33Parser generators
- A parser generator is a program that reads a
grammar and produces a parser - The best known parser generator is yacc It
produces bottom-up parsers - Most parser generators - including yacc - do not
work for every CFG they accept a restricted
class of CFGs that can be parsed efficiently
using the method employed by that parser generator
34Top-down parsing
- Starting from parse tree containing just S, build
tree down toward input. Expand left-most
non-terminal. - Algorithm (next slide)
35Top-down parsing (cont.)
- Let input a1a2...an
- current sentential form (csf) S
- loop
- suppose csf t1...tkA?
- if t1...tk ??a1...ak , its an error
- based on ak1..., choose production A ??
- csf becomes t1...tk??
-
36Top-down parsing example
- Grammar H L ??E L E
E ??a b - Input ab
- Parse tree Sentential form Input
L
ab
EL
ab
aL
ab
37Top-down parsing example (cont.)
- Parse tree Sentential form Input
aE
ab
ab
ab
38LL(1) parsing
- Efficient form of top-down parsing
- Use only first symbol of remaining input (ak1)
to choose next production. That is, employ a
function M? ? N? P in choose production step
of algorithm. - When this works, grammar is called LL(1)
39LL(1) examples
- Example 1
- H L ??E L E E ??a b
- Given input ab, so next symbol is a.
- Which production to use? Cant tell.
- ? H not LL(1)
40LL(1) examples
- Example 2
- Exp ?? Term Exp
- Exp ? Exp
- Term ??id
- (Use for end-of-input symbol.)
Grammar is LL(1) Exp and Term have only one
production Exp has two productions but only
one is applicable at any time.
41Nonrecursive predictive parsing
- It is possible to build a nonrecursive predictive
parser by maintaining a stack explicitly, rather
than implicitly via recursive calls - The key problem during predictive parsing is that
of determining the production to be applied for a
non-terminal
42Nonrecursive predictive parsing
- Algorithm. Nonrecursive predictive parsing
- Set ip to point to the first symbol of w.
- repeat
- Let X be the top of the stack symbol and a the
symbol pointed to by ip - if X is a terminal or then
- if X a then
- pop X from the stack and advance ip
- else error()
- else // X is a nonterminal
- if MX,a X?Y1 Y2 Y k then
- pop X from the stack
- push YkY k-1, , Y1 onto the stack with Y1 on
top - (push nothing if Y1 Y2 Y k is ? )
- output the production X?Y1 Y2 Y k
- else error()
- until X
43LL(1) grammars
- No left recursion
- A ?? Aa If this production is chosen, parse
makes no progress. - No common prefixes
- A ?? ab ag
- Can fix by left factoring
- A ?? aA
- A ? b g
44LL(1) grammars (cont.)
- No ambiguity
- Precise definition requires that production to
choose be unique (choose function M very hard
to calculate otherwise)
45Top-down Parsing
L
Start symbol and root of parse tree
Input tokens ltt0,t1,,t-i,...gt
E0 E-n
L
Input tokens ltt-i,...gt
E0 E-n
From left to right, grow the parse tree
downwards
...
46Checking LL(1)-ness
- For any sequence of grammar symbols ?, define set
FIRST(a) ? S to be - FIRST(a) a a ? ab for some b
47Checking LL(1)-ness
- Define Grammar G (N, ?, P, S) is LL(1) iff
whenever there are two left-most derivations (in
which the leftmost non-terminal is always
expanded first) - S gt wA? gt w?? gt wx
- S gt wA? gt w?? gt wy
- such that FIRST(x) FIRST(y), it follows that ?
? - In other words, given
- 1. A string wA? in V and
- 2. The first terminal symbol to be derived from
A?, say t - there is at most one production that can be
applied to A to - yield a derivation of any terminal string
beginning with wt - FIRST sets can often be calculated by inspection
48FIRST Sets
Exp ?? Term Exp Exp ? Exp Term
??id (Use for end-of-input symbol)
FIRST() FIRST( Exp) FIRST() ?
FIRST( Exp) ? grammar is LL(1)
49FIRST Sets
L ??E L EE ??a b
FIRST(E L) a, b FIRST(E) FIRST(E L) ?
FIRST(E) ? ? grammar not LL(1).
50Computing FIRST Sets
- Algorithm. Compute FIRST(X) for all grammar
symbols X - forall X ? V do FIRST(X)
- forall X ? ? (X is a terminal) do FIRST(X)X
- forall productions X ? ? do FIRST(X) FIRST(X)
U ? - repeat
- c forall productions X?Y1 Y2 Y k do
- forall i ? 1,k do
- FIRST(X) FIRST(X) U (FIRST(Yi) - ?) if
? ? FIRST(Yi) then continue c - FIRST(X) FIRST(X) U ?
- until no more terminals or ? are added to any
FIRST set
51FIRST Sets of Strings of Symbols
- FIRST(X1X2Xn) is the union of FIRST(X1) and all
FIRST(Xi) such that ? ? FIRST(Xk) for k1, 2, ,
i-1 - FIRST(X1X2Xn) contains ? iff ? ? FIRST(Xk) for
k1, 2, , n
52FIRST Sets do not Suffice
- Given the productions
- A? T x
- A? T y T? w T? e
- T? w should be applied when the next input token
is w. - T? e should be applied whenever the next terminal
(the one pointed to by ip) is either x or y
53FOLLOW Sets
- For any nonterminal X, define set FOLLOW(X) ? S
as - FOLLOW(X) a S ?aXab
54Computing the FOLLOW Set
- Algorithm. Compute FOLLOW(X) for all nonterminals
X - FOLLOW(S)
- forall productions A ? ?B? do FOLLOW(B)Follow(B)
U (FIRST(?) - ?) - repeat
- forall productions A ? ?B or A ? ?B? with ? ?
FIRST(?) do - FOLLOW(B) FOLLOW(B) U FOLLOW(A)
- until all FOLLOW sets remain the same
55Construction of a predictive parsing table
- Algorithm. Construction of a predictive parsing
table - M,
- forall productions A ? ? do
- forall a ? FIRST(?) do
- MA,a MA,a U A ? ?
- if ? ? FIRST(?) then
- forall b ? FOLLOW(A) do
- MA,b MA,b U A ? ?
- Make all empty entries of M be error
56Another Definition of LL(1)
- Define Grammar G is LL(1) if for every A? N
with productions A ? a1 . . . an - FIRST(ai FOLLOW(A)) ? FIRST(aj FOLLOW(A) ) ?
for all i, j
57Regular Languages
- Definition. A regular grammar is one whose
productions are all of the type - A ? aB
- A ? a
- A Regular Expression is either
- a
- R1 R2
- R1 R2
- R
58Nondeterministic Finite State Automaton
a
b
b
start
a
0
1
2
3
b
59Regular Languages
- Theorem. The classes of languages
- Generated by a regular grammar
- Expressed by a regular expression
- Recognized by a NDFS automaton
- Recognized by a DFS automaton
- coincide.
60Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM
KEYWORD
letter
, , -, /, (, )
OPERATOR
61Scanner code
- state start
- loop
- if no input character buffered then read
one, and add it to the accumulated token - case state of
- start
- case input_char of
- A..Z, a..z state id
- 0..9 state num
- else ...
- end
- id
- case input_char of
- A..Z, a..z state id
- 0..9 state id
- else ...
- end
- num
- case input_char of
- 0..9 ...
62Table-driven DFA
63Language Classes
L0
L0
CSL
CFL NPA
LR(1)
LL(1)
RL DFANFA
64Question
- Are regular expressions, as provided by Perl or
other languages, sufficient for parsing nested
structures, e.g. XML files?