Title: CSC 3130: Automata theory and formal languages
1Fall 2009
The Chinese University of Hong Kong
CSC 3130 Automata theory and formal languages
Parsers for programming languages
Andrej Bogdanov http//www.cse.cuhk.edu.hk/andrej
b/csc3130
2CFG of the java programming language
Identifier IDENTIFIER QualifiedIdentifier Ide
ntifier . Identifier Literal IntegerLiteral
FloatingPointLiteral CharacterLiteral
StringLiteral BooleanLiteral NullLiteral Ex
pression Expression1 AssignmentOperator
Expression1 AssignmentOperator -
/
from http//java.sun.com/docs/books/jls /second_e
dition/html/syntax.doc.html52996
3Parsing java programs
class Point2d / The X and Y coordinates of
the point--instance variables / private
double x private double y private
boolean debug // A trick to help with
debugging public Point2d (double px, double
py) // Constructor x px y py debug
false // turn off debugging public
Point2d () // Default constructor this (0.0,
0.0) // Invokes 2 parameter
Point2D constructor // Note that a
this() invocation must be the BEGINNING of //
statement body of constructor public Point2d
(Point2d pt) // Another
consructor x pt.getX() y pt.getY()
Simple java program about 1000 symbols
4Parsing algorithms
- How long would it take to parse this?
- Can we parse faster?
- No! CYK is the fastest known general-purposeparsi
ng algorithm
exhaustive algorithm
about 1080 years (longer than life of universe)
CYK algorithm
about 1 week!
5Another way of thinking
Scientist Find an algorithm thatcan parse
strings inany grammar
Engineer Design your grammar so it has a very
fastparsing algorithm
6An example
Stack
S ? Tc(1) T ? TA(2) A(3) A ? aTb(4) ab(5)
Input
Action
? a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S
abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc
c c c ? ?
shift shift reduce (5) reduce (3) shift shift shif
t reduce (5) reduce (3) shift reduce (4) reduce
(2) shift reduce (1)
input abaabbc
a
b
a
b
c
a
b
7Items
S ? Tc(1)
T ? A(3)
T ? TA(2)
A ? aTb(4)
A ? ab(5)
A ? aTb A ? aTb A ? aTb A ? aTb
A ? ab A ? ab A ? ab
T ? A T ? A
S ? Tc S ? Tc S ? Tc
T ? TA T ? TA T ? TA
Stack
Input
Action
? a ab A T Ta
abaabbc baabbc aabbc aabbc aabbc abbc
shift shift reduce (5) reduce (3) shift shift
Idea of parsing algorithm Try to match complete
items to top of stack
8Some terminology
Stack
S ? Tc(1) T ? TA(2) A(3) A ? aTb(4) ab(5)
Input
Action
? a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S
abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc
c c c ? ?
shift shift reduce (5) reduce (3) shift
shift shift reduce (5) reduce (3) shift reduce
(4) reduce (2) shift reduce (1)
input abaabbc
handle
valid items aTb, ab
valid items Ta, Tc, aTb
9Outline of LR(0) parsing algorithm
- As the string is being read, it is pushed on a
stack - Algorithm keeps track of all valid items
- Algorithm can perform two actions
no complete itemis valid
there is one valid item,and it is complete
shift
reduce
10Running the algorithm
Input
Valid Items
Stack
A
? a aa aab aA aAb A
aabb abb bb b b ? ?
A ? aAb A ? ab A ? aAb A ? ab A ? aAb A
? ab A ? aAb A ? ab A ? aAb A ? ab A ?
ab A ? aAb A ? aAb
S S S R S R
A ? aAb ab
A ? aAb ? aabb
11How to update valid items
- Initial set of valid items
- Updating valid items on shift b
- After these updates, for every valid item A ?
aCb andproduction C ? d, we also addas a
valid item
S ? a
for every production S ? a
A ? abb
A ? abb
is updated to
A ? aXb
disappears if X ? b
a, b terminals A, B variables X, Y mixed
symbols a, b mixed strings
notation
C ? d
12How to update valid items
- Updating valid items on reduce b to B
- First, we backtrack to valid items before reduce
- Then, we apply same rules as for shift B (as if
B were a terminal)
A ? aBb
is updated to
A ? aBb
disappears if X ? B
A ? aXb
C ? d
is added for every valid item A ? aCb and
production C ? d
13Viable item updates by NFA
- States of NFA will be items (plus a start state
q0) - For every item S ? a we have a transition
- For every item A ? ?X? we have a transition
- For every item A ? aCb and production C ? d
e
q0
S ? a
X
A ? ?X?
A ? ?X?
e
C ? d
A ? ?C?
14Example
A ? aAb ab
A ? aAb
A ? aAb
A ? aAb
A ? aAb
A ? ab
A ? ab
A ? ab
15Convert NFA to DFA
a
2
A ? aAb A ? ab A ? aAb A ? ab
1
4
A
a
A ? aAb A? ab
A ? aAb
b
b
5
3
A ? ab
A ? aAb
die
states correspond to sets of valid
items transitions are labeled by variables /
terminals
16Shift states and reduce states
a
2
A ? aAb A ? ab A ? aAb A ? ab
1
4
A
a
A ? aAb A? ab
A ? aAb
b
b
5
3
A ? ab
A ? aAb
are shift states
1
2
4
are reduce states
3
5
17Attempt at parsing with DFA
Input
DFA state
Stack
A
? a aa aab aA
aabb abb bb b b
A ? aAb A ? ab A ? aAb A ? ab A ? aAb A
? ab A ? aAb A ? ab A ? aAb A ? ab A ?
ab A ? aAb
1 2 2 3 ?
S S S R
A ? aAb ab
A ? aAb ? aabb
18Remember the state in stack!
Input
DFA state
Stack
A
1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A
aabb abb bb b b ? ?
A ? aAb A ? ab A ? aAb A ? ab A ? aAb A
? ab A ? aAb A ? ab A ? aAb A ? ab A ?
ab A ? aAb A ? aAb
1 2 2 3 4 5
S S S R S R
A ? aAb ab
A ? aAb ? aabb
19Reconstructing the parse tree
Input
DFA state
Stack
A
1 12 122 1223 124 1245 1
aabb abb bb b b ? ?
A ? aAb A ? ab A ? aAb A ? ab A ? aAb A
? ab A ? aAb A ? ab A ? aAb A ? ab A ?
ab A ? aAb A ? aAb
1 2 2 3 4 5
S S S R S R
a
a
b
b
A ? aAb ab
A ? aAb ? aabb
20LR(0) grammars and deterministic PDAs
- The parsing procedure can be implemented by
adeterministic pushdown automaton - A PDA is deterministic if in every state there is
atmost one possible transition - for every input symbol and pop symbol, including
e - Example PDA for wwR is deterministic, but PDA
forwwR is not
21LR(0) grammars and deterministic PDAs
- Not every PDA can be made deterministic
- Since PDAs are equivalent to CFGs, LR(0) parsing
algorithm must fail for some CFG, e.g. - Why does LR(0) parsing algorithm fail?
L wwR w ? a, b
22Example 1
L wwR w ? a, b
A ? aAa bAb e
23Example 1
L wwR w ? a, b
A ? aAa bAb e
a, b
a
A ? aAa
A ? aAa
A ? bAb
A ? aAa
A ? aAa
a, b
A
A ? bAb
A ? aAa
A ? bAb
A ?
A ? bAb
A ?
b
A ? bAb
24Example 1
L wwR w ? a, b
A ? aAa bAb e
a, b
a
A ? aAa
A ? aAa
A ? bAb
A ? aAa
A ? aAa
a, b
A
A ? bAb
A ? aAa
A ? bAb
A ?
A ? bAb
A ?
b
A ? bAb
input abba
25When you cant LR(0) parse
- Algorithm can perform two actions
- What if
no complete itemis valid
there is one valid item,and it is complete
shift (S)
reduce (R)
some valid itemscomplete, some not
more than one validcomplete item
S / R conflict
R / R conflict
26Example 2
L wwR w ? a, b
A ? aAa bAb
a
A
a
e
A ? aAa
A ? aAa
A ? aAa
A ? aAa
e
e
e
e
q0
A ?
A ?
e
e
e
b
A
b
A ? bAb
A ? bAb
A ? bAb
A ? bAb
e
27Example 2
L wwR w ? a, b
A ? aAa bAb e
a, b
4
2
a
A ? aAa
1
A ? aAa
3
A ? bAb
A ? aAa
A ? aAa
a, b
A
A ? bAb
A ? aAa
A ? bAb
A ?
A ? bAb
5
A ?
b
A ? bAb
6
A ?
No S/R or R/R conflicts!
28Example 2 parsing
State
Stack
A
1 12 122 1226 1223 12234 123 1236 1
1 2 2 6 3 4 3 5
S S S R S R S R
A
b
a
a
b
29Hierarchy of context-free grammars
context-free grammars parse using CYK algorithm
(slow)
to be continued