Title: Lexical Analysis and Parsing
1Lexical Analysis and Parsing
2Regular Expressions
- Later definitions build on earlier ones
- Nothing defined in terms of itself (no recursion)
Regular grammar for numeric literals in
Pascaldigit -gt 012...89 unsigned_integer
-gt digit digit unsigned_number -gt
unsigned_integer
(( . unsigned_integer) e )
(( e ( - e ) unsigned_integer
) e )
3Regular Expression Notation
- a an ordinary letter
- e the empty string
- M N choosing from M or N
- MN concatenation of M and N
- M zero or more times (Kleene star)
- M one or more times
- M? zero or one occurence
- a-zA-Z character set alternation (choice)
- . period stands for any single char exc. newline
4Converting a Regular Expression to an NFA
a
N
M
e
MN
e
e
M
e
e
M
e
N
M
MN
5Converting an NFA to a DFA
- For set of states S, closure(s) is the set of
states that can be reached from S without
consuming any input. - For a set of states S, DFAedge(s, c) is the set
of states that can be reached from S by consuming
input symbol c. - Each set of NFA states corresponds to one DFA
state (hence at most 2n states).
6Extended BNF (EBNF)
- Rules or productions
- Variables or non-terminals on LHS
- Terminals (the prog.Langs tokens)
- Start symbol (non-terminal)
- Vertical bar
- Kleene star
- Meta-level parentheses of regular expressions
7Derivations and Parse Trees
Nested constructs require recursion, i.e.
context-free grammars CFG for arithmetic
expressions expression -gt identifier number
- expression
(expression) expression
operator expression operator -gt - /
8Parse Tree for Slopex Intercept
Is this the only parse tree for this expression
and grammar?
9A Better Expression Grammar
1. expression -gt term expression add_op
term 2. term -gt factor term mult_op
factor 3. factor -gt identifier number -
factor (expression) 4. add_op -gt - 5.
mult_op -gt / A good grammar reflects the
internal structure of programs. This grammar is
unambiguous and captures (HOW?)- operator
precedence (,/ bind tighter than ,- )-
associativity (ops group left to right)
10And Better Parse Trees...
3 4 5
10 - 4 - 3
11Syntax-directed Translation
- Parser calls scanner to obtain tokens.
- Assembles tokens into parse tree.
- Passes tree to later phases of compilation.
- Scanner deterministic finite automaton.
- Parser pushdown automaton.
- Scanners and parsers can be generated
automatically from regular expressions and CFGs
(e.G. lex/yacc, see assignment 1).
12Deeper Into the Details of Scanning.
13Scanning
- Accept the longest possible token in each
invocation of the scanner. - Implementation.
- Ad hoc.
- Capture finite automaton.
- Case(switch) statements.
- Table and driver.
- Impurities.
- Handling of keywords (look up ident. In hash
table). - Need to peek (look) ahead (e.G. . Vs ..).
14Scanner for Pascal
15Scanner for Pascal(case Statements)
16Scanner (Tabledriver)
17Scanner Generators
- Start with a regular expression.
- Construct an NFA from it.
- Use a set of subsets construction to obtain an
equivalent DFA. - Construct the minimal equivalent DFA.
18Example of scanner generation
Language Strings of 0s and 1s in which the
number of 0s is even Regular expression
(1010)1
19NFA -gt DFA conversion
hardcopy
- A state of the DFA after reading a given input
letter represents the set of states that the NFA
might have reached with the same input letter. - Each state of the DFA that contains a final state
of the NFA is a final state of the DFA. - Number of states of the DFA exponential (in the
worst case) in the number of states of the NFA.
20Obtaining the minimal equivalent DFA
- Initially two equivalence classes final and
nonfinal states. - Search for an equivalence class C and an input
letter a such that with a as input, the states in
C make transitions to states in kgt1 different
equivalence classes. - Partition C into k classes accordingly
- Repeat until unable to find a class to partition.
21Example of obtaining the minimal equivalent DFA
Initial classesA, B, E, C, D No class
requires partitioning! Hence a two-state DFA is
obtained.
22Example (cont.)
23Deeper Into the Details of Parsing
24Parsing approaches (for linear time performance)
- Parsing in general has O(n3) cost.
- Need classes of grammars that can be parsed in
linear time - Top-down or predictive parsing orrecursive
descent parsingor LL parsing (Left-to-right
Left-most) - Bottom-up or shift-reduce parsing orLR parsing
(Left-to-right Right-most)
25Top-down Parsing
- Predicts a derivation
- Matches non-terminal against token observed in
input
26LL(1) Grammar
- A grammar for which a top-down determistic parser
can be producedwith one token of look-ahead. - How can one tell whether a grammar is LL(1)?
- Define s-grammar first
- Then generalize to LL(1) grammar
27S-grammar
- The RHS side of each production begins with a
terminal. - Where a nonterminal appears as the LHS of more
than one production, the corresponding RHSs begin
with different terminals.
28Examples
- An s-grammar
- S-gtpX
- S-gtqY
- X-gtaXb
- X-gtx
- Y-gtaYd
- Y-gty
- Not an s-grammar
- S-gtR
- S-gtT
- R-gtpX
- T-gtqY
- X-gtaXb
- X-gtx
- Y-gtaYd
- Y-gty
29Example of Left-to-Right Leftmost Derivation
Input paaaxbbb
- DerivationS
- pX
- paXb
- paaXbb
- paaaXbbb
- paaaxbbb
Where the leftmost non-terminal can be replaced
using more than one production, the appropriate
production can be chosen by examining the next
symbol of the input.
30Starter Symbols
- A terminal a is a starter symbol for nonterminal
A iff A gt a awhere a is a string of terminals
and/or nonterminals.S(A) set of starter symbols
for A. - A terminal a is a starter symbol for a iffa gtab
where a, b are strings of terminals and/or
nonterminals.
31Director Symbols
- The director symbols of nonterminal A are
- S(A), and
- If A can generate the empty string, then all the
symbols which can follow A
32LL(1) Grammar Definition
- For every nonterminal appearing in the LHS of
more than one productionthe sets of director
symbols corresponding to the RHSs of the
alternative productions are disjoint. - All LL(1) grammars can be parsed
deterministically top down. - An algorithm exists for automatically determining
whether a grammar is LL(1).
33Example
(hardcopy)
T-gtAB A-gtPQ A-gtBC P-gtpP P-gteQ-gtqQ Q-gte B-gtbB B-gte
C-gtcC C-gtf
Director symbols of A (from A-gtPQ) p, q, b,
e Director symbols of A (from A-gtBC) b, e
34LL(1) Languages
- Do all languages possess an LL(1) grammar? (No).
- If not, is there an algorithm to determine
whether a language is LL(1)? (No). - The obvious grammar for most programming
languages is not LL(1). - Given a non-LL(1) grammar that describes an LL(1)
language, can it be transformed into LL(1) form?
(Yes in special cases useful in practice).
35Bottom-up Parsing
- LR left-to-right, right-most derivation(bottom-u
p parser) - Shifts new leaves from scanner into a forest of
partially completed parse tree fragments - At some point it realizes that it has complete
right-hand side, which it can reduce.
36Bottom-up parsing (2)
- The symbols joined together are called a handle
- Keep track of the productions we might be in the
middle of. - Characteristic Finite State Machine in LR
parsing its states are the various possible sets
of productions. - CFSM recognizes grammars viable prefixes.
37A Simple Grammar for a Comma-separated List of
Identifiers
hardcopy
id_list -gt id id_list_tail id_list_tail -gt , id
id_list_tail id_list_tail -gt
_________________________ String to be parsed
A, B, C
38Top-down/bottom-up Parsing
39Stack Contents (Roots of Partial Trees) in
Bottom-up Parsing
40A more realistic Example the Calculator Language
41A Sum-and-average Program
hardcopy
read A read B sum A B write sum write sum / 2
42LL(1) Grammar for Calculator Language
hardcopy
43Recursive Descent Parser
hardcopy
44Parse Tree for Sum-and-avg Program
45LR(1) Grammar for the Calculator Language
hardcopy
46LR(1) Grammar for the Calculator Language
- LR(1) version uses
- Left recursion for stmt_list
- Left recursion for expr and term
- Key concept
- Figure out when you reach the end of a RHS.
- Keep track of the set of productions we may be in
the middle of, and where in these productions.
47Bottom-up Parsing Overview
- LR(1) parsers loop over inspection of a look-up
table to find out what action to take. - Variants differ in how to resolve conflicts.
- Most common is LALR(1)
48A birds-eye view of grammar and language classes
49Relationships Among Grammar Classes
LALR(1) is a standardfor programming languages
andautomatic parsergenerators
50Relationships Among Language Classes
51Examples of Languages(Proofs Beyond the Scope of
This Class)