Title: Chapter 4 Lexical and Syntax Analysis
1Chapter 4 Lexical and Syntax Analysis
- CS 350 Programming Language Design
- Indiana University Purdue University Fort Wayne
- Mark Temte
2Chapter 4 topics
- Introduction
- Lexical Analysis
- Parsing
- Recursive-Descent Parsing
- Bottom-Up Parsing
3Introduction
- The syntax analysis portion of a compiler
typically consists of two parts - A low-level part called a lexical analyzer
- A deterministic finite automaton (DFA)
- Based on a regular grammar
- A high-level part called a parser (or syntax
analyzer) - A push-down automaton
- Based on a context-free grammar described with
BNF
4Introduction
- Reasons to use BNF to describe syntax
- Provides a clear and concise syntax description
- The parser can be based directly on the BNF
- Parsers based on BNF are easy to maintain
- Reasons to separate lexical and syntax analysis
- Simplicity
- Less complex approaches can be used for lexical
analysis - Separating them out simplifies the parser
- Efficiency
- Separation allows optimization of the lexical
analyzer - Portability
- Parts of the lexical analyzer may not be
portable, but the parser always is portable
5Lexical Analysis
- A lexical analyzer is a pattern matcher for
character strings - A lexical analyzer is a front-end for the
parser - Identifies substrings of the source program that
belong together (lexemes) - Lexemes match a character pattern, which is
associated with a lexical category called a token - myCount is a lexeme
- The token for myCount might be called IDENT
6Lexical Analysis
- A lexical analyzer also . . .
- Skips comments
- Skips blanks outside lexemes
- Inserts lexemes for identifiers into a symbol
table - Detects syntactic errors in lexemes
- Ill-formed floating-point literals, for example
7Lexical Analysis
- The lexical analyzer is typically a function that
is called by the parser when it needs the next
token - Three common approaches to building a lexical
analyzer - Write a formal description of the tokens and use
a software tool that constructs table-driven
lexical analyzers using the description - A UNIX tool that does this is lex
- Draw a state transition diagram that describes
the tokens and write a program that implements
the state diagram - Draw a state transition diagram that describes
the tokens and hand-construct a table-driven
implementation of the state diagram - We examine the second approach
8Lexical Analysis
- State diagram design
- A naïve state diagram would have a transition
from every state resulting from every character
in the source language - Such a diagram would be very large!
- Transitions can usually be combined to simplify
the state diagram - To recognize an identifier, all uppercase and
lowercase letters are equivalent - Use a character class that includes all letters
- To recognize an integer literal, all digits are
equivalent - Use a digit class
9Lexical Analysis
- Reserved words and identifiers can be recognized
together - Then use a table lookup to determine whether a
possible identifier is in fact a reserved word - Alternative is to have a separate part of the
diagram for each reserved word
10Lexical Analysis
- A lexical analyzer typically has several global
variables - Character nextChar
- charClass (letter, digit, etc.)
- String lexeme
- Some convenient utility subprograms are . . .
- getChar
- Gets the next character of input, puts it in
nextChar, determines its class, and puts the
class in charClass - addChar
- Adds the character from nextChar to the lexeme
string - Lookup
- Determines whether the string in lexeme is a
reserved word - Returns a code
11Example state transition diagram
12A simple lexical analyzer
/ a simple lexical analyzer / int lex( )
getChar( ) switch ( charClass ) /
Parse identifiers and reserved words /
case LETTER addChar( )
getChar( ) while ( charClass LETTER
charClass DIGIT ) addChar(
) getChar( )
return lookup( lexeme ) break
13A simple lexical analyzer
/ Parse integer literals /
case DIGIT addChar( )
getChar( ) while ( charClass DIGIT )
addChar( ) getChar(
) return INT_LIT
break / End of switch / / End of
function lex /
14Parsing
- A parser is a recognizer for a context-free
language - Given an input program, a parser . . .
- Finds all syntax errors
- For each syntax error, an appropriate diagnostic
message is generated - Recovery is attempted to find additional syntax
errors - Produce the parse tree for the program
- Perhaps just a traversal of the nodes of the
parse tree
15Parsing
- Two categories of parsers
- Top down
- Produce the parse tree, beginning at the root
- Order is that of a leftmost derivation
- Bottom up
- Produce the parse tree, beginning at the leaves
- Order is that of the reverse of a rightmost
derivation - Parsers look only one token ahead in the input
16Top-down parsers
- A top-down parser traces the parse tree in
preorder - It produces a leftmost derivation of the program
- Partway through the leftmost derivation, suppose
that the sentential form, xA?, has been derived - Nonterminal A must be replaced next
- There may be several RHSs for A
- Call these A-rules
17Top-down parsers
- The parser must choose the correct A-rule to get
the next sentential form in the leftmost
derivation - The parser is guided by the single lookahead
token - The chosen A-rule must uniquely produce the
lookahead token - The most common top-down parsing algorithms . . .
- Recursive descent parser
- A coded implementation
- LL parsers
- Table driven implementation
- Left-to-right scan of tokens produces a Leftmost
derivation
18Bottom-up parsers
- Start with the tokens of the program and work
back to the start symbol - We end up with a rightmost derivation in reverse
order - Try to match the RHS of some production rule with
a substring of tokens and replace the substring
with the LHS of the production rule - This is called a reduction
- The goal is to find a series of reductions
- Each reduction should produce the previous
sentential form in a rightmost derivation
19Bottom-up parsers
- Problem
- More than one RHS may match input
- The correct RHS must be correctly selected based
only on the lookahead token - The correct RHS is called the handle
- The most common bottom-up parsing algorithms are
in the LR family - Table driven implementation
- Left-to-right scan of tokens produces a Rightmost
derivation
20The Complexity of Parsing
- Parsers that work for any unambiguous grammar are
complex and inefficient - The big-O is O(n3), where n is the length of the
input - General parsers often reach dead ends and must
back up and reparse - Practical parsers only work for a subset of all
unambiguous grammars - The big-O of these is O(n), where n is the input
length
21Recursive-descent parsing
- This involves a subprogram for each nonterminal
in the grammar - This subprogram parses the sub-sentences that can
be generated by that nonterminal - Recursive production rules lead to recursive
subprograms - EBNF is ideally suited for being the basis for a
recursive-descent parser - EBNF minimizes the number of nonterminals
22Recursive-descent parsing
- Consider a grammar for simple expressions
- For a production rule LHS with only one RHS . . .
- Work through the RHS, symbol-by-symbol
- For any terminal symbol, compare it with the
lookahead token - If they match, continue else there is an error
- For any nonterminal symbol, call the symbols
associated parsing subprogram
ltexprgt ? lttermgt ( - ) lttermgt lttermgt ?
ltfactorgt ( / ) ltfactorgt ltfactorgt ? id (
ltexprgt )
23Recursive-descent parsing
- Assume we have a lexical analyzer named lex,
which puts the next token code in nextToken - This particular routine does not detect errors
/ Function expr parses strings in the language
generated by the rule
ltexprgt ? lttermgt ( - ) lttermgt
/ void expr( ) / Parse
the first term / Â Â term( ) / As long as
the next token is or -, call lex to get the
next token, and parse the next term
/ Â Â while ( nextToken PLUS_CODE
nextToken MINUS_CODE ) Â Â Â Â lex(
) Â Â Â term( ) Â Â
Convention term() and every other parsing
subprogram leaves the next token in nextToken
when it finishes
24Recursive-descent parsing
- A production rule LHS that has more than one RHS
requires an initial process to determine which
RHS it is to parse - The correct RHS is chosen on the basis of the
lookahead token - The lookahead is compared with the first token
that can be generated by each RHS until a match
is found - The possible tokens that can be generated must be
determined by analysis when the compiler is
constructed - If no match is found, it is a syntax error
25Recursive-descent parsing
/ Function factor parses strings in the language
generated by the rule
ltfactorgt -gt id ( ltexprgt )
/ void
factor( ) / Determine which RHS / Â Â if
( nextToken ID_CODE )
/ For the RHS id, just call lex / Â Â Â
lex() Â Â else if ( nextToken LEFT_PAREN_CODE
) / If the RHS is (ltexprgt), call lex to
pass over the left parenthesis, call
expr, and check for the right parenthesis
/ Â Â Â Â lex(
) expr( ) Â Â if ( nextToken
RIGHT_PAREN_CODE ) lex( ) else
error( ) else error( ) /
Neither RHS matches / / end of factor /
26Recursive-descent parsing
- The LL grammar class has a problem with left
recursion - If a grammar has left recursion, either direct or
indirect, it cannot be the basis for a top-down
parser - For example, no production rule may have the form
- A ? A B
- A recursive descent parser subprogram for A would
immediately call itself, resulting in an infinite
chain of recursive calls - Fortunately, a grammar can be modified to remove
left recursion
27Recursive-descent parsing
- The LL grammar class also has a problem with
pairwise disjointness - Lack of pairwise disjointness is another
characteristic of grammars that disallows
top-down parsing - This is the inability to determine the correct
RHS on the basis of one token of lookahead
28Pairwise disjointness problem
- Define the FIRST set of a symbol string ? by
FIRST(?) a ? gt a? - If ? gt ?, ? is in FIRST(?))
- Here ? is the empty string
- Pairwise disjointness test
- Let A be any LHS nonterminal that has more than
one RHS - Then for each pair of rules, A ? ?i and A ? ?k,
it must be true that - FIRST(?i) ? FIRST(?k) ?
29Pairwise disjointness problem
- Examples
- The following group of production rules pass
pairwise disjointness test - A ? a bB cAb
- The next group of production rules do not pass
- A ? a aB
- A grammar that fails the pairwise disjointness
test can often be modified successfully using
left factoring
30Left factoring example
- The production rule group
- ltid_listgt ? identifier identifier ,
ltid_listgt - fails the pairwise disjointness test
- Replace the group with
- ltid_listgt ? identifier ltnewgt
- ltnewgt ? , ltid_listgt ?
-
31Bottom-up parsing
- Recall that a bottom-up parser produces a
rightmost derivation in reverse order by reading
input from left to right
Simple grammar E ? E T T T ? T F F F ? (
E ) id
- Rightmost derivation
- E
- E T
- E T F
- E T c
- E F c
- E b c
- T b c
- F b c
- a b c
32Bottom-up parsing
- Given a right sentential form, the bottom-up
parsing problem is to find the correct RHS (the
handle) to reduce to a LHS to get the previous
right sentential form in a rightmost derivation - Some handle definitions
- Definition ? is the handle of the right
sentential form - ? ? ? w if and only if S gt ? A
w gt ? ? w - Definition ? is a phrase of the right sentential
form ? - if and only if S gt ? ?1A?2 gt ?1??2
- Def ? is a simple phrase of the right sentential
form ? - if and only if S gt ? ?1A?2 gt ?1??2
33Bottom-up parsing
- Intuition about handles
- The handle of a right sentential form is its
leftmost simple phrase - Given a parse tree, it is now easy to find the
handle - Of course, you are not given the parse tree in
advance - Parsing can be thought of as handle pruning
34Bottom-up parsing
- Bottom-up parsers are often called shift-reduce
parsers - The focus of parser activity is a parse stack
- Shift and reduce activity
- Reduce is the action of replacing the handle on
the top of the parse stack with its corresponding
LHS - Shift is the action of moving the next input
token to the top of the parse stack
35Bottom-up parsing
- Advantages of LR parsers
- They will work for nearly all grammars that
describe programming languages. - They work on a larger class of grammars than
other bottom-up algorithms, but are as efficient
as any other bottom-up parser. - They can detect syntax errors as soon as it is
possible - LL parsers also have this property
- The LR class of grammars is a superset of the
class of grammars that can be parsed by LL parsers
36Bottom-up parsing
- LR parsers
- Are table driven
- It is usually not practical to construct a table
by hand - The table must be constructed automatically from
the grammar by a program - For example, the UNIX program is yacc
37Bottom-up parsing
- LR parsing was discovered by Donald Knuth (1965)
- Knuths insight
- A bottom-up parser can use the entire history of
the parse, up to the current point, to make
parsing decisions - There are only a finite and relatively small
number of different parse situations that could
have occurred, so the history can be stored as a
sequence of states Sm, on the parse stack
38Bottom-up parsing
- An LR configuration is the entire state of an LR
parser - (S0 X1 S1 X2 S2 Xm Sm, ai ai1an )
- The uppercase letters represent the parse stack
- The lowercase letters represent the unread input
- There is one state S for each grammar symbol X on
the parse stack
39Bottom-up parsing
40Bottom-up parsing
- LR parser table has two components
- ACTION table
- The ACTION table specifies the action of the
parser, given the parser state and the next token - Rows are state names
- Columns are terminals
- GOTO table
- The GOTO table specifies which state to put on
top of the parse stack after a reduction action
has taken place - Rows are state names
- Columns are nonterminals
41Form of an LR parsing table
42Form of an LR parsing table
- This parsing table resulted from the grammar
- Initial LR configuration (S0, a1an)
Grammar 1. E ? E T 2. E ? T 3. T ? T F 4. T ?
F 5. F ? ( E ) 6. F ? id
43Parser actions
- If ACTIONSm, ai Shift S, the next
configuration is - (S0X1S1X2S2XmSmaiS, ai1an)
- If ACTIONSm, ai Reduce A ? ? and
- S GOTOSm-r, A, where r the length of ?,
the next configuration is - (S0 X1 S1 X2 S2Xm-r Sm-r A S, ai ai1an )
- If ACTIONSm, ai Accept, the parse is complete
and no errors were found - If ACTIONSm, ai Error, the parser calls an
error-handling routine