Chapter 4 Lexical and Syntax Analysis

About This Presentation

Title:

Chapter 4 Lexical and Syntax Analysis

Description:

If the RHS is ( expr ), call lex to pass over the left parenthesis, call expr, and check for the right parenthesis */ lex( ); expr ... – PowerPoint PPT presentation

Number of Views:486

Avg rating:3.0/5.0

Slides: 44

Provided by: david2548

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 Lexical and Syntax Analysis

1
Chapter 4 Lexical and Syntax Analysis

CS 350 Programming Language Design
Indiana University Purdue University Fort Wayne
Mark Temte

2
Chapter 4 topics

Introduction
Lexical Analysis
Parsing
Recursive-Descent Parsing
Bottom-Up Parsing

3
Introduction

The syntax analysis portion of a compiler
typically consists of two parts
A low-level part called a lexical analyzer
A deterministic finite automaton (DFA)
Based on a regular grammar
A high-level part called a parser (or syntax
analyzer)
A push-down automaton
Based on a context-free grammar described with
BNF

4
Introduction

Reasons to use BNF to describe syntax
Provides a clear and concise syntax description
The parser can be based directly on the BNF
Parsers based on BNF are easy to maintain
Reasons to separate lexical and syntax analysis
Simplicity
Less complex approaches can be used for lexical
analysis
Separating them out simplifies the parser
Efficiency
Separation allows optimization of the lexical
analyzer
Portability
Parts of the lexical analyzer may not be
portable, but the parser always is portable

5
Lexical Analysis

A lexical analyzer is a pattern matcher for
character strings
A lexical analyzer is a front-end for the
parser
Identifies substrings of the source program that
belong together (lexemes)
Lexemes match a character pattern, which is
associated with a lexical category called a token
myCount is a lexeme
The token for myCount might be called IDENT

6
Lexical Analysis

A lexical analyzer also . . .
Skips comments
Skips blanks outside lexemes
Inserts lexemes for identifiers into a symbol
table
Detects syntactic errors in lexemes
Ill-formed floating-point literals, for example

7
Lexical Analysis

The lexical analyzer is typically a function that
is called by the parser when it needs the next
token
Three common approaches to building a lexical
analyzer
Write a formal description of the tokens and use
a software tool that constructs table-driven
lexical analyzers using the description
A UNIX tool that does this is lex
Draw a state transition diagram that describes
the tokens and write a program that implements
the state diagram
Draw a state transition diagram that describes
the tokens and hand-construct a table-driven
implementation of the state diagram
We examine the second approach

8
Lexical Analysis

State diagram design
A naïve state diagram would have a transition
from every state resulting from every character
in the source language
Such a diagram would be very large!
Transitions can usually be combined to simplify
the state diagram
To recognize an identifier, all uppercase and
lowercase letters are equivalent
Use a character class that includes all letters
To recognize an integer literal, all digits are
equivalent
Use a digit class

9
Lexical Analysis

Reserved words and identifiers can be recognized
together
Then use a table lookup to determine whether a
possible identifier is in fact a reserved word
Alternative is to have a separate part of the
diagram for each reserved word

10
Lexical Analysis

A lexical analyzer typically has several global
variables
Character nextChar
charClass (letter, digit, etc.)
String lexeme
Some convenient utility subprograms are . . .
getChar
Gets the next character of input, puts it in
nextChar, determines its class, and puts the
class in charClass
addChar
Adds the character from nextChar to the lexeme
string
Lookup
Determines whether the string in lexeme is a
reserved word
Returns a code

11
Example state transition diagram
12
A simple lexical analyzer
/ a simple lexical analyzer / int lex( )
getChar( ) switch ( charClass ) /
Parse identifiers and reserved words /
case LETTER addChar( )
getChar( ) while ( charClass LETTER
charClass DIGIT ) addChar(
) getChar( )
return lookup( lexeme ) break

13
A simple lexical analyzer
/ Parse integer literals /
case DIGIT addChar( )
getChar( ) while ( charClass DIGIT )
addChar( ) getChar(
) return INT_LIT
break / End of switch / / End of
function lex /
14
Parsing

A parser is a recognizer for a context-free
language
Given an input program, a parser . . .
Finds all syntax errors
For each syntax error, an appropriate diagnostic
message is generated
Recovery is attempted to find additional syntax
errors
Produce the parse tree for the program
Perhaps just a traversal of the nodes of the
parse tree

15
Parsing

Two categories of parsers
Top down
Produce the parse tree, beginning at the root
Order is that of a leftmost derivation
Bottom up
Produce the parse tree, beginning at the leaves
Order is that of the reverse of a rightmost
derivation
Parsers look only one token ahead in the input

16
Top-down parsers

A top-down parser traces the parse tree in
preorder
It produces a leftmost derivation of the program
Partway through the leftmost derivation, suppose
that the sentential form, xA?, has been derived
Nonterminal A must be replaced next
There may be several RHSs for A
Call these A-rules

17
Top-down parsers

The parser must choose the correct A-rule to get
the next sentential form in the leftmost
derivation
The parser is guided by the single lookahead
token
The chosen A-rule must uniquely produce the
lookahead token
The most common top-down parsing algorithms . . .
Recursive descent parser
A coded implementation
LL parsers
Table driven implementation
Left-to-right scan of tokens produces a Leftmost
derivation

18
Bottom-up parsers

Start with the tokens of the program and work
back to the start symbol
We end up with a rightmost derivation in reverse
order
Try to match the RHS of some production rule with
a substring of tokens and replace the substring
with the LHS of the production rule
This is called a reduction
The goal is to find a series of reductions
Each reduction should produce the previous
sentential form in a rightmost derivation

19
Bottom-up parsers

Problem
More than one RHS may match input
The correct RHS must be correctly selected based
only on the lookahead token
The correct RHS is called the handle
The most common bottom-up parsing algorithms are
in the LR family
Table driven implementation
Left-to-right scan of tokens produces a Rightmost
derivation

20
The Complexity of Parsing

Parsers that work for any unambiguous grammar are
complex and inefficient
The big-O is O(n3), where n is the length of the
input
General parsers often reach dead ends and must
back up and reparse
Practical parsers only work for a subset of all
unambiguous grammars
The big-O of these is O(n), where n is the input
length

21
Recursive-descent parsing

This involves a subprogram for each nonterminal
in the grammar
This subprogram parses the sub-sentences that can
be generated by that nonterminal
Recursive production rules lead to recursive
subprograms
EBNF is ideally suited for being the basis for a
recursive-descent parser
EBNF minimizes the number of nonterminals

22
Recursive-descent parsing

Consider a grammar for simple expressions
For a production rule LHS with only one RHS . . .
Work through the RHS, symbol-by-symbol
For any terminal symbol, compare it with the
lookahead token
If they match, continue else there is an error
For any nonterminal symbol, call the symbols
associated parsing subprogram

ltexprgt ? lttermgt ( - ) lttermgt lttermgt ?
ltfactorgt ( / ) ltfactorgt ltfactorgt ? id (
ltexprgt )
23
Recursive-descent parsing

Assume we have a lexical analyzer named lex,
which puts the next token code in nextToken
This particular routine does not detect errors

/ Function expr parses strings in the language
generated by the rule
ltexprgt ? lttermgt ( - ) lttermgt
/ void expr( ) / Parse
the first term /    term( ) / As long as
the next token is or -, call lex to get the
next token, and parse the next term

/    while ( nextToken PLUS_CODE
nextToken MINUS_CODE )      lex(
)     term( )
Convention term() and every other parsing
subprogram leaves the next token in nextToken
when it finishes
24
Recursive-descent parsing

A production rule LHS that has more than one RHS
requires an initial process to determine which
RHS it is to parse
The correct RHS is chosen on the basis of the
lookahead token
The lookahead is compared with the first token
that can be generated by each RHS until a match
is found
The possible tokens that can be generated must be
determined by analysis when the compiler is
constructed
If no match is found, it is a syntax error

25
Recursive-descent parsing
/ Function factor parses strings in the language
generated by the rule
ltfactorgt -gt id ( ltexprgt )
/ void
factor( ) / Determine which RHS / if
( nextToken ID_CODE )
/ For the RHS id, just call lex /
lex() else if ( nextToken LEFT_PAREN_CODE
) / If the RHS is (ltexprgt), call lex to
pass over the left parenthesis, call
expr, and check for the right parenthesis
/ lex(
) expr( ) if ( nextToken
RIGHT_PAREN_CODE ) lex( ) else
error( ) else error( ) /
Neither RHS matches / / end of factor /
26
Recursive-descent parsing

The LL grammar class has a problem with left
recursion
If a grammar has left recursion, either direct or
indirect, it cannot be the basis for a top-down
parser
For example, no production rule may have the form
A ? A B
A recursive descent parser subprogram for A would
immediately call itself, resulting in an infinite
chain of recursive calls
Fortunately, a grammar can be modified to remove
left recursion

27
Recursive-descent parsing

The LL grammar class also has a problem with
pairwise disjointness
Lack of pairwise disjointness is another
characteristic of grammars that disallows
top-down parsing
This is the inability to determine the correct
RHS on the basis of one token of lookahead

28
Pairwise disjointness problem

Define the FIRST set of a symbol string ? by
FIRST(?) a ? gt a?
If ? gt ?, ? is in FIRST(?))
Here ? is the empty string
Pairwise disjointness test
Let A be any LHS nonterminal that has more than
one RHS
Then for each pair of rules, A ? ?i and A ? ?k,
it must be true that
FIRST(?i) ? FIRST(?k) ?

29
Pairwise disjointness problem

Examples
The following group of production rules pass
pairwise disjointness test
A ? a bB cAb
The next group of production rules do not pass
A ? a aB
A grammar that fails the pairwise disjointness
test can often be modified successfully using
left factoring

30
Left factoring example

The production rule group
ltid_listgt ? identifier identifier ,
ltid_listgt
fails the pairwise disjointness test
Replace the group with
ltid_listgt ? identifier ltnewgt
ltnewgt ? , ltid_listgt ?

31
Bottom-up parsing

Recall that a bottom-up parser produces a
rightmost derivation in reverse order by reading
input from left to right

Simple grammar E ? E T T T ? T F F F ? (
E ) id

Rightmost derivation
E
E T
E T F
E T c
E F c
E b c
T b c
F b c
a b c

32
Bottom-up parsing

Given a right sentential form, the bottom-up
parsing problem is to find the correct RHS (the
handle) to reduce to a LHS to get the previous
right sentential form in a rightmost derivation
Some handle definitions
Definition ? is the handle of the right
sentential form
? ? ? w if and only if S gt ? A
w gt ? ? w
Definition ? is a phrase of the right sentential
form ?
if and only if S gt ? ?1A?2 gt ?1??2
Def ? is a simple phrase of the right sentential
form ?
if and only if S gt ? ?1A?2 gt ?1??2

33
Bottom-up parsing

Intuition about handles
The handle of a right sentential form is its
leftmost simple phrase
Given a parse tree, it is now easy to find the
handle
Of course, you are not given the parse tree in
advance
Parsing can be thought of as handle pruning

34
Bottom-up parsing

Bottom-up parsers are often called shift-reduce
parsers
The focus of parser activity is a parse stack
Shift and reduce activity
Reduce is the action of replacing the handle on
the top of the parse stack with its corresponding
LHS
Shift is the action of moving the next input
token to the top of the parse stack

35
Bottom-up parsing

Advantages of LR parsers
They will work for nearly all grammars that
describe programming languages.
They work on a larger class of grammars than
other bottom-up algorithms, but are as efficient
as any other bottom-up parser.
They can detect syntax errors as soon as it is
possible
LL parsers also have this property
The LR class of grammars is a superset of the
class of grammars that can be parsed by LL parsers

36
Bottom-up parsing

LR parsers
Are table driven
It is usually not practical to construct a table
by hand
The table must be constructed automatically from
the grammar by a program
For example, the UNIX program is yacc

37
Bottom-up parsing

LR parsing was discovered by Donald Knuth (1965)
Knuths insight
A bottom-up parser can use the entire history of
the parse, up to the current point, to make
parsing decisions
There are only a finite and relatively small
number of different parse situations that could
have occurred, so the history can be stored as a
sequence of states Sm, on the parse stack

38
Bottom-up parsing

An LR configuration is the entire state of an LR
parser
(S0 X1 S1 X2 S2 Xm Sm, ai ai1an )
The uppercase letters represent the parse stack
The lowercase letters represent the unread input
There is one state S for each grammar symbol X on
the parse stack

39
Bottom-up parsing

LR parser operation

40
Bottom-up parsing

LR parser table has two components
ACTION table
The ACTION table specifies the action of the
parser, given the parser state and the next token
Rows are state names
Columns are terminals
GOTO table
The GOTO table specifies which state to put on
top of the parse stack after a reduction action
has taken place
Rows are state names
Columns are nonterminals