Compiler Construction - PowerPoint PPT Presentation

About This Presentation

Title:

Compiler Construction

Description:

The EMPTY STRING is a special 0-length string denoted e ... (r) is a RE denoting L(r) Additional conventions. To avoid too many parentheses, we assume: ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 58

Provided by: OS7

Category:

more less

Transcript and Presenter's Notes

Title: Compiler Construction

1
Compiler Construction

2? ??
Lexical Analysis

2
Lexical Analysis

get next token is a command sent from the
parser to the lexical analyzer.
On receipt of the command, the lexical analyzer
scans the input until it determines the next
token, and returns it.

3
Other jobs of the lexical analyzer

We also want the lexer to
Strip out comments and white space from the
source code.
Correlate parser errors with the source code
location (the parser doesnt know what line of
the file its at, but the lexer does)

4
Tokens, patterns, and lexemes

A TOKEN is a set of strings over the source
alphabet.
A PATTERN is a rule that describes that set.
A LEXEME is a sequence of characters matching
that pattern.
E.g. in Pascal, for the statement
const pi 3.1416
The substring pi is a lexeme for the token
identifier

5
Example tokens, lexemes, patterns
6
Tokens

Together, the complete set of tokens form the set
of terminal symbols used in the grammar for the
parser.
In most languages, the tokens fall into these
categories
Keywords
Operators
Identifiers
Constants
Literal stirngs
Punctuation
Usually the token is represented as an integer.
The lexer and parser just agree on which integers
are used for each token.

7
Token attributes

If there is more than one lexeme for a token, we
have to save additional information about the
token.
Example the token number matches lexemes 10 and
20.
Code generation needs the actual number, not just
the token.
With each token, we associate ATTRIBUTES.
Normally just a pointer into the symbol table.

8
Example attributes

For C source code
E M C C
We have token/attribute pairs
ltID, ptr to symbol table entry for Egt
ltAssign_op, NULLgt
ltID, ptr to symbol table entry for Mgt
ltMult_op, NULLgt
ltID, ptr to symbol table entry for Cgt
ltMult_op, NULLgt
ltID, ptr to symbol table entry for Cgt

9
Lexical errors

When errors occur, we could just crash
It is better to print an error message then
continue.
Possible techniques to continue on error
Delete a character
Insert a missing character
Replace an incorrect character by a correct
character
Transpose adjacent characters

10
Token specification

REGULAR EXPRESSIONS (REs) are the most common
notation for pattern specification.
Every pattern specifies a set of strings, so an
RE names a set of strings.
Definitions
The ALPHABET (often written ?) is the set of
legal input symbols
A STRING over some alphabet ? is a finite
sequence of symbols from ?
The LENGTH of string s is written s
The EMPTY STRING is a special 0-length string
denoted e

11
More definitions strings and substrings

A PREFIX of s is formed by removing 0 or more
trailing symbols of s
A SUFFIX of s is formed by removing 0 or more
leading symbols of s
A SUBSTRING of s is formed by deleting a prefix
and a suffix from s
A PROPER prefix, suffix, or substring is a
nonempty string x that is, respectively, a
prefix, suffix, or substring of s but with x ? s.

12
More definitions

A LANGUAGE is a set of strings over a fixed
alphabet ?.
Example languages
Ø (the empty set)
e
a, aa, aaa, aaaa
The CONCATENATION of two strings x and y is
written xy
String EXPONENTIATION is written si, where s0 e
and si si-1s for igt0.

13
Operations on languages

We often want to perform operations on sets of
strings (languages). The important ones are
The UNION of L and M L ? M s s is in L OR
s is in M
The CONCATENATION of L and MLM st s is in
L and t is in M
The KLEENE CLOSURE of L
The POSITIVE CLOSURE of L

14
Regular expressions

REs let us precisely define a set of strings.
For C identifiers, we might use ( letter _ ) (
letter digit _ )
Parentheses are for grouping, means OR, and
means Kleene closure.
Every RE defines a language L(r).

15
Regular expressions

Here are the rules for writing REs over an
alphabet ?
e is an RE denoting e , the language
containing only the empty string.
If a is in ?, then a is a RE denoting a .
If r and s are REs denoting L(r) and L(s), then
(r)(s) is a RE denoting L(r) ? L(s)
(r)(s) is a RE denoting L(r) L(s)
(r) is a RE denoting (L(r))
(r) is a RE denoting L(r)

16
Additional conventions

To avoid too many parentheses, we assume
has the highest precedence, and is left
associative.
Concatenation has the 2nd highest precedence, and
is left associative.
has the lowest precedence and is left
associative.

17
Example REs

a b
( a b ) ( a b )
a
(a b )
a ab

18
Equivalence of REs
19
Regular definitions

To make our REs simpler, we can give names to
subexpressions. A REGULAR DEFINITION is a
sequence
d1 -gt r1
d2 -gt r2
dn -gt rn

20
Regular definitions

Example for identifiers in C
letter -gt A B Z a b z
digit -gt 0 1 9
id -gt ( letter _ ) ( letter digit _ )
Example for numbers in Pascal
digit -gt 0 1 9
digits -gt digit digit
optional_fraction -gt . digits e
optional_exponent -gt ( E ( - e ) digits )
e
num -gt digits optional_fraction optional_exponent

21
Notational shorthand

To simplify out REs, we can use a few shortcuts
1. means one or more instances ofa (ab)
2. ? means zero or one instance
ofOptional_fraction -gt ( . digits ) ?
3. creates a character classA-Za-zA-Za-z0-9
You can prove that these shortcuts do not
increase the representational power of REs, but
they are convenient.

22
Token recognition

We now know how to specify the tokens for our
language. But how do we write a program to
recognize them?
if -gt if
then -gt then
else -gt else
relop -gt lt lt ltgt gt gt
id -gt letter ( letter digit )
num -gt digit ( . digit )? ( E (-)? digit )?

23
Token recognition

We also want to strip whitespace, so we need
definitions
delim -gt blank tab newline
ws -gt delim

24
Attribute values
25
Transition diagrams

Transition diagrams are also called finite
automata.
We have a collection of STATES drawn as nodes in
a graph.
TRANSITIONS between states are represented by
directed edges in the graph.
Each transition leaving a state s is labeled with
a set of input characters that can occur after
state s.
For now, the transitions must be DETERMINISTIC.
Each transition diagram has a single START state
and a set of TERMINAL STATES.
The label OTHER on an edge indicates all possible
inputs not handled by the other transitions.
Usually, when we recognize OTHER, we need to put
it back in the source stream since it is part of
the next token. This action is denoted with a
next to the corresponding state.

26
Automated lexical analyzer generation

Next time we discuss Lex and how it does its job
Given a set of regular expressions, produce C
code to recognize the tokens.

27
Lexical Analysis
28
Lexical Analysis Example
29
Lexical Analysis With Lex
30
Lexical analysis with Lex
31
Lex source program format

The Lex program has three sections, separated by
declarations
transition rules
auxiliary code

32
Declarations section

Code between and is inserted directly into
the lex.yy.c. Should contain
Manifest constants (define for each token)
Global variables, function declarations, typedefs
Outside and , REGULAR DEFINITIONS are
declared.Examples
delim \t\n
ws delim
letter A-Za-z

Each definition is a name followed by a
pattern. Declared names can be used in later
patterns, if surrounded by .
33
Translation rules section

Translation rules take the form
p1 action1
p2 action2
pn actionn
Where pi is a regular expression and actioni is
a C program fragment to be executed whenever pi
is recognized in the input stream.
In regular expressions, references to regular
definitions must be enclosed in to distinguish
them from the corresponding character sequences.

34
Auxiliary procedures

Arbitrary C code can be placed in this section,
e.g. functions to manipulate the symbol table.
See the complete example lex specification
attached.

35
Special characters

Some characters have special meaning to Lex.
. in a RE stands for ANY character
stands for Kleene closure
stands for positive closure
? stands for 0-or-1 instance of
- produces a character range (e.g. in A-Z)
When you want to use these characters in a RE,
they must be escaped
e.g. in RE digit(\.digit)? . is escaped
with \

36
Lex interface to yacc

The yacc parser calls a function yylex() produced
by lex.
yylex() returns the next token it finds in the
input stream.
yacc expects the tokens attribute, if any, to be
returned via the global variable yylval.
The declaration of yylval is up to you (the
compiler writer). In our example, we use a union,
since we have a few different kinds of attributes.

37
Lookahead in Lex

Sometimes, we dont know until looking ahead
several characters what the next token is.
Recognition of the DO keyword in Fortran is a
famous example.
DO5I1.25 assigns the value 1.25 to DO5I
DO5I1,25 is a DO loop
Lex handles long-term lookahead with
r1/r2 DO/(letterdigit)(letterdigit)
,

(if its followed by letters digits, , more
letters digits, followed by a ,)
Recognize keyword DO
38
Finite Automata for Lexical Analysis
39
Automatic lexical analyzer generation

How do Lex and similar tools do their job?
Lex translates regular expressions into
transition diagrams.
Then it translates the transition diagrams into C
code to recognize tokens in the input stream.
There are many possible algorithms.
The simplest algorithm is RE -gt NFA -gt DFA -gt C
code.

40
Finite automata (FAs) and regular languages

A RECOGNIZER takes language L and string x as
input, and responds YES if x?L, or NO otherwise.
The finite automaton (FA) is one class of
recognizer.
A FA is DETERMINISTIC if there is only one
possible transition for each ltstate,inputgt pair.
A FA is NONDETERMINISTIC if there is more than
one possible transition some ltstate,inputgt pair.
BUT both DFAs and NFAs recognize the same class
of languages REGULAR languages, or the class of
languages that can be written as regular
expressions.

41
NFAs

A NFA is a 5-tuple lt S, ?, move, s0, F gt
S is the set of STATES in the automaton.
? is the INPUT CHARACTER SET
move( s, c ) S is the TRANSITION
FUNCTIONspecifying which states S the automaton
can move to on seeing input c while in state s.
s0 is the START STATE.
F is the set of FINAL, or ACCEPTING STATES

42
NFA example
The NFA
has move() function

and recognizes the language L (ab)abb
(the set of all strings of as and bs ending
with abb)

43
The language defined by a NFA

An NFA ACCEPTS string x iff there exists a path
from s0 to an accepting state, such that the edge
labels along the path spell out x.
The LANGUAGE DEFINED BY a NFA N, written L(N), is
the set of strings it accepts.

44
Another NFA example

This NFA accepts L aabb

45
Deterministic FAs (DFAs)

The DFA is a special case of the NFA except
No state has an e-transition
No state has more than one edge leaving it for
the same input character.
The benefit of DFAs is that they are simple to
simulate there is only one choice for the
machines state after each input symbol.

46
Algorithm to simulate a DFA

Inputs string x terminated by EOF DFA D
lt S, ?, move, s0, F gt
Outputs YES if D accepts x NO otherwise
Method
s s0
c nextchar
while ( c ! EOF )
s move( s, c )
c nextchar
if ( s ? F ) return YES
else return NO

47
DFA example

This DFA accepts L (ab)abb

48
RE -gt DFA

Now we know how to simulate DFAs.
If we can convert our REs into a DFA, we can
automatically generate lexical analyzers.
BUT it is not easy to convert REs directly into a
DFA.
Instead, we will convert our REs to a NFA then
convert the NFA to a DFA.

49
Converting a NFA to a DFA
50
NFA -gt DFA

NFAs are ambiguous we dont know what state a
NFA is in after observing each input.
The simplest conversion method is to have the DFA
track the SUBSET of states the NFA MIGHT be in.
We need three functions for the construction
e-closure(s) the set of NFA states reachable
from NFA state s on e-transitions alone.
e-closure(T) the set of NFA states reachable
from some state s ? T on e-transitions alone.
move(T,a) the set of NFA states to which there
is a transition on input a from some NFA state s
? T

51
Subset construction algorithm

Inputs a NFA N lt SN, ?, tranN, n0, FN gt
Outputs a DFA D lt SD, ?, tranD, d0, FD gt
Method
add a state d0 to SD
corresponding to e-closure(n0) while
there is an unexpanded state di ? SD
for each input symbol a ? ?
dj e-closure(move(di,a))
if dj ? SD,
add dj to SD
tranN( di, a ) dj

52
Examples convert these NFAs
a)
b)
53
Converting a RE to a NFA
54
RE -gt NFA

The construction is bottom up.
Construct NFAs to recognize e and each element a
? ?.
Recursively expand those NFAs for alternation,
concatenation, and Kleene closure.
Every step introduces at most two additional NFA
states.
Therefore the NFA is at most twice as large as
the regular expression.

55
RE -gt NFA algorithm

Inputs A RE r over alphabet ?
Outputs A NFA N accepting L(r)
Method Parse r.

If r e, then N is
If r a ? ? , then N is
If r s t, construct N(s) for s and N(t) for t
then N is
56
RE -gt NFA algorithm
If r st, construct N(s) for s and N(t) for t
then N is
If r s, construct N(s) for s, then N is
If r ( s ), construct N(s) then let N be N(s).
57
Example