Compiler Construction - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler Construction

Description:

The EMPTY STRING is a special 0-length string denoted e ... (r) is a RE denoting L(r) Additional conventions. To avoid too many parentheses, we assume: ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 58
Provided by: OS7
Category:

less

Transcript and Presenter's Notes

Title: Compiler Construction


1
Compiler Construction
  • 2? ??
  • Lexical Analysis


2
Lexical Analysis
  • get next token is a command sent from the
    parser to the lexical analyzer.
  • On receipt of the command, the lexical analyzer
    scans the input until it determines the next
    token, and returns it.

3
Other jobs of the lexical analyzer
  • We also want the lexer to
  • Strip out comments and white space from the
    source code.
  • Correlate parser errors with the source code
    location (the parser doesnt know what line of
    the file its at, but the lexer does)

4
Tokens, patterns, and lexemes
  • A TOKEN is a set of strings over the source
    alphabet.
  • A PATTERN is a rule that describes that set.
  • A LEXEME is a sequence of characters matching
    that pattern.
  • E.g. in Pascal, for the statement
  • const pi 3.1416
  • The substring pi is a lexeme for the token
    identifier

5
Example tokens, lexemes, patterns
6
Tokens
  • Together, the complete set of tokens form the set
    of terminal symbols used in the grammar for the
    parser.
  • In most languages, the tokens fall into these
    categories
  • Keywords
  • Operators
  • Identifiers
  • Constants
  • Literal stirngs
  • Punctuation
  • Usually the token is represented as an integer.
  • The lexer and parser just agree on which integers
    are used for each token.

7
Token attributes
  • If there is more than one lexeme for a token, we
    have to save additional information about the
    token.
  • Example the token number matches lexemes 10 and
    20.
  • Code generation needs the actual number, not just
    the token.
  • With each token, we associate ATTRIBUTES.
    Normally just a pointer into the symbol table.

8
Example attributes
  • For C source code
  • E M C C
  • We have token/attribute pairs
  • ltID, ptr to symbol table entry for Egt
  • ltAssign_op, NULLgt
  • ltID, ptr to symbol table entry for Mgt
  • ltMult_op, NULLgt
  • ltID, ptr to symbol table entry for Cgt
  • ltMult_op, NULLgt
  • ltID, ptr to symbol table entry for Cgt

9
Lexical errors
  • When errors occur, we could just crash
  • It is better to print an error message then
    continue.
  • Possible techniques to continue on error
  • Delete a character
  • Insert a missing character
  • Replace an incorrect character by a correct
    character
  • Transpose adjacent characters

10
Token specification
  • REGULAR EXPRESSIONS (REs) are the most common
    notation for pattern specification.
  • Every pattern specifies a set of strings, so an
    RE names a set of strings.
  • Definitions
  • The ALPHABET (often written ?) is the set of
    legal input symbols
  • A STRING over some alphabet ? is a finite
    sequence of symbols from ?
  • The LENGTH of string s is written s
  • The EMPTY STRING is a special 0-length string
    denoted e

11
More definitions strings and substrings
  • A PREFIX of s is formed by removing 0 or more
    trailing symbols of s
  • A SUFFIX of s is formed by removing 0 or more
    leading symbols of s
  • A SUBSTRING of s is formed by deleting a prefix
    and a suffix from s
  • A PROPER prefix, suffix, or substring is a
    nonempty string x that is, respectively, a
    prefix, suffix, or substring of s but with x ? s.

12
More definitions
  • A LANGUAGE is a set of strings over a fixed
    alphabet ?.
  • Example languages
  • Ø (the empty set)
  • e
  • a, aa, aaa, aaaa
  • The CONCATENATION of two strings x and y is
    written xy
  • String EXPONENTIATION is written si, where s0 e
    and si si-1s for igt0.

13
Operations on languages
  • We often want to perform operations on sets of
    strings (languages). The important ones are
  • The UNION of L and M L ? M s s is in L OR
    s is in M
  • The CONCATENATION of L and MLM st s is in
    L and t is in M
  • The KLEENE CLOSURE of L
  • The POSITIVE CLOSURE of L

14
Regular expressions
  • REs let us precisely define a set of strings.
  • For C identifiers, we might use ( letter _ ) (
    letter digit _ )
  • Parentheses are for grouping, means OR, and
    means Kleene closure.
  • Every RE defines a language L(r).

15
Regular expressions
  • Here are the rules for writing REs over an
    alphabet ?
  • e is an RE denoting e , the language
    containing only the empty string.
  • If a is in ?, then a is a RE denoting a .
  • If r and s are REs denoting L(r) and L(s), then
  • (r)(s) is a RE denoting L(r) ? L(s)
  • (r)(s) is a RE denoting L(r) L(s)
  • (r) is a RE denoting (L(r))
  • (r) is a RE denoting L(r)

16
Additional conventions
  • To avoid too many parentheses, we assume
  • has the highest precedence, and is left
    associative.
  • Concatenation has the 2nd highest precedence, and
    is left associative.
  • has the lowest precedence and is left
    associative.

17
Example REs
  • a b
  • ( a b ) ( a b )
  • a
  • (a b )
  • a ab

18
Equivalence of REs
19
Regular definitions
  • To make our REs simpler, we can give names to
    subexpressions. A REGULAR DEFINITION is a
    sequence
  • d1 -gt r1
  • d2 -gt r2
  • dn -gt rn

20
Regular definitions
  • Example for identifiers in C
  • letter -gt A B Z a b z
  • digit -gt 0 1 9
  • id -gt ( letter _ ) ( letter digit _ )
  • Example for numbers in Pascal
  • digit -gt 0 1 9
  • digits -gt digit digit
  • optional_fraction -gt . digits e
  • optional_exponent -gt ( E ( - e ) digits )
    e
  • num -gt digits optional_fraction optional_exponent

21
Notational shorthand
  • To simplify out REs, we can use a few shortcuts
  • 1. means one or more instances ofa (ab)
  • 2. ? means zero or one instance
    ofOptional_fraction -gt ( . digits ) ?
  • 3. creates a character classA-Za-zA-Za-z0-9
  • You can prove that these shortcuts do not
    increase the representational power of REs, but
    they are convenient.

22
Token recognition
  • We now know how to specify the tokens for our
    language. But how do we write a program to
    recognize them?
  • if -gt if
  • then -gt then
  • else -gt else
  • relop -gt lt lt ltgt gt gt
  • id -gt letter ( letter digit )
  • num -gt digit ( . digit )? ( E (-)? digit )?

23
Token recognition
  • We also want to strip whitespace, so we need
    definitions
  • delim -gt blank tab newline
  • ws -gt delim

24
Attribute values
25
Transition diagrams
  • Transition diagrams are also called finite
    automata.
  • We have a collection of STATES drawn as nodes in
    a graph.
  • TRANSITIONS between states are represented by
    directed edges in the graph.
  • Each transition leaving a state s is labeled with
    a set of input characters that can occur after
    state s.
  • For now, the transitions must be DETERMINISTIC.
  • Each transition diagram has a single START state
    and a set of TERMINAL STATES.
  • The label OTHER on an edge indicates all possible
    inputs not handled by the other transitions.
  • Usually, when we recognize OTHER, we need to put
    it back in the source stream since it is part of
    the next token. This action is denoted with a
    next to the corresponding state.

26
Automated lexical analyzer generation
  • Next time we discuss Lex and how it does its job
  • Given a set of regular expressions, produce C
    code to recognize the tokens.

27
Lexical Analysis
28
Lexical Analysis Example
29
Lexical Analysis With Lex
30
Lexical analysis with Lex
31
Lex source program format
  • The Lex program has three sections, separated by
  • declarations
  • transition rules
  • auxiliary code

32
Declarations section
  • Code between and is inserted directly into
    the lex.yy.c. Should contain
  • Manifest constants (define for each token)
  • Global variables, function declarations, typedefs
  • Outside and , REGULAR DEFINITIONS are
    declared.Examples
  • delim \t\n
  • ws delim
  • letter A-Za-z

Each definition is a name followed by a
pattern. Declared names can be used in later
patterns, if surrounded by .
33
Translation rules section
  • Translation rules take the form
  • p1 action1
  • p2 action2
  • pn actionn
  • Where pi is a regular expression and actioni is
    a C program fragment to be executed whenever pi
    is recognized in the input stream.
  • In regular expressions, references to regular
    definitions must be enclosed in to distinguish
    them from the corresponding character sequences.

34
Auxiliary procedures
  • Arbitrary C code can be placed in this section,
    e.g. functions to manipulate the symbol table.
  • See the complete example lex specification
    attached.

35
Special characters
  • Some characters have special meaning to Lex.
  • . in a RE stands for ANY character
  • stands for Kleene closure
  • stands for positive closure
  • ? stands for 0-or-1 instance of
  • - produces a character range (e.g. in A-Z)
  • When you want to use these characters in a RE,
    they must be escaped
  • e.g. in RE digit(\.digit)? . is escaped
    with \

36
Lex interface to yacc
  • The yacc parser calls a function yylex() produced
    by lex.
  • yylex() returns the next token it finds in the
    input stream.
  • yacc expects the tokens attribute, if any, to be
    returned via the global variable yylval.
  • The declaration of yylval is up to you (the
    compiler writer). In our example, we use a union,
    since we have a few different kinds of attributes.

37
Lookahead in Lex
  • Sometimes, we dont know until looking ahead
    several characters what the next token is.
    Recognition of the DO keyword in Fortran is a
    famous example.
  • DO5I1.25 assigns the value 1.25 to DO5I
  • DO5I1,25 is a DO loop
  • Lex handles long-term lookahead with
    r1/r2 DO/(letterdigit)(letterdigit)
    ,

(if its followed by letters digits, , more
letters digits, followed by a ,)
Recognize keyword DO
38
Finite Automata for Lexical Analysis
39
Automatic lexical analyzer generation
  • How do Lex and similar tools do their job?
  • Lex translates regular expressions into
    transition diagrams.
  • Then it translates the transition diagrams into C
    code to recognize tokens in the input stream.
  • There are many possible algorithms.
  • The simplest algorithm is RE -gt NFA -gt DFA -gt C
    code.

40
Finite automata (FAs) and regular languages
  • A RECOGNIZER takes language L and string x as
    input, and responds YES if x?L, or NO otherwise.
  • The finite automaton (FA) is one class of
    recognizer.
  • A FA is DETERMINISTIC if there is only one
    possible transition for each ltstate,inputgt pair.
  • A FA is NONDETERMINISTIC if there is more than
    one possible transition some ltstate,inputgt pair.
  • BUT both DFAs and NFAs recognize the same class
    of languages REGULAR languages, or the class of
    languages that can be written as regular
    expressions.

41
NFAs
  • A NFA is a 5-tuple lt S, ?, move, s0, F gt
  • S is the set of STATES in the automaton.
  • ? is the INPUT CHARACTER SET
  • move( s, c ) S is the TRANSITION
    FUNCTIONspecifying which states S the automaton
    can move to on seeing input c while in state s.
  • s0 is the START STATE.
  • F is the set of FINAL, or ACCEPTING STATES

42
NFA example
The NFA
has move() function
  • and recognizes the language L (ab)abb
  • (the set of all strings of as and bs ending
    with abb)

43
The language defined by a NFA
  • An NFA ACCEPTS string x iff there exists a path
    from s0 to an accepting state, such that the edge
    labels along the path spell out x.
  • The LANGUAGE DEFINED BY a NFA N, written L(N), is
    the set of strings it accepts.

44
Another NFA example
  • This NFA accepts L aabb

45
Deterministic FAs (DFAs)
  • The DFA is a special case of the NFA except
  • No state has an e-transition
  • No state has more than one edge leaving it for
    the same input character.
  • The benefit of DFAs is that they are simple to
    simulate there is only one choice for the
    machines state after each input symbol.

46
Algorithm to simulate a DFA
  • Inputs string x terminated by EOF DFA D
    lt S, ?, move, s0, F gt
  • Outputs YES if D accepts x NO otherwise
  • Method
  • s s0
  • c nextchar
  • while ( c ! EOF )
  • s move( s, c )
  • c nextchar
  • if ( s ? F ) return YES
  • else return NO

47
DFA example
  • This DFA accepts L (ab)abb

48
RE -gt DFA
  • Now we know how to simulate DFAs.
  • If we can convert our REs into a DFA, we can
    automatically generate lexical analyzers.
  • BUT it is not easy to convert REs directly into a
    DFA.
  • Instead, we will convert our REs to a NFA then
    convert the NFA to a DFA.

49
Converting a NFA to a DFA
50
NFA -gt DFA
  • NFAs are ambiguous we dont know what state a
    NFA is in after observing each input.
  • The simplest conversion method is to have the DFA
    track the SUBSET of states the NFA MIGHT be in.
  • We need three functions for the construction
  • e-closure(s) the set of NFA states reachable
    from NFA state s on e-transitions alone.
  • e-closure(T) the set of NFA states reachable
    from some state s ? T on e-transitions alone.
  • move(T,a) the set of NFA states to which there
    is a transition on input a from some NFA state s
    ? T

51
Subset construction algorithm
  • Inputs a NFA N lt SN, ?, tranN, n0, FN gt
  • Outputs a DFA D lt SD, ?, tranD, d0, FD gt
  • Method
  • add a state d0 to SD
    corresponding to e-closure(n0) while
    there is an unexpanded state di ? SD
  • for each input symbol a ? ?
  • dj e-closure(move(di,a))
  • if dj ? SD,
  • add dj to SD
  • tranN( di, a ) dj

52
Examples convert these NFAs
a)
b)
53
Converting a RE to a NFA
54
RE -gt NFA
  • The construction is bottom up.
  • Construct NFAs to recognize e and each element a
    ? ?.
  • Recursively expand those NFAs for alternation,
    concatenation, and Kleene closure.
  • Every step introduces at most two additional NFA
    states.
  • Therefore the NFA is at most twice as large as
    the regular expression.

55
RE -gt NFA algorithm
  • Inputs A RE r over alphabet ?
  • Outputs A NFA N accepting L(r)
  • Method Parse r.

If r e, then N is
If r a ? ? , then N is
If r s t, construct N(s) for s and N(t) for t
then N is
56
RE -gt NFA algorithm
If r st, construct N(s) for s and N(t) for t
then N is
If r s, construct N(s) for s, then N is
If r ( s ), construct N(s) then let N be N(s).
57
Example
  • Use the NFA construction algorithm to build a NFA
    for r (ab)abb
Write a Comment
User Comments (0)
About PowerShow.com