COS 320 Compilers - PowerPoint PPT Presentation

About This Presentation
Title:

COS 320 Compilers

Description:

Rules may be prefixed with the list of lexers that are allowed to use this rule. ... Longest match & rule priority used for disambiguation ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 39
Provided by: DPW9
Category:
Tags: cos | compilers | list | longest

less

Transcript and Presenter's Notes

Title: COS 320 Compilers


1
COS 320Compilers
  • David Walker

2
Outline
  • Last Week
  • Introduction to ML
  • Today
  • Lexical Analysis
  • Reading Chapter 2 of Appel

3
The Front End
  • Lexical Analysis Create sequence of tokens from
    characters
  • Syntax Analysis Create abstract syntax tree from
    sequence of tokens
  • Type Checking Check program for well-formedness
    constraints

stream of characters
stream of tokens
abstract syntax
Lexer
Parser
Type Checker
4
Lexical Analysis
  • Lexical Analysis Breaks stream of ASCII
    characters (source) into tokens
  • Token An atomic unit of program syntax
  • i.e., a word as opposed to a sentence
  • Tokens and their types

Type ID REAL SEMI LPAREN NUM IF
Characters Recognized foo, x, listcount 10.45,
3.14, -2.1 ( 50, 100 if
Token ID(foo), ID(x), ... REAL(10.45),
REAL(3.14), ... SEMI LPAREN NUM(50), NUM(100) IF
5
Lexical Analysis Example
x ( y 4.0 )
6
Lexical Analysis Example
x ( y 4.0 ) ID(x)
Lexical Analysis
7
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN
Lexical Analysis
8
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN LPAREN ID(y) PLUS
REAL(4.0) RPAREN SEMI
Lexical Analysis
9
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

Lexer Specification
10
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

Lexer Specification
Lexer
lexer generator
11
Lexer Implementation
  • Implementation Options
  • Write a Lexer from scratch
  • Boring, error-prone and too much work
  • Use a Lexer Generator
  • Quick and easy. Good for lazy compiler writers.

stream of characters
Lexer Specification
Lexer
lexer generator
stream of tokens
12
  • How do we specify the lexer?
  • Develop another language
  • Well use a language involving regular
    expressions to specify tokens
  • What is a lexer generator?
  • Another compiler ....

13
Some Definitions
  • We will want to define the language of legal
    tokens our lexer can recognize
  • Alphabet a collection of symbols (ASCII is an
    alphabet)
  • String a finite sequence of symbols taken from
    our alphabet
  • Language of legal tokens a set of strings
  • Language of ML keywords set of all strings
    which are ML keywords (FINITE)
  • Language of ML tokens set of all strings which
    map to ML tokens (INFINITE)
  • Some people use the word language to mean more
    general sets
  • eg ML Language set of all strings
    representing correct ML programs (INFINITE).

14
Regular Expressions Construction
  • Base Cases
  • For each symbol a in alphabet, a is a RE denoting
    the set a
  • Epsilon (e) denotes
  • Inductive Cases (M and N are REs)
  • Alternation (M N) denotes strings in M or N
  • (a b) a, b
  • Concatenation (M N) denotes strings in M
    concatenated with strings in N
  • (a b) (a c) aa, ac, ba, bc
  • Kleene closure (M) denotes strings formed by any
    number of repetitions of strings in M
  • (a b ) e, a, b, aa, ab, ba, bb, ...

15
Regular Expressions
  • Integers begin with an optional minus sign,
    continue with a sequence of digits
  • Regular Expression
  • (- e) (0 1 2 3 4 5 6 7 8
    9)

16
Regular Expressions
  • Integers begin with an optional minus sign,
    continue with a sequence of digits
  • Regular Expression
  • (- e) (0 1 2 3 4 5 6 7 8
    9)
  • So writing (0 1 2 3 4 5 6 7 8
    9) and even worse (a b c ...) gets
    tedious...

17
Regular Expressions
  • common abbreviations
  • a-c (a b c)
  • . any character except \n
  • \n new line character
  • a one or more
  • a? zero or one
  • all abbreviations can be defined in terms of the
    standard regular expressions

18
Ambiguous Token Rule Sets
  • A single expression is a completely unambiguous
    specification of a token.
  • Sometimes, when we put together a set of regular
    expressions to specify all of the tokens in a
    language, ambiguities arise
  • i.e., two regular expression match the same
    string

19
Ambiguous Token Rule Sets
  • Example
  • Identifier tokens a-z (a-z 0-9)
  • Sample keyword tokens if, then, ...
  • How do we tokenize
  • foobar gt ID(foobar) or ID(foo) ID(bar)
  • if gt ID(if) or IF

20
Ambiguous Token Rule Sets
  • We resolve ambiguities using two rules
  • Longest match The regular expression that
    matches the longest string takes precedence.
  • Rule Priority The regular expressions
    identifying tokens are written down in sequence.
    If two regular expressions match the same
    (longest) string, the first regular expression in
    the sequence takes precedence.

21
Ambiguous Token Rule Sets
  • Example
  • Identifier tokens a-z (a-z 0-9)
  • Sample keyword tokens if, then, ...
  • How do we tokenize
  • foobar gt ID(foobar) or ID(foo) ID(bar)
  • if gt ID(if) or IF

22
Ambiguous Token Rule Sets
  • Example
  • Identifier tokens a-z (a-z 0-9)
  • Sample keyword tokens if, then, ...
  • How do we tokenize
  • foobar gt ID(foobar) or ID(foo) ID(bar)
  • if gt ID(if) or IF

23
Lexer Implementation
  • Implementation Options
  • Write Lexer from scratch
  • Boring and error-prone
  • Use Lexical Analyzer Generator
  • Quick and easy
  • ml-lex is a lexical analyzer generator for ML.
  • lex and flex are lexical analyzer generators for
    C.

24
ML-Lex Specification
  • Lexical specification consists of 3 parts

User Declarations ML-LEX Definitions Rul
es
25
User Declarations
  • User Declarations
  • User can define various values that are available
    to the action fragments.
  • Two values must be defined in this section
  • type lexresult
  • type of the value returned by each rule action.
  • fun eof ()
  • called by lexer when end of input stream is
    reached.

26
ML-LEX Definitions
  • ML-LEX Definitions
  • User can define regular expression abbreviations
  • Define multiple lexers to work together. Each is
    given a unique name.

DIGITS 0-9 LETTER a-zA-Z
s LEX1 LEX2 LEX3
27
Rules
  • Rules
  • A rule consists of a pattern and an action
  • Pattern in a regular expression.
  • Action is a fragment of ordinary ML code.
  • Rules may be prefixed with the list of lexers
    that are allowed to use this rule.

ltlexer_listgt regular_expression gt (action.code)
28
Rules
  • Rules
  • A rule consists of a pattern and an action
  • Pattern in a regular expression.
  • Action is a fragment of ordinary ML code.
  • Longest match rule priority used for
    disambiguation
  • Rules may be prefixed with the list of lexers
    that are allowed to use this rule.

ltlexer_listgt regular_expression gt (action.code)
29
Rules
  • Rule actions can use any value defined in the
    User Declarations section, including
  • type lexresult
  • type of value returned by each rule action
  • val eof unit -gt lexresult
  • called by lexer when end of input stream reached
  • special variables
  • yytext input substring matched by regular
    expression
  • yypos file position of the beginning of matched
    string
  • continue () used to recursively called lexer

30
A Simple Lexer
datatype token Num of int Id of string IF
THEN ELSE EOF type lexresult token
( mandatory ) fun eof () EOF
( mandatory ) fun itos s case
Int.fromString s of SOME x gt x NONE gt raise
fail NUM 1-90-9 ID a-zA-Z
(a-zA-Z NUM) if gt (IF) then gt
(THEN) else gt (ELSE) NUM gt (Num (itos
yytext)) ID gt (Id yytext)
31
Using Multiple Lexers
  • Rules prefixed with a lexer name are matched only
    when that lexer is executing
  • Enter new lexer using command YYBEGIN
  • Initial lexer is called INITIAL

32
Using Multiple Lexers
type lexresult unit ( mandatory ) fun
eof () () ( mandatory
) s COMMENT ltINITIALgt if gt
() ltINITIALgt a-z gt () ltINITIALgt (
gt (YYBEGIN COMMENT continue ()) ltCOMMENTgt
) gt (YYBEGIN INITIAL continue
()) ltCOMMENTgt \n . gt (continue ())
33
A (Marginally) More Exciting Lexer
type lexresult string
( mandatory ) fun eof ()
(print End of file\n EOF) (
mandatory ) s COMMENT INT 1-9
0-9 ltINITIALgt if gt
(IF) ltINITIALgt then gt (THEN) ltINITIALgt
INT gt ( INT( yytext )
) ltINITIALgt ( gt (YYBEGIN COMMENT
continue ()) ltCOMMENTgt ) gt (YYBEGIN
INITIAL continue ()) ltCOMMENTgt \n . gt
(continue ())
34
Implementing Lexers
  • By compiling, of course
  • convert REs into non-deterministic finite
    automata
  • convert non-deterministic finite automata into
    deterministic finite automata
  • convert deterministic finite automata into a
    blazingly fast table-driven algorithm
  • you did everything but possibly the last step in
    your favorite algorithms class

35
Table-driven algorithm
  • DFA Table
  • Remember start position in character stream
  • Keep reading characters and moving from state to
    state until no transitions apply
  • An auxiliary table maps final states to the token
    type identified yystring input from start to
    current

1 2 3 4
2
2

3 4
4
a
1
3
b


a
c

2
4

b

36
Table-driven algorithm
  • DFA
  • Detail how to deal with longest match?
  • when reading iffy should recognize iffy as
    ID, not if as keyword and then fy as ID

a-z
a-z
1
2
37
Table-driven algorithm
  • DFA
  • Detail how to deal with longest match?
  • save most recent final state seen and position in
    character string
  • when no more transition can be made, revert to
    last saved legal final state
  • see Appel 2.4 for more details

a-z
a-z
1
2
38
Summary
  • A Lexer
  • input stream of characters
  • output stream of tokens
  • Writing lexers by hand is boring, so we use a
    lexer generator ml-lex
  • lexer generators work by converting REs through
    automata theory to efficient table-driven
    algorithms.
  • theory wins again.
Write a Comment
User Comments (0)
About PowerShow.com