Title: COS 320 Compilers
1COS 320Compilers
2Outline
- Last Week
- Introduction to ML
- Today
- Lexical Analysis
- Reading Chapter 2 of Appel
3The Front End
- Lexical Analysis Create sequence of tokens from
characters - Syntax Analysis Create abstract syntax tree from
sequence of tokens - Type Checking Check program for well-formedness
constraints
stream of characters
stream of tokens
abstract syntax
Lexer
Parser
Type Checker
4Lexical Analysis
- Lexical Analysis Breaks stream of ASCII
characters (source) into tokens - Token An atomic unit of program syntax
- i.e., a word as opposed to a sentence
- Tokens and their types
Type ID REAL SEMI LPAREN NUM IF
Characters Recognized foo, x, listcount 10.45,
3.14, -2.1 ( 50, 100 if
Token ID(foo), ID(x), ... REAL(10.45),
REAL(3.14), ... SEMI LPAREN NUM(50), NUM(100) IF
5Lexical Analysis Example
x ( y 4.0 )
6Lexical Analysis Example
x ( y 4.0 ) ID(x)
Lexical Analysis
7Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN
Lexical Analysis
8Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN LPAREN ID(y) PLUS
REAL(4.0) RPAREN SEMI
Lexical Analysis
9Lexer Implementation
- Implementation Options
- Write a Lexer from scratch
- Boring, error-prone and too much work
- Use a Lexer Generator
- Quick and easy. Good for lazy compiler writers.
Lexer Specification
10Lexer Implementation
- Implementation Options
- Write a Lexer from scratch
- Boring, error-prone and too much work
- Use a Lexer Generator
- Quick and easy. Good for lazy compiler writers.
Lexer Specification
Lexer
lexer generator
11Lexer Implementation
- Implementation Options
- Write a Lexer from scratch
- Boring, error-prone and too much work
- Use a Lexer Generator
- Quick and easy. Good for lazy compiler writers.
stream of characters
Lexer Specification
Lexer
lexer generator
stream of tokens
12- How do we specify the lexer?
- Develop another language
- Well use a language involving regular
expressions to specify tokens - What is a lexer generator?
- Another compiler ....
13Some Definitions
- We will want to define the language of legal
tokens our lexer can recognize - Alphabet a collection of symbols (ASCII is an
alphabet) - String a finite sequence of symbols taken from
our alphabet - Language of legal tokens a set of strings
- Language of ML keywords set of all strings
which are ML keywords (FINITE) - Language of ML tokens set of all strings which
map to ML tokens (INFINITE) - Some people use the word language to mean more
general sets - eg ML Language set of all strings
representing correct ML programs (INFINITE).
14Regular Expressions Construction
- Base Cases
- For each symbol a in alphabet, a is a RE denoting
the set a - Epsilon (e) denotes
- Inductive Cases (M and N are REs)
- Alternation (M N) denotes strings in M or N
- (a b) a, b
- Concatenation (M N) denotes strings in M
concatenated with strings in N - (a b) (a c) aa, ac, ba, bc
- Kleene closure (M) denotes strings formed by any
number of repetitions of strings in M - (a b ) e, a, b, aa, ab, ba, bb, ...
15Regular Expressions
- Integers begin with an optional minus sign,
continue with a sequence of digits - Regular Expression
- (- e) (0 1 2 3 4 5 6 7 8
9)
16Regular Expressions
- Integers begin with an optional minus sign,
continue with a sequence of digits - Regular Expression
- (- e) (0 1 2 3 4 5 6 7 8
9) - So writing (0 1 2 3 4 5 6 7 8
9) and even worse (a b c ...) gets
tedious...
17Regular Expressions
- common abbreviations
- a-c (a b c)
- . any character except \n
- \n new line character
- a one or more
- a? zero or one
- all abbreviations can be defined in terms of the
standard regular expressions
18Ambiguous Token Rule Sets
- A single expression is a completely unambiguous
specification of a token. - Sometimes, when we put together a set of regular
expressions to specify all of the tokens in a
language, ambiguities arise - i.e., two regular expression match the same
string
19Ambiguous Token Rule Sets
- Example
- Identifier tokens a-z (a-z 0-9)
- Sample keyword tokens if, then, ...
- How do we tokenize
- foobar gt ID(foobar) or ID(foo) ID(bar)
- if gt ID(if) or IF
20Ambiguous Token Rule Sets
- We resolve ambiguities using two rules
- Longest match The regular expression that
matches the longest string takes precedence. - Rule Priority The regular expressions
identifying tokens are written down in sequence.
If two regular expressions match the same
(longest) string, the first regular expression in
the sequence takes precedence.
21Ambiguous Token Rule Sets
- Example
- Identifier tokens a-z (a-z 0-9)
- Sample keyword tokens if, then, ...
- How do we tokenize
- foobar gt ID(foobar) or ID(foo) ID(bar)
- if gt ID(if) or IF
22Ambiguous Token Rule Sets
- Example
- Identifier tokens a-z (a-z 0-9)
- Sample keyword tokens if, then, ...
- How do we tokenize
- foobar gt ID(foobar) or ID(foo) ID(bar)
- if gt ID(if) or IF
23Lexer Implementation
- Implementation Options
- Write Lexer from scratch
- Boring and error-prone
- Use Lexical Analyzer Generator
- Quick and easy
- ml-lex is a lexical analyzer generator for ML.
- lex and flex are lexical analyzer generators for
C.
24ML-Lex Specification
- Lexical specification consists of 3 parts
User Declarations ML-LEX Definitions Rul
es
25User Declarations
- User Declarations
- User can define various values that are available
to the action fragments. - Two values must be defined in this section
- type lexresult
- type of the value returned by each rule action.
- fun eof ()
- called by lexer when end of input stream is
reached.
26ML-LEX Definitions
- ML-LEX Definitions
- User can define regular expression abbreviations
- Define multiple lexers to work together. Each is
given a unique name.
DIGITS 0-9 LETTER a-zA-Z
s LEX1 LEX2 LEX3
27Rules
- Rules
- A rule consists of a pattern and an action
- Pattern in a regular expression.
- Action is a fragment of ordinary ML code.
- Rules may be prefixed with the list of lexers
that are allowed to use this rule.
ltlexer_listgt regular_expression gt (action.code)
28Rules
- Rules
- A rule consists of a pattern and an action
- Pattern in a regular expression.
- Action is a fragment of ordinary ML code.
- Longest match rule priority used for
disambiguation - Rules may be prefixed with the list of lexers
that are allowed to use this rule.
ltlexer_listgt regular_expression gt (action.code)
29Rules
- Rule actions can use any value defined in the
User Declarations section, including - type lexresult
- type of value returned by each rule action
- val eof unit -gt lexresult
- called by lexer when end of input stream reached
- special variables
- yytext input substring matched by regular
expression - yypos file position of the beginning of matched
string - continue () used to recursively called lexer
30A Simple Lexer
datatype token Num of int Id of string IF
THEN ELSE EOF type lexresult token
( mandatory ) fun eof () EOF
( mandatory ) fun itos s case
Int.fromString s of SOME x gt x NONE gt raise
fail NUM 1-90-9 ID a-zA-Z
(a-zA-Z NUM) if gt (IF) then gt
(THEN) else gt (ELSE) NUM gt (Num (itos
yytext)) ID gt (Id yytext)
31Using Multiple Lexers
- Rules prefixed with a lexer name are matched only
when that lexer is executing - Enter new lexer using command YYBEGIN
- Initial lexer is called INITIAL
32Using Multiple Lexers
type lexresult unit ( mandatory ) fun
eof () () ( mandatory
) s COMMENT ltINITIALgt if gt
() ltINITIALgt a-z gt () ltINITIALgt (
gt (YYBEGIN COMMENT continue ()) ltCOMMENTgt
) gt (YYBEGIN INITIAL continue
()) ltCOMMENTgt \n . gt (continue ())
33A (Marginally) More Exciting Lexer
type lexresult string
( mandatory ) fun eof ()
(print End of file\n EOF) (
mandatory ) s COMMENT INT 1-9
0-9 ltINITIALgt if gt
(IF) ltINITIALgt then gt (THEN) ltINITIALgt
INT gt ( INT( yytext )
) ltINITIALgt ( gt (YYBEGIN COMMENT
continue ()) ltCOMMENTgt ) gt (YYBEGIN
INITIAL continue ()) ltCOMMENTgt \n . gt
(continue ())
34Implementing Lexers
- By compiling, of course
- convert REs into non-deterministic finite
automata - convert non-deterministic finite automata into
deterministic finite automata - convert deterministic finite automata into a
blazingly fast table-driven algorithm - you did everything but possibly the last step in
your favorite algorithms class
35Table-driven algorithm
- DFA Table
- Remember start position in character stream
- Keep reading characters and moving from state to
state until no transitions apply - An auxiliary table maps final states to the token
type identified yystring input from start to
current
1 2 3 4
2
2
3 4
4
a
1
3
b
a
c
2
4
b
36Table-driven algorithm
- DFA
- Detail how to deal with longest match?
- when reading iffy should recognize iffy as
ID, not if as keyword and then fy as ID
a-z
a-z
1
2
37Table-driven algorithm
- DFA
- Detail how to deal with longest match?
- save most recent final state seen and position in
character string - when no more transition can be made, revert to
last saved legal final state - see Appel 2.4 for more details
a-z
a-z
1
2
38Summary
- A Lexer
- input stream of characters
- output stream of tokens
- Writing lexers by hand is boring, so we use a
lexer generator ml-lex - lexer generators work by converting REs through
automata theory to efficient table-driven
algorithms. - theory wins again.