COS 320 Compilers - PowerPoint PPT Presentation

About This Presentation

Title:

COS 320 Compilers

Description:

Rules may be prefixed with the list of lexers that are allowed to use this rule. ... Longest match & rule priority used for disambiguation ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 39

Provided by: DPW9

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: COS 320 Compilers

1
COS 320Compilers

David Walker

2
Outline

Last Week
Introduction to ML
Today
Lexical Analysis
Reading Chapter 2 of Appel

3
The Front End

Lexical Analysis Create sequence of tokens from
characters
Syntax Analysis Create abstract syntax tree from
sequence of tokens
Type Checking Check program for well-formedness
constraints

stream of characters
stream of tokens
abstract syntax
Lexer
Parser
Type Checker
4
Lexical Analysis

Lexical Analysis Breaks stream of ASCII
characters (source) into tokens
Token An atomic unit of program syntax
i.e., a word as opposed to a sentence
Tokens and their types

Type ID REAL SEMI LPAREN NUM IF
Characters Recognized foo, x, listcount 10.45,
3.14, -2.1 ( 50, 100 if
Token ID(foo), ID(x), ... REAL(10.45),
REAL(3.14), ... SEMI LPAREN NUM(50), NUM(100) IF
5
Lexical Analysis Example
x ( y 4.0 )
6
Lexical Analysis Example
x ( y 4.0 ) ID(x)
Lexical Analysis
7
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN
Lexical Analysis
8
Lexical Analysis Example
x ( y 4.0 )
ID(x) ASSIGN LPAREN ID(y) PLUS
REAL(4.0) RPAREN SEMI
Lexical Analysis
9
Lexer Implementation

Implementation Options
Write a Lexer from scratch
Boring, error-prone and too much work
Use a Lexer Generator
Quick and easy. Good for lazy compiler writers.

Lexer Specification
10
Lexer Implementation

Implementation Options
Write a Lexer from scratch
Boring, error-prone and too much work
Use a Lexer Generator
Quick and easy. Good for lazy compiler writers.

Lexer Specification
Lexer
lexer generator
11
Lexer Implementation

Implementation Options
Write a Lexer from scratch
Boring, error-prone and too much work
Use a Lexer Generator
Quick and easy. Good for lazy compiler writers.

stream of characters
Lexer Specification
Lexer
lexer generator
stream of tokens
12

How do we specify the lexer?
Develop another language
Well use a language involving regular
expressions to specify tokens
What is a lexer generator?
Another compiler ....

13
Some Definitions

We will want to define the language of legal
tokens our lexer can recognize
Alphabet a collection of symbols (ASCII is an
alphabet)
String a finite sequence of symbols taken from
our alphabet
Language of legal tokens a set of strings
Language of ML keywords set of all strings
which are ML keywords (FINITE)
Language of ML tokens set of all strings which
map to ML tokens (INFINITE)
Some people use the word language to mean more
general sets
eg ML Language set of all strings
representing correct ML programs (INFINITE).

14
Regular Expressions Construction

Base Cases
For each symbol a in alphabet, a is a RE denoting
the set a
Epsilon (e) denotes
Inductive Cases (M and N are REs)
Alternation (M N) denotes strings in M or N
(a b) a, b
Concatenation (M N) denotes strings in M
concatenated with strings in N
(a b) (a c) aa, ac, ba, bc
Kleene closure (M) denotes strings formed by any
number of repetitions of strings in M
(a b ) e, a, b, aa, ab, ba, bb, ...

15
Regular Expressions

Integers begin with an optional minus sign,
continue with a sequence of digits
Regular Expression
(- e) (0 1 2 3 4 5 6 7 8
9)

16
Regular Expressions

Integers begin with an optional minus sign,
continue with a sequence of digits
Regular Expression
(- e) (0 1 2 3 4 5 6 7 8
9)
So writing (0 1 2 3 4 5 6 7 8
9) and even worse (a b c ...) gets
tedious...

17
Regular Expressions

common abbreviations
a-c (a b c)
. any character except \n
\n new line character
a one or more
a? zero or one
all abbreviations can be defined in terms of the
standard regular expressions

18
Ambiguous Token Rule Sets

A single expression is a completely unambiguous
specification of a token.
Sometimes, when we put together a set of regular
expressions to specify all of the tokens in a
language, ambiguities arise
i.e., two regular expression match the same
string

19
Ambiguous Token Rule Sets

Example
Identifier tokens a-z (a-z 0-9)
Sample keyword tokens if, then, ...
How do we tokenize
foobar gt ID(foobar) or ID(foo) ID(bar)
if gt ID(if) or IF

20
Ambiguous Token Rule Sets

We resolve ambiguities using two rules
Longest match The regular expression that
matches the longest string takes precedence.
Rule Priority The regular expressions
identifying tokens are written down in sequence.
If two regular expressions match the same
(longest) string, the first regular expression in
the sequence takes precedence.

21
Ambiguous Token Rule Sets

Example
Identifier tokens a-z (a-z 0-9)
Sample keyword tokens if, then, ...
How do we tokenize
foobar gt ID(foobar) or ID(foo) ID(bar)
if gt ID(if) or IF

22
Ambiguous Token Rule Sets

Example
Identifier tokens a-z (a-z 0-9)
Sample keyword tokens if, then, ...
How do we tokenize
foobar gt ID(foobar) or ID(foo) ID(bar)
if gt ID(if) or IF

23
Lexer Implementation

Implementation Options
Write Lexer from scratch
Boring and error-prone
Use Lexical Analyzer Generator
Quick and easy
ml-lex is a lexical analyzer generator for ML.
lex and flex are lexical analyzer generators for
C.

24
ML-Lex Specification

Lexical specification consists of 3 parts

User Declarations ML-LEX Definitions Rul
es
25
User Declarations

User Declarations
User can define various values that are available
to the action fragments.
Two values must be defined in this section
type lexresult
type of the value returned by each rule action.
fun eof ()
called by lexer when end of input stream is
reached.

26
ML-LEX Definitions

ML-LEX Definitions
User can define regular expression abbreviations
Define multiple lexers to work together. Each is
given a unique name.

DIGITS 0-9 LETTER a-zA-Z
s LEX1 LEX2 LEX3
27
Rules

Rules
A rule consists of a pattern and an action
Pattern in a regular expression.
Action is a fragment of ordinary ML code.
Rules may be prefixed with the list of lexers
that are allowed to use this rule.

ltlexer_listgt regular_expression gt (action.code)
28
Rules

Rules
A rule consists of a pattern and an action
Pattern in a regular expression.
Action is a fragment of ordinary ML code.
Longest match rule priority used for
disambiguation
Rules may be prefixed with the list of lexers
that are allowed to use this rule.

ltlexer_listgt regular_expression gt (action.code)
29
Rules

Rule actions can use any value defined in the
User Declarations section, including
type lexresult
type of value returned by each rule action
val eof unit -gt lexresult
called by lexer when end of input stream reached
special variables
yytext input substring matched by regular
expression
yypos file position of the beginning of matched
string
continue () used to recursively called lexer

30
A Simple Lexer
datatype token Num of int Id of string IF
THEN ELSE EOF type lexresult token
( mandatory ) fun eof () EOF
( mandatory ) fun itos s case
Int.fromString s of SOME x gt x NONE gt raise
fail NUM 1-90-9 ID a-zA-Z
(a-zA-Z NUM) if gt (IF) then gt
(THEN) else gt (ELSE) NUM gt (Num (itos
yytext)) ID gt (Id yytext)
31
Using Multiple Lexers

Rules prefixed with a lexer name are matched only
when that lexer is executing
Enter new lexer using command YYBEGIN
Initial lexer is called INITIAL

32
Using Multiple Lexers
type lexresult unit ( mandatory ) fun
eof () () ( mandatory
) s COMMENT ltINITIALgt if gt
() ltINITIALgt a-z gt () ltINITIALgt (
gt (YYBEGIN COMMENT continue ()) ltCOMMENTgt
) gt (YYBEGIN INITIAL continue
()) ltCOMMENTgt \n . gt (continue ())
33
A (Marginally) More Exciting Lexer
type lexresult string
( mandatory ) fun eof ()
(print End of file\n EOF) (
mandatory ) s COMMENT INT 1-9
0-9 ltINITIALgt if gt
(IF) ltINITIALgt then gt (THEN) ltINITIALgt
INT gt ( INT( yytext )
) ltINITIALgt ( gt (YYBEGIN COMMENT
continue ()) ltCOMMENTgt ) gt (YYBEGIN
INITIAL continue ()) ltCOMMENTgt \n . gt
(continue ())
34
Implementing Lexers

By compiling, of course
convert REs into non-deterministic finite
automata
convert non-deterministic finite automata into
deterministic finite automata
convert deterministic finite automata into a
blazingly fast table-driven algorithm
you did everything but possibly the last step in
your favorite algorithms class

35
Table-driven algorithm

DFA Table
Remember start position in character stream
Keep reading characters and moving from state to
state until no transitions apply
An auxiliary table maps final states to the token
type identified yystring input from start to
current

1 2 3 4
2
2

3 4
4
a
1
3
b

a
c

2
4

b

36
Table-driven algorithm

DFA
Detail how to deal with longest match?
when reading iffy should recognize iffy as
ID, not if as keyword and then fy as ID

a-z
a-z
1
2
37
Table-driven algorithm

DFA
Detail how to deal with longest match?
save most recent final state seen and position in
character string
when no more transition can be made, revert to
last saved legal final state
see Appel 2.4 for more details

a-z
a-z
1
2
38
Summary

A Lexer
input stream of characters
output stream of tokens
Writing lexers by hand is boring, so we use a
lexer generator ml-lex
lexer generators work by converting REs through
automata theory to efficient table-driven
algorithms.
theory wins again.

Write a Comment

User Comments (0)