Lexical Analysis and Scanning - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Lexical Analysis and Scanning

Description:

Returned just as identity of token. And perhaps location ... Lexical analyzer returns token type. And key to table entry ... Token type. Identity of character ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 52

Provided by: robertberr

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lexical Analysis and Scanning

1
Lexical Analysis and Scanning

Honors Compilers
Feb 5th 2001
Robert Dewar

2
The Input

Read string input
Might be sequence of characters (Unix)
Might be sequence of lines (VMS)
Character set
ASCII
ISO Latin-1
ISO 10646 (16-bit unicode)
Others (EBCDIC, JIS, etc)

3
The Output

A series of tokens
Punctuation ( ) ,
Operators -
Keywords begin end if
Identifiers Square_Root
String literals hello this is a string
Character literals x
Numeric literals 123 4_5.23e2 16ac

4
Free form vs Fixed form

Free form languages
White space does not matter
Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages
Layout is critical
Fortran, label in cols 1-6
COBOL, area A B
Lexical analyzer must worry about layout

5
Punctuation

Typically individual special characters
Such as -
Lexical analyzer does not know from
Sometimes double characters
E.g. ( treated as a kind of bracket
Returned just as identity of token
And perhaps location
For error message and debugging purposes

6
Operators

Like punctuation
No real difference for lexical analyzer
Typically single or double special chars
Operators -
Operations
Returned just as identity of token
And perhaps location

7
Keywords

Reserved identifiers
E.g. BEGIN END in Pascal, if in C
Maybe distinguished from identifiers
E.g. mode vs mode in Algol-68
Returned just as token identity
With possible location information
Unreserved keywords (e.g. PL/1)
Handled as identifiers (parser distinguishes)

8
Identifiers

Rules differ
Length, allowed characters, separators
Need to build table
So that junk1 is recognized as junk1
Typical structure hash table
Lexical analyzer returns token type
And key to table entry
Table entry includes location information

9
More on Identifier Tables

Most common structure is hash table
With fixed number of headers
Chain according to hash code
Serial search on one chain
Hash code computed from characters
No hash code is perfect!
Avoid any arbitrary limits

10
String Literals

Text must be stored
Actual characters are important
Not like identifiers
Character set issues
Table needed
Lexical analyzer returns key to table
May or may not be worth hashing

11
Character Literals

Similar issues to string literals
Lexical Analyzer returns
Token type
Identity of character
Note, cannot assume character set of host
machine, may be different

12
Numeric Literals

Also need a table
Typically record value
E.g. 123 0123 01_23 (Ada)
But cannot use int for values
Because may have different characteristics
Float stuff much more complex
Denormals, correct rounding
Very delicate stuff

13
Handling Comments

Comments have no effect on program
Can therefore be eliminated by scanner
But may need to be retrieved by tools
Error detection issues
E.g. unclosed comments
Scanner does not return comments

14
Case Equivalence

Some languages have case equivalence
Pascal, Ada
Some do not
C, Java
Lexical analyzer ignores case if needed
This_Routine THIS_RouTine
Error analysis may need exact casing

15
Issues to Address

Speed
Lexical analysis can take a lot of time
Minimize processing per character
I/O is also an issue (read large blocks)
We compile frequently
Compilation time is important
Especially during development

16
General Approach

Define set of token codes
An enumeration type
A series of integer definitions
These are just codes (no semantics)
Some codes associated with data
E.g. key for identifier table
May be useful to build tree node
For identifiers, literals etc

17
Interface to Lexical Analyzer

Convert entire file to a file of tokens
Lexical analyzer is separate phase
Parser calls lexical analyzer
Get next token
This approach avoids extra I/O
Parser builds tree as we go along

18
Implementation of Scanner

Given the input text
Generate the required tokens
Or provide token by token on demand
Before we describe implementations
We take this short break
To describe relevant formalisms

19
Relevant Formalisms

Type 3 (Regular) Grammars
Regular Expressions
Finite State Machines

20
Regular Grammars

Regular grammars
Non-terminals (arbitrary names)
Terminals (characters)
Two forms of rules
Non-terminal terminal
Non-terminal terminal Non-terminal
One non-terminal is the start symbol
Regular (type 3) grammars cannot count
No concept of matching nested parens

21
Regular Grammars

Regular grammars
E.g. grammar of reals with no exponent
REAL 0 REAL1 (repeat for 1 .. 9)
REAL1 0 REAL1 (repeat for 1 .. 9)
REAL1 . INTEGER
INTEGER 0 INTEGER (repeat for 1 .. 9)
INTEGER 0 (repeat for 1 .. 9)
Start symbol is REAL

22
Regular Expressions

Regular expressions (RE) defined by
Any terminal character is an RE
Alternation RE RE
Concatenation RE1 RE2
Repetition RE (zero or more REs)
Language of REs type 3 grammars
Regular expressions are more convenient

23
Specifying REs in Unix Tools

Single characters a b c d \x
Alternation bcd b-z abcd
Match any character .
Match sequence of characters x y
Concatenation abcd-q
Optional 0-9(.0-9)?

24
Finite State Machines

Languages and Automata
A language is a set of strings
An automaton is a machine
That determines if a given string is in the
language or not.
FSMs are automata that recognize regular
languages (regular expressions)

25
Definitions of FSM

A set of labeled states
Directed arcs labeled with character
A state may be marked as terminal
Transition from state S1 to S2
If and only if arc from S1 to S2
Labeled with next character (which is eaten)
Recognized if ends up in terminal state
One state is distinguished start state

26
Building FSM from Grammar

One state for each non-terminal
A rule of the form
Nont1 terminal
Generates transition from S1 to final state
A rule of the form
Nont1 terminal Nont2
Generates transition from S1 to S2

27
Building FSMs from REs

Every RE corresponds to a grammar
For all regular expressions
A natural translation to FSM exists
We will not give details of algorithm here

28
Non-Deterministic FSM

A non-deterministic FSM
Has at least one state
With two arcs to two separate states
Labeled with the same character
Which way to go?
Implementation requires backtracking
Nasty ?

29
Deterministic FSM

For all states S
For all characters C
There is either ONE or NO arcs
From state S
Labeled with character C
Much easier to implement
No backtracking ?

30
Dealing with ND FSM

Construction naturally leads to ND FSM
For example, consider FSM for
0-9 0-9\.0-9
(integer or real)
We will naturally get a start state
With two sets of 0-9 branches
And thus non-deterministic

31
Converting to Deterministic

There is an algorithm for converting
From any ND FSM
To an equivalent deterministic FSM
Algorithm is in the text book
Example (given in terms of REs)
0-9 0-9\.0-9
0-9(\.0-9)?

32
Implementing the Scanner

Three methods
Completely informal, just write code
Define tokens using regular expressions
Convert REs to ND finite state machine
Convert ND FSM to deterministic FSM
Program the FSM
Use an automated program
To achieve above three steps

33
Ad Hoc Code (forget FSMs)

Write normal hand code
A procedure called Scan
Normal coding techniques
Basically scan over white space and comments till
non-blank character found.
Base subsequent processing on character
E.g. colon may be or
/ may be operator or start of comment
Return token found
Write aggressive efficient code

34
Using FSM Formalisms

Start with regular grammar or RE
Typically found in the language standard
For example, for Ada
Chapter 2. Lexical Elements
Digit 0 1 2 3 4 5 6 7 8 9
decimal-literal integer .integerexponent
integer digit underline digit
exponent E integer E - integer

35
Using FSM formalisms, cont

Given REs or grammar
Convert to finite state machine
Convert ND FSM to deterministic FSM
Write a program to recognize
Using the deterministic FSM

36
Implementing FSM (Method 1)

Each state is code of the form
ltltstate1gtgt case Next_Character is when a gt
goto state3 when b gt goto state1 when
others gt End_of_token_processing end
case
ltltstate2gtgt

37
Implementing FSM (Method 2)

There is a variable called State
loop case State is when state1
gtltltstate1gtgt case Next_Character is
when a gt State state3 when b gt
State state1 when others gt
End_token_processing end case when
state2 end case
end loop

38
Implementing FSM (Method 3)

T array (State, Character) of Statewhile
More_Input loop Curstate T (Curstate,
Next_Char) if Curstate Error_State then
end loop

39
Automatic FSM Generation

Our example, FLEX
See home page for manual in HTML
FLEX is given
A set of regular expressions
Actions associated with each RE
It builds a scanner
Which matches REs and executes actions

40
Flex General Format

Input to Flex is a set of rules
Regexp actions (C statements)
Regexp actions (C statements)
Flex scans the longest matching Regexp
And executes the corresponding actions

41
An Example of a Flex scanner

DIGIT 0-9ID a-za-z0-9DIGIT
printf (an integer s (d)\n,
yytext, atoi (yytext)) DIGIT.
DIGIT printf (a float
s (g)\n, yytext, atof
(yytext))ifthenbeginendprocedurefunction
printf (a keyword
s\n, yytext))

42
Flex Example (continued)