LEXICAL ANALYSIS

About This Presentation

Title:

LEXICAL ANALYSIS

Description:

Lexical analysis converts a character stream to a token stream of pairs token type, value ... Prior lexical analysis phase obtains tokens consisting of a type ... – PowerPoint PPT presentation

Number of Views:288

Avg rating:3.0/5.0

Slides: 50

Provided by: nanc4

Category:

more less

Transcript and Presenter's Notes

Title: LEXICAL ANALYSIS

1
LEXICAL ANALYSIS
2
First Step in Compilation
Source code (character stream)
Lexical analysis
Token stream
Parsing
Abstract syntax tree
Intermediate Code Generation
Intermediate code
Code Generation
Assembly code
3
Lexical Analysis
Source code (character stream)
if (b 0) a hi
Lexical analysis
Token stream
Parsing
Semantic Analysis
4
A Closer Look

Lexical analysis converts a character stream to a
token stream of pairs lttoken type, valuegt

if (x1 x2 lt 1.0) y x1
KEYif
IDx1
OP
IDx2
RELOPlt
LPAREN
NUM1.0
LBRACE
RPAREN
5
Why a Separate Lexical Analysis Phase?

Programs could be made from characters, and parse
trees would go down to the character level
Machine specific, obfuscates parsing, cumbersome
Lexical analysis is firewall between program
representation and parsing actions
Prior lexical analysis phase obtains tokens
consisting of a type and value

6
ltSTMTgt ? IFKEY LPAREN ltCONDgt RPAREN ltSTMTgt
ID ASSIGNOP ltEXPRgt SEMI ltCONDgt ?
ltEXPRgt RELOP ltEXPRgt ltEXPRgt ? ID CONSTANT
grammar
parse tree
ltSTMTgt IFKEY LPAREN ltCONDgt RPAREN
ltSTMTgt
Parser groups tokens according to grammar
ltEXPRgt RELOP ltEXPRgt
ID ASSIGNOP ltEXPRgt SEMI
ID CONSTANT
CONSTANT
Lexical analyzer (phase 2) turns lexemes into
tokens
Lexical analyzer (phase 1) groups characters into
lexemes
7
Lexical Analysis Terminology

Token
Terminal symbol in a grammar
Classes of sequences of characters with a
collective meaning, e.g., IDENT
Constants, Operators, Punctuation, Reserved words
(keywords)
Lexeme
character sequence matched by an instance of the
token, e.g. sqrt

8
Token Types

Identifiers x y11 elsex_i00
Keywords if else while
Integers 2 1000 -500 6663554
Floating point 2.0 0.00020 .02 1. 1e5 0.e-10
Symbols - lt gt .. /
Comments dont change this

9
Token Values

Some token types have values associated with them

10
Lexical errors

What if user omits the space in realf?
No lexical error, single token IDENT(realf) is
produced instead of sequence REAL, IDENT(f)!
Typically few lexical error types
illegal chars
unterminated comments
ill-formed constants

11
Issues

How to describe tokens unambiguously
2.e0 20.e-01 2.0000
How to break text up into tokens
if (x 0) a xltlt1
iff (x 0) a xlt1
How to write the lexer

12
How to Describe Tokens

Programming language tokens can be described
using regular expressions
A regular expression R describes some set of
strings L(R)
L(R) is the language defined by R
L(abc) abc
L(hello goodbye) hello, goodbye
L(1-90-9) all positive integer constants

13
Define each kind of token using a RE

Keywords, punctuation are easy
IF keyword if
Left paren (
Identifiers, constants a bit more complicated
Identifiers letter (letter digit)
Constants reals are more complicated
( -)? digit (digit) .

N.B. extended (UNIX-like) RE syntax ? stands for
0 or 1
14
Implementing the Lexer

Lexer implemented with a finite state automaton
that corresponds to the regular expressions
describing the tokens
finite set of states
set of transitions between states
transitions taken on input symbols
one starting state q0 and a set of final states
Automaton for RE for IF
Combine the automata for each token type to
create the lexer

15
RE to FA

1. ID letter (letter digit)
2. INTCONSTANT digit digit
4. MULOP /
5. ADDOP -
6. ASSNOP
7. COLON
8. LESSTHAN lt
9. NOTEQUAL ltgt
LT_OR_EQUAL lt
GT gt
GT_OR_EQUAL gt
EQUAL
ENDMARKER .
SEMICOLON
LEFTPAREN (
RIGHTPAREN )

16
Augmenting the FA

Following recognition of a token, an action is
specified that provides for
returning the appropriate token (type, value
pair)
In some cases, other housekeeping

Action return (IFKEY,-)
17
Matching Tokens
elsex0

REs alone not enough need rule for choosing
Most languages longest matching token wins
even if a shorter token is only way
Exception early FORTRAN (totally
whitespace-insensitive)
Ties in length resolved by prioritizing tokens
REs priorities longest-matching token rule
lexer definition

18
Delimeters

The longest matching token rule has an impact
on the REs (and FAs)
IF i f delimeter
For some tokens, delimeters not an issue
leftparen, rightparen, ,

19
Comments and Whitespace

Not part of tokens
Lexer skips over them
Function as delimeters else is two tokens
identifier el and identifier se
Whitespace
Blanks and newlines, tabs
Only first is relevant--can throw away the rest
in a sequence
Comments
May want to preserve for printing user source

20
Implementation Options

Hand written lexer
Implement a finite state automaton
start in some initial state
look at each input character in sequence, update
lexer state accordingly
if state at end of input is an accepting state,
the input string matches the RE
Lexer generator
generates tokenizer automatically (e.g., flex,
jlex)
Uses RE to NFA to DFA algorithm
Generates a table-driven lexer (also an FSA)

21
Hand-written lexer

Overall structure

Driver Calls GetNextToken Prints token type
and value
GetNextToken Calls AssembleSimpleToken Changes
Ids to keywords where necessary Returns next
token in input stream
AssembleSimpleToken Calls GetNextChar
repeatedly Assembles char sequences into valid
tokens Returns simple token
The FSA
GetNextChar Returns the next significant
token in the input stream
22
Finite Automata

Automaton (DFA) can be represented as
transition table
graph

23
A regexp matcher(table-driven method)
boolean accept_stateNSTATES int
trans_tableNSTATESNCHARS int state
0 while (state ! ERROR_STATE) c
getNextChar() if (c lt 0) break state
tablestatec return accept_statestate
24
Hand written lexer Top-level loop(Non-table
driven method)
Token nextToken( ) if
(identifierChar(getNextChar()))
return readIdentifier() if
(numericChar(getNextChar()))
return readNumber() else return
readSymbol()
25
Hand written lexerfor identifiers
Token readIdentifier( ) String id
while (true) char c getNextChar()
if (!identifierChar(c)) return
new Token(ID, id) id id String(c)

26
Input Buffering

Lexer should be optimized for speed
Buffering systems in standard languages (C, etc.)
are poor
Copy from disk to OS buffer, OS buffer to buffer
in FILE structure, FILE structure to string
Solution buffer input yourself
Get two buffers with a size of a disk block each
Two pointers keep track of location in each
Load input in buffers reload one when done

27
Problems

Dont always know what kind of token we are going
to read from seeing first character
if token begins with i is it an identifier?
if token begins with lt is it less than or
less than or equal?

28
Look-ahead character

Scan text one character at a time
Use look-ahead character (next) to determine what
kind of character to read and when the current
token ends

char next while (identifierChar(next))
id id String(next) next input.read
()
next
29
Lookahead and Pushback

In many instances you read a char or two beyond
the end of a token (e.g., when read a delimeter
that could be a part of the next token)
But sometimes you dont
Need a way to retain previously seen lookahead
chars
Simple solution use a stack
Push back a char by pushing on stack
Get next char from stack (if not empty) else read
from input
Our lexer need two character lookahead for the
DOUBLEDOT
5..10 vs 5.10

30
Reserved Words

When FAs are combined, will not know if a letter
is first character of an identifier or a reserved
word
Can use the same FA for identifiers and reserved
words, check later which it is

Action return (ID,lexeme)
Action if keyword (lexeme) return
(type, -) else return (ID,lexeme)
Need not be here-- can do in lexical
identification
31
Error Recovery

Not too many types of lexical errors
Illegal character
Ill-formed constant
How is it handled?
Discard and print a message
BUT
If a character in the middle of a lexeme is
wrong, do you discard the char or the whole
lexeme?
Try to correct?

32
Identifier Tokens

In the final compiler, the value portion of the
type-value pair for the identifier token will be
a pointer to the symbol table entry for that
identifier
For now, send the lexeme as value
Implement when symbol table routines are written
Token type is an enum
Token value is a union (ST pointer or string)
Also an issue with type of constants
Only the lexer knows if real or int
Kludge for our compiler
two token types, INTCONSTANT and REALCONSTANT
Parser will treat as one token type CONSTANT

33
OPTION 2Lexer Generator

Input
list of regular expressions describing tokens in
language, in priority order
associated action for each RE (generates
appropriate kind of token, other bookkeeping)
Process
Reads patterns
Builds finite automaton to accept valid tokens
Output
C code implementation of FA that reads an input
stream and breaks it up into tokens according to
the REs. (or reports lexical error --Unexpected
character )
Compile and link C code, you've got a scanner

34
How does lex build the FA?

Programmer writes the regular expression
Generates corresponding NFA-?
Thompson's construction 5 rules for making an
NFA-? for any regular expression
Kleene's Theorem proves that any NFA-? is
equivalent to some NFA, which is in turn
equivalent to a DFA
So, lex can generate deterministic code
Lex matches longest token, then accepts

Automata theory proves you can write regular
expressions, give them to a program like lex,
which will generate a machine to accept exactly
those expressions
35
Lexer generator

Regular expression with attached actions
-?1-90-9 return new Token(Tokens.IntConst,
Integer.parseInt(yytext())
Generates scanning code that decides
whether the input is lexically well-formed
what is the corresponding token sequence
Observation
This process is equivalent to deciding whether
the input is in the language of the regular
expression (R1Rn)

36
Example Input

digits 01-90-9
letter A-Za-z
identifier letter(letter0-9_)
whitespace \ \t\n\r
whitespace / discard /
digits return new IntegerConstant(Integer.pa
rseInt(yytext())
if return new IfToken()
while return new WhileToken()
identifier return new IdentifierToken(yytext(
))

regular expressions
actions
37
Three parts to Lex

Declarations
Regular expression definitions of tokens
This is a sample Lex program written by....
digit --gt 0-9
number -- gt digit
Transition Rules
Regular Expression Action when matched
number printf("The number is s\n", yytext)
junk printf("Junk is not a valid
input!\n")
quit return 0
Auxilliary Procedures
Written into the C program..
int main() is required
separates the three parts

38
Example

delim \t\n
ws delim
letter A-Za-z
digit 0-9
id letter(letterdigit)
number digit(\.digit)?(E\-?digit)?
ws / no action and no return /
if return(IF)
then return(THEN)
else return(ELSE)
id yylval install_id() return(ID)
number yylval install_num()return(NUMBER)

Available variables
yylval
yytext (null terminated string)
yyleng (length of the matching string)
yyin the file handle
yyin fopen(args0, r)
Available functions
yylex() (the primary function generated)
input() - Returns the next character from the
input
unput
int main(int argc, char argv)
Calls yylex to perform the lexical analysis

40
Context Checking

Lex allows context-dependent REs
r/x The regular expression r will be matched
only if it is followed by an occurrence of
regular expression x.
Makes it easy to deal with our ADDOP vs.
UNARYPLUS problem

41
How to Use Lex (flex)

Run Unix man flex for full information
Write regular expressions and actions
Compile using Lex (flex) tool
flex ltprog_namegt.l
Results in C code
Compile using C compiler
Link to the lex library
gcc lex.yy.c -ll
Run the a.out file and recognize tokens
a.out lt input.textgt

42
Lexer generators

The power
Programmer describes tokens as regular
expressions
Lex turns description of tokens into code
Generated code compiles into a scanner
The pitfalls
Source code generated by lex hard to debug
Without understanding basis in formal languages,
lex can be a quirky black box

43
Comparison of Methods

Hand-coded scanner
Programmer creates types, defines data
procedures, designs flow of control, implements
in source language.
Lex-generated scanner
Programmer writes patterns
(Declarative, not procedural)
Lex/flex implements flow of control
Must less hand-coding, but
code looks pretty alien, tricky to debug

44
Summary

Lexical analyzer converts a text stream to tokens
For most languages, legal tokens conveniently,
precisely defined using regular expressions
Two ways to write lexer
Hand code
Use a Lexer generator to generate lexer code
automatically from token REs, precedence

45
APPENDIX
46
Regular Expression Notation