Title: LEXICAL ANALYSIS
1LEXICAL ANALYSIS
2First Step in Compilation
Source code (character stream)
Lexical analysis
Token stream
Parsing
Abstract syntax tree
Intermediate Code Generation
Intermediate code
Code Generation
Assembly code
3Lexical Analysis
Source code (character stream)
if (b 0) a hi
Lexical analysis
Token stream
Parsing
Semantic Analysis
4A Closer Look
- Lexical analysis converts a character stream to a
token stream of pairs lttoken type, valuegt
if (x1 x2 lt 1.0) y x1
KEYif
IDx1
OP
IDx2
RELOPlt
LPAREN
NUM1.0
LBRACE
RPAREN
5Why a Separate Lexical Analysis Phase?
- Programs could be made from characters, and parse
trees would go down to the character level - Machine specific, obfuscates parsing, cumbersome
- Lexical analysis is firewall between program
representation and parsing actions - Prior lexical analysis phase obtains tokens
consisting of a type and value
6ltSTMTgt ? IFKEY LPAREN ltCONDgt RPAREN ltSTMTgt
ID ASSIGNOP ltEXPRgt SEMI ltCONDgt ?
ltEXPRgt RELOP ltEXPRgt ltEXPRgt ? ID CONSTANT
grammar
parse tree
ltSTMTgt IFKEY LPAREN ltCONDgt RPAREN
ltSTMTgt
Parser groups tokens according to grammar
ltEXPRgt RELOP ltEXPRgt
ID ASSIGNOP ltEXPRgt SEMI
ID CONSTANT
CONSTANT
Lexical analyzer (phase 2) turns lexemes into
tokens
Lexical analyzer (phase 1) groups characters into
lexemes
7Lexical Analysis Terminology
- Token
- Terminal symbol in a grammar
- Classes of sequences of characters with a
collective meaning, e.g., IDENT - Constants, Operators, Punctuation, Reserved words
(keywords) - Lexeme
- character sequence matched by an instance of the
token, e.g. sqrt
8Token Types
- Identifiers x y11 elsex_i00
- Keywords if else while
- Integers 2 1000 -500 6663554
- Floating point 2.0 0.00020 .02 1. 1e5 0.e-10
- Symbols - lt gt .. /
- Comments dont change this
9Token Values
- Some token types have values associated with them
10Lexical errors
- What if user omits the space in realf?
- No lexical error, single token IDENT(realf) is
produced instead of sequence REAL, IDENT(f)! -
- Typically few lexical error types
- illegal chars
- unterminated comments
- ill-formed constants
11Issues
- How to describe tokens unambiguously
- 2.e0 20.e-01 2.0000
- How to break text up into tokens
- if (x 0) a xltlt1
- iff (x 0) a xlt1
- How to write the lexer
12How to Describe Tokens
- Programming language tokens can be described
using regular expressions - A regular expression R describes some set of
strings L(R) - L(R) is the language defined by R
- L(abc) abc
- L(hello goodbye) hello, goodbye
- L(1-90-9) all positive integer constants
13Define each kind of token using a RE
- Keywords, punctuation are easy
- IF keyword if
- Left paren (
- Identifiers, constants a bit more complicated
- Identifiers letter (letter digit)
- Constants reals are more complicated
- ( -)? digit (digit) .
N.B. extended (UNIX-like) RE syntax ? stands for
0 or 1
14Implementing the Lexer
- Lexer implemented with a finite state automaton
that corresponds to the regular expressions
describing the tokens - finite set of states
- set of transitions between states
- transitions taken on input symbols
- one starting state q0 and a set of final states
- Automaton for RE for IF
- Combine the automata for each token type to
create the lexer
15RE to FA
- 1. ID letter (letter digit)
- 2. INTCONSTANT digit digit
- 4. MULOP /
- 5. ADDOP -
- 6. ASSNOP
- 7. COLON
- 8. LESSTHAN lt
- 9. NOTEQUAL ltgt
- LT_OR_EQUAL lt
- GT gt
- GT_OR_EQUAL gt
- EQUAL
- ENDMARKER .
- SEMICOLON
- LEFTPAREN (
- RIGHTPAREN )
16Augmenting the FA
- Following recognition of a token, an action is
specified that provides for - returning the appropriate token (type, value
pair) - In some cases, other housekeeping
Action return (IFKEY,-)
17Matching Tokens
elsex0
- REs alone not enough need rule for choosing
- Most languages longest matching token wins
- even if a shorter token is only way
- Exception early FORTRAN (totally
whitespace-insensitive) - Ties in length resolved by prioritizing tokens
- REs priorities longest-matching token rule
lexer definition
18Delimeters
- The longest matching token rule has an impact
on the REs (and FAs) - IF i f delimeter
- For some tokens, delimeters not an issue
leftparen, rightparen, ,
19Comments and Whitespace
- Not part of tokens
- Lexer skips over them
- Function as delimeters else is two tokens
identifier el and identifier se - Whitespace
- Blanks and newlines, tabs
- Only first is relevant--can throw away the rest
in a sequence - Comments
- May want to preserve for printing user source
20Implementation Options
- Hand written lexer
- Implement a finite state automaton
- start in some initial state
- look at each input character in sequence, update
lexer state accordingly - if state at end of input is an accepting state,
the input string matches the RE - Lexer generator
- generates tokenizer automatically (e.g., flex,
jlex) - Uses RE to NFA to DFA algorithm
- Generates a table-driven lexer (also an FSA)
21Hand-written lexer
Driver Calls GetNextToken Prints token type
and value
GetNextToken Calls AssembleSimpleToken Changes
Ids to keywords where necessary Returns next
token in input stream
AssembleSimpleToken Calls GetNextChar
repeatedly Assembles char sequences into valid
tokens Returns simple token
The FSA
GetNextChar Returns the next significant
token in the input stream
22Finite Automata
- Automaton (DFA) can be represented as
- transition table
- graph
23A regexp matcher(table-driven method)
boolean accept_stateNSTATES int
trans_tableNSTATESNCHARS int state
0 while (state ! ERROR_STATE) c
getNextChar() if (c lt 0) break state
tablestatec return accept_statestate
24Hand written lexer Top-level loop(Non-table
driven method)
Token nextToken( ) if
(identifierChar(getNextChar()))
return readIdentifier() if
(numericChar(getNextChar()))
return readNumber() else return
readSymbol()
25Hand written lexerfor identifiers
Token readIdentifier( ) String id
while (true) char c getNextChar()
if (!identifierChar(c)) return
new Token(ID, id) id id String(c)
26Input Buffering
- Lexer should be optimized for speed
- Buffering systems in standard languages (C, etc.)
are poor - Copy from disk to OS buffer, OS buffer to buffer
in FILE structure, FILE structure to string - Solution buffer input yourself
- Get two buffers with a size of a disk block each
- Two pointers keep track of location in each
- Load input in buffers reload one when done
27Problems
- Dont always know what kind of token we are going
to read from seeing first character - if token begins with i is it an identifier?
- if token begins with lt is it less than or
less than or equal?
28Look-ahead character
- Scan text one character at a time
- Use look-ahead character (next) to determine what
kind of character to read and when the current
token ends
char next while (identifierChar(next))
id id String(next) next input.read
()
next
29Lookahead and Pushback
- In many instances you read a char or two beyond
the end of a token (e.g., when read a delimeter
that could be a part of the next token) - But sometimes you dont
- Need a way to retain previously seen lookahead
chars - Simple solution use a stack
- Push back a char by pushing on stack
- Get next char from stack (if not empty) else read
from input - Our lexer need two character lookahead for the
DOUBLEDOT - 5..10 vs 5.10
30Reserved Words
- When FAs are combined, will not know if a letter
is first character of an identifier or a reserved
word - Can use the same FA for identifiers and reserved
words, check later which it is
Action return (ID,lexeme)
Action if keyword (lexeme) return
(type, -) else return (ID,lexeme)
Need not be here-- can do in lexical
identification
31Error Recovery
- Not too many types of lexical errors
- Illegal character
- Ill-formed constant
- How is it handled?
- Discard and print a message
- BUT
- If a character in the middle of a lexeme is
wrong, do you discard the char or the whole
lexeme? - Try to correct?
32Identifier Tokens
- In the final compiler, the value portion of the
type-value pair for the identifier token will be
a pointer to the symbol table entry for that
identifier - For now, send the lexeme as value
- Implement when symbol table routines are written
- Token type is an enum
- Token value is a union (ST pointer or string)
- Also an issue with type of constants
- Only the lexer knows if real or int
- Kludge for our compiler
- two token types, INTCONSTANT and REALCONSTANT
- Parser will treat as one token type CONSTANT
33OPTION 2Lexer Generator
- Input
- list of regular expressions describing tokens in
language, in priority order - associated action for each RE (generates
appropriate kind of token, other bookkeeping) - Process
- Reads patterns
- Builds finite automaton to accept valid tokens
- Output
- C code implementation of FA that reads an input
stream and breaks it up into tokens according to
the REs. (or reports lexical error --Unexpected
character ) - Compile and link C code, you've got a scanner
34How does lex build the FA?
- Programmer writes the regular expression
- Generates corresponding NFA-?
- Thompson's construction 5 rules for making an
NFA-? for any regular expression - Kleene's Theorem proves that any NFA-? is
equivalent to some NFA, which is in turn
equivalent to a DFA - So, lex can generate deterministic code
- Lex matches longest token, then accepts
Automata theory proves you can write regular
expressions, give them to a program like lex,
which will generate a machine to accept exactly
those expressions
35Lexer generator
- Regular expression with attached actions
-
- -?1-90-9 return new Token(Tokens.IntConst,
- Integer.parseInt(yytext())
- Generates scanning code that decides
- whether the input is lexically well-formed
- what is the corresponding token sequence
Observation - This process is equivalent to deciding whether
the input is in the language of the regular
expression (R1Rn)
36Example Input
-
- digits 01-90-9
- letter A-Za-z
- identifier letter(letter0-9_)
- whitespace \ \t\n\r
-
- whitespace / discard /
- digits return new IntegerConstant(Integer.pa
rseInt(yytext()) - if return new IfToken()
- while return new WhileToken()
-
- identifier return new IdentifierToken(yytext(
))
regular expressions
actions
37Three parts to Lex
- Declarations
- Regular expression definitions of tokens
- This is a sample Lex program written by....
- digit --gt 0-9
- number -- gt digit
- Transition Rules
- Regular Expression Action when matched
- number printf("The number is s\n", yytext)
- junk printf("Junk is not a valid
input!\n") - quit return 0
- Auxilliary Procedures
- Written into the C program..
- int main() is required
- separates the three parts
38Example
-
-
- delim \t\n
- ws delim
- letter A-Za-z
- digit 0-9
- id letter(letterdigit)
- number digit(\.digit)?(E\-?digit)?
-
- ws / no action and no return /
- if return(IF)
- then return(THEN)
- else return(ELSE)
- id yylval install_id() return(ID)
- number yylval install_num()return(NUMBER)
-
39- Available variables
- yylval
- yytext (null terminated string)
- yyleng (length of the matching string)
- yyin the file handle
- yyin fopen(args0, r)
- Available functions
- yylex() (the primary function generated)
- input() - Returns the next character from the
input - unput
- int main(int argc, char argv)
- Calls yylex to perform the lexical analysis
40Context Checking
- Lex allows context-dependent REs
- r/x The regular expression r will be matched
only if it is followed by an occurrence of
regular expression x. - Makes it easy to deal with our ADDOP vs.
UNARYPLUS problem
41How to Use Lex (flex)
- Run Unix man flex for full information
- Write regular expressions and actions
- Compile using Lex (flex) tool
- flex ltprog_namegt.l
- Results in C code
- Compile using C compiler
- Link to the lex library
- gcc lex.yy.c -ll
- Run the a.out file and recognize tokens
- a.out lt input.textgt
42Lexer generators
- The power
- Programmer describes tokens as regular
expressions - Lex turns description of tokens into code
- Generated code compiles into a scanner
- The pitfalls
- Source code generated by lex hard to debug
- Without understanding basis in formal languages,
lex can be a quirky black box
43Comparison of Methods
- Hand-coded scanner
- Programmer creates types, defines data
procedures, designs flow of control, implements
in source language. - Lex-generated scanner
- Programmer writes patterns
- (Declarative, not procedural)
- Lex/flex implements flow of control
- Must less hand-coding, but
- code looks pretty alien, tricky to debug
44Summary
- Lexical analyzer converts a text stream to tokens
- For most languages, legal tokens conveniently,
precisely defined using regular expressions - Two ways to write lexer
- Hand code
- Use a Lexer generator to generate lexer code
automatically from token REs, precedence
45APPENDIX
46Regular Expression Notation
- a ordinary character stands for itself
- ? the empty string
- RS any string from either L(R) or L(S)
- RS string from L(R) followed by one
- from L(S)
- R zero or more strings from L(R),
- concatenated
- ?RRRRRRRRRRRRRRR
47Convenient RE Shorthand
- R one or more strings from L(R) R(R)
- R? optional R (R ?)
- abce one of the listed characters
- (abce)
- a-z one character from this range
- (abcde...)
- ab anything but one of the listed chars
- a-z one character not from this range
48Examples
- Regular Expression Strings in L(R)
- a a
- ab ab
- a b a b
-
- (ab) ab abab
- (a ?) b ab b
49More Examples
- Regular Expression Strings in L(R)
- digit 0-9 0 1 2 3
- posint digit 8 412
- int -? posint -42 1024
- real int (? (. posint)) -1.56 12 1.0
- -?0-9(? (. 0-9))
- a-zA-Z_a-zA-Z0-9_ C identifiers