LEXICAL ANALYSIS - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

LEXICAL ANALYSIS

Description:

Lexical analysis converts a character stream to a token stream of pairs token type, value ... Prior lexical analysis phase obtains tokens consisting of a type ... – PowerPoint PPT presentation

Number of Views:288
Avg rating:3.0/5.0
Slides: 50
Provided by: nanc4
Category:

less

Transcript and Presenter's Notes

Title: LEXICAL ANALYSIS


1
LEXICAL ANALYSIS
2
First Step in Compilation
Source code (character stream)
Lexical analysis
Token stream
Parsing
Abstract syntax tree
Intermediate Code Generation
Intermediate code
Code Generation
Assembly code
3
Lexical Analysis
Source code (character stream)
if (b 0) a hi
Lexical analysis
Token stream
Parsing
Semantic Analysis
4
A Closer Look
  • Lexical analysis converts a character stream to a
    token stream of pairs lttoken type, valuegt

if (x1 x2 lt 1.0) y x1
KEYif
IDx1
OP
IDx2
RELOPlt
LPAREN
NUM1.0
LBRACE
RPAREN
5
Why a Separate Lexical Analysis Phase?
  • Programs could be made from characters, and parse
    trees would go down to the character level
  • Machine specific, obfuscates parsing, cumbersome
  • Lexical analysis is firewall between program
    representation and parsing actions
  • Prior lexical analysis phase obtains tokens
    consisting of a type and value

6
ltSTMTgt ? IFKEY LPAREN ltCONDgt RPAREN ltSTMTgt
ID ASSIGNOP ltEXPRgt SEMI ltCONDgt ?
ltEXPRgt RELOP ltEXPRgt ltEXPRgt ? ID CONSTANT
grammar
parse tree
ltSTMTgt IFKEY LPAREN ltCONDgt RPAREN
ltSTMTgt
Parser groups tokens according to grammar
ltEXPRgt RELOP ltEXPRgt
ID ASSIGNOP ltEXPRgt SEMI
ID CONSTANT
CONSTANT
Lexical analyzer (phase 2) turns lexemes into
tokens
Lexical analyzer (phase 1) groups characters into
lexemes
7
Lexical Analysis Terminology
  • Token
  • Terminal symbol in a grammar
  • Classes of sequences of characters with a
    collective meaning, e.g., IDENT
  • Constants, Operators, Punctuation, Reserved words
    (keywords)
  • Lexeme
  • character sequence matched by an instance of the
    token, e.g. sqrt

8
Token Types
  • Identifiers x y11 elsex_i00
  • Keywords if else while
  • Integers 2 1000 -500 6663554
  • Floating point 2.0 0.00020 .02 1. 1e5 0.e-10
  • Symbols - lt gt .. /
  • Comments dont change this

9
Token Values
  • Some token types have values associated with them

10
Lexical errors
  • What if user omits the space in realf?
  • No lexical error, single token IDENT(realf) is
    produced instead of sequence REAL, IDENT(f)!
  • Typically few lexical error types
  • illegal chars
  • unterminated comments
  • ill-formed constants

11
Issues
  • How to describe tokens unambiguously
  • 2.e0 20.e-01 2.0000
  • How to break text up into tokens
  • if (x 0) a xltlt1
  • iff (x 0) a xlt1
  • How to write the lexer

12
How to Describe Tokens
  • Programming language tokens can be described
    using regular expressions
  • A regular expression R describes some set of
    strings L(R)
  • L(R) is the language defined by R
  • L(abc) abc
  • L(hello goodbye) hello, goodbye
  • L(1-90-9) all positive integer constants

13
Define each kind of token using a RE
  • Keywords, punctuation are easy
  • IF keyword if
  • Left paren (
  • Identifiers, constants a bit more complicated
  • Identifiers letter (letter digit)
  • Constants reals are more complicated
  • ( -)? digit (digit) .

N.B. extended (UNIX-like) RE syntax ? stands for
0 or 1
14
Implementing the Lexer
  • Lexer implemented with a finite state automaton
    that corresponds to the regular expressions
    describing the tokens
  • finite set of states
  • set of transitions between states
  • transitions taken on input symbols
  • one starting state q0 and a set of final states
  • Automaton for RE for IF
  • Combine the automata for each token type to
    create the lexer

15
RE to FA
  • 1. ID letter (letter digit)
  • 2. INTCONSTANT digit digit
  • 4. MULOP /
  • 5. ADDOP -
  • 6. ASSNOP
  • 7. COLON
  • 8. LESSTHAN lt
  • 9. NOTEQUAL ltgt
  • LT_OR_EQUAL lt
  • GT gt
  • GT_OR_EQUAL gt
  • EQUAL
  • ENDMARKER .
  • SEMICOLON
  • LEFTPAREN (
  • RIGHTPAREN )

16
Augmenting the FA
  • Following recognition of a token, an action is
    specified that provides for
  • returning the appropriate token (type, value
    pair)
  • In some cases, other housekeeping

Action return (IFKEY,-)
17
Matching Tokens
elsex0
  • REs alone not enough need rule for choosing
  • Most languages longest matching token wins
  • even if a shorter token is only way
  • Exception early FORTRAN (totally
    whitespace-insensitive)
  • Ties in length resolved by prioritizing tokens
  • REs priorities longest-matching token rule
    lexer definition

18
Delimeters
  • The longest matching token rule has an impact
    on the REs (and FAs)
  • IF i f delimeter
  • For some tokens, delimeters not an issue
    leftparen, rightparen, ,

19
Comments and Whitespace
  • Not part of tokens
  • Lexer skips over them
  • Function as delimeters else is two tokens
    identifier el and identifier se
  • Whitespace
  • Blanks and newlines, tabs
  • Only first is relevant--can throw away the rest
    in a sequence
  • Comments
  • May want to preserve for printing user source

20
Implementation Options
  • Hand written lexer
  • Implement a finite state automaton
  • start in some initial state
  • look at each input character in sequence, update
    lexer state accordingly
  • if state at end of input is an accepting state,
    the input string matches the RE
  • Lexer generator
  • generates tokenizer automatically (e.g., flex,
    jlex)
  • Uses RE to NFA to DFA algorithm
  • Generates a table-driven lexer (also an FSA)

21
Hand-written lexer
  • Overall structure

Driver Calls GetNextToken Prints token type
and value
GetNextToken Calls AssembleSimpleToken Changes
Ids to keywords where necessary Returns next
token in input stream
AssembleSimpleToken Calls GetNextChar
repeatedly Assembles char sequences into valid
tokens Returns simple token
The FSA
GetNextChar Returns the next significant
token in the input stream
22
Finite Automata
  • Automaton (DFA) can be represented as
  • transition table
  • graph

23
A regexp matcher(table-driven method)
boolean accept_stateNSTATES int
trans_tableNSTATESNCHARS int state
0 while (state ! ERROR_STATE) c
getNextChar() if (c lt 0) break state
tablestatec return accept_statestate
24
Hand written lexer Top-level loop(Non-table
driven method)
Token nextToken( ) if
(identifierChar(getNextChar()))
return readIdentifier() if
(numericChar(getNextChar()))
return readNumber() else return
readSymbol()
25
Hand written lexerfor identifiers
Token readIdentifier( ) String id
while (true) char c getNextChar()
if (!identifierChar(c)) return
new Token(ID, id) id id String(c)

26
Input Buffering
  • Lexer should be optimized for speed
  • Buffering systems in standard languages (C, etc.)
    are poor
  • Copy from disk to OS buffer, OS buffer to buffer
    in FILE structure, FILE structure to string
  • Solution buffer input yourself
  • Get two buffers with a size of a disk block each
  • Two pointers keep track of location in each
  • Load input in buffers reload one when done

27
Problems
  • Dont always know what kind of token we are going
    to read from seeing first character
  • if token begins with i is it an identifier?
  • if token begins with lt is it less than or
    less than or equal?

28
Look-ahead character
  • Scan text one character at a time
  • Use look-ahead character (next) to determine what
    kind of character to read and when the current
    token ends

char next while (identifierChar(next))
id id String(next) next input.read
()
next
29
Lookahead and Pushback
  • In many instances you read a char or two beyond
    the end of a token (e.g., when read a delimeter
    that could be a part of the next token)
  • But sometimes you dont
  • Need a way to retain previously seen lookahead
    chars
  • Simple solution use a stack
  • Push back a char by pushing on stack
  • Get next char from stack (if not empty) else read
    from input
  • Our lexer need two character lookahead for the
    DOUBLEDOT
  • 5..10 vs 5.10

30
Reserved Words
  • When FAs are combined, will not know if a letter
    is first character of an identifier or a reserved
    word
  • Can use the same FA for identifiers and reserved
    words, check later which it is

Action return (ID,lexeme)
Action if keyword (lexeme) return
(type, -) else return (ID,lexeme)
Need not be here-- can do in lexical
identification
31
Error Recovery
  • Not too many types of lexical errors
  • Illegal character
  • Ill-formed constant
  • How is it handled?
  • Discard and print a message
  • BUT
  • If a character in the middle of a lexeme is
    wrong, do you discard the char or the whole
    lexeme?
  • Try to correct?

32
Identifier Tokens
  • In the final compiler, the value portion of the
    type-value pair for the identifier token will be
    a pointer to the symbol table entry for that
    identifier
  • For now, send the lexeme as value
  • Implement when symbol table routines are written
  • Token type is an enum
  • Token value is a union (ST pointer or string)
  • Also an issue with type of constants
  • Only the lexer knows if real or int
  • Kludge for our compiler
  • two token types, INTCONSTANT and REALCONSTANT
  • Parser will treat as one token type CONSTANT

33
OPTION 2Lexer Generator
  • Input
  • list of regular expressions describing tokens in
    language, in priority order
  • associated action for each RE (generates
    appropriate kind of token, other bookkeeping)
  • Process
  • Reads patterns
  • Builds finite automaton to accept valid tokens
  • Output
  • C code implementation of FA that reads an input
    stream and breaks it up into tokens according to
    the REs. (or reports lexical error --Unexpected
    character )
  • Compile and link C code, you've got a scanner

34
How does lex build the FA?
  • Programmer writes the regular expression
  • Generates corresponding NFA-?
  • Thompson's construction 5 rules for making an
    NFA-? for any regular expression
  • Kleene's Theorem proves that any NFA-? is
    equivalent to some NFA, which is in turn
    equivalent to a DFA
  • So, lex can generate deterministic code
  • Lex matches longest token, then accepts

Automata theory proves you can write regular
expressions, give them to a program like lex,
which will generate a machine to accept exactly
those expressions
35
Lexer generator
  • Regular expression with attached actions
  • -?1-90-9 return new Token(Tokens.IntConst,
  • Integer.parseInt(yytext())
  • Generates scanning code that decides
  • whether the input is lexically well-formed
  • what is the corresponding token sequence
    Observation
  • This process is equivalent to deciding whether
    the input is in the language of the regular
    expression (R1Rn)

36
Example Input
  • digits 01-90-9
  • letter A-Za-z
  • identifier letter(letter0-9_)
  • whitespace \ \t\n\r
  • whitespace / discard /
  • digits return new IntegerConstant(Integer.pa
    rseInt(yytext())
  • if return new IfToken()
  • while return new WhileToken()
  • identifier return new IdentifierToken(yytext(
    ))

regular expressions
actions
37
Three parts to Lex
  • Declarations
  • Regular expression definitions of tokens
  • This is a sample Lex program written by....
  • digit --gt 0-9
  • number -- gt digit
  • Transition Rules
  • Regular Expression Action when matched
  • number printf("The number is s\n", yytext)
  • junk printf("Junk is not a valid
    input!\n")
  • quit return 0
  • Auxilliary Procedures
  • Written into the C program..
  • int main() is required
  • separates the three parts

38
Example
  • delim \t\n
  • ws delim
  • letter A-Za-z
  • digit 0-9
  • id letter(letterdigit)
  • number digit(\.digit)?(E\-?digit)?
  • ws / no action and no return /
  • if return(IF)
  • then return(THEN)
  • else return(ELSE)
  • id yylval install_id() return(ID)
  • number yylval install_num()return(NUMBER)

39
  • Available variables
  • yylval
  • yytext (null terminated string)
  • yyleng (length of the matching string)
  • yyin the file handle
  • yyin fopen(args0, r)
  • Available functions
  • yylex() (the primary function generated)
  • input() - Returns the next character from the
    input
  • unput
  • int main(int argc, char argv)
  • Calls yylex to perform the lexical analysis

40
Context Checking
  • Lex allows context-dependent REs
  • r/x The regular expression r will be matched
    only if it is followed by an occurrence of
    regular expression x.
  • Makes it easy to deal with our ADDOP vs.
    UNARYPLUS problem

41
How to Use Lex (flex)
  • Run Unix man flex for full information
  • Write regular expressions and actions
  • Compile using Lex (flex) tool
  • flex ltprog_namegt.l
  • Results in C code
  • Compile using C compiler
  • Link to the lex library
  • gcc lex.yy.c -ll
  • Run the a.out file and recognize tokens
  • a.out lt input.textgt

42
Lexer generators
  • The power
  • Programmer describes tokens as regular
    expressions
  • Lex turns description of tokens into code
  • Generated code compiles into a scanner
  • The pitfalls
  • Source code generated by lex hard to debug
  • Without understanding basis in formal languages,
    lex can be a quirky black box

43
Comparison of Methods
  • Hand-coded scanner
  • Programmer creates types, defines data
    procedures, designs flow of control, implements
    in source language.
  • Lex-generated scanner
  • Programmer writes patterns
  • (Declarative, not procedural)
  • Lex/flex implements flow of control
  • Must less hand-coding, but
  • code looks pretty alien, tricky to debug

44
Summary
  • Lexical analyzer converts a text stream to tokens
  • For most languages, legal tokens conveniently,
    precisely defined using regular expressions
  • Two ways to write lexer
  • Hand code
  • Use a Lexer generator to generate lexer code
    automatically from token REs, precedence

45
APPENDIX
46
Regular Expression Notation
  • a ordinary character stands for itself
  • ? the empty string
  • RS any string from either L(R) or L(S)
  • RS string from L(R) followed by one
  • from L(S)
  • R zero or more strings from L(R),
  • concatenated
  • ?RRRRRRRRRRRRRRR

47
Convenient RE Shorthand
  • R one or more strings from L(R) R(R)
  • R? optional R (R ?)
  • abce one of the listed characters
  • (abce)
  • a-z one character from this range
  • (abcde...)
  • ab anything but one of the listed chars
  • a-z one character not from this range

48
Examples
  • Regular Expression Strings in L(R)
  • a a
  • ab ab
  • a b a b
  • (ab) ab abab
  • (a ?) b ab b

49
More Examples
  • Regular Expression Strings in L(R)
  • digit 0-9 0 1 2 3
  • posint digit 8 412
  • int -? posint -42 1024
  • real int (? (. posint)) -1.56 12 1.0
  • -?0-9(? (. 0-9))
  • a-zA-Z_a-zA-Z0-9_ C identifiers
Write a Comment
User Comments (0)
About PowerShow.com