Chapter 2 Lexical Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 2 Lexical Analysis

Description:

Chapter 2 Lexical Analysis Nai-Wei Lin ... – PowerPoint PPT presentation

Number of Views:368
Avg rating:3.0/5.0
Slides: 68
Provided by: Naiwei
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Lexical Analysis


1
Chapter 2 Lexical Analysis
  • Nai-Wei Lin

2
Lexical Analysis
  • Lexical analysis recognizes the vocabulary of the
    programming language and transforms a string of
    characters into a string of words or tokens
  • Lexical analysis discards white spaces and
    comments between the tokens
  • Lexical analyzer (or scanner) is the program that
    performs lexical analysis

3
Outline
  • Scanners
  • Tokens
  • Regular expressions
  • Finite automata
  • Automatic conversion from regular expressions to
    finite automata
  • FLex - a scanner generator

4
Scanners
token
Parser
Scanner
characters
next token
Symbol Table
5
Tokens
  • A token is a sequence of characters that can be
    treated as a unit in the grammar of a programming
    language
  • A programming language classifies tokens into a
    finite set of token types Type Examples ID foo
    i n NUM 73 13 IF if COMMA ,

6
Semantic Values of Tokens
  • Semantic values are used to distinguish different
    tokens in a token type
  • lt ID, foogt, lt ID, i gt, lt ID, n gt
  • lt NUM, 73gt, lt NUM, 13 gt
  • lt IF, gt
  • lt COMMA, gt
  • Token types affect syntax analysis and semantic
    values affect semantic analysis

7
Scanner Generators
Scanner definition in matalanguage
Scanner Generator
Scanner
Program in programming language
Token types semantic values
Scanner
8
Languages
  • A language is a set of strings
  • A string is a finite sequence of symbols taken
    from a finite alphabet
  • The C language is the (infinite) set of all
    strings that constitute legal C programs
  • The language of C reserved words is the (finite)
    set of all alphabetic strings that cannot be used
    as identifiers in the C programs
  • Each token type is a language

9
Regular Expressions (RE)
  • A language allows us to use a finite description
    to specify a (possibly infinite) set
  • RE is the metalanguage used to define the token
    types of a programming language

10
Regular Expressions
  • ? is a RE denoting L ?
  • If a ? alphabet, then a is a RE denoting L a
  • Suppose r and s are RE denoting L(r) and L(s)
  • alternation (r) (s) is a RE denoting L(r) ?
    L(s)
  • concatenation (r) (s) is a RE denoting
    L(r)L(s)
  • repetition (r) is a RE denoting (L(r))
  • (r) is a RE denoting L(r)

11
Examples
  • a b a, b
  • (a b)(a b) aa, ab, ba, bb
  • a ?, a, aa, aaa, ...
  • (a b) the set of all strings of as and bs
  • a ab the set containing the string a and
    all strings consisting of zero or more as
    followed by a b

12
Regular Definitions
  • Names for regular expressions d1 ? r1 d2 ?
    r2 ... dn ? rnwhere ri over alphabet ?
    d1, d2, ..., di-1
  • Examples letter ? A B ... Z a b
    ... z digit ? 0 1 ... 9 identifier ?
    letter ( letter digit )

13
Notational Abbreviations
  • One or more instances (r) denoting (L(r)) r
    r ? r r r
  • Zero or one instance r? r ?
  • Character classes abc a b c a-z a
    b ... z abc any character except a
    b c
  • Any character except newline .

14
Examples
  • if return IF
  • a-za-z0-9 return ID
  • 0-9 return NUM
  • (0-9.0-9)(0-9.0-9) return
    REAL
  • (--a-z\n)( \n \t) /do
    nothing for white spaces and comments/
  • . error()

15
Completeness of REs
  • A lexical specification should be complete
    namely, it always matches some initial substring
    of the input . / match any /

16
Disambiguity of REs (1)
  • Longest match disambiguation rules the longest
    initial substring of the input that can match any
    regular expression is taken as the next token
    (0-9.0-9)(0-9.0-9) / REAL
    / 0.9

17
Disambiguity of REs (2)
  • Rule priority disambiguation rules for a
    particular longest initial substring, the first
    regular expression that can match determines its
    token type if / IF /
    a-za-z0-9 / ID / if

18
Finite Automata
  • A finite automaton is a finite-state transition
    diagram that can be used to model the recognition
    of a token type specified by a regular expression
  • A finite automaton can be a nondeterministic
    finite automaton or a deterministic finite
    automaton

19
Nondeterministic Finite Automata (NFA)
  • An NFA consists of
  • A finite set of states
  • A finite set of input symbols
  • A transition function that maps (state, symbol)
    pairs to sets of states
  • A state distinguished as start state
  • A set of states distinguished as final states

20
An Example
start
  • RE (a b)abb
  • States 1, 2, 3, 4
  • Input symbols a, b
  • Transition function(1,a) 1,2, (1,b)
    1(2,b) 3, (3,b) 4
  • Start state 1
  • Final state 4

a,b
a
b
b
21
Acceptance of NFA
  • An NFA accepts an input string s iff there is
    some path in the finite-state transition diagram
    from the start state to some final state such
    that the edge labels along this path spell out s
  • The language recognized by an NFA is the set of
    strings it accepts

22
An Example
(a b)abb
aabb
a
a
b
b
start
1
4
2
3
b
23
An Example
aaba
(a b)abb
a
a
b
b
start
1
4
2
3
b
24
Another Example
  • RE aa bb
  • States 1, 2, 3, 4, 5
  • Input symbols a, b
  • Transition function(1, ?) 2, 4, (2, a)
    3, (3, a) 3,(4, b) 5, (5, b) 5
  • Start state 1
  • Final states 3, 5

25
Finite-State Transition Diagram
aa bb
a
a
2
3
start
1
4
5
b
b
aaa
26
Operations on NFA states
  • ?-closure(s) set of states reachable from a
    state s on ?-transitions alone
  • ?-closure(S) set of states reachable from some
    state s in S on ?-transitions alone
  • move(s, c) set of states to which there is a
    transition on input symbol c from a state s
  • move(S, c) set of states to which there is a
    transition on input symbol c from some state s in
    S

27
An Example
aa bb
a
S0 1 S1 ?-closure(1) 1,2,4 S2
move(1,2,4,a) 3 S3 ?-closure(3)
3 S4 move(3,a) 3 S5 ?-closure(3)
3 S6 move(3,a) 3 S7 ?-closure(3)
3 3 is in 3, 5 ? accept
a
2
3
start
1
4
5
b
b
aaa
28
Simulating an NFA
Input An input string ended with eof and an NFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin S ?-closure(s0) c
nextchar while c ltgt eof do begin S
?-closure(move(S, c)) c nextchar end
if S ? F ltgt ? then return yes else return
no end.
29
Computation of ?-closure
(a b)abb
a
4
3
start
a
b
b
11
10
1
2
8
9
7
b
?-closure(1) 1,2,3,5,8
5
6
?-closure(4) 2,3,4,5,7,8
30
Computation of ?-closure
Input An NFA and a set of NFA states S. Output
T ?-closure(S). begin push all states in S
onto stack T S while stack is not empty
do begin pop t, the top element, off of
stack for each state u with an edge from t
to u labeled ? do if u is not in T then
begin add u to T push u onto
stack end end return T end.
31
Deterministic Finite Automata (DFA)
  • A DFA is a special case of an NFA in which
  • no state has an ?-transition
  • for each state s and input symbol a, there is at
    most one edge labeled a leaving s

32
An Example
  • RE (a b)abb
  • States 1, 2, 3, 4
  • Input symbols a, b
  • Transition function(1,a) 2, (2,a) 2, (3,a)
    2, (4,a) 2(1,b) 1, (2,b) 3, (3,b) 4,
    (4,b) 1
  • Start state 1
  • Final state 4

33
Finite-State Transition Diagram
34
Acceptance of DFA
  • A DFA accepts an input string s iff there is one
    path in the finite-state transition diagram from
    the start state to some final state such that the
    edge labels along this path spell out s
  • The language recognized by a DFA is the set of
    strings it accepts

35
An Example
(a b)abb
aabb
36
An Example
(a b)abb
aaba
b
a
b
b
start
1
4
2
3
a
a
b
a
37
An Example
bbababb s 1 s move(1, b) 1 s move(1,
b) 1 s move(1, a) 2 s move(2, b) 3 s
move(3, a) 2 s move(2, b) 3 s move(3, b)
4 4 is in 4 ? accept
38
Simulating a DFA
Input An input string ended with eof and a DFA
with start state s0 and final states
F. Output The answer yes if accepts, no
otherwise. begin s s0 c nextchar
while c ltgt eof do begin s move(s, c)
c nextchar end if s is in F then return
yes else return no end.
39
Combined Finite Automata
i
f
start
if
1
2
3
IF
ID
a-z
start
a-z,0-9
a-za-z0-9
1
2
REAL
0-9
.
(0-9.0-9) (0-9.0-9)
0-9
3
2
0-9
start
1
0-9
.
4
5
0-9
REAL
40
Combined Finite Automata
i
f
2
3
4
IF
?
ID
a-z
start
a-z,0-9
5
6
1
?
?
REAL
0-9
.
0-9
9
8
0-9
7
0-9
.
10
11
0-9
NFA
REAL
41
Combined Finite Automata
f
IF
ID
2
3
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
42
Recognizing the Longest Match
  • The automaton must keep track of the longest
    match seen so far and the position of that match
    until a dead state is reached
  • Use two variables Last-Final (the state number of
    the most recent final state encountered) and
    Input-Position-at-Last-Final to remember the last
    time the automaton was in a final state

43
An Example
ID
IF
2
3
iffail
S C L P 1 0 0 i 2 2
1 f 3 3 2 f 4 4 3 a 4 4
4 i 4 4 5 l 4 4 6 ?
g-z
a-z,0-9
a-e
i
4
a-z,0-9
j-z
ID
0-9
start
a-h
1
REAL
0-9
.
6
5
0-9
.
0-9
7
8
0-9
DFA
REAL
44
Scanner Generators
45
Flex A Scanner Generator
A language for specifying scanners
Flex compiler
lex.yy.c
lang.l
C compiler -lfl
a.out
lex.yy.c
a.out
tokens
source code
46
Flex Programs
auxiliary declarationsregular
definitionstranslation rulesauxiliary
procedures
47
Translation Rules
P1 action1 P2 action2 ... Pn actionn
where Pi are regular expressions and actioni are
C program segments
48
Example 1
username printf( s, getlogin() )
By default, any text not matched by a flex
scanner is copied to the output. This scanner
copies its input file to its output with each
occurrence of username being replaced with the
users login name.
49
Example 2
int lines 0, chars 0 \n lines
chars . chars / all characters except \n
/ main() yylex() printf(lines
d, chars d\n, lines, chars)
50
Example 3
define EOF 0 define LE 25 ... delim
\t\n ws delim letter A-Za-z digit 0-9 i
d letter(letterdigit) number digit(\.
digit)?(E\-?digit)?
51
Example 3
ws / no action and no return /
if return (IF) else return
(ELSE) id yylvalinstall_id() return
(ID) number yylvalinstall_num() return
(NUMBER) lt yylvalLE return
(RELOP) yylvalEQ return (RELOP)
... ltltEOFgtgt return(EOF) install_id() ...
install_num() ...
52
Functions and Variables
yylex() a function implementing the lexical
analyzer and returning the token
matched yytext a global pointer variable
pointing to the lexeme matched yyleng a
global variable giving the length of the lexeme
matched yylval an external global variable
storing the attribute of the token
53
NFA from Flex Programs
P1 P2 ... Pn
54
Rules
  • Look for the longest lexeme
  • number
  • Look for the first-listed pattern that
    matchesthe longest lexeme
  • keywords and identifiers
  • List frequently occurring patterns first
  • white space

55
Rules
  • View keywords as exceptions to the rule of
    identifiers
  • construct a keyword table

56
Rules
  • Start condition ltsgtr match r only in start
    condition s
  • Start conditions are declared in the first
    section using either s or x s str
  • A start condition is activated using the BEGIN
    action \ BEGIN(str) ltstrgt / eat up
    string body /
  • The default start condition is INITIAL
    ltstrgt\ BEGIN(INITIAL)

57
Lexical Error Recovery
  • Error none of patterns matches a prefix of the
    remaining input
  • Panic mode error recovery
  • delete successive characters from the remaining
    input until the pattern-matching can continue

58
Maintaining Line Number
  • Flex allows to maintain the number of the current
    line in the global variable yylineno using the
    following option mechanism option
    yylinenoin the first section

59
From a RE to an NFA
  • Thompsons construction algorithm
  • For ? , construct
  • For a in alphabet, construct

?
start
i
f
start
a
f
i
60
From a RE to an NFA
  • Suppose N(s) and N(t) are NFA for RE s and t
  • for s t, construct
  • for s t, construct

is
fs
N(s)
start
f
i
it
ft
N(t)
fs
start
i
N(s)
N(t)
it
61
From a RE to an NFA
  • for s, construct
  • for (s), use N(s)

start
is
fs
i
N(s)
62
An Example
(a b)abb
63
From an NFA to a DFA
Subset construction Algorithm. Input An NFA
N. Output A DFA D with states Dstates and
trasition table Dtran. begin add ?-closure(s0)
as an unmarked state to Dstates while there
is an unmarked state T in Dstates do begin
mark T for each input symbol a do begin
U ?-closure(move(T, a)) if U
is not in Dstates then add U as an
unmarked state to Dstates DtranT, a
U end end.
64
An Example
(a b)abb
a
4
3
start
a
b
b
11
1
2
8
9
10
7
b
5
6
65
An Example
?-closure(1) 1,2,3,5,8 A ?-closure(move(A,
a))?-closure(4,9) 2,3,4,5,7,8,9
B ?-closure(move(A, b))?-closure(6)
2,3,5,6,7,8 C ?-closure(move(B,
a))?-closure(4,9) B ?-closure(move(B,
b))?-closure(6,10) 2,3,5,6,7,8,10
D ?-closure(move(C, a))?-closure(4,9)
B ?-closure(move(C, b))?-closure(6)
C ?-closure(move(D, a))?-closure(4,9)
B ?-closure(move(D, b))?-closure(6,11)
2,3,5,6,7,8,11 E ?-closure(move(E,
a))?-closure(4,9) B ?-closure(move(E,
b))?-closure(6) C
66
An Example
Input Symbol
State
a
b
A 1,2,3,5,8
B
C
B 2,3,4,5,7,8,9
B
D
C 2,3,5,6,7,8
B
C
D 2,3,5,6,7,8,10
B
E
E 2,3,5,6,7,8,11
B
C
67
An Example
start
Write a Comment
User Comments (0)
About PowerShow.com