Lexical Analysis - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Lexical Analysis

Description:

The class constructor Symbol pairs together a terminal token with an optional ... if a terminal is specified with a class (a subtype of Object) then an object of ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 29
Provided by: lambd
Category:
Tags: analysis | lexical

less

Transcript and Presenter's Notes

Title: Lexical Analysis


1
Lexical Analysis
  • Leonidas Fegaras

2
Lexical Analysis
  • A scanner groups input characters into tokens
  • input x x (acc123)
  • token value
  • identifier x
  • equal
  • identifier x
  • star
  • left-paren (
  • identifier acc
  • plus
  • integer 123
  • right-paren )
  • Tokens are typically represented by numbers

3
Communication with the Parser
get token
get next character
AST
scanner
parser
source file
token
  • Each time the parser needs a token, it sends a
    request to the scanner
  • the scanner reads as many characters from the
    input stream as necessary to construct a single
    token
  • when a single token is formed, the scanner is
    suspended and returns the token to the parser
  • the parser will repeatedly call the scanner to
    read all the tokens from the input stream

4
Tasks of a Scanner
  • A typical scanner
  • recognizes the keywords of the language
  • these are the reserved words that have a special
    meaning in the language, such as the word class
    in Java
  • recognizes special characters, such as ( and ),
    or groups of special characters, such as and
  • recognizes identifiers, integers, reals,
    decimals, strings, etc
  • ignores whitespaces (tabs, blanks, etc) and
    comments
  • recognizes and processes special directives (such
    as the include "file" directive in C) and macros

5
Scanner Generators
  • Input a scanner specification
  • describes every token using Regular Expressions
    (REs)
  • eg, the RE
  • a-za-zA-Z0-9
  • recognizes all identifiers with at least one
    alphanumeric letter whose first letter is
    lower-case alphabetic
  • handles whitespaces and resolve ambiguities
  • Output the actual scanner
  • Scanner generators compile regular expressions
    into efficient programs (finite state machines)
  • You will use a scanner generator for Java, called
    JLex, for the project

6
Regular Expressions
  • are a very convenient form of representing
    (possibly infinite) sets of strings, called
    regular sets
  • eg, the RE (a b)aa represents the
    infinite set
  • aa,aaa,baa,abaa, ...
  • a RE is one of the following
  • name RE designation
  • epsilon ?
  • symbol a a for some character a
  • concatenation AB the set rs r?A, s?B , where
    rs is string concatenation,
  • and A and B designate the REs for A
    and B
  • alternation A B the set A ? B, where A and B
    designate the REs for A and B
  • repetition A the set ?? A (AA) (AAA) ...
    (an infinite set)
  • eg, the RE (a b)c designates rs
    r?a?b, s?c , which is equal to
    ac,bc
  • Shortcuts P PP, P? P ?, a-z
    (ab...z)

7
Properties
  • concatenation and alternation are associative
  • eg, ABC means (AB)C and is equivalent to A(BC)
  • alternation is commutative
  • eg, A B B A
  • repetition is idempotent
  • eg, A A
  • concatenation distributes over alternation
  • eg, (a b)c ac bc

8
Examples
  • for-keyword for
  • letter a-zA-Z
  • digit 0-9
  • identifier letter (letter digit)
  • sign - ?
  • integer sign (0 1-9digit)
  • decimal integer . digit
  • real (integer decimal) E sign digit

9
Disambiguation Rules
  • longest match rule from all tokens that match
    the input prefix, choose the one that matches the
    most characters
  • rule priority if more than one token has the
    longest match, choose the one listed first
  • Examples
  • for8 is it the for-keyword, the identifier f,
    the identifier
  • fo, the identifier for, or the identifier
    for8?
  • Use rule 1 for8 matches the most
    characters.
  • for is it the for-keyword, the identifier f,
    the identifier
  • fo, or the identifier for?
  • Use rule 1 2 the for-keyword and the for
  • identifier have the longest match but the
  • for-keyword is listed first.

10
How Scanner Generators Work
  • Translate REs into a finite state machine
  • Done in three steps
  • translate REs into a no-deterministic finite
    automaton (NFA)
  • translate the NFA into a deterministic finite
    automaton (DFA)
  • optimize the DFA (optional)

11
Deterministic Finite Automata
  • A DFA represents a finite state machine that
    recognizes a RE
  • eg, the RE (abc) is represented by the
    DFA
  • A finite automaton consists of
  • a finite set of states
  • a set of transitions (moves)
  • one start state
  • a set of final states (accepting states)
  • a DFA has a unique transition for every
    state-character combination
  • A DFA accepts a string if starting from the start
    state and moving from state to state, each time
    following the arrow that corresponds the current
    input character, it reaches a final state when
    the entire input string is consumed

12
DFA (cont.)
  • The error state 0 is implied
  • The transition table T gives the next state
    Ts,c for a state s and a character c
  • a b c
  • 0 0 0 0
  • 1 2 0 0
  • 2 0 3 0
  • 3 0 0 4
  • 4 2 0 4

13
The DFA of a Scanner
  • for-keyword for
  • identifier a-za-z0-9

14
Scanner Code
  • The scanner code that uses the transition table
    T
  • state initial_state
  • current_character get_next_character()
  • while ( true )
  • next_state Tstate,current_character
  • if (next_state ERROR)
  • break
  • state next_state
  • current_character get_next_character()
  • if ( current_character EOF )
  • break
  • if ( is_final_state(state) )
  • we have a valid token'
  • else report an error'

15
With Longest Match
  • state initial_state
  • final_state ERROR
  • current_character get_next_character()
  • while ( true )
  • next_state Tstate,current_character
  • if (next_state ERROR)
  • break
  • state next_state
  • if ( is_final_state(state) )
  • final_state state
  • current_character get_next_character()
  • if (current_character EOF)
  • break
  • if ( final_state ERROR )
  • report an error'
  • else if ( state ! final_state )
  • we have a valid token but need to backtrack
    (to put characters back into the input stream)'
  • else we have a valid token'

16
Alternative Scanner Code
  • For each transition in a DFA
  • s1
  • generate code
  • s1 current_character get_next_character()
  • ...
  • if ( current_character 'c' )
  • goto s2
  • ...
  • s2 current_character get_next_character()
  • ...

c
s2
17
Mapping a RE into an NFA
  • An NFA is similar to a DFA but it also permits
    multiple transitions over the same character and
    transitions over ?
  • The following rules construct NFAs with only one
    final state

18
Example
  • The RE (a b)c is mapped into the NFA

19
Converting an NFA to a DFA
  • Subset construction
  • assign a number to each NFA state
  • each DFA state will be assigned a set of numbers
  • the closure of a DFA state n1,...,nk is the DFA
    state that contains all the NFA states that can
    be reached by zero or more empty transitions (ie,
    ? transitions) from the NFA states n1, ..., or
    nk
  • so the closure of n1,...,nk is a superset of or
    equal to n1,...,nk
  • the initial DFA state is the closure of the
    initial NFA state
  • for every DFA state labelled by some set
    n1,...,nk and for every character c in the
    language alphabet, you find all the states
    reachable by n1, n2, or nk using c arrows and you
    union together the closures of these nodes. If
    this set is not the label of any other node in
    the DFA constructed so far, you create a new DFA
    node with this label

20
Example
21
Example
  • (a b)(abb ab)

22
JLex
  • Regular expressions (where e and f are regular
    expressions)
  • c any character c other than ? ( ) .
    " \
  • \c any character c, but \n is newline, \c is
    control-c, etc
  • . any character except \n
  • ... the concatenation of all the characters in
    the string
  • ef concatenation
  • e f alternation
  • e Kleene closure
  • e ee
  • e? optional e
  • name macro expansion
  • ... any character enclosed in (but only
    one character), from
  • c a character c (or use \c)
  • ef any character from e or from f
  • a-b any character from a to b
  • ... any character in the string
  • ... any character except those enclosed by

23
JLex Rules
  • A JLex rule
  • RE action
  • where action is Java code
  • typically, the action returns a token
  • but you want to skip whitespaces and comments
  • yytext() returns the part of the input that
    matches the RE
  • JLex uses longest match and rule priority
  • States and state transitions can be used for
    better control
  • the initial (default) state is YYINITIAL
  • any other state should be declared using the
    state directive
  • now a rule can take the form
  • ltsgt RE action
  • which can match if we are in state s only
  • you jump to a state s using yybegin(s)

24
Case Study The Calculator Scanner
  • The calculator example is available at
  • http//lambda.uta.edu/cse5317/calc.tar.gz
  • After you download it on gamma, do
  • tar xfz calc.tar.gz
  • cd calc
  • build
  • run
  • then try it with some input eg,
  • 2(38)
  • x34
  • x3
  • define f(n) if n0 then 1 else nf(n-1)
  • f(5)
  • quit

25
Tokens are Defined in calc.cup
  • terminal LP, RP, COMMA, SEMI, ASSIGN, IF, THEN,
    ELSE, AND, OR, NOT, QUIT, PLUS, TIMES, MINUS,
    DIV, EQ, LT, GT, LE, NE, GE, FALSE, TRUE, DEFINE
  • terminal String ID
  • terminal Integer INT
  • terminal Float REALN
  • terminal String STRINGT
  • The class constructor Symbol pairs together a
    terminal token with an optional value (a Java
    Object)
  • if a terminal is specified with a class (a
    subtype of Object) then an object of this class
    should be provided along with the token
  • eg, Symbol(sym.ID,x)
  • eg, Symbol(sym.INT,10)

26
The Calculator Scanner
  • import java_cup.runtime.Symbol
  • class CalcLex
  • public
  • line
  • char
  • cup
  • DIGIT0-9
  • IDa-zA-Za-zA-Z0-9_

27
The Calculator Scanner (cont.)
  • DIGIT return new Symbol(sym.INT,new
    Integer(yytext()))
  • DIGIT"."DIGIT return new
    Symbol(sym.REALN,new Float(yytext()))
  • "(" return new Symbol(sym.LP)
  • ")" return new Symbol(sym.RP)
  • "," return new Symbol(sym.COMMA)
  • "" return new Symbol(sym.SEMI)
  • "" return new Symbol(sym.ASSIGN)
  • "define" return new
    Symbol(sym.DEFINE)
  • "quit" return new
    Symbol(sym.QUIT)
  • "if" return new Symbol(sym.IF)
  • "then" return new Symbol(sym.THEN)
  • "else" return new Symbol(sym.ELSE)
  • "and" return new Symbol(sym.AND)
  • "or" return new Symbol(sym.OR)
  • "not" return new Symbol(sym.NOT)
  • "false" return new Symbol(sym.FALSE)
  • "true" return new Symbol(sym.TRUE)

28
The Calculator Scanner (cont.)
  • "" return new Symbol(sym.PLUS)
  • "" return new Symbol(sym.TIMES)
  • "-" return new Symbol(sym.MINUS)
  • "/" return new Symbol(sym.DIV)
  • "" return new Symbol(sym.EQ)
  • "lt" return new Symbol(sym.LT)
  • "gt" return new Symbol(sym.GT)
  • "lt" return new Symbol(sym.LE)
  • "!" return new Symbol(sym.NE)
  • "gt" return new Symbol(sym.GE)
  • ID return new Symbol(sym.ID,yytext())
  • \"\"\" return new Symbol(sym.STRINGT,
  • yytext().substring(1,yytext().length()-1))
  • \t\r\n\f / ignore white spaces.
    /
  • . System.err.println("Illegal character
    "yytext())
Write a Comment
User Comments (0)
About PowerShow.com