Lexical Analysis and Scanning - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Lexical Analysis and Scanning

Description:

Returned just as identity of token. And perhaps location ... Lexical analyzer returns token type. And key to table entry ... Token type. Identity of character ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 52
Provided by: robertberr
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Lexical Analysis and Scanning


1
Lexical Analysis and Scanning
  • Honors Compilers
  • Feb 5th 2001
  • Robert Dewar

2
The Input
  • Read string input
  • Might be sequence of characters (Unix)
  • Might be sequence of lines (VMS)
  • Character set
  • ASCII
  • ISO Latin-1
  • ISO 10646 (16-bit unicode)
  • Others (EBCDIC, JIS, etc)

3
The Output
  • A series of tokens
  • Punctuation ( ) ,
  • Operators -
  • Keywords begin end if
  • Identifiers Square_Root
  • String literals hello this is a string
  • Character literals x
  • Numeric literals 123 4_5.23e2 16ac

4
Free form vs Fixed form
  • Free form languages
  • White space does not matter
  • Tabs, spaces, new lines, carriage returns
  • Only the ordering of tokens is important
  • Fixed format languages
  • Layout is critical
  • Fortran, label in cols 1-6
  • COBOL, area A B
  • Lexical analyzer must worry about layout

5
Punctuation
  • Typically individual special characters
  • Such as -
  • Lexical analyzer does not know from
  • Sometimes double characters
  • E.g. ( treated as a kind of bracket
  • Returned just as identity of token
  • And perhaps location
  • For error message and debugging purposes

6
Operators
  • Like punctuation
  • No real difference for lexical analyzer
  • Typically single or double special chars
  • Operators -
  • Operations
  • Returned just as identity of token
  • And perhaps location

7
Keywords
  • Reserved identifiers
  • E.g. BEGIN END in Pascal, if in C
  • Maybe distinguished from identifiers
  • E.g. mode vs mode in Algol-68
  • Returned just as token identity
  • With possible location information
  • Unreserved keywords (e.g. PL/1)
  • Handled as identifiers (parser distinguishes)

8
Identifiers
  • Rules differ
  • Length, allowed characters, separators
  • Need to build table
  • So that junk1 is recognized as junk1
  • Typical structure hash table
  • Lexical analyzer returns token type
  • And key to table entry
  • Table entry includes location information

9
More on Identifier Tables
  • Most common structure is hash table
  • With fixed number of headers
  • Chain according to hash code
  • Serial search on one chain
  • Hash code computed from characters
  • No hash code is perfect!
  • Avoid any arbitrary limits

10
String Literals
  • Text must be stored
  • Actual characters are important
  • Not like identifiers
  • Character set issues
  • Table needed
  • Lexical analyzer returns key to table
  • May or may not be worth hashing

11
Character Literals
  • Similar issues to string literals
  • Lexical Analyzer returns
  • Token type
  • Identity of character
  • Note, cannot assume character set of host
    machine, may be different

12
Numeric Literals
  • Also need a table
  • Typically record value
  • E.g. 123 0123 01_23 (Ada)
  • But cannot use int for values
  • Because may have different characteristics
  • Float stuff much more complex
  • Denormals, correct rounding
  • Very delicate stuff

13
Handling Comments
  • Comments have no effect on program
  • Can therefore be eliminated by scanner
  • But may need to be retrieved by tools
  • Error detection issues
  • E.g. unclosed comments
  • Scanner does not return comments

14
Case Equivalence
  • Some languages have case equivalence
  • Pascal, Ada
  • Some do not
  • C, Java
  • Lexical analyzer ignores case if needed
  • This_Routine THIS_RouTine
  • Error analysis may need exact casing

15
Issues to Address
  • Speed
  • Lexical analysis can take a lot of time
  • Minimize processing per character
  • I/O is also an issue (read large blocks)
  • We compile frequently
  • Compilation time is important
  • Especially during development

16
General Approach
  • Define set of token codes
  • An enumeration type
  • A series of integer definitions
  • These are just codes (no semantics)
  • Some codes associated with data
  • E.g. key for identifier table
  • May be useful to build tree node
  • For identifiers, literals etc

17
Interface to Lexical Analyzer
  • Convert entire file to a file of tokens
  • Lexical analyzer is separate phase
  • Parser calls lexical analyzer
  • Get next token
  • This approach avoids extra I/O
  • Parser builds tree as we go along

18
Implementation of Scanner
  • Given the input text
  • Generate the required tokens
  • Or provide token by token on demand
  • Before we describe implementations
  • We take this short break
  • To describe relevant formalisms

19
Relevant Formalisms
  • Type 3 (Regular) Grammars
  • Regular Expressions
  • Finite State Machines

20
Regular Grammars
  • Regular grammars
  • Non-terminals (arbitrary names)
  • Terminals (characters)
  • Two forms of rules
  • Non-terminal terminal
  • Non-terminal terminal Non-terminal
  • One non-terminal is the start symbol
  • Regular (type 3) grammars cannot count
  • No concept of matching nested parens

21
Regular Grammars
  • Regular grammars
  • E.g. grammar of reals with no exponent
  • REAL 0 REAL1 (repeat for 1 .. 9)
  • REAL1 0 REAL1 (repeat for 1 .. 9)
  • REAL1 . INTEGER
  • INTEGER 0 INTEGER (repeat for 1 .. 9)
  • INTEGER 0 (repeat for 1 .. 9)
  • Start symbol is REAL

22
Regular Expressions
  • Regular expressions (RE) defined by
  • Any terminal character is an RE
  • Alternation RE RE
  • Concatenation RE1 RE2
  • Repetition RE (zero or more REs)
  • Language of REs type 3 grammars
  • Regular expressions are more convenient

23
Specifying REs in Unix Tools
  • Single characters a b c d \x
  • Alternation bcd b-z abcd
  • Match any character .
  • Match sequence of characters x y
  • Concatenation abcd-q
  • Optional 0-9(.0-9)?

24
Finite State Machines
  • Languages and Automata
  • A language is a set of strings
  • An automaton is a machine
  • That determines if a given string is in the
    language or not.
  • FSMs are automata that recognize regular
    languages (regular expressions)

25
Definitions of FSM
  • A set of labeled states
  • Directed arcs labeled with character
  • A state may be marked as terminal
  • Transition from state S1 to S2
  • If and only if arc from S1 to S2
  • Labeled with next character (which is eaten)
  • Recognized if ends up in terminal state
  • One state is distinguished start state

26
Building FSM from Grammar
  • One state for each non-terminal
  • A rule of the form
  • Nont1 terminal
  • Generates transition from S1 to final state
  • A rule of the form
  • Nont1 terminal Nont2
  • Generates transition from S1 to S2

27
Building FSMs from REs
  • Every RE corresponds to a grammar
  • For all regular expressions
  • A natural translation to FSM exists
  • We will not give details of algorithm here

28
Non-Deterministic FSM
  • A non-deterministic FSM
  • Has at least one state
  • With two arcs to two separate states
  • Labeled with the same character
  • Which way to go?
  • Implementation requires backtracking
  • Nasty ?

29
Deterministic FSM
  • For all states S
  • For all characters C
  • There is either ONE or NO arcs
  • From state S
  • Labeled with character C
  • Much easier to implement
  • No backtracking ?

30
Dealing with ND FSM
  • Construction naturally leads to ND FSM
  • For example, consider FSM for
  • 0-9 0-9\.0-9
  • (integer or real)
  • We will naturally get a start state
  • With two sets of 0-9 branches
  • And thus non-deterministic

31
Converting to Deterministic
  • There is an algorithm for converting
  • From any ND FSM
  • To an equivalent deterministic FSM
  • Algorithm is in the text book
  • Example (given in terms of REs)
  • 0-9 0-9\.0-9
  • 0-9(\.0-9)?

32
Implementing the Scanner
  • Three methods
  • Completely informal, just write code
  • Define tokens using regular expressions
  • Convert REs to ND finite state machine
  • Convert ND FSM to deterministic FSM
  • Program the FSM
  • Use an automated program
  • To achieve above three steps

33
Ad Hoc Code (forget FSMs)
  • Write normal hand code
  • A procedure called Scan
  • Normal coding techniques
  • Basically scan over white space and comments till
    non-blank character found.
  • Base subsequent processing on character
  • E.g. colon may be or
  • / may be operator or start of comment
  • Return token found
  • Write aggressive efficient code

34
Using FSM Formalisms
  • Start with regular grammar or RE
  • Typically found in the language standard
  • For example, for Ada
  • Chapter 2. Lexical Elements
  • Digit 0 1 2 3 4 5 6 7 8 9
  • decimal-literal integer .integerexponent
  • integer digit underline digit
  • exponent E integer E - integer

35
Using FSM formalisms, cont
  • Given REs or grammar
  • Convert to finite state machine
  • Convert ND FSM to deterministic FSM
  • Write a program to recognize
  • Using the deterministic FSM

36
Implementing FSM (Method 1)
  • Each state is code of the form
  • ltltstate1gtgt case Next_Character is when a gt
    goto state3 when b gt goto state1 when
    others gt End_of_token_processing end
    case
  • ltltstate2gtgt

37
Implementing FSM (Method 2)
  • There is a variable called State
  • loop case State is when state1
    gtltltstate1gtgt case Next_Character is
    when a gt State state3 when b gt
    State state1 when others gt
    End_token_processing end case when
    state2 end case
  • end loop

38
Implementing FSM (Method 3)
  • T array (State, Character) of Statewhile
    More_Input loop Curstate T (Curstate,
    Next_Char) if Curstate Error_State then
    end loop

39
Automatic FSM Generation
  • Our example, FLEX
  • See home page for manual in HTML
  • FLEX is given
  • A set of regular expressions
  • Actions associated with each RE
  • It builds a scanner
  • Which matches REs and executes actions

40
Flex General Format
  • Input to Flex is a set of rules
  • Regexp actions (C statements)
  • Regexp actions (C statements)
  • Flex scans the longest matching Regexp
  • And executes the corresponding actions

41
An Example of a Flex scanner
  • DIGIT 0-9ID a-za-z0-9DIGIT
    printf (an integer s (d)\n,
    yytext, atoi (yytext)) DIGIT.
    DIGIT printf (a float
    s (g)\n, yytext, atof
    (yytext))ifthenbeginendprocedurefunction
    printf (a keyword
    s\n, yytext))

42
Flex Example (continued)
  • ID printf (an identifier s\n,
    yytext)-/ printf (an
    operator s\n, yytext)
  • --.\n / eat Ada style comment /
  • \t\n / eat white space /
  • . printf (unrecognized
    character)

43
Assembling the flex program
  • include ltmath.hgt / for atof /
  • ltltflex text we gave goes heregtgt
  • main (argc, argv) int argc
  • char argv
  • yyin fopen (argv1, r)
  • yylex()

44
Running flex
  • flex is a program that is executed
  • The input is as we have given
  • The output is a running C program
  • For Ada fans
  • Look at aflex (www.adapower.com)
  • For C fans
  • flex can run in C mode
  • Generates appropriate classes

45
Choice Between Methods?
  • Hand written scanners
  • Typically much faster execution
  • And pretty easy to write
  • And a easier for good error recovery
  • Flex approach
  • Simple to Use
  • Easy to modify token language

46
The GNAT Scanner
  • Hand written (scn.adb/scn.ads)
  • Basically a call does
  • Super quick scan past blanks/comments etc
  • Big case statement
  • Process based on first character
  • Call special routines
  • Namet.Get_Name for identifier (hashing)
  • Keywords recognized by special hash
  • Strings (stringt.ads)
  • Integers (uintp.ads)
  • Reals (ureal.ads)

47
More on the GNAT Scanner
  • Entire source read into memory
  • Single contiguous block
  • Source location is index into this block
  • Different index range for each source file
  • See sinput.adb/ads for source mgmt
  • See scans.ads for definitions of tokens

48
More on GNAT Scanner
  • Read scn.adb code
  • Very easy reading, e.g.

49
ASSIGNMENT TWO
  • Write a flex or aflex program
  • Recognize tokens of Algol-68s program
  • Print out tokens in style of flex example
  • Extra credit
  • Build hash table for identifiers
  • Output hash table key

50
Preprocessors
  • Some languages allow preprocessing
  • This is a separate step
  • Input is source
  • Output is expanded source
  • Can either be done as separate phase
  • Or embedded into the lexical analyzer
  • Often done as separate phase
  • Need to keep track of source locations

51
Nasty Glitches
  • Separation of tokens
  • Not all languages have clear rules
  • FORTRAN has optional spaces
  • DO10I1.6
  • identifier operator literal
  • DO10I 1.6
  • DO10I1,6
  • Keyword stmt loopvar operator literal punc
    literal
  • DO 10 I 1
    , 6
  • Modern languages avoid this kind of thing!
Write a Comment
User Comments (0)
About PowerShow.com