Title: Lexical Analysis and Scanning
1Lexical Analysis and Scanning
- Honors Compilers
- Feb 5th 2001
- Robert Dewar
2The Input
- Read string input
- Might be sequence of characters (Unix)
- Might be sequence of lines (VMS)
- Character set
- ASCII
- ISO Latin-1
- ISO 10646 (16-bit unicode)
- Others (EBCDIC, JIS, etc)
3The Output
- A series of tokens
- Punctuation ( ) ,
- Operators -
- Keywords begin end if
- Identifiers Square_Root
- String literals hello this is a string
- Character literals x
- Numeric literals 123 4_5.23e2 16ac
4Free form vs Fixed form
- Free form languages
- White space does not matter
- Tabs, spaces, new lines, carriage returns
- Only the ordering of tokens is important
- Fixed format languages
- Layout is critical
- Fortran, label in cols 1-6
- COBOL, area A B
- Lexical analyzer must worry about layout
5Punctuation
- Typically individual special characters
- Such as -
- Lexical analyzer does not know from
- Sometimes double characters
- E.g. ( treated as a kind of bracket
- Returned just as identity of token
- And perhaps location
- For error message and debugging purposes
6Operators
- Like punctuation
- No real difference for lexical analyzer
- Typically single or double special chars
- Operators -
- Operations
- Returned just as identity of token
- And perhaps location
7Keywords
- Reserved identifiers
- E.g. BEGIN END in Pascal, if in C
- Maybe distinguished from identifiers
- E.g. mode vs mode in Algol-68
- Returned just as token identity
- With possible location information
- Unreserved keywords (e.g. PL/1)
- Handled as identifiers (parser distinguishes)
8Identifiers
- Rules differ
- Length, allowed characters, separators
- Need to build table
- So that junk1 is recognized as junk1
- Typical structure hash table
- Lexical analyzer returns token type
- And key to table entry
- Table entry includes location information
9More on Identifier Tables
- Most common structure is hash table
- With fixed number of headers
- Chain according to hash code
- Serial search on one chain
- Hash code computed from characters
- No hash code is perfect!
- Avoid any arbitrary limits
10String Literals
- Text must be stored
- Actual characters are important
- Not like identifiers
- Character set issues
- Table needed
- Lexical analyzer returns key to table
- May or may not be worth hashing
11Character Literals
- Similar issues to string literals
- Lexical Analyzer returns
- Token type
- Identity of character
- Note, cannot assume character set of host
machine, may be different
12Numeric Literals
- Also need a table
- Typically record value
- E.g. 123 0123 01_23 (Ada)
- But cannot use int for values
- Because may have different characteristics
- Float stuff much more complex
- Denormals, correct rounding
- Very delicate stuff
13Handling Comments
- Comments have no effect on program
- Can therefore be eliminated by scanner
- But may need to be retrieved by tools
- Error detection issues
- E.g. unclosed comments
- Scanner does not return comments
14Case Equivalence
- Some languages have case equivalence
- Pascal, Ada
- Some do not
- C, Java
- Lexical analyzer ignores case if needed
- This_Routine THIS_RouTine
- Error analysis may need exact casing
15Issues to Address
- Speed
- Lexical analysis can take a lot of time
- Minimize processing per character
- I/O is also an issue (read large blocks)
- We compile frequently
- Compilation time is important
- Especially during development
16General Approach
- Define set of token codes
- An enumeration type
- A series of integer definitions
- These are just codes (no semantics)
- Some codes associated with data
- E.g. key for identifier table
- May be useful to build tree node
- For identifiers, literals etc
17Interface to Lexical Analyzer
- Convert entire file to a file of tokens
- Lexical analyzer is separate phase
- Parser calls lexical analyzer
- Get next token
- This approach avoids extra I/O
- Parser builds tree as we go along
18Implementation of Scanner
- Given the input text
- Generate the required tokens
- Or provide token by token on demand
- Before we describe implementations
- We take this short break
- To describe relevant formalisms
19Relevant Formalisms
- Type 3 (Regular) Grammars
- Regular Expressions
- Finite State Machines
20Regular Grammars
- Regular grammars
- Non-terminals (arbitrary names)
- Terminals (characters)
- Two forms of rules
- Non-terminal terminal
- Non-terminal terminal Non-terminal
- One non-terminal is the start symbol
- Regular (type 3) grammars cannot count
- No concept of matching nested parens
-
21Regular Grammars
- Regular grammars
- E.g. grammar of reals with no exponent
- REAL 0 REAL1 (repeat for 1 .. 9)
- REAL1 0 REAL1 (repeat for 1 .. 9)
- REAL1 . INTEGER
- INTEGER 0 INTEGER (repeat for 1 .. 9)
- INTEGER 0 (repeat for 1 .. 9)
- Start symbol is REAL
22Regular Expressions
- Regular expressions (RE) defined by
- Any terminal character is an RE
- Alternation RE RE
- Concatenation RE1 RE2
- Repetition RE (zero or more REs)
- Language of REs type 3 grammars
- Regular expressions are more convenient
23Specifying REs in Unix Tools
- Single characters a b c d \x
- Alternation bcd b-z abcd
- Match any character .
- Match sequence of characters x y
- Concatenation abcd-q
- Optional 0-9(.0-9)?
24Finite State Machines
- Languages and Automata
- A language is a set of strings
- An automaton is a machine
- That determines if a given string is in the
language or not. - FSMs are automata that recognize regular
languages (regular expressions)
25Definitions of FSM
- A set of labeled states
- Directed arcs labeled with character
- A state may be marked as terminal
- Transition from state S1 to S2
- If and only if arc from S1 to S2
- Labeled with next character (which is eaten)
- Recognized if ends up in terminal state
- One state is distinguished start state
26Building FSM from Grammar
- One state for each non-terminal
- A rule of the form
- Nont1 terminal
- Generates transition from S1 to final state
- A rule of the form
- Nont1 terminal Nont2
- Generates transition from S1 to S2
27Building FSMs from REs
- Every RE corresponds to a grammar
- For all regular expressions
- A natural translation to FSM exists
- We will not give details of algorithm here
28Non-Deterministic FSM
- A non-deterministic FSM
- Has at least one state
- With two arcs to two separate states
- Labeled with the same character
- Which way to go?
- Implementation requires backtracking
- Nasty ?
29Deterministic FSM
- For all states S
- For all characters C
- There is either ONE or NO arcs
- From state S
- Labeled with character C
- Much easier to implement
- No backtracking ?
30Dealing with ND FSM
- Construction naturally leads to ND FSM
- For example, consider FSM for
- 0-9 0-9\.0-9
- (integer or real)
- We will naturally get a start state
- With two sets of 0-9 branches
- And thus non-deterministic
31Converting to Deterministic
- There is an algorithm for converting
- From any ND FSM
- To an equivalent deterministic FSM
- Algorithm is in the text book
- Example (given in terms of REs)
- 0-9 0-9\.0-9
- 0-9(\.0-9)?
32Implementing the Scanner
- Three methods
- Completely informal, just write code
- Define tokens using regular expressions
- Convert REs to ND finite state machine
- Convert ND FSM to deterministic FSM
- Program the FSM
- Use an automated program
- To achieve above three steps
33Ad Hoc Code (forget FSMs)
- Write normal hand code
- A procedure called Scan
- Normal coding techniques
- Basically scan over white space and comments till
non-blank character found. - Base subsequent processing on character
- E.g. colon may be or
- / may be operator or start of comment
- Return token found
- Write aggressive efficient code
34Using FSM Formalisms
- Start with regular grammar or RE
- Typically found in the language standard
- For example, for Ada
- Chapter 2. Lexical Elements
- Digit 0 1 2 3 4 5 6 7 8 9
- decimal-literal integer .integerexponent
- integer digit underline digit
- exponent E integer E - integer
35Using FSM formalisms, cont
- Given REs or grammar
- Convert to finite state machine
- Convert ND FSM to deterministic FSM
- Write a program to recognize
- Using the deterministic FSM
36Implementing FSM (Method 1)
- Each state is code of the form
- ltltstate1gtgt case Next_Character is when a gt
goto state3 when b gt goto state1 when
others gt End_of_token_processing end
case - ltltstate2gtgt
37Implementing FSM (Method 2)
- There is a variable called State
- loop case State is when state1
gtltltstate1gtgt case Next_Character is
when a gt State state3 when b gt
State state1 when others gt
End_token_processing end case when
state2 end case - end loop
38Implementing FSM (Method 3)
- T array (State, Character) of Statewhile
More_Input loop Curstate T (Curstate,
Next_Char) if Curstate Error_State then
end loop
39Automatic FSM Generation
- Our example, FLEX
- See home page for manual in HTML
- FLEX is given
- A set of regular expressions
- Actions associated with each RE
- It builds a scanner
- Which matches REs and executes actions
40Flex General Format
- Input to Flex is a set of rules
- Regexp actions (C statements)
- Regexp actions (C statements)
-
- Flex scans the longest matching Regexp
- And executes the corresponding actions
41An Example of a Flex scanner
- DIGIT 0-9ID a-za-z0-9DIGIT
printf (an integer s (d)\n,
yytext, atoi (yytext)) DIGIT.
DIGIT printf (a float
s (g)\n, yytext, atof
(yytext))ifthenbeginendprocedurefunction
printf (a keyword
s\n, yytext))
42Flex Example (continued)
- ID printf (an identifier s\n,
yytext)-/ printf (an
operator s\n, yytext) - --.\n / eat Ada style comment /
- \t\n / eat white space /
- . printf (unrecognized
character)
43Assembling the flex program
- include ltmath.hgt / for atof /
- ltltflex text we gave goes heregtgt
- main (argc, argv) int argc
- char argv
-
- yyin fopen (argv1, r)
- yylex()
-
-
-
44Running flex
- flex is a program that is executed
- The input is as we have given
- The output is a running C program
- For Ada fans
- Look at aflex (www.adapower.com)
- For C fans
- flex can run in C mode
- Generates appropriate classes
45Choice Between Methods?
- Hand written scanners
- Typically much faster execution
- And pretty easy to write
- And a easier for good error recovery
- Flex approach
- Simple to Use
- Easy to modify token language
46The GNAT Scanner
- Hand written (scn.adb/scn.ads)
- Basically a call does
- Super quick scan past blanks/comments etc
- Big case statement
- Process based on first character
- Call special routines
- Namet.Get_Name for identifier (hashing)
- Keywords recognized by special hash
- Strings (stringt.ads)
- Integers (uintp.ads)
- Reals (ureal.ads)
47More on the GNAT Scanner
- Entire source read into memory
- Single contiguous block
- Source location is index into this block
- Different index range for each source file
- See sinput.adb/ads for source mgmt
- See scans.ads for definitions of tokens
48More on GNAT Scanner
- Read scn.adb code
- Very easy reading, e.g.
49ASSIGNMENT TWO
- Write a flex or aflex program
- Recognize tokens of Algol-68s program
- Print out tokens in style of flex example
- Extra credit
- Build hash table for identifiers
- Output hash table key
50Preprocessors
- Some languages allow preprocessing
- This is a separate step
- Input is source
- Output is expanded source
- Can either be done as separate phase
- Or embedded into the lexical analyzer
- Often done as separate phase
- Need to keep track of source locations
51Nasty Glitches
- Separation of tokens
- Not all languages have clear rules
- FORTRAN has optional spaces
- DO10I1.6
- identifier operator literal
- DO10I 1.6
- DO10I1,6
- Keyword stmt loopvar operator literal punc
literal - DO 10 I 1
, 6 - Modern languages avoid this kind of thing!