Review: Regular expression: - PowerPoint PPT Presentation

About This Presentation
Title:

Review: Regular expression:

Description:

r and s are regular expressions denoting the language (set) L(r ) ... Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) Lex source program ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 15
Provided by: xyu
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Review: Regular expression:


1
  • Review Regular expression
  • How do we define it?
  • Given an alphabet ,
  • Base case
  • is a regular expression that denote
    , the set that contains the empty string.
  • For each , a is a regular expression
    denote a, the set containing the string a.
  • Induction case
  • r and s are regular expressions denoting the
    language (set) L(r ) and L(s ). Then
  • ( r ) ( s ) is a regular expression denoting
    L( r ) U L( s )
  • ( r ) ( s ) is a regular expression denoting L(
    r ) L ( s )
  • ( r ) is a regular expression denoting (L ( r
    ))

2
  • Lex -- a Lexical Analyzer Generator (by M.E. Lesk
    and Eric. Schmidt)
  • Lex source program
  • definition
  • rules
  • user subroutines
  • Rules ltregular expressiongt ltactiongt
  • Each regular expression specifies a token.
  • Default action for anything that is not matched
    copy to the output
  • Action C source fragments specifying what to do
    when a token is recognized.

3
  • lex program examples ex1.l and ex2.l
  • lex ex1.l produces the lex.yy.c file.
  • The int yylex() routine is the scanner that finds
    all the regular expressions specified.
  • yylex() returns a non-zero value (usually token
    id) normally.
  • yylex() returns 0 when end of file is reached.
  • Need a drive to test the routine.
  • You need to have a yywrap() function in the lex
    file (return 1).
  • Something to do with compiling multiple files.

4
  • Lex regular expression contains text characters
    and operators.
  • Letters of alphabet and digits are always text
    characters.
  • Regular expression integer matches the string
    integer
  • Operators \-?.()/ltgt
  • When these characters happen in a regular
    expression, they have special meanings

5
  • operators (characters that have special
    meanings) \-?.()/ltgt
  • , , , (,) -- used in regular
    expression
  • -- any character in between quote is a text
    character.
  • E.g. xyz xyz
  • \ -- escape character,
  • To get the operators back xyz ??
  • To specify special characters \40
  • and -- used to specify a set of
    characters
  • e.g a-z, a-zA-Z,
  • Every character in it except , - and \ is a text
    character
  • -0-9, \40-\176
  • -- not, used as the first character after the
    left bracket
  • E.g abc -- everything except a, b or c.
  • a-zA-Z -- ??

6
  • operators (characters that have special
    meanings) \-?.()/ltgt
  • . -- every character
  • ? -- optional ab?c matches ac or abc
  • / -- used in character lookahead
  • e.g. ab/cd -- matches ab only if it is followed
    by cd
  • -- enclose a regular definition
  • -- has special meaning in lex
  • -- match the end of a line, -- match the
    beginning of a line
  • ab ab/\n
  • lt gt start condidtion (more context
    sensitivity support, see the paper for details).

7
  • Order of pattern matching
  • Always matches the longest pattern.
  • When multiple patterns matches, use the first
    pattern.
  • To override, add REJECT in the action.
  • ...
  • Ab printf(rule
    1\n)
  • Abc printf(rule
    2\n)
  • letterletterdigit printf(rule 3\n)
  • Input Abc
  • What happened when at . as a pattern?

8
  • Manipulate the lexeme and/or the input stream
  • yytext -- a char pointer pointing to the matched
    string
  • yyleng -- the length of the matched string
  • I/O routines to manipulate the input stream
  • input() -- get a character from the input
    character, return lt0 when reaching the end of
    the input stream, the character otherwise
  • unput( c ) -- put c back onto the input stream
  • Deal with comments (/ .. /
  • /./ ???
  • / char c1
  • c2 input()
  • if (c2 lt0) lex_error(unfinished
    comment
  • else c1 c2 c2 input()
  • while (((c1!)
    (c2 ! /)) (c2 gt 0)) c1 c2 c2
    input()
  • if (c2 lt 0)
    lex_error( .)

9
  • Reporting errors
  • What kind of errors? Not too many.
  • Characters that cannot lead to a token
  • unended comments (can we do it in later phases?)
  • unended string constants.
  • How to keep track of current position (which
    line, which column)?
  • Use to global variable for this yyline, yycolumn
  • int yyline 1, yycolumn 1
  • ...
  • \t\n / do
    nothing/
  • If return
    (IFNumber)
  • return
    (PLUSNumber)
  • letterletterdigit yylval
    idtable_insert(yytext) return(IDNumber)
  • ...

10
  • Reporting errors
  • How to report an error character that cannot lead
    to a token?
  • How to deal with unended commend?
  • How to deal with unended string?

11
  • Dealing with identifiers, string constants.
  • Data structures
  • A string table that stores the lexeme value.
  • To avoid inserting the same lexeme multiple
    times, we will maintain an id table that records
    all identifiers found. Id table will have
    pointer pointing to the string table.
  • Implementation of the id table hash_table, link
    list, tree,
  • The hash_table implementation in page 433-436.

cp
n
match
last
i
j
c p \0 n \0 m a t c h \0 l a s t \0 I
\0 j \0
12
  • Some code piece for the id table
  • define STRINGTABLELENGTH 20000
  • define PRIME 997
  • struct HashItem
  • int index
  • struct HashItem next
  • struct HashItem HashTablePRIME
  • char StringTableSTRINGTABLELENGTH
  • int StringTableIndex0
  • int HashFunction(char s) / copy from
    page 436 /
  • int HashInsert(char s)

13
  • Internal representation of String constants
  • Needs conversion for the special characters.
  • abc gt abc\0
  • abc\def gt abcdef\0
  • abc\n gt abc\n
  • Recognizing constant strings with special
    characters
  • Assuming string cannot pass line boundary.
  • Use yymore()
  • \n char c
  • c input()
  • if (c ! ) error
  • else if (yytextyyleng-1
    \\)
  • unput( c ) yymore()
  • else / find the whole
    string, normal process/

14
Put it all together
  • Checkout token.l program.
Write a Comment
User Comments (0)
About PowerShow.com