Tutorial on Regular Expressions - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Tutorial on Regular Expressions

Description:

no need to escape inside bracket character set, except backslash. Quantifiers * zero or more ... References. http://www.lib.uchicago.edu/keith/tcl-course ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 23
Provided by: cqin
Category:

less

Transcript and Presenter's Notes

Title: Tutorial on Regular Expressions


1
Tutorial on Regular Expressions
2
Matching Characters
  • ab - a followed by b
  • . wild-card character
  • a. - a followed by any character

3
Character Sets
  • xyz single character, only character in set
    x, y, z is allowed
  • Hhello either Hello or hello
  • 0-9 specify range of character allowed, from
    0 to 9
  • xyz not x, y, z
  • a-zA-Z character other than alphabet
  • include as the 1st character in a set
  • ?()\\ no need to escape inside bracket
    character set, except backslash

4
Quantifiers
  • zero or more
  • one or more
  • ? zero or one
  • ba b followed by zero or more a
  • (ab) one or more ab, specified within group ()
  • . match everything, including empty string
  • .\n all lines in input (greedy match)
  • \n\n zero or more non-newline followed by a
    newline, i.e. the first line

5
Quoting RE with Braces
  • \n\n wont work, but
  • \n\n will do
  • \\n\\n will also do

6
Alternation
  • Hellohello either Hello or hello
  • (Hh)ello same as above
  • Hh also same as above

7
Anchoring a Match
  • beginning of the string
  • end of the string
  • \t leading spaces tabs (at least one)
  • tion string end up with tion
  • abc.xyz string begin with abc, end up with
    xyz

8
Backslash Quoting
  • To turn off special characters . ? ( )
    \
  • \ matching
  • No backslash inside a bracketed expression
  • (\\?) is the same as ?
  • \\ to match a single backslash, you need two
  • \B aslo backslash as in advanced R.E.

9
Matching Precedence
  • Earliest in the string
  • Longest in the string
  • Nongreedy quantifier ? shortest
  • ab will match abbb on abbbx
  • ab will match ab on ababbbx

10
Capturing Subpatterns
  • Use () parentheses to capture a subpattern
  • lttdgt(lt)lt/tdgt get everything between lttdgt
    and lt/tdgt tag
  • (0-9)\.(0-9) will match subpattern before
    after decimal point

11
Advanced RE, Escape Sequences
  • Quote the R.E. pattern with to us advanced RE
  • \s space-like characters (white spaces)
  • \w letters, digit, and the underscore
  • \B backslash
  • \d digit
  • \S Non-space
  • \W Non-alphanumeric
  • \D Non-digit
  • \A Beginning of a string
  • \Z Ending of a string

12
Advanced RE, Character Class
  • identifier valid only within bracketed
    character set
  • A-Za-z alpha
  • digit 0-9 \d
  • space \b\f\n\r\t\v \s
  • digitalpha_ \dalpha_
    alnum_ \w

13
Advanced RE, Nongreedy Quantifiers
  • .?\n making shortest match, otherwise longest
  • \n\n .?\n
  • lttdgt(lt)lt/tdgt lttdgt(.?)lt/tdgt
  • ? Will match zero or one, ?? Will make it match
    only zero

14
Advanced RE, Bound Quantifiers
  • m,n at least m, at most n
  • m exact m
  • m, m or more
  • Nongreedy ? can be added after them

15
Advanced RE, Back References
  • \1, \2, \3, (count by left parentheses)
  • () quoted string (single or
    double quote)
  • ().?\1 same as above, but simpler

16
Advanced RE, Look-ahead
  • (?) peek without consume
  • A.(?\.txt) matching filename before .txt
  • A.(?!\.txt) matching filename does not end
    with .txt

17
Advanced RE, Character Codes
  • \nn or \mmm 8 bit Octal
  • \unnnn 16 bit Unicode
  • \uyyyyyyyy 32 bit Unicode
  • \xnn 8 bit Hex (Pitfall! Will consume all hexa
    digit truncate to 8 bit Hex)

18
Advanced RE, Newline Sensitive Matching
  • lineanchor option makes the and anchors work
    relative to newlines
  • With or w/o lineanchor, one can use \A and \Z to
    match the beginning and end of the string
  • linestop will make . and match stop at
    newline character, unless \n explicitly
    specified.

19
Advanced RE, Expanded Syntax
  • Leading (?x) will make the RE pattern capable of
    embedding comments within RE syntax
  • E.g.
  • regexp (?x) A pattern to match URLs
  • () The protocol before
    initial colon
  • //(/) The server name
  • ((0-9))? The optional port
    number
  • (/.) The trailing pathname
  • input

20
Regular Expression Commands
  • regexp Regular Expression
  • regsub Regular expression Substitution

21
Sample Scripts
  • Url_Decode
  • proc Url_Decode url
  • regsub all \ url url
  • regsub all (xdigit2 url \
  • format c 0x\1 url
  • return subst url

22
References
  • http//www.lib.uchicago.edu/keith/tcl-course/topic
    s/regexp.html
  • http//wiki.tcl.tk/10842 (Tcl Intro) link to
    http//wiki.tcl.tk/986 (regexp)
  • Brent Welch et al, Practical Programming in Tcl
    and Tk, 4th ed., Chapter 11
Write a Comment
User Comments (0)
About PowerShow.com