Title: Tutorial on Regular Expressions
1Tutorial on Regular Expressions
2Matching Characters
- ab - a followed by b
- . wild-card character
- a. - a followed by any character
3Character Sets
- xyz single character, only character in set
x, y, z is allowed - Hhello either Hello or hello
- 0-9 specify range of character allowed, from
0 to 9 - xyz not x, y, z
- a-zA-Z character other than alphabet
- include as the 1st character in a set
- ?()\\ no need to escape inside bracket
character set, except backslash -
4Quantifiers
- zero or more
- one or more
- ? zero or one
- ba b followed by zero or more a
- (ab) one or more ab, specified within group ()
- . match everything, including empty string
- .\n all lines in input (greedy match)
- \n\n zero or more non-newline followed by a
newline, i.e. the first line
5Quoting RE with Braces
- \n\n wont work, but
- \n\n will do
- \\n\\n will also do
6Alternation
- Hellohello either Hello or hello
- (Hh)ello same as above
- Hh also same as above
7Anchoring a Match
- beginning of the string
- end of the string
- \t leading spaces tabs (at least one)
- tion string end up with tion
- abc.xyz string begin with abc, end up with
xyz
8Backslash Quoting
- To turn off special characters . ? ( )
\ - \ matching
- No backslash inside a bracketed expression
- (\\?) is the same as ?
- \\ to match a single backslash, you need two
- \B aslo backslash as in advanced R.E.
9Matching Precedence
- Earliest in the string
- Longest in the string
- Nongreedy quantifier ? shortest
- ab will match abbb on abbbx
- ab will match ab on ababbbx
10Capturing Subpatterns
- Use () parentheses to capture a subpattern
- lttdgt(lt)lt/tdgt get everything between lttdgt
and lt/tdgt tag - (0-9)\.(0-9) will match subpattern before
after decimal point
11Advanced RE, Escape Sequences
- Quote the R.E. pattern with to us advanced RE
- \s space-like characters (white spaces)
- \w letters, digit, and the underscore
- \B backslash
- \d digit
- \S Non-space
- \W Non-alphanumeric
- \D Non-digit
- \A Beginning of a string
- \Z Ending of a string
12Advanced RE, Character Class
- identifier valid only within bracketed
character set - A-Za-z alpha
- digit 0-9 \d
- space \b\f\n\r\t\v \s
- digitalpha_ \dalpha_
alnum_ \w
13Advanced RE, Nongreedy Quantifiers
- .?\n making shortest match, otherwise longest
- \n\n .?\n
- lttdgt(lt)lt/tdgt lttdgt(.?)lt/tdgt
- ? Will match zero or one, ?? Will make it match
only zero
14Advanced RE, Bound Quantifiers
- m,n at least m, at most n
- m exact m
- m, m or more
- Nongreedy ? can be added after them
15Advanced RE, Back References
- \1, \2, \3, (count by left parentheses)
- () quoted string (single or
double quote) - ().?\1 same as above, but simpler
16Advanced RE, Look-ahead
- (?) peek without consume
- A.(?\.txt) matching filename before .txt
- A.(?!\.txt) matching filename does not end
with .txt
17Advanced RE, Character Codes
- \nn or \mmm 8 bit Octal
- \unnnn 16 bit Unicode
- \uyyyyyyyy 32 bit Unicode
- \xnn 8 bit Hex (Pitfall! Will consume all hexa
digit truncate to 8 bit Hex)
18Advanced RE, Newline Sensitive Matching
- lineanchor option makes the and anchors work
relative to newlines - With or w/o lineanchor, one can use \A and \Z to
match the beginning and end of the string - linestop will make . and match stop at
newline character, unless \n explicitly
specified.
19Advanced RE, Expanded Syntax
- Leading (?x) will make the RE pattern capable of
embedding comments within RE syntax - E.g.
- regexp (?x) A pattern to match URLs
- () The protocol before
initial colon - //(/) The server name
- ((0-9))? The optional port
number - (/.) The trailing pathname
- input
20Regular Expression Commands
- regexp Regular Expression
- regsub Regular expression Substitution
21Sample Scripts
- Url_Decode
- proc Url_Decode url
- regsub all \ url url
- regsub all (xdigit2 url \
- format c 0x\1 url
- return subst url
22References
- http//www.lib.uchicago.edu/keith/tcl-course/topic
s/regexp.html - http//wiki.tcl.tk/10842 (Tcl Intro) link to
http//wiki.tcl.tk/986 (regexp) - Brent Welch et al, Practical Programming in Tcl
and Tk, 4th ed., Chapter 11