Title: Regular Expressions: Concepts
1Regular Expressions Concepts
- CSC492 Topics in Perl
- Joe Reynoldson
- 3/14/05
2Regular Expressions (regex) Basics
- The purpose of a regex to match a string to a
pattern - The syntax of regex is simple (although sometimes
confusing), and it basically represents a whole
new language - Regex is not unique to Perl, but regex languages
are unique - Many (including Java) imitate perl regex
- Regexes and globs are 2 different things
- Globs are wildcard matches, often in UNIX or DOS
shell - Example .pl
3Regex explained
- Regexes divide an infinite set of input strings
into 2 groups - Matched
- Didn't match
- There's no in between (fuzzy matching)
- regexes are often used as boolean statements in
conditional expressions (such as in if statements
or while loops)
4Matching with the matching operator
- The matching operator ( m// ) matches _ against
the pattern specified between the slashes - Most Perl hackers leave off the leading 'm' and
just use slashes
while(ltINgt) if( /Waldo/ ) print "I
found Waldo in _!\n"
5Metacharacters Wildcard ( . )
- There are many characters which have special
meaning in regexes - The dot ( . ) is a wildcard which matches any
character (except newline... we'll get to that) - Matches if _ is I have a cat or I have a car
- Also matches can, cab, cam, and ca9
/I have a ca./
6Metacharacters Escape ( \ )
- To match a period, you must escape it
- The escape character ( \ ) is our second
metacharacter - Use escape whack to match a whack/\\/
- Use whack-a-mole to relieve tension
/3\.14159/
7Matching Operator Variation
- The slashes can be replaced by pretty much any
character - This is handy when matching against data with
many slashes (such as the full path to a file) - Perhaps m/home/j/jreynold would be more
explicit for non-hackers - Better than the alternative
if( /home/j/jreynold ) print "_ contains
my home directory\n"
/\/home\/j\/jreynold/
8A Quick CGI Example
- Suppose you want to determine if a web page was
requested by a browser in the usd.edu domain - Of course this regex fails miserably, but how?
(Hint There are an infinite number of input
strings)
_ ENV'REMOTE_HOST' if(/\.usd\.edu/ )
print "You appear to be on a ltbgtU.lt/bgt
computer\n"
9Metacharacters Simple Quantifiers
- Quantifiers allow a programmer to specify how
many matches s?he wants (hee hee) - Here they are in no particular order
- ? matches a pattern 0 or 1 time (/s?he/ would
match she or he, and now youre in on my geeky
joke! You too can be the life of the party!) - matches the pattern 0 or more times (/she/
matches he, she, sshe, ssshe, ) - matches the pattern 1 or more times (/she/
matches she, sshe, ssshe, )
10Escaping Quantifiers
- Each quantifier is a metacharacter, and therefore
they must be escaped if you'd like to explicitly
match them - /c\/ matches c, c, c, ... (but not c)
- /get out\??/ matches get out and get out?
- /\bold\/ matches bold, bold, bold, oh
you get the picture...
11Metacharacters Grouping Patterns
- Parenthesis ( ( ) ) can group items together for
the sake of quantifying
print "Continue? (yes or no) " chomp(_
ltSTDINgt) if( /y(es)?/ ) check for y or yes
print "Continuing!\n"
12Metacharacters Alternation
- Alternative is a fancy regex term for 'or'
- The vertical bar ( ) placed between two
patterns mean match the left hand side or the
right hand side - /textpdfhtml/ means match the string text, or
the string pdf, or the string html - From our previous example
if( /(Yy)(eseahah)?/ )