Title: Regular Expressions
1address m/(\d .)\n(.?, (A-Z2)
(\d5)-?(\d0,5)/
2Introduction to Regular Expressions
- Its all about patterns
- Character Classes match any text of a certain
type - Repetition operators specify a recurring pattern
- Search flags change how the RegEx operates
- In this presentation
- green denotes a character class
- yellow denotes a repetition quantifier
- orange denotes a search flag or other symbol
- My examples use Perl syntax
3Introduction to Regular Expressions
- Basic syntax
- All RegEx statements must begin and end with /
- /something/
- Escaping reserved characters is crucial
- /(i.e. / is invalid because ( must be closed
- However, /\(i\.e\. / is valid for finding (i.e.
- Reserved characters include
- . ? ( ) / \
- Also some characters have special meanings based
on their position in the statement
4Regular Expression Matching
- Text Matching
- A RegEx can match plain text
- ex. if (name /Dan/) print match
- But this will match Dan, Danny, Daniel, etc
- Full Text Matching with Anchors
- Might want to match a whole line (or string)
- ex. if (name /Dan/) print match
- This will only match Dan
- anchors to the front of the line
- anchors to the end of the line
5Regular Expression Matching
- Order of results
- The search will begin at the start of the string
- This can be altered, dont ask yet
- Every character is important
- Any plain text in the expression is treated
literally - Nothing is neglected (close doesnt count)
- / s/ is not the same as / s/
- Far easier to write than to debug!
6Regular Expression Char Classes
- Allows specification of only certain allowable
chars - dofZ matches only the letters d, o, f, and Z
- If you have a string dog then /dofZ/ would
match d only even though o is also in the
class - So this expression can be stated match one of
either d, o, f, or Z. - A-Za-z matches any letter
- a-fA-F0-9 matches any hexadecimal character
- /\\ matches anything BUT , , /, or \
- The in the front of the char class specifies
not - In a char class, you only need to escape \ ( -
7Regular Expression Char Classes
- Special character classes match specific
characters - \d matches a single digit
- \w matches a word character (A-Z, a-z, _)
- \b matches a word boundary /\bword\b/
- \s matches a whitespace character (spc, tab,
newln) - . wildcard matches everything except newlines
- Use very carefully, you could get anything!
- To match anything but capitalize the char
class - i.e. \D matches anything that isnt a digit
8Regular Expression Char Classes
- Character Class Examples
- bodyPart /e\w\w/
- Matches ear, eye, etc
- thing 1, 2, 3 strikes! thing /\s\d/
- Matches 2
- thing 1, 2, 3 strikes! thing /\s\d/
- Matches 1
- Not always useful to match single characters
- phone /\d\d\d-\d\d\d-\d\d\d\d/
- Theres a better way
9Regular Expression Repetition
- Repetition allows for flexibility
- Range of occurrences
- weight /\d2,3/
- Matches any weight from 10 to 999
- name /\w5,/
- Matches any name longer than 5 letters
- if (SSN /\d9/) print Invalid SSN!
- Matches exactly 9 digits
10Regular Expression Repetition
- General Quantifiers
- Some more special characters
- favoriteNumber /\d/
- Matches any size number or no number at all
- firstName /\w/
- Matches one or more characters
- middleInitial /\w?/
- Matches one or zero characters
11Regular Expression Repetition
- Greedy vs Nongreedy matching
- Greedy matching gets the longest results possible
- Nongreedy matching gets the shortest possible
- Lets say robot The12thRobotIs2ndInLine
- robot /\w\d/ (greedy)
- Matches The12thRobotIs2
- Maximizes the length of \w
- robot /\w?\d/ (nongreedy)
- Matches The12
- Minimizes the length of \w
12Regular Expression Repetition
- Greedy vs Nongreedy matching
- Suppose txt something is so cool
- txt /something/
- Matches something
- txt /so(mething)?/
- Matches something and the second so
- txt /so(mething)??/
- Matches only so and the second so
- Doesnt really make sense to do this
13Regular Expression Real Life Examples
- Using what youve learned so far, you can
- Validate a standard 8.3 file name
- path /\w1,8\.A-Za-z0-92,3/
- Account for poorly spelled user input
- answer /ban1,2an1,2a/
- iansLastName /Paet1,2ersoen/
- iansFirstName /E?Ii?aeo?n/
- Matches Ian, Ean, Eian, Eon, Ien, Ein
- At least everyone gets the n right
14Alternation
- Alternation allows multiple possibilities
- Let story He went to get his mother
- story /(HeShe)\b.?\b(hisher)\b.?
(motherfatherbrothersisterdog)/ - Also matches She punched her fat brother
- Make sure the grouping is correct!
- ans /(truefalse)/
- Matches only true or false
- ans /truefalse/ (same as /(truefalse)/)
- Matches true never or not really false
15Grouping for Backreferences
- Backreferences
- With all these wildcards and possible matches, we
usually need to know what the expression finally
ended up matching. - Backreferences let you see what was matched
- Can be used after the expression has evaluated or
even inside the expression itself - Handled very differently in different languages
- Numbered from left to right, starting at 1
16Grouping for Backreferences
- Perl backreferences
- Used inside the expression
- txt /\b(\w)\s\1\b/
- Finds any duplicated word, must use \1 here
- Used after the expression
- class /(.?)-(\d)/
- The first word between hyphens is stored in the
Perl variable 1 (not \1) and the number goes in
2 - print I am in class 1, section 2
17Grouping for Backreferences
- Java backreferences
- Annoying but still useful
- Pattern p Pattern.compile((.?)-(\\d))
- Matcher m p.matcher(mySchedule)
- m.find()
- System.out.println(I am in class m.group(1)
- , section m.group(2))
- Ugly, but usually better than the alternative
- m.group() returns the entire string matched
18Grouping for Backreferences
- Javascript backreferences
- Used inside the expression
- Not supported
- Used after the expression
- /(.?)-(\d)/.test(class)
- alert(RegExp.1)
- str str.replace(/(\S)\s(\S)/, 2 1)
- RegExp supports all of Perls special
backreference variables (wait a few slides)
19Grouping for Backreferences
- PHP/Python backreferences
- Allows the use of specifically named
backreferences - Groups also maintain their numbers
- .NET backreferences
- Allows named backreferences
- If you try to access named groups by number,
stuff breaks - Check the web for info on how to use
backreferences in these and other languages.
20Grouping without Backreferences
- Sometimes you just need to make a group
- If important groups must be backreferenced,
disable backreferencing for any unimportant
groups - sentence /(?HeShe) likes (\w)\./
- I dont care if its a he or she
- All I want to know is what he/she likes
- Therefore I use (?) to forgo the backreference
- 1 will contain that thing that he/she likes
21Matching Modes
- Matching has different functional modes
- Modes can be set by flags outside the expression
(only in some languages implementations) - name /a-z/i
- i turns off case sensitivity
- xml /title(\w ).keywords(\w )/s
- s enables . to match newlines
- report /\sName\s\S?The End.\s/m
- m allows newlines between and
22Matching Modes
- Matching has different functional modes
- Modes can be set by flags inside the expression
(except in Javascript and Ruby) - password /a-z(?i)a-jp-xz0-94,11/
- If an insane web site specifies that your
password must begin with a lowercase letter
followed by 4 to 11 upper/lower alphanumeric
characters excluding k through o and y. - element /(?i)A-Z(?-i)a-z?/
- (?i) makes the first letter case insensitive (if
they type o, but meant O, we still know they mean
oxygen). (?-i) makes sure the second letter is
lowercase, otherwise its 2 elements
23Regular Expression Replacing
- Replacements simplify complex data modification
- Generally the first part of a replace command is
the regular expression and the second part is
what to replace the matched text with - Usually a backreference variable can be used in
the replacement text to refer to a group matched
in the expression - The RegEx engine continues searching at the point
in the string following the replacement - Replacements use all the same syntax, but have
several unique features and are implemented very
differently in various languages.
24Regular Expression Replacing
- Perl replacement syntax
- phone s/\D//
- Removes the first non-digit character in a phone
- Note that leaving the replacement blank deletes
- html s/(\s)/1\t/
- Adds a tab to a line of HTML using backreferences
- sample s/abc/ABC/
- Might not do what is expected
- The second part is NOT a regular expression, its
a string
25Regular Expression Replacing
- Java replacement syntax (sucks)
- Pattern p Pattern.compile(\\\\\\\\server(\\d))
- p.matcher(netPath).replaceAll(\\\\workstation1)
- Yes, you actually have to use 8 \s to make \\
- Any \ in the expression needs to be doubled
- Matcher should parse replacement for 1
- This has the same effect but is slightly faster
than - netPath.replaceAll(\\\\\\\\server(\\d),
-
\\\\workstation1) - No, you cant seem to use .replace()
26Replacement Modes
- Replacements can be performed singly or globally
- The examples I have been using replace only
single occurrences of patterns - Use the g flag to force the expression to scan
the entire string - phone s/\D//g
- Removes all non-digits in the phone number
- myGarage s/JeepCougar/Boeing/g
- Gives me jets in exchange for cars
- Dont use it if its not necessary
27Combining Replace and Match Modes
- Combining modes is easy
- To combine modes, just append the flags
- alphabet /Q//gi
- Get rid of the pesky letter Q (and q too)
- response /(?im)(aeiou.?)(?-m)(.)/
- This example sucks. Point is you can combine
modes inside the statement, too.
28References for Learning More
- Tutorials for other programming languages
- http//www.regular-expressions.info/
- In-depth syntax
- http//kobesearch.cpan.org/htdocs/perl/perlreref.h
tml - Code Search (ex ip address regex)
- http//www.google.com/codesearch