Title: Regular Expressions
1Regular Expressions
The hidden power language
Roy Osherove www.iserializable.com Methodology
Team System Expert Sela Group www.Sela.co.il
2Tools
- http//tools.osherove.com
- www.ISerializable.com
3The Log File
4Developer Problem Make this log file useful
- Old log file from a nix systems entries
- Converted to and from various formats
- Searched by users
- Format may change
- Search fields can be added, removed or renamed
at runtime
Date CPUsramcpu HHmmss action user
domain.machine 25/05/1998 100512x86 214912
Search Anakin Antler.Anita1 25/05/98
100512x86 215115 Update Anakin
Antler.Anita1 26/05/1998 100256x86 110245
Search Darth Cydot.Uk.Gerry2k 26/05/98
100256x86 111249 Update Darth
Cydot.Uk.Gerry2k 27/05/98 100512x86 153430
Search Anakin Anterl.Anita1 12/08/1998
201024x86 101453 Search Obi Monaco.Huarez
5About 15 minutes later
About 45 minutes later
Home early.
6You can be home early too!
- Regex is easier than you think
7What are Regular Expressions?
- A language to describe a language using
patterns - Think SQL or XPath for text
- Originated with Perl and nix shell scripting
- Many variations and frameworks exist. Only one
for .NET (for now) - Used in most languages
8Common Regex Uses
- Text Validation
- Phones, emails, address or any format requirement
- Text Manipulation
- Transform text
- Text Parsing
- Find in files, site Scraping, data collection
9What .NET brings to the plate
- Full object model
- Extended syntax
- Optimization techniques in the framework
10.NET Regular Expressions
- Show up in several places
- In the classes of the System.Text.RegularExpressio
ns namespace - Via the RegularExpressionValidator validator
control (for ASP.NET) - Sprinkled in dozens of other places
- Browser capabilities filter
- In the WSDL ltmatchgt tag
- And many more
11Key Classes within System.Text.RegularExpressions
- Regex
- Contains the pattern and matching options
- Important methods
- IsMatch() returns boolean
- Replace() returns a string
- Split() returns a string array
-
- Main Use
- Validation, Splitting, Replacing text
12The Process
Matches
Input
Regex
Splits
Pattern
Text
Replace text
Options
13Validation
14Syntax
- Match exact text as written in the pattern
- a will match all a in the text.
- Except for special symbols
15Enclosing Alternatives with
- The square brackets allow you to specify a list
of alternate values. Used in conjunction with
the operator, you can even specify character
ranges. - Cc Capital or lowercase c
- A-Z Any capital letter A through Z
- A-Za-z Any capital or lowercase letter
- 0-9 Any digit 0 through 9
- A-Za-z0-9 Any letter or digit
- 0-9.- Any digit or special char listed
- Notice no escape needed
16Controlling ExpressionFrequency with
- The operators allow you to control the
frequency of the preceding expression. The
expression takes one of these two forms - occurrences
- A-Za-z3
- MinOccurrences, MaxOccurences
- A-Za-z1,3
17Basic Frequency Operators
- ? 0 or 1
- 0 or more
- 1 or more
- So,
- 3
- Will match
- 3, 33, 3333
- but not
- 45, 678.
18Wildcard Operator .
- . matches any non-newline character
- Unless multiline mode has been turned on for the
pattern - Examples
- A. would match a capital A followed by one any
character. - Will not match Abc
- A. would match a capital A followed by one or
more non-newline characters - \.htm.? would match ".htm" followed by
- an optional non-newline character
- Backslash escape characters that have reserved
meanings in regular expressions
19Convenience Expressions
- \d
- Any digit
- \D
- Any non-digit
- Must match something else one
- \s
- Any whitespace character (such as a space or tab)
- \S
- Any character other than a whitespace character
- \w
- Any number or letter
- \W
- Any character other than a number or letter
Many more Unicode, Hex Values, negative lookups
20Quick Quiz!
- A-Za-z3
- 3 capital or lowercase letters
- Abc, abc, aBC,1bc
- A-Za-z2,4
- A capital letter followed by at least 2 but not
more than 4 lowercase letters - Abc, Acbde, abcde, ABcde
- \w3,8\.\w3
- 3 to 8 AlphaNumeric characters, followed by a dot
and 3 alpha numerics - Filename.txt, d0main.com, 1234.567, 34.456
21Splitting and Manipulating
22Text Manipulation
23The Spammer
24(2) Key Classes within System.Text.RegularExpressi
ons
- MatchCollection - Match
- MatchCollection stores all the matches found
- GroupCollection - Group
- CaptureCollection - Capture
- Regex.Match() returns Match
- Regex.Matches() returns MatchCollection
-
- Main Use
- Parsing, searching, collecting data
25Simple parsingParsing for emails
26Grouping(the coolest part)
27Grouping (pay attention!)
- Groups give us object models
- HTML File
- Roy_at_Osherove.com
- Create a capture hierarchy and use it in code
-
- \w\.\-_at_ \w\.\-\.\w2,5
- (?ltuserNamegt\w\.\-)_at_(?ltdomaingt\w\.\-\.\w2,5
)
28Grouping Emails The Regulator
29Getting back to the first problemMake this log
file useful
- Old log file from a nix systems entries
- Converted to and from various formats
- Searched by users
- Format may change
- Search fields can be added, removed or renamed
at runtime
Date CPUsramcpu HHmmss action user
domain.machine 25/05/1998 100512x86 214912
Search Anakin Antler.Anita1 25/05/98
100512x86 215115 Update Anakin
Antler.Anita1 26/05/1998 100256x86 110245
Search Darth Cydot.Uk.Gerry2k 26/05/98
100256x86 111249 Update Darth
Cydot.Uk.Gerry2k 27/05/98 100512x86 153430
Search Anakin Anterl.Anita1 12/08/1998
201024x86 101453 Search Obi Monaco.Huarez
30How do I start?
- Take a sample of the log file
- Recognize the data pattern for each entry
- Use groups to get each lines values
- Create a tool that uses this regex to parse a log
file - The tool will use the returned results to
generate the log as XML - Load the XML into a DataSet
- Allow user to print Select statements on the
DataSet
31Parsing a log file
32Regulazy
- Build simple expressions by example
- No syntax knowledge needed
- Free
- Tools.osherove.com
33When not to use Regex
- When its easier and more readable to do it
otherwise - Not just because its cool
- Hard to read
- Steep learning curve
- Hard to maintain
Sometimes, when confronted with a problem, you
might decide to solve it with Regular Expressions
for the wrong reasons. Now you youve got two
problems.
34Summary
- Amazing parsing flexibility
- Good skill to have anywhere
- Can save you time and nerves
- With Power comes responsibility
- Weigh the pros and cons before using
35Resources
- The Regulator tools.osherove.com
- Regulazy tools.osherove.com
- Regexlib.com Regex archive (http//www.regexlib.
com) Cheat Sheet - http//www.regular-expressions.info
Roy Osherove Royo_at_sela.co.il Blog
www.iserializable.com
36Thank you!
Roy Osherove Royo_at_sela.co.il Blog
www.iserializable.com
37(No Transcript)