Title: Formal Languages
1Formal Languages Regex Review Hinrich
Schütze IMS, Uni Stuttgart, WS 2007/08 Slides
borrowed from Chris Manning
2What are regular expressions?
- They are a powerful way of finding/matching/
extracting patterns in text (strings) - The bad news
- You or your friends will start writing things
like - /\b(SunMonTuesThursFriSun)(\.day)?Wed(\.ne
sday)?Sat(\.urday)?\b/ - The good news
- Once you get used to it, this is actually a
powerful way to do text processing - After a while you can actually read such things
easily
3Whats an easy regular expression?
- can
- This is a regular expression which matches the
word can in a text like You can do it
4Whats a harder regular expression?
- cancould
- This doesnt match cancould in
- if (cancould gt 0x7f) printf(Funny bit
operation) - is a meta-character meaning OR. It lets you
define a list of strings to match - This regular expression does match could in
- I thought I could do it
- It will also match can in They scanned his
iris - Why things get more subtle
5Whats a hard regular expression?
- cancould
- Will match can in They scanned his iris
- We need to stop that!
- What we want is that there is a white space or a
beginning of line or something like that on both
sides of the words - \bcancould\b
- Regular expression languages make it easy to do
things like that - The above match would no longer happen
6What is a regular expression?
- Its a pattern that matches a string
- This is a little subtle. The pattern looks like a
string it may even be a string - but its
actually a pattern that describes a set of
strings - A regular expression describes a (perhaps
infinite) set of strings - Its easier to get a sense of this through
examples.
7Regular expressions history
- Came out of understanding of formal languages in
the 1960s - Chomsky, Aho, Ullman, etc.
- Regular expressions describe regular languages
which can be recognized by finite state machines - First big promoter of use was Unix
- Tools like sed, grep, awk support regular
expressions - But a fairly plain and limited regex language
- Could invoke it from C(), but it was painful
8Regular expressions history
- Second bigger promoter of use was Perl
- Developed a greatly extended, flexible regex
language which was really useful and usable - Regular expressions were directly supported in
the language syntax - Everyone used them everywhere for everything
- Still one of the simplest and most natural places
to use them - Once you get into that psychology and ignore
whats happening behind the scenes - Though Perl does a lot to make things as fast as
possible, including caching recently used regex.
9Regular expression libraries
- All modern programming languages have built-in
regular expression libraries (perl, java, python,
)
10Ranges of Characters
- Ranges can be specified in Regular Expressions
- Valid Ranges
- A-Z Upper Case Roman Alphabet
- a-z Lower Case Roman Alphabet
- A-Za-z Upper or Lower Case Roman Alphabet
- A-F Upper Case A through F Roman
Characters - A-z Valid but be careful (need to know ASCII!)
- Invalid Ranges
- a-Z Not Valid
- F-A Not Valid
11Ranges cont ...
- Ranges of Digits can also be specified
- 0-9 Valid
- Negating Ranges
- / 0-9 /
- Match anything except a digit
- / a /
- Match anything except an a
- / A-Z /
- Match anything that starts with something
other than a single upper case
letter - First start of line
- Second negation
12Literal Metacharacters
- Suppose that you actually want to look for all
strings that equal '' in your text - Use the \ symbol
- / \ / Regular expression to search for
- In general \ escapes a metacharacter back to its
original meaning - Caution some variation in regular expression
packages in POSIX/emacs \ is the metacharacter
and is a pipe character. In Perl, a backslashed
character is always the regular character
13Predefined patterns in Perl 5 regexps
- Some Patterns
- \d 0 9
- \w azAz09_
- \s \r\t\n\f (white space pattern)
- \S \r\t\n\f
- \D 0-9
- \W a-zAZ09
- Example 19\d\d
- Looks for any year in the 1900's
14Word Boundary Metacharacter
- Regular Expression to match the start or the end
of a 'word' \b - This is a (zero-width) assertion (like and )
- An assertion is a statement about the position of
the match pattern within a string. (zero-width
meaning?) - Examples
- / Jeff\b/ Match Jeff but not Jefferson
- /Carol\b/ Match Carol but not Caroline
- /Rollin\b/ Match Rollin but not Rolling
- /\bform/ Match form or formation but not
Information - /\bform\b/ Match form but neither information
nor formation
15DOT Metacharacter
- The DOT Metacharacter, '.' symbolizes any
character except a new line - / b.bble/
- Would possibly return bobble, babble, bubble
- /.oat/
- Would possibly return boat, coat, goat
- Note remember '.' usually means a bunch of
anything, this can be handy but also can have
hidden ramifications. - Use sparingly. Often better to be more specific
16PIPE Metacharacter
- The PIPE Metacharacter is used for alternation
- /Bridget (ThomsonMcInnes)/
- Match Bridget Thomson or Bridget McInnes but
NOT Bridget Thomson McInnes - / Bbridget /
- Match B or bridget
- /( Bb)ridget/
- Match Bridget or bridget at the beginning of a
line
17A Simple Example
- Now suppose that we want to not only get all
words that end in 'ing' but also 'ed'. - How would we write a regular expression to
accomplish this?
18A Simple Example
- Now suppose that we want to not only get all
words that end in 'ing' but also 'ed'. - Regular Expression
- word m /a-z(inged)/
19The ? Metacharacter
- The metacharacter, ?, indicates that the
character immediately preceding it occurs zero or
one time - Examples
- / worl?ds /
- Match either 'worlds' or 'words'
- / m?ethane /
- Match either 'methane' or 'ethane'
20The Metacharacter
- The metacharacter, , indicates that the
characterer immediately preceding it occurs zero
or more times - Example
- / abc/ Match 'ac', 'abc', 'abbc', 'abbbc'
ect... - Matches any string that starts with an a, if
possibly followed by a sequence of b's and ends
with a c. - Sometimes called Kleene star
21Greedy vs. Lazy Matching
- The regular expression engine does greedy
matching by default. This means that it attempts
to match the maximum possible number of
characters, if given a choice. For example - str The dogggg
- str /The (dog)/
- print 1
- This prints dogggg because g, one or
more gs, is interpreted to mean the maximum
possible number of gs. - Greedy matching can cause problems .
- Arguably it was a mistake default, especially
with hindsight - Lazy matching matches the minimal number of
characters. It is turned on by putting a
question mark ? after a quantifier. Using the
examples above, - str /The (dog?)/ print 1
prints dog
22Two Examples
- Finding blank lines.
- Matching letters only.
23A Few Examples
- Finding blank lines. They might have a space or
tab on them. so use /\s/ - Matching letters only. The problem is, \w is
digits and underscore in addition to letters. You
can use /A-Za-z/, which requires that every
character in the entire string be a letter Or
try /\W\d_/. - Words in general sometimes have apostrophe(')
and hyphen (-) in them, such as o'clock and cat's
and pre-evaluation. Also, there are some common
numerical/letter mixed expressions 1st for
first, for example. So, \w by itself wont
match everything that we consider a word in
common English.
24More Examples
- Time of day For example. 1130.
25More Examples
- Time of day For example. 1130.
- 010-90-50-9
- If only want 12 hour clock, this would
overa-allow times as 1900 and 0030. - A more complicated construction works better
(10121-9)0-50-9. That is, a 1 followed
by 0, 1, or 2, OR any digit 1-9.
26Other things I should mention
- One or more
- Counting Xn,m Xn, X,m
- Capturing group (A-Z)
- Unicode blocks \pLu
27Implementation
- Without capturing groups can turn into efficient
deterministic FA - Regular expression libraries generally dont but
use NFA state machine - Worst case exponential, but used with care,
youre fine
28Limitations
- Things with recursive embedding arent finite
state and you cant do it all with regular
expressions - Human languages
- XML
- But the desperate Perl hacker, as the XML
community refers to regular expression users, can
get a long way - Good for structured patterns or lists
- Not so good for e.g., peoples names
29Summary
30Predefined character classes
- . any one character except a line terminator
- \d a digit 0-9
- \D a non-digit 0-9
- \s a whitespace character \t\n\x0B\f\r
Notice the space.Spaces are significantin
regular expressions!
- \S a non-whitespace character \s
- \w a word character a-zA-Z_0-9
- \W a non-word character \w
31Boundary matchers
- These patterns match the empty string if at the
specified position - the beginning of a line
- the end of a line
- \b a word boundary
- \B not a word boundary
32Exercises
- Regular expressions to find lines
- with a special character, e.g.
- with an uppercase letter
- with all letters in uppercase
- with initial lowercase letter
- with the word London
- With non-blank white spaces
- with IP addresses
- with an email address
- containing one, two or three
- not containing o
- Find the line with the longest word
33Exercises
- Regular expressions to find lines containing
- a single digit
- a word of length 13
- a tab