Formal Languages

About This Presentation

Title:

Formal Languages

Description:

... possibly return : bobble, babble, bubble /.oat/ Would possibly return : ... with all letters in uppercase. with initial lowercase letter. with the word London ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 34

Provided by: imsUnist

Category:

more less

Transcript and Presenter's Notes

Title: Formal Languages

1
Formal Languages Regex Review Hinrich
Schütze IMS, Uni Stuttgart, WS 2007/08 Slides
borrowed from Chris Manning
2
What are regular expressions?

They are a powerful way of finding/matching/
extracting patterns in text (strings)
The bad news
You or your friends will start writing things
like
/\b(SunMonTuesThursFriSun)(\.day)?Wed(\.ne
sday)?Sat(\.urday)?\b/
The good news
Once you get used to it, this is actually a
powerful way to do text processing
After a while you can actually read such things
easily

3
Whats an easy regular expression?

can
This is a regular expression which matches the
word can in a text like You can do it

4
Whats a harder regular expression?

cancould
This doesnt match cancould in
if (cancould gt 0x7f) printf(Funny bit
operation)
is a meta-character meaning OR. It lets you
define a list of strings to match
This regular expression does match could in
I thought I could do it
It will also match can in They scanned his
iris
Why things get more subtle

5
Whats a hard regular expression?

cancould
Will match can in They scanned his iris
We need to stop that!
What we want is that there is a white space or a
beginning of line or something like that on both
sides of the words
\bcancould\b
Regular expression languages make it easy to do
things like that
The above match would no longer happen

6
What is a regular expression?

Its a pattern that matches a string
This is a little subtle. The pattern looks like a
string it may even be a string - but its
actually a pattern that describes a set of
strings
A regular expression describes a (perhaps
infinite) set of strings
Its easier to get a sense of this through
examples.

7
Regular expressions history

Came out of understanding of formal languages in
the 1960s
Chomsky, Aho, Ullman, etc.
Regular expressions describe regular languages
which can be recognized by finite state machines
First big promoter of use was Unix
Tools like sed, grep, awk support regular
expressions
But a fairly plain and limited regex language
Could invoke it from C(), but it was painful

8
Regular expressions history

Second bigger promoter of use was Perl
Developed a greatly extended, flexible regex
language which was really useful and usable
Regular expressions were directly supported in
the language syntax
Everyone used them everywhere for everything
Still one of the simplest and most natural places
to use them
Once you get into that psychology and ignore
whats happening behind the scenes
Though Perl does a lot to make things as fast as
possible, including caching recently used regex.

9
Regular expression libraries

All modern programming languages have built-in
regular expression libraries (perl, java, python,
)

10
Ranges of Characters

Ranges can be specified in Regular Expressions
Valid Ranges
A-Z Upper Case Roman Alphabet
a-z Lower Case Roman Alphabet
A-Za-z Upper or Lower Case Roman Alphabet
A-F Upper Case A through F Roman
Characters
A-z Valid but be careful (need to know ASCII!)
Invalid Ranges
a-Z Not Valid
F-A Not Valid

11
Ranges cont ...

Ranges of Digits can also be specified
0-9 Valid
Negating Ranges
/ 0-9 /
Match anything except a digit
/ a /
Match anything except an a
/ A-Z /
Match anything that starts with something
other than a single upper case
letter
First start of line
Second negation

12
Literal Metacharacters

Suppose that you actually want to look for all
strings that equal '' in your text
Use the \ symbol
/ \ / Regular expression to search for
In general \ escapes a metacharacter back to its
original meaning
Caution some variation in regular expression
packages in POSIX/emacs \ is the metacharacter
and is a pipe character. In Perl, a backslashed
character is always the regular character

13
Predefined patterns in Perl 5 regexps

Some Patterns
\d 0 9
\w azAz09_
\s \r\t\n\f (white space pattern)
\S \r\t\n\f
\D 0-9
\W a-zAZ09
Example 19\d\d
Looks for any year in the 1900's

14
Word Boundary Metacharacter

Regular Expression to match the start or the end
of a 'word' \b
This is a (zero-width) assertion (like and )
An assertion is a statement about the position of
the match pattern within a string. (zero-width
meaning?)
Examples
/ Jeff\b/ Match Jeff but not Jefferson
/Carol\b/ Match Carol but not Caroline
/Rollin\b/ Match Rollin but not Rolling
/\bform/ Match form or formation but not
Information
/\bform\b/ Match form but neither information
nor formation

15
DOT Metacharacter

The DOT Metacharacter, '.' symbolizes any
character except a new line
/ b.bble/
Would possibly return bobble, babble, bubble
/.oat/
Would possibly return boat, coat, goat
Note remember '.' usually means a bunch of
anything, this can be handy but also can have
hidden ramifications.
Use sparingly. Often better to be more specific

16
PIPE Metacharacter

The PIPE Metacharacter is used for alternation
/Bridget (ThomsonMcInnes)/
Match Bridget Thomson or Bridget McInnes but
NOT Bridget Thomson McInnes
/ Bbridget /
Match B or bridget
/( Bb)ridget/
Match Bridget or bridget at the beginning of a
line

17
A Simple Example

Now suppose that we want to not only get all
words that end in 'ing' but also 'ed'.
How would we write a regular expression to
accomplish this?

18
A Simple Example

Now suppose that we want to not only get all
words that end in 'ing' but also 'ed'.
Regular Expression
word m /a-z(inged)/

19
The ? Metacharacter

The metacharacter, ?, indicates that the
character immediately preceding it occurs zero or
one time
Examples
/ worl?ds /
Match either 'worlds' or 'words'
/ m?ethane /
Match either 'methane' or 'ethane'

20
The Metacharacter

The metacharacter, , indicates that the
characterer immediately preceding it occurs zero
or more times
Example
/ abc/ Match 'ac', 'abc', 'abbc', 'abbbc'
ect...
Matches any string that starts with an a, if
possibly followed by a sequence of b's and ends
with a c.
Sometimes called Kleene star

21
Greedy vs. Lazy Matching

The regular expression engine does greedy
matching by default. This means that it attempts
to match the maximum possible number of
characters, if given a choice. For example
str The dogggg
str /The (dog)/
print 1
This prints dogggg because g, one or
more gs, is interpreted to mean the maximum
possible number of gs.
Greedy matching can cause problems .
Arguably it was a mistake default, especially
with hindsight
Lazy matching matches the minimal number of
characters. It is turned on by putting a
question mark ? after a quantifier. Using the
examples above,
str /The (dog?)/ print 1
prints dog

22
Two Examples

Finding blank lines.
Matching letters only.

23
A Few Examples

Finding blank lines. They might have a space or
tab on them. so use /\s/
Matching letters only. The problem is, \w is
digits and underscore in addition to letters. You
can use /A-Za-z/, which requires that every
character in the entire string be a letter Or
try /\W\d_/.
Words in general sometimes have apostrophe(')
and hyphen (-) in them, such as o'clock and cat's
and pre-evaluation. Also, there are some common
numerical/letter mixed expressions 1st for
first, for example. So, \w by itself wont
match everything that we consider a word in
common English.

24
More Examples

Time of day For example. 1130.

25
More Examples

Time of day For example. 1130.
010-90-50-9
If only want 12 hour clock, this would
overa-allow times as 1900 and 0030.
A more complicated construction works better
(10121-9)0-50-9. That is, a 1 followed
by 0, 1, or 2, OR any digit 1-9.

26
Other things I should mention

One or more
Counting Xn,m Xn, X,m
Capturing group (A-Z)
Unicode blocks \pLu

27
Implementation

Without capturing groups can turn into efficient
deterministic FA
Regular expression libraries generally dont but
use NFA state machine
Worst case exponential, but used with care,
youre fine

28
Limitations

Things with recursive embedding arent finite
state and you cant do it all with regular
expressions
Human languages
XML
But the desperate Perl hacker, as the XML
community refers to regular expression users, can
get a long way
Good for structured patterns or lists
Not so good for e.g., peoples names

29
Summary
30
Predefined character classes

. any one character except a line terminator
\d a digit 0-9
\D a non-digit 0-9
\s a whitespace character \t\n\x0B\f\r

Notice the space.Spaces are significantin
regular expressions!

\S a non-whitespace character \s
\w a word character a-zA-Z_0-9
\W a non-word character \w

31
Boundary matchers

These patterns match the empty string if at the
specified position
the beginning of a line
the end of a line
\b a word boundary
\B not a word boundary

32
Exercises

Regular expressions to find lines
with a special character, e.g.
with an uppercase letter
with all letters in uppercase
with initial lowercase letter
with the word London
With non-blank white spaces
with IP addresses
with an email address
containing one, two or three
not containing o
Find the line with the longest word

Formal Languages - PowerPoint PPT Presentation

Formal Languages

... possibly return : bobble, babble, bubble /.oat/ Would possibly return : ... with all letters in uppercase. with initial lowercase letter. with the word London ... – PowerPoint PPT presentation