Formal Languages - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Formal Languages

Description:

... possibly return : bobble, babble, bubble /.oat/ Would possibly return : ... with all letters in uppercase. with initial lowercase letter. with the word London ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 34
Provided by: imsUnist
Category:

less

Transcript and Presenter's Notes

Title: Formal Languages


1
Formal Languages Regex Review Hinrich
Schütze IMS, Uni Stuttgart, WS 2007/08 Slides
borrowed from Chris Manning
2
What are regular expressions?
  • They are a powerful way of finding/matching/
    extracting patterns in text (strings)
  • The bad news
  • You or your friends will start writing things
    like
  • /\b(SunMonTuesThursFriSun)(\.day)?Wed(\.ne
    sday)?Sat(\.urday)?\b/
  • The good news
  • Once you get used to it, this is actually a
    powerful way to do text processing
  • After a while you can actually read such things
    easily

3
Whats an easy regular expression?
  • can
  • This is a regular expression which matches the
    word can in a text like You can do it

4
Whats a harder regular expression?
  • cancould
  • This doesnt match cancould in
  • if (cancould gt 0x7f) printf(Funny bit
    operation)
  • is a meta-character meaning OR. It lets you
    define a list of strings to match
  • This regular expression does match could in
  • I thought I could do it
  • It will also match can in They scanned his
    iris
  • Why things get more subtle

5
Whats a hard regular expression?
  • cancould
  • Will match can in They scanned his iris
  • We need to stop that!
  • What we want is that there is a white space or a
    beginning of line or something like that on both
    sides of the words
  • \bcancould\b
  • Regular expression languages make it easy to do
    things like that
  • The above match would no longer happen

6
What is a regular expression?
  • Its a pattern that matches a string
  • This is a little subtle. The pattern looks like a
    string it may even be a string - but its
    actually a pattern that describes a set of
    strings
  • A regular expression describes a (perhaps
    infinite) set of strings
  • Its easier to get a sense of this through
    examples.

7
Regular expressions history
  • Came out of understanding of formal languages in
    the 1960s
  • Chomsky, Aho, Ullman, etc.
  • Regular expressions describe regular languages
    which can be recognized by finite state machines
  • First big promoter of use was Unix
  • Tools like sed, grep, awk support regular
    expressions
  • But a fairly plain and limited regex language
  • Could invoke it from C(), but it was painful

8
Regular expressions history
  • Second bigger promoter of use was Perl
  • Developed a greatly extended, flexible regex
    language which was really useful and usable
  • Regular expressions were directly supported in
    the language syntax
  • Everyone used them everywhere for everything
  • Still one of the simplest and most natural places
    to use them
  • Once you get into that psychology and ignore
    whats happening behind the scenes
  • Though Perl does a lot to make things as fast as
    possible, including caching recently used regex.

9
Regular expression libraries
  • All modern programming languages have built-in
    regular expression libraries (perl, java, python,
    )

10
Ranges of Characters
  • Ranges can be specified in Regular Expressions
  • Valid Ranges
  • A-Z Upper Case Roman Alphabet
  • a-z Lower Case Roman Alphabet
  • A-Za-z Upper or Lower Case Roman Alphabet
  • A-F Upper Case A through F Roman
    Characters
  • A-z Valid but be careful (need to know ASCII!)
  • Invalid Ranges
  • a-Z Not Valid
  • F-A Not Valid

11
Ranges cont ...
  • Ranges of Digits can also be specified
  • 0-9 Valid
  • Negating Ranges
  • / 0-9 /
  • Match anything except a digit
  • / a /
  • Match anything except an a
  • / A-Z /
  • Match anything that starts with something
    other than a single upper case
    letter
  • First start of line
  • Second negation

12
Literal Metacharacters
  • Suppose that you actually want to look for all
    strings that equal '' in your text
  • Use the \ symbol
  • / \ / Regular expression to search for
  • In general \ escapes a metacharacter back to its
    original meaning
  • Caution some variation in regular expression
    packages in POSIX/emacs \ is the metacharacter
    and is a pipe character. In Perl, a backslashed
    character is always the regular character

13
Predefined patterns in Perl 5 regexps
  • Some Patterns
  • \d 0 9
  • \w azAz09_
  • \s \r\t\n\f (white space pattern)
  • \S \r\t\n\f
  • \D 0-9
  • \W a-zAZ09
  • Example 19\d\d
  • Looks for any year in the 1900's

14
Word Boundary Metacharacter
  • Regular Expression to match the start or the end
    of a 'word' \b
  • This is a (zero-width) assertion (like and )
  • An assertion is a statement about the position of
    the match pattern within a string. (zero-width
    meaning?)
  • Examples
  • / Jeff\b/ Match Jeff but not Jefferson
  • /Carol\b/ Match Carol but not Caroline
  • /Rollin\b/ Match Rollin but not Rolling
  • /\bform/ Match form or formation but not
    Information
  • /\bform\b/ Match form but neither information
    nor formation

15
DOT Metacharacter
  • The DOT Metacharacter, '.' symbolizes any
    character except a new line
  • / b.bble/
  • Would possibly return bobble, babble, bubble
  • /.oat/
  • Would possibly return boat, coat, goat
  • Note remember '.' usually means a bunch of
    anything, this can be handy but also can have
    hidden ramifications.
  • Use sparingly. Often better to be more specific

16
PIPE Metacharacter
  • The PIPE Metacharacter is used for alternation
  • /Bridget (ThomsonMcInnes)/
  • Match Bridget Thomson or Bridget McInnes but
    NOT Bridget Thomson McInnes
  • / Bbridget /
  • Match B or bridget
  • /( Bb)ridget/
  • Match Bridget or bridget at the beginning of a
    line

17
A Simple Example
  • Now suppose that we want to not only get all
    words that end in 'ing' but also 'ed'.
  • How would we write a regular expression to
    accomplish this?

18
A Simple Example
  • Now suppose that we want to not only get all
    words that end in 'ing' but also 'ed'.
  • Regular Expression
  • word m /a-z(inged)/

19
The ? Metacharacter
  • The metacharacter, ?, indicates that the
    character immediately preceding it occurs zero or
    one time
  • Examples
  • / worl?ds /
  • Match either 'worlds' or 'words'
  • / m?ethane /
  • Match either 'methane' or 'ethane'

20
The Metacharacter
  • The metacharacter, , indicates that the
    characterer immediately preceding it occurs zero
    or more times
  • Example
  • / abc/ Match 'ac', 'abc', 'abbc', 'abbbc'
    ect...
  • Matches any string that starts with an a, if
    possibly followed by a sequence of b's and ends
    with a c.
  • Sometimes called Kleene star

21
Greedy vs. Lazy Matching
  • The regular expression engine does greedy
    matching by default. This means that it attempts
    to match the maximum possible number of
    characters, if given a choice. For example
  • str The dogggg
  • str /The (dog)/
  • print 1
  • This prints dogggg because g, one or
    more gs, is interpreted to mean the maximum
    possible number of gs.
  • Greedy matching can cause problems .
  • Arguably it was a mistake default, especially
    with hindsight
  • Lazy matching matches the minimal number of
    characters. It is turned on by putting a
    question mark ? after a quantifier. Using the
    examples above,
  • str /The (dog?)/ print 1
    prints dog

22
Two Examples
  • Finding blank lines.
  • Matching letters only.

23
A Few Examples
  • Finding blank lines. They might have a space or
    tab on them. so use /\s/
  • Matching letters only. The problem is, \w is
    digits and underscore in addition to letters. You
    can use /A-Za-z/, which requires that every
    character in the entire string be a letter Or
    try /\W\d_/.
  • Words in general sometimes have apostrophe(')
    and hyphen (-) in them, such as o'clock and cat's
    and pre-evaluation. Also, there are some common
    numerical/letter mixed expressions 1st for
    first, for example. So, \w by itself wont
    match everything that we consider a word in
    common English.

24
More Examples
  • Time of day For example. 1130.

25
More Examples
  • Time of day For example. 1130.
  • 010-90-50-9
  • If only want 12 hour clock, this would
    overa-allow times as 1900 and 0030.
  • A more complicated construction works better
    (10121-9)0-50-9. That is, a 1 followed
    by 0, 1, or 2, OR any digit 1-9.

26
Other things I should mention
  • One or more
  • Counting Xn,m Xn, X,m
  • Capturing group (A-Z)
  • Unicode blocks \pLu

27
Implementation
  • Without capturing groups can turn into efficient
    deterministic FA
  • Regular expression libraries generally dont but
    use NFA state machine
  • Worst case exponential, but used with care,
    youre fine

28
Limitations
  • Things with recursive embedding arent finite
    state and you cant do it all with regular
    expressions
  • Human languages
  • XML
  • But the desperate Perl hacker, as the XML
    community refers to regular expression users, can
    get a long way
  • Good for structured patterns or lists
  • Not so good for e.g., peoples names

29
Summary
30
Predefined character classes
  • . any one character except a line terminator
  • \d a digit 0-9
  • \D a non-digit 0-9
  • \s a whitespace character \t\n\x0B\f\r

Notice the space.Spaces are significantin
regular expressions!
  • \S a non-whitespace character \s
  • \w a word character a-zA-Z_0-9
  • \W a non-word character \w

31
Boundary matchers
  • These patterns match the empty string if at the
    specified position
  • the beginning of a line
  • the end of a line
  • \b a word boundary
  • \B not a word boundary

32
Exercises
  • Regular expressions to find lines
  • with a special character, e.g.
  • with an uppercase letter
  • with all letters in uppercase
  • with initial lowercase letter
  • with the word London
  • With non-blank white spaces
  • with IP addresses
  • with an email address
  • containing one, two or three
  • not containing o
  • Find the line with the longest word

33
Exercises
  • Regular expressions to find lines containing
  • a single digit
  • a word of length 13
  • a tab
Write a Comment
User Comments (0)
About PowerShow.com