Chapter 11: Regular Expressions and Matching - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 11: Regular Expressions and Matching

Description:

The match operator can be bound not only onto string literals and variables, but ... values for $str which would yield true and false values in the above match. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 33
Provided by: craigkn
Category:

less

Transcript and Presenter's Notes

Title: Chapter 11: Regular Expressions and Matching


1
  • Chapter 11 Regular Expressions and Matching
  • The match operator has the following form.
  • m/pattern/
  • A pattern can be an ordinary string or a
    generalized string containing metacharacters.
  • The binding operator, , is used to "bind" the
    matching operator onto a string.
  • "yesterday" m/yes/
  • Here the pattern is an ordinary three character
    string.
  • The entire expression evaluates to a Boolean
    value, true (1) in this case since the pattern
    yes is a substring of "yesterday".

2
  • Since matching expressions result in Boolean
    values, they are usually used in a conditional.
  • str"yesterday"
  • if(str m/yes/)
  • print "The pattern yes was found in str.\n"
  • For demonstration, we will usually only show the
    matching expression.
  • Example
  • str"yesterday"
  • str m/ester/ true
  • str m/Ester/ false
  • str m/yet/ false

3
  • Some notes
  • The ! is the negated form of the match
    operator. It returns true if the matching action
    does not find the pattern in the string. We will
    more often use the matching operator.
  • if(response ! m/yes/)
  • print "yes was not found in your response.\n"
  • The matching operator can be simplified
    syntactically. For example, the following two
    expressions are equivalent.
  • str m/yes/
  • str /yes/

4
  • The match operator can be bound not only onto
    string literals and variables, but also onto
    expressions that evaluate to strings.
  • str1"wilde"
  • str2"beest"
  • str1.str2 /debe/ true
  • Example A server-side "platform sniff" done by
    matching against the HTTP_USER_AGENT environment
    variable.
  • This example features the first pattern which is
    not merely a sequence of characters. The match
  • info /(UnixLinux)/
  • is true of either Unix or Linux is a substring of
    whatever is stored in the info variable.
  • See source file os.cgi.

5
  • A regular expression is a set of rules which
    define a generalized string.
  • For simplicity we call regular expressions
    patterns.
  • The syntax for a pattern is /pattern/ .
  • A pattern is like a double quoted string in that
    variables are interpolated and escape sequences
    are interpreted.
  • But a pattern is much more powerful than a
    string and can contain wildcards, character
    classes, and quantifiers, just to name a few
    features which make patterns (regular
    expressions) much more general than ordinary
    strings.

6
  • Metacharacters
  • Characters which have special meaning in
    patterns are called metacharacters.
  • ( ) \ ? .
  • If used literally inside a pattern, their
    special meaning must be escaped.
  • if(sentence m/\?/)
  • print "Your sentence seems to be a question.\n"

7
  • Normal characters
  • These include ordinary ASCII characters which
    are not metacharacters.
  • Normal characters include, letters, numbers, the
    underscore, and a few other characters such as _at_
    , which are not reserved metacharacters
    in patterns.
  • Normal characters need not be escaped when
    testing for matches.
  • if(sentence m//)
  • print "Your sentence seems to contain an
    independent clause.\n"

8
  • Escaped characters
  • Escaping in patterns works just like escaping
    characters in ordinary strings..
  • For example, \ stands for one , and \( stands
    for one (.
  • The following tests whether str contains the
    three character string "(b)".
  • str /\(b\)/
  • Example values for str which would yield true
    and false values in the above match.
  • true "(b)" , "(a)(b)(c)"
  • false "(ab)" , "( b )"

9
  • Escape sequences that stand for one character
  • Some escaped characters stand literally for only
    one character, like escaped metacharacters.
  • Some stand for one invisible character, such as
    a whitespace character. Just like with ordinary
    strings \n stands for one newline character, and
    \t stands for one tab character.
  • The following tests whether str contains two
    consecutive newline characters.
  • str /\n\n/
  • true "a\n\nb" , "a\n\n\n\tb"
  • false "\na\n" , "a\n \nb"

10
  • Escape sequences that stand for a class of
    characters
  • These represent only one character in a pattern,
    but that one character matches any character in
    the specified group.

11
  • The following tests whether str contains a four
    character sequence that looks like a year in the
    1900s.
  • str /19\d\d/
  • true "1921" , "34192176"
  • false "191a" , "34192-76"
  • The following tests whether str contains a
    non-whitespace character. (i.e. It is not the
    empty string or merely a sequence of whitespace
    characters. )
  • str /\S/
  • true "x" , "()"
  • false "" , " ", "\n"

12
  • Wildcard
  • A period . stands for any one character, except
    a newline.
  • The following tests whether str contains a
    three character substring that is c and t with
    anything in between, except a newline.
  • str /c.t/
  • true "cat" , "arctangent"
  • false "ct" , "cart" , "arc\ntangent"

13
  • Escape sequences that match locations
  • These characters do not actually represent a
    character in a pattern. Rather, they represent
    locations within patterns.

14
  • The following tests whether str begins with T
  • str /\AT/
  • true "Tom" , "The beest"
  • false "tom" , "ATT"
  • The following tests whether str begins with The
    .
  • str /\AThe/
  • true "Thelma" , "The beest"
  • false "That" , "the beest"
  • The following tests whether str contains the
    word cat but not as part of any bigger word.
  • str /\bcat\b/
  • true "cat" , "my cat"
  • false "cats" , "concatenate"

15
  • Note When matching locations, the escape
    sequence does not "use up" a character. That is,
    an expression such as
  • str /ing\z/
  • only tests for the three character string ing at
    the end of str.

16
  • Character Classes
  • Square brackets in a pattern define a class.
  • The whole class matches only one character, and
    only if the character belongs to the class.
  • The following tests whether str contains a
    three-character string beginning with one of r,
    b, or c, and followed by at.
  • str /rbcat/
  • true "rat" , "bat" , "cat" ,
  • "concatenate" , "battery"
  • false "mat" , "at"

17
  • The escape sequences \d, \w, and \s and their
    opposites can be used inside a class.
  • A dash (-) can be used between two characters to
    denote a range of characters.
  • For example, the class
  • \dA-F
  • stands for one character that is either a numeric
    digit or one of the upper case letters A-F. It
    is equivalent to 0123456789ABCDEF
  • The following tests whether str contains a
    two-digit hexadecimal number as formatted in
    query string encoding.
  • str /\dA-F\dA-F/
  • true"0A" , "dataHi,0A0Dmy name is..."
  • false "0a" , "3"

18
  • Alternatives
  • The character serves like an or by creating
    alternatives.
  • The following tests whether str contains any of
    the three patterns.
  • str /catdogferret/
  • true "cat" , "dog" , "ferret" , "my cat"
  • "cats and dogs" , "doggedly"
  • false "hamster" , "dodge the cart"
  • The alternatives are tested from left to right.
  • The alternatives themselves can be more
    complicated patterns.

19
  • Grouping and Capturing
  • Parentheses () are used for grouping in
    patterns.
  • The following tests whether str contains one of
    the three alternatives, then a whitespace, then
    food.
  • str /(catdogferret) food/
  • true "cat food","dog food","ferret food"
  • "I like cat food and dog food"
  • false "cats food", "rat food", "dogfood"
  • With several alternatives, it is often desirable
    to capture which of the alternatives caused the
    successful match. That is, a mere truth value
    indicating a match doesn't indicate which match
    actually occurred.

20
  • The special, built-in variables 1, 2, 3,
    automatically capture an alternative that
    provides a successful match.
  • str "Do you have ferret food?"
  • str /(catdogferret) food/
  • Here, 1 is assigned the value "ferret" since
    that alternative provides the match. The rest
    are empty.
  • If more than one match is present, only the
    left-most match is recorded since alternatives
    are processed from left to right.
  • str "Do you have dog food or ferret food?"
  • str /(catdogferret) food/
  • Here, "dog" is assigned to 1, but 2 is empty
    even though there is a second match.

21
  • Multiple groups can populate more of the special
    variables.
  • str "Purina cat chow"
  • str /(catdogferret) (foodchow)/
  • 1 is assigned the value "cat" and 2 is
    assigned the value "chow". Captured matches are
    assigned into the special variables starting from
    the left-most grouping of alternatives.
  • Groups can be collected into a larger group.
  • str "Purina cat chow"
  • str /((catdogferret) (foodchow))/
  • 1 is assigned "cat chow" , 2 is assigned "cat"
    , and 3 is assigned "chow". The left-most
    behavior is still observed.

22
Note After a successful match, the special
capturing variables are global variables within
the program. if (data /(catdogferret)
(foodchow)/ ) print "The match1 2
was found." So if the data is "Purina cat
chow is now", then the print statement would
generate The match cat chow was found. As
global variables, they will contain the captured
matches throughout the rest of the program or
until their values are replaced by data captured
in other matches.
23
  • Other special variables
  • There is some degree of "capturing" even when
    grouping is not used.
  • (prematch - that part before the match),
  • (match - the matched part)
  • ' (postmatch - the part after the match).
  • After this is executed
  • "I like cats and bats." /rbcat/
  • contains "cat"
  • contains "I like "
  • ' contains "s and bats"
  • In general, the original string is equivalent to
    the concatenation of the three special variables.
    . . '

24
Quantifiers
  • A quantifier is always put after the character
    (or class of characters) to be quantified.
  • /x/ -- matches one or more
    x's in a row
  • /aeiou3/ -- matches any three vowels in a
    row
  • /c.t/ -- matches a c followed
    by a t with 0 or more of
  • any
    character in between

25
  • The following tests whether str contains at
    least one b character in between an a and c.
  • str /abc/
  • true "abc", "abbc" , "abbbc" , "aabcc"
  • false "ac" , "aBc"
  • The following tests whether str contains a
    sequence of exactly 3 b characters in between an
    a and c.
  • str /ab3c/
  • true "abbbc", "aabbbcc"
  • false "abbc" , "abbbbc"
  • The following tests whether str contains a
    sequence of at least 2 b characters in between an
    a and c.
  • str /ab2,c/
  • true "abbc", "abbbc" , "aabbbbcc"
  • false "abc" , "aBBc"

26
  • It gets interesting when quantifiers are mixed
    with the special character classes.
  • The following tests to see if str contains an
    alphanumeric word (chunk of consecutive
    alphanumeric characters).
  • str /\w/
  • true "beest", "1234" , "R2D2" , "x" ,
    "xyz"
  • false "" , "" , " "
  • The following tests to see if str contains one
    or more consecutive digits (i.e. is there an
    integer inside).
  • str /\d/
  • true "1", "121 Elm. St." , "R2D2" ,
  • "1" , "3.14"
  • false "a" , "" , "" , " "

27
  • The following tests to see if str contains a
    substring that looks like a (possibly negative)
    integer. That is, does str contain zero or one
    characters, followed by one or more
    consecutive digits.
  • str /-?\d/
  • true "2", "-2" , "-3.14" , "3-21.7"
  • false "xyx" , "x-y" , "4-x"
  • The following tests to see if there is at least
    one whitespace character in str.
  • str /\s/
  • true " ", " " , " xyy" , "The End"
  • false "" , "xyz" , "TheEnd"

28
  • The following matches any two digit hexadecimal
    number. That is, it matches any occurrence of two
    consecutive characters from the class
    0123456789abcdefABCDEF.
  • /\da-fA-F2/
  • The quantified pattern is equivalent to the
    longer pattern /\da-fA-F\da-fA-F/.
  • For the next example, suppose we have dates that
    are roughly formatted, but in the general form
  • month_name day_number, year
  • We wish to create a pattern capable of factoring
    out inconsistent formatting and capture the three
    date parts. For example, it should handle both
    dates below.
  • jan 1,2002
  • MARCH 22, 02

29
  • The following tests whether date contains (a
    group of one or more letters, lower or
    upper-case), followed by one or more spaces,
    followed by (a group of one or more digits),
    followed by a comma and then zero or more spaces,
    followed by (a group of one or more digits).
  • date /(a-zA-Z)\s(\d),\s(\d)/
  • Since there are three groups, the month is
    captured into 1, the day into 2, and the year
    in 3.

30
  • Quantifiers are greedy by default
  • That means a quantified pattern will attempt to
    match as much as possible. ("Matching is
    greedy.")
  • The following expression tests for a character, followed by one or more of anything
    (wildcard), followed by a character.
  • "Title" //
  • The quantifier's greedyness passes up "",
    which would otherwise be a match. So the pattern
    matches the whole string in this case.

31
  • To overcome the greedyness (match as little as
    possible), an extra ? character is placed after
    the quantifier.
  • For example, to find HTML tags, the pattern
    would be used. It basically says test for
    a until the first character is found.
  • The following would only match "".
  • "Title" //

32
  • Command modifiers
  • The behavior of the matching operator can be
    altered by using a command modifier, which is
    placed after the operator.
  • string_expression /pattern/command_modifier
  • Case insensitive matching
  • The command modifier i specifies that the
    matching should be done in a case insensitive
    fashion.
  • if(str /be/i)
  • print "The string contains either be, Be, bE,
    or BE."
Write a Comment
User Comments (0)
About PowerShow.com