Matching in list context (Chapter 11 continued) - PowerPoint PPT Presentation

About This Presentation
Title:

Matching in list context (Chapter 11 continued)

Description:

The following results in $str having the value 'puppy ferret category' ... in $str containing 'puppy ferret ferretegory' ... str =~ s/(cat|dog)/ferret/g; ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 21
Provided by: craigkn
Category:

less

Transcript and Presenter's Notes

Title: Matching in list context (Chapter 11 continued)


1
  • Matching in list context (Chapter 11
    continued)
  • _at_array (str /pattern/)
  • This stores the list of the special (1, 2,)
    capturing variables into the _at_array only if there
    are grouped expressions in the pattern to capture
    matches. Otherwise, if there are no grouped
    expressions, either (1) or () is returned into
    the _at_array depending upon whether there are
    successful matches or not.
  • The following results in
  • ("cat chow" , "cat" , "chow")
  • being assigned to the _at_array.
  • _at_array ("Purina cat chow"
  • /((catdogferret) (foodchow))/)

2
  • The g command modifier causes matching to be
    done globally -- it doesn't quit after finding
    the first match.
  • _at_array (str /pattern/g)
  • Use global matching only when there are no
    grouped expressions in the pattern.
  • The following results in the list ("an ", "amp")
    being assigned to the _at_array.
  • _at_array ("an example" /a../g)
  • In contrast the following would result in the
    one-element list ("an ") being assigned to the
    _at_array.
  • _at_array ("an example" /(a..)/)

3
  • The following statement parses out all of the
    HTML tags and stores this list ("lth1gt", "lt/h1gt")
    in the _at_tags array.
  • _at_tags ("lth1gtTitlelt/h1gt" /lt.?gt/g)
  • Suppose document is a (perhaps long) string
    that contains some text document, and suppose we
    want to pull out all the social security numbers
    from the document. If we assume social security
    numbers look like 123-45-6789, then a solution is
  • _at_soc_numbers(document /\d3-\d2-\d4/g)
  • But what if the social security numbers are
    inconsistent in that some are missing the dashes?
    Then a solution is
  • _at_soc_numbers(document/\d3-?\d2-?\d4/g)

4
  • Two very useful functions that take patterns and
    return lists.

split(pattern, string) Returns a list consisting of the fields (the substrings not used in any matches) between successful matches of the pattern against the string. Trailing empty fields are omitted.
split(pattern,string,limit) Returns a list with at most limit number of fields.
grep(pattern, list) Returns a list consisting of those elements in the given list which successfully matched the pattern. (grep -- get regular expression pattern)
5
  • We have used split often, even in the decoding
    routine where we split about a one-character
    string.
  • _at_nameValuePairssplit(//,datastring)
  • A string with more complicated delimiting
    patterns can also be split. In the following
    case, a delimiter is one or more colons.
  • str "23224559885"
  • _at_numbers split( // , str)

6
  • grep (get regular expression pattern) is
    different from split in that you send it an array
    rather than a string. It "filters" the array
    based upon the regular expression. That is only
    those array elements which match the pattern are
    returned.
  • Suppose _at_domains contains some large number of
    named Web addresses. One simple call to grep can
    filter out only those addresses in the ".edu"
    domain, for example
  • _at_edu_sites grep (/\.edu/, _at_domains)
  • Note The period had to be escaped since it is a
    metacharacter.

7
Example Analyzing log files. A typical HTTP
access log. See accesslog.txt.
8
  • The 10 different fields are actually standard.
  • Results when we split out the first line (around
    delimiting spaces).
  • _at_fields split (/\s/, line)

FIELD First Line Meaning
field0 136.201.141.108 Address (either IP or name) of client
field1 - Not used anymore
field2 - Not used anymore
field3 09/Nov/2001103401 Date and time
field4 -0600 Time zone
field5 "GET Request method
field6 / Relative part of URL (here the site root)
field7 HTTP/1.1" HTTP version
field8 200 Status code (success code or error code)
field9 16058 Bytes transferred
9
  • Log file analysis can get very elaborate and
    there are many commercial and free software
    packages available for that.
  • For a simple example, we count the total number
    of hits (lines in the access log) and the total
    number of unique hits (different IP addresses).
  • Notice that requesting one page can result in
    numerous lines in the access log since all of the
    image transfers are separate HTTP transactions.
    (Some hit counters you find actually report the
    number of lines in the file!)
  • Counting lines is easy. To count the number of
    unique IP addresses, we add IP addresses to a
    hash as the keys. Thus a new hash entry only can
    originate from a new IP address. We then count
    the number of keys in the hash.
  • See source file hitcount.pl

10
  • The substitution operator
  • scalar_variable s/pattern/replacement_string/c
    ommand_modifiers
  • The binding operator "binds" the substitution
    onto the string.
  • The substitution operator s/// takes two
    arguments (in contrast to the match operator m//
    ).
  • It attempts to find a match for the pattern in
    the scalar_variable, and if successful, replaces
    the match with the replacement_string.
  • Thus, the scalar variable is altered if a
    successful match is found. In contrast, match
    operator does not alter the string onto which it
    is bound.

11
  • The following attempts to replace the with my.
  • str "the cat in the hat"
  • str s/the/my/
  • This causes str to contain "my cat in the hat".
  • By default, only the left-most occurrence is
    replaced.
  • The g (global) command modifier causes
    substitutions to be made globally.
  • str "the cat in the hat"
  • str s/the/my/g
  • This causes str to contain "my cat in my hat".

12
  • The following results in str having the value
    "puppy ferret category". (non-global
    substitution)
  • str "puppy dog category"
  • str s/(catdog)/ferret/
  • A similar global substitution results in str
    containing "puppy ferret ferretegory".
  • str "puppy dog category"
  • str s/(catdog)/ferret/g
  • The following replaces all whitespace characters
    with the empty string, resulting in str
    containing "hello".
  • str "h e l l o"
  • str s/\s//g

13
  • Captured matches can actually be included into
    the replacement string.
  • str "puppy dog category"
  • str s/(\w)/1s/g
  • This results in str having the value "puppys
    dogs categorys".
  • There is only one set of grouping parentheses
    used in this example, so we only need to use 1.
  • As each match is found, 1 is assigned that new
    match. Thus, 1 may be reused several times
    during a global substitution.

14
  • The transliteration operator
  • scalar tr/search_characters/replacement_charac
    ters/
  • This replaces the search characters with the
    corresponding replacement characters.
  • It's usually used with single characters.
  • str "the cat in the hat"
  • str tr/a/u/
  • The result is "the cut in the hut"
  • Transliteration can be done using substitutions,
    but tr automatically does global substitutions
    and only uses characters which means you don't
    have to escape metacharacters.

15
Example Inspired by news sites which which
display parts of stories and provide links
pointing to the full stories.
See partialcontent.cgi
16
  • Each story is a text file (.news)
  • Paragraphs must separated by at least a blank
    line /n/n
  • The program reads the directory and prints the
    first two paragraphs of only the .news files.

17
  • Acquiring only the .news stories from the
    directory is straight forward, especially with
    the power of grep.
  • opendir(D, "storyDataDir")
  • _at_storyFiles readdir(D)
  • closedir(D)
  • _at_storyFiles grep (/.news/ , _at_storyFiles)
  • We then loop over the .news files and process
    each one.
  • foreach file (_at_storyFiles)
  • if(open(STORY, "storyDataDirfile"))
  • my _at_wholeStory ltSTORYgt
  • close(STORY)
  • join whole story into one string
  • my story join("", _at_wholeStory)

18
  • We can then extract all of the paragraph with
    one global match!!
  • _at_paragraphs (story /((.\n)?\n\s\n)/g)
  • It's then trivial to print the first two
    paragraphs.
  • But the pattern certainly needs clarification.
  • First we need to identify the space between
    paragraphs.
  • \n\s\n matches one or more consecutive blank
    lines
  • That is, two newline characters
    with zero or more whitespace characters
    in between.
  • Since quantifiers are greedy, the pattern will
    not stop after finding the first in a sequence of
    blank lines.

19
  • Now we match paragraph content.
  • (.\n) one or more of any character
  • (wildcard doesn't match /n characters)
  • Now the whole pattern which matches a paragraph.
  • /(.\n)?\n\s\n/ one or more of anything,
    then a
  • then a blank line(s)
  • Notes
  • One would have been tempted to identify
    paragraphs as one or more wildcard characters
    (.). But that would miss parts of paragraphs
    containing an inadvertent hard return (\n)
    between sentences.
  • The extra metacharacter (?) specifies
    non-greedy matching. Otherwise, the pattern
    would not stop after the first paragraph.

20
  • There are still two subtle pitfalls regarding
    the structure of the news files.
  • A sequence of two blank lines (\n\n\n) or more
    at the beginning of the file will cause the first
    \n to be matched as the first paragraph. (That
    is not a problem for multiple blank lines between
    paragraphs since \n\s\n is greedy.)
  • If there are no blank lines after the last
    paragraph in the file, the last paragraph will
    not be matched (hence not captured). That
    doesn't affect this application as long as there
    are three or more paragraphs in a file.
  • How would you fix those problems?
Write a Comment
User Comments (0)
About PowerShow.com