Python - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Python

Description:

Title: PowerPoint Presentation Last modified by: Preferred Customer Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 22
Provided by: opimWhart
Category:
Tags: bracket | python

less

Transcript and Presenter's Notes

Title: Python


1
Python Pattern Matchingwith Regular
Expressions (REs)
  • OPIM 101

FilePythonREs.ppt
2
Foresight
  • Pattern matching
  • Literal
  • With metacharacters
  • Regular expressions (REs)
  • Using REs in Python

3
Consider dir by Itself
D\athomepc\day\idtgtdir Volume in drive D has
no label Volume Serial Number is 3E4B-1609
Directory of D\athomepc\day\idt .
ltDIRgt 01-01-02 816a . ..
ltDIRgt 01-01-02 816a .. SPRING1 PDF
180,072 01-01-02 817a spring02idtfront.pdf SPR
ING2 PDF 241,542 01-01-02 819a
spring02idtpartI.pdf SPRING3 PDF 1,246,514
01-01-02 820a spring02idtpartII.pdf SPRING4
PDF 2,517,343 01-01-02 822a
spring02idtpartIII.pdf SPRING5 PDF 3,469,138
01-01-02 824a spring02idtpartIV.pdf CASE1-1
DOC 35,328 01-01-02 842a
case1-python.doc LECTUR1 PPT 78,336
01-01-02 945a lecture01fall01.ppt PYTHON1 PPT
34,816 01-01-02 946a Python_Intro.ppt PYT
HON2 PPT 37,376 01-01-02 946a
Python_Structures.ppt LECTUR2 PPT 154,112
01-01-02 1151a lecture01spring02.ppt PYTHON3
PPT 34,816 01-01-02 1152a PythonREs.ppt
11 file(s) 8,029,393 bytes 2
dir(s) 1,209.06 MB free D\athomepc\day\id
tgt
4
Now dir with a Literal Search
D\athomepc\day\idtgtdir case1-python.doc Volume
in drive D has no label Volume Serial Number is
3E4B-1609 Directory of D\athomepc\day\idt CASE1
-1 DOC 35,328 01-01-02
842a case1-python.doc 1 file(s)
35,328 bytes 0 dir(s) 1,209.06
MB free D\athomepc\day\idtgt
5
Now dir with
D\athomepc\day\idtgtdir .doc Volume in drive D
has no label Volume Serial Number is 3E4B-1609
Directory of D\athomepc\day\idt CASE1-1 DOC
35,328 01-01-02 842a case1-python.doc
1 file(s) 35,328 bytes 0
dir(s) 1,209.06 MB free D\athomepc\day\id
tgt
6
Literal vs. Pattern Searches
  • dir myfile.doc
  • Searches literally, for an exact match with
    myfile.doc
  • dir my.doc
  • Does a pattern search. Matches to any file
    beginning with my, followed by 0 or more
    characters of any kind, followed by .doc

7
MetaCharacters
  • dir treats as a metacharacter, a character
    not taken literally, but as instruction to match
    a certain kind of pattern (here anything)
  • The dir metacharacter scheme is very useful

8
On Beyond
  • ...and also very primitive and limited
  • A step up grep in Unix Linux support for RE
    searches in some text editors, e.g., TextPad
    (www.textpad.com)
  • Regular expressions (REs) use a richer language
    and larger set of metacharacters, giving us a
    very powerful capability to extract information
    (patterns) from text

9
Pythons RE Metacharacters
  • Heres the complete list
  • . ? \ ( )
  • No use memorizing. Well learn by examples.
  • A natural question But what if I want to search
    for a pattern that contains what Pythons RE
    counts as metacharacters?
  • Be just a little patient

10
Load Pythons re Module
gtgtgt import re gtgtgt teststring "Television is
public anomie number 1. gtgtgt teststring 'Televisio
n is public anomie number 1. gtgtgt
len(teststring) 37 gtgtgt match re.search('anomie',
teststring) gtgtgt match None 0 gtgtgt
match.span() (21, 27) gtgtgt teststring2127 'anomi
e gtgtgt
11
Now a Nonliteral Match
gtgtgt match re.search('Television',teststring) gtgtgt
match None 0 gtgtgt match re.search('television
',teststring) gtgtgt match None 1 gtgtgt match
re.search('tTelevision',teststring) gtgtgt
match.span() (0, 10) gtgtgt teststring 'Television
is public anomie number 1. gtgtgt
12
Square Bracket Notation ...
  • tT means any one of the characters t or
    T.
  • ... is called a character class
  • Examples
  • abc, a-z, A-Z
  • tT not t and not T

13
Not Example
gtgtgt teststring 'Television is public anomie
number 1. gtgtgt match re.search('tTa-z',te
ststring) gtgtgt match.span() (1, 10) gtgtgt
teststring110 'elevision gtgtgt
Note means one or more of the previous
means zero or more ? means zero or one
14
'\s\w\.' and '\s(\w)\.'
gtgtgt teststring 'Television is public anomie
number 1. gtgtgt match re.search('\s\w\.',teststr
ing) gtgtgt match.span() (34, 37) gtgtgt
teststring3437 ' 1. gtgtgt match
re.search('\s(\w)\.',teststring) gtgtgt
match.span(0) (34, 37) gtgtgt match.span(1) (35,
36) gtgtgt teststring3536 '1 gtgtgt
15
. \.
  • Inside ... most metacharacters are taken
    literally
  • So, . \.
  • Note (again) ... is called a character class

gtgtgt match re.search('\s(\w).',teststring) gtgtgt
match.span() (34, 37) gtgtgt
16
Avoiding Greed ?
gtgtgt newstring 'ltdiv align"center"gt gtgtgt
newstring newstring'lti class"smaller"gt gtgtgt
newstring newstring'(As of 1055 AM on
12/20/01) gtgtgt newstring newstring'lt/igtlt/divgtltb
rgt gtgtgt newstring 'ltdiv align"center"gtlti
class"smaller"gt(As of 1055 AM on
12/20/01)lt/igtlt/divgtltbrgt gtgtgt match
re.search('lt.gt',newstring) gtgtgt match.span() (0,
81) gtgtgt match re.search('lt.?gt',newstring) gtgtgt
match.group() ltdiv align"center"gt gtgtgt
17
More on Not Being Greedy
gtgtgt match re.search(r'lt(\w).?gt(.)lt/(\1)',newst
ring) gtgtgt match.groups() ('d', 'lti
class"smaller"gt(As of 1055 AM on
12/20/01)lt/igt', 'd') gtgtgt match
re.search(r'lt(\w).?gt(lt)lt/(\1)',newstring) gtgtgt
match.groups() ('i', '(As of 1055 AM on
12/20/01)', 'i') gtgtgt
\1 is called a backreference. It refers to group 1
18
Concluding
  • REs are a very powerful tool, very often very
    useful
  • The language notation is compact and a bit hard
    to read
  • Practice, study the examples, dont worry about
    memorization.

19
Advice on Scripting
  • Scripting, and programming in general, is a
    process
  • Successful scripts dont spring into existence
    whole
  • Scripts built in small increments
  • Attend to
  • Decomposition
  • Stories
  • Testing

20
Advice on Scripting
  • Decomposition
  • Solve big problems by decomposing them into small
    problems and solving them
  • Stories
  • Scripting/programming as a form of literature
  • Use comments with code to tell a clear story
    about what the code is or should be doing
  • Testing
  • Everything, whole and part, often, varying inputs

21
Readings
  • IDT book, chapter 8, Text and Pattern
    Processing
  • Further information (but beyond the scope of 101)
  • The Python online documentation on the re module
  • Regular Expression HOWTO by A.M. Kuchling at
    http//py-howto.sourceforge.net/ and also at
    http//py-howto.sourceforge.net/regex/regex.html
Write a Comment
User Comments (0)
About PowerShow.com