Making your regular expressions efficient - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Making your regular expressions efficient

Description:

Compile only once with /o or qr// All regular expressions are compiled into Perl's internal form only once, at compile time. ... Regular Expressions by ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 32
Provided by: jerusalemp
Category:

less

Transcript and Presenter's Notes

Title: Making your regular expressions efficient


1
Making your regular expressions efficient
  • Eitan Schuler ExLibris

2
Introduction
  • Executing regex takes time to perl even despite
    of internal optimization
  • Performance is not always an issue, but it
    doesnt hurt to keep efficiency in mind

3
Agenda
  • Compile only once (/o, qr/)
  • Regex objects
  • Match variables
  • Study()
  • Alternation
  • Backtracking
  • Memory-free parenthesis
  • Benchmarking

4
Compile only once with /o or qr//
  • All regular expressions are compiled into Perls
    internal form only once, at compile time.

foreach (_at_mylist) count /perl/
5
Compile only once with /o or qr//
  • When a pattern contains interpolated variables,
    its compiled every time its used.
  • The reason word might have changed

chomp(word ltSTDINgt)foreach
(_at_mylist) count /word/
6
Compile only once with /o or qr//
  • /o says compile it only at the fist time!Even
    if word has changed dont compile it again!

chomp(word ltSTDINgt)foreach
(_at_mylist) count /word/o
7
Compile only once with /o or qr//
  • The Gotcha
  • If the day is over it wont work any more.

Sub MySub my today (qwltsun mon tue wed thu
fri satgt)(localtime) 6 while (ltINgt) if
(m/today/io) . . .
8
Compile only once with /o or qr//Regex object
  • The fix
  • Efficient each call compiles only once
  • Correct

Sub MySub my today (qwltsun mon tue wed thu
fri satgt) (localtime) 6 my RegExObj
qr/today/i compiles once per call
while (ltINgt) if (_ RegExObj ) .
9
Match variables
  • , ,

text"Jerusalem Perl Mongers"if (text
/Perl/) print Jerusalem print
Perl print ' Mongers
10
Match variables
  • When using match variables in the program (even
    once) perl has to keep track of these values
    every single pattern match in the program.

text"Jerusalem Perl Mongers"print if text
/Perl/while (ltgt) push _at_mylist,_ if
/hello/
11
Match variables
  • Solution not to use , ,
  • Instead of
  • Use 1

text"Jerusalem Perl Mongers"text
s/Perl/\U/g
text"Jerusalem Perl Mongers"text
s/(Perl)/\U1/g
12
Match variables
  • _at_- and _at_ refer to starting and ending offsets.
  • -0 and 0 refer to the offset of the whole
    text.
  • -1 to 1 -2 to 2 captured

13
Match variables
  • Solution not to use , ,

text"Jerusalem Perl Mongers"text
m/Perl/print substr(text, 0,
-0)."\n"print substr(text, -0 , 0 -
-0)."\n"print substr(text, 0)."\n"
see regex3.pl
14
Study()
  • Optimizes some kinds of searches of a string.
  • After studying a string, a regex can benefit from
    the cached knowledge about the string.

15
Study()
  • Usage

longtextJerusalem Perl Mongers"study(lon
gtext)if (longtext m/\sC\s/).if
(longtext m/Perl/).if (longtext
m/PHP/).if (longtext m/COBOL/).
16
Study()
  • Modifying the string causes invalidation of the
    study-list.

text"Jerusalem Perl Mongers"study(text)if
(text m/\sC\s/).if (text
m/Perl/).textYet another textif (text
m/PHP/).if (text m/COBOL/).
17
Study() - when to use
  • Matching with literal text many times to the same
    string.
  • String is long.

18
Study() - when not to use
  • When matching with /i
  • When string is short
  • When you plan just a few matches before the sting
    is modified
  • Always benchmark and see it it worth!

19
Avoid unnecessary alternation
  • Alternation is generally slow.
  • Each time an alternative fails to match, the
    regex engine has to backtrack in the string and
    try the next alternative.

20
Avoid unnecessary alternation
  • First it tries to match Motke. If fails it backs
    up to the beginning() and then it tries Diana.
    If fails it backs up again and again to the
    beginning of the string, until it matches/fails.

while (ltgt) print if /(MotkeDianaZacharyIssac
)/
21
Avoid unnecessary alternation
  • Dont ever use alternation (abc) instead of
    character class abc.
  • Incorrect
  • Correct

while (ltgt) push _at_digits, m/(0123456789)
/g
while (ltgt) push _at_digits, m/0-9/g
22
Non-capturing Parentheses (?)
  • Example matching a numberFor making \.0-9
    optional we need ()
  • The problem we captured unnecessary data into
    1. We dont use this.

if (number m/-?0-9(\.0-9)?/)
23
Non-capturing Parentheses (?)
  • The solution
  • No 1 in this case
  • More efficient
  • Less confusing later on

if (number m/-?0-9(?\.0-9)?/)
24
Non-capturing Parentheses (?)
  • Example thisthat
  • Seems baduse this instead
  • Much fasterfor 2 reasons
  • Non-capturing
  • When the first 2 chars dont match, it doesnt
    make the alternation.

if (text m/(thisthat)/)
if (text m/th(?isat)/)
25
Avoid unnecessary backtracking
  • Lets find words ending either with a or with
    bPerl matches boundary, then first
    alternation as many word characters as possible.
    Then it looks for awhich it will not find,
    because the word is over. It backs up a character
    if its a match. If not, it backs up to the
    \b and tries the 2nd.

while (ltgt)push _at_words, m/\b(\wa\wb)\b/g
26
Avoid unnecessary backtracking
  • Improvement getting rid of the alternation
  • BUT still backtracking

while (ltgt) push _at_words, m/\b(\wab)\b/g
27
Avoid unnecessary backtracking
  • Ask yourself how would I search in real life?
  • More than likely you would look only to the end
    of each word.
  • We dont really care about the first part of the
    word.
  • No backtracking, because we simply match a single
    character.

while (ltgt) push _at_words, m/ab\b/g
28
Bechmark your regex!
  • Lets use Benchmark

use Benchmarktimethese(20000,mem gt
q _"web nasa England walla step car
output" push _at_words, m/ab\b/g ,mem2 gt
q _"web nasa England walla step car
output" push _at_words, m/\b(\wa\wb)\b/g )
29
Bechmark Your Regex!
  • Output (regex4.Pl)

Benchmark timing 20000 iterations of mem,
mem2... mem 1 wallclock secs ( 0.58 usr
0.00 sys 0.58 CPU) _at_ 34423.41/s (n20000)
mem2 2 wallclock secs ( 0.92 usr 0.03
sys 0.95 CPU) _at_ 21030.49/s (n20000)
30
Summary
  • Use common sense
  • Use these techniques
  • Always ask yourself how should I search for
    this? by scanning the text with the eyes.

31
Where to Get More Information
  • Mastering Regular Expressions by Jeffrey Friedl
  • Effective Perl Programming by R. Schwartz and
    Joseph N. Hall
Write a Comment
User Comments (0)
About PowerShow.com