Title: Making your regular expressions efficient
1Making your regular expressions efficient
2Introduction
- Executing regex takes time to perl even despite
of internal optimization - Performance is not always an issue, but it
doesnt hurt to keep efficiency in mind
3Agenda
- Compile only once (/o, qr/)
- Regex objects
- Match variables
- Study()
- Alternation
- Backtracking
- Memory-free parenthesis
- Benchmarking
4Compile only once with /o or qr//
- All regular expressions are compiled into Perls
internal form only once, at compile time.
foreach (_at_mylist) count /perl/
5Compile only once with /o or qr//
- When a pattern contains interpolated variables,
its compiled every time its used. - The reason word might have changed
chomp(word ltSTDINgt)foreach
(_at_mylist) count /word/
6Compile only once with /o or qr//
- /o says compile it only at the fist time!Even
if word has changed dont compile it again!
chomp(word ltSTDINgt)foreach
(_at_mylist) count /word/o
7Compile only once with /o or qr//
- The Gotcha
- If the day is over it wont work any more.
Sub MySub my today (qwltsun mon tue wed thu
fri satgt)(localtime) 6 while (ltINgt) if
(m/today/io) . . .
8Compile only once with /o or qr//Regex object
- The fix
- Efficient each call compiles only once
- Correct
Sub MySub my today (qwltsun mon tue wed thu
fri satgt) (localtime) 6 my RegExObj
qr/today/i compiles once per call
while (ltINgt) if (_ RegExObj ) .
9Match variables
text"Jerusalem Perl Mongers"if (text
/Perl/) print Jerusalem print
Perl print ' Mongers
10Match variables
- When using match variables in the program (even
once) perl has to keep track of these values
every single pattern match in the program.
text"Jerusalem Perl Mongers"print if text
/Perl/while (ltgt) push _at_mylist,_ if
/hello/
11Match variables
- Solution not to use , ,
- Instead of
- Use 1
text"Jerusalem Perl Mongers"text
s/Perl/\U/g
text"Jerusalem Perl Mongers"text
s/(Perl)/\U1/g
12Match variables
- _at_- and _at_ refer to starting and ending offsets.
- -0 and 0 refer to the offset of the whole
text. - -1 to 1 -2 to 2 captured
13Match variables
text"Jerusalem Perl Mongers"text
m/Perl/print substr(text, 0,
-0)."\n"print substr(text, -0 , 0 -
-0)."\n"print substr(text, 0)."\n"
see regex3.pl
14Study()
- Optimizes some kinds of searches of a string.
- After studying a string, a regex can benefit from
the cached knowledge about the string.
15Study()
longtextJerusalem Perl Mongers"study(lon
gtext)if (longtext m/\sC\s/).if
(longtext m/Perl/).if (longtext
m/PHP/).if (longtext m/COBOL/).
16Study()
- Modifying the string causes invalidation of the
study-list.
text"Jerusalem Perl Mongers"study(text)if
(text m/\sC\s/).if (text
m/Perl/).textYet another textif (text
m/PHP/).if (text m/COBOL/).
17Study() - when to use
- Matching with literal text many times to the same
string. - String is long.
18Study() - when not to use
- When matching with /i
- When string is short
- When you plan just a few matches before the sting
is modified - Always benchmark and see it it worth!
19Avoid unnecessary alternation
- Alternation is generally slow.
- Each time an alternative fails to match, the
regex engine has to backtrack in the string and
try the next alternative.
20Avoid unnecessary alternation
- First it tries to match Motke. If fails it backs
up to the beginning() and then it tries Diana.
If fails it backs up again and again to the
beginning of the string, until it matches/fails.
while (ltgt) print if /(MotkeDianaZacharyIssac
)/
21Avoid unnecessary alternation
- Dont ever use alternation (abc) instead of
character class abc. - Incorrect
- Correct
while (ltgt) push _at_digits, m/(0123456789)
/g
while (ltgt) push _at_digits, m/0-9/g
22Non-capturing Parentheses (?)
- Example matching a numberFor making \.0-9
optional we need () - The problem we captured unnecessary data into
1. We dont use this.
if (number m/-?0-9(\.0-9)?/)
23Non-capturing Parentheses (?)
- The solution
- No 1 in this case
- More efficient
- Less confusing later on
if (number m/-?0-9(?\.0-9)?/)
24Non-capturing Parentheses (?)
- Example thisthat
- Seems baduse this instead
- Much fasterfor 2 reasons
- Non-capturing
- When the first 2 chars dont match, it doesnt
make the alternation.
if (text m/(thisthat)/)
if (text m/th(?isat)/)
25Avoid unnecessary backtracking
- Lets find words ending either with a or with
bPerl matches boundary, then first
alternation as many word characters as possible.
Then it looks for awhich it will not find,
because the word is over. It backs up a character
if its a match. If not, it backs up to the
\b and tries the 2nd.
while (ltgt)push _at_words, m/\b(\wa\wb)\b/g
26Avoid unnecessary backtracking
- Improvement getting rid of the alternation
- BUT still backtracking
while (ltgt) push _at_words, m/\b(\wab)\b/g
27Avoid unnecessary backtracking
- Ask yourself how would I search in real life?
- More than likely you would look only to the end
of each word. - We dont really care about the first part of the
word. - No backtracking, because we simply match a single
character.
while (ltgt) push _at_words, m/ab\b/g
28Bechmark your regex!
use Benchmarktimethese(20000,mem gt
q _"web nasa England walla step car
output" push _at_words, m/ab\b/g ,mem2 gt
q _"web nasa England walla step car
output" push _at_words, m/\b(\wa\wb)\b/g )
29Bechmark Your Regex!
Benchmark timing 20000 iterations of mem,
mem2... mem 1 wallclock secs ( 0.58 usr
0.00 sys 0.58 CPU) _at_ 34423.41/s (n20000)
mem2 2 wallclock secs ( 0.92 usr 0.03
sys 0.95 CPU) _at_ 21030.49/s (n20000)
30Summary
- Use common sense
- Use these techniques
- Always ask yourself how should I search for
this? by scanning the text with the eyes.
31Where to Get More Information
- Mastering Regular Expressions by Jeffrey Friedl
- Effective Perl Programming by R. Schwartz and
Joseph N. Hall