Regular Expressions - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Regular Expressions

Description:

Identify uses of recursion particularly backtracking ... http://perl.plover.com/Regex/article.html. http://swtch.com/~rsc/regexp/regexp1.html ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 22
Provided by: chris520
Category:

less

Transcript and Presenter's Notes

Title: Regular Expressions


1
Regular Expressions
  • You understand this if you can
  • Identify uses of recursion particularly
    backtracking
  • Explain regular expression ? finite state
    machine
  • Compare linear vs. exponential algorithms
  • http//www.codeguru.com/cpp/cpp/cpp_mfc/parsing/a
    rticle.php/c4093/Three
  • http//perl.plover.com/Regex/article.html
  • http//swtch.com/rsc/regexp/regexp1.html

2
Recursion Reminder
  • A routine that may call itself
  • Parameters to recursive call are smaller
  • Must be at least one non-recursive branch
  • Example
  • QuickSort(anArray, lower, upper)
  • if (lower!upper)
  • mid partition(lower, upper)
  • QuickSort(anArray, lower, mid)
  • QuickSort(anArray, mid1, upper)

3
Using Recursion
  • Divide and Conquer
  • evaluate(ArithExp)
  • if (ArithExp is expA expB)
  • return evaluate(expA) evaluate(expB)
  • if (ArithExp is expA expB)
  • return evaluate(expA) - evaluate(expB)
  • If (ArithExp is a number)
  • return number
  • Parsing
  • Identifying whether input matches a grammar

4
Using Recursion
  • Backtracking
  • findSolution(position)
  • numToTry getCountOfPositions(position)
  • while (numToTry gt 0)
  • nextPosition getNext(position, numToTry--)
  • solution findSolution(nextPosition)
  • if (solution ! null)
  • return makeSolution(position, solution)
  • return null

Essentially Try to find a route to the solution
from where you are. If that fails, go back to
where you came from and try an alternative.
This is a rough outline to illustrate the idea.
5
Backtracking
  • Exhaustive Search
  • Need to watch for loops
  • Even without loops can visit same state twice
  • Can improve by detecting visits to same state
  • Can be expensive
  • E.g.
  • 2 branches at each node and n levels
  • 2n states to explore Exponential complexity

6
Regular Expressions
  • A pattern used to recognise a set of strings
  • Constructing Regular Expressions
  • Alternation ()
  • catdog matches cat or dog
  • Grouping ()
  • control the scope and precedence of the
    operators.
  • pe(td)al matches petal or pedal
  • Quantification repetition of preceding
    expression
  • ? 0 .. 1
  • "colou?r" matches both color and colour.
  • - 0
  • \sRegular Expressions\s // \s is whitespace
    (tabs space)
  • - 1
  • Regular\sExpression

7
Additional Features 1
  • Character Sets
  • 1234567890
  • Any character
  • .
  • Not
  • 1234567890
  • Symbols
  • \s - whitespace

8
Additional Features 2
  • Recorded sub-expressions
  • Stored match can be used later in a pattern
  • H(.)\sD\1
  • Matches Humpty Dumpty Hoo Doo
  • This is a very powerful feature
  • Makes regular expressions more expressive
  • Can match additional patterns that original
    notation cant
  • Other additional features are just notation
  • Can be implemented using original features

9
Uses of Regular Expressions
  • Validation
  • Common on Web Sites
  • Splitting text input
  • Clever search replace editing
  • Lexical Analysis
  • Efficient way of converting source into tokens
  • Numbers, String literals
  • Operators and symbols (e.g. , gt, )
  • Keywords if, switch
  • Identifiers
  • Faster than recursive parsing

10
Deterministic FSM
No Match
Success State
  • FSM Recogniser

b
d
b
c
a
a
Success
Event
Start State
Input ab
Input abcd
11
Non-deterministic FSM
  • Several transitions with same label from state
  • Allow empty transition
  • Activate a new state with no input

a(bc)?
c
b
c
Which is the NFSM? Do they all match the same
strings?
a
b
a
Input abc
Match
12
RegEx NFSM DFSM
  • NFSM can represent Regular Expressions
  • Except for recorded sub-expressions
  • Any NFSM can be converted into a DFSM
  • An NFSM has a set of active states
  • Essentially
  • Create DFSM with extra states representing these
    sets of active states
  • Normal RegEx, DFSM NFSM
  • Equivalent expressive power

13
Matching Regular Expressions
  • Backtracking (supports recorded sub-exp)
  • Match (regExp)
  • if (regExp is ExpA ExpB) return Match(ExpA) and
    Match(ExpB)
  • pos getInputPosition()
  • if (regExp is ExpA ExpB)
  • if Match(ExpA) return true
  • else
  • resetInputPosition() // Backtracking
  • return Match(ExpB)
  • if (regExp is ExpA)
  • while (Match(ExpA))
  • pos getInputPosition()
  • resetInputPosition() // Backtracking past
    failed match
  • return true
  • return false

14
Matching RegEx with FSM 1
  • Compile RegEx to NFSM
  • Match input against the NFSM
  • Keep track of multiple active states
  • What is the maximum size of the list?

b
c
a
1
5
4
2
c
b
3
abc
15
Matching RegEx with FSM
  • Compile Regular Expression to DFSM
  • Convert RegEx to NFSM
  • Convert NFSM to DFSM
  • Match input against the DFSM
  • Only one active state which moves with input

b
a
c
1
4
2
c
b
3
16
Converting Regex to NFSM
a
Empty Transition
17
Example
a
  • a(bc)
  • a, ab, ac, abcc .

b
c
bc
b
a
(bc)
c
a(bc)
18
Efficiency Backtracking
  • a?nan
  • Shorthand for a?a?a?a? .. aaaa ..
  • Input a against a?a
  • Try to match a against a?, succeed but fail
    overall
  • Try a? , match a against a, succeed
  • Input aaa against a?a?a?aaa
  • It will take 23 permutations to get a?a?a?
  • 111, 110, 101 .
  • Worst-case efficiency 2n (exponential slow?)
  • where n input size

Exponential complexity
19
Efficiency NFSM
  • In worst case,
  • m maximum number of active states
  • n input size
  • Efficiency is O(nm)

20
Experimental results
  • http//swtch.com/rsc/regexp/regexp1.html

Notice the difference in speed for input of size
25 characters. Will this affect the user
experience?
Which use backtracking and which use NFSM
approach?
Why would backtracking approach be used?
Does this mean that complexity analysis is a
waste of time?
Note logarithmic time scale
regular expression and text size n a?nan
matching an
21
Summary
  • Regular Expressions are useful
  • Regular Expressions ? Finite State Machines
  • Non-Determinisitic ? Deterministic FSM
  • In expressive power
  • Recognition Algorithms
  • Backtracking recursive O(2n) worst case
  • FSM O(mn) O(size of pattern size of input)
  • Theoretically more efficient (but in practice?)
  • However, cant handle recorded sub-expression

Similarly, Quicksort has an n2 worst case
complexity but on average is nlog(n) - fast
Write a Comment
User Comments (0)
About PowerShow.com