Title: Regular Expressions
1Regular Expressions
- You understand this if you can
- Identify uses of recursion particularly
backtracking - Explain regular expression ? finite state
machine - Compare linear vs. exponential algorithms
- http//www.codeguru.com/cpp/cpp/cpp_mfc/parsing/a
rticle.php/c4093/Three - http//perl.plover.com/Regex/article.html
- http//swtch.com/rsc/regexp/regexp1.html
2Recursion Reminder
- A routine that may call itself
- Parameters to recursive call are smaller
- Must be at least one non-recursive branch
- Example
- QuickSort(anArray, lower, upper)
- if (lower!upper)
- mid partition(lower, upper)
- QuickSort(anArray, lower, mid)
- QuickSort(anArray, mid1, upper)
-
3Using Recursion
- Divide and Conquer
- evaluate(ArithExp)
- if (ArithExp is expA expB)
- return evaluate(expA) evaluate(expB)
- if (ArithExp is expA expB)
- return evaluate(expA) - evaluate(expB)
-
- If (ArithExp is a number)
- return number
-
- Parsing
- Identifying whether input matches a grammar
4Using Recursion
- Backtracking
- findSolution(position)
- numToTry getCountOfPositions(position)
- while (numToTry gt 0)
- nextPosition getNext(position, numToTry--)
- solution findSolution(nextPosition)
- if (solution ! null)
- return makeSolution(position, solution)
-
- return null
Essentially Try to find a route to the solution
from where you are. If that fails, go back to
where you came from and try an alternative.
This is a rough outline to illustrate the idea.
5Backtracking
- Exhaustive Search
- Need to watch for loops
- Even without loops can visit same state twice
- Can improve by detecting visits to same state
- Can be expensive
- E.g.
- 2 branches at each node and n levels
- 2n states to explore Exponential complexity
6Regular Expressions
- A pattern used to recognise a set of strings
- Constructing Regular Expressions
- Alternation ()
- catdog matches cat or dog
- Grouping ()
- control the scope and precedence of the
operators. - pe(td)al matches petal or pedal
- Quantification repetition of preceding
expression - ? 0 .. 1
- "colou?r" matches both color and colour.
- - 0
- \sRegular Expressions\s // \s is whitespace
(tabs space) - - 1
- Regular\sExpression
7Additional Features 1
- Character Sets
- 1234567890
- Any character
- .
- Not
- 1234567890
- Symbols
- \s - whitespace
8Additional Features 2
- Recorded sub-expressions
- Stored match can be used later in a pattern
- H(.)\sD\1
- Matches Humpty Dumpty Hoo Doo
- This is a very powerful feature
- Makes regular expressions more expressive
- Can match additional patterns that original
notation cant - Other additional features are just notation
- Can be implemented using original features
9Uses of Regular Expressions
- Validation
- Common on Web Sites
- Splitting text input
- Clever search replace editing
- Lexical Analysis
- Efficient way of converting source into tokens
- Numbers, String literals
- Operators and symbols (e.g. , gt, )
- Keywords if, switch
- Identifiers
- Faster than recursive parsing
10Deterministic FSM
No Match
Success State
b
d
b
c
a
a
Success
Event
Start State
Input ab
Input abcd
11Non-deterministic FSM
- Several transitions with same label from state
- Allow empty transition
- Activate a new state with no input
a(bc)?
c
b
c
Which is the NFSM? Do they all match the same
strings?
a
b
a
Input abc
Match
12RegEx NFSM DFSM
- NFSM can represent Regular Expressions
- Except for recorded sub-expressions
- Any NFSM can be converted into a DFSM
- An NFSM has a set of active states
- Essentially
- Create DFSM with extra states representing these
sets of active states - Normal RegEx, DFSM NFSM
- Equivalent expressive power
13Matching Regular Expressions
- Backtracking (supports recorded sub-exp)
- Match (regExp)
- if (regExp is ExpA ExpB) return Match(ExpA) and
Match(ExpB) - pos getInputPosition()
- if (regExp is ExpA ExpB)
- if Match(ExpA) return true
- else
- resetInputPosition() // Backtracking
- return Match(ExpB)
-
- if (regExp is ExpA)
- while (Match(ExpA))
- pos getInputPosition()
- resetInputPosition() // Backtracking past
failed match - return true
-
- return false
14Matching RegEx with FSM 1
- Compile RegEx to NFSM
- Match input against the NFSM
- Keep track of multiple active states
- What is the maximum size of the list?
b
c
a
1
5
4
2
c
b
3
abc
15Matching RegEx with FSM
- Compile Regular Expression to DFSM
- Convert RegEx to NFSM
- Convert NFSM to DFSM
- Match input against the DFSM
- Only one active state which moves with input
b
a
c
1
4
2
c
b
3
16Converting Regex to NFSM
a
Empty Transition
17Example
a
b
c
bc
b
a
(bc)
c
a(bc)
18Efficiency Backtracking
- a?nan
- Shorthand for a?a?a?a? .. aaaa ..
- Input a against a?a
- Try to match a against a?, succeed but fail
overall - Try a? , match a against a, succeed
- Input aaa against a?a?a?aaa
- It will take 23 permutations to get a?a?a?
- 111, 110, 101 .
- Worst-case efficiency 2n (exponential slow?)
- where n input size
Exponential complexity
19Efficiency NFSM
- In worst case,
- m maximum number of active states
- n input size
- Efficiency is O(nm)
20Experimental results
- http//swtch.com/rsc/regexp/regexp1.html
Notice the difference in speed for input of size
25 characters. Will this affect the user
experience?
Which use backtracking and which use NFSM
approach?
Why would backtracking approach be used?
Does this mean that complexity analysis is a
waste of time?
Note logarithmic time scale
regular expression and text size n a?nan
matching an
21Summary
- Regular Expressions are useful
- Regular Expressions ? Finite State Machines
- Non-Determinisitic ? Deterministic FSM
- In expressive power
- Recognition Algorithms
- Backtracking recursive O(2n) worst case
- FSM O(mn) O(size of pattern size of input)
- Theoretically more efficient (but in practice?)
- However, cant handle recorded sub-expression
Similarly, Quicksort has an n2 worst case
complexity but on average is nlog(n) - fast