Title: Pattern Matching on Strings using Regular Expressions
1Pattern Matching on Stringsusing Regular
Expressions
Num 0 1-90-9 Email a-z "_at_"
a-z ("." a-z )
Claus Brabrand brabrand_at_itu.dk IT University
of Copenhagen
Jakob G. Thomsen gedefar_at_cs.au.dk Aarhus
University
2Abstract
We show how to achieve typed and unambiguous
declarative pattern matching on strings using
regular expressions extended with a simple
recording operator. We give a characterization
of ambiguity of regular expressions that leads to
a sound and complete static analysis. The
analysis is capable of pinpointing all
ambiguities in terms of the structure of the
regular expression and report shortest ambiguous
strings. We also show how pattern matching can
be integrated into statically typed programming
languages for deconstructing strings and
reproducing typed and structured values. We
validate our approach by giving a full
implementation of the approach presented in this
paper. The resulting tool, reg-exp-rec, adds
typed and unambiguous pattern matching to Java in
a stand-alone and non-intrusive manner. We
evaluate the approach using several realistic
examples.
3Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
4Introduction Motivation
- Pattern matching an indispensable problem
- Many applications need to "parse" dynamic input
- 1) URLs
- 2) Log Files
- 3) DBLP
(list of key-value pairs)
http//first.dk/index.php?id141viewdetails
protocol
host
path
query-string
13/02/2010 66.249.65.107 get /support.html 20/02/2
010 42.116.32.64 post /search.html
ltarticlegt lttitlegtThree Models for
the...lt/titlegt ltauthorgtNoam Chomskylt/authorgt
ltyeargt1956lt/yeargt lt/articlegt
5Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
6The Chomsky Hierarchy (1956)
- Language classes (formalisms)
- Type-3 regular expressions "enough" for
- URLs, log files, DBLP, ...
- "Trade" (excess) expressivity for
- declarativity, simplicity, and static safety !
7Type-0 java.net.URL
- Turing-Complete programming (e.g., Java)
- "unrestricted grammars" (e.g., rewriting
systems) - Cyclomatic complexity (of official
"java.net.URL") - 88 bug reports on Sun's Bug Repository !
- Bug reports span more than a decade !
8Type-1 Context-Sensitivity
- Not widely used (or studied?) formalism
- Presumeably because
- Restricts expressivity w/o offering extra safety?
- ? -
9Type-2 Context-Free Grammars
- Conceptually harder than regexps
- Essentially (Type-3) Regular Expressions
recursion - The ultimate end-all scientific argument
- We d
(conjecture!)
regexps 12 times more popular !
10Type-? Regexp Capture Groups
- Capturing groups (Perl, PHP, Java regex, ...)
- Syntax (i.e., in parentheses)
- Back-references
- Syntax (i.e., "index of" capturing group)
- Beyond regularity !
- is non-regular
- In fact, not even context-free !!!
- is non-context-free
(R)
\7
(a)b\1
an b an n ? 0
? ? ? ???, ???
(.).\1
11Type-? Regexp Capture Groups
- Interpretation with back-tracking
- NP-complete (exponential worst-case) -(
regexp " a?nan " vs. string " an "
1 minute
0.02 msecs
3.000.0001 on strings of length 29 !!!
12Type-3 Regular Expressions
Declarative !
Safe !
Simple !
- Closure properties
- Union
- Concatenation
- Iteration
- Restriction
- Intersection
- Complement
- ...
- Decidability properties
- ...
- ...
- Containment L(R) ? L(R')
- Ambiguity
- ...
- ...
13Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
14Regular Expressions
- Syntax
- Semantics
- where
- L1 ? L2 is concatenation (i.e., ?1 ?2 ?1?L1,
?2?L2 ) - L ?i?0 Li where L0 ? and
Li L ? Li-1
15Common Extensions (sugar)
- Any character (aka, dot)
- "." as c1c2...cn, ci??
- Character ranges
- "a-z" as ab...z
- One-or-more regexps
- "R" as R?R
- Optional regexp
- "R?" as ?R
- Various repetitions e.g.
- "R2,3" as R?R?R?
16Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
17Recording
- Syntax
- "x " is a recording identifier
- (it "remembers" the substring it matches)
- Semantics
- Example (simplified emails)
- Matching against string
- yields
NB cannot use DFAs / NFAs ! - only recognition
(yes / no) - not how (i.e., "the structure")
a-z "_at_" a-z ("."
a-z)
ltuser gt ltdomain
gt
"obama_at_whitehouse.gov"
user "obama"
domain "whitehouse.gov"
18Recording (structured)
- Another example (with nested recordings)
- Matching against string
- yields
ltdate ltday 0-92 gt "/" ltmonth
0-92 gt "/" ltyear 0-94 gt gt
"26/06/1992"
date 26/06/1992
date.day 26
date.month 06
date.year 1992
19Recording (structured, lists)
- Yet another example (yielding lists)
- Matching against string
- yields a list structure
ltname a-z gt " " ltname a-z gt
( ltname a-z gt "\n" )
ltname a-z gt (" " ltname a-z gt )
"obama bush"
name obama,bush
20Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
21Abstract Syntax Trees (ASTs)
22Ambiguity
- Definition
- R ambiguous iff
- ?T,T'?ASTR T ? T' ? T T'
- where ? AST ? ? (the flattening) is
23Characterization of Ambiguity
- Theorem
- R unambiguous iff
NB sound complete !
R ? R?R
24Examples
- Ambiguous
- aa
- L(a) ? L(a) a ? Ø
- a?a
- L(a) L(a) an ? Ø
- Unambiguous
- aaa
- L(a) ? L(aa) Ø
- a?ba
- L(a) L(ba) Ø
25Ambiguity Examples
- a?b(ab)
-
- (aab)?(baa)
-
- (aaaaa)
-
ambiguous choice a?b lt--gt (ab)
shortest ambiguous string "ab"
ambiguous concatenation (aab) lt--gt (baa)
shortest ambiguous string "aba"
ambiguous star (aaaaa)
shortest ambiguous string "aaaaa"
26Ambiguity vs. Recordings
- Ambiguities inside recordings
- ltx a a gt
- ltx a ? a gt
- ...is not a problem!
- Contextual composition (of recordings)
- ltx a gt ? a
- ltx a gt a
- ...is a problem!
- Note our tool tests only for these!
27Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
28Disambiguation
- 1) Manual rewriting
- Always possible -)
- Tedious -(
- Error-prone -(
- Not structure-preserving -(
- 3) Disambiguators
- From characterization
- concat '?L', '?R'
- choice 'L', 'R'
- star 'L', 'R'
- (partial-order on ASTs)
- 2) Restriction
- R1 - R2
- And then encode...
- RC as ? - R
- R1 R2 as (R1CR2C)C
- 4) Default disamb
- concat, choice, and star are all left-biassed
(by default) ! - (Our tool does this)
29Quizzz (Restriction vs. Recording)
- Which can have recordings?
- A) R1, R2, R3, R4, and R5 can have recordings
- B) R1, R3, R4, and R5 can have recordings
- C) R1, R4, and R5 can have recordings
- D) R1 can have
recordings - E) None of them can have recordings
R1 - R2 R3C as ? - R3 R4 R5 as
(R4CR5C)C
i.e., where do recordings make sense?
30Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
31Type Inference
32Examples (Type Inference)
Person ltname a-z gt " (" ltage 0-9 gt ")"
class Person // auto-generated String
name int age static Person match(String
s) ... public String toString() ...
compile (our tool)
String s "obama (48)" Person p
Person.match(s) print(p.name " is " p.age
"y old")
33Examples (Type Inference)
Person ltname a-z gt " (" ltage 0-9 gt ")"
People ( Person "\n" )
class People // auto-generated String
name int age static Person
match(String s) ... public String
toString() ...
compile (our tool)
String s "obama (48) \n bush (63) \n "
People p People.match(s) println("Second
name is " p1.name)
34Examples (Type Inference)
Person ltname a-z gt " (" ltage 0-9 gt ")"
People ( ltperson Person gt "\n" )
class People // auto-generated
Person person class Person // nested
class String name int age ...
compile (our tool)
String s "obama (48) \n bush (63) \n "
People people People.match(s) for (p
people.person) println(p.name)
35Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
36URLs
- URLs
-
- Regexp
- Query string further structured (list of
key-value pairs)
(list of key-value pairs)
"http//www.google.com/search?qrecordhlen"
protocol
host
path
query-string (list of key-value pairs)
Host lthost a-z ("." a-z ) gt Path
ltpath a-z/. gt Query ltquery
a-z gt URL "http//" Host "/" Path
"?" Query
KeyVal ltkey a-z gt "" ltval a-z gt
Query KeyVal ("" KeyVal)
37URLs (Usage Example)
Host lthost a-z ("." a-z ) gt Path
ltpath a-z/. gt KeyVal ltkey a-z
gt "" ltval a-z gt Query KeyVal (""
KeyVal) URL "http//" Host "/" Path
"?" Query
String s "http//www.google.com/search?qrecord"
URL url URL.match(s) print("Host is "
url.host) if (url.key.lengthgt0) print("1st key
" url.key0) for (String val url.val)
println("value " val)
38Log Files
Format
13/02/2010 66.249.65.107 /support.html 20/02/2010
42.116.32.64 /search.html ...
Date ltdate ltday Day gt "/"
ltmonth Month gt "/"
ltyear 0-94 gt gt IP ltip 0-91,3
("." 0-91,3 )3 gt Entry ltentry Date
" " IP " " Path "\n" gt Log Entry
Regexp
Log log Log.match(log_file) for (Entry e
log.entry) if (e.date.month 02
e.date.day 29) print("Access on LEAP
YEAR from IP " e.ip)
Usage
39Log Files (cont'd, ambiguity)
- Assume we forgot "/" (between day month)
- Ambiguity
- i.e. "1/01" (January 1) vs. "10/1" (January
10) -)
Regexp
Day 0?1-9 1-20-9 30 31 Month
0?1-9 10 11 12 Date ltdate
ltday Day gt // no slash !
ltmonth Month gt "/"
ltyear 0-94 gt gt
Error
ambiguous concatenation ltdaygt lt--gt ltmonthgt
shortest ambiguous string "101"
40DBLP (Format)
ltarticlegt ltauthorgtNoam Chomskylt/authorgt
lttitlegtThree Models for the Description of
Languagelt/titlegt ltyeargt1956lt/yeargt
ltjournalgtIRE Transactions on Information
Theorylt/journalgt lt/articlegt ltarticlegt
ltauthorgtClaus Brabrandlt/authorgt ltauthorgtJakob
G Thomsenlt/authorgt lttitlegtTyped and
Unambiguous Pattern Matching on
Strings using Regular Expressionslt/titlegt
ltyeargt2010lt/yeargt ltnotegtSubmittedlt/notegt lt/art
iclegt ...
41DBLP (Regexp)
- DBLP Regexp
- Ambiguity !
- EITHER 2 publications (. "")
- OR 1 publication (. gray part) !!!
Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltpub
Article gt
ambiguous star ltpubgt shortest ambiguous
string "ltarticlegtlttitlegtlt/titlegtlt/articlegt
ltarticlegtlttitlegtlt/titlegtlt/articlegt"
42DBLP (Disambiguated)
- DBLP Regexp
- Disambiguated (using "(R1-R2)")
- Unambiguous! -)
Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltpub
Article gt
Article "ltarticlegt" Author
Title (. - (. "lt/articlegt" .))
"lt/articlegt"
43DBLP (Usage Example)
- DBLP Regexp
- Usage (example)
Author "ltauthorgt" ltauthor a-z gt
"lt/authorgt" Title "lttitlegt" lttitle a-z
gt "lt/titlegt" Article "ltarticlegt" Author
Title . "lt/articlegt" DBLP ltarticle
Article gt
DBLP dblp DBLP.match(readXMLfile("DBLP.xml")) f
or (Article a dblp.article) print("Title "
a.title)
44Outline
- Pattern Matching (intro motiv)
- The Chomsky Hierarchy (1956)
- Regular Expressions
- The Recording Construction
- Ambiguity
- Disambiguation
- Type Inference
- Usage and Examples
- Evaluation and Conclusion
45Evaluation
- Evaluation summary
- Also, (Type-3) regexps expressive "enough"
- for URLs, Log files, DBLP, ...
MatMult
NP-Complete
FrischCardelli'04
46Type-3 vs. Type-0 (URLs)
Regexps are 8 times more concise !
47java.util.regex vs. Our approach
- Efficiency(on DBLP)
- java.util.regex
- Exponential O(2?) 2,500 chars in 2
mins ! - In contrast ours
- Linear (on DBLP) 1,200,000 chars in 6 secs !
2 mins
10 msecs
48Related Work
- Recording (with lists in general)
- "x as R" in XDuce "xR" in CDuce and "x_at_R" in
Scala and HaRP - Ambiguity
- BookEvenGreibachOtt'71 and Hosoya'03 for
XDuce but indirectly via NFAa, not directly
(syntax-directed) - Disambiguation
- Vansummeren'06 but with global, not local
disambiguation - Type inference
- Exact type inference in XDuce
CDuce(soundnesscompleteness proof in
Vansummeren'06)but not for stand-alone and
non-intrusive usage (Java)
49Conclusion
- For string pattern matching, it is possible to
- In conclusion
- i.e., ambiguity checking and type inference !
- stand-alone non-intrusive language
integration (Java) !
"trade (excess) expressivity for
safetysimplicity"
We conclude that if regular expressions are
sufficiently expressive, they provide a simple,
declarative, and safe means for pattern matching
on strings, capable of extracting highly
structural information in a statically type-safe
and unambiguous manner.
50lt/Talkgt
http//www.cs.au.dk/gedefar/reg-exp-rec/