Title: Stackbased Algorithms for Pattern Matching on DAGs
 1Stack-based Algorithms for Pattern Matching on 
DAGs
- Li Chen, Amarnath Gupta, M. Erdem Kurul 
-  
San Diego Supercomputer Center (SDSC), University 
of California, San Diego
VLDB05 
 2Motivation
- Graph model is important in databases and 
 knowledge representation
- Bibliographic citations, hypertext, ontology 
- A lot of scientific data are beyond XML tree 
 model
- Many of them are directed acyclic graphs (DAGs) 
- Taxonomy of proteins, chemical compounds, 
 organisms
- Data provenance graphs 
- Sequence data and multiple sequence alignments 
- Searching for highly similar substructures 
- Gives rise to numerous pattern matching problems 
- e.g., a novel (metabolic) pathway against a 
 pathway database
3Example  Its Abstraction
graph-structured patent citation network
- Labeled DAG 
- node patent/article 
- label patent/article properties 
- year, contact_author, affiliation, country, etc. 
- directed edge uniformly cited-by
- Query Model 
- node matching a certain property of data nodes 
- edges / for direct (resp. // for indirect ) 
 cited-by
4Problem Definition
- Is this a (sub)graph isomorphism problem?
Definition Two graphs are isomorphic if there is 
a one-to-one correspondence between their 
vertices and there is an edge between two 
vertices of one graph if and only if there is an 
edge between the two corresponding vertices in 
the other. 
NP-hard !
- Is this a subgraph homeomorphism problem?
Definition The homeomorphic image of a pattern 
graph H in a data graph G is the images of nodes 
in H are nodes of G, and the images of edges in H 
are paths in G. 
Neither!!  ours is easier (polynomial)
-  acyclic graph model 
-  corresponding vertices (nodes) have the same 
 label
5Problem Definition (cont.)
- Pattern matching on DAGs 
- The input is a (virtual) single rooted DAG G 
- A total mapping from Q to G, preserving 
 parent-child / ancestor-descendant relationships
- Branches represent and  semantics ? structural 
 join
- Return node bindings in witness structures
m1
(c)
c4
c1
p1
a1
c2
(b)
(e)
b1
e2
e1
m2
Twig pattern query
DAG-structured data 
 6Related Work
- Exact v.s. Inexact graph matching Shasha PODS02 
- Exact a total mapping from query nodes to data 
 nodes (usually requiring label matching)
- Inexact either a partial mapping, or an 
 approximated total mapping
- Trade-off between space and time 
- Trade space for time 
- Path Index materialize fixed or parameterized 
 length of paths
- Transitive closure computing for queries 
 involving //
- Store adjacency list and compute on-the-fly 
-  Q Is there a method economic in both time and 
 space?
? 
 7Outline of The Talk
- Motivation 
- Problem Definition 
- Related Work 
- The inspiration of our idea 
- Our Approach 
- Linear-space representation for DAG 
- Stack-based algorithms for path, twig, and dag 
 queries
- Complexity analysis 
- Optimization by prefiltering 
- Experimental Evaluations 
- Conclusions and Future Work
8Inspiration from XML Pattern Matching Interval 
Encodings
interval encoding of a tree
Implication of overlapping intervals
- y is a descendant of x, i.e., y x ? 
- y.left gt x.left and y.right lt x.right
?
x
y
?
y
x
x
y
- Difficulties in directly applying interval 
 encoding to DAG
- Each tree node has at most one parent, while a 
 graph node may have more
- Multiple encoding may be a solution, but more 
 comparisons are introduced, so likely not space
 nor time economic
9Inspiration from XML Pattern Matching 
Stack-based Algorithms for Holistic Joins
- Stack-based Algorithm Bruno et al. SIGMOD02 
- Build a stream and a stack corresponding to each 
 query node
- Nodes in streams are pushed into stacks in their 
 document order
- Pop a node from its stack if its interval no 
 longer overlaps the newly pushed node
- For a node pushed into a leaf stack, output all 
 root-to-leaf paths
10Challenges
- Whether stack-based algorithms are extendable to 
 pattern matching on DAGs?
- If possible, how? Is it economic in space and 
 time?
11Outline of The Talk
- Motivation 
- Problem Definition 
- Related Work 
- The inspiration of our idea 
- Our Approach 
- Linear-space representation for DAG 
- Stack-based algorithms for path, twig, and dag 
 queries
- Complexity analysis 
- Optimization by prefiltering 
- Experimental Evaluations 
- Conclusions and Future Work
12DAG Representation
- Partial order v.s. transitive closure 
- G  (V, E, ) 
- node partial order , i.e., ?e lta,bgt E ? b 
 a
- transitive closure , i.e., ?p ltx,ygt P ? y 
 x
- What do we do? 
- Not pre-compute and store 
- Neither store adjacency list for 
- Instead, store interval encoding of a tree-cover, 
 covering part of
- And index on the remaining linkages minimally but 
 losslessly
?
?
?
?
?
? 
 13Our DAG Representation
- Decompose a DAG G into T and GR 
- T  (V, ET) is a tree-cover (spanning tree) 
- GR  (VR, ER) is the remaining graph, ER  E - ET
14Properties of Our DAG Representation
?
- Lossless in inducing 
- Building costs (a tree-cover traversal of G) 
- Procedure 
- encode each node w along the traversal of T 
- if w has surplus preds ui in addition to its tree 
 parent v, add ui in PL(w) and if v is also in
 SPPI,
- add v in PL(w), if v does not have surplus preds 
 itself
- inherit PL(v) in PL(w), otherwise 
- Linear time  space 
- in terms of V for interval encoding 
- in terms of E for populating SSPI
15Extending Stack-based Holistic Join Algorithms
- Key ideas 
- Keep the data structures of streams and stacks 
- Add a new structure  partial solution pools 
- Put a popped node in its pool, rather than 
 discard it
- Grow partial solutions for the new-found 
- Exploit temporal properties to avoid vain attempts
? 
 16Algorithm Extension
- SweepPartialSolutions checking  building 
 solutions in pools
- When? 
- A node v is popped out of stack 
- Where? 
- Between v and the nodes in each of its children 
 pools
- What (condition)? 
- Check if each child pool has a node w, s.t. w 
 v
- How? 
- What (action)? 
- Expand grow partial solutions headed by w to be 
 headed by v
?
Check if ?u?PL(PL(w)) s.t. u.L ? v.L and u.R ? 
v.R 
v
w
v
v
w
w 
 17PathStackD by Example
m1
m1
m1 c1 b1
c4
c1
p1
c1
b1
m1
a1
c2
b1
e2
e1
m2
c2
m2
b1
c1
m1
(a) Data G
(b) Query
m2
c1
m1
b1
c2
c4
c1
b1
m1
c4
c2 
 18Algorithm Analysis
?
- The total containment ( ) checks in pools are 
- Not Sm1 x Sm2 x  x Smn times of SPPI 
 look-ups
- Smi size of the ith stream, n size of the 
 path Q
- But much tightly restricted due to temporal 
 properties
- Not all stream nodes, but child pool nodes (to 
 the left of v)
- Not entire SSPI is searched for checking if w v
?
Function checkContainment(v,w) while (unext 
PL(w) and !found) if (u.L ? v.L and u.R ? 
v.R) return true else if (u.L ? v.R) return 
false else if (u has no preds) remove u from 
PL(w) else found  checkContainment(v,u) 
 if (!found) PL(w)PL(w)PL(u)-u  
 19PathStackD
- Theorem 1 Given a path query q and a DAG G, 
 PathStackD
- correctly returns all the query answers for q.
sound
complete
and 
Theorem 2 Given a path query q and a DAG G, 
PathStackD has the worst-case I/O and CPU time 
complexities of O(qSmi  qSmid  E), 
i.e., max(E, qSmi(max(Smi, d))).
2
Smi average stream size q query 
size d diameter of G
Optimal compared to O(V q)
2 
 20Additional Changes in TwigStackD
- Key changes 
- getMinSources (original) ? getMissings (ours) 
- sweepPartialSolutions
1. node with minimal left value
1. the same
2. has all the required descendant types
2. record which required types are missing
3. check if missing types are complemented by 
pool nodes
m
b1
c2
Sb
Pb
m1
c1
c
Sc
Sm
Pc
Pm
e1
b
e
Se
Pe
(a) Data G
(b) Query 
 21A Prefiltering Step
- Purpose 
- Improve efficiency by reducing the I/O factor 
 Smi
- Basic Idea 
- Impose structural constraints of the query 
 pattern for filtering nodes to be put in streams
e.g.,
each QBitVec captures required upwards 
structural constraints
each QBitVec captures required downwards 
structural constraints
a
a
1111
1000
QBitVec
b
d
b
d
0011
0100
1010
1100
QBit
c
c
0001
1011 
 22Two Passes for Prefiltering
- Downwards Filtering By Example 
- Traverse data DAG and aggregate the satisfied 
 descendant types
- Match the satisfied with the required
Data nodes are processed in post-order when 
exiting each edge directing from n to prev, do 
// myBitVec is the bitVector value for n 
myBitVec  bitOR(myBitVec,prevBitVec,QBit) // 
prev is query relevant if it matches a query 
label if (prev is query relevant  prev 
does not satisfies structural constraint) 
then myBitVecbitAND(myBitVec,prevQBit) if (n 
is query relevant  bitAND(myBitVec,QBitV
ec)  QBitVec) then n satisfies structural 
constraint put n into the corresponding 
stream
a1
1111
e1
d1
b1
0001
0100
0011
c1
a2
m1
0001
0001
1000
c2
0001
encoded data DAG
?
satisfied constraints
?
post-order  guarantees that a node is encoded 
before all its ancestors topological-order  
guarantees that a node is encoded before all its 
descendants 
 23Summary of Our Approach
- The key ideas 
- Our DAG representation losslessly covers all 
 transitivity closure
- Interval encoding on tree-cover T for covering 
- SSPI and tree-cover encoding together cover the 
 complete
- Worst-case space is O(VE), compared to 
 O(V2) if pre-compute and store all transitive
 closure
- _stackD algorithms leverage tradeoffs between 
 space and time
- Adopt a new structure, i.e., partial solution 
 pools, in addition to streams and stacks
- Modify/add procedures to handle stack-popped 
 nodes in pools, where remaining solutions can be
 found
- Worst-case time is O(max(E, Smi2)), compared 
 to O(V2) if no path index is utilized
- Prefiltering further optimizes performance by 
 reducing Smi
?
? 
 24Outline of The Talk
- Motivation 
- Problem Definition 
- Related Work 
- The inspiration of our idea 
- Our Approach 
- Linear-space representation for DAG 
- Stack-based algorithms for path, twig, and dag 
 queries
- Complexity analysis 
- Optimization by prefiltering 
- Experimental Evaluations 
- Conclusions and Future Work
25Experimental Evaluations
- System implementation 
- Java 1.4 
- Light-weight storage engine -- PSEPro from 
 ObjectStore
- Utilize its VMMA for memory??disk data structure 
 mapping
- Experimental setups 
- Tunable synthetic DAG data generator 
- Parameters diameter, fan-out, fan-in, distinct  
 of labels
- Real-life data 
- Gene ontology data, tree data from XMark 
 benchmark augmented by random cross links
- 2.6Ghz Pentium IV PC, 1GB MM, 2GB VM 
26Experiment 1 
(ms)
(ms)
(ms)
PQ
TQ
DQ
PQ
TQ
DQ
n50K, m90K 
n100K, m180K 
n25K, m45K 
(ms)
(ms)
PQ
TQ
DQ
a
a
a
b
b
b
c
e
c
f
d
d
e
d
PQ
TQ
DQ
PQ
TQ
DQ
f
n200K, m360K 
n400K, m720K 
n V m E
Compare processing time (including prefiltering 
and query execution) of StackD against 
NavKanza PODS03 
 27Experiment 2
(ms)
(K)
n5K, m5K, 10K, 20K, 30k, 40K, 50K, 60K, 
70K 
Evaluate the performances of both algorithms with 
the changing characteristics (density) of DAG 
 28Experiment 3
(ms)
(ms)
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
c
c
b
c
c
c
c
c
c
f
f
d
e
d
e
d
d
d
d
i
g
h
e
e
e
f
f
g
Evaluate the performances of both algorithms with 
the changing characteristics (size) of query 
 29Experiment 4
(ms)
Evaluate the performance of PathStack-D with or 
without the aid of the prefiltering step 
 30Experiment 5
BuildSSPI
TSDFilter
TSDExec
Scan
Result
NavAlgo
25K
50K
100K
200K
400K
100MB XML document ( 1.4M nodes and  1.6M 
edges)
PQ//site//person//age TQ//site(//item//descripti
on, //category//name, //person//age) 
 StackD
Scan
Result
NavAlgo
PQ
TQ 
 31Conclusions and Future work
- Conclusions 
- Gracefully generalized the stack-based algorithms 
 for pattern matching on DAGs
- The extended algorithms are sound and complete 
- The proposed approach is optimal among those that 
 do not rely on precomputed transitive closure
- Future Work 
- Further improvement by incorporating statistics 
 on a graph structure and/or advanced indexing
 schemes
- Allow for more general graph operations which 
 gives rise to more challenging query optimizations
32Questions?