Title: Stackbased Algorithms for Pattern Matching on DAGs
1Stack-based Algorithms for Pattern Matching on
DAGs
- Li Chen, Amarnath Gupta, M. Erdem Kurul
-
San Diego Supercomputer Center (SDSC), University
of California, San Diego
VLDB05
2Motivation
- Graph model is important in databases and
knowledge representation - Bibliographic citations, hypertext, ontology
- A lot of scientific data are beyond XML tree
model - Many of them are directed acyclic graphs (DAGs)
- Taxonomy of proteins, chemical compounds,
organisms - Data provenance graphs
- Sequence data and multiple sequence alignments
- Searching for highly similar substructures
- Gives rise to numerous pattern matching problems
- e.g., a novel (metabolic) pathway against a
pathway database
3Example Its Abstraction
graph-structured patent citation network
- Labeled DAG
- node patent/article
- label patent/article properties
- year, contact_author, affiliation, country, etc.
- directed edge uniformly cited-by
- Query Model
- node matching a certain property of data nodes
- edges / for direct (resp. // for indirect )
cited-by
4Problem Definition
- Is this a (sub)graph isomorphism problem?
Definition Two graphs are isomorphic if there is
a one-to-one correspondence between their
vertices and there is an edge between two
vertices of one graph if and only if there is an
edge between the two corresponding vertices in
the other.
NP-hard !
- Is this a subgraph homeomorphism problem?
Definition The homeomorphic image of a pattern
graph H in a data graph G is the images of nodes
in H are nodes of G, and the images of edges in H
are paths in G.
Neither!! ours is easier (polynomial)
- acyclic graph model
- corresponding vertices (nodes) have the same
label
5Problem Definition (cont.)
- Pattern matching on DAGs
- The input is a (virtual) single rooted DAG G
- A total mapping from Q to G, preserving
parent-child / ancestor-descendant relationships - Branches represent and semantics ? structural
join - Return node bindings in witness structures
m1
(c)
c4
c1
p1
a1
c2
(b)
(e)
b1
e2
e1
m2
Twig pattern query
DAG-structured data
6Related Work
- Exact v.s. Inexact graph matching Shasha PODS02
- Exact a total mapping from query nodes to data
nodes (usually requiring label matching) - Inexact either a partial mapping, or an
approximated total mapping - Trade-off between space and time
- Trade space for time
- Path Index materialize fixed or parameterized
length of paths - Transitive closure computing for queries
involving // - Store adjacency list and compute on-the-fly
- Q Is there a method economic in both time and
space?
?
7Outline of The Talk
- Motivation
- Problem Definition
- Related Work
- The inspiration of our idea
- Our Approach
- Linear-space representation for DAG
- Stack-based algorithms for path, twig, and dag
queries - Complexity analysis
- Optimization by prefiltering
- Experimental Evaluations
- Conclusions and Future Work
8Inspiration from XML Pattern Matching Interval
Encodings
interval encoding of a tree
Implication of overlapping intervals
- y is a descendant of x, i.e., y x ?
- y.left gt x.left and y.right lt x.right
?
x
y
?
y
x
x
y
- Difficulties in directly applying interval
encoding to DAG - Each tree node has at most one parent, while a
graph node may have more - Multiple encoding may be a solution, but more
comparisons are introduced, so likely not space
nor time economic
9Inspiration from XML Pattern Matching
Stack-based Algorithms for Holistic Joins
- Stack-based Algorithm Bruno et al. SIGMOD02
- Build a stream and a stack corresponding to each
query node - Nodes in streams are pushed into stacks in their
document order - Pop a node from its stack if its interval no
longer overlaps the newly pushed node - For a node pushed into a leaf stack, output all
root-to-leaf paths
10Challenges
- Whether stack-based algorithms are extendable to
pattern matching on DAGs? - If possible, how? Is it economic in space and
time?
11Outline of The Talk
- Motivation
- Problem Definition
- Related Work
- The inspiration of our idea
- Our Approach
- Linear-space representation for DAG
- Stack-based algorithms for path, twig, and dag
queries - Complexity analysis
- Optimization by prefiltering
- Experimental Evaluations
- Conclusions and Future Work
12DAG Representation
- Partial order v.s. transitive closure
- G (V, E, )
- node partial order , i.e., ?e lta,bgt E ? b
a - transitive closure , i.e., ?p ltx,ygt P ? y
x - What do we do?
- Not pre-compute and store
- Neither store adjacency list for
- Instead, store interval encoding of a tree-cover,
covering part of - And index on the remaining linkages minimally but
losslessly
?
?
?
?
?
?
13Our DAG Representation
- Decompose a DAG G into T and GR
- T (V, ET) is a tree-cover (spanning tree)
- GR (VR, ER) is the remaining graph, ER E - ET
14Properties of Our DAG Representation
?
- Lossless in inducing
- Building costs (a tree-cover traversal of G)
- Procedure
- encode each node w along the traversal of T
- if w has surplus preds ui in addition to its tree
parent v, add ui in PL(w) and if v is also in
SPPI, - add v in PL(w), if v does not have surplus preds
itself - inherit PL(v) in PL(w), otherwise
- Linear time space
- in terms of V for interval encoding
- in terms of E for populating SSPI
15Extending Stack-based Holistic Join Algorithms
- Key ideas
- Keep the data structures of streams and stacks
- Add a new structure partial solution pools
- Put a popped node in its pool, rather than
discard it - Grow partial solutions for the new-found
- Exploit temporal properties to avoid vain attempts
?
16Algorithm Extension
- SweepPartialSolutions checking building
solutions in pools - When?
- A node v is popped out of stack
- Where?
- Between v and the nodes in each of its children
pools - What (condition)?
- Check if each child pool has a node w, s.t. w
v - How?
- What (action)?
- Expand grow partial solutions headed by w to be
headed by v
?
Check if ?u?PL(PL(w)) s.t. u.L ? v.L and u.R ?
v.R
v
w
v
v
w
w
17PathStackD by Example
m1
m1
m1 c1 b1
c4
c1
p1
c1
b1
m1
a1
c2
b1
e2
e1
m2
c2
m2
b1
c1
m1
(a) Data G
(b) Query
m2
c1
m1
b1
c2
c4
c1
b1
m1
c4
c2
18Algorithm Analysis
?
- The total containment ( ) checks in pools are
- Not Sm1 x Sm2 x x Smn times of SPPI
look-ups - Smi size of the ith stream, n size of the
path Q - But much tightly restricted due to temporal
properties - Not all stream nodes, but child pool nodes (to
the left of v) - Not entire SSPI is searched for checking if w v
?
Function checkContainment(v,w) while (unext
PL(w) and !found) if (u.L ? v.L and u.R ?
v.R) return true else if (u.L ? v.R) return
false else if (u has no preds) remove u from
PL(w) else found checkContainment(v,u)
if (!found) PL(w)PL(w)PL(u)-u
19PathStackD
- Theorem 1 Given a path query q and a DAG G,
PathStackD - correctly returns all the query answers for q.
sound
complete
and
Theorem 2 Given a path query q and a DAG G,
PathStackD has the worst-case I/O and CPU time
complexities of O(qSmi qSmid E),
i.e., max(E, qSmi(max(Smi, d))).
2
Smi average stream size q query
size d diameter of G
Optimal compared to O(V q)
2
20Additional Changes in TwigStackD
- Key changes
- getMinSources (original) ? getMissings (ours)
- sweepPartialSolutions
1. node with minimal left value
1. the same
2. has all the required descendant types
2. record which required types are missing
3. check if missing types are complemented by
pool nodes
m
b1
c2
Sb
Pb
m1
c1
c
Sc
Sm
Pc
Pm
e1
b
e
Se
Pe
(a) Data G
(b) Query
21A Prefiltering Step
- Purpose
- Improve efficiency by reducing the I/O factor
Smi - Basic Idea
- Impose structural constraints of the query
pattern for filtering nodes to be put in streams
e.g.,
each QBitVec captures required upwards
structural constraints
each QBitVec captures required downwards
structural constraints
a
a
1111
1000
QBitVec
b
d
b
d
0011
0100
1010
1100
QBit
c
c
0001
1011
22Two Passes for Prefiltering
- Downwards Filtering By Example
- Traverse data DAG and aggregate the satisfied
descendant types - Match the satisfied with the required
Data nodes are processed in post-order when
exiting each edge directing from n to prev, do
// myBitVec is the bitVector value for n
myBitVec bitOR(myBitVec,prevBitVec,QBit) //
prev is query relevant if it matches a query
label if (prev is query relevant prev
does not satisfies structural constraint)
then myBitVecbitAND(myBitVec,prevQBit) if (n
is query relevant bitAND(myBitVec,QBitV
ec) QBitVec) then n satisfies structural
constraint put n into the corresponding
stream
a1
1111
e1
d1
b1
0001
0100
0011
c1
a2
m1
0001
0001
1000
c2
0001
encoded data DAG
?
satisfied constraints
?
post-order guarantees that a node is encoded
before all its ancestors topological-order
guarantees that a node is encoded before all its
descendants
23Summary of Our Approach
- The key ideas
- Our DAG representation losslessly covers all
transitivity closure - Interval encoding on tree-cover T for covering
- SSPI and tree-cover encoding together cover the
complete - Worst-case space is O(VE), compared to
O(V2) if pre-compute and store all transitive
closure - _stackD algorithms leverage tradeoffs between
space and time - Adopt a new structure, i.e., partial solution
pools, in addition to streams and stacks - Modify/add procedures to handle stack-popped
nodes in pools, where remaining solutions can be
found - Worst-case time is O(max(E, Smi2)), compared
to O(V2) if no path index is utilized - Prefiltering further optimizes performance by
reducing Smi
?
?
24Outline of The Talk
- Motivation
- Problem Definition
- Related Work
- The inspiration of our idea
- Our Approach
- Linear-space representation for DAG
- Stack-based algorithms for path, twig, and dag
queries - Complexity analysis
- Optimization by prefiltering
- Experimental Evaluations
- Conclusions and Future Work
25Experimental Evaluations
- System implementation
- Java 1.4
- Light-weight storage engine -- PSEPro from
ObjectStore - Utilize its VMMA for memory??disk data structure
mapping - Experimental setups
- Tunable synthetic DAG data generator
- Parameters diameter, fan-out, fan-in, distinct
of labels - Real-life data
- Gene ontology data, tree data from XMark
benchmark augmented by random cross links - 2.6Ghz Pentium IV PC, 1GB MM, 2GB VM
26Experiment 1
(ms)
(ms)
(ms)
PQ
TQ
DQ
PQ
TQ
DQ
n50K, m90K
n100K, m180K
n25K, m45K
(ms)
(ms)
PQ
TQ
DQ
a
a
a
b
b
b
c
e
c
f
d
d
e
d
PQ
TQ
DQ
PQ
TQ
DQ
f
n200K, m360K
n400K, m720K
n V m E
Compare processing time (including prefiltering
and query execution) of StackD against
NavKanza PODS03
27Experiment 2
(ms)
(K)
n5K, m5K, 10K, 20K, 30k, 40K, 50K, 60K,
70K
Evaluate the performances of both algorithms with
the changing characteristics (density) of DAG
28Experiment 3
(ms)
(ms)
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
c
c
b
c
c
c
c
c
c
f
f
d
e
d
e
d
d
d
d
i
g
h
e
e
e
f
f
g
Evaluate the performances of both algorithms with
the changing characteristics (size) of query
29Experiment 4
(ms)
Evaluate the performance of PathStack-D with or
without the aid of the prefiltering step
30Experiment 5
BuildSSPI
TSDFilter
TSDExec
Scan
Result
NavAlgo
25K
50K
100K
200K
400K
100MB XML document ( 1.4M nodes and 1.6M
edges)
PQ//site//person//age TQ//site(//item//descripti
on, //category//name, //person//age)
StackD
Scan
Result
NavAlgo
PQ
TQ
31Conclusions and Future work
- Conclusions
- Gracefully generalized the stack-based algorithms
for pattern matching on DAGs - The extended algorithms are sound and complete
- The proposed approach is optimal among those that
do not rely on precomputed transitive closure - Future Work
- Further improvement by incorporating statistics
on a graph structure and/or advanced indexing
schemes - Allow for more general graph operations which
gives rise to more challenging query optimizations
32Questions?