Stackbased Algorithms for Pattern Matching on DAGs presentation

About This Presentation

Transcript and Presenter's Notes

Title: Stackbased Algorithms for Pattern Matching on DAGs

1
Stack-based Algorithms for Pattern Matching on
DAGs

Li Chen, Amarnath Gupta, M. Erdem Kurul

San Diego Supercomputer Center (SDSC), University
of California, San Diego
VLDB05
2
Motivation

Graph model is important in databases and
knowledge representation
Bibliographic citations, hypertext, ontology
A lot of scientific data are beyond XML tree
model
Many of them are directed acyclic graphs (DAGs)
Taxonomy of proteins, chemical compounds,
organisms
Data provenance graphs
Sequence data and multiple sequence alignments
Searching for highly similar substructures
Gives rise to numerous pattern matching problems
e.g., a novel (metabolic) pathway against a
pathway database

3
Example Its Abstraction
graph-structured patent citation network

Labeled DAG
node patent/article
label patent/article properties
year, contact_author, affiliation, country, etc.
directed edge uniformly cited-by

Query Model
node matching a certain property of data nodes
edges / for direct (resp. // for indirect )
cited-by

4
Problem Definition

Is this a (sub)graph isomorphism problem?

Definition Two graphs are isomorphic if there is
a one-to-one correspondence between their
vertices and there is an edge between two
vertices of one graph if and only if there is an
edge between the two corresponding vertices in
the other.
NP-hard !

Is this a subgraph homeomorphism problem?

Definition The homeomorphic image of a pattern
graph H in a data graph G is the images of nodes
in H are nodes of G, and the images of edges in H
are paths in G.
Neither!! ours is easier (polynomial)

acyclic graph model
corresponding vertices (nodes) have the same
label

5
Problem Definition (cont.)

Pattern matching on DAGs
The input is a (virtual) single rooted DAG G
A total mapping from Q to G, preserving
parent-child / ancestor-descendant relationships
Branches represent and semantics ? structural
join
Return node bindings in witness structures

m1
(c)
c4
c1
p1
a1
c2
(b)
(e)
b1
e2
e1
m2
Twig pattern query
DAG-structured data
6
Related Work

Exact v.s. Inexact graph matching Shasha PODS02
Exact a total mapping from query nodes to data
nodes (usually requiring label matching)
Inexact either a partial mapping, or an
approximated total mapping
Trade-off between space and time
Trade space for time
Path Index materialize fixed or parameterized
length of paths
Transitive closure computing for queries
involving //
Store adjacency list and compute on-the-fly
Q Is there a method economic in both time and
space?

?
7
Outline of The Talk

Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach
Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag
queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work

8
Inspiration from XML Pattern Matching Interval
Encodings
interval encoding of a tree
Implication of overlapping intervals

y is a descendant of x, i.e., y x ?
y.left gt x.left and y.right lt x.right

?
x
y
?
y
x
x
y

Difficulties in directly applying interval
encoding to DAG
Each tree node has at most one parent, while a
graph node may have more
Multiple encoding may be a solution, but more
comparisons are introduced, so likely not space
nor time economic

9
Inspiration from XML Pattern Matching
Stack-based Algorithms for Holistic Joins

Stack-based Algorithm Bruno et al. SIGMOD02
Build a stream and a stack corresponding to each
query node
Nodes in streams are pushed into stacks in their
document order
Pop a node from its stack if its interval no
longer overlaps the newly pushed node
For a node pushed into a leaf stack, output all
root-to-leaf paths

10
Challenges

Whether stack-based algorithms are extendable to
pattern matching on DAGs?
If possible, how? Is it economic in space and
time?

11
Outline of The Talk

Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach
Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag
queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work

12
DAG Representation

Partial order v.s. transitive closure
G (V, E, )
node partial order , i.e., ?e lta,bgt E ? b
a
transitive closure , i.e., ?p ltx,ygt P ? y
x
What do we do?
Not pre-compute and store
Neither store adjacency list for
Instead, store interval encoding of a tree-cover,
covering part of
And index on the remaining linkages minimally but
losslessly

?
?
?
?
?
?
13
Our DAG Representation

Decompose a DAG G into T and GR
T (V, ET) is a tree-cover (spanning tree)
GR (VR, ER) is the remaining graph, ER E - ET

14
Properties of Our DAG Representation
?

Lossless in inducing
Building costs (a tree-cover traversal of G)
Procedure
encode each node w along the traversal of T
if w has surplus preds ui in addition to its tree
parent v, add ui in PL(w) and if v is also in
SPPI,
add v in PL(w), if v does not have surplus preds
itself
inherit PL(v) in PL(w), otherwise
Linear time space
in terms of V for interval encoding
in terms of E for populating SSPI

15
Extending Stack-based Holistic Join Algorithms

Key ideas
Keep the data structures of streams and stacks
Add a new structure partial solution pools
Put a popped node in its pool, rather than
discard it
Grow partial solutions for the new-found
Exploit temporal properties to avoid vain attempts

?
16
Algorithm Extension

SweepPartialSolutions checking building
solutions in pools
When?
A node v is popped out of stack
Where?
Between v and the nodes in each of its children
pools
What (condition)?
Check if each child pool has a node w, s.t. w
v
How?
What (action)?
Expand grow partial solutions headed by w to be
headed by v

?
Check if ?u?PL(PL(w)) s.t. u.L ? v.L and u.R ?
v.R
v
w
v
v
w
w
17
PathStackD by Example
m1
m1
m1 c1 b1
c4
c1
p1
c1
b1
m1
a1
c2
b1
e2
e1
m2
c2
m2
b1
c1
m1
(a) Data G
(b) Query
m2
c1
m1
b1
c2
c4
c1
b1
m1
c4
c2
18
Algorithm Analysis
?

The total containment ( ) checks in pools are
Not Sm1 x Sm2 x x Smn times of SPPI
look-ups
Smi size of the ith stream, n size of the
path Q
But much tightly restricted due to temporal
properties
Not all stream nodes, but child pool nodes (to
the left of v)
Not entire SSPI is searched for checking if w v

?
Function checkContainment(v,w) while (unext
PL(w) and !found) if (u.L ? v.L and u.R ?
v.R) return true else if (u.L ? v.R) return
false else if (u has no preds) remove u from
PL(w) else found checkContainment(v,u)
if (!found) PL(w)PL(w)PL(u)-u
19
PathStackD

Theorem 1 Given a path query q and a DAG G,
PathStackD
correctly returns all the query answers for q.

sound
complete
and
Theorem 2 Given a path query q and a DAG G,
PathStackD has the worst-case I/O and CPU time
complexities of O(qSmi qSmid E),
i.e., max(E, qSmi(max(Smi, d))).
2
Smi average stream size q query
size d diameter of G
Optimal compared to O(V q)
2
20
Additional Changes in TwigStackD

Key changes
getMinSources (original) ? getMissings (ours)
sweepPartialSolutions

1. node with minimal left value
1. the same
2. has all the required descendant types
2. record which required types are missing
3. check if missing types are complemented by
pool nodes
m
b1
c2
Sb
Pb
m1
c1
c
Sc
Sm
Pc
Pm
e1
b
e
Se
Pe
(a) Data G
(b) Query
21
A Prefiltering Step

Purpose
Improve efficiency by reducing the I/O factor
Smi
Basic Idea
Impose structural constraints of the query
pattern for filtering nodes to be put in streams

e.g.,
each QBitVec captures required upwards
structural constraints
each QBitVec captures required downwards
structural constraints
a
a
1111
1000
QBitVec
b
d
b
d
0011
0100
1010
1100
QBit
c
c
0001
1011
22
Two Passes for Prefiltering

Downwards Filtering By Example
Traverse data DAG and aggregate the satisfied
descendant types
Match the satisfied with the required

Data nodes are processed in post-order when
exiting each edge directing from n to prev, do
// myBitVec is the bitVector value for n
myBitVec bitOR(myBitVec,prevBitVec,QBit) //
prev is query relevant if it matches a query
label if (prev is query relevant prev
does not satisfies structural constraint)
then myBitVecbitAND(myBitVec,prevQBit) if (n
is query relevant bitAND(myBitVec,QBitV
ec) QBitVec) then n satisfies structural
constraint put n into the corresponding
stream
a1
1111
e1
d1
b1
0001
0100
0011
c1
a2
m1
0001
0001
1000
c2
0001
encoded data DAG
?
satisfied constraints
?
post-order guarantees that a node is encoded
before all its ancestors topological-order
guarantees that a node is encoded before all its
descendants
23
Summary of Our Approach

The key ideas
Our DAG representation losslessly covers all
transitivity closure
Interval encoding on tree-cover T for covering
SSPI and tree-cover encoding together cover the
complete
Worst-case space is O(VE), compared to
O(V2) if pre-compute and store all transitive
closure
_stackD algorithms leverage tradeoffs between
space and time
Adopt a new structure, i.e., partial solution
pools, in addition to streams and stacks
Modify/add procedures to handle stack-popped
nodes in pools, where remaining solutions can be
found
Worst-case time is O(max(E, Smi2)), compared
to O(V2) if no path index is utilized
Prefiltering further optimizes performance by
reducing Smi

?
?
24
Outline of The Talk

Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach
Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag
queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work

25
Experimental Evaluations

System implementation
Java 1.4
Light-weight storage engine -- PSEPro from
ObjectStore
Utilize its VMMA for memory??disk data structure
mapping
Experimental setups
Tunable synthetic DAG data generator
Parameters diameter, fan-out, fan-in, distinct
of labels
Real-life data
Gene ontology data, tree data from XMark
benchmark augmented by random cross links
2.6Ghz Pentium IV PC, 1GB MM, 2GB VM

26
Experiment 1
(ms)
(ms)
(ms)
PQ
TQ
DQ
PQ
TQ
DQ
n50K, m90K
n100K, m180K
n25K, m45K
(ms)
(ms)
PQ
TQ
DQ
a
a
a
b
b
b
c
e
c
f
d
d
e
d
PQ
TQ
DQ
PQ
TQ
DQ
f
n200K, m360K
n400K, m720K
n V m E
Compare processing time (including prefiltering
and query execution) of StackD against
NavKanza PODS03
27
Experiment 2
(ms)
(K)
n5K, m5K, 10K, 20K, 30k, 40K, 50K, 60K,
70K
Evaluate the performances of both algorithms with
the changing characteristics (density) of DAG
28
Experiment 3
(ms)
(ms)
a
a
a
a
a
a
a
a
a
b
b
b
b
b
b
b
b
c
c
b
c
c
c
c
c
c
f
f
d
e
d
e
d
d
d
d
i
g
h
e
e
e
f
f
g
Evaluate the performances of both algorithms with
the changing characteristics (size) of query
29
Experiment 4
(ms)
Evaluate the performance of PathStack-D with or
without the aid of the prefiltering step
30
Experiment 5
BuildSSPI
TSDFilter
TSDExec
Scan
Result
NavAlgo
25K
50K
100K
200K
400K
100MB XML document ( 1.4M nodes and 1.6M
edges)
PQ//site//person//age TQ//site(//item//descripti
on, //category//name, //person//age)
StackD
Scan
Result
NavAlgo
PQ
TQ
31
Conclusions and Future work

Conclusions
Gracefully generalized the stack-based algorithms
for pattern matching on DAGs
The extended algorithms are sound and complete
The proposed approach is optimal among those that
do not rely on precomputed transitive closure
Future Work
Further improvement by incorporating statistics
on a graph structure and/or advanced indexing
schemes
Allow for more general graph operations which
gives rise to more challenging query optimizations

32
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Stackbased Algorithms for Pattern Matching on DAGs PowerPoint PPT Presentation