Title: Early%20Profile%20Pruning%20on%20XML-aware%20Publish-Subscribe%20Systems
1Early Profile Pruning on XML-aware
Publish-Subscribe Systems
- Mirella M. Moro, Petko Bakalov, Vassilis J.
Tsotras - University of California, Riverside
2Overview
- Motivation
- Bottom-up Filtering FSM (BUFF)
- Bounding-based XML Filtering (BoxFilter)
- Core Modules
- Filtering algorithms
- Experimental results
3Motivation
- Publish-subscribe systems The message
transmission is defined by the message content - Examples notification websites hotwire.com or
ticketmaster.com
Publisher
Publisher
Publisher
Publisher
Docu ments
Docu ments
Docu ments
Docu ments
Matching algorithm
Re su l t
Re su l t
Re su l t
Re su l t
Prof ile
Prof ile
Prof ile
Prof ile
Submit, Update, Delete
Submit, Update, Delete
Submit, Update, Delete
Submit, Update, Delete
Subscriber
Subscriber
Subscriber
Subscriber
4Publish-subscribe systems
- The data is exchanged in XML format.
- Nodes - correspond to elements, attributes or
text values - Edges represent immediate element-subelement or
element-value relationships
ltBibgt ltarticle vol7 no11gt lttitlegtt1lt/titl
egt ltauthorgt ltlastgtDeWittlt/lastgt ltmigtJlt/migt
ltfirstgtDavidlt/firstgt lt/authorgt ltjournalgtTP
DSlt/journalgt ltyeargt1996lt/yeargt lt/articlegt ltart
iclegt lttitlegtt2lt/titlegt ltauthorgt ltlastgtFlor
escult/lastgt ltfirstgtDanielalt/firstgt lt/authorgt
ltproceedingsgtSIGMOD lt/proceedingsgt lty
eargt2006lt/yeargt lt/articlegt lt/Bibgt
(a) Document
(b) Tree representation
5Publish-subscribe systems (cont.)
- The user profiles are expressed in XML query
language (XPath, XQuery) - XML query contains
- structural constraints
- value-based constraints
Structural constraints ////article/author_at_last
Smith''//procs_at_confVLDB''
Tree pattern
article
proceedings
author
conf
last
6Related Work/Our Contribution
- Current work
- Construction of overlay network
- Dissemination/indexing of profiles (queries)
- Processing of stream of messages
- We focus on the matching process that takes place
within a broker - Improves the performance of regular FSM by using
a bottom-up evaluation of the document - Develop index-based filtering technique that
performs early pruning of the query profile
7Overview
- Motivation
- Bottom-up Filtering FSM (BUFF)
- Bounding-based XML Filtering (BoxFilter)
- Core Modules
- Filtering algorithms
- Experimental results
8Bottom-up vs. Top-down filtering
- State machines are among the most common methods
for the XML matching process - Top-down approach (i.e. in-order traversal or
depth first order) advancing the state machine
for each XML element (or attribute) read. - Do not consider any form of early pruning
- Bottom-up approach This approach takes into
consideration the (usual) fact that an XML
document has its more selective elements located
at its leaves
9Example
- Top-down approach groups the queries according to
their common prefixes - Bottom up groups them according to their common
suffixes.
root
Q2 a
Q5 e
Q3 a
a
a
a
a
a
a
a
a
a
a
a
c
f
e
b
b
b
b
b
b
b
b
b
b
b
d
h
f
c
c
c
c
c
c
c
c
c
c
c
d
(b) Queries
(a) Document
c
d
a
3
2
3
4
4
b
b
c
Q1
1
2
Q1
d
a
c
d
1
5
6
5
e
Q2
Q2
a
f
h
f
e
a
7
8
9
0
6
7
8
0
Q4
Q3
Q3
h
e
h
e
a
f
f
11
12
10
11
12
Q5
10
9
Q4
Q5
g
g
h
e
13
14
13
14
Q6
Q6
(d) Bottom up
(c) Top-down
10BUFF
- FSM-based Bottom-up approach for XML filtering.
- BUFF avoids translating documents and queries to
Prüfer sequences (as the other algorithms do),
and employs a more direct evaluation algorithm. - The document is parsed through a SAX parser,
which triggers events for specific marks (tags)
in the XML document - The machine keeps a runtime stack that stores the
current document path being processed.
11BUFF Example
e
ltegt
d
ltdgt
c
d
b
1
2
3
4
e
c
ltcgt
Q1
0
f
b
ltbgt
c
a
b
5
6
7
8
a
ltagt
Q2
(a) Document and BUFF
(b)?
(c)?
(d)?
(e)?
(f)?
(g)?
12Overview
- Motivation
- Bottom-up Filtering FSM (BUFF)
- Bounding-based XML Filtering (BoxFilter)
- Core Modules
- Filtering algorithms
- Experimental results
13Bounding-based XML Filtering
- Two major processes working asynchronously
- Profile Management
- Profile Matching
Prüfer Sequence
Profile Manager
Matching Algorithm
Matching Module
Input Documents
Profiles (queries)?
Matched Documents
14Prüfer Sequence
- A unique sequential encoding of a labeled tree
- Algorithm
- Iteratively removes nodes from the tree until
all nodes but the last two have been removed. - At each iteration, the algorithm finds and
removes the leaf with the smallest label and adds
to the Prüfer sequence the label of that leaf's
parent. - Theorem If a query tree Q is a subgraph of a
document tree D then the Prüfer sequence of Q is
a subsequence of the Prüfer sequence of D
15Sequence Envelope
- Assume a set of k Prüfer sequences representing
user profiles S1,..,Sk - We can derive two new sequences
- Upper bound U for each position take largest
element - Lower bound L for each position take smallest
element - L and U form the smallest possible bounding
envelope that encompasses all members of the set
of sequences from above and below.
16Example
- Assume 3 sequences with 11 symbols each
- abcabababcd
- cdcdecdcdec
- dedededebab
17Sequence Envelope (Cont.)
- The sequence envelope structure is that it can be
used as an aggregation of the sustaining set of
sequences
18BoXFilter Tree
- Sequence envelopes can be nested forming
BoXFilter tree
19Filtering algorithms
- The profiles in the system are organized in
BoXFilter tree. Documents are traversed thought
the tree - There are two variations of the filtering
algorithm - Sequential documents are processed one by one
- Batch processing documents are organized in a
tree like the queries and both trees are joined - After the traversal of the BoXFilter tree, there
is a verification step
20Overview
- Motivation
- Bottom-up Filtering FSM (BUFF)
- Bounding-based XML Filtering (BoxFilter)
- Core Modules
- Filtering algorithms
- Experimental results
21Experimental Results
- We have generated datasets with 1000, 10000 and
100000 small documents (with up to 8KB) - We generated up to 100000 queries with
selectivity fixed to 50
(a)?
(b)?
(c)?
22Experimental Results (cont.)
- In this set of experiments, we vary the number
of documents that match any of the profile
queries. (selectivity 1\ means that one percent
of the documents satisfy \textitany of the
queries.)
23Thank You!