Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003

About This Presentation

Title:

Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003

Description:

A special deterministic stack machine (push down automaton) ... An AFA is a nondeterministic finite automaton with AND, OR or NOT labels on each ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 31

Provided by: aburak

Category:

more less

Transcript and Presenter's Notes

Title: Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003

1
Stream Processing of XPath Queries with
Predicates A. K. Gupta and D. Suciu _at_ ACM SIGMOD
2003

A. Burak Gürdag
burak.gurdag_at_boun.edu.tr

Levent Özgür levent.ozgur_at_boun.edu.tr
2
Outline

XPath Definition
Problem Definition Goal
Proposed Solution
Implementation Optimizations
Experiments
Conclusion

3
XPath

XPath is a language for addressing parts of an
XML document
It may consist of
Labels
Wildcards
Predicates
Output functions (text(), sum(), etc)
Example XPath query
P //ab/text()1 and .//a_at_cgt2

4
XPath (contd)

XPath grammar considered in the paper
P /E //E
E label text() _at_ . E/E
E//E EQ
Q E E Op Const Q and Q Q or Q
not(Q)
Op lt ? gt ? ?
An XPath query P is treated as a boolean filter
that matches an XML document iff P selects at
least one node after evaluated on the documents
root.
set of Xpath queries workload

5
Problem Definition

Process a large collection of XPath
queries(filters) on an incoming stream of XML
packets(documents) and determine, for each
packet, the set of queries that matches that
packet
XML stream processing problem
An instance XML routing problem
Example application
XML Message Brokers
Used in enterprise level information exchange
infrastructures
XML messaging in general

6
Goal

Provide a scalable and high performance solution
to the XML stream processing problem
Eliminate redundant work by considering common
subexpressions in
Structure navigation part
Predicate evaluation part
Dominates the computation time when queries have
multiple predicates
Consider space complexity
e.g. avoid state explosion problem

7
Proposed Solution

XPush Machine
A special deterministic stack machine (push down
automaton)
Simulates the execution of a set of XPath filters
Input series of SAX events
Output the set of filter IDs(oids) that match
the processed document

8
XPush Machine (Definition)

Modified PDA
Top-down and bottom-up states and separate
transition functions, accordingly.
For optimization purposes
SAX events as inputs
An XPush machine is a tuple
(Qt,Qb,q0t,q0b,tpush,tpop,ttadd,tbadd,taccept)

9
XPush Machine (Definition)

States
Qt set of top-down states
Qb set of bottom-up states
q0t initial top-down state
q0b initial bottom-up state
qt current top-down state
qb current bottom-up state
qst top-down state on top of the stack
qsb bottom-up state on top of the stack
s stack of states

Transition functions(tables) tpush Qt x ? ?
Qt tpop Qb x ? ? Qb tvalue Qt x V ?
Qb tbadd Qb x Qb ? Qb ttadd Qt x Qb ?
Qt taccept Qb ? P(I) ? element
alphabet V set of atomic values P(I) set of
filter identifiers
10
XPush Machine (Execution)

5 SAX call-back functions

startDocument() qt q0t qb q0b s empty
text(str) qb tvalue(qt,str)
endDocument() return taccept(qb)
startElement(a) push(s,(qt,qb)) qt
tpush(qt,a) qb q0b
endElement(a) qaux tpop(qb,a) (qst,qsb)
pop(s) qb tbadd(qsb,qaux) qt
ttadd(qst,qaux)
11
XPush Machine (Construction)

In two steps
Construct an alternating finite automaton(AFA)
for each filter
Construct a single bottom-up XPush machine from
all AFA
Single top-down state, naive construction, no
optimization.

12
AFA Construction

An AFA is a nondeterministic finite automaton
with AND, OR or NOT labels on each state
Each AFA state corresponds to a subquery

13
AFA Construction (Example)

P1 //ab/text()1 and .//a_at_cgt2
P2 //a_at_cgt2 and b/text()1

14
XPush Machine Construction

Single top-down state
Naive construction
Each bottom-up state corresponds to a set of AFA
states
Considering all AFA
Eliminate common sub-expressions between filters
? speedup
Transition functions(tables) constructed
accordingly.

15
Implementation

Lazy computation
Compute lazily, at run time
o/w exponentially many states
Expand only those states that are accessible for
the given XML input

16
Lazy Computation

Number of states reduced
Do not construct states inconsistent with DTD
A person has only one name
Exploit regularities in the data that are not in
DTD
Usually at most two phones for a person
Data not occurred in a given data set

17
Data Structure

Xpush composed of states and a stack
All discovered XPush states are stored in a hash
table by their signature
XPush state sorted array of AFA states 32 bit
signature

18
State Precomputation

In fact not so lazy!
Compute some states and transition table entries
Speed up lazy Xpush machine!

19
Optimizations

Improve performance of lazy Xpush machine during
state construction
Two main classes
State Pruning
Training Xpush Machine

20
State Pruning

Top-down Pruning
The bottom-up machine may follow false leads
Keep track of enabled branches in top-down
component of the state
Order Optimization
Based on order information between elements
extracted from DTD

21
Training XPush Machine (1)

Generate training data from the queries!
Generate one XML document tree D for every Xpath
query tree P
P1 /a(b/text()3 and _at_c4) or d/text()5
D1lta c4gt ltbgt 3 lt/bgt ltdgt 5 lt/dgt lt/agt

22
Training XPush Machine (2)

Lazy XPush machine run on the XML training data
first
Compute some of the states which can be reused

23
Analysis

Number of states in the lazy Xpush machine is not
exponential
The number of accessible states in the Xpush
machine is no larger than the number of cliques
in the independence graph
Low selectives of atomic predicates reduces
number of expected states

24
Experiments

How effective?
Memory Requriments?
Ideal Performance?
Optimization techniques?

25
Experimental Settings

Run experiments on two data sets
Protein
NASA
Pentium3, 700 MHz, dual processor, 2 GB memory,
RedHat 7.1

26
Effectiveness Of XPush

Btw 5000 and 20000 queries
Approximately 200 000 atomic predicates
9.12 MB document
Only 1.2 secons including parsing (Apaches
parser 2.53)

27
Memory Requirements

Number of states below 165 000
Far from the worst case (exponential)
Slightly above linear increase as a function of
workload
Mosy effective when queries have many branches

28
Hit Ratio

Think Xpush machine as a cache
Remember configuration we have just seen
Hit ratio for successful lookups versus total
number of lookups above 90

29
Effectiveness

Each optimization improves performance
Number of states reduced
Size of states reduced

30
Conclusion

Process efficiently large numbers of Xpath
expressions with many predicates per query on a
stream of XML data
Xpush machine runs extremely fast
Memory requirements manageable
Cost for laziness recovered later !

Write a Comment

User Comments (0)