Title: Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003
1Stream Processing of XPath Queries with
Predicates A. K. Gupta and D. Suciu _at_ ACM SIGMOD
2003
- A. Burak Gürdag
- burak.gurdag_at_boun.edu.tr
Levent Özgür levent.ozgur_at_boun.edu.tr
2Outline
- XPath Definition
- Problem Definition Goal
- Proposed Solution
- Implementation Optimizations
- Experiments
- Conclusion
3XPath
- XPath is a language for addressing parts of an
XML document - It may consist of
- Labels
- Wildcards
- Predicates
- Output functions (text(), sum(), etc)
- Example XPath query
- P //ab/text()1 and .//a_at_cgt2
4XPath (contd)
- XPath grammar considered in the paper
- P /E //E
- E label text() _at_ . E/E
E//E EQ - Q E E Op Const Q and Q Q or Q
not(Q) - Op lt ? gt ? ?
- An XPath query P is treated as a boolean filter
that matches an XML document iff P selects at
least one node after evaluated on the documents
root. - set of Xpath queries workload
5Problem Definition
- Process a large collection of XPath
queries(filters) on an incoming stream of XML
packets(documents) and determine, for each
packet, the set of queries that matches that
packet - XML stream processing problem
- An instance XML routing problem
- Example application
- XML Message Brokers
- Used in enterprise level information exchange
infrastructures - XML messaging in general
6Goal
- Provide a scalable and high performance solution
to the XML stream processing problem - Eliminate redundant work by considering common
subexpressions in - Structure navigation part
- Predicate evaluation part
- Dominates the computation time when queries have
multiple predicates - Consider space complexity
- e.g. avoid state explosion problem
7Proposed Solution
- XPush Machine
- A special deterministic stack machine (push down
automaton) - Simulates the execution of a set of XPath filters
- Input series of SAX events
- Output the set of filter IDs(oids) that match
the processed document
8XPush Machine (Definition)
- Modified PDA
- Top-down and bottom-up states and separate
transition functions, accordingly. - For optimization purposes
- SAX events as inputs
- An XPush machine is a tuple
- (Qt,Qb,q0t,q0b,tpush,tpop,ttadd,tbadd,taccept)
9XPush Machine (Definition)
- States
- Qt set of top-down states
- Qb set of bottom-up states
- q0t initial top-down state
- q0b initial bottom-up state
- qt current top-down state
- qb current bottom-up state
- qst top-down state on top of the stack
- qsb bottom-up state on top of the stack
- s stack of states
Transition functions(tables) tpush Qt x ? ?
Qt tpop Qb x ? ? Qb tvalue Qt x V ?
Qb tbadd Qb x Qb ? Qb ttadd Qt x Qb ?
Qt taccept Qb ? P(I) ? element
alphabet V set of atomic values P(I) set of
filter identifiers
10XPush Machine (Execution)
- 5 SAX call-back functions
startDocument() qt q0t qb q0b s empty
text(str) qb tvalue(qt,str)
endDocument() return taccept(qb)
startElement(a) push(s,(qt,qb)) qt
tpush(qt,a) qb q0b
endElement(a) qaux tpop(qb,a) (qst,qsb)
pop(s) qb tbadd(qsb,qaux) qt
ttadd(qst,qaux)
11XPush Machine (Construction)
- In two steps
- Construct an alternating finite automaton(AFA)
for each filter - Construct a single bottom-up XPush machine from
all AFA - Single top-down state, naive construction, no
optimization.
12AFA Construction
- An AFA is a nondeterministic finite automaton
with AND, OR or NOT labels on each state - Each AFA state corresponds to a subquery
13AFA Construction (Example)
- P1 //ab/text()1 and .//a_at_cgt2
- P2 //a_at_cgt2 and b/text()1
14XPush Machine Construction
- Single top-down state
- Naive construction
- Each bottom-up state corresponds to a set of AFA
states - Considering all AFA
- Eliminate common sub-expressions between filters
- ? speedup
- Transition functions(tables) constructed
accordingly.
15Implementation
- Lazy computation
- Compute lazily, at run time
- o/w exponentially many states
- Expand only those states that are accessible for
the given XML input
16Lazy Computation
- Number of states reduced
- Do not construct states inconsistent with DTD
- A person has only one name
- Exploit regularities in the data that are not in
DTD - Usually at most two phones for a person
- Data not occurred in a given data set
17Data Structure
- Xpush composed of states and a stack
- All discovered XPush states are stored in a hash
table by their signature - XPush state sorted array of AFA states 32 bit
signature
18State Precomputation
- In fact not so lazy!
- Compute some states and transition table entries
- Speed up lazy Xpush machine!
19Optimizations
- Improve performance of lazy Xpush machine during
state construction - Two main classes
- State Pruning
- Training Xpush Machine
20State Pruning
- Top-down Pruning
- The bottom-up machine may follow false leads
- Keep track of enabled branches in top-down
component of the state - Order Optimization
- Based on order information between elements
extracted from DTD
21Training XPush Machine (1)
- Generate training data from the queries!
- Generate one XML document tree D for every Xpath
query tree P - P1 /a(b/text()3 and _at_c4) or d/text()5
- D1lta c4gt ltbgt 3 lt/bgt ltdgt 5 lt/dgt lt/agt
22Training XPush Machine (2)
- Lazy XPush machine run on the XML training data
first - Compute some of the states which can be reused
23Analysis
- Number of states in the lazy Xpush machine is not
exponential - The number of accessible states in the Xpush
machine is no larger than the number of cliques
in the independence graph - Low selectives of atomic predicates reduces
number of expected states
24Experiments
- How effective?
- Memory Requriments?
- Ideal Performance?
- Optimization techniques?
25Experimental Settings
- Run experiments on two data sets
- Protein
- NASA
- Pentium3, 700 MHz, dual processor, 2 GB memory,
RedHat 7.1
26Effectiveness Of XPush
- Btw 5000 and 20000 queries
- Approximately 200 000 atomic predicates
- 9.12 MB document
- Only 1.2 secons including parsing (Apaches
parser 2.53)
27Memory Requirements
- Number of states below 165 000
- Far from the worst case (exponential)
- Slightly above linear increase as a function of
workload - Mosy effective when queries have many branches
28Hit Ratio
- Think Xpush machine as a cache
- Remember configuration we have just seen
- Hit ratio for successful lookups versus total
number of lookups above 90
29Effectiveness
- Each optimization improves performance
- Number of states reduced
- Size of states reduced
30Conclusion
- Process efficiently large numbers of Xpath
expressions with many predicates per query on a
stream of XML data - Xpush machine runs extremely fast
- Memory requirements manageable
- Cost for laziness recovered later !