Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003

Description:

A special deterministic stack machine (push down automaton) ... An AFA is a nondeterministic finite automaton with AND, OR or NOT labels on each ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 31
Provided by: aburak
Category:

less

Transcript and Presenter's Notes

Title: Stream Processing of XPath Queries with Predicates A' K' Gupta and D' Suciu ACM SIGMOD 2003


1
Stream Processing of XPath Queries with
Predicates A. K. Gupta and D. Suciu _at_ ACM SIGMOD
2003
  • A. Burak Gürdag
  • burak.gurdag_at_boun.edu.tr

Levent Özgür levent.ozgur_at_boun.edu.tr
2
Outline
  • XPath Definition
  • Problem Definition Goal
  • Proposed Solution
  • Implementation Optimizations
  • Experiments
  • Conclusion

3
XPath
  • XPath is a language for addressing parts of an
    XML document
  • It may consist of
  • Labels
  • Wildcards
  • Predicates
  • Output functions (text(), sum(), etc)
  • Example XPath query
  • P //ab/text()1 and .//a_at_cgt2

4
XPath (contd)
  • XPath grammar considered in the paper
  • P /E //E
  • E label text() _at_ . E/E
    E//E EQ
  • Q E E Op Const Q and Q Q or Q
    not(Q)
  • Op lt ? gt ? ?
  • An XPath query P is treated as a boolean filter
    that matches an XML document iff P selects at
    least one node after evaluated on the documents
    root.
  • set of Xpath queries workload

5
Problem Definition
  • Process a large collection of XPath
    queries(filters) on an incoming stream of XML
    packets(documents) and determine, for each
    packet, the set of queries that matches that
    packet
  • XML stream processing problem
  • An instance XML routing problem
  • Example application
  • XML Message Brokers
  • Used in enterprise level information exchange
    infrastructures
  • XML messaging in general

6
Goal
  • Provide a scalable and high performance solution
    to the XML stream processing problem
  • Eliminate redundant work by considering common
    subexpressions in
  • Structure navigation part
  • Predicate evaluation part
  • Dominates the computation time when queries have
    multiple predicates
  • Consider space complexity
  • e.g. avoid state explosion problem

7
Proposed Solution
  • XPush Machine
  • A special deterministic stack machine (push down
    automaton)
  • Simulates the execution of a set of XPath filters
  • Input series of SAX events
  • Output the set of filter IDs(oids) that match
    the processed document

8
XPush Machine (Definition)
  • Modified PDA
  • Top-down and bottom-up states and separate
    transition functions, accordingly.
  • For optimization purposes
  • SAX events as inputs
  • An XPush machine is a tuple
  • (Qt,Qb,q0t,q0b,tpush,tpop,ttadd,tbadd,taccept)

9
XPush Machine (Definition)
  • States
  • Qt set of top-down states
  • Qb set of bottom-up states
  • q0t initial top-down state
  • q0b initial bottom-up state
  • qt current top-down state
  • qb current bottom-up state
  • qst top-down state on top of the stack
  • qsb bottom-up state on top of the stack
  • s stack of states

Transition functions(tables) tpush Qt x ? ?
Qt tpop Qb x ? ? Qb tvalue Qt x V ?
Qb tbadd Qb x Qb ? Qb ttadd Qt x Qb ?
Qt taccept Qb ? P(I) ? element
alphabet V set of atomic values P(I) set of
filter identifiers
10
XPush Machine (Execution)
  • 5 SAX call-back functions

startDocument() qt q0t qb q0b s empty
text(str) qb tvalue(qt,str)
endDocument() return taccept(qb)
startElement(a) push(s,(qt,qb)) qt
tpush(qt,a) qb q0b
endElement(a) qaux tpop(qb,a) (qst,qsb)
pop(s) qb tbadd(qsb,qaux) qt
ttadd(qst,qaux)
11
XPush Machine (Construction)
  • In two steps
  • Construct an alternating finite automaton(AFA)
    for each filter
  • Construct a single bottom-up XPush machine from
    all AFA
  • Single top-down state, naive construction, no
    optimization.

12
AFA Construction
  • An AFA is a nondeterministic finite automaton
    with AND, OR or NOT labels on each state
  • Each AFA state corresponds to a subquery

13
AFA Construction (Example)
  • P1 //ab/text()1 and .//a_at_cgt2
  • P2 //a_at_cgt2 and b/text()1

14
XPush Machine Construction
  • Single top-down state
  • Naive construction
  • Each bottom-up state corresponds to a set of AFA
    states
  • Considering all AFA
  • Eliminate common sub-expressions between filters
  • ? speedup
  • Transition functions(tables) constructed
    accordingly.

15
Implementation
  • Lazy computation
  • Compute lazily, at run time
  • o/w exponentially many states
  • Expand only those states that are accessible for
    the given XML input

16
Lazy Computation
  • Number of states reduced
  • Do not construct states inconsistent with DTD
  • A person has only one name
  • Exploit regularities in the data that are not in
    DTD
  • Usually at most two phones for a person
  • Data not occurred in a given data set

17
Data Structure
  • Xpush composed of states and a stack
  • All discovered XPush states are stored in a hash
    table by their signature
  • XPush state sorted array of AFA states 32 bit
    signature

18
State Precomputation
  • In fact not so lazy!
  • Compute some states and transition table entries
  • Speed up lazy Xpush machine!

19
Optimizations
  • Improve performance of lazy Xpush machine during
    state construction
  • Two main classes
  • State Pruning
  • Training Xpush Machine

20
State Pruning
  • Top-down Pruning
  • The bottom-up machine may follow false leads
  • Keep track of enabled branches in top-down
    component of the state
  • Order Optimization
  • Based on order information between elements
    extracted from DTD

21
Training XPush Machine (1)
  • Generate training data from the queries!
  • Generate one XML document tree D for every Xpath
    query tree P
  • P1 /a(b/text()3 and _at_c4) or d/text()5
  • D1lta c4gt ltbgt 3 lt/bgt ltdgt 5 lt/dgt lt/agt

22
Training XPush Machine (2)
  • Lazy XPush machine run on the XML training data
    first
  • Compute some of the states which can be reused

23
Analysis
  • Number of states in the lazy Xpush machine is not
    exponential
  • The number of accessible states in the Xpush
    machine is no larger than the number of cliques
    in the independence graph
  • Low selectives of atomic predicates reduces
    number of expected states

24
Experiments
  • How effective?
  • Memory Requriments?
  • Ideal Performance?
  • Optimization techniques?

25
Experimental Settings
  • Run experiments on two data sets
  • Protein
  • NASA
  • Pentium3, 700 MHz, dual processor, 2 GB memory,
    RedHat 7.1

26
Effectiveness Of XPush
  • Btw 5000 and 20000 queries
  • Approximately 200 000 atomic predicates
  • 9.12 MB document
  • Only 1.2 secons including parsing (Apaches
    parser 2.53)

27
Memory Requirements
  • Number of states below 165 000
  • Far from the worst case (exponential)
  • Slightly above linear increase as a function of
    workload
  • Mosy effective when queries have many branches

28
Hit Ratio
  • Think Xpush machine as a cache
  • Remember configuration we have just seen
  • Hit ratio for successful lookups versus total
    number of lookups above 90

29
Effectiveness
  • Each optimization improves performance
  • Number of states reduced
  • Size of states reduced

30
Conclusion
  • Process efficiently large numbers of Xpath
    expressions with many predicates per query on a
    stream of XML data
  • Xpush machine runs extremely fast
  • Memory requirements manageable
  • Cost for laziness recovered later !
Write a Comment
User Comments (0)
About PowerShow.com