Title: Processing XML Streams with Deterministic Automata
1Processing XML Streams with Deterministic Automata
- Denis Mindolin
- Gaurav Chandalia
2Introduction
XML data stream
XPath query 1
XPath query 2
XML Stream Router
XPath query 3
Consumer 1
Consumer 2
Consumer 3
3Related Work
- The problem was introduced in Altinel and
Franklin 2000 for a system XFilter. - Chan et al. 2002 describes techniques to solve
the problem based on a trie (XTrie) - Diao et al. 2003 discusses a method based on
optimized NFAs(YFilter) - Green et al. 2003 introduces how to solve the
problem using lazy DFA
4DFA approach in general
- Convert the set of XPath expressions into the set
of NFAs - Convert the set of NFAs into a single NFA
- Convert the single NFA into a DFA
- Process XML data stream with DFA (using SAX model)
5DFA approach in general (cont)
- Linear XPath expression
- P /N //N PP
- N E A text() text() S
- where
- E element label
- A attribute label
- / - child axis
- // - descendant axis
- - wild card
- S constant string
What about predicates? To be decomposed into
linear XPath expressions
6DFA approach in general (cont)
- Consider two XPath expressions
- /datasets/dataset//tableHead///text()Galaxy/
title - /datasets/dataset/history/tableHead/field
- Corresponding query tree
- D IN R/datasets/dataset
- H IN D/history
- T IN D/title sax f true
- TH IN D/tableHead sax f true
- N IN D//tableHead//
- F IN TH/field
- V IN N/text()"Galaxy"
7Conversion of XPath expressions into NFA and DFA
Query tree
Query NFA
Query DFA
X IN R/a Y IN X///b Z IN X/b/ U IN Z/d
8Eager DFA vs. Lazy DFA
- DFA is eager if it is obtained by the standard
algorithm of conversion of NFA to DFA Hopcroft
and Ullman 1979 - DFA is lazy if it is constructed at run-time on
demand. Initially it has a single state and
whenever we attempt to make a transition into a
missing state we compute it and update a
transition.
9Eager DFA
- P p0 // p1 // // pk
- pi N1 / N2 / / Nni
- k of //s
- ni length of pi, i0,,k
- m max of s in each pi
- n length (or depth) of P, i.e.
- s alphabet size ?
Theorem. Given a linear XPath expression P,
define prefix(P) n0, and body(P)
when kgt0, and body(P) 1 when k 0. Then eager
DFA for P has at most prefix(P) body(P) states.
In particular, if m 0 and k ?1, then DFA has at
most (n1) states.
10Lazy DFA. Example
DFA
Queries
1
\a\\\b \a\b\\d
a
2
Sample XML document
b
ltagt
3
7
ltbgt
b
ltbgt
6
b
ltd/gt
4
b
d
8
lt/bgt
d
b
lt/bgt
b
lt/agt
5
b
11Lazy DFA
Graph schema (based on DTD)
d the maximum number of simple cycles
that a simple path can intersect D the
total number of nonempty, simple paths
starting at the root
d 2, D 13
12Lazy DFA (cont)
- Theorem. Consider a graph schema with d, D, and
let Q be set of XPath expressions of maximum
depth n. Then on any XML input satisfying the
schema, the lazy DFA has at most 1 D(1n)d
states - Corollary. The number of states of lazy DFA does
not depend on the number of XPath expressions,
only on their depth. - If n 10, and the number of XPath expressions is
equal to 100,000. - Eager DFA may have ? 2100,000 states
- Lazy DFA will have ? 1574 states
13Lazy DFA. Implementation
- To process XML stream, it uses SAX model
- The subset of XPath considered in the
implementation - No text() and attribute values tests
- Only child and descendant axes
- All predicates of a query must fire before the
target element -
14Restrictions of the implementation
XPath queries
Sample XML document
1. All predicates fire before the target element
ltcoursesgt ltcoursegt367-203lt/coursegt
lttitlegtMEDIA WORKSHOPlt/titlegt ltlevelgtUlt/levelgt
ltsectiongt ltsectiongtSe 101lt/sectiongt
ltdaysgtTlt/daysgt lthoursgt
ltstartgt130pmlt/startgt
ltendgt520pmlt/endgt lt/hoursgt lt/sectiongt
ltcreditsgt1-3lt/creditsgt lt/coursesgt
\\courseslevel\section
2. Predicates fire between the starting and
closing tags of the target element
\\coursesdays\section
3. Predicates fire after the target element
\\coursescredits\section
15Processing attributes
- When processing a stream, all attributes are
converted into elements
ltsection_listinggt ltsection nameSe 101
description/gt lthours
start"130pm end"520pm"/gtlt/sect
ion_listinggt
ltsection_listinggt ltsectiongt lt_at_namegtSe
101lt/_at_namegt lt_at_description/gt lt/sectiongt
lthoursgt lt_at_startgt130pmlt/_at_startgt
lt_at_endgt520pmlt/_at_endgt lt/hoursgt lt/section_listinggt
16Testing
- Reference implementation Galax 1.0.3.5
- Testing XML stream World geographic database
http//www.cs.washington.edu/research/xmldatasets/
data/mondial/mondial-3.0.xml (1MB) - Maximum XML depth of the stream was 6
- Number of queries was 14
- The depth of queries had a range of 1 to 5
- The number of predicates had a range of 0 to 3
- The depth of predicates had a range of 1 to 4
Method used Number of states used
NFA 22
Eager DFA 87
Lazy DFA 22
17Reference
- Todd J. Green et al, Processing XML Streams with
Deterministic Automata and Stream Indexes,, ACM
Transactions on Computational Logic, 12/2004 - Altinel, M. and Franklin, M. 2000. Efficient
filtering of XML documents for selective
dissemination, In Proceedings of VLDB. Cairo - Chen J et al, 2000, NiagaraCQ a scalable
continuous query system for internet databases.
In Proceedings of the ACM/SIGMOD Conference on
Management of Data - Diao, Y. and Franklin, M. 2003. Query processing
for high-volume XML message brokering. In
Proceedings of VLDB. Berlin, Germany. - John E. Hopcroft, Jeffrey D. Ullman 1987,
Introduction to automata theory, languages, and
computation