Efficient Querying of XML Data Using Structural Joins - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Querying of XML Data Using Structural Joins

Description:

... (document('prices.xml')/prices/book/title) ... outgoing edges and contain values (like strings, gifs, audio etc. ... for locating edges with rare labels. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 76
Provided by: iri4
Category:

less

Transcript and Presenter's Notes

Title: Efficient Querying of XML Data Using Structural Joins


1
Efficient Querying of XML Data Using Structural
Joins
2
Content
  • A quick look at XML query languages
  • Lore - an example of a native XML database
  • DB2 - an example of RDBMSs support for XML
  • On supporting containment queries in RDBMS
  • The Tree-Merge and Stack-Tree algorithms
  • The StackPath algorithm

3
XML
  • Replacement for HTML
  • Focus is on storing and processing.
  • Electronic Data Interchange
  • Querying becomes desirable.
  • People with many XML documents actually have an
    XML database.

4
XML query languages
  • XML-QL
  • Influenced by SQL
  • Submitted to W3C (lost favor to XQuery)
  • XPath
  • used in XSLT
  • the basis for path expressions in XQuery
  • XQuery
  • A W3C working draft (version 1.0)
  • Based on Quilt (which in turn was mainly
    influenced by XML-QL and Lorel)
  • No updates, limited IR features

5
XPath
  • /para
  • selects all para grandchildren of the context
    node
  • /doc/chapter5/section2
  • selects the second section of the fifth chapter
    of the doc
  • chapter//para
  • selects the para element descendants of the
    chapter element children of the context node
  • para_at_type"warning"
  • selects all para children of the context node
    that have a type attribute with value warning
  • chaptertitle"Introduction"
  • selects the chapter children of the context node
    that have one or more title children with
    string-value equal to Introduction

6
XQuery
  • document("books.xml")//chapter/title
  • Finds all titles of chapters in document
    books.xml
  • document(bib.xml")//bookpublisher
    "Addison-WesleyAND _at_year gt "1991"
  • Finds all books in document bib.xml published by
    Addison-Wesley after 1991
  • ltresultsgt
  • FOR t IN distinct(document("price
    s.xml")/prices/book/title)
  • LET p avg(document("prices.xml
    ")/prices/booktitlet/price)
  • WHERE (document("bib/xml")/bookti
    tlet/publisher) "Addison-Wesley"
  • RETURN
  • ltresultgt t ltavggt p
    lt/avggt lt/resultgt
  • lt/resultsgt
  • Returns the title and average price of all books
    published by Addison-Wesley

7
XML documents as trees
  • ltbook year2000gt
  • lttitlegt XML lt/titlegt
  • ltauthorsgt
  • ltauthorgt Bill lt/authorgt
  • ltauthorgt Jake lt/authorgt
  • lt/authorsgt
  • ltchaptergt
  • ltheadgt History lt/headgt
  • ltsectiongt
  • ltheadgt lt/headgt
  • ltsectiongt lt/sectiongt
  • lt/sectiongt
  • ltsectiongt lt/sectiongt
  • lt/chaptergt
  • ltchaptergt lt/chaptergt
  • lt/bookgt Order of nodes is important

8
XML documents as trees
  • ltbook year2000gt
  • lttitlegt XML lt/title idid1gt
  • ltauthorsgt
  • ltauthorgt Bill lt/authorgt
  • ltauthorgt Jake lt/authorgt
  • lt/authorsgt
  • ltchaptergt
  • ltheadgt History lt/headgt
  • ltsectiongt
  • ltheadgt lt/headgt
  • ltsectiongt lt/section idrefid1gt
  • lt/sectiongt
  • ltsectiongt lt/sectiongt
  • lt/chaptergt
  • ltchaptergt lt/chaptergt
  • lt/bookgt Order of nodes is important

book
year
authors
chapter
chapter
title
...
2000
xml
author
head
section
section
author
...
Bill
Jake
History
head
section
...
...
9
Executing queries
  • How does one execute a complex query
  • Parse the query (i.e. break it down to basic
    operations).
  • Let a query optimizer devise a corresponding
    physical query plan.
  • Execute the required basic operations combining
    the intermediate results as you go.
  • The most common basic operations are
  • Finding nodes satisfying a given predicate on
    their value.
  • Finding nodes satisfying a given structural
    relationship.

10
XML databases
  • XML is semi-structured data items may have
    missing elements or multiple occurrences of the
    same element.It may even not have a DTD.
  • Native semi-structured databases
  • X-Hive, Lore
  • RDBMS
  • Oracle
  • SQL-Server
  • DB2
  • All added support for XML

11
Semi-structured XML databases
  • There arent many around
  • Store XML files plus indexes
  • Usually build (and store) most or all of the tree
  • Usually solve path expressions by pointer-chasing

12
LOREAn example of a native semi-structured
database
13
Lore - sample database
  • Select x
  • From DBGroup.Member x
  • Where exists y in x.age ylt30

14
Lore - data model
  • Called the Object Exchange Model
  • The data model is a graph (though the reference
    edges are marked as such).
  • Each vertex is an object with a unique object
    identifier.
  • Atomic objects have no outgoing edges and contain
    values (like strings, gifs, audio etc.)
  • All other objects may have outgoing edges.
  • Tag-Names (labels) are attached to the edges, not
    the vertices.
  • Objects may optionally have aliases (names).
  • As is obvious this is just another view of our
    XML tree

15
Lore - indexes
  • Vindex (value index) - implemented as a B-tree
  • Supports finding all atomic objects with a given
    incoming edge label satisfying a given predicate.
  • Lindex (label index) - implemented using
    extendible hashing
  • Supports finding all parents of a given object
    via an edge with a given label.
  • Bindex (edge index)
  • Supports finding all parent-child pairs connected
    via a given label. This is useful for locating
    edges with rare labels.
  • In addition there are some other indexes (not
    important to us).
  • Note that we need more indexes than in a
    relational database

16
Lore - statistics (partial list)
  • For each labeled path p of length lt k (usually
    k1)
  • The total number of instances of p, denoted p
  • The total number of distinct objects reachable
    via p,denoted pd
  • The total number of l-labeled edges going out of
    p,denoted p l
  • The total number of l-labeled edges coming into
    p,denoted p l

17
Lore - path expressions (simplified)
  • Simple path expressions
  • x.l y
  • Path expressions
  • an ordered list of simple path expressions
  • x.l y, y.l2 z
  • Path expressions logical plan
  • x.B y, y.C z, z.D v

18
Lore - basic physical operators (slightly edited)
  • Scan(father, label, son)
  • Finds all the sons of a given father (through a
    given label).
  • Does pointer-chasing
  • Lindex(father, label, son)
  • Finds all the fathers of a given son (through a
    given label).
  • Uses the Lindex
  • Bindex(label, father, son)
  • Finds all the father-son pairs connected by a
    given label.
  • Uses the Bindex
  • Vindex(label, operator, value, atomic-object)
  • Finds all the the atomic objects with a given
    label incoming label satisfying the given
    predicate.
  • Uses the Vindex
  • Name(alias, node)
  • Verifies that the specified node has the given
    alias.

19
Lore - physical path subplans

x and y are unbound
y is bound
x and y are unbound
  • The estimated hit-rate (per x) of scan(x, C, y)
    is (B C / Bd)
  • The estimated hit-rate (per y) of Lindex(x, C,
    y) is (C B / Cd)

20
Lore - sample logical plan
  • Select x From DBGroup.Member x Where exists y in
    x.age ylt30
  • Glue nodes are pivot points, they recursively
    evaluate the cost of evaluating their sons in
    left-right or right-left order.

21
Lore - sample physical subplans
  • (a) corresponds to a possible left-right plan of
    the top glue
  • (b) corresponds to a possible left-right plan of
    the right glue
  • (c) corresponds to a possible right-left plan of
    the right glue
  • (d) corresponds to a possible right-left plan of
    the top glue, using (c)

22
Lore - path expressions strategies
  • A higher level view of path expressions solving
  • Top-Down
  • Look for all Member objects in DBGroup and for
    each one look for Age subobjects with a value lt
    30.
  • uses scan
  • Bottom-up
  • Look for all atomic objects with value lt 30 and
    for each one walk up the tree using only
    Age-labeled followed by Member-labeled edges.
  • uses Vindex and then Lindex
  • Hybrid
  • Do Top-Down part of the way and Bottom-Up part of
    the way.
  • Select x From DBGroup.Member x Where exists y in
    x.age ylt30

23
Lore - path strategies (continued)
  • Top-Down is better when there are few paths
    satisfying the required structure, but many
    objects satisfying the predicate.
  • Bottom-Up is better when there are a few objects
    satisfying the predicate but many paths
    satisfying the required structure.
  • Hybrid is better when the fan-out degree (going
    down), increases at the same time the fan-in
    degree (going up) does.

24
DB2An example of a RDBMS support of XML
25
DB2 - XML support
  • XML column
  • An entire XML document is stored as a column in a
    table.
  • may be XMLCLOB, XMLVARCHAR or XMLFile.
  • You define which XML elements or attributes
    should be extracted to indexed columns in side
    tables.
  • UDFs are provided for inserting, updating and
    selecting fragments of a document.
  • XML collection
  • Compose an XML document from existing DB2 tables.
  • Decompose an XML document and retrieve some of it
    into a set of DB2 tables.
  • Basically a conversion mechanism.
  • Stored procedures automate most of the work.

26
DB2 - a nice diagram...
27
DB2 - example Data Access Definition
28
DB2 - example DAD (continued)
29
DB2 - searching XML documents
  • Well, whatever is in the side tables is queried
    using SQL.
  • What about things not in any side table?
  • A loosely coupled IR engine (part of the DB2 Text
    Extender) is called using a UDF to take care of
    this.
  • The UDFs use a syntax compatible with XPath.

30
DB2 - conclusions (in a nutshell)
  • Pros
  • Integrated solution which automates a lot of
    work.
  • We can ask queries that mix data from XML and the
    regular database tables (aka web-supported
    database queries and database-supported web
    queries).
  • Cons
  • One has to manually define the mappings between
    the XML documents and the tables.
  • Is it fast enough?

31
On Supporting Containment Queries in RDBMS
Zhang, Naughton, DeWitt, Luo, Lohman ACM SIGMOD
2001
32
Article goals
  • Given that a lot of XML data is (and will
    probably be) stored in RDBMS which is the best
    way to support containment queries?
  • Using a loosely coupled IR engine?
  • OR
  • Using the native tables and query mechanisms of
    the RDBMS?

33
Structural relationships in trees
1
15
2
4
6
11
2
4
9
14
3
5
7
9
12
14
1
3
6
8
11
13
8
10
13
15
5
7
10
12
  • Note that x is a descendant of y if and only
    ifpreorder(x) gt preorder(y) and postorder(x) lt
    postorder(y)
  • y is the father of x if in addition level(x)
    level(y) 1

34
Structural relationships in XML
  • The previous observations are true even if we
    look at any monotone functions of the preorder
    and the postorder numbers.
  • The start and end position of an element in an
    XML document are exactly such monotone functions.
  • In other words we can use a small extension of
    the regular
  • IR inverted-index to also solve structural
    relationships!
  • Note that we have a problem of adapting the
    numbers if the document changes.

35
The inverted indexes
  • An Elements index (E-Index) Holds for each XML
    element, the docno, begin, end and level of every
    occurrence of that element.
  • A Text index (T-Index) Holds for each text word,
    the docno, wordno and level of every occurrence
    of the word.

36
Experiment plan
  • Compare the following two systems
  • An inverted list engine supporting containment
    queries on XML data.
  • The engine was built (due to lack of a commercial
    one).
  • The code was written in C and the
    inverted-indexes were stored in a B-tree with
    each list stored as a record.
  • Each list is in ascending order of docno, begin
    (or wordno).
  • An in-house algorithm was developed for
    evaluating simple containment queries.
  • A full RDBMS approach (tried DB2 7.1 and
    SQL-Server 7.0)
  • The E-index and T-index are stored as the
    following tables ELEMENTS(term, docno, begin,
    end, level)TEXTS(term, docno, wordno, level)
  • Note that we do not use the IR engine of the
    RDBMS.

37
Using the inverted indexes tables
  • E//"T
  • select from ELEMENTS e, TEXTS t
  • where e.term E and t.term T
  • and e.docno t.docno
  • and e.begin lt t.wordno and t.wordno lt e.end
  • E"T"
  • select from ELEMENTS e, TEXTS t
  • where e.term E and t.term T
  • and e.docno t.docno
  • and e.begin 1 t.wordno and t.wordno 1
    e.end
  • In a similar fashion we solve Elements only
    queries, father-son, and words distance queries.

(how will this look for E//E ?)
38
Experiment setup
  • The data sets

39
Experiment setup (continued)
  • The queries are all simple queries of the form
    E//T, E//E, E/T or E/E

40
Experiment results
41
Results analysis
  • Why did DB2 perform better in QS4, QD4 and QG5?
  • Remember that each list in the inverted engine
    is stored as one record!
  • Why did DB2 perform worse in all the other
    queries?
  • Bad optimizer decisions?
  • Is I/O more expensive (locking, security, etc.)?
  • Other factors?
  • It turns out that the queries are CPU-bound!
  • Further investigation found out that it was the
    merge algorithm.

42
DB2 merge algorithms
  • When joining on
  • a.docno d.docno and a.begin lt d.wordno
    and d.wordno lt a.end
  • Standard Merge-Join only uses the a.docno
    d.docno predicate (since it does one comparison,
    using one index per table), and applies the rest
    of the condition on each matching couple.
  • Hash-Join only uses the a.docno d.docno
    predicate (since it can not handle inequalities
    anyway), and thus performs similarly to the
    classical merge join.
  • Index nested-loop join looks, for each row in the
    outer table, for all rows in the inner table
    index that lie between a start-key and
    astop-key.
  • Assuming the outer table is ELEMENTS and the
    inner table is TEXTS
  • start-key term value and docno outer.docno
    and wordno gt outer.begin
  • end-key term value and docno outer.docno
    and wordno lt outer.end

43
The Multi-Predicate Merge Join
  • begin-desc Dlist-gtfirst node OutputList
    NULL
  • for (a Alist-gtfirstNode a a-gtnextNode)
  • d begin_desc
  • while (d.docno lt a.docno) d d-gtnextNode
  • if (a.docno lt b.docno) continue
  • while (d.begin lt a.begin) d d-gtnextNode
  • begin_desc d
  • while (d.begin lt a.end) // implies d.end lt
    a.end
  • if (a.docno lt b.docno) break
  • append (a,d) to OutputList
  • d d-gtnextNode

44
Comparison of the merge algorithms
  • It seems like the NLJ algorithm will usually
    compare less items, BUT
  • It has to spend time on index seeks!
  • It uses random access so cache utilization is
    poor.

45
MPMGJN traditional joins - statistics
Note DB2 did not choose NLJ for QG4
46
Structural Joins A Primitive for Efficient XML
Query Pattern Matching
Al-Khalifa, Jagadish, Koudas, Patel, Srivastava,
Wu ICDE 2002
47
Structural-Join algorithms
Based on the (docId, startPos, endPos, level)
information of XML elements and attributes. Given
two lists of potential ancestors and potential
descendants, both in ascending order of
docIdstartPos, the following structural join
algorithms are presented
  • Tree-Merge-Anc (aka MPMGJN)
  • Tree-Merge-Desc
  • Stack-Tree-Desc
  • Stack-Tree-Anc
  • The ?-?-Anc algorithms produce the output sorted
    by the ancestors.
  • The ?-?-Desc algorithms produce the output sorted
    by the descendants.
  • The sorting variant to use depends on the way an
    optimizer chooses to compose a complex query.

48
Tree-Merge-Anc
  • begin-desc Dlist-gtfirst node OutputList
    NULL
  • for (a Alist-gtfirstNode a a-gtnextNode)
  • d begin_desc
  • while (d.startPos lt a.startPos) d
    d-gtnextNode
  • begin_desc d
  • while (d.startPos lt a.endPos) // implies
    d.endPos lt a.endPos
  • if (a.level 1 ! d.level) continue //
    father-son
  • append (a,d) to OutputList
  • d d-gtnextNode
  • Note For ease of exposition, we assume that
    Alist and Dlist have the same docId.

49
Analysis of Tree-Merge-Anc
  • Ancestor-Descendant structural relationships
  • O(Alist Dlist OutputList)
  • Since first while loop increases d, and second
    while loop increases output or a.
  • Father-Son structural relationships
  • O(Alist Dlist)

Can sub-sorting on levelNum help ?
50
Tree-Merge-Desc
  • begin-anc Alist-gtfirst node OutputList NULL
  • for (d Dlist-gtfirstNode d d-gtnextNode)
  • a begin_anc
  • while (a.endPos lt d.startPos) a
    a-gtnextNode
  • begin_anc a
  • while (a.startPos lt d.startPos)
  • if (a.level 1 ! d.level) continue //
    father-son
  • if (d.endPos lt a.endPos) append (a,d) to
    OutputList
  • a a-gtnextNode
  • Note For ease of exposition, we assume that
    Alist and Dlist have the same docId.

51
Analysis of Tree-Merge-Desc
  • Ancestor-Descendant and Father-Son structural
    relationships
  • O(Alist Dlist).
  • Works in linear time on most real data.

Dlist
Alist
begin end a0 1 4n2 a1 2 5 a2 6
9 a3 10 13 . . an 4n-2 4n1
begin d1 3 d2 7 d3 11 . . dn 4n-1
a0
a3
a1
a2
an
...
d1
d2
d3
dn
...
52
Stack-Tree algorithms
  • Motivation
  • A depth-first traversal of a tree can be
    performed in linear time, using a stack as large
    as the height of the tree.
  • An ancestor-descendant structural relationship is
    manifested as the ancestor appearing higher on
    the stack than the descendant.
  • Unfortunately, a depth-first traversal requires
    going over all the tree.

53
Stack-Tree-Desc
  • a Alist-gtfirst node d Dlist-gtfirst node
    OutputList NULL
  • while (lists are not empty)
  • e (a.startPos lt d.startPos) ? a d
  • while (e.startPos gt stack-gttop.endPos)
    stack-gtpop()
  • if (e a) // remember that e.startPos gt
    stack-gttop.startPos
  • stack-gtpush(a)
  • a a-gtnextNode
  • else // e d
  • for each a in stack // Father-Son If
    (stack-gttop.level 1 d.level)
    append(stack-gttop, d)
  • append (a, d) to OutputList
  • d d-gtnextNode
  • Note For ease of exposition, we assume that
    Alist and Dlist have the same docId.

54
Stack-Tree-Desc (father-son example)
a1 d1 a2 d2 . . . . an dn dn1 dn2
. . d2n
a1
d1
d2n
a2
d2
d2n-1
a3
an
...
d3
d2n-2
. . .
an
dn
dn1
a2
? e.startPos gt stack-gttop.endPos
a1

(a1,d1)
(a2,d2)
...
(an-1,dn-1)
(an,dn)
(an,dn1)
(an-1,dn2)
...
(a3,d2n-2)
(a2,d2n-1)
(a1,d2n)
55
Analysis of Stack-Tree-Dec
  • O(Alist Dlist OutputList) for
    ancestor-descendant as well as father-son
    structural relationships.
  • Each Alist element is pushed once and popped
    once, so stack operations take O(Alist).
  • The inner for loop outputs a new pair each
    time, so its total time is O(OutputList).
  • When doing father-son structural joins, we do not
    even have a for loop.
  • The algorithm is non-blocking.
  • IO complexity is O(Alist/P Dlist/P
    OutputList/P) where P is the page size.
  • Each input page is read just once (and output
    sent as soon as it is computed).
  • The stack is as large as the tree height, so it
    is very reasonable to assume that it fits in RAM.

56
Stack-Tree-Anc
  • a Alist-gtfirst node d Dlist-gtfirst node
    OutputList NULL
  • while (lists are not empty)
  • e (a.startPos lt d.startPos) ? a d
  • while (e.startPos gt stack-gttop.endPos)
  • temp stack-gtpop()
  • if (stack-gtisEmpty())
  • append temp-gtselfList to OutputList append
    temp-gtinheritList to OutputList
  • else
  • append temp-gtinheritList to temp-gtselfList
    append temp-gtselfList to stack-gttop-gtinheritList
  • if (e a) // remember that e.startPos gt
    stack-gttop.startPos
  • stack-gtpush(a) a a-gtnextNode
  • else // e d
  • for each a in stack
  • if(a stack-gtbottom) append (a, d) to
    OutputList
  • else append (a, d) to selfList associated
    with a
  • d d-gtnextNode

57
Stack-Tree-Anc (father-son example)
? e.startPos gt stack-gttop.endPos
a1 d1 a2 d2 . . . . an dn dn1 dn2
. . d2n
an
(an,dn)
(an,dn1)
. . .
(an-1,dn-1)
(an,dn), (an,dn1)
. . .
a2
(a2,d2)
(a2,d2n-1)
(a3,d3),(a3,d2n-2)...(an,dn),(an,dn1)
a1
(a1,d1)
(a1,d2n)
(a2,d2),(a2,d2n-1)...(an,dn),(an,dn1)
58
Analysis of Stack-Tree-Anc
  • O(Alist Dlist OutputList) For
    ancestor-descendant as well as father-son
    structural relationships.
  • Assuming the lists are maintained as linked lists
    with head and tail pointers.
  • The algorithm is blocking (but only partially).
  • IO complexity is O(Alist/P Dlist/P
    OutputList/P) where P is the page size.
  • We cannot assume that all the lists fit in RAM.
  • All that we do with lists (except output) is
    appending.
  • We can page out a list and we need only keep its
    tail in RAM. So we need two extra pages in memory
    per stack entry - still a reasonable assumption.
  • We only need to know the address of the head of a
    list.
  • Each list page is thus paged out at most once,
    and paged back in only for output.

59
Experiment workload
  • Experimented with real XML data as well as
    synthetic data generated by IBM XML data
    generator (with similar results).
  • Presented the results for the largest data set
    6.3 million elements (800Mb of data).

60
Experiment results
  • Implemented the structural join algorithms, as
    well as bottom-up and top-down, on the TIMBER
    native XML query engine (built on top of SHORE).
  • Bottom-up and top-down performed poorly
  • Even on 10 of the data it took bottom-up 283.5
    seconds to run QS1, and 717.8 seconds for
    top-down to do it.
  • It took less than 15 seconds for any of the join
    algorithms to complete QS1 on the full data set!

61
Experiment results (continued)
  • Implemented the STJ-D as an application program
    interfacing to a commercial RDBMS through a set
    of cursors.
  • Also ran the queries using the RDBMS join
    mechanisms.

QS1
Combined an index on startPos, endPos
Small up to 10 selectivity, Medium up to 25
62
Experiment results (continued)
63
Holistic Twig JoinsOptimal XML Pattern Matching
Bruno, Koudas, Srivastava ACM SIGMOD 2002
64
Twig patterns
  • booktitle XML AND year 2000
  • booktitle XML//authorFn jane AND Ln
    doe

book
...
year
authors
chapter
chapter
title
...
2000
XML
author
head
section
section
author
author
...
book
book
...
Ln
title
section
Fn
Ln
Fn
Ln
Fn
title
year
title
author
XML
2000
XML
Ln
Fn
jane
moe
doe
john
john
doe
XML
...
Twig patterns
jane
doe
65
Twig pattern matching
  • Given a twig pattern Q and an XML database D, a
    match is a mapping from nodes in Q to nodes in D,
    satisfying
  • Query node predicates are satisfied by their
    images.
  • The structural relationships between the query
    nodes are satisfied by their images.
  • If Q has k nodes, the result may be represented
    by a relation with k columns.

66
Twig pattern matching approaches
  • Decompose the twig into a series of binary
    structural joins, compute each (using STJ-D for
    example) and join the results.
  • Note that one may have intermediate results that
    are very big. Consider for example booktitle
    XML
  • Decompose the twig into a series of rooted
    path-expressions, compute each one independently
    and merge-join the results.
  • Note that one may have intermediate results that
    are very big (but only in different
    branches).Consider for example book//authorFn
    jane AND Ln doe
  • Decompose the twig into a series of rooted
    path-expressions, compute them simultaneously
    taking interdependencies into account, and
    merge-join the results.

67
PathStack-Desc
  • go to start of all lists OutputList NULL
  • while (lists are not empty)
  • e element with minimum startPos in all lists
  • i the list e was taken from advance list i
  • for(int j1 j lt numLists j)
  • while (e.startPos gt stackj-gttop.endPos)
    stackj-gtpop()
  • if (e is not from the leaf list) // remember
    that for every stack e.startPos gt
    stack-gttop.startPos
  • stacki-gtpush(a, stacki-1-gttop) // if the
    I-1 stack is not empty of course
  • else // e is the path query leaf
  • let (x1, x2, xnumLists-1) be the linked list
    whose head is the top of the numLists-1 stack.
  • For each (y1, y2, ynumLists-1, e) such that
    for all j yj is below xj do
  • append (y1, y2, ynumLists-1, e) to
    OutputList
  • Note For ease of exposition, we assume that
    all lists have the same docId.

68
PathStack-Desc (example)
a1 b1 a2 b2 c1 b3 c2
a2
b2
? e.startPos gt stack-gttop.endPos
a1
b1
b3

(a2,b2,c1) (a1,b2,c1) (a1,b1,c1)
(a1,b3,c2)
69
PathStack-Desc experimental results
  • Implemented the binary join algorithms, as well
    as the StackPath, in C using the file-system as
    the storage engine.
  • Used a synthetic data set made of 1 million nodes
    with 6 different labels (A1, A2, A6) uniformly
    distributed (no information regarding other
    parameters).

70
Final remarks
  • What we did not do (partial list)
  • Look at using B-Trees with the stack algorithms.
  • Look at the TwigStack algorithm.
  • Look at Kleen-closure evaluation.
  • Conclusions
  • There is a lot more work to be done by everybody.

71
Appendix TwigStack (in a nutshell)
  • getNext(q) returns a query node such that the
    head of its list satisfies
  • It has the smallest startPos (L) of all the heads
    of its descendant and sibling lists.
  • It participates in a solution to the sub-query
    rooted at that query node.
  • If it is part of a solution involving its
    ancestors they were already read.

72
Appendix (continued)
  • Note that as long as 09 succeeds we return the
    node whose head has the smallest startPos (L) of
    all the heads of lists in the sub-tree of q.When
    09 fails we float up a node whose list head has
    the the smallest startPos (L) of all the heads of
    lists in its descendant or sibling lists.
  • Once a node floats up, its father nodes list
    does not contain any more ancestors of its list
    head (otherwise 09 would not fail). Applying the
    same logic to the father and grandfather etc.
    leads us by induction to the conclusion that if
    it this nodes list head is part of a solution
    involving its ancestors, these ancestors are
    already out of their lists.

73
Appendix (continued)
  • Both used ternary trees.
  • Left sub-tree in (a) has only A1A2A3A4 paths.
  • Middle sub-tree in (a) has only A1A5A6A7
    paths.
  • Right sub-tree in (a) has solutions. Its size
    varies (8 to 24 of the tree).
  • (b) left has no A2 or A3, middle has no A4 or
    A5, right has no A6 or A7.

74
Bibliography
  • Shu-Yao Chien, Zografoula Vagena, Donghui Zhang,
    Vassilis J. Tsotras, Carlo Zaniolo, Efficient
    Structural Joins on Indexed XML Documents
    Proc.of VLDB 2002
  • Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas,
    Jingesh M. Patel, Divesh Srivastava, Yuqing Wu,
    Structural Joins A Primitive for Efficient XML
    Query Pattern Matching, ICDE 2002
  • Nicolas Bruno, Nick Koudas, Divesh Srivastava,
    Holistic Twig Joins Optimal XML Pattern
    Matching, ACM SIGMOD 2002
  • Shu-Yao Chien, Vassilis J. Tsotras, Carlo
    Zaniolo, Donghui Zhang, Efficient Complex Query
    Support for Multiversion XML Documents, Proc. of
    VLDB 2001
  • Jason McHugh, Jennifer Widom, Query Optimization
    for XML, Proc. of VLDB 1999
  • Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong
    Luo, Guy Lohman, On Supporting Containment
    Queries in Relational Database Management
    Systems, ACM SIGMOD 2001
  • Quanzhong Li, Bongki Moon, Indexing and Querying
    XML Data for Regular Path Expressions, Proc. of
    VLDB 2001
  • IBM DB2 web site http//www-3.ibm.com/software/
    data/db2/
  • www.w3.org site (on XPath and XQuery)

75
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com