Chapter 10: XML - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Chapter 10: XML

Description:

XML-QL, Quilt, XQL, ... Tree Model of XML Data ... XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 91
Provided by: kers151
Category:
Tags: xml | chapter | quilt

less

Transcript and Presenter's Notes

Title: Chapter 10: XML


1
Chapter 10 XML
  • The world of XML

2
The Data
  • Semistructured data instance a large graph

3
The indexing problem
  • The storage problem
  • Store the graph in a relational DBMS
  • Develop a new database storage structure
  • The indexing problem
  • Input large, irregular data graph
  • Output index structure for evaluating (regular)
    path expressions, e.g.
  • bib.paper.author.firstname

4
XSet a simple index for XML
  • Part of the Ninja project at Berkeley
  • Example XML data

5
XSet a simple index for XML
  • Each node a hashtable
  • Each entry list of pointers to data nodes (not
    shown)
  • SELECT X FROM part.name X -yes
  • SELECT X FROM part.supplier.name X -yes
  • SELECT X FROM part..subpart.name X -maybe
  • SELECT X FROM .supplier.name X -maybe

6
Region Algebras
  • structured text text with tags (like XML)
  • data sequence of characters c1c2c3
  • region interval in the text
  • representation (x,y) cx,cx1, cy
  • example ltsectiongt lt/sectiongt
  • region set a set of regions
  • example all ltsectiongt regions (may be nested)
  • region algebra operators on region set,
  • s1 op s2
  • s1 intersect s2 r r? s1, r ?s2
  • s1 included s2 r r?s1, ?r ? s2, r ? r
  • s1 including s2 r r? s1, ?r ? s2, r ? r
  • s1 parent s2 r r? s1, ?r? s2, r is a parent
    of r
  • s1 child s2 r r? s1, ?r ? s2, r is child of
    r

7
Region Algebras
  • Region expressions correspond to simple XPath
    expressions
  • s1 child s2 r r? s1, ?r ? s2, r is child of
    r
  • part.name name child (part child
    root)
  • part.supplier.name name child (supplier child
    (part child root))
  • .supplier.name name child supplier
  • part..subpart.name name child (subpart
    included (part child root))

8
Efficient computation of Region Algebra Operators
  • Example s1 included s2
  • s1 (x1,x1'), (x2,x2'),
  • s2 (y1,y1'), (y2,y2'),
  • (i.e. assume each consists of disjoint regions)
  • Algorithm
  • if xi lt yj then i i 1
  • if xi' gt yj' then j j 1
  • otherwise print (xi,xi'), do i i 1
  • Can do in sub-linear time when one region is very
    small

9
Storage structures for region algebras
  • Every node is characterised by an integer pair
    (x,y)
  • This means we have a 2-d space
  • Any 2-d space data structure can be used
  • If you use a (pre-order,post-order) numbering you
    get triangular filling of 2-d
  • (to be discussed later)

10
Alternative mappings
  • Mapping the structure to the relational world
  • The Edge approach
  • The Attribute approach
  • The Universal Table approach
  • The Normalized Universal approach
  • The Monet/XML approach
  • The Dataguide approach
  • Mapping values
  • Separate value tables
  • Inlining
  • Shredding

11
Dataguide approach
  • Developed in the context of Lore, Lorel (Stanford
    Univ)
  • Predecessor of the Monet/XML model
  • Observation
  • queries in the graph-representation take a
    limited form
  • they are partial walks from the root to an object
    of interest
  • this behaviour was stressed by the query language
    Lorel, i.e. an SQL-based query language based on
    processing regular expressions

SELECT X FROM (Bib..author).(lastnamefirstname).
Abiteboul X
12
DataGuides
  • Definition
  • given a semistructured data instance DB, a
    DataGuide for DB is a graph G s.t.
  • - every path in DB also occurs in G
  • - every path in G occurs in DB
  • - every path in G is unique

13
Dataguides
  • Example

14
DataGuides
  • Multiple DataGuides for the same data

15
DataGuides
  • Definition
  • Let w, w be two words (I.e word queries) and G
    a graph
  • w ?G w if w(G) w(G)
  • Definition
  • G is a strong dataguide for a database DB if ?G
    is the same as ?DB
  • Example
  • - G1 is a strong dataguide
  • - G2 is not strong
  • person.project !?DB dept.project
  • person.project !?G2 dept.project

16
DataGuides
  • Constructing the strong DataGuide G
  • Nodes(G)root
  • Edges(G)?
  • while changes do
  • choose s in Nodes(G), a in Labels
  • add syx in s, (x -a-gty) in Edges(DB) to
    Nodes(G)
  • add (x -a-gty) to Edges(G)
  • Use hash table for Nodes(G)
  • This is precisely the powerset automaton
    construction.

17
DataGuides
  • How large are the dataguides ?
  • if DB is a tree, then size(G) lt size(DB)
  • why? answer every node is in exactly one extent
    of G
  • here dataguide XSet
  • How many nodes does the strong dataguide have for
    this DB ?

20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic
schemas, like

18
Monet XML approach
Monet XML approach
19
Monet XML approach
20
Monet XML approach
21
Monet XML approach
22
Monet XML approach
23
  • Querying the XML world

24
Querying and Transforming XML Data
  • Standard XML querying/translation languages
  • XPath
  • Simple language consisting of path expressions
  • XSLT
  • Simple language designed for translation from XML
    to XML and XML to HTML
  • XQuery
  • An XML query language with a rich set of features
  • Wide variety of other languages have been
    proposed, and some served as basis for the Xquery
    standard
  • XML-QL, Quilt, XQL,

25
Tree Model of XML Data
  • Query and transformation languages are based on a
    tree model of XML data
  • An XML document is modeled as a tree, with nodes
    corresponding to elements and attributes
  • Element nodes have children nodes, which can be
    attributes or subelements
  • Text in an element is modeled as a text node
    child of the element
  • Children of a node are ordered according to their
    order in the XML document
  • Element and attribute nodes (except for the root
    node) have a single parent, which is an element
    node
  • The root node has a single child, which is the
    root element of the document
  • We use the terminology of nodes, children,
    parent, siblings, ancestor, descendant, etc.,
    which should be interpreted in the above tree
    model of XML data.

26
XML data with ID and IDREF attributes
  • ltbank-2gt
  • ltaccount account-numberA-401 ownersC100
    C102gt
  • ltbranch-namegt Downtown lt/branch-namegt
  • ltbranchgt500 lt/balancegt
  • lt/accountgt
  • ltcustomer customer-idC100 accountsA-401gt
  • ltcustomer-namegtJoelt/customer-namegt
  • ltcustomer-streetgtMonroelt/customer-street
    gt
  • ltcustomer-citygtMadisonlt/customer-citygt
  • lt/customergt
  • ltcustomer customer-idC102 accountsA-401
    A-402gt
  • ltcustomer-namegt Marylt/customer-namegt
  • ltcustomer-streetgt Erinlt/customer-streetgt
  • ltcustomer-citygt Newark lt/customer-citygt
  • lt/customergt
  • lt/bank-2gt

27
XPath
  • XPath is used to address (select) parts of
    documents using path expressions
  • A path expression is a sequence of steps
    separated by /
  • Think of file names in a directory hierarchy
  • Result of path expression set of values that
    along with their containing elements/attributes
    match the specified path
  • E.g. /bank-2/customer/name evaluated on
    the bank-2 data we saw earlier returns
  • ltnamegtJoelt/namegt
  • ltnamegtMarylt/namegt
  • E.g. /bank-2/customer/name/text( )
  • returns the same names, but without the
    enclosing tags

28
XPath (Cont.)
  • The initial / denotes root of the document
    (above the top-level tag)
  • Path expressions are evaluated left to right
  • Each step operates on the set of instances
    produced by the previous step
  • Selection predicates may follow any step in a
    path, in
  • E.g. /bank-2/accountbalance gt 400
  • returns account elements with a balance value
    greater than 400
  • /bank-2/accountbalance returns account
    elements containing a balance subelement
  • Attributes are accessed using _at_
  • E.g. /bank-2/accountbalance gt
    400/_at_account-number
  • returns the account numbers of those accounts
    with balance gt 400
  • IDREF attributes are not dereferenced
    automatically (more on this later)

29
Functions in XPath
  • XPath provides several functions
  • The function count() at the end of a path counts
    the number of elements in the set generated by
    the path
  • E.g. /bank-2/accountcustomer/count() gt 2
  • Returns accounts with gt 2 customers
  • Also function for testing position (1, 2, ..) of
    node w.r.t. siblings
  • Boolean connectives and and or and function not()
    can be used in predicates
  • IDREFs can be referenced using function id()
  • id() can also be applied to sets of references
    such as IDREFS and even to strings containing
    multiple references separated by blanks
  • E.g. /bank-2/account/id(_at_owner)
  • returns all customers referred to from the owners
    attribute of account elements.

30
More XPath Features
  • Operator used to implement union
  • E.g. /bank-2/account/id(_at_owner)
    /bank-2/loan/id(_at_borrower)
  • gives customers with either accounts or loans
  • However, cannot be nested inside other
    operators.
  • // can be used to skip multiple levels of nodes
  • E.g. /bank-2//name
  • finds any name element anywhere under the
    /bank-2 element, regardless of the element in
    which it is contained.
  • A step in the path can go to (13 variations in
    the standard)
  • parents, siblings, ancestors and descendants
  • of the nodes generated by the previous step, not
    just to the children
  • //, described above, is a short from for
    specifying all descendants
  • .. specifies the parent.

31
Pathfinder
  • Xpath is essential for the implementation of an
    Xquery processor. It is strongly related to the
    data structures and its primitives.
  • A state-of-the-art implementation is
    MonetDB/Pathfinder developed by Uni. Konstantz,
    Twente University, CWI

32
Pathfinder Uni Konstantz
33
Pathfinder
34
Pathfinder
35
Pathfinder
36
(No Transcript)
37
Pathfinder
38
Pathfinder
39
Pathfinder
40
pathfinder
41
Pathfinder
42
Staircase join
43
Staircase join
44
Pathfinder
45
Pathfinder
46
Pathfinder
47
Pathfinder
48
  • XQuery

49
XQuery
  • XQuery is a general purpose query language for
    XML data
  • Currently being standardized by the World Wide
    Web Consortium (W3C)
  • The textbook description is based on a March 2001
    draft of the standard. The final version may
    differ, but major features likely to stay
    unchanged.
  • Alpha version of XQuery engine
  • Galax http//db.bell-labs.com/galax/
  • IPSI-IQ
  • Xpath visualized http//www.vbxml.com/xpathvisual
    izer/
  • MonetDB/Pathfinder
  • Xhive
  • XQuery is derived from the Quilt query language,
    which itself borrows from SQL, XQL and XML-QL

50
XQuery
  • XQuery uses a for let where .. return
    syntax
  • for ? SQL from where ? SQL where
    return ? SQL select let allows temporary
    variables, and has no equivalent in SQL
  • Variables make it possible to keep the state of
    processing around and severely complicates
    optimization

51
FLWR Syntax in XQuery
  • For clause uses XPath expressions, and variables
    in the for- clause ranges over values in the set
    returned by Xpath
  • XPath is used to address (select) parts of
    documents using path expressions
  • A path expression is a sequence of steps
    separated by /
  • Result of path expression set of values that
    along with their containing elements/attributes
    match the specified path
  • E.g. /bank-2/customer/name evaluated on
    the bank-2 data we saw earlier returns
  • ltnamegtJoelt/namegt
  • ltnamegtMarylt/namegt
  • E.g. /bank-2/customer/name/text( )
  • returns the same names, but without the
    enclosing tags

52
XPath
  • XPath is used to address (select) parts of
    documents using path expressions
  • A path expression is a sequence of steps
    separated by /
  • Think of file names in a directory hierarchy
  • Result of path expression set of values that
    along with their containing elements/attributes
    match the specified path
  • E.g. /bank-2/customer/name evaluated on
    the bank-2 data we saw earlier returns
  • ltnamegtJoelt/namegt
  • ltnamegtMarylt/namegt
  • E.g. /bank-2/customer/name/text( )
  • returns the same names, but without the
    enclosing tags

53
XPath (Cont.)
  • The initial / denotes root of the document
    (above the top-level tag)
  • Path expressions are evaluated left to right
  • Each step operates on the set of instances
    produced by the previous step
  • Selection predicates may follow any step in a
    path, in
  • E.g. /bank-2/accountbalance gt 400
  • returns account elements with a balance value
    greater than 400
  • /bank-2/accountbalance returns account
    elements containing a balance subelement
  • Attributes are accessed using _at_
  • E.g. /bank-2/accountbalance gt
    400/_at_account-number
  • returns the account numbers of those accounts
    with balance gt 400
  • IDREF attributes are not dereferenced
    automatically (more on this later)

54
Functions in XPath
  • XPath provides several functions
  • The function count() at the end of a path counts
    the number of elements in the set generated by
    the path
  • E.g. /bank-2/accountcustomer/count() gt 2
  • Returns accounts with gt 2 customers
  • Also function for testing position (1, 2, ..) of
    node w.r.t. siblings
  • Boolean connectives and and or and function not()
    can be used in predicates
  • IDREFs can be referenced using function id()
  • id() can also be applied to sets of references
    such as IDREFS and even to strings containing
    multiple references separated by blanks
  • E.g. /bank-2/account/id(_at_owner)
  • returns all customers referred to from the owners
    attribute of account elements.

55
More XPath Features
  • Operator used to implement union
  • E.g. /bank-2/account/id(_at_owner)
    /bank-2/loan/id(_at_borrower)
  • gives customers with either accounts or loans
  • However, cannot be nested inside other
    operators.
  • // can be used to skip multiple levels of nodes
  • E.g. /bank-2//name
  • finds any name element anywhere under the
    /bank-2 element, regardless of the element in
    which it is contained.
  • A step in the path can go to (13 variations in
    the standard)
  • parents, siblings, ancestors and descendants
  • of the nodes generated by the previous step, not
    just to the children
  • //, described above, is a short from for
    specifying all descendants
  • .. specifies the parent.

56
FLWR Syntax in XQuery
  • Simple FLWR expression in XQuery
  • find all accounts with balance gt 400, with each
    result enclosed in an ltaccount-numbergt ..
    lt/account-numbergt tag for x in
    /bank-2/account let acctno
    x/_at_account-number where x/balance gt 400
    return ltaccount-numbergt acctno
    lt/account-numbergt
  • Let clause not really needed in this query, and
    selection can be done In XPath. Query can be
    written as
  • for x in /bank-2/accountbalancegt400 return
    ltaccount-numbergt X/_at_account-number

    lt/account-numbergt

57
Path Expressions and Functions
  • Path expressions are used to bind variables in
    the for clause, but can also be used in other
    places
  • E.g. path expressions can be used in let clause,
    to bind variables to results of path expressions
  • The function distinct( ) can be used to removed
    duplicates in path expression results
  • The function document(name) returns root of named
    document
  • E.g. document(bank-2.xml)/bank-2/account
  • Aggregate functions such as sum( ) and count( )
    can be applied to path expression results
  • XQuery does not support groupby, but the same
    effect can be got by nested queries, with nested
    FLWR expressions within a return clause
  • More on nested queries later

58
Joins
  • Joins are specified in a manner very similar to
    SQLfor b in /bank/account,
  • c in /bank/customer,
  • d in /bank/depositor
  • where a/account-number d/account-number
    and c/customer-name d/customer-name
  • return ltcust-acctgt c a lt/cust-acctgt
  • The same query can be expressed with the
    selections specified as XPath selections
  • for a in /bank/account c in
    /bank/customer d in /bank/depositor
    account-number a/account-number and
    customer-name
    c/customer-name
  • return ltcust-acctgt c alt/cust-acctgt

59
Changing Nesting Structure
  • The following query converts data from the flat
    structure for bank information into the nested
    structure used in bank-1
  • ltbank-1gt
  • for c in /bank/customer
  • return
  • ltcustomergt
  • c/
  • for d in /bank/depositorcustomer-name
    c/customer-name,
  • a in /bank/accountaccount-numberd/a
    ccount-number
  • return a
  • lt/customergt
  • lt/bank-1gt
  • c/ denotes all the children of the node to
    which c is bound, without the enclosing
    top-level tag
  • Exercise for reader write a nested query to find
    sum of accountbalances, grouped by branch.

60
XQuery Path Expressions
  • c/text() gives text content of an element
    without any subelements/tags
  • XQuery path expressions support the gt operator
    for dereferencing IDREFs
  • Equivalent to the id( ) function of XPath, but
    simpler to use
  • Can be applied to a set of IDREFs to get a set of
    results

61
Sorting in XQuery
  • Sortby clause can be used at the end of any
    expression. E.g. to return customers sorted by
    name for c in /bank/customer return
    ltcustomergt c/ lt/customergt sortby(name)
  • Can sort at multiple levels of nesting (sort by
    customer-name, and by account-number within each
    customer)
  • ltbank-1gt for c in /bank/customer
    return ltcustomergt c/ for d in
    /bank/depositorcustomer-namec/customer-name,
    a in /bank/accountaccount-numberd/ac
    count-number return ltaccountgt a/
    lt/accountgt sortby(account-number) lt/customergt
    sortby(customer-name)
  • lt/bank-1gt

62
Functions and Other XQuery Features
  • User defined functions with the type system of
    XMLSchema function balances(xsdstring c)
    returns list(xsdnumeric) for d in
    /bank/depositorcustomer-name c,
    a in /bank/accountaccount-numberd/account-numb
    er return a/balance
  • Types are optional for function parameters and
    return values
  • Universal and existential quantification in where
    clause predicates
  • some e in path satisfies P
  • every e in path satisfies P
  • XQuery also supports If-then-else clauses

63
  • Xmark http//www.xml-benchmark.org
  • Used in most experiments on Xpath and Xquery
    evaluation
  • Old figures on hand-compiled queries for the
    dataguide approach can be found in
  • http//www.cwi.nl/mk/xmarkArchive/Reports/Monet_r
    eport/monet_report.html

64
Xmark
65
XMark
66
Monet XML approach
Monet XML approach
67
XMark
  • Q1 Return the name of the person with ID
    personal
  • FOR b IN /site/people/person_at_idpersonal
  • RETURN b/name/text()

68
Do it yourself Or skip
69
Xmark queries
  • Q2 Return the initial increases of all open
    auctions.
  • This query evaluates the cost of array look-ups.
    Note that this query may actually be harder to
    evaluate than it looks especially relational
    back-ends may have to struggle with rather
    complex aggregations to select the bidder element
    with index 1.
  • Q3 Return the IDs of all open auctions whose
    current increase is at least twice as high as the
    initial increase.
  • This is a more complex application of index
    lookups. In the case of a relational DBMS, the
    query can take advantage of set-valued aggregates
    on the index attribute to accelerate the
    execution.

70
Xmark queries
  • Q4 List the reserves of those open auctions
    where a certain person issued a bid before
    another person
  • This time, we stress the textual nature of XML
    documents by querying the tag order in the source
    document
  • Q5 How many sold items cost more than 40?
  • Strings are the generic data type in XML
    documents. Queries that interpret strings will
    often need to cast strings to another data type
    that carries more semantics. This query
    challenges the DBMS in terms of the casting
    primitives it provides. Especially, if there is
    no additional schema information or just a DTD at
    hand, casts are likely to occur frequently.

71
Xmark queries
  • Q6 How many items are listed on all continents?
  • Regular path expressions are a fundamental
    building block of virtually every query language
    for XML or semi-structured data. These queries
    investigate how well the query processor can
    optimize path expressions and prune traversals of
    irrelevant parts of the tree.
  • Q7 How many pieces of prose are in our database?
  • A good evaluation engine should realize that
    there is no need to traverse the complete
    document tree to evaluate such expressions.Also
    note that the COUNT aggregation does not require
    a complete traversal of the tree. Just the
    cardinality of the respective relation is
    queried. Note that the tag ltemailgt does not
    exist in the database document.

72
Xmark queries
  • Q8 List the names of persons and the number of
    items they bought. (joins person,
    closed\_auction)
  • References are an integral part of XML as they
    allow richer relationships than just hierarchical
    element structures. This query defines
    horizontal traversals with increasing complexity.
    A good query optimizer should take advantage of
    the cardinalities of the sets to be joined.
  • Q9 List the names of persons and the names of
    the items they bought in Europe. (joins person,
    closed_auction, item)
  • References are an integral part of XML as they
    allow richer relationships than just hierarchical
    element structures. These queries define
    horizontal traversals with increasing complexity.
    A good query optimizer should take advantage of
    the cardinalities of the sets to be joined.

73
Xmark queries
  • Q10 List all persons according to their
    interest use french markup in the result.
  • Constructing new elements may put the storage
    engine under stress especially in the context of
    creating materialized document views. The
    following query reverses the structure of person
    records by grouping them according to the
    interest profile of a person. Large parts of the
    person records are repeatedly reconstructed. To
    avoid simple copying of the original database we
    translate the mark-up into french.
  • Q11 For each person, list the number of items
    currently on sale whose price does not exceed
    0.02\ of the person's income
  • This query tests the database's ability to handle
    large (intermediate) results. This time, joins
    are on the basis of values. The difference
    between these queries and the reference chasing
    queries Q8 and Q9 is that references are
    specified in the DTD and may be optimized with
    logical OIDs for example. The two queries Q11
    and Q12 cascade in thesize of the result set and
    provide various optimization opportunities.

74
Xmark queries
  • Q12 For each richer-than-average person, list
    the number of items currently on sale whose price
    does not exceed 0.02 of the person's income
  • This query tests the database's ability to handle
    large (intermediate) results. This time, joins
    are on the basis of values. The difference
    between these queries and the reference chasing
    queries Q8 and Q9 is that references are
    specified in the DTD and may be optimized with
    logical OIDs for example. The two queries Q11
    and Q12 cascade in the size of the result set and
    provide various optimization opportunities.
  • Q13 List the names of items registered in
    Australia along with their descriptions.
  • A key design for XML-gtDBMS mappings is to
    determine the fragmentation criteria. The
    complementary action is to reconstruct the
    original document from its broken-down
    representation. Query 13 tests for the ability
    of the database to reconstruct portions of the
    original XML document.

75
Xmark queries
  • Q14Return the names of all items whose
    description contains the word gold'.
  • We continue to challenge the textual nature of
    XML documents this time, we conduct a full-text
    search in the form of keyword search. Although
    full-text scanning could be studied in isolation
    we think that the interaction with structural
    mark-up is essential as the concepts are
    considered orthogonal so query Q14 is restricted
    to a subset of the document by combining content
    and structure.
  • Q15 Print the keywords in emphasis in
    annotations of closed auctions.
  • We now try to quantify the costs of long path
    traversals that don't include wildcards. We
    first descend deep into the tree (Query 15) and
    then return again (Query 16). Both queries only
    check for the existence of paths rather than
    selecting paths with predicates.

76
Xmark queries
  • Q16 Return the IDs of those auctions that have
    one or more kweywords in emphasis.
  • Q17Which persons don't have a homepage?
  • This is to test how well the query processors
    knows to deal with the semi-structured aspect of
    XML data, especially elements that are declared
    optional in the DTD.
  • Q18Convert the currency of the reserve of all
    open auctions to another currency.
  • This query puts the application of user defined
    functions (UDF) to the proof. In the XML world,
    UDFs are of particular importance because they
    allow the user to assign semantics to generic
    strings that go beyond type coercion.

77
Query optimizer challenges
  • Mapping Xquery to a RBDMS should be able
  • to deal with ordered tables
  • to skip sub-documents
  • to perform dynamic type casting
  • to avoid unnecessary construction of string
    intermediates
  • to recognize join-paths for fast access
  • to balance fragmentation and reconstruction cose

78
Xmark answers
  • Q2 Return the initial increases of all open
    auctions.
  • This query evaluates the cost of array look-ups.
    Note that this query may actually be harder to
    evaluate than it looks especially relational
    back-ends may have to struggle with rather
    complex aggregations to select the bidder element
    with index 1.
  • FOR b IN document("auction.xml")/site/open_auc
    tions/open_auction
  • RETURN ltincreasegt b/bidder1/increase/text()
    lt/increasegt

79
XMark
  • Q3 Return the IDs of all open auctions whose
    current increase is at least twice as high as the
    initial increase.
  • This is a more complex application of index
    lookups. In the case of a relational DBMS, the
    query can take advantage of set-valued aggregates
    on the index attribute to accelerate the
    execution.
  • FOR b IN document("auction.xml")/site/open_auc
    tions/open_auction
  • WHERE b/bidder0/increase/text() 2 lt
    b/bidderlast()/increase/text()
  • RETURN ltincrease firstb/bidder0/increase/text(
    )
  • lastb/bidderlast()/increase/t
    ext()/gt

80
Xmark result
  • Q4 List the reserves of those open auctions
    where a certain person issued a bid before
    another person
  • This time, we stress the textual nature of XML
    documents by querying the tag order in the source
    document
  • FOR b IN document("auction.xml")/site/open_auc
    tions/open_auction
  • WHERE b/bidder/personrefid"person18829"
    BEFORE
  • b/bidder/personrefid"person10487"
  • RETURN lthistorygt b/initial/text() lt/historygt

81
Xmark answers
  • Q5 How many sold items cost more than 40?
  • Strings are the generic data type in XML
    documents. Queries that interpret strings will
    often need to cast strings to another data type
    that carries more semantics. This query
    challenges the DBMS in terms of the casting
    primitives it provides. Especially, if there is
    no additional schema information or just a DTD at
    hand, casts are likely to occur frequently.
  • COUNT (FOR i document("auction.xml")/site/clo
    sed_auctions/closed_auction
  • WHERE i/price/text() gt 40
  • RETURN i/price)

82
Xmark results
  • Q6 How many items are listed on all continents?
  • Regular path expressions are a fundamental
    building block of virtually every query language
    for XML or semi-structured data. These queries
    investigate how well the query processor can
    optimize path expressions and prune traversals of
    irrelevant parts of the tree.
  • FOR b IN document("auction.xml")/site/regions
  • RETURN COUNT (b//item)

83
Xmark results
  • Q7 How many pieces of prose are in our database?
  • A good evaluation engine should realize that
    there is no need to traverse the complete
    document tree to evaluate such expressions.Also
    note that the COUNT aggregation does not require
    a complete traversal of the tree. Just the
    cardinality of the respective relation is
    queried. Note that the tag ltemailgt does not
    exist in the database document.
  • FOR p IN document("auction.xml")/site
  • RETURN count(p//description)
    count(p//annotation) count(p//email)

84
Xmark results
  • Q8 List the names of persons and the number of
    items they bought. (joins person,
    closed\_auction)
  • References are an integral part of XML as they
    allow richer relationships than just hierarchical
    element structures. This query defines
    horizontal traversals with increasing complexity.
    A good query optimizer should take advantage of
    the cardinalities of the sets to be joined.
  • FOR p IN document("auction.xml")/site/people/p
    erson
  • LET a FOR t IN document("auction.xml")/sit
    e/closed_auctions/closed_auction
  • WHERE t/buyer/_at_person p/_at_id
  • RETURN t
  • RETURN ltitem personp/name/text()gt COUNT (a)
    lt/itemgt

85
Xmark results
  • Q9 List the names of persons and the names of
    the items they bought in Europe. (joins person,
    closed_auction, item)
  • References are an integral part of XML as they
    allow richer relationships than just hierarchical
    element structures. These queries define
    horizontal traversals with increasing complexity.
    A good query optimizer should take advantage of
    the cardinalities of the sets to be joined.
  • FOR p IN document("auction.xml")/site/people/p
    erson
  • LET a FOR t IN document("auction.xml")/sit
    e/closed_auctions/closed_auction
  • LET n FOR t2 IN
    document("auction.xml")/site/regions/europe/item
  • WHERE t/itemref/_at_item
    t2/_at_id
  • RETURN t2
  • WHERE p/_at_id t/buyer/_at_person
  • RETURN ltitemgt n/name/text() lt/itemgt
  • RETURN ltperson namep/name/text()gt a lt/persongt

86
Xmark results
  • Q10 List all persons according to their
    interest use french markup in the result.
  • Constructing new elements may put the storage
    engine under stress especially in the context of
    creating materialized document views. The
    following query reverses the structure of person
    records by grouping them according to the
    interest profile of a person. Large parts of the
    person records are repeatedly reconstructed. To
    avoid simple copying of the original database we
    translate the mark-up into french.

87
  • FOR i IN DISTINCT document("auction.xml")/
    site/people/person/profile/interest/_at_category
  • LET p FOR t IN document("auction.xml")/sit
    e/people/person
  • WHERE t/profile/interest/_at_category
    i
  • RETURN ltpersonnegt
  • ltstatistiquesgt
  • ltsexegt t/gender/text()
    lt/sexegt,
  • ltagegt t/age/text()
    lt/agegt,
  • lteducationgt
    t/education/text()lt/educationgt,
  • ltrevenugt t/income/text()
    lt/revenugt
  • lt/statistiquesgt,

88
  • ltcoordonneesgt
  • ltnomgt t/name/text()
    lt/nomgt,
  • ltruegt t/street/text()
    lt/ruegt,
  • ltvillegt t/city/text()
    lt/villegt,
  • ltpaysgt t/country/text()
    lt/paysgt,
  • ltreseaugt
  • ltcourriergt
    t/email/text() lt/courriergt,
  • ltpagePersogt
    t/homepage/text()lt/pagePersogt
  • lt/reseaugt,
  • lt/coordonneesgt
  • ltcartePaiementgt
    t/creditcard/text()lt/cartePaiementgt
  • lt/personnegt
  • RETURN ltcategoriegt
  • ltidgt i lt/idgt,
  • p
  • lt/categoriegt

89
Xmark results
  • Q11 For each person, list the number of items
    currently on sale whose price does not exceed
    0.02\ of the person's income
  • This query tests the database's ability to handle
    large (intermediate) results. This time, joins
    are on the basis of values. The difference
    between these queries and the reference chasing
    queries Q8 and Q9 is that references are
    specified in the DTD and may be optimized with
    logical OIDs for example. The two queries Q11
    and Q12 cascade in the size of the result set and
    provide various optimization opportunities.
  • FOR p IN document("auction.xml")/site/people/pers
    on
  • LET c FOR i IN document("auction.xml")/site/o
    pen_auctions/open_auction/initial
  • WHERE p/profile/_at_income gt (5000
    i/text())
  • RETURN i
  • RETURN ltitems namep/profile/_at_incomegt COUNT (c)
    lt/itemsgt

90
Xmark results
  • 12 For each richer-than-average person, list the
    number of items currently on sale whose price
    does not exceed 0.02 of the person's income
  • This query tests the database's ability to handle
    large (intermediate) results. This time, joins
    are on the basis of values. The difference
    between these queries and the reference chasing
    queries Q8 and Q9 is that references are
    specified in the DTD and may be optimized with
    logical OIDs for example. The two queries Q11
    and Q12 cascade in the size of the result set and
    provide various optimization opportunities. FOR
    p IN document("auction.xml")/site/people/person
  • FOR p IN document("auction.xml")/site/people/pers
    on
  • LET l FOR i IN document("auction.xml")/site/o
    pen_auctions/open_auction/initial
  • WHERE p/profile/_at_income gt (5000
    i/text())
  • RETURN i
  • WHERE p/profile/_at_income gt 50000
  • RETURN ltitems incomep/profile/_at_incomegt COUNT
    (l) lt/itemsgt

91
Xmark results
  • Q13 List the names of items registered in
    Australia along with their descriptions.
  • A key design for XML-gtDBMS mappings is to
    determine the fragmentation criteria. The
    complementary action is to reconstruct the
    original document from its broken-down
    representation. Query 13 tests for the ability
    of the database to reconstruct portions of
    theoriginal XML document.
  • FOR i IN document("auction.xml")/site/regions/aus
    tralia/item RETURN ltitem namei/name/text()gt
    i/description lt/itemgt

92
Xmark results
  • Q14Return the names of all items whose
    description contains the word gold'.
  • We continue to challenge the textual nature of
    XML documents this time, we conduct a full-text
    search in the form of keyword search. Although
    full-text scanning could be studied in isolation
    we think that the interaction with structural
    mark-up is essential as the concepts are
    considered orthogonal so query Q14 is restricted
    to a subset of the document by combining content
    and structure.
  • FOR i IN document("auction.xml")/site//item
  • WHERE CONTAINS (i/description,"gold")
  • RETURN i/name/text()

93
Xmark results
  • Q15 Print the keywords in emphasis in
    annotations of closed auctions.
  • We now try to quantify the costs of long path
    traversals that don't include wildcards. We
    first descend deep into the tree (Query 15) and
    then return again (Query 16). Both queries only
    check for the existence of paths rather than
    selecting paths with predicates.
  • FOR a IN document("auction.xml")/site/closed_auct
    ions/closed_auction/annotation/description/parlist
    /listitem/parlist/listitem/text/emph/keyword/text(
    )
  • RETURN lttextgt a lt/textgt

94
Xmark results
  • Q16 Return the IDs of those auctions that have
    one or more kweywords in emphasis.
  • FOR a IN document("auction.xml")/site/closed_auct
    ions/closed_auction
  • WHERE NOT EMPTY (a/annotation/description/parlist
    /listitem/parlist/\
  • listitem/text/emph/keyword/te
    xt())
  • RETURN ltperson ida/seller/_at_person /gt

95
Xmark results
  • Q17Which persons don't have a homepage?
  • This is to test how well the query processors
    knows to deal with the semi-structured aspect of
    XML data, especially elements that are declared
    optional in the DTD.
  • FOR p IN document("auction.xml")/site/people/p
    erson
  • WHERE EMPTY(p/homepage/text())
  • RETURN ltperson namep/name/text()/gt

96
Xmark results
  • Q18Convert the currency of the reserve of all
    open auctions to another currency.
  • This query puts the application of user defined
    functions (UDF) to the proof. In the XML world,
    UDFs are of particular importance because they
    allow the user to assign semantics to generic
    strings that go beyond type coercion.
  • FUNCTION CONVERT (v)
  • RETURN 2.20371 v -- convert Dfl to Euros
  • FOR i IN document("auction.xml")/site/open_auc
    tions/open_auction/
  • RETURN CONVERT(i/reserve/text())

97
Query optimizer challenges
  • Mapping Xquery to a RBDMS should be able
  • to deal with ordered tables
  • to skip sub-documents
  • to perform dynamic type casting
  • to avoid unnecessary construction of string
    intermediates
  • to recognize join-paths for fast access
  • to balance fragmentation and reconstruction cose

98
Xmark results
Effect of loading 100Mb document into DBMS
99
Xmark results
100
Xmark results
101
Pathfinder/MonetDB 2004 implementation in seconds
Write a Comment
User Comments (0)
About PowerShow.com