Title: Chapter 10: XML
1Chapter 10 XML
2The Data
- Semistructured data instance a large graph
3The indexing problem
- The storage problem
- Store the graph in a relational DBMS
- Develop a new database storage structure
- The indexing problem
- Input large, irregular data graph
- Output index structure for evaluating (regular)
path expressions, e.g. - bib.paper.author.firstname
4XSet a simple index for XML
- Part of the Ninja project at Berkeley
- Example XML data
5XSet a simple index for XML
- Each node a hashtable
- Each entry list of pointers to data nodes (not
shown)
- SELECT X FROM part.name X -yes
- SELECT X FROM part.supplier.name X -yes
- SELECT X FROM part..subpart.name X -maybe
- SELECT X FROM .supplier.name X -maybe
6Region Algebras
- structured text text with tags (like XML)
- data sequence of characters c1c2c3
- region interval in the text
- representation (x,y) cx,cx1, cy
- example ltsectiongt lt/sectiongt
- region set a set of regions
- example all ltsectiongt regions (may be nested)
- region algebra operators on region set,
- s1 op s2
- s1 intersect s2 r r? s1, r ?s2
- s1 included s2 r r?s1, ?r ? s2, r ? r
- s1 including s2 r r? s1, ?r ? s2, r ? r
- s1 parent s2 r r? s1, ?r? s2, r is a parent
of r - s1 child s2 r r? s1, ?r ? s2, r is child of
r
7Region Algebras
- Region expressions correspond to simple XPath
expressions - s1 child s2 r r? s1, ?r ? s2, r is child of
r
- part.name name child (part child
root) - part.supplier.name name child (supplier child
(part child root)) - .supplier.name name child supplier
- part..subpart.name name child (subpart
included (part child root))
8Efficient computation of Region Algebra Operators
- Example s1 included s2
- s1 (x1,x1'), (x2,x2'),
- s2 (y1,y1'), (y2,y2'),
- (i.e. assume each consists of disjoint regions)
- Algorithm
- if xi lt yj then i i 1
- if xi' gt yj' then j j 1
- otherwise print (xi,xi'), do i i 1
- Can do in sub-linear time when one region is very
small
9Storage structures for region algebras
- Every node is characterised by an integer pair
(x,y) - This means we have a 2-d space
- Any 2-d space data structure can be used
- If you use a (pre-order,post-order) numbering you
get triangular filling of 2-d - (to be discussed later)
10Alternative mappings
- Mapping the structure to the relational world
- The Edge approach
- The Attribute approach
- The Universal Table approach
- The Normalized Universal approach
- The Monet/XML approach
- The Dataguide approach
- Mapping values
- Separate value tables
- Inlining
- Shredding
11Dataguide approach
- Developed in the context of Lore, Lorel (Stanford
Univ) - Predecessor of the Monet/XML model
- Observation
- queries in the graph-representation take a
limited form - they are partial walks from the root to an object
of interest - this behaviour was stressed by the query language
Lorel, i.e. an SQL-based query language based on
processing regular expressions
SELECT X FROM (Bib..author).(lastnamefirstname).
Abiteboul X
12DataGuides
- Definition
- given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t. - - every path in DB also occurs in G
- - every path in G occurs in DB
- - every path in G is unique
13Dataguides
14DataGuides
- Multiple DataGuides for the same data
15DataGuides
- Definition
- Let w, w be two words (I.e word queries) and G
a graph - w ?G w if w(G) w(G)
- Definition
- G is a strong dataguide for a database DB if ?G
is the same as ?DB - Example
- - G1 is a strong dataguide
- - G2 is not strong
- person.project !?DB dept.project
- person.project !?G2 dept.project
16DataGuides
- Constructing the strong DataGuide G
- Nodes(G)root
- Edges(G)?
- while changes do
- choose s in Nodes(G), a in Labels
- add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G) - add (x -a-gty) to Edges(G)
- Use hash table for Nodes(G)
- This is precisely the powerset automaton
construction.
17DataGuides
- How large are the dataguides ?
- if DB is a tree, then size(G) lt size(DB)
- why? answer every node is in exactly one extent
of G - here dataguide XSet
- How many nodes does the strong dataguide have for
this DB ?
20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic
schemas, like
18Monet XML approach
Monet XML approach
19Monet XML approach
20Monet XML approach
21Monet XML approach
22Monet XML approach
23 24Querying and Transforming XML Data
- Standard XML querying/translation languages
- XPath
- Simple language consisting of path expressions
- XSLT
- Simple language designed for translation from XML
to XML and XML to HTML - XQuery
- An XML query language with a rich set of features
- Wide variety of other languages have been
proposed, and some served as basis for the Xquery
standard - XML-QL, Quilt, XQL,
25Tree Model of XML Data
- Query and transformation languages are based on a
tree model of XML data - An XML document is modeled as a tree, with nodes
corresponding to elements and attributes - Element nodes have children nodes, which can be
attributes or subelements - Text in an element is modeled as a text node
child of the element - Children of a node are ordered according to their
order in the XML document - Element and attribute nodes (except for the root
node) have a single parent, which is an element
node - The root node has a single child, which is the
root element of the document - We use the terminology of nodes, children,
parent, siblings, ancestor, descendant, etc.,
which should be interpreted in the above tree
model of XML data.
26XML data with ID and IDREF attributes
-
- ltbank-2gt
- ltaccount account-numberA-401 ownersC100
C102gt - ltbranch-namegt Downtown lt/branch-namegt
- ltbranchgt500 lt/balancegt
- lt/accountgt
- ltcustomer customer-idC100 accountsA-401gt
- ltcustomer-namegtJoelt/customer-namegt
- ltcustomer-streetgtMonroelt/customer-street
gt - ltcustomer-citygtMadisonlt/customer-citygt
- lt/customergt
- ltcustomer customer-idC102 accountsA-401
A-402gt - ltcustomer-namegt Marylt/customer-namegt
- ltcustomer-streetgt Erinlt/customer-streetgt
- ltcustomer-citygt Newark lt/customer-citygt
- lt/customergt
- lt/bank-2gt
27XPath
- XPath is used to address (select) parts of
documents using path expressions - A path expression is a sequence of steps
separated by / - Think of file names in a directory hierarchy
- Result of path expression set of values that
along with their containing elements/attributes
match the specified path - E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns - ltnamegtJoelt/namegt
- ltnamegtMarylt/namegt
- E.g. /bank-2/customer/name/text( )
- returns the same names, but without the
enclosing tags
28XPath (Cont.)
- The initial / denotes root of the document
(above the top-level tag) - Path expressions are evaluated left to right
- Each step operates on the set of instances
produced by the previous step - Selection predicates may follow any step in a
path, in - E.g. /bank-2/accountbalance gt 400
- returns account elements with a balance value
greater than 400 - /bank-2/accountbalance returns account
elements containing a balance subelement - Attributes are accessed using _at_
- E.g. /bank-2/accountbalance gt
400/_at_account-number - returns the account numbers of those accounts
with balance gt 400 - IDREF attributes are not dereferenced
automatically (more on this later)
29Functions in XPath
- XPath provides several functions
- The function count() at the end of a path counts
the number of elements in the set generated by
the path - E.g. /bank-2/accountcustomer/count() gt 2
- Returns accounts with gt 2 customers
- Also function for testing position (1, 2, ..) of
node w.r.t. siblings - Boolean connectives and and or and function not()
can be used in predicates - IDREFs can be referenced using function id()
- id() can also be applied to sets of references
such as IDREFS and even to strings containing
multiple references separated by blanks - E.g. /bank-2/account/id(_at_owner)
- returns all customers referred to from the owners
attribute of account elements.
30More XPath Features
- Operator used to implement union
- E.g. /bank-2/account/id(_at_owner)
/bank-2/loan/id(_at_borrower) - gives customers with either accounts or loans
- However, cannot be nested inside other
operators. - // can be used to skip multiple levels of nodes
- E.g. /bank-2//name
- finds any name element anywhere under the
/bank-2 element, regardless of the element in
which it is contained. - A step in the path can go to (13 variations in
the standard) - parents, siblings, ancestors and descendants
- of the nodes generated by the previous step, not
just to the children - //, described above, is a short from for
specifying all descendants - .. specifies the parent.
31Pathfinder
- Xpath is essential for the implementation of an
Xquery processor. It is strongly related to the
data structures and its primitives. - A state-of-the-art implementation is
MonetDB/Pathfinder developed by Uni. Konstantz,
Twente University, CWI
32Pathfinder Uni Konstantz
33Pathfinder
34Pathfinder
35Pathfinder
36(No Transcript)
37Pathfinder
38Pathfinder
39Pathfinder
40pathfinder
41Pathfinder
42Staircase join
43Staircase join
44Pathfinder
45Pathfinder
46Pathfinder
47Pathfinder
48 49XQuery
- XQuery is a general purpose query language for
XML data - Currently being standardized by the World Wide
Web Consortium (W3C) - The textbook description is based on a March 2001
draft of the standard. The final version may
differ, but major features likely to stay
unchanged. - Alpha version of XQuery engine
- Galax http//db.bell-labs.com/galax/
- IPSI-IQ
- Xpath visualized http//www.vbxml.com/xpathvisual
izer/ - MonetDB/Pathfinder
- Xhive
- XQuery is derived from the Quilt query language,
which itself borrows from SQL, XQL and XML-QL
50XQuery
- XQuery uses a for let where .. return
syntax - for ? SQL from where ? SQL where
return ? SQL select let allows temporary
variables, and has no equivalent in SQL - Variables make it possible to keep the state of
processing around and severely complicates
optimization
51FLWR Syntax in XQuery
- For clause uses XPath expressions, and variables
in the for- clause ranges over values in the set
returned by Xpath - XPath is used to address (select) parts of
documents using path expressions - A path expression is a sequence of steps
separated by / - Result of path expression set of values that
along with their containing elements/attributes
match the specified path - E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns - ltnamegtJoelt/namegt
- ltnamegtMarylt/namegt
- E.g. /bank-2/customer/name/text( )
- returns the same names, but without the
enclosing tags
52XPath
- XPath is used to address (select) parts of
documents using path expressions - A path expression is a sequence of steps
separated by / - Think of file names in a directory hierarchy
- Result of path expression set of values that
along with their containing elements/attributes
match the specified path - E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns - ltnamegtJoelt/namegt
- ltnamegtMarylt/namegt
- E.g. /bank-2/customer/name/text( )
- returns the same names, but without the
enclosing tags
53XPath (Cont.)
- The initial / denotes root of the document
(above the top-level tag) - Path expressions are evaluated left to right
- Each step operates on the set of instances
produced by the previous step - Selection predicates may follow any step in a
path, in - E.g. /bank-2/accountbalance gt 400
- returns account elements with a balance value
greater than 400 - /bank-2/accountbalance returns account
elements containing a balance subelement - Attributes are accessed using _at_
- E.g. /bank-2/accountbalance gt
400/_at_account-number - returns the account numbers of those accounts
with balance gt 400 - IDREF attributes are not dereferenced
automatically (more on this later)
54Functions in XPath
- XPath provides several functions
- The function count() at the end of a path counts
the number of elements in the set generated by
the path - E.g. /bank-2/accountcustomer/count() gt 2
- Returns accounts with gt 2 customers
- Also function for testing position (1, 2, ..) of
node w.r.t. siblings - Boolean connectives and and or and function not()
can be used in predicates - IDREFs can be referenced using function id()
- id() can also be applied to sets of references
such as IDREFS and even to strings containing
multiple references separated by blanks - E.g. /bank-2/account/id(_at_owner)
- returns all customers referred to from the owners
attribute of account elements.
55More XPath Features
- Operator used to implement union
- E.g. /bank-2/account/id(_at_owner)
/bank-2/loan/id(_at_borrower) - gives customers with either accounts or loans
- However, cannot be nested inside other
operators. - // can be used to skip multiple levels of nodes
- E.g. /bank-2//name
- finds any name element anywhere under the
/bank-2 element, regardless of the element in
which it is contained. - A step in the path can go to (13 variations in
the standard) - parents, siblings, ancestors and descendants
- of the nodes generated by the previous step, not
just to the children - //, described above, is a short from for
specifying all descendants - .. specifies the parent.
56FLWR Syntax in XQuery
- Simple FLWR expression in XQuery
- find all accounts with balance gt 400, with each
result enclosed in an ltaccount-numbergt ..
lt/account-numbergt tag for x in
/bank-2/account let acctno
x/_at_account-number where x/balance gt 400
return ltaccount-numbergt acctno
lt/account-numbergt - Let clause not really needed in this query, and
selection can be done In XPath. Query can be
written as - for x in /bank-2/accountbalancegt400 return
ltaccount-numbergt X/_at_account-number
lt/account-numbergt
57Path Expressions and Functions
- Path expressions are used to bind variables in
the for clause, but can also be used in other
places - E.g. path expressions can be used in let clause,
to bind variables to results of path expressions - The function distinct( ) can be used to removed
duplicates in path expression results - The function document(name) returns root of named
document - E.g. document(bank-2.xml)/bank-2/account
- Aggregate functions such as sum( ) and count( )
can be applied to path expression results - XQuery does not support groupby, but the same
effect can be got by nested queries, with nested
FLWR expressions within a return clause - More on nested queries later
58Joins
- Joins are specified in a manner very similar to
SQLfor b in /bank/account, - c in /bank/customer,
- d in /bank/depositor
- where a/account-number d/account-number
and c/customer-name d/customer-name - return ltcust-acctgt c a lt/cust-acctgt
- The same query can be expressed with the
selections specified as XPath selections - for a in /bank/account c in
/bank/customer d in /bank/depositor
account-number a/account-number and
customer-name
c/customer-name - return ltcust-acctgt c alt/cust-acctgt
59Changing Nesting Structure
- The following query converts data from the flat
structure for bank information into the nested
structure used in bank-1 - ltbank-1gt
- for c in /bank/customer
- return
- ltcustomergt
- c/
- for d in /bank/depositorcustomer-name
c/customer-name, - a in /bank/accountaccount-numberd/a
ccount-number - return a
- lt/customergt
- lt/bank-1gt
- c/ denotes all the children of the node to
which c is bound, without the enclosing
top-level tag - Exercise for reader write a nested query to find
sum of accountbalances, grouped by branch.
60XQuery Path Expressions
- c/text() gives text content of an element
without any subelements/tags - XQuery path expressions support the gt operator
for dereferencing IDREFs - Equivalent to the id( ) function of XPath, but
simpler to use - Can be applied to a set of IDREFs to get a set of
results
61Sorting in XQuery
- Sortby clause can be used at the end of any
expression. E.g. to return customers sorted by
name for c in /bank/customer return
ltcustomergt c/ lt/customergt sortby(name) - Can sort at multiple levels of nesting (sort by
customer-name, and by account-number within each
customer) - ltbank-1gt for c in /bank/customer
return ltcustomergt c/ for d in
/bank/depositorcustomer-namec/customer-name,
a in /bank/accountaccount-numberd/ac
count-number return ltaccountgt a/
lt/accountgt sortby(account-number) lt/customergt
sortby(customer-name) - lt/bank-1gt
62Functions and Other XQuery Features
- User defined functions with the type system of
XMLSchema function balances(xsdstring c)
returns list(xsdnumeric) for d in
/bank/depositorcustomer-name c,
a in /bank/accountaccount-numberd/account-numb
er return a/balance -
- Types are optional for function parameters and
return values - Universal and existential quantification in where
clause predicates - some e in path satisfies P
- every e in path satisfies P
- XQuery also supports If-then-else clauses
63- Xmark http//www.xml-benchmark.org
- Used in most experiments on Xpath and Xquery
evaluation - Old figures on hand-compiled queries for the
dataguide approach can be found in - http//www.cwi.nl/mk/xmarkArchive/Reports/Monet_r
eport/monet_report.html
64Xmark
65XMark
66Monet XML approach
Monet XML approach
67XMark
- Q1 Return the name of the person with ID
personal - FOR b IN /site/people/person_at_idpersonal
- RETURN b/name/text()
68Do it yourself Or skip
69Xmark queries
- Q2 Return the initial increases of all open
auctions. - This query evaluates the cost of array look-ups.
Note that this query may actually be harder to
evaluate than it looks especially relational
back-ends may have to struggle with rather
complex aggregations to select the bidder element
with index 1. - Q3 Return the IDs of all open auctions whose
current increase is at least twice as high as the
initial increase. - This is a more complex application of index
lookups. In the case of a relational DBMS, the
query can take advantage of set-valued aggregates
on the index attribute to accelerate the
execution.
70Xmark queries
- Q4 List the reserves of those open auctions
where a certain person issued a bid before
another person - This time, we stress the textual nature of XML
documents by querying the tag order in the source
document - Q5 How many sold items cost more than 40?
- Strings are the generic data type in XML
documents. Queries that interpret strings will
often need to cast strings to another data type
that carries more semantics. This query
challenges the DBMS in terms of the casting
primitives it provides. Especially, if there is
no additional schema information or just a DTD at
hand, casts are likely to occur frequently.
71Xmark queries
- Q6 How many items are listed on all continents?
- Regular path expressions are a fundamental
building block of virtually every query language
for XML or semi-structured data. These queries
investigate how well the query processor can
optimize path expressions and prune traversals of
irrelevant parts of the tree. - Q7 How many pieces of prose are in our database?
- A good evaluation engine should realize that
there is no need to traverse the complete
document tree to evaluate such expressions.Also
note that the COUNT aggregation does not require
a complete traversal of the tree. Just the
cardinality of the respective relation is
queried. Note that the tag ltemailgt does not
exist in the database document.
72Xmark queries
- Q8 List the names of persons and the number of
items they bought. (joins person,
closed\_auction) - References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. This query defines
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined. - Q9 List the names of persons and the names of
the items they bought in Europe. (joins person,
closed_auction, item) - References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. These queries define
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined.
73Xmark queries
- Q10 List all persons according to their
interest use french markup in the result. - Constructing new elements may put the storage
engine under stress especially in the context of
creating materialized document views. The
following query reverses the structure of person
records by grouping them according to the
interest profile of a person. Large parts of the
person records are repeatedly reconstructed. To
avoid simple copying of the original database we
translate the mark-up into french. - Q11 For each person, list the number of items
currently on sale whose price does not exceed
0.02\ of the person's income - This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in thesize of the result set and
provide various optimization opportunities.
74Xmark queries
- Q12 For each richer-than-average person, list
the number of items currently on sale whose price
does not exceed 0.02 of the person's income - This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities. - Q13 List the names of items registered in
Australia along with their descriptions. - A key design for XML-gtDBMS mappings is to
determine the fragmentation criteria. The
complementary action is to reconstruct the
original document from its broken-down
representation. Query 13 tests for the ability
of the database to reconstruct portions of the
original XML document.
75Xmark queries
- Q14Return the names of all items whose
description contains the word gold'. - We continue to challenge the textual nature of
XML documents this time, we conduct a full-text
search in the form of keyword search. Although
full-text scanning could be studied in isolation
we think that the interaction with structural
mark-up is essential as the concepts are
considered orthogonal so query Q14 is restricted
to a subset of the document by combining content
and structure. - Q15 Print the keywords in emphasis in
annotations of closed auctions. - We now try to quantify the costs of long path
traversals that don't include wildcards. We
first descend deep into the tree (Query 15) and
then return again (Query 16). Both queries only
check for the existence of paths rather than
selecting paths with predicates.
76Xmark queries
- Q16 Return the IDs of those auctions that have
one or more kweywords in emphasis. - Q17Which persons don't have a homepage?
- This is to test how well the query processors
knows to deal with the semi-structured aspect of
XML data, especially elements that are declared
optional in the DTD. - Q18Convert the currency of the reserve of all
open auctions to another currency. - This query puts the application of user defined
functions (UDF) to the proof. In the XML world,
UDFs are of particular importance because they
allow the user to assign semantics to generic
strings that go beyond type coercion.
77Query optimizer challenges
- Mapping Xquery to a RBDMS should be able
- to deal with ordered tables
- to skip sub-documents
- to perform dynamic type casting
- to avoid unnecessary construction of string
intermediates - to recognize join-paths for fast access
- to balance fragmentation and reconstruction cose
78Xmark answers
- Q2 Return the initial increases of all open
auctions. - This query evaluates the cost of array look-ups.
Note that this query may actually be harder to
evaluate than it looks especially relational
back-ends may have to struggle with rather
complex aggregations to select the bidder element
with index 1. - FOR b IN document("auction.xml")/site/open_auc
tions/open_auction - RETURN ltincreasegt b/bidder1/increase/text()
lt/increasegt
79XMark
- Q3 Return the IDs of all open auctions whose
current increase is at least twice as high as the
initial increase. - This is a more complex application of index
lookups. In the case of a relational DBMS, the
query can take advantage of set-valued aggregates
on the index attribute to accelerate the
execution. - FOR b IN document("auction.xml")/site/open_auc
tions/open_auction - WHERE b/bidder0/increase/text() 2 lt
b/bidderlast()/increase/text() - RETURN ltincrease firstb/bidder0/increase/text(
) - lastb/bidderlast()/increase/t
ext()/gt
80Xmark result
- Q4 List the reserves of those open auctions
where a certain person issued a bid before
another person - This time, we stress the textual nature of XML
documents by querying the tag order in the source
document - FOR b IN document("auction.xml")/site/open_auc
tions/open_auction - WHERE b/bidder/personrefid"person18829"
BEFORE - b/bidder/personrefid"person10487"
- RETURN lthistorygt b/initial/text() lt/historygt
81Xmark answers
- Q5 How many sold items cost more than 40?
- Strings are the generic data type in XML
documents. Queries that interpret strings will
often need to cast strings to another data type
that carries more semantics. This query
challenges the DBMS in terms of the casting
primitives it provides. Especially, if there is
no additional schema information or just a DTD at
hand, casts are likely to occur frequently. - COUNT (FOR i document("auction.xml")/site/clo
sed_auctions/closed_auction - WHERE i/price/text() gt 40
- RETURN i/price)
82Xmark results
- Q6 How many items are listed on all continents?
- Regular path expressions are a fundamental
building block of virtually every query language
for XML or semi-structured data. These queries
investigate how well the query processor can
optimize path expressions and prune traversals of
irrelevant parts of the tree. - FOR b IN document("auction.xml")/site/regions
- RETURN COUNT (b//item)
83Xmark results
- Q7 How many pieces of prose are in our database?
- A good evaluation engine should realize that
there is no need to traverse the complete
document tree to evaluate such expressions.Also
note that the COUNT aggregation does not require
a complete traversal of the tree. Just the
cardinality of the respective relation is
queried. Note that the tag ltemailgt does not
exist in the database document. - FOR p IN document("auction.xml")/site
- RETURN count(p//description)
count(p//annotation) count(p//email)
84Xmark results
- Q8 List the names of persons and the number of
items they bought. (joins person,
closed\_auction) - References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. This query defines
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined. - FOR p IN document("auction.xml")/site/people/p
erson - LET a FOR t IN document("auction.xml")/sit
e/closed_auctions/closed_auction - WHERE t/buyer/_at_person p/_at_id
- RETURN t
- RETURN ltitem personp/name/text()gt COUNT (a)
lt/itemgt
85Xmark results
- Q9 List the names of persons and the names of
the items they bought in Europe. (joins person,
closed_auction, item) - References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. These queries define
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined. - FOR p IN document("auction.xml")/site/people/p
erson - LET a FOR t IN document("auction.xml")/sit
e/closed_auctions/closed_auction - LET n FOR t2 IN
document("auction.xml")/site/regions/europe/item - WHERE t/itemref/_at_item
t2/_at_id - RETURN t2
- WHERE p/_at_id t/buyer/_at_person
- RETURN ltitemgt n/name/text() lt/itemgt
- RETURN ltperson namep/name/text()gt a lt/persongt
86Xmark results
- Q10 List all persons according to their
interest use french markup in the result. - Constructing new elements may put the storage
engine under stress especially in the context of
creating materialized document views. The
following query reverses the structure of person
records by grouping them according to the
interest profile of a person. Large parts of the
person records are repeatedly reconstructed. To
avoid simple copying of the original database we
translate the mark-up into french.
87- FOR i IN DISTINCT document("auction.xml")/
site/people/person/profile/interest/_at_category - LET p FOR t IN document("auction.xml")/sit
e/people/person - WHERE t/profile/interest/_at_category
i - RETURN ltpersonnegt
- ltstatistiquesgt
- ltsexegt t/gender/text()
lt/sexegt, - ltagegt t/age/text()
lt/agegt, - lteducationgt
t/education/text()lt/educationgt, - ltrevenugt t/income/text()
lt/revenugt - lt/statistiquesgt,
88- ltcoordonneesgt
- ltnomgt t/name/text()
lt/nomgt, - ltruegt t/street/text()
lt/ruegt, - ltvillegt t/city/text()
lt/villegt, - ltpaysgt t/country/text()
lt/paysgt, - ltreseaugt
- ltcourriergt
t/email/text() lt/courriergt, - ltpagePersogt
t/homepage/text()lt/pagePersogt - lt/reseaugt,
- lt/coordonneesgt
- ltcartePaiementgt
t/creditcard/text()lt/cartePaiementgt - lt/personnegt
- RETURN ltcategoriegt
- ltidgt i lt/idgt,
- p
- lt/categoriegt
89Xmark results
- Q11 For each person, list the number of items
currently on sale whose price does not exceed
0.02\ of the person's income - This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities. - FOR p IN document("auction.xml")/site/people/pers
on - LET c FOR i IN document("auction.xml")/site/o
pen_auctions/open_auction/initial - WHERE p/profile/_at_income gt (5000
i/text()) - RETURN i
- RETURN ltitems namep/profile/_at_incomegt COUNT (c)
lt/itemsgt
90Xmark results
- 12 For each richer-than-average person, list the
number of items currently on sale whose price
does not exceed 0.02 of the person's income - This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities. FOR
p IN document("auction.xml")/site/people/person - FOR p IN document("auction.xml")/site/people/pers
on - LET l FOR i IN document("auction.xml")/site/o
pen_auctions/open_auction/initial - WHERE p/profile/_at_income gt (5000
i/text()) - RETURN i
- WHERE p/profile/_at_income gt 50000
- RETURN ltitems incomep/profile/_at_incomegt COUNT
(l) lt/itemsgt
91Xmark results
- Q13 List the names of items registered in
Australia along with their descriptions. - A key design for XML-gtDBMS mappings is to
determine the fragmentation criteria. The
complementary action is to reconstruct the
original document from its broken-down
representation. Query 13 tests for the ability
of the database to reconstruct portions of
theoriginal XML document. - FOR i IN document("auction.xml")/site/regions/aus
tralia/item RETURN ltitem namei/name/text()gt
i/description lt/itemgt
92Xmark results
- Q14Return the names of all items whose
description contains the word gold'. - We continue to challenge the textual nature of
XML documents this time, we conduct a full-text
search in the form of keyword search. Although
full-text scanning could be studied in isolation
we think that the interaction with structural
mark-up is essential as the concepts are
considered orthogonal so query Q14 is restricted
to a subset of the document by combining content
and structure. - FOR i IN document("auction.xml")/site//item
- WHERE CONTAINS (i/description,"gold")
- RETURN i/name/text()
93Xmark results
- Q15 Print the keywords in emphasis in
annotations of closed auctions. - We now try to quantify the costs of long path
traversals that don't include wildcards. We
first descend deep into the tree (Query 15) and
then return again (Query 16). Both queries only
check for the existence of paths rather than
selecting paths with predicates. - FOR a IN document("auction.xml")/site/closed_auct
ions/closed_auction/annotation/description/parlist
/listitem/parlist/listitem/text/emph/keyword/text(
) - RETURN lttextgt a lt/textgt
94Xmark results
- Q16 Return the IDs of those auctions that have
one or more kweywords in emphasis. - FOR a IN document("auction.xml")/site/closed_auct
ions/closed_auction - WHERE NOT EMPTY (a/annotation/description/parlist
/listitem/parlist/\ - listitem/text/emph/keyword/te
xt()) - RETURN ltperson ida/seller/_at_person /gt
95Xmark results
- Q17Which persons don't have a homepage?
- This is to test how well the query processors
knows to deal with the semi-structured aspect of
XML data, especially elements that are declared
optional in the DTD. - FOR p IN document("auction.xml")/site/people/p
erson - WHERE EMPTY(p/homepage/text())
- RETURN ltperson namep/name/text()/gt
96Xmark results
- Q18Convert the currency of the reserve of all
open auctions to another currency. - This query puts the application of user defined
functions (UDF) to the proof. In the XML world,
UDFs are of particular importance because they
allow the user to assign semantics to generic
strings that go beyond type coercion. - FUNCTION CONVERT (v)
-
- RETURN 2.20371 v -- convert Dfl to Euros
-
- FOR i IN document("auction.xml")/site/open_auc
tions/open_auction/ - RETURN CONVERT(i/reserve/text())
97Query optimizer challenges
- Mapping Xquery to a RBDMS should be able
- to deal with ordered tables
- to skip sub-documents
- to perform dynamic type casting
- to avoid unnecessary construction of string
intermediates - to recognize join-paths for fast access
- to balance fragmentation and reconstruction cose
98Xmark results
Effect of loading 100Mb document into DBMS
99Xmark results
100Xmark results
101Pathfinder/MonetDB 2004 implementation in seconds