Chapter 10: XML

About This Presentation

Title:

Chapter 10: XML

Description:

XML-QL, Quilt, XQL, ... Tree Model of XML Data ... XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 91

Provided by: kers151

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 10: XML

1
Chapter 10 XML

The world of XML

2
The Data

Semistructured data instance a large graph

3
The indexing problem

The storage problem
Store the graph in a relational DBMS
Develop a new database storage structure
The indexing problem
Input large, irregular data graph
Output index structure for evaluating (regular)
path expressions, e.g.
bib.paper.author.firstname

4
XSet a simple index for XML

Part of the Ninja project at Berkeley
Example XML data

5
XSet a simple index for XML

Each node a hashtable
Each entry list of pointers to data nodes (not
shown)

SELECT X FROM part.name X -yes
SELECT X FROM part.supplier.name X -yes
SELECT X FROM part..subpart.name X -maybe
SELECT X FROM .supplier.name X -maybe

6
Region Algebras

structured text text with tags (like XML)
data sequence of characters c1c2c3
region interval in the text
representation (x,y) cx,cx1, cy
example ltsectiongt lt/sectiongt
region set a set of regions
example all ltsectiongt regions (may be nested)
region algebra operators on region set,
s1 op s2
s1 intersect s2 r r? s1, r ?s2
s1 included s2 r r?s1, ?r ? s2, r ? r
s1 including s2 r r? s1, ?r ? s2, r ? r
s1 parent s2 r r? s1, ?r? s2, r is a parent
of r
s1 child s2 r r? s1, ?r ? s2, r is child of
r

7
Region Algebras

Region expressions correspond to simple XPath
expressions
s1 child s2 r r? s1, ?r ? s2, r is child of
r

part.name name child (part child
root)
part.supplier.name name child (supplier child
(part child root))
.supplier.name name child supplier
part..subpart.name name child (subpart
included (part child root))

8
Efficient computation of Region Algebra Operators

Example s1 included s2
s1 (x1,x1'), (x2,x2'),
s2 (y1,y1'), (y2,y2'),
(i.e. assume each consists of disjoint regions)
Algorithm
if xi lt yj then i i 1
if xi' gt yj' then j j 1
otherwise print (xi,xi'), do i i 1
Can do in sub-linear time when one region is very
small

9
Storage structures for region algebras

Every node is characterised by an integer pair
(x,y)
This means we have a 2-d space
Any 2-d space data structure can be used
If you use a (pre-order,post-order) numbering you
get triangular filling of 2-d
(to be discussed later)

10
Alternative mappings

Mapping the structure to the relational world
The Edge approach
The Attribute approach
The Universal Table approach
The Normalized Universal approach
The Monet/XML approach
The Dataguide approach
Mapping values
Separate value tables
Inlining
Shredding

11
Dataguide approach

Developed in the context of Lore, Lorel (Stanford
Univ)
Predecessor of the Monet/XML model
Observation
queries in the graph-representation take a
limited form
they are partial walks from the root to an object
of interest
this behaviour was stressed by the query language
Lorel, i.e. an SQL-based query language based on
processing regular expressions

SELECT X FROM (Bib..author).(lastnamefirstname).
Abiteboul X
12
DataGuides

Definition
given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t.
- every path in DB also occurs in G
- every path in G occurs in DB
- every path in G is unique

13
Dataguides

Example

14
DataGuides

Multiple DataGuides for the same data

15
DataGuides

Definition
Let w, w be two words (I.e word queries) and G
a graph
w ?G w if w(G) w(G)
Definition
G is a strong dataguide for a database DB if ?G
is the same as ?DB
Example
- G1 is a strong dataguide
- G2 is not strong
person.project !?DB dept.project
person.project !?G2 dept.project

16
DataGuides

Constructing the strong DataGuide G
Nodes(G)root
Edges(G)?
while changes do
choose s in Nodes(G), a in Labels
add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G)
add (x -a-gty) to Edges(G)
Use hash table for Nodes(G)
This is precisely the powerset automaton
construction.

17
DataGuides

How large are the dataguides ?
if DB is a tree, then size(G) lt size(DB)
why? answer every node is in exactly one extent
of G
here dataguide XSet
How many nodes does the strong dataguide have for
this DB ?

20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic
schemas, like

18
Monet XML approach
Monet XML approach
19
Monet XML approach
20
Monet XML approach
21
Monet XML approach
22
Monet XML approach
23

Querying the XML world

24
Querying and Transforming XML Data

Standard XML querying/translation languages
XPath
Simple language consisting of path expressions
XSLT
Simple language designed for translation from XML
to XML and XML to HTML
XQuery
An XML query language with a rich set of features
Wide variety of other languages have been
proposed, and some served as basis for the Xquery
standard
XML-QL, Quilt, XQL,

25
Tree Model of XML Data

Query and transformation languages are based on a
tree model of XML data
An XML document is modeled as a tree, with nodes
corresponding to elements and attributes
Element nodes have children nodes, which can be
attributes or subelements
Text in an element is modeled as a text node
child of the element
Children of a node are ordered according to their
order in the XML document
Element and attribute nodes (except for the root
node) have a single parent, which is an element
node
The root node has a single child, which is the
root element of the document
We use the terminology of nodes, children,
parent, siblings, ancestor, descendant, etc.,
which should be interpreted in the above tree
model of XML data.

26
XML data with ID and IDREF attributes

ltbank-2gt
ltaccount account-numberA-401 ownersC100
C102gt
ltbranch-namegt Downtown lt/branch-namegt
ltbranchgt500 lt/balancegt
lt/accountgt
ltcustomer customer-idC100 accountsA-401gt
ltcustomer-namegtJoelt/customer-namegt
ltcustomer-streetgtMonroelt/customer-street
gt
ltcustomer-citygtMadisonlt/customer-citygt
lt/customergt
ltcustomer customer-idC102 accountsA-401
A-402gt
ltcustomer-namegt Marylt/customer-namegt
ltcustomer-streetgt Erinlt/customer-streetgt
ltcustomer-citygt Newark lt/customer-citygt
lt/customergt
lt/bank-2gt

27
XPath

XPath is used to address (select) parts of
documents using path expressions
A path expression is a sequence of steps
separated by /
Think of file names in a directory hierarchy
Result of path expression set of values that
along with their containing elements/attributes
match the specified path
E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns
ltnamegtJoelt/namegt
ltnamegtMarylt/namegt
E.g. /bank-2/customer/name/text( )
returns the same names, but without the
enclosing tags

28
XPath (Cont.)

The initial / denotes root of the document
(above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances
produced by the previous step
Selection predicates may follow any step in a
path, in
E.g. /bank-2/accountbalance gt 400
returns account elements with a balance value
greater than 400
/bank-2/accountbalance returns account
elements containing a balance subelement
Attributes are accessed using _at_
E.g. /bank-2/accountbalance gt
400/_at_account-number
returns the account numbers of those accounts
with balance gt 400
IDREF attributes are not dereferenced
automatically (more on this later)

29
Functions in XPath

XPath provides several functions
The function count() at the end of a path counts
the number of elements in the set generated by
the path
E.g. /bank-2/accountcustomer/count() gt 2
Returns accounts with gt 2 customers
Also function for testing position (1, 2, ..) of
node w.r.t. siblings
Boolean connectives and and or and function not()
can be used in predicates
IDREFs can be referenced using function id()
id() can also be applied to sets of references
such as IDREFS and even to strings containing
multiple references separated by blanks
E.g. /bank-2/account/id(_at_owner)
returns all customers referred to from the owners
attribute of account elements.

30
More XPath Features

Operator used to implement union
E.g. /bank-2/account/id(_at_owner)
/bank-2/loan/id(_at_borrower)
gives customers with either accounts or loans
However, cannot be nested inside other
operators.
// can be used to skip multiple levels of nodes
E.g. /bank-2//name
finds any name element anywhere under the
/bank-2 element, regardless of the element in
which it is contained.
A step in the path can go to (13 variations in
the standard)
parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not
just to the children
//, described above, is a short from for
specifying all descendants
.. specifies the parent.

31
Pathfinder

Xpath is essential for the implementation of an
Xquery processor. It is strongly related to the
data structures and its primitives.
A state-of-the-art implementation is
MonetDB/Pathfinder developed by Uni. Konstantz,
Twente University, CWI

32
Pathfinder Uni Konstantz
33
Pathfinder
34
Pathfinder
35
Pathfinder
36
(No Transcript)
37
Pathfinder
38
Pathfinder
39
Pathfinder
40
pathfinder
41
Pathfinder
42
Staircase join
43
Staircase join
44
Pathfinder
45
Pathfinder
46
Pathfinder
47
Pathfinder
48

XQuery

49
XQuery

XQuery is a general purpose query language for
XML data
Currently being standardized by the World Wide
Web Consortium (W3C)
The textbook description is based on a March 2001
draft of the standard. The final version may
differ, but major features likely to stay
unchanged.
Alpha version of XQuery engine
Galax http//db.bell-labs.com/galax/
IPSI-IQ
Xpath visualized http//www.vbxml.com/xpathvisual
izer/
MonetDB/Pathfinder
Xhive
XQuery is derived from the Quilt query language,
which itself borrows from SQL, XQL and XML-QL

50
XQuery

XQuery uses a for let where .. return
syntax
for ? SQL from where ? SQL where
return ? SQL select let allows temporary
variables, and has no equivalent in SQL
Variables make it possible to keep the state of
processing around and severely complicates
optimization

51
FLWR Syntax in XQuery

For clause uses XPath expressions, and variables
in the for- clause ranges over values in the set
returned by Xpath
XPath is used to address (select) parts of
documents using path expressions
A path expression is a sequence of steps
separated by /
Result of path expression set of values that
along with their containing elements/attributes
match the specified path
E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns
ltnamegtJoelt/namegt
ltnamegtMarylt/namegt
E.g. /bank-2/customer/name/text( )
returns the same names, but without the
enclosing tags

52
XPath

XPath is used to address (select) parts of
documents using path expressions
A path expression is a sequence of steps
separated by /
Think of file names in a directory hierarchy
Result of path expression set of values that
along with their containing elements/attributes
match the specified path
E.g. /bank-2/customer/name evaluated on
the bank-2 data we saw earlier returns
ltnamegtJoelt/namegt
ltnamegtMarylt/namegt
E.g. /bank-2/customer/name/text( )
returns the same names, but without the
enclosing tags

53
XPath (Cont.)

The initial / denotes root of the document
(above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances
produced by the previous step
Selection predicates may follow any step in a
path, in
E.g. /bank-2/accountbalance gt 400
returns account elements with a balance value
greater than 400
/bank-2/accountbalance returns account
elements containing a balance subelement
Attributes are accessed using _at_
E.g. /bank-2/accountbalance gt
400/_at_account-number
returns the account numbers of those accounts
with balance gt 400
IDREF attributes are not dereferenced
automatically (more on this later)

54
Functions in XPath

XPath provides several functions
The function count() at the end of a path counts
the number of elements in the set generated by
the path
E.g. /bank-2/accountcustomer/count() gt 2
Returns accounts with gt 2 customers
Also function for testing position (1, 2, ..) of
node w.r.t. siblings
Boolean connectives and and or and function not()
can be used in predicates
IDREFs can be referenced using function id()
id() can also be applied to sets of references
such as IDREFS and even to strings containing
multiple references separated by blanks
E.g. /bank-2/account/id(_at_owner)
returns all customers referred to from the owners
attribute of account elements.

55
More XPath Features

Operator used to implement union
E.g. /bank-2/account/id(_at_owner)
/bank-2/loan/id(_at_borrower)
gives customers with either accounts or loans
However, cannot be nested inside other
operators.
// can be used to skip multiple levels of nodes
E.g. /bank-2//name
finds any name element anywhere under the
/bank-2 element, regardless of the element in
which it is contained.
A step in the path can go to (13 variations in
the standard)
parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not
just to the children
//, described above, is a short from for
specifying all descendants
.. specifies the parent.

56
FLWR Syntax in XQuery

Simple FLWR expression in XQuery
find all accounts with balance gt 400, with each
result enclosed in an ltaccount-numbergt ..
lt/account-numbergt tag for x in
/bank-2/account let acctno
x/_at_account-number where x/balance gt 400
return ltaccount-numbergt acctno
lt/account-numbergt
Let clause not really needed in this query, and
selection can be done In XPath. Query can be
written as
for x in /bank-2/accountbalancegt400 return
ltaccount-numbergt X/_at_account-number

lt/account-numbergt

57
Path Expressions and Functions

Path expressions are used to bind variables in
the for clause, but can also be used in other
places
E.g. path expressions can be used in let clause,
to bind variables to results of path expressions
The function distinct( ) can be used to removed
duplicates in path expression results
The function document(name) returns root of named
document
E.g. document(bank-2.xml)/bank-2/account
Aggregate functions such as sum( ) and count( )
can be applied to path expression results
XQuery does not support groupby, but the same
effect can be got by nested queries, with nested
FLWR expressions within a return clause
More on nested queries later

58
Joins

Joins are specified in a manner very similar to
SQLfor b in /bank/account,
c in /bank/customer,
d in /bank/depositor
where a/account-number d/account-number
and c/customer-name d/customer-name
return ltcust-acctgt c a lt/cust-acctgt
The same query can be expressed with the
selections specified as XPath selections
for a in /bank/account c in
/bank/customer d in /bank/depositor
account-number a/account-number and
customer-name
c/customer-name
return ltcust-acctgt c alt/cust-acctgt

59
Changing Nesting Structure

The following query converts data from the flat
structure for bank information into the nested
structure used in bank-1
ltbank-1gt
for c in /bank/customer
return
ltcustomergt
c/
for d in /bank/depositorcustomer-name
c/customer-name,
a in /bank/accountaccount-numberd/a
ccount-number
return a
lt/customergt
lt/bank-1gt
c/ denotes all the children of the node to
which c is bound, without the enclosing
top-level tag
Exercise for reader write a nested query to find
sum of accountbalances, grouped by branch.

60
XQuery Path Expressions

c/text() gives text content of an element
without any subelements/tags
XQuery path expressions support the gt operator
for dereferencing IDREFs
Equivalent to the id( ) function of XPath, but
simpler to use
Can be applied to a set of IDREFs to get a set of
results

61
Sorting in XQuery

Sortby clause can be used at the end of any
expression. E.g. to return customers sorted by
name for c in /bank/customer return
ltcustomergt c/ lt/customergt sortby(name)
Can sort at multiple levels of nesting (sort by
customer-name, and by account-number within each
customer)
ltbank-1gt for c in /bank/customer
return ltcustomergt c/ for d in
/bank/depositorcustomer-namec/customer-name,
a in /bank/accountaccount-numberd/ac
count-number return ltaccountgt a/
lt/accountgt sortby(account-number) lt/customergt
sortby(customer-name)
lt/bank-1gt

62
Functions and Other XQuery Features

User defined functions with the type system of
XMLSchema function balances(xsdstring c)
returns list(xsdnumeric) for d in
/bank/depositorcustomer-name c,
a in /bank/accountaccount-numberd/account-numb
er return a/balance
Types are optional for function parameters and
return values
Universal and existential quantification in where
clause predicates
some e in path satisfies P
every e in path satisfies P
XQuery also supports If-then-else clauses

Xmark http//www.xml-benchmark.org
Used in most experiments on Xpath and Xquery
evaluation
Old figures on hand-compiled queries for the
dataguide approach can be found in
http//www.cwi.nl/mk/xmarkArchive/Reports/Monet_r
eport/monet_report.html

64
Xmark
65
XMark
66
Monet XML approach
Monet XML approach
67
XMark

Q1 Return the name of the person with ID
personal
FOR b IN /site/people/person_at_idpersonal
RETURN b/name/text()

68
Do it yourself Or skip
69
Xmark queries

Q2 Return the initial increases of all open
auctions.
This query evaluates the cost of array look-ups.
Note that this query may actually be harder to
evaluate than it looks especially relational
back-ends may have to struggle with rather
complex aggregations to select the bidder element
with index 1.
Q3 Return the IDs of all open auctions whose
current increase is at least twice as high as the
initial increase.
This is a more complex application of index
lookups. In the case of a relational DBMS, the
query can take advantage of set-valued aggregates
on the index attribute to accelerate the
execution.

70
Xmark queries

Q4 List the reserves of those open auctions
where a certain person issued a bid before
another person
This time, we stress the textual nature of XML
documents by querying the tag order in the source
document
Q5 How many sold items cost more than 40?
Strings are the generic data type in XML
documents. Queries that interpret strings will
often need to cast strings to another data type
that carries more semantics. This query
challenges the DBMS in terms of the casting
primitives it provides. Especially, if there is
no additional schema information or just a DTD at
hand, casts are likely to occur frequently.

71
Xmark queries

Q6 How many items are listed on all continents?
Regular path expressions are a fundamental
building block of virtually every query language
for XML or semi-structured data. These queries
investigate how well the query processor can
optimize path expressions and prune traversals of
irrelevant parts of the tree.
Q7 How many pieces of prose are in our database?
A good evaluation engine should realize that
there is no need to traverse the complete
document tree to evaluate such expressions.Also
note that the COUNT aggregation does not require
a complete traversal of the tree. Just the
cardinality of the respective relation is
queried. Note that the tag ltemailgt does not
exist in the database document.

72
Xmark queries

Q8 List the names of persons and the number of
items they bought. (joins person,
closed\_auction)
References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. This query defines
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined.
Q9 List the names of persons and the names of
the items they bought in Europe. (joins person,
closed_auction, item)
References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. These queries define
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined.

73
Xmark queries

Q10 List all persons according to their
interest use french markup in the result.
Constructing new elements may put the storage
engine under stress especially in the context of
creating materialized document views. The
following query reverses the structure of person
records by grouping them according to the
interest profile of a person. Large parts of the
person records are repeatedly reconstructed. To
avoid simple copying of the original database we
translate the mark-up into french.
Q11 For each person, list the number of items
currently on sale whose price does not exceed
0.02\ of the person's income
This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in thesize of the result set and
provide various optimization opportunities.

74
Xmark queries

Q12 For each richer-than-average person, list
the number of items currently on sale whose price
does not exceed 0.02 of the person's income
This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities.
Q13 List the names of items registered in
Australia along with their descriptions.
A key design for XML-gtDBMS mappings is to
determine the fragmentation criteria. The
complementary action is to reconstruct the
original document from its broken-down
representation. Query 13 tests for the ability
of the database to reconstruct portions of the
original XML document.

75
Xmark queries

Q14Return the names of all items whose
description contains the word gold'.
We continue to challenge the textual nature of
XML documents this time, we conduct a full-text
search in the form of keyword search. Although
full-text scanning could be studied in isolation
we think that the interaction with structural
mark-up is essential as the concepts are
considered orthogonal so query Q14 is restricted
to a subset of the document by combining content
and structure.
Q15 Print the keywords in emphasis in
annotations of closed auctions.
We now try to quantify the costs of long path
traversals that don't include wildcards. We
first descend deep into the tree (Query 15) and
then return again (Query 16). Both queries only
check for the existence of paths rather than
selecting paths with predicates.

76
Xmark queries

Q16 Return the IDs of those auctions that have
one or more kweywords in emphasis.
Q17Which persons don't have a homepage?
This is to test how well the query processors
knows to deal with the semi-structured aspect of
XML data, especially elements that are declared
optional in the DTD.
Q18Convert the currency of the reserve of all
open auctions to another currency.
This query puts the application of user defined
functions (UDF) to the proof. In the XML world,
UDFs are of particular importance because they
allow the user to assign semantics to generic
strings that go beyond type coercion.

77
Query optimizer challenges

Mapping Xquery to a RBDMS should be able
to deal with ordered tables
to skip sub-documents
to perform dynamic type casting
to avoid unnecessary construction of string
intermediates
to recognize join-paths for fast access
to balance fragmentation and reconstruction cose

78
Xmark answers

Q2 Return the initial increases of all open
auctions.
This query evaluates the cost of array look-ups.
Note that this query may actually be harder to
evaluate than it looks especially relational
back-ends may have to struggle with rather
complex aggregations to select the bidder element
with index 1.
FOR b IN document("auction.xml")/site/open_auc
tions/open_auction
RETURN ltincreasegt b/bidder1/increase/text()
lt/increasegt

79
XMark

Q3 Return the IDs of all open auctions whose
current increase is at least twice as high as the
initial increase.
This is a more complex application of index
lookups. In the case of a relational DBMS, the
query can take advantage of set-valued aggregates
on the index attribute to accelerate the
execution.
FOR b IN document("auction.xml")/site/open_auc
tions/open_auction
WHERE b/bidder0/increase/text() 2 lt
b/bidderlast()/increase/text()
RETURN ltincrease firstb/bidder0/increase/text(
)
lastb/bidderlast()/increase/t
ext()/gt

80
Xmark result

Q4 List the reserves of those open auctions
where a certain person issued a bid before
another person
This time, we stress the textual nature of XML
documents by querying the tag order in the source
document
FOR b IN document("auction.xml")/site/open_auc
tions/open_auction
WHERE b/bidder/personrefid"person18829"
BEFORE
b/bidder/personrefid"person10487"
RETURN lthistorygt b/initial/text() lt/historygt

81
Xmark answers

Q5 How many sold items cost more than 40?
Strings are the generic data type in XML
documents. Queries that interpret strings will
often need to cast strings to another data type
that carries more semantics. This query
challenges the DBMS in terms of the casting
primitives it provides. Especially, if there is
no additional schema information or just a DTD at
hand, casts are likely to occur frequently.
COUNT (FOR i document("auction.xml")/site/clo
sed_auctions/closed_auction
WHERE i/price/text() gt 40
RETURN i/price)

82
Xmark results

Q6 How many items are listed on all continents?
Regular path expressions are a fundamental
building block of virtually every query language
for XML or semi-structured data. These queries
investigate how well the query processor can
optimize path expressions and prune traversals of
irrelevant parts of the tree.
FOR b IN document("auction.xml")/site/regions
RETURN COUNT (b//item)

83
Xmark results

Q7 How many pieces of prose are in our database?
A good evaluation engine should realize that
there is no need to traverse the complete
document tree to evaluate such expressions.Also
note that the COUNT aggregation does not require
a complete traversal of the tree. Just the
cardinality of the respective relation is
queried. Note that the tag ltemailgt does not
exist in the database document.
FOR p IN document("auction.xml")/site
RETURN count(p//description)
count(p//annotation) count(p//email)

84
Xmark results

Q8 List the names of persons and the number of
items they bought. (joins person,
closed\_auction)
References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. This query defines
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined.
FOR p IN document("auction.xml")/site/people/p
erson
LET a FOR t IN document("auction.xml")/sit
e/closed_auctions/closed_auction
WHERE t/buyer/_at_person p/_at_id
RETURN t
RETURN ltitem personp/name/text()gt COUNT (a)
lt/itemgt

85
Xmark results

Q9 List the names of persons and the names of
the items they bought in Europe. (joins person,
closed_auction, item)
References are an integral part of XML as they
allow richer relationships than just hierarchical
element structures. These queries define
horizontal traversals with increasing complexity.
A good query optimizer should take advantage of
the cardinalities of the sets to be joined.
FOR p IN document("auction.xml")/site/people/p
erson
LET a FOR t IN document("auction.xml")/sit
e/closed_auctions/closed_auction
LET n FOR t2 IN
document("auction.xml")/site/regions/europe/item
WHERE t/itemref/_at_item
t2/_at_id
RETURN t2
WHERE p/_at_id t/buyer/_at_person
RETURN ltitemgt n/name/text() lt/itemgt
RETURN ltperson namep/name/text()gt a lt/persongt

86
Xmark results

Q10 List all persons according to their
interest use french markup in the result.
Constructing new elements may put the storage
engine under stress especially in the context of
creating materialized document views. The
following query reverses the structure of person
records by grouping them according to the
interest profile of a person. Large parts of the
person records are repeatedly reconstructed. To
avoid simple copying of the original database we
translate the mark-up into french.

FOR i IN DISTINCT document("auction.xml")/
site/people/person/profile/interest/_at_category
LET p FOR t IN document("auction.xml")/sit
e/people/person
WHERE t/profile/interest/_at_category
i
RETURN ltpersonnegt
ltstatistiquesgt
ltsexegt t/gender/text()
lt/sexegt,
ltagegt t/age/text()
lt/agegt,
lteducationgt
t/education/text()lt/educationgt,
ltrevenugt t/income/text()
lt/revenugt
lt/statistiquesgt,

ltcoordonneesgt
ltnomgt t/name/text()
lt/nomgt,
ltruegt t/street/text()
lt/ruegt,
ltvillegt t/city/text()
lt/villegt,
ltpaysgt t/country/text()
lt/paysgt,
ltreseaugt
ltcourriergt
t/email/text() lt/courriergt,
ltpagePersogt
t/homepage/text()lt/pagePersogt
lt/reseaugt,
lt/coordonneesgt
ltcartePaiementgt
t/creditcard/text()lt/cartePaiementgt
lt/personnegt
RETURN ltcategoriegt
ltidgt i lt/idgt,
p
lt/categoriegt

89
Xmark results

Q11 For each person, list the number of items
currently on sale whose price does not exceed
0.02\ of the person's income
This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities.
FOR p IN document("auction.xml")/site/people/pers
on
LET c FOR i IN document("auction.xml")/site/o
pen_auctions/open_auction/initial
WHERE p/profile/_at_income gt (5000
i/text())
RETURN i
RETURN ltitems namep/profile/_at_incomegt COUNT (c)
lt/itemsgt

90
Xmark results

12 For each richer-than-average person, list the
number of items currently on sale whose price
does not exceed 0.02 of the person's income
This query tests the database's ability to handle
large (intermediate) results. This time, joins
are on the basis of values. The difference
between these queries and the reference chasing
queries Q8 and Q9 is that references are
specified in the DTD and may be optimized with
logical OIDs for example. The two queries Q11
and Q12 cascade in the size of the result set and
provide various optimization opportunities. FOR
p IN document("auction.xml")/site/people/person
FOR p IN document("auction.xml")/site/people/pers
on
LET l FOR i IN document("auction.xml")/site/o
pen_auctions/open_auction/initial
WHERE p/profile/_at_income gt (5000
i/text())
RETURN i
WHERE p/profile/_at_income gt 50000
RETURN ltitems incomep/profile/_at_incomegt COUNT
(l) lt/itemsgt

91
Xmark results

Q13 List the names of items registered in
Australia along with their descriptions.
A key design for XML-gtDBMS mappings is to
determine the fragmentation criteria. The
complementary action is to reconstruct the
original document from its broken-down
representation. Query 13 tests for the ability
of the database to reconstruct portions of
theoriginal XML document.
FOR i IN document("auction.xml")/site/regions/aus
tralia/item RETURN ltitem namei/name/text()gt
i/description lt/itemgt

92
Xmark results

Q14Return the names of all items whose
description contains the word gold'.
We continue to challenge the textual nature of
XML documents this time, we conduct a full-text
search in the form of keyword search. Although
full-text scanning could be studied in isolation
we think that the interaction with structural
mark-up is essential as the concepts are
considered orthogonal so query Q14 is restricted
to a subset of the document by combining content
and structure.
FOR i IN document("auction.xml")/site//item
WHERE CONTAINS (i/description,"gold")
RETURN i/name/text()

93
Xmark results

Q15 Print the keywords in emphasis in
annotations of closed auctions.
We now try to quantify the costs of long path
traversals that don't include wildcards. We
first descend deep into the tree (Query 15) and
then return again (Query 16). Both queries only
check for the existence of paths rather than
selecting paths with predicates.
FOR a IN document("auction.xml")/site/closed_auct
ions/closed_auction/annotation/description/parlist
/listitem/parlist/listitem/text/emph/keyword/text(
)
RETURN lttextgt a lt/textgt

94
Xmark results

Q16 Return the IDs of those auctions that have
one or more kweywords in emphasis.
FOR a IN document("auction.xml")/site/closed_auct
ions/closed_auction
WHERE NOT EMPTY (a/annotation/description/parlist
/listitem/parlist/\
listitem/text/emph/keyword/te
xt())
RETURN ltperson ida/seller/_at_person /gt

95
Xmark results

Q17Which persons don't have a homepage?
This is to test how well the query processors
knows to deal with the semi-structured aspect of
XML data, especially elements that are declared
optional in the DTD.
FOR p IN document("auction.xml")/site/people/p
erson
WHERE EMPTY(p/homepage/text())
RETURN ltperson namep/name/text()/gt

96
Xmark results

Q18Convert the currency of the reserve of all
open auctions to another currency.
This query puts the application of user defined
functions (UDF) to the proof. In the XML world,
UDFs are of particular importance because they
allow the user to assign semantics to generic
strings that go beyond type coercion.
FUNCTION CONVERT (v)
RETURN 2.20371 v -- convert Dfl to Euros
FOR i IN document("auction.xml")/site/open_auc
tions/open_auction/
RETURN CONVERT(i/reserve/text())

97
Query optimizer challenges

Mapping Xquery to a RBDMS should be able
to deal with ordered tables
to skip sub-documents
to perform dynamic type casting
to avoid unnecessary construction of string
intermediates
to recognize join-paths for fast access
to balance fragmentation and reconstruction cose

98
Xmark results
Effect of loading 100Mb document into DBMS
99
Xmark results
100
Xmark results
101
Pathfinder/MonetDB 2004 implementation in seconds

Write a Comment

User Comments (0)