Title: Querying XML Documents
1Querying XML Documents
2Objectives
- How XML generalizes relational databases
- The XQuery language
- How XML may be supported in databases
3Only Some Trees are Relations
- They have height two
- The root has an unbounded number of children
- All nodes in the second layer (records) have a
fixed number of child nodes (fields) - Trees are ordered, while both rows and columns of
tables may be permuted without changing the
meaning of the data
4An XML tree
5Relational Tables
6Query Usage Scenarios
- Data-oriented
- to query XML data, like using SQL to query
relational databases - Document-oriented
- to retrieve parts of documents
- to provide dynamic indexes
- to perform context-sensitive searching
- to generate new documents as combinations of
existing documents - Programming
- to automatically generate documentation, similar
to the Javedoc tool - Hybrid
- to retrieve information from hybrid data, such as
patient records
7XQuery Design Requirements
- XML syntax and human-readable
- Declarative
- Namespace aware
- Coordinate with XML Schema
- Support simple and complex data types
- Can combine multiple documents
- Can transform and create XML trees
8What is XQuery?
- XQuery is the language for querying XML data
- XQuery 1.0 is a strict superset of XPath 2.0
- XQuery 1.0 and XPath 2.0 share the same data
model and support the same functions and
operators - XQuery has the extra expressive power to
- join information from different sources and
- generate new XML fragments
- XQuery and XSLT are both domain-specific
languages for combining and transforming XML data
from multiple sources - XQuery is designed from scratch, XSLT is an
intellectual descendant of CSS - Technically, they may emulate each other
- XQuery is defined by the W3C
- XQuery is supported by all the major database
engines (IBM, Oracle, Microsoft, etc.)
9XQuery Prolog
- XQuery expressions are evaluated relatively to a
context, which is explicitly provided by a prolog - Declarations specify various parameters for the
XQuery processor, such as - xquery version "1.0"
- declare xmlspace preserve
- declare xmlspace strip
- declare default element namespace URI
- declare default function namespace URI
- import schema at URI
- declare namespace NCName URI
10Implicit Declarations
- Declarations implicitly defined in any XQuery
implementation - declare namespace xml
- "http//www.w3.org/XML/1998/namespace"
- declare namespace xs
- "http//www.w3.org/2001/XMLSchema"
- declare namespace xsi
- "http//www.w3.org/2001/XMLSchema-instance"
- declare namespace fn
- "http//www.w3.org/2005/11/xpath-functions"
- declare namespace xdt
- "http//www.w3.org/2005/11/xpath-datatypes"
- declare namespace local
- "http//www.w3.org/2005/11/xquery-local-functions"
11XPath vs. XQuery
- XPath expressions are also XQuery expressions
- XPath expressions are required to be evaluated in
a static context, which must be provided by the
invoking application - The XQuery prolog gives the required static
context - The initial context node, position, and size are
undefined because an XQuery expression may work
on multiple XML documents - Only axes child, descendant, parent, attribute,
self, and descendant-of-self are required to be
implemented in XQuery.
12Datatype Expressions
- Same atomic values as XPath 2.0
- Also lots of primitive simple values
- xsstring("XML is fun")
- xsboolean("true")
- xsdecimal("3.1415")
- xsfloat("6.02214199E23")
- xsdateTime("1999-05-31T132000-0500")
- xstime("132000-0500")
- xsdate("1999-05-31")
- xsgYearMonth("1999-05")
- xsgYear("1999")
- xshexBinary("48656c6c6f0a")
- xsbase64Binary("SGVsbG8K")
- xsanyURI("http//www.brics.dk/ixwt/")
- xsQName("rcprecipe")
13XML Expressions
- XQuery expressions may compute new XML nodes
- Expressions may denote element, character data,
comment, and processing instruction nodes - Each node is created with a unique node identity
- Constructors may be either direct or computed
14Direct Constructors
- Uses the standard XML syntax
- The expression
- ltfoogtltbar/gtbazlt/foogt
- evaluates to the given XML fragment
- Note that nodes are created with unique identity
and therefore ltfoo/gt is ltfoo/gt evaluates to
false (operators is, ltlt, gtgt are used to compare
nodes on identity and document order)
15Same Namespace Declarations (1/3)
- declare default element namespace
"http//businesscard.org" - ltcardgt
- ltnamegtJohn Doelt/namegt
- lttitlegtCEO, Widget Inc.lt/titlegt
- ltemailgtjohn.doe_at_widget.comlt/emailgt
- ltphonegt(202) 555-1414lt/phonegt
- ltlogo uri"widget.gif"/gt
- lt/cardgt
16Same Namespace Declarations (2/3)
- declare namespace b "http//businesscard.org"
- ltbcardgt
- ltbnamegtJohn Doelt/bnamegt
- ltbtitlegtCEO, Widget Inc.lt/btitlegt
- ltbemailgtjohn.doe_at_widget.comlt/bemailgt
- ltbphonegt(202) 555-1414lt/bphonegt
- ltblogo uri"widget.gif"/gt
- lt/bcardgt
17Same Namespace Declarations (3/3)
- ltcard xmlns"http//businesscard.org"gt
- ltnamegtJohn Doelt/namegt
- lttitlegtCEO, Widget Inc.lt/titlegt
- ltemailgtjohn.doe_at_widget.comlt/emailgt
- ltphonegt(202) 555-1414lt/phonegt
- ltlogo uri"widget.gif"/gt
- lt/cardgt
18Enclosed Expressions
- Expressions create an element named numbers which
has a single character data node with value 1 2 3
4 5 (if boundary-space is set to strip) - lt numbers gt1 2 3 4 5lt/ numbers gt
- lt numbers gt1, 2, 3, 4, 5lt/ numbers gt
- lt numbers gt1, "2", 3, 4, 5lt/ numbers gt
- lt numbers gt1 to 5lt/ numbers gt
- lt numbers gt1 11 " " "3" " " 4 to 5lt/
numbers gt - Enclosed expressions are allowed inside attribute
values - ltfoo bar"1 2 3 4 5"/gt
- ltfoo bar"1, 2, 3, 4, 5"/gt
- ltfoo bar"1 2 to 4 5"/gt
19Explicit Constructors
- The constant expression
- ltcard xmlns"http//businesscard.org"gt
- ltnamegtJohn Doelt/namegt
- lttitlegtCEO, Widget Inc.lt/titlegt
- ltemailgtjohn.doe_at_widget.comlt/emailgt
- ltphonegt(202) 555-1414lt/phonegt
- ltlogo uri"widget.gif"/gt
- lt/cardgt
- Can be written as
- element card
- namespace "http//businesscard.org" ,
- element name text "John Doe" ,
- element title text "CEO, Widget Inc." ,
- element email text "john.doe_at_widget.com"
, - element phone text "(202) 555-1414" ,
- element logo attribute uri "widget.gif"
-
20Computed QNames
- Qualified names can be replaced by enclosed
expressions evaluating to equivalent strings - element "card"
- namespace "http//businesscard.org" ,
- element "name" text "John Doe" ,
- element "title" text "CEO, Widget Inc."
, - element "email" text "john.doe_at_widget.co
m" , - element "phone" text "(202) 555-1414"
, - element "logo"
- attribute "uri" "widget.gif"
-
-
21Biliingual Business Cards
- Controlled by a global variable lang
- element if (lang"Danish") then "kort" else
"card" - namespace "http//businesscard.org" ,
- element if (lang"Danish") then "navn" else
"name" - text "John Doe" ,
- element if (lang"Danish") then "titel" else
"title" - text "CEO, Widget Inc." ,
- element "email"
- text "john.doe_at_widget.inc" ,
- element if (lang"Danish") then "telefon"
else "phone" - text "(202) 456-1414" ,
- element "logo"
- attribute "uri" "widget.gif"
-
-
22FLWOR Expressions
- Used for general queries
- ltdoublesgt
- for s in fndoc("students.xml")//student
- let m s/major
- where fncount(m) ge 2
- order by s/_at_id
- return ltdoublegt
- s/name/text()
- lt/doublegt
-
- lt/doublesgt
23The Difference Between For and Let (1/4)
- A FLWOR expression
- for x in (1, 2, 3, 4)
- let y ("a", "b", "c")
- return (x, y)
- Output
- 1, a, b, c, 2, a, b, c, 3, a, b, c, 4, a, b, c
24The Difference Between For and Let (2/4)
- A FLWOR expression
- let x in (1, 2, 3, 4)
- for y ("a", "b", "c")
- return (x, y)
- Output
- 1, 2, 3, 4, a, 1, 2, 3, 4, b, 1, 2, 3, 4, c
25The Difference Between For and Let (3/4)
- A FLWOR expression
- for x in (1, 2, 3, 4)
- for y in ("a", "b", "c")
- return (x, y)
- Output
- 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b,
3, c, 4, a, 4, b, 4, c
26The Difference Between For and Let (4/4)
- A FLWOR expression
- let x (1, 2, 3, 4)
- let y ("a", "b", "c")
- return (x, y)
- Output
- 1, 2, 3, 4, a, b, c
27An XML Example Document books.xml
- lt?xml version"1.0" encoding"ISO-8859-1"?gt
- ltbookstoregt
- ltbook category"COOKING"gt
- lttitle lang"en"gtEveryday Italianlt/titlegt
- ltauthorgtGiada De Laurentiislt/authorgt
- ltyeargt2005lt/yeargt
- ltpricegt30.00lt/pricegt
- lt/bookgt
- ltbook category"CHILDREN"gt
- lttitle lang"en"gtHarry Potterlt/titlegt
- ltauthorgtJ K. Rowlinglt/authorgt
- ltyeargt2005lt/yeargt
- ltpricegt29.99lt/pricegt
- lt/bookgt
28An XML Example Document books.xml (cont.)
- ltbook category"WEB"gt
- lttitle lang"en"gtXQuery Kick Startlt/titlegt
- ltauthorgtJames McGovernlt/authorgt
- ltauthorgtPer Bothnerlt/authorgt
- ltauthorgtKurt Caglelt/authorgt
- ltauthorgtJames Linnlt/authorgt
- ltauthorgtVaidyanathan Nagarajanlt/authorgt
- ltyeargt2003lt/yeargt
- ltpricegt49.99lt/pricegt
- lt/bookgt
- ltbook category"WEB"gt
- lttitle lang"en"gtLearning XMLlt/titlegt
- ltauthorgtErik T. Raylt/authorgt
- ltyeargt2003lt/yeargt
- ltpricegt39.95lt/pricegt
- lt/bookgt
- lt/bookstoregt
29Path Expressions
- The following path expression is used to select
all the title elements in the "books.xml" file - doc("books.xml")/bookstore/book/title
- The XQuery above will extract the following
- lttitle lang"en"gtEveryday Italianlt/titlegt
- lttitle lang"en"gtHarry Potterlt/titlegt
- lttitle lang"en"gtXQuery Kick Startlt/titlegt
- lttitle lang"en"gtLearning XMLlt/titlegt
30Predicates
- The following predicate is used to select all the
book elements under the bookstore element that
have a price element with a value that is less
than 30 - doc("books.xml")/bookstore/bookpricelt30
- The XQuery above will extract the following
- ltbook category"CHILDREN"gt
- lttitle lang"en"gtHarry Potterlt/titlegt
- ltauthorgtJ K. Rowlinglt/authorgt
- ltyeargt2005lt/yeargt
- ltpricegt29.99lt/pricegt
- lt/bookgt
31FLOWR Expressions
- The following FLWOR expression will select all
the title elements under the book elements that
are under the bookstore element that have a price
element with a value that is higher than 30 - for x in doc("books.xml")/bookstore/book
- where x/pricegt30
- return x/title
- The result will be
- lttitle lang"en"gtXQuery Kick Startlt/titlegt
- lttitle lang"en"gtLearning XMLlt/titlegt
32Conditional Expressions
- Notes on the "if-then-else" syntax parentheses
around the if expression are required. else is
required, but it can be just else (). - for x in doc("books.xml")/bookstore/book
- return if (x/_at_category"CHILDREN")
- then ltchildgtdata(x/title)lt/childgt
- else ltadultgtdata(x/title)lt/adultgt
- The result of the example above will be
- ltadultgtEveryday Italianlt/adultgt
- ltchildgtHarry Potterlt/childgt
- ltadultgtLearning XMLlt/adultgt
- ltadultgtXQuery Kick Startlt/adultgt
33Comparisons
- General comparisons , !, lt, lt, gt, gt
- Value comparisons eq, ne, lt, le, gt, ge
- bookstore//book/_at_q gt 10
- The expression above returns true if any q
attributes have values greater than 10. - bookstore//book/_at_q gt 10
- The expression above returns true if there is
only one q attribute returned by the expression,
and its value is greater than 10. If more than
one q is returned, an error occurs.
34Selecting and Filtering Elements
- The at keyword can be used to count the
iteration - for x at i in doc("books.xml")/bookstore/book/t
itle - return ltbookgti. data(x)lt/bookgt
- Result
- ltbookgt1. Everyday Italianlt/bookgt
- ltbookgt2. Harry Potterlt/bookgt
- ltbookgt3. XQuery Kick Startlt/bookgt
- ltbookgt4. Learning XMLlt/bookgt
- It is also allowed with more than one in
expression in the for clause. Use comma to
separate each in expression - for x in (10,20), y in (100,200)
- return lttestgtxx and yylt/testgt
- Result
- lttestgtx10 and y100lt/testgt
- lttestgtx10 and y200lt/testgt
- lttestgtx20 and y100lt/testgt
- lttestgtx20 and y200lt/testgt
35Computing Joins
- fridge.xml
- ltfridgegt
- ltstuffgteggslt/stuffgt
- ltstuffgtolive oillt/stuffgt
- ltstuffgtketchuplt/stuffgt
- ltstuffgtunrecognizable moldy thinglt/stuffgt
- lt/fridgegt
- Find recipes that use some ingredients in the
refrigerator - declare namespace rcp "http//www.brics.dk/ixwt
/recipes" - for r in fndoc("recipes.xml")//rcprecipe
- for i in r//rcpingredient/_at_name
- for s in fndoc("fridge.xml")//stufftext()i
- return fndistinct-values(r/rcptitle/text() )
36Inverting a Relation
- declare namespace rcp "http//www.brics.dk/ixwt/
recipes" - ltingredientsgt
- for i in distinct-values(
- fndoc("recipes.xml")//rcpingredient/_at_name
- )
- return ltingredient name"i"gt
- for r in fndoc("recipes.xml")//rcprecipe
- where r//rcpingredient_at_namei
- return lttitlegtr/rcptitle/text()lt/titlegt
-
- lt/ingredientgt
-
- lt/ingredientsgt
37Sorting the Results
- declare namespace rcp "http//www.brics.dk/ixwt/
recipes" - ltingredientsgt
- for i in distinct-values(fndoc("recipes.xml")
//rcpingredient/_at_name) - order by i
- return ltingredient name"i"gt
- for r in fndoc("recipes.xml")//rcprecipe
- where r//rcpingredient_at_namei
- order by r/rcptitle/text()
- return lttitlegtr/rcptitle/text()lt/titlegt
-
- lt/ingredientgt
-
- lt/ingredientsgt
38A More Complicated Sorting
- for s in document("students.xml")//student
- order by
- fncount(
- s/results/resultfncontains(_at_grade,"A")
- ) descending,
- fncount(s/major) descending,
- xsinteger(s/age/text()) ascending
- return s/name/text()
39Using Functions
- declare function localgrade(g)
- if (g"A") then 4.0 else if (g"A-") then 3.7
else if (g"B") then 3.3 - else if (g"B") then 3.0 else if (g"B-") then
2.7 else if (g"C") then 2.3 - else if (g"C") then 2.0 else if (g"C-") then
1.7 else if (g"D") then 1.3 - else if (g"D") then 1.0 else if (g"D-") then
0.7 else 0 -
- declare function localgpa(s)
- fnavg(for g in s/results/result/_at_grade return
localgrade(g)) -
- ltgpasgt
- for s in fndoc("students.xml")//student
- return ltgpa id"s/_at_id" gpa"localgpa(s)
"/gt - lt/gpasgt
40Extend the Expressive Power
- Using recursive function to generate an XML tree
of a given height - declare function gen(n)
- if (n eq 0) ltbar/gt
- else ltfoogt gen(n - 1), gen(n - 1) lt/foogt
-
- Compute the height of an XML tree
- declare function localheight(x)
- if (fnempty(x/)) then 1
- else fnmax(for y in x/ return
localheight(y))1 -
41A Textual Outline
- Cailles en Sarcophages
- pastry
- chilled unsalted butter
- flour
- salt
- ice water
- filling
- baked chicken
- marinated chicken
- small chickens, cut up
- Herbes de Provence
- dry white wine
- orange juice
- minced garlic
- truffle oil
- ...
42Computing Textual Outlines
- declare namespace rcp "http//www.brics.dk/ixwt/
recipes" - declare function localingredients(i,p)
- fnstring-join(
- for j in i/rcpingredient
- return fnstring-join((p,j/_at_name,"x0A",
- localingredients(j,fnconcat(p,"
"))),""),"") -
- declare function localrecipes(r)
- fnconcat(r/rcptitle/text(),"x0A",localingre
dients(r," ")) -
- fnstring-join(
- for r in fndoc("recipes.xml")//rcprecipe5
- return localrecipes(r),""
- )
43XML Databases
- How can XML and databases be merged?
- Several different approaches
- extract XML views of relations
- use SQL to generate XML
- shred XML into relational databases
44Automatic XML Views (1/2)
- ltStudentsgt
- ltrecord id"100026" name"Joe Average"
age"21"/gt - ltrecord id"100078" name"Jack Doe" age"18"/gt
- lt/Studentsgt
- xmlelement(name, "Students",
- select xmlelement(name,
- "record",
- xmlattributes(s.id, s.name, s.age))
- from Students
- )
45Automatic XML Views (2/2)
- ltStudentsgt
- ltrecordgt
- ltidgt100026lt/idgt
- ltnamegtJoe Averagelt/namegt
- ltagegt21lt/agegt
- lt/recordgt
- ltrecordgt
- ltidgt100078lt/idgt
- ltnamegtJack Doelt/namegt
- ltagegt18lt/agegt
- lt/recordgt
- lt/Studentsgt
- xmlelement(name, "Students",
- select xmlelement(name,
- "record",
- xmlforest(s.id, s.name, s.age))
- from Students
- )
46Storing XML Documents
- Designing a specialized system for storing native
XML data - Using a DBMS to store the whole XML documents as
text fields - Using a DBMS to store the document contents as
data elements - It must support the XMLs ordered data model
47Using a DBMS Relational DTD
- Schema-aware
- An element that can occur at most once in its
parent is stored as a column of the table
representing its parent
48Using a DBMS Edge
- Schema-less
- A single table is used to store the entire
document - Each node is assigned an ID in depth first order
1
root
2
references
3
4
book
book
author
author
author
title
year
author
2000
Data on the Web
S. Abiteboul
D. Suciu
P. Buneman
49Using a DBMS Edge (cont.)