Title: Introduction to XML Algebra
1Introduction to XML Algebra
- Based on talk prepared for CS561 by Wan Liu and
Bintou Kane
2Data Model
- data model core data structures and data types
supported by DBMS - relational database is a table (set-oriented)
data model - XML format is a tree-structured hierarchical
model
3Why XML Algebra?
- It is common to translate a query language into
an algebra. - First, the algebra is used to give a semantics
for the query language. - Second, the algebra is used to support query
optimization.
4XML Algebra History
- Lore Algebra (August 1999)
- -- Stanford University
-
- IBM Algebra (September 1999)
- --Oracle IBM Microsoft Corp
- YAT Algebra (May 2000)
- ATT Algebra (June 2000)
- --ATT Bell Labs
- Niagara Algebra (2001)
- -- University of Wisconsin -Madison
5NIAGARA
- Title Following the paths of XML Data An
algebraic framework for XML query evaluation - By Leonidas Galanis, Efstratios Viglas, David
J. DeWitt, Jeffrey. F. Naughton, and David Maier.
- Univ. of Wisconsin
6Outline
- Concepts of Niagara Algebra
- Operations
- Optimization
7Goals of Niagara Algebra
- Be independent of schema information
- Query on both structure and content
- Generate simple, flexible, yet powerful algebraic
expressions - Allow re-use of traditional optimization
techniques
8Example XML Source Documents
Invoice.xml ltInvoice_Documentgt ltinvoice No
1gt ltaccount_numbergt2 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.25lt/totalgt
lt/invoicegt ltinvoicegt ltaccount_numbergt1
lt/account_numbergt ltcarriergtSprintlt/carriergt
lttotalgt1.20lt/totalgt lt/invoicegt
ltinvoicegt ltaccount_numbergt1 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.75lt/totalgt
lt/invoicegt lt/Invoice_Documentgt
- Customer.xml
- ltCustomer_Documentgt
- ltcustomergt
- ltaccountgt1 lt/accountgt
- ltnamegtTom lt/namegt
- lt/customer gt
- ltcustomergt
- ltaccountgt2 lt/accountgt
- ltnamegtGeorge lt/namegt
- lt/customer gt
- lt/Customer _Documentgt
9XML Data Model and Tree Graph
Invoice_Document
ltInvoice_Documentgt ltinvoicegt
ltnumbergt2lt/numbergt ltcarriergtSprintlt/carriergt
lttotalgt0.25lt/totalgt lt/invoicegt
ltinvoicegt ltnumbergt1lt/numbergt ltcarriergtSprintlt/car
riergt lttotalgt1.20lt/totalgt lt/invoicegt lt/Invoice
_Documentgt
Invoice
Invoice
number
carrier
number
total
total
carrier
2
ATT
0.25
1
1.20
Sprint
Ordered Tree Graph, Semi structured Data
10XML Data Model GVDNM01
- Collection of bags of vertices.
- Vertices in a bag have no order.
- Example
Root invoice.xml invoice
invoice.account_number
lt account_number gt element-content lt/
account_number gt
ltinvoicegt Invoice-element-content lt/invoicegt
Rootinvoice.xml, invoice, invoice.
account_number
11Data Model
- Bag elements are reachable by path expressions.
- Path expression consists of two parts
- An entry point
- A relative forward part
- Example account_numberinvoice
12Operators
- Source S , Follow ?, Select ?, Join , Rename
?, Expose ?, Vertex ?, Group ?, Union ?,
Intersection ?, Difference - , Cartesian Product
?.
13 Source Operator S
- Input a list of documents
- Output a collection of singleton bags
- Examples
- S () All Known XML documents
- S (invoice.xml) All XML documents whose
filename match - invoice.xml
- S (,schema.dtd) All known XML documents that
conform - to
schema.dtd
14Follow operator ?
- Input a path expression in entry point notation
- Functionality extracts vertices reachable by
path expression - Output a new bag that consists of the extracted
vertex all contents of original bag (in case of
unnesting follow)
15Follow operator (Example)
Root invoice.xml , invoice, invoice.carrier
Root invoice.xml invoice
invoice.carrier
ltcarriergt carrier -element-content lt/carrier gt
ltinvoicegt Invoice-element-content lt/invoicegt
Unnesting Follow
?(carrierinvoice)
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
Root invoice.xml , invoice
16Select operator ?
- Input a set of bags
- Functionality filters the bags of a collection
using a predicate - Output a set of bags that conform to the
predicate - Predicate Logical operator (?,?,?), or simple
qualifications (?,?,?,?,?,?)
17Select operator (Example)
Root invoice.xml , invoice,
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
? invoice.carrier Sprint
Root invoice.xml invoice
Root invoice.xml invoice
ltinvoicegt Invoice-element-content lt/invoicegt
ltinvoicegt Invoice-element-content lt/invoicegt
Root invoice.xml , invoice, Root invoice.xml
, invoice,
18Join operator
- Input two collections of bags
- Functionality Joins the two collections based on
a predicate - Output the concatenation of pairs of pages that
satisfy the predicate
19Join operator (Example)
Root invoice.xml , invoice, Root customer.xml ,
customer
Root invoice.xml invoice
Root customer.xml customer
ltinvoicegt Invoice-element-content lt/invoicegt
ltcustomergt customer-element-content lt/customergt
account_number invoice numbercustomer
Root invoice.xml invoice
Root customer.xml customer
ltinvoicegt Invoice-element-content lt/invoicegt
ltcustomergt customer-element-content lt/customergt
Root invoice.xml , invoice
Root customer.xml , customer
20Expose operator ?
- Input a list of path expressions of vertices to
be exposed - Output a set of bags that contains vertices in
the parameter list with the same order
21Expose operator (Example)
Root invoice.xml , invoice.bill_period,
invoice.carrier
Root invoice.xml invoice.
bill_period invoice.carrier
ltcarriergt bill_period -element-content lt/carrier gt
ltinvoicegt carrier-element-content lt/invoicegt
?(bill_period,carrier)
Root invoice.xml invoice
invoice.carrier invoice.bill_period
ltcarriergt bill_period -element-content lt/carrier gt
ltinvoicegt Invoice-element-content lt/invoicegt
ltinvoicegt carrier-element-content lt/invoicegt
Root invoice.xml , invoice, invoice.carrier,
invoice.bill_period
22Vertex operator ?
- Creates the actual XML vertex that will encompass
everything created by an expose operator - Example
? (Customer_invoice)?(?(account)invoice.account_
number, ?(inv_total)invoice.total)
23Other operators
- Group ? is used for arbitrary grouping of
elements based on their values - Aggregate functions can be used with the group
operator (i.e. average) - Rename ? Changes entry point annotation of
elements of a bag. - Example ?(invoice.bill_period,date)
24Example XML Source Documents
Invoice.xml ltInvoice_Documentgt
ltinvoicegt ltaccount_numbergt2 lt/account_numbergt
ltcarriergtATTlt/carriergt lttotalgt0.25lt/totalgt
lt/invoicegt ltinvoicegt ltaccount_numbergt1
lt/account_numbergt ltcarriergtSprintlt/carriergt
lttotalgt1.20lt/totalgt lt/invoicegt
ltinvoicegt ltaccount_numbergt1 lt/account_numbergt
lttotalgt0.75lt/totalgt lt/invoicegt ltauditorgt
maria lt/auditorgt lt/Invoice_Documentgt
Customer.xml ltCustomer_Documentgt
ltcustomergt ltaccountgt1 lt/accountgt ltnamegtTom
lt/namegt lt/customer gt ltcustomergt ltaccountgt
2 lt/accountgt ltnamegtGeorge lt/namegt
lt/customer gt lt/Customer _Documentgt
25Xquery Example
- List account number, customer name, and invoice
total for all invoices that has carrier
Sprint.
- FOR i in (invoices.xml)//invoice,
- c in (customers.xml)//customer
- WHERE i/carrier Sprint and
- i/account_number c/account
- RETURN
- ltSprint_invoicesgt
- i/account_number,
- c/name,
- i/total
- lt/Sprint_invoicesgt
26Example Xquery output
- ltSprint_Invoicegt
- ltaccount_numbergt1 lt/account_numbergt
- ltnamegtTom lt/namegt
- lttotalgt1.20lt/totalgt
- lt/Sprint_Invoice gt
27Algebra Tree Execution
Account_number name total
Expose (.account_number , .name, .total )
invoice(2) customer(1)
Join (.invoice.account_number.customer.account)
invoice (2)
Select (carrier Sprint )
customer (2)
customer(1)
Invoice (1)
invoice (2)
invoice (3)
Follow (.invoice)
Follow (.customer)
Source (Invoices.xml)
Source (cutomers.xml)
28Optimization with Niagara
- Optimizer based on Niagara algebra
- Use the operation more efficiently
- Produce simpler expressions by combining
operations -
29Language Convention
- A and B are path expressions
- Alt B --? Path Expression A is prefix of B
- AnB ---? Common prefix of path A and B
- AnB ---? Greatest common of path A and B
- - ---? Null path Expression
30Heuristics using Rewrite Rules
-
- Allow optimization based on path selectivity
- When applying un-nesting following operation Fµ
31Interchangeability of Follow operation
- Fµ(A) Fµ(B)Fµ (B)Fµ (A)
- TRUE when exists C such that C lt A C lt B and
C AnB - Or AnB -
32Application of Rule on Invoice
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
- ?
- Fµ(carrierinvoice)Fµ(acc_Numinvoice)
33Application of Rule on Invoice
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
- ?
- Fµ(carrierinvoice)Fµ(acc_Numinvoice)
- Equivalent because both share the common prefix
invoice. - Case AnB invoice
34Benefit of Rule Application
- NOTE let us assume that acc_Num is required for
each invoice element, while - carrier is not required for invoice element
- THEN
- Fµ(acc_Numinvoice)Fµ(carrierinvoice)
- ?
- Fµ(carrierinvoice)Fµ(acc_Numinvoice)
- Then what algebra tree do we prefer?
- Fµ(acc_Numinvoice)Fµ(acc_Numcustomer)
- make more sense than Why?
35Discussion
- Reduction of Input Size on first
- Sub-operation
-
- Fµ(carrierinvoice)
36- Should we/can we apply the rule below?
- Fµ(acc_Numinvoice)Fµ(acc_NumCustomer)
37- acc_Numinvoice and
- acc_Numcustomer
- are two totally different paths
- Case is AnB -
- So yes, rule is valid.