Title: Chapter 10: XML
1Chapter 10 XML
2Context
- The dawn of database technology 70s
- A DBMS is a flexible store-recall system for
digital information - It provides permanent memory for structured
information
3Context
- Database Managements technology for
administrative settings completed in the early
80s - Search for demanding application areas that could
benefit from a database approach - A sound datamodel to structure the information
and maintain integrity rules - A high level programming language model to
manipulate the data - Separation of concerns between modelling and
manipulation, and physical storage and order of
execution thanks to query optimizer technology
4Context
- Demanding areas of research in DBMS core
technology - Office Information systems, e.g. document
modelling and workflow - CAD/CAM, e.g. how to manage the design of an
airplane or nucleur power plant - GIS, e.g. managing remote sensing information
- WWW, e.g. how to integrate heterogenous sources
- Agent-based systems, e.g. reactive systems
- Multimedia, e.g. video storage/retrieval
- Datamining, e.g. discovery of client profiles
- Sensor networks, e.g. small footprint and
energy-wise computing
5Context
- Demanding areas of research in DBMS core
technology - Office Information systems, Extensible DBMS,
blobs - CAD/CAM, Object-oriented DBMS, geometry
- GIS, GIS DBMS, geometry and images
- Agent-based systems, Active DBMS, triggers
- Multimedia, MM DBMS, feature analysis
- Datamining, Datawarehouse systems, cube,
association rules - Sensor networks, P2P databases, ad-hoc networking
6Context
- Application interaction with DBMS
- Proprietary application programming interface,
shielding the hardware distinctions - Use readable interfaces to improve monitoring and
development - Example in Monetdb the interaction is based on
ascii text with the first character indicative
for the message type - gt prompt, await for next request
- ! error occurred, rest is the message
- start of a tuple answer
- Language embedding to remove the impedance
mismatch, i.e. avoid cost of transforming data - Effectively failed in the OO world
7Context
- Learning points database perspective,
- Database system should not be concerned with the
user-interaction technology, they should be
blind and deaf - The strong requirements on schema, integrity
rules and processing is a harness - Interaction with applications should be
self-descriptive as much as possible, because,
you can not a priori know a complete schema - Need for semi-structured databases
8Semi-structured data
- Properties of semistructured databases
- The schema is not given in advance and may be
implicit in the data - The schema is relatively large and changes
frequently - The schema is descriptive rather than
prescriptive, integrity rules may be violated - The data is not strongly typed, the values of
attributes may be of different type - Stanford Lore system is the prototypical first
attempt to support semi-structured databases
9Context
- Accidentally, in the world of digital publishing
there is a need for a simple datamodel to
structure information - SMGL HTML XML XHTML
- XPATH XQUERY XSLT
- By the end 90s, the document world meets the
database world
10Introduction
- XML Extensible Markup Language
- Defined by the WWW Consortium (W3C)
- Originally intended as a document markup language
not a database language - Documents have tags giving extra information
about sections of the document - E.g. lttitlegt XML lt/titlegt ltslidegt Introduction
lt/slidegt - Derived from SGML (Standard Generalized Markup
Language), but simpler to use than SGML - Extensible, unlike HTML
- Users can add new tags, and separately specify
how the tag should be handled for display
11XML Introduction (Cont.)
- The ability to specify new tags, and to create
nested tag structures made XML a great way to
exchange data, not just documents. - Much of the use of XML has been in data exchange
applications, not as a replacement for HTML - Tags make data (relatively) self-documenting
- E.g. ltbankgt
- ltaccountgt
- ltaccount-numbergt A-101
lt/account-numbergt - ltbranch-namegt Downtown
lt/branch-namegt - ltbalancegt 500
lt/balancegt - lt/accountgt
- ltdepositorgt
- ltaccount-numbergt A-101
lt/account-numbergt - ltcustomer-namegt Johnson
lt/customer-namegt - lt/depositorgt
- lt/bankgt
12XML Motivation
- Data interchange is critical in todays networked
world - Examples
- Banking funds transfer
- Order processing (especially inter-company
orders) - Scientific data
- Chemistry ChemML,
- Genetics BSML (Bio-Sequence Markup Language),
- Paper flow of information between organizations
is being replaced by electronic flow of
information - Each application area has its own set of
standards for representing information (W3C
maintains ca 30 standards) - XML has become the basis for all new generation
data interchange formats
13XML Motivation (Cont.)
- Each XML based standard defines what are valid
elements, using - XML type specification languages to specify the
syntax - DTD (Document Type Descriptors)
- XML Schema
- Plus textual descriptions of the semantics
- XML allows new tags to be defined as required
- However, this may be constrained by DTDs
- A wide variety of tools is available for parsing,
browsing and querying XML documents/data
14Structure of XML Data
- Tag label for a section of data
- Element section of data beginning with lttagnamegt
and ending with matching lt/tagnamegt - Elements must be properly nested
- Proper nesting
- ltaccountgt ltbalancegt . lt/balancegt lt/accountgt
- Improper nesting
- ltaccountgt ltbalancegt . lt/accountgt lt/balancegt
- Formally every start tag must have a unique
matching end tag, that is in the context of the
same parent element. - Every document must have a single top-level
element
15Motivation for Nesting
- Nesting of data is useful in data transfer
- Example elements representing customer-id,
customer name, and address nested within an order
element - Nesting is not supported, or discouraged, in
relational databases - With multiple orders, customer name and address
are stored redundantly - normalization replaces nested structures in each
order by foreign key into table storing customer
name and address information - Nesting is supported in object-relational
databases and NF2 - But nesting is appropriate when transferring data
- External application does not have direct access
to data referenced by a foreign key
16Example of Nested Elements
- ltbank-1gt ltcustomergt
- ltcustomer-namegt Hayes lt/customer-namegt
- ltcustomer-streetgt Main lt/customer-streetgt
- ltcustomer-citygt Harrison
lt/customer-citygt - ltaccountgt
- ltaccount-numbergt A-102 lt/account-numbergt
- ltbranch-namegt Perryridge
lt/branch-namegt - ltbalancegt 400 lt/balancegt
- lt/accountgt
- ltaccountgt
-
- lt/accountgt
- lt/customergt . .
- lt/bank-1gt
17Structure of XML Data (Cont.)
- Mixture of text with sub-elements is legal in
XML. - Example
- ltaccountgt
- This account is seldom used any more.
- ltaccount-numbergt A-102lt/account-numbergt
- ltbranch-namegt Perryridgelt/branch-namegt
- ltbalancegt400 lt/balancegtlt/accountgt
- Useful for document markup, but discouraged for
data representation
18Attributes
- Elements can have attributes
- ltaccount acct-type checking gt
- ltaccount-numbergt A-102
lt/account-numbergt - ltbranch-namegt Perryridge
lt/branch-namegt - ltbalancegt 400 lt/balancegt
- lt/accountgt
- Attributes are specified by namevalue pairs
inside the starting tag of an element - An element may have several attributes, but each
attribute name can only occur once - ltaccount acct-type checking monthly-fee5gt
19Attributes Vs. Subelements
- Distinction between subelement and attribute
- In the context of documents, attributes are part
of markup, while subelement contents are part of
the basic document contents - In the context of data representation, the
difference is unclear and may be confusing - Same information can be represented in two ways
- ltaccount account-number A-101gt .
lt/accountgt - ltaccountgt ltaccount-numbergtA-101lt/account-numb
ergt lt/accountgt - Suggestion use attributes for identifiers of
elements, and use subelements for contents
20More on XML Syntax
-
- Elements without subelements or text content can
be abbreviated by ending the start tag with a /gt
and deleting the end tag - ltaccount numberA-101 branchPerryridge
balance200 /gt - To store string data that may contain tags,
without the tags being interpreted as
subelements, use CDATA as below - lt!CDATAltaccountgt lt/accountgtgt
- Here, ltaccountgt and lt/accountgt are treated as
just strings
21Namespaces
- XML data has to be exchanged between
organizations - Same tag name may have different meaning in
different organizations, causing confusion on
exchanged documents - Specifying a unique string as an element name
avoids confusion - Better solution use unique-nameelement-name
- Avoid using long unique names all over document
by using XML Namespaces - ltbank XmlnsFBhttp//www.FirstBank.comgt
- ltFBbranchgt
- ltFBbranchnamegtDowntownlt/FBbranchnamegt
- ltFBbranchcitygt Brooklynlt/FBbranchcitygt
- lt/FBbranchgt
- lt/bankgt
22XML Document Schema
- Database schemas constrain what information can
be stored, and the data types of stored values - XML documents are not required to have an
associated schema - However, schemas are very important for XML data
exchange - Otherwise, a site cannot automatically interpret
data received from another site - Two mechanisms for specifying XML schema
- Document Type Definition (DTD)
- Widely used
- XML Schema
- Newer, not yet widely used
23Document Type Definition (DTD)
- The type of an XML document can be specified
using a DTD - DTD constraints structure of XML data
- What elements can occur
- What attributes can/must an element have
- What subelements can/must occur inside each
element, and how many times. - DTD does not constrain data types
- All values represented as strings in XML
- DTD syntax
- lt!ELEMENT element (subelements-specification) gt
- lt!ATTLIST element (attributes) gt
24Element Specification in DTD
- Subelements can be specified as
- names of elements, or
- PCDATA (parsed character data), i.e., character
strings - EMPTY (no subelements) or ANY (anything can be a
subelement) - Example
- lt! ELEMENT depositor (customer-name
account-number)gt - lt! ELEMENT customer-name(PCDATA)gt
- lt! ELEMENT account-number (PCDATA)gt
- Subelement specification may have regular
expressions - lt!ELEMENT bank ( ( account customer
depositor))gt - Notation
- - alternatives
- - 1 or more occurrences
- - 0 or more occurrences
25Bank DTD
- lt!DOCTYPE bank
- lt!ELEMENT bank ( ( account customer
depositor))gt - lt!ELEMENT account (account-number branch-name
balance)gt - lt! ELEMENT customer(customer-name
customer-street
customer-city)gt - lt! ELEMENT depositor (customer-name
account-number)gt - lt! ELEMENT account-number (PCDATA)gt
- lt! ELEMENT branch-name (PCDATA)gt
- lt! ELEMENT balance(PCDATA)gt
- lt! ELEMENT customer-name(PCDATA)gt
- lt! ELEMENT customer-street(PCDATA)gt
- lt! ELEMENT customer-city(PCDATA)gt
- gt
26Attribute Specification in DTD
- Attribute specification for each attribute
- Name
- Type of attribute
- CDATA
- ID (identifier) or IDREF (ID reference) or IDREFS
(multiple IDREFs) - more on this later
- Whether
- mandatory (REQUIRED)
- has a default value (value),
- or neither (IMPLIED)
- Examples
- lt!ATTLIST account acct-type CDATA checkinggt
- lt!ATTLIST customer
- customer-id ID REQUIRED
- accounts IDREFS REQUIRED gt
27IDs and IDREFs
- An element can have at most one attribute of type
ID - The ID attribute value of each element in an XML
document must be distinct - Thus the ID attribute value is an object
identifier - An attribute of type IDREF must contain the ID
value of an element in the same document - An attribute of type IDREFS contains a set of (0
or more) ID values. Each ID value must contain
the ID value of an element in the same document
28Bank DTD with Attributes
- Bank DTD with ID and IDREF attribute types.
- lt!DOCTYPE bank-2
- lt!ELEMENT account (branch, balance)gt
- lt!ATTLIST account
- account-number ID
REQUIRED - owners IDREFS
REQUIREDgt - lt!ELEMENT customer(customer-name,
customer-street, -
customer-city)gt - lt!ATTLIST customer
- customer-id ID
REQUIRED - accounts IDREFS
REQUIREDgt - declarations for branch, balance,
customer-name,
customer-street and customer-citygt
29XML data with ID and IDREF attributes
-
- ltbank-2gt
- ltaccount account-numberA-401 ownersC100
C102gt - ltbranch-namegt Downtown lt/branch-namegt
- ltbranchgt500 lt/balancegt
- lt/accountgt
- ltcustomer customer-idC100 accountsA-401gt
- ltcustomer-namegtJoelt/customer-namegt
- ltcustomer-streetgtMonroelt/customer-street
gt - ltcustomer-citygtMadisonlt/customer-citygt
- lt/customergt
- ltcustomer customer-idC102 accountsA-401
A-402gt - ltcustomer-namegt Marylt/customer-namegt
- ltcustomer-streetgt Erinlt/customer-streetgt
- ltcustomer-citygt Newark lt/customer-citygt
- lt/customergt
- lt/bank-2gt
30Limitations of DTDs
- No typing of text elements and attributes
- All values are strings, no integers, reals, etc.
- Difficult to specify unordered sets of
subelements - Order is usually irrelevant in databases
- (A B) allows specification of an unordered
set, but - Cannot ensure that each of A and B occurs only
once - IDs and IDREFs are untyped
- The owners attribute of an account may contain a
reference to another account, which is
meaningless - owners attribute should ideally be constrained to
refer to customer elements
31XML Schema
- XML Schema is a more sophisticated schema
language which addresses the drawbacks of DTDs.
Supports - Typing of values
- E.g. integer, string, etc
- Also, constraints on min/max values
- User defined types
- Is itself specified in XML syntax, unlike DTDs
- More standard representation, but verbose
- Is integrated with namespaces
- Many more features
- List types, uniqueness and foreign key
constraints, inheritance .. - BUT significantly more complicated than DTDs,
not yet widely used.
32XML Schema Version of Bank DTD
- ltxsdschema xmlnsxsdhttp//www.w3.org/2001/XMLSc
hemagt - ltxsdelement namebank typeBankType/gt
- ltxsdelement nameaccountgtltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement
nameaccount-number typexsdstring/gt
ltxsdelement namebranch-name
typexsdstring/gt ltxsdelement
namebalance typexsddecimal/gt
lt/xsdsquencegtlt/xsdcomplexTypegt - lt/xsdelementgt
- .. definitions of customer and depositor .
- ltxsdcomplexType nameBankTypegtltxsdsquencegt
- ltxsdelement refaccount minOccurs0
maxOccursunbounded/gt - ltxsdelement refcustomer minOccurs0
maxOccursunbounded/gt - ltxsdelement refdepositor minOccurs0
maxOccursunbounded/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
- lt/xsdschemagt
33Storage of XML Data
- XML data can be stored in
- Non-relational data stores
- Flat files
- Natural for storing XML
- But has all problems discussed in Chapter 1 (no
concurrency, no recovery, ) - XML database
- Database built specifically for storing XML data,
supporting DOM model and declarative querying - Currently no commercial-grade scaleable system
- Relational databases
- Data must be translated into relational form
- Advantage mature database systems
- Disadvantages overhead of translating data and
queries
34Storing XML in Relational Databases
- Store as string
- E.g. store each top level element as a string
field of a tuple in a database - Use a single relation to store all elements, or
- Use a separate relation for each top-level
element type - E.g. account, customer, depositor
- Indexing
- Store values of subelements/attributes to be
indexed, such as customer-name and account-number
as extra fields of the relation, and build
indices - Oracle 9 supports function indices which use the
result of a function as the key value. Here, the
function should return the value of the required
subelement/attribute - SQL server 2005 same
35Storing XML in Relational Databases
- Store as string
- E.g. store each top level element as a string
field of a tuple in a database - Benefits
- Can store any XML data even without DTD
- As long as there are many top-level elements in a
document, strings are small compared to full
document, allowing faster access to individual
elements. - Drawback Need to parse strings to access values
inside the elements parsing is slow.
36OEM model
- Semi structured and XML databases can be modelled
as graph-problems - Early prototypes directly supported the graph
model as the physical implementation scheme.
Querying the graph model was implemented using
graph traversals - XML without IDREFS can be modelled as trees
37(No Transcript)
38Storing XML as Relations (Cont.)
- Tree representation model XML data as tree and
store using relations
nodes(id, type, label, value)
child (child-id, parent-id) - Each element/attribute is given a unique
identifier - Type indicates element/attribute
- Label specifies the tag name of the element/name
of attribute - Value is the text value of the element/attribute
- The relation child notes the parent-child
relationships in the tree - Can add an extra attribute to child to record
ordering of children - Benefit Can store any XML data, even without DTD
- Drawbacks
- Data is broken up into too many pieces,
increasing space overheads - Even simple queries require a large number of
joins, which can be slow -
39Storing XML in Relations (Cont.)
- Map to relations
- If DTD of document is known, you can map data to
relations - Bottom-level elements and attributes are mapped
to attributes of relations - A relation is created for each element type
- An id attribute to store a unique id for each
element - all element attributes become relation attributes
- All subelements that occur only once become
attributes - For text-valued subelements, store the text as
attribute value - For complex subelements, store the id of the
subelement - Subelements that can occur multiple times
represented in a separate table - Similar to handling of multivalued attributes
when converting ER diagrams to tables - Benefits
- Efficient storage
- Can translate XML queries into SQL, execute
efficiently, and then translate SQL results back
to XML
40Alternative mappings
- Mapping the structure
- The Edge approach
- The Attribute approach
- The Universal Table approach
- The Normalized Universal approach
- The Dataguide approach
- Mapping values
- Separate value tables
- Inlining
- Shredding
41Edge approach
- Use a single Edge table to capture the graph
structure - Edge(source, ordinal, name, flag, target)
- Flag value, reference
- Keys source, ordinal)
- Index source, name,target
42Attribute approach
- Group all attributes with the same name into one
table - Aname(source,ordinal,flag, target)
- Key source,ordinal
- Indextarget
43Universal approach
- Use the Universal Table, all attributes are
stored as columns - Universal(source, ord-1,flag-1,target-1,
,ord-n,flag-n,target-n) - Key source, index target-i
44Normalized Universal
- Same as Universal, but factor out the repeating
values - Universal(source, ord-1,flag-1,target-1,
,ord-n,flag-n,target-n) - Overflow_n(source,ord, flag,target)
- Key source, and source,ord
- Index target-i
45Mapping values
- Separate value tables
- Use V_type(vid, value) tables, eg. int(vid,val),
str(vid,val),.
46Mapping values
- Inlining
- As illustrated in previous mappings, inline the
values in the structure relations
47Shredding
- Try to recognize repeating structures and map
them to separate tables - Handle the remainder through any of the previous
methods
48Evaluation
- Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999 - Database storage overhead
49Evaluation
- Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999 - Bulk loading
50Evaluation
- Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999 - Reconstruction
51The Data
- Semistructured data instance a large graph
52The indexing problem
- The storage problem
- Store the graph in a relational DBMS
- Develop a new database storage structure
- The indexing problem
- Input large, irregular data graph
- Output index structure for evaluating (regular)
path expressions, e.g. - bib.paper.author.firstname
53XSet a simple index for XML
- Part of the Ninja project at Berkeley
- Example XML data
54XSet a simple index for XML
- Each node a hashtable
- Each entry list of pointers to data nodes (not
shown)
55XSet Efficient query evaluation
- SELECT X FROM part.name X -yes
- SELECT X FROM part.supplier.name X -yes
- SELECT X FROM part..subpart.name X -maybe
- SELECT X FROM .supplier.name X -maybe
Will gain when index fits in memory
56Region Algebras
- structured text text with tags (like XML)
- data sequence of characters c1c2c3
- region interval in the text
- representation (x,y) cx,cx1, cy
- example ltsectiongt lt/sectiongt
- region set a set of regions
- example all ltsectiongt regions (may be nested)
- region algebra operators on region set,
- s1 op s2
57Representation of a region set
- Example the ltsubpartgt region set
58Region algebra some operators
- s1 intersect s2 r r? s1, r ?s2
- s1 included s2 r r?s1, ?r ? s2, r ? r
- s1 including s2 r r? s1, ?r ? s2, r ? r
- s1 parent s2 r r? s1, ?r? s2, r is a parent
of r - s1 child s2 r r? s1, ?r ? s2, r is child of
r
Examples ltsubpartgt included ltpartgt ltpartgt
including ltsubpartgt
59Efficient computation of Region Algebra Operators
- Example s1 included s2
- s1 (x1,x1'), (x2,x2'),
- s2 (y1,y1'), (y2,y2'),
- (i.e. assume each consists of disjoint regions)
- Algorithm
- if xi lt yj then i i 1
- if xi' gt yj' then j j 1
- otherwise print (xi,xi'), do i i 1
- Can do in sub-linear time when one region is very
small
60From path expressions to region expressions
Region expressions correspond to simple XPath
expressions
- part.name name child (part child
root) - part.supplier.name name child (supplier child
(part child root)) - .supplier.name name child supplier
- part..subpart.name name child (subpart
included (part child root))
61Storage structures for region algebras
- Every node is characterised by an integer pair
(x,y) - This means we have a 2-d space
- Any 2-d space data structure can be used
- If you use a (pre-order,post-order) numbering you
get triangular filling of 2-d - (to be discussed later)
62Alternative mappings
- Mapping the structure to the relational world
- The Edge approach
- The Attribute approach
- The Universal Table approach
- The Normalized Universal approach
- The Monet/XML approach
- The Dataguide approach
- Mapping values
- Separate value tables
- Inlining
- Shredding
63Dataguide approach
- Developed in the context of Lore, Lorel (Stanford
Univ) - Predecessor of the Monet/XML model
- Observation
- queries in the graph-representation take a
limited form - they are partial walks from the root to an object
of interest - this behaviour was stressed by the query language
Lorel, i.e. an SQL-based query language based on
processing regular expressions
SELECT X FROM (Bib..author).(lastnamefirstname).
Abiteboul X
64DataGuides
- Definition
- given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t. - - every path in DB also occurs in G
- - every path in G occurs in DB
- - every path in G is unique
65Dataguides
66DataGuides
- Multiple DataGuides for the same data
67DataGuides
- Definition
- Let w, w be two words (I.e word queries) and G
a graph - w ?G w if w(G) w(G)
- Definition
- G is a strong dataguide for a database DB if ?G
is the same as ?DB - Example
- - G1 is a strong dataguide
- - G2 is not strong
- person.project !?DB dept.project
- person.project !?G2 dept.project
68DataGuides
- Constructing the strong DataGuide G
- Nodes(G)root
- Edges(G)?
- while changes do
- choose s in Nodes(G), a in Labels
- add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G) - add (x -a-gty) to Edges(G)
- Use hash table for Nodes(G)
- This is precisely the powerset automaton
construction.
69DataGuides
- How large are the dataguides ?
- if DB is a tree, then size(G) lt size(DB)
- why? answer every node is in exactly one extent
of G - here dataguide XSet
- How many nodes does the strong dataguide have for
this DB ?
20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic
schemas, like