Chapter 10: XML presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 10: XML

1
Chapter 10 XML

The world of XML

2
Context

The dawn of database technology 70s
A DBMS is a flexible store-recall system for
digital information
It provides permanent memory for structured
information

3
Context

Database Managements technology for
administrative settings completed in the early
80s
Search for demanding application areas that could
benefit from a database approach
A sound datamodel to structure the information
and maintain integrity rules
A high level programming language model to
manipulate the data
Separation of concerns between modelling and
manipulation, and physical storage and order of
execution thanks to query optimizer technology

4
Context

Demanding areas of research in DBMS core
technology
Office Information systems, e.g. document
modelling and workflow
CAD/CAM, e.g. how to manage the design of an
airplane or nucleur power plant
GIS, e.g. managing remote sensing information
WWW, e.g. how to integrate heterogenous sources
Agent-based systems, e.g. reactive systems
Multimedia, e.g. video storage/retrieval
Datamining, e.g. discovery of client profiles
Sensor networks, e.g. small footprint and
energy-wise computing

5
Context

Demanding areas of research in DBMS core
technology
Office Information systems, Extensible DBMS,
blobs
CAD/CAM, Object-oriented DBMS, geometry
GIS, GIS DBMS, geometry and images
Agent-based systems, Active DBMS, triggers
Multimedia, MM DBMS, feature analysis
Datamining, Datawarehouse systems, cube,
association rules
Sensor networks, P2P databases, ad-hoc networking

6
Context

Application interaction with DBMS
Proprietary application programming interface,
shielding the hardware distinctions
Use readable interfaces to improve monitoring and
development
Example in Monetdb the interaction is based on
ascii text with the first character indicative
for the message type
gt prompt, await for next request
! error occurred, rest is the message
start of a tuple answer
Language embedding to remove the impedance
mismatch, i.e. avoid cost of transforming data
Effectively failed in the OO world

7
Context

Learning points database perspective,
Database system should not be concerned with the
user-interaction technology, they should be
blind and deaf
The strong requirements on schema, integrity
rules and processing is a harness
Interaction with applications should be
self-descriptive as much as possible, because,
you can not a priori know a complete schema
Need for semi-structured databases

8
Semi-structured data

Properties of semistructured databases
The schema is not given in advance and may be
implicit in the data
The schema is relatively large and changes
frequently
The schema is descriptive rather than
prescriptive, integrity rules may be violated
The data is not strongly typed, the values of
attributes may be of different type
Stanford Lore system is the prototypical first
attempt to support semi-structured databases

9
Context

Accidentally, in the world of digital publishing
there is a need for a simple datamodel to
structure information
SMGL HTML XML XHTML
XPATH XQUERY XSLT
By the end 90s, the document world meets the
database world

10
Introduction

XML Extensible Markup Language
Defined by the WWW Consortium (W3C)
Originally intended as a document markup language
not a database language
Documents have tags giving extra information
about sections of the document
E.g. lttitlegt XML lt/titlegt ltslidegt Introduction
lt/slidegt
Derived from SGML (Standard Generalized Markup
Language), but simpler to use than SGML
Extensible, unlike HTML
Users can add new tags, and separately specify
how the tag should be handled for display

11
XML Introduction (Cont.)

The ability to specify new tags, and to create
nested tag structures made XML a great way to
exchange data, not just documents.
Much of the use of XML has been in data exchange
applications, not as a replacement for HTML
Tags make data (relatively) self-documenting
E.g. ltbankgt
ltaccountgt
ltaccount-numbergt A-101
lt/account-numbergt
ltbranch-namegt Downtown
lt/branch-namegt
ltbalancegt 500
lt/balancegt
lt/accountgt
ltdepositorgt
ltaccount-numbergt A-101
lt/account-numbergt
ltcustomer-namegt Johnson
lt/customer-namegt
lt/depositorgt
lt/bankgt

12
XML Motivation

Data interchange is critical in todays networked
world
Examples
Banking funds transfer
Order processing (especially inter-company
orders)
Scientific data
Chemistry ChemML,
Genetics BSML (Bio-Sequence Markup Language),
Paper flow of information between organizations
is being replaced by electronic flow of
information
Each application area has its own set of
standards for representing information (W3C
maintains ca 30 standards)
XML has become the basis for all new generation
data interchange formats

13
XML Motivation (Cont.)

Each XML based standard defines what are valid
elements, using
XML type specification languages to specify the
syntax
DTD (Document Type Descriptors)
XML Schema
Plus textual descriptions of the semantics
XML allows new tags to be defined as required
However, this may be constrained by DTDs
A wide variety of tools is available for parsing,
browsing and querying XML documents/data

14
Structure of XML Data

Tag label for a section of data
Element section of data beginning with lttagnamegt
and ending with matching lt/tagnamegt
Elements must be properly nested
Proper nesting
ltaccountgt ltbalancegt . lt/balancegt lt/accountgt
Improper nesting
ltaccountgt ltbalancegt . lt/accountgt lt/balancegt
Formally every start tag must have a unique
matching end tag, that is in the context of the
same parent element.
Every document must have a single top-level
element

15
Motivation for Nesting

Nesting of data is useful in data transfer
Example elements representing customer-id,
customer name, and address nested within an order
element
Nesting is not supported, or discouraged, in
relational databases
With multiple orders, customer name and address
are stored redundantly
normalization replaces nested structures in each
order by foreign key into table storing customer
name and address information
Nesting is supported in object-relational
databases and NF2
But nesting is appropriate when transferring data
External application does not have direct access
to data referenced by a foreign key

16
Example of Nested Elements

ltbank-1gt ltcustomergt
ltcustomer-namegt Hayes lt/customer-namegt
ltcustomer-streetgt Main lt/customer-streetgt
ltcustomer-citygt Harrison
lt/customer-citygt
ltaccountgt
ltaccount-numbergt A-102 lt/account-numbergt
ltbranch-namegt Perryridge
lt/branch-namegt
ltbalancegt 400 lt/balancegt
lt/accountgt
ltaccountgt
lt/accountgt
lt/customergt . .
lt/bank-1gt

17
Structure of XML Data (Cont.)

Mixture of text with sub-elements is legal in
XML.
Example
ltaccountgt
This account is seldom used any more.
ltaccount-numbergt A-102lt/account-numbergt
ltbranch-namegt Perryridgelt/branch-namegt
ltbalancegt400 lt/balancegtlt/accountgt
Useful for document markup, but discouraged for
data representation

18
Attributes

Elements can have attributes
ltaccount acct-type checking gt
ltaccount-numbergt A-102
lt/account-numbergt
ltbranch-namegt Perryridge
lt/branch-namegt
ltbalancegt 400 lt/balancegt
lt/accountgt
Attributes are specified by namevalue pairs
inside the starting tag of an element
An element may have several attributes, but each
attribute name can only occur once
ltaccount acct-type checking monthly-fee5gt

19
Attributes Vs. Subelements

Distinction between subelement and attribute
In the context of documents, attributes are part
of markup, while subelement contents are part of
the basic document contents
In the context of data representation, the
difference is unclear and may be confusing
Same information can be represented in two ways
ltaccount account-number A-101gt .
lt/accountgt
ltaccountgt ltaccount-numbergtA-101lt/account-numb
ergt lt/accountgt
Suggestion use attributes for identifiers of
elements, and use subelements for contents

20
More on XML Syntax

Elements without subelements or text content can
be abbreviated by ending the start tag with a /gt
and deleting the end tag
ltaccount numberA-101 branchPerryridge
balance200 /gt
To store string data that may contain tags,
without the tags being interpreted as
subelements, use CDATA as below
lt!CDATAltaccountgt lt/accountgtgt
Here, ltaccountgt and lt/accountgt are treated as
just strings

21
Namespaces

XML data has to be exchanged between
organizations
Same tag name may have different meaning in
different organizations, causing confusion on
exchanged documents
Specifying a unique string as an element name
avoids confusion
Better solution use unique-nameelement-name
Avoid using long unique names all over document
by using XML Namespaces
ltbank XmlnsFBhttp//www.FirstBank.comgt
ltFBbranchgt
ltFBbranchnamegtDowntownlt/FBbranchnamegt
ltFBbranchcitygt Brooklynlt/FBbranchcitygt
lt/FBbranchgt
lt/bankgt

22
XML Document Schema

Database schemas constrain what information can
be stored, and the data types of stored values
XML documents are not required to have an
associated schema
However, schemas are very important for XML data
exchange
Otherwise, a site cannot automatically interpret
data received from another site
Two mechanisms for specifying XML schema
Document Type Definition (DTD)
Widely used
XML Schema
Newer, not yet widely used

23
Document Type Definition (DTD)

The type of an XML document can be specified
using a DTD
DTD constraints structure of XML data
What elements can occur
What attributes can/must an element have
What subelements can/must occur inside each
element, and how many times.
DTD does not constrain data types
All values represented as strings in XML
DTD syntax
lt!ELEMENT element (subelements-specification) gt
lt!ATTLIST element (attributes) gt

24
Element Specification in DTD

Subelements can be specified as
names of elements, or
PCDATA (parsed character data), i.e., character
strings
EMPTY (no subelements) or ANY (anything can be a
subelement)
Example
lt! ELEMENT depositor (customer-name
account-number)gt
lt! ELEMENT customer-name(PCDATA)gt
lt! ELEMENT account-number (PCDATA)gt
Subelement specification may have regular
expressions
lt!ELEMENT bank ( ( account customer
depositor))gt
Notation
- alternatives
- 1 or more occurrences
- 0 or more occurrences

25
Bank DTD

lt!DOCTYPE bank
lt!ELEMENT bank ( ( account customer
depositor))gt
lt!ELEMENT account (account-number branch-name
balance)gt
lt! ELEMENT customer(customer-name
customer-street

customer-city)gt
lt! ELEMENT depositor (customer-name
account-number)gt
lt! ELEMENT account-number (PCDATA)gt
lt! ELEMENT branch-name (PCDATA)gt
lt! ELEMENT balance(PCDATA)gt
lt! ELEMENT customer-name(PCDATA)gt
lt! ELEMENT customer-street(PCDATA)gt
lt! ELEMENT customer-city(PCDATA)gt
gt

26
Attribute Specification in DTD

Attribute specification for each attribute
Name
Type of attribute
CDATA
ID (identifier) or IDREF (ID reference) or IDREFS
(multiple IDREFs)
more on this later
Whether
mandatory (REQUIRED)
has a default value (value),
or neither (IMPLIED)
Examples
lt!ATTLIST account acct-type CDATA checkinggt
lt!ATTLIST customer
customer-id ID REQUIRED
accounts IDREFS REQUIRED gt

27
IDs and IDREFs

An element can have at most one attribute of type
ID
The ID attribute value of each element in an XML
document must be distinct
Thus the ID attribute value is an object
identifier
An attribute of type IDREF must contain the ID
value of an element in the same document
An attribute of type IDREFS contains a set of (0
or more) ID values. Each ID value must contain
the ID value of an element in the same document

28
Bank DTD with Attributes

Bank DTD with ID and IDREF attribute types.
lt!DOCTYPE bank-2
lt!ELEMENT account (branch, balance)gt
lt!ATTLIST account
account-number ID
REQUIRED
owners IDREFS
REQUIREDgt
lt!ELEMENT customer(customer-name,
customer-street,
customer-city)gt
lt!ATTLIST customer
customer-id ID
REQUIRED
accounts IDREFS
REQUIREDgt
declarations for branch, balance,
customer-name,
customer-street and customer-citygt

29
XML data with ID and IDREF attributes

ltbank-2gt
ltaccount account-numberA-401 ownersC100
C102gt
ltbranch-namegt Downtown lt/branch-namegt
ltbranchgt500 lt/balancegt
lt/accountgt
ltcustomer customer-idC100 accountsA-401gt
ltcustomer-namegtJoelt/customer-namegt
ltcustomer-streetgtMonroelt/customer-street
gt
ltcustomer-citygtMadisonlt/customer-citygt
lt/customergt
ltcustomer customer-idC102 accountsA-401
A-402gt
ltcustomer-namegt Marylt/customer-namegt
ltcustomer-streetgt Erinlt/customer-streetgt
ltcustomer-citygt Newark lt/customer-citygt
lt/customergt
lt/bank-2gt

30
Limitations of DTDs

No typing of text elements and attributes
All values are strings, no integers, reals, etc.
Difficult to specify unordered sets of
subelements
Order is usually irrelevant in databases
(A B) allows specification of an unordered
set, but
Cannot ensure that each of A and B occurs only
once
IDs and IDREFs are untyped
The owners attribute of an account may contain a
reference to another account, which is
meaningless
owners attribute should ideally be constrained to
refer to customer elements

31
XML Schema

XML Schema is a more sophisticated schema
language which addresses the drawbacks of DTDs.
Supports
Typing of values
E.g. integer, string, etc
Also, constraints on min/max values
User defined types
Is itself specified in XML syntax, unlike DTDs
More standard representation, but verbose
Is integrated with namespaces
Many more features
List types, uniqueness and foreign key
constraints, inheritance ..
BUT significantly more complicated than DTDs,
not yet widely used.

32
XML Schema Version of Bank DTD

ltxsdschema xmlnsxsdhttp//www.w3.org/2001/XMLSc
hemagt
ltxsdelement namebank typeBankType/gt
ltxsdelement nameaccountgtltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement
nameaccount-number typexsdstring/gt
ltxsdelement namebranch-name
typexsdstring/gt ltxsdelement
namebalance typexsddecimal/gt
lt/xsdsquencegtlt/xsdcomplexTypegt
lt/xsdelementgt
.. definitions of customer and depositor .
ltxsdcomplexType nameBankTypegtltxsdsquencegt
ltxsdelement refaccount minOccurs0
maxOccursunbounded/gt
ltxsdelement refcustomer minOccurs0
maxOccursunbounded/gt
ltxsdelement refdepositor minOccurs0
maxOccursunbounded/gt
lt/xsdsequencegt
lt/xsdcomplexTypegt
lt/xsdschemagt

33
Storage of XML Data

XML data can be stored in
Non-relational data stores
Flat files
Natural for storing XML
But has all problems discussed in Chapter 1 (no
concurrency, no recovery, )
XML database
Database built specifically for storing XML data,
supporting DOM model and declarative querying
Currently no commercial-grade scaleable system
Relational databases
Data must be translated into relational form
Advantage mature database systems
Disadvantages overhead of translating data and
queries

34
Storing XML in Relational Databases

Store as string
E.g. store each top level element as a string
field of a tuple in a database
Use a single relation to store all elements, or
Use a separate relation for each top-level
element type
E.g. account, customer, depositor
Indexing
Store values of subelements/attributes to be
indexed, such as customer-name and account-number
as extra fields of the relation, and build
indices
Oracle 9 supports function indices which use the
result of a function as the key value. Here, the
function should return the value of the required
subelement/attribute
SQL server 2005 same

35
Storing XML in Relational Databases

Store as string
E.g. store each top level element as a string
field of a tuple in a database
Benefits
Can store any XML data even without DTD
As long as there are many top-level elements in a
document, strings are small compared to full
document, allowing faster access to individual
elements.
Drawback Need to parse strings to access values
inside the elements parsing is slow.

36
OEM model

Semi structured and XML databases can be modelled
as graph-problems
Early prototypes directly supported the graph
model as the physical implementation scheme.
Querying the graph model was implemented using
graph traversals
XML without IDREFS can be modelled as trees

37
(No Transcript)
38
Storing XML as Relations (Cont.)

Tree representation model XML data as tree and
store using relations
nodes(id, type, label, value)
child (child-id, parent-id)
Each element/attribute is given a unique
identifier
Type indicates element/attribute
Label specifies the tag name of the element/name
of attribute
Value is the text value of the element/attribute
The relation child notes the parent-child
relationships in the tree
Can add an extra attribute to child to record
ordering of children
Benefit Can store any XML data, even without DTD
Drawbacks
Data is broken up into too many pieces,
increasing space overheads
Even simple queries require a large number of
joins, which can be slow

39
Storing XML in Relations (Cont.)

Map to relations
If DTD of document is known, you can map data to
relations
Bottom-level elements and attributes are mapped
to attributes of relations
A relation is created for each element type
An id attribute to store a unique id for each
element
all element attributes become relation attributes
All subelements that occur only once become
attributes
For text-valued subelements, store the text as
attribute value
For complex subelements, store the id of the
subelement
Subelements that can occur multiple times
represented in a separate table
Similar to handling of multivalued attributes
when converting ER diagrams to tables
Benefits
Efficient storage
Can translate XML queries into SQL, execute
efficiently, and then translate SQL results back
to XML

40
Alternative mappings

Mapping the structure
The Edge approach
The Attribute approach
The Universal Table approach
The Normalized Universal approach
The Dataguide approach
Mapping values
Separate value tables
Inlining
Shredding

41
Edge approach

Use a single Edge table to capture the graph
structure
Edge(source, ordinal, name, flag, target)
Flag value, reference
Keys source, ordinal)
Index source, name,target

42
Attribute approach

Group all attributes with the same name into one
table
Aname(source,ordinal,flag, target)
Key source,ordinal
Indextarget

43
Universal approach

Use the Universal Table, all attributes are
stored as columns
Universal(source, ord-1,flag-1,target-1,
,ord-n,flag-n,target-n)
Key source, index target-i

44
Normalized Universal

Same as Universal, but factor out the repeating
values
Universal(source, ord-1,flag-1,target-1,
,ord-n,flag-n,target-n)
Overflow_n(source,ord, flag,target)
Key source, and source,ord
Index target-i

45
Mapping values

Separate value tables
Use V_type(vid, value) tables, eg. int(vid,val),
str(vid,val),.

46
Mapping values

Inlining
As illustrated in previous mappings, inline the
values in the structure relations

47
Shredding

Try to recognize repeating structures and map
them to separate tables
Handle the remainder through any of the previous
methods

48
Evaluation

Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999
Database storage overhead

49
Evaluation

Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999
Bulk loading

50
Evaluation

Some results reported by Florescu, Kossmann using
a commercial DBMS on documents of 100K objects in
1999
Reconstruction

51
The Data

Semistructured data instance a large graph

52
The indexing problem

The storage problem
Store the graph in a relational DBMS
Develop a new database storage structure
The indexing problem
Input large, irregular data graph
Output index structure for evaluating (regular)
path expressions, e.g.
bib.paper.author.firstname

53
XSet a simple index for XML

Part of the Ninja project at Berkeley
Example XML data

54
XSet a simple index for XML

Each node a hashtable
Each entry list of pointers to data nodes (not
shown)

55
XSet Efficient query evaluation

SELECT X FROM part.name X -yes
SELECT X FROM part.supplier.name X -yes
SELECT X FROM part..subpart.name X -maybe
SELECT X FROM .supplier.name X -maybe

Will gain when index fits in memory
56
Region Algebras

structured text text with tags (like XML)
data sequence of characters c1c2c3
region interval in the text
representation (x,y) cx,cx1, cy
example ltsectiongt lt/sectiongt
region set a set of regions
example all ltsectiongt regions (may be nested)
region algebra operators on region set,
s1 op s2

57
Representation of a region set

Example the ltsubpartgt region set

58
Region algebra some operators

s1 intersect s2 r r? s1, r ?s2
s1 included s2 r r?s1, ?r ? s2, r ? r
s1 including s2 r r? s1, ?r ? s2, r ? r
s1 parent s2 r r? s1, ?r? s2, r is a parent
of r
s1 child s2 r r? s1, ?r ? s2, r is child of
r

Examples ltsubpartgt included ltpartgt ltpartgt
including ltsubpartgt
59
Efficient computation of Region Algebra Operators

Example s1 included s2
s1 (x1,x1'), (x2,x2'),
s2 (y1,y1'), (y2,y2'),
(i.e. assume each consists of disjoint regions)
Algorithm
if xi lt yj then i i 1
if xi' gt yj' then j j 1
otherwise print (xi,xi'), do i i 1
Can do in sub-linear time when one region is very
small

60
From path expressions to region expressions
Region expressions correspond to simple XPath
expressions

part.name name child (part child
root)
part.supplier.name name child (supplier child
(part child root))
.supplier.name name child supplier
part..subpart.name name child (subpart
included (part child root))

61
Storage structures for region algebras

Every node is characterised by an integer pair
(x,y)
This means we have a 2-d space
Any 2-d space data structure can be used
If you use a (pre-order,post-order) numbering you
get triangular filling of 2-d
(to be discussed later)

62
Alternative mappings

Mapping the structure to the relational world
The Edge approach
The Attribute approach
The Universal Table approach
The Normalized Universal approach
The Monet/XML approach
The Dataguide approach
Mapping values
Separate value tables
Inlining
Shredding

63
Dataguide approach

Developed in the context of Lore, Lorel (Stanford
Univ)
Predecessor of the Monet/XML model
Observation
queries in the graph-representation take a
limited form
they are partial walks from the root to an object
of interest
this behaviour was stressed by the query language
Lorel, i.e. an SQL-based query language based on
processing regular expressions

SELECT X FROM (Bib..author).(lastnamefirstname).
Abiteboul X
64
DataGuides

Definition
given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t.
- every path in DB also occurs in G
- every path in G occurs in DB
- every path in G is unique

65
Dataguides

Example

66
DataGuides

Multiple DataGuides for the same data

67
DataGuides

Definition
Let w, w be two words (I.e word queries) and G
a graph
w ?G w if w(G) w(G)
Definition
G is a strong dataguide for a database DB if ?G
is the same as ?DB
Example
- G1 is a strong dataguide
- G2 is not strong
person.project !?DB dept.project
person.project !?G2 dept.project

68
DataGuides

Constructing the strong DataGuide G
Nodes(G)root
Edges(G)?
while changes do
choose s in Nodes(G), a in Labels
add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G)
add (x -a-gty) to Edges(G)
Use hash table for Nodes(G)
This is precisely the powerset automaton
construction.

69
DataGuides

How large are the dataguides ?
if DB is a tree, then size(G) lt size(DB)
why? answer every node is in exactly one extent
of G
here dataguide XSet
How many nodes does the strong dataguide have for
this DB ?

20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic
schemas, like

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 10: XML PowerPoint PPT Presentation