Title: Data Storage and Retrieval in an XML World
1Data Storage and Retrieval in an XML World
2What is XML?
- eXtensible Markup Language
- Meta-markup language developed by W3C to deal
with shortcomings of HTML - HTML grew too complex and unwieldy (over 100 tags
in latest version). - XML allowed for domain specific markup.
- Semantics not document specific but application
specific. - XML is a subset of the Standardized and General
Markup Language (SGML).
3XML and Data
- XML also provided a way to describe structured
data - XML has many advantages as a data storage and
interchange format - Built in support for internationalization via
unicode. - Platform independence.
- Human readable format makes it easier to debug.
- Extensibility - new versions of a format dont
have to break apps based on older versions of the
format. - Large number of off-the-shelf tools for
processing XML documents already exist.
4Structuring XML
- Since XML is a way to describe structured data
there should be a means to specify the structure
of an XML document. - DTDs and Schemas are different mechanisms for
providing a grammar for an XML document. - An XML document that conforms to a DTD or schema
is considered to be valid.
5Document Type Definitions (DTDs)
- DTDs were inherited from SGML.
- DTDs have a different syntax from XML
- They are used to specify legal elements that can
occur in an XML document and the order they occur
in.
6Sample DTD and XML document
- lt!ELEMENT gatech_student (name, age)gt
- lt!ATTLIST gatech_student gtnum CDATAgt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT age (PCDATA)gt
- ltgatech_student gtnum"gt000x"gt
- ltnamegtGeorge Burdelllt/namegt
- ltagegt21lt/agegt
- lt/gatech_studentgt
7DTDs Unsatisfactory
- DTDs proved inadequate due to to a number of
reasons. The main reasons being - They used a different syntax than XML
- Non-existent support for datatypes
- Lack of control over occurrence of elements
8XML Data Reduced (XDR)
- A recommendation for XML schemas was submitted to
the W3C by the Microsoft Corporation as a
potential XML schema standard. - XDR tackled some of the problems of DTDs
- XDR schemas are XML files
- Support for a number of datatypes analogous to
those used in relational database management
systems and popular programming languages - Occurrence of elements is controllable
9Sample XDR and XML document
- ltSchema name"myschema" xmlns"urnschemas-microso
ft-comxml-data" xmlnsdt"urnschemas-microsoft-c
omdatatypes"gt - ltElementType name"age" dttype"ui1" /gt
- ltElementType name"name" dttype"string" /gt
- ltAttributeType name"gtnum" dttype"string"
/gt - ltElementType name"gatech_student"
order"seq"gt - ltelement type"name" minOccurs"1"
maxOccurs"1"/gt - ltelement type"age" minOccurs"1"
maxOccurs"1"/gt - ltattribute type"gtnum" /gt
- lt/ElementTypegt
- lt/Schemagt
- ltgatech_student gtnum"gt000x"gt
- ltnamegtGeorge Burdelllt/namegt
- ltagegt21lt/agegt
- lt/gatech_studentgt
10XML Schema Definitions (XSD)
- W3C standard
- XSD outshines XDR in the following ways
- Supports more datatypes
- Provides the ability to create custom data types
- Supports object oriented programming concepts
like inheritance and polymorphism.
11Sample XSD and XML document
- ltschema xmlns"http//www.w3.org/2001/XMLSchema"
gt - ltelement name"gatech_student"gt
- ltcomplexTypegt
- ltsequencegt
- ltelement name"name" type"string"/gt
- ltelement name"age" type"unsignedInt"/gt
- lt/sequencegt
- ltattribute name"gtnum"gt
- ltsimpleTypegt
- ltrestriction base"string"gt ltpattern
value"gt\d3A-Za-z1"/gt lt/restrictiongt - lt/simpleTypegt lt/attributegt lt/complexTypegt
lt/elementgt lt/schemagt - ltgatech_student gtnum"gt000x"gt
- ltnamegtGeorge Burdelllt/namegt
- ltagegt21lt/agegt
- lt/gatech_studentgt
12Querying XML
- It is sometimes necessary to extract subsets of
the data stored within an XML document. - A number of languages have been created for
querying XML documents including Lorel, Quilt,
UnQL, Xduce, XML-QL, Xpath, XQL, Xquery and YaTL.
- XPath is a W3C recommendation and XQuery is a W3C
working draft.
13XML Path Language (XPath)
- XPath is a language for addressing parts of an
XML document using a syntax that resembles
hierarchical paths used to address parts of a
filesystem or URL - Also provides functions for interacting with
selected data - Functions for the accessing information about
document nodes - Functions for the manipulating of strings,
numbers and booleans. - Developers can add functions to the XPath
library.
14Sample XPath queries
- /gatech_student/name
- Selects all name elements that are children of
the root element gatech_student. - //age
- Selects all age elements in the document.
- /gatech_student/
- Selects all child elements of the root element
gatech_student. - /gatech_student_at_gtnum
- Selects all gtnum attributes of the
gatech_student elements in the document. - //name()'age'
- Selects all elements that are named "age".
- /gatech_student/age/ancestor
- Selects all ancestors of all the age elements
that are children of the gatech_student element
(which should select the gatech_student element).
15XML Query Language (XQuery)
- XQuery is an attempt to provide a query language
that provides the same breadth of functionality
and underlying formalism as SQL does for
relational databases. - XQuery is a functional language where each query
is an expression. - XQuery has a sophisticated type system based on
XML schema datatypes and supports the
manipulation of the document nodes unlike XPath. - W3C is also working towards creating an alternate
version of XQuery that has the same semantics but
uses XML based syntax instead called XQueryX.
16XQuery Expressions
- path expressions
- element constructors
- FLWR expressions
- expressions involving operators and functions
- conditional expressions
- quantified expressions
- expressions that test or modify datatypes
17XML Usage Models I (Review)
- Document-centric
- Semi structured documents
- Irregular content
- Human creation and/or consumption is primary
aspect - Sample XHTML document
- lthtml xmlns "http//www.w3.org/1999/xhtml"gt
- ltheadgt
- lttitlegtSample Web Pagelt/titlegt
- lt/headgt
- ltbodygt
- ltpgt All XHTML documents must be
well-formed and valid. lt/pgt - ltimg src"http//www.example.com
/sample.jpg" height "50" width "25"/gt - ltbr /gt ltbr /gt
- lt/bodygt lt/htmlgt
18XML Usage Models II (Review)
- Data-centric
- Structured
- Appears in a regular order
- Mechanical creation and consumption.
- XML usage is incidental
- Sample SOAP message
- ltSOAP-ENVEnvelope xmlnsSOAP-ENV"http//schemas
.xmlsoap.org/soap/envelope/" SOAP-ENVencodingStyl
e"http//schemas.xmlsoap.org/soap/encoding/"gt - ltSOAP-ENVBodygt
- ltmGetLastTradePrice
xmlnsm"Some-URI"gt - ltsymbolgtDISlt/symbolgt
- lt/mGetLastTradePricegt
- lt/SOAP-ENVBodygt
- lt/SOAP-ENVEnvelopegt
19XML Storage in a data centric model
- Stored in database (typically an RDBMS)
- One may want to extract data from a database as
XML, store XML into a database or both - Primary choices for retrieving data
- Middleware components
- XML-enabled databases
20Middleware components
- Could be full blown application or an API.
- Different strategies used
- ADO same API for XML SQL access.
- jxTransformer custom queries used to specify
how the results of a SQL query should be
converted to XML. - DatabaseDOM - user created template file
contains the SQL to XML mappings for the SQL
query results. - DB2XML - default mapping of SQL results to XML
data exists that cannot be altered by the user
21Sample jxTransformer Query
- SQL
- SELECT EmployeeID, FirstName, LastName, Title,
HireDate, Salary - FROM Employees e WHERE HireDate gt d
2000-01-01 - jxTransformer custom query
- xml_document(
- xml_element(result,
- SELECT
- xml_element(Employees_Info
- xml_attribute(ID,
e.EmployeeID), - xml_element(name,
- xml_element(first,
e.FirstName), - xml_element(last,
e.LastName) ), - xml_element(title,
e.Title), - xml_element(hiredate,
e.HireDate), - xml_element(salary, e.Salary) )
- FROM EMPLOYEES e WHERE e.HireDate gt d
2000-01-01 ) )
22Sample DatabaseDOM template
- ltXMLDATABASEMAPgt
- ltTEMPLATEgt
- ltEMPLOYEE_LISTgt
- ltEMPLOYEE NO"EMPNO" GENDER"SEX"gt
- ltNAMEgt
- ltFIRSTgtFIRSTNMElt/FIRSTgt
- ltMIDDLE_INITIALgtMIDINITlt/MIDDLE_INITIA
Lgt - ltLASTgtLASTNAMElt/LASTgt lt/NAMEgt
- lt/EMPLOYEEgt
- lt/EMPLOYEE_LISTgt
- lt/TEMPLATEgt
- ltDATABASEgt
- ltMAXRETURNROWSgt1000lt/MAXRETURNROWSgt
- ltJDBCgt
- ltURLgtjdbcdb2samplelt/URLgt
- ltDRIVERgtCOM.ibm.db2.jdbc.app.DB2Driverlt/DRIVER
gt - lt/JDBCgt
- ltUSERIDgtpaullt/USERIDgt ltPASSWORDgtXXXXXXlt/PASSWOR
Dgt - ltTABLEgtEMPLOYEElt/TABLEgt
23XML-enabled databases
- An XML-enabled database understands how to
convert data to XML and back - Big 3 RDBMS vendors all have different XML
strategies. -
24XML and DB2
- Uses DB2 Extender to add XML support
- Can store an entire XML document and its DTD as a
user-defined column of an xml type - XMLCLOB
- XMLVARCHAR
- XMLFile
- Option to shred the document into multiple tables
and columns also available. - XML data can be queried with syntax that is
compliant with W3C XPath recommendation. - Updating of XML data is also possible using
stored procedures.
25SAMPLE DB2 XML EXTENDER TABLE AND QUERY
- TABLE
- TABLE mail_user user_name
- VARCHAR(20) NOT NULL PRIMARY KEY
- passwd VARCHAR(10)
- mailbox XMLVARCHAR
- QUERY
- SELECT user_name FROM mail_user WHERE
extractVarchar(mailbox,"/Mailbox/Inbox/Email/Subje
ct") LIKE "XML"
26XML and Oracle 9i
- XML documents can be stored as whole documents in
user-defined columns of type XMLType or CLOB/BLOB - Shredded documents can be reconstituted using the
XML SQL Utility. - Querying XML possible via two means
- Oracle Text (CONTAINS WITHIN clauses in SQL)
for BLOB or VARCHAR2 columns - XMLType columns can be queried via extract() and
existsNode() functions which use XPath. - Relational views of XML data possible.
27SAMPLE ORACLE 9i TABLE AND QUERY
- TABLE
- CREATE TABLE mail_user(
- user_name VARCHAR2(20),
- passwd VARCHAR2(10),
- mailbox SYS.XMLTYPE )
- QUERY
- SELECT user_name FROM mail_user m WHERE
m.mailbox.extract('/Mailbox/Inbox/Email/Subject/te
xt()').getStringVal() like 'XML'
28XML and SQL Server
- XML can be retrieved from relational rows using
FOR XML clause in SQL - RAW
- AUTO
- EXPLICIT
- XML views of relational data possible.
- Specified using XSD files for mapping
- Queried using XPath
29XML Storage in a document centric model
- Stored in a content management system
- A content management system typically consists of
a repository that stores a variety of XML
documents, an editor and an engine that provides
one or more of the following features - version, revison and access control
- ability to reuse documents in different formats
- collaboration
- web publishing facilities
- support for a variety of text editors (e.g.
Microsoft Word, Adobe Framemaker, etc) - indexing and search capabilities
30XML Storage in a Hybrid Model
- Where both data-centric and document-centric
models are in use best choice is native XML
database. - A native XML database is a database that has an
XML document as its fundamental (logical) unit of
storage and defines a (logical) model for an XML
document, as opposed to the data in that
document, and stores and retrieves documents
according to that model. - At a minimum, the model must include elements,
attributes, PCDATA, and document order.
31Tamino a commercial native XML database
- Created by Software AG.
- Features
- Storage retrieval of XML documents
- Storage retrieval relational data
- Interfacing with external applications and data
sources. - Transactional (ACID properties)
- Querying via X-Query (based on XPath NOT XQuery)
- Indexing
- GUI tools
- Web based administration
- Schema editor
- Interactive query interface
32Tamino Schemas
- Schemas in Tamino are DTD-based and are mainly
used as a way to describe how the XML data should
be indexed - Document storage choices
- Specify a pre-existing DTD which is then
converted to a Tamino schema, - store a well-formed XML document without a schema
which means that default indexing ensues - Create a schema from scratch for the XML document
being stored - Schemas are also used as a way to specify data
types which is important for type based
operations during querying (e.g. numeric ops)
33Tamino and SQL
- Tamino ships with a SQL engine
- Schemas can be used to creating mappings from SQL
to XML - XML can be retrieved from RDBMS sources either
internal (SQL engine) or external - Schemas can also be used to represent joins
across different document types (which could mean
different data sources)
34Tamino programming support
- APIs available for accessing XML store from both
Java and Microsoft's Jscript - C programmers can interact with the SQL engine
using the SQL precompiler - ODBC, OLE DB and JDBC clients can communicate
with SQL Engine - X-Tensions framework allows developers to extend
the functionality of Tamino by using C COM
objects or Java objects
35dbXML an Open Source native XML database
- Created by the dbXML group.
- Lightweight and modular
- Can easily be embedded in applications
- XML Documents arranged in hierarchical
filesystem-like manner. - Querying via XPath.
- Indexing support but no transactions or schemas.
- Command line administration tools
36dbXML programming support
- Written in Java and has implementation of XMLDB
initatives XML Database API. - Exposes CORBA API to enable access from any
language with CORBA bindings.
37Conclusion
- Paper on this topic
- http//www.25hoursaday.com/StoringAndQueryingXML.h
tml - Missed Opportunities
- eXcelon
- Questions???