Title: XML and Databases
1XML and Databases
2XML Motivation
3XML Motivation
- Huge amounts of unstructured data on the web
HTML documents - No structure information
- Only format instructions (presentation)
- Integration of data from different sources
- Structural differences
- Closely related to semistructured data
4Semistructured Data
- Integration of heterogeneous sources
- Data sources with non rigid structures
- Biological data
- Web data
- Need for more structural information than plain
text, but less constraints on structure than in
relational data
5Characteristics of Semistructured Data
- Missing or additional tuples
- Multiple attributes
- Different types in different objects
- Heterogeneous collection
- Self-describing, irregular data with no apriori
structure
6HTML Document Example
Type of information
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteoul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
Title
Authors
Year
book
7The Idea Behind XML
- Easily support information exchange between
applications / computers - Reuse what worked in HTML
- Human readable
- Standard
- Easy to generate and read
- But allow arbitrary markup
- Uniform language for semistructured data
- Data Management
8XML
9XML
- eXtensible Markup Language
- Universal standard for documents and data
- Defined by W3C
- Set of emerging technologies
- XLink, XPointer, XSchema, DOM, SAX, XPath,
XQuery,
10XML
- XML gives a syntax, not a semantic
- XML defines the structure of a document, not how
it is processed - Separate structural information from format
instructions
11XML Example
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
12 XML Terminology
- Tags book, title, author,
- Start tag ltbookgt
- End Tag lt/bookgt
- Elements are nested
- Empty Element
- ltreviewsgtlt/reviewsgt gt ltreviews/gt
- XML Document single root element
- XML Document is well formed matching tags
13XML Attributes
- Attributes are ltname, valuegt pairs that
characterize an element. - ltbook price 55 currency USDgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- lt/bookgt
- Can define oid, but they are just syntax
14More XML
- Text can be CDATA or PCDATA
- Entity References amp, gtgt,
- Processing Instructions lt?blink?gt
- Comments lt!-- comment text --gt
15Well Formed XML Documents
- Elements must be properly nested
- ltbookgtlttitlegt Foundations of Databases
lt/titlegtlt/bookgt - But Not
- ltbookgtlttitlegt Foundations of Databases
lt/bookgtlt/titlegt - There must be a unique root element
- Elements can be of
- element content
- or mixed content
- lttitlegtThis is ltbgtMixedlt/bgtContentlt/titlegt
16XML Potential
- Flexible enough to represent anything
- Stock market, DNA, Music, Chemicals
- Weather information
- Wireless network configuration
- Enables easy information exchange
- Between companies
- Within companies
- Standard everybody uses the same technology
17XML Limitations
- XML is only a syntax for documents
- We need tools!
- Editors and parsers
- Programming APIs (for Java, C, etc.)
- Languages to manipulate XML (how many books?)
- Schemas (What is a book like?)
- Storage (What if you have a lot of XML?)
- Transfer protocols (How do you exchange it?)
- What about XML in Chinese?
- How can XML fit into my phone?
- Query processing?
18XML Schema Language
19DTDs Document Type Descriptors
- Similar to a schema
- Grammar describing constraints on document
structure and content - XML Documents can be validated against a DTD
lt!ELEMENT Book (title, author)gtlt!ELEMENT title
PCDATAgtlt!ELEMENT author (name, address,
age?)gtlt!ATTLIST Book id ID REQUIREDgtlt!ATTLIST
Book pub IDREF IMPLIEDgt
20Shortcomings of DTDs
- Useful for documents, but not so good for data
- No support for structural re-use
- Object-oriented-like structures arent supported
- No support for data types
- Cant do data validation
- Can have a single key item (ID), but
- No support for multi-attribute keys
- No support for foreign keys (references to other
keys) - No constraints on IDREFs (reference only a
Section)
21XSchema
- In XML format
- Includes primitive data types (integers, strings,
dates,) - Supports value-based constraints (integers gt 100)
- Inheritance
- Foreign keys
-
22Example of XSchema
- ltschema version1.0 xmlnshttp//www.w3.org/199
9/XMLSchemagt - ltelement nameauthor typestring /gt
- ltelement namedate type date /gt
- ltelement nameabstractgt
- lttypegt
-
- lt/typegt
- lt/elementgt
- ltelement namepapergt
- lttypegt
- ltattribute namekeywords typestring/gt
- ltelement refauthor minOccurs0
maxOccurs /gt - ltelement refdate /gt
- ltelement refabstract minOccurs0
maxOccurs1 /gt - ltelement refbody /gt
- lt/typegt
- lt/elementgt
- lt/schemagt
23XML Storage
24Storing XML Data
- Different approaches
- Storing as text
- Using RDBMS
- Using a native system
- Tailored for XML, (NATIX, Tamino, Ipedo, etc.)
- Performance of the various approaches
- depends on your application
25Storing XML as Text
- Simple
- Easy to compress
- No updates
- Need to parse the document every time it is needed
26Storing XML in RDBMS
- Uses existing RDBMS techniques
- Costly in space, takes time to reconstruct
original document - Example techniques
- Schema with 2 relations tag and value
- Schema with n relations 1 per element name
27Accessing and Querying XML Data
28XML as a Tree DOM
- DOM Document Object Model
- Class hierarchy serving as an API to XML trees
- Methods of those classes can be used to
manipulate XML (e.g., Nodechild, Nodename) - Can be used from Java, C to develop XML
applications. - Each node has an identity (i.e., a unique
identifier) in the whole document
29XML as a DOM Tree
- Class hierarchy(node, element attribute)
bibliography
book
book
title
author
publisher
year
author
author
Foundations of Databases
Abiteboul
Hull
Vianu
Addison Wesley
1995
30XML as a Stream SAX
- XML document event stream. E.g.,
- Opening tag book
- Opening tag title
- Text Foundations of databases
- Closing tag title
- Opening tag author
- Etc.
- SAX allow you to associate actions with those
events to build applications - Very efficient since it corresponds to events
during parsing, but not always sufficient.
31XPath
- Language for navigating in an XML document (seen
as a tree) - One root node
- types of nodes root, element, text, attribute,
comment, - XPath expression defines navigation in the tree
following axis child, descendant, parent,
ancestor,
32XPath Examples
- Find all the titles of all the books
- //book/title
- Find the title of all books written by Charles
Dickens - //bookauthorCharles Dickens/title
- Find the title of the first section in the
second chapter in Great Expectations - //booktitleGreat Expectations/chapter2/sect
ion1/title - Find the title of all sections that come after
the second chapter in Great Expectations - //booktitleGreat Expectations/chapter2/foll
owingsection/title
33Querying XML Data
- Need for a language to query XML data
- Should yield XML output
- Should support standard query operations
- No schema required
- Several work on an XML query language XML-QL,
XQuery,..
34XQuery
- XPath included in XQuery
- FLWR expressions for let where return
FOR x IN document("bib.xml")/bib/book WHERE
x/year gt 1995 RETURN x/title
Result lttitlegt abc lt/titlegt lttitlegt def
lt/titlegt lttitlegt ghi lt/titlegt
35How to process XML Queries?
- Use indexes
- Need to identify nodes
- Need to know relations between nodes
- Labeling Schemes
- Dewey encoding
- Prefix-Postfix encoding
- Twigstack
36Web Services
37What are Web Services
- Programming interfaces for application to
application communication on the Web - platform-independent,
- language-independent
- object model-independent
- Possibility to activate methods on remote web
servers (RPC) - 2 main applications
- E-commerce
- Access to remote data
38XML and Web Services
- Exchange of information between application is in
XML - Input and Result
- Use of SOAP to generate messages
- Descriptions of the web service functionality
given in XML, according to the WSDL schema
Web Services standards use XML heavily
39Conclusions
- XML a very active area
- Many research directions
- Many applications
- Standards not finalized yet
- XQuery
- XML Schema
- Web Services
40Some Important XML Standards
- XSL/XSLT presentation and transformation
standards - RDF resource description framework (meta-info
such as ratings, categorizations, etc.) - XPath/XPointer/XLink standard for linking to
documents and elements within - Namespaces for resolving name clashes
- DOM Document Object Model for manipulating XML
documents - SAX Simple API for XML parsing
41References
- XML
- http//www.w3.org/XML/
- Sudarshan S. Chawathe Describing and
Manipulating XML Data. IEEE Data Engineering
Bulletin 22(3)(1999) - XML Standards
- http//www.w3.org/ (XSL, XPath, XSchema, DOM)
- Storing XML Data
- Daniela Florescu, Donald Kossmann Storing and
Querying XML Data using an RDMBS. IEEE Data
Engineering Bulletin 22(3)(1999) - Hartmut Liefke, Dan Suciu XMILL An Efficient
Compressor for XML Data. SIGMOD Conference 2000 - XQuery
- http//www.w3.org/TR/xquery/
- Peter Fankhauser XQuery Formal Semantics State
and Challenges. SIGMOD Record 30(3)(2001) - Web Services
- http//www.w3.org/2002/ws/