Title: Managing XML and Semistructured Data
1Managing XML and Semistructured Data
Prof. Dan Suciu
Spring 2001
2In this lecture
- XML syntax
- XML Query data model
- Comparison of XML with semistructured data
- Papers
- XML, Java, and the future of the Web by Jon
Bosak, Sun Microsystems. - W3C XML Query Data Model Mary Fernandez, Jonathan
Robie.
3XML
- a W3C standard to complement HTML
- origins structured text SGML
- motivation
- HTML describes presentation
- XML describes content
-
- http//www.w3.org/TR/2000/REC-xml-20001006
(version 2, 10/2000)
4From HTML to XML
HTML describes the presentation
5HTML
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteoul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
6XML
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
XML describes the content
7XML Terminology
- tags book, title, author,
- start tag ltbookgt, end tag lt/bookgt
- elements ltbookgtltbookgt,ltauthorgtlt/authorgt
- elements are nested
- empty element ltredgtlt/redgt abbrv. ltred/gt
- an XML document single root element
well formed XML document if it has matching tags
8More XML Attributes
- ltbook price 55 currency USDgt
- lttitlegt Foundations of Databases lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
-
- ltyeargt 1995 lt/yeargt
- lt/bookgt
attributes are alternative ways to represent data
9More XML Oids and References
- ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
- ltperson ido456gt ltnamegt Mary lt/namegt
- ltchildren
idrefo123 o555/gt - lt/persongt
- ltperson ido123 mothero456gtltnamegtJohnlt/namegt
- lt/persongt
oids and references in XML are just syntax
10More XML CDATA Section
- Syntax lt!CDATA .....any text here...gt
- Example
- ltexamplegt lt!CDATA some text here lt/notAtaggt
ltgtgt - lt/examplegt
11More XML Entity References
- Syntax entityname
- Example ltelementgt this is less than lt
lt/elementgt - Some entities
12More XML Processing Instructions
- Syntax lt?target argument?gt
- Exampleltproductgt ltnamegt Alarm Clock lt/namegt
lt?ringBell 20?gt
ltpricegt 19.99 lt/pricegtlt/productgt - What do they mean ?
13More XML Comments
- Syntax lt!-- .... Comment text... --gt
- Yes, they are part of the data model !!!
14XML Namespaces
- http//www.w3.org/TR/REC-xml-names (1/99)
- name prefixlocalpart
ltbook xmlnsisbnwww.isbn-org.org/defgt
lttitlegt lt/titlegt ltnumbergt 15 lt/numbergt
ltisbnnumbergt . lt/isbnnumbergt lt/bookgt
15XML Namespaces
- syntactic ltnumbergt , ltisbnnumbergt
- semantic provide URL for schema
lttag xmlnsmystyle http//gt
ltmystyletitlegt
lt/mystyletitlegt ltmystylenumbergt
lt/taggt
16XML Data Model
- Several competing models
- Document Object Model (DOM)
- http//www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-2001
0209/ (2/2001) - class hierarchy (node, element, attribute,)
- objects have behavior
- defines API to inspect/modify the document
- XSL data model
- Infoset
- PSV (post schema validation)
- XML Query data model (next)
17XML Query Data Model
- http//www.w3.org/TR/query-datamodel/2/2001
- Describes XML as a tree, specialized nodes
- Uses a functional-style notation (think ML)
18XML Query Data Model
- Node DocNode ElemNode
ValueNode
AttrNode NSNode
PINode CommentNode
InfoItemNode
RefNode
19XML Query Data Model
- Element node (simplified definition)
- elemNode (QNameValue,
AttrNode , ElemNode
ValueNode) ? ElemNode - QNameValue means a tag name
- ... means set of...
- ... means list of ...
20XML Query Data Model
- Reads give me a tag, a set of attributes, a
list of elements/values, and I will return an
element
21XML Query Data Model
book1 elemNode(book, price2, currency3,
title4, author5, author6,
author7, year8) price2 attrNode() /
next /currency3 attrNode()title4
elemNode(title, string9)
ltbook price 55 currency USDgt
lttitlegt Foundations lt/titlegt ltauthorgt
Abiteboul lt/authorgt ltauthorgt Hull lt/authorgt
ltauthorgt Vianu lt/authorgt ltyeargt 1995
lt/yeargt lt/bookgt
22XML Query Data Model
- Attribute node
- attrNode (QNameValue, ValueNode)
? AttrNode
23XML Query Data Model
price2 attrNode(price,string10) string10
valueNode() / next /currency3
attrNode(currency,
string11)string11 valueNode()
ltbook price 55 currency USDgt
lttitlegt Foundations lt/titlegt ltauthorgt
Abiteboul lt/authorgt ltauthorgt Hull lt/authorgt
ltauthorgt Vianu lt/authorgt ltyeargt 1995
lt/yeargt lt/bookgt
24XML Query Data Model
- Value node
- ValueNode StringValue
BoolValue FloatValue
- stringValue string ? StringValue
- boolValue boolean ? BoolValue
- floatValue float ? FloatValue
25XML Query Data Model
price2 attrNode(price,string10)string10
valueNode(stringValue(55))currency3
attrNode(currency, string11)string11
valueNode(stringValue(USD)) title4
elemNode(title, string9)string9
valueNode(stringValue(Foundations))
ltbook price 55 currency USDgt
lttitlegt Foundations lt/titlegt ltauthorgt
Abiteboul lt/authorgt ltauthorgt Hull lt/authorgt
ltauthorgt Vianu lt/authorgt ltyeargt 1995
lt/yeargt lt/bookgt
26XLink
- Generalizes HTMLs href
- Many types simple, extended, locator, ...
- Discuss only simple links
ltperson xmlnsxlinkhttp///.w3.org/1999/xlink
xlinktypesimple
xlinkhrefhttp//a.b.c/myhomepage.html
xlinktitleThe Homepage
xlinkshowreplace
xlinkactuateonRequestgt ..... lt/persongt
required attributes
optional attributes
27XLink
- show attribute can be
- new
- replace
- embed
- other
- actuate attribute can be
- onLoad
- onRequest
- other
- none
28XLink
- href attribute
- a URI or
- an Xpointer (next)
29XPointer
- An extension of XPath (next week)
- Usage
- hrefwww.a.b.c/document.xmlxpointerExpr
- An xpointer expression points to
- A point
- A range
30XPointer
- Pointing to a point (XML element or character)
- Full form e.g. xpointer(id(3652))
- Bar name e.g. 3652
- Child sequence e.g. xpointer( /1/3/2/5),
xpointer(
/bib/book3) - Pointing to a range e.g. xpointer(id(3652 to
44)) - Most interesting examples use XPath
31XML v.s. Semistructured Data
- both described best by a graph
- both are schema-less, self-describing
32Similarities and Differences
- ltperson ido123gt
- ltnamegt Alan lt/namegt
- ltagegt 42 lt/agegt
- ltemailgt ab_at_com lt/emailgt
- lt/persongt
- person o123
- name Alan,
- age 42,
- email ab_at_com
ltperson fathero123gt lt/persongt
person father o123
similar on trees, different on graphs
33More Differences
- XML is ordered, ssd is not
- XML can mix text and elements
- lttalkgt Making Java easier to type and easier
to type - ltspeakergt Phil Wadler lt/speakergt
- lt/talkgt
- XML has lots of other stuff entities, processing
instructions, comments
Very importantthese differences make XML data
management harder
34Summary of Data Models
- semistructured data, XML
- data is self-describing, irregular
- schema embedded with the data