Title: Semistructured Data and XML
1Chapter 29
- Semistructured Data and XML
- Transparencies
2Chapter - Objectives
- What semistructured data is.
- Concepts of the Object Exchange Model (OEM), a
model for semistructured data. - Basics of Lore, a semistructured DBMS, and its
query language, Lorel . - Main language elements of XML.
- Difference between well-formed and valid XML
documents. - How Document Type Definitions (DTDs) can be used
to define the valid syntax of an XML document.
3Chapter - Objectives
- How Document Object Model (DOM) compares with
OEM. - About other related XML technologies.
- Limitations of DTDs and how the W3C XML Schema
overcomes these limitations. - How RDF and RDF Schema provide a foundation for
processing meta-data.
4DTD XML Names and NMTOKEN
- Name Characters are letters, digits, hyphens,
underscores, colons or full stops. - An NMTOKEN is any collection of Name Characters
- NMTOKENS is any list of NMTOKENs separated by
white space (space, tab, newline etc.) - Case is significant PERSON and person are
distinct names - Attribute and Element names must be (a subset of)
NMTOKEN with restriction - Names cannot begin with a digit
- Names cannot begin with xml (or any variant
gotten by case changes) system will use this
prefix
5Element Declarations EMPTY
- Keyword ELEMENT Introduces a new
elementlt!ELEMENT NAME CONTENT_MODELgt - Element name must begin with a letter, and may
additionally contain digits and some
punctuations, i.e. ., -, _, and as we
described earlier under NMTOKEN - If an element can hold no child elements, and
also no text, then it is known as empty element
and denoted by EMPTY for CONTENT_MODEL - This seems trivial but it isnt because the
present or absence of this element in an XML file
can be used as a flag - As an example we can find several in HTML such as
HR and IMG which never have children and include
no text. Here we would writelt!ELEMENT HR EMPTYgt
and then ltHR/gt or ltHRgtlt/HRgt generates a
horizontal line - EMPTY ELEMENTS can have attributes such as the
SRC attribute in ltIMG/gt to specify source of
image.
6Element Declarations ANY
- An element declared to have a content of ANY may
contain all of the other elements declared in the
DTD - This is not quite the same as no DTD for the file
- lt!DOCTYPE fred lt!ELEMENT fred ANY gtgt
- ltfredgt ltpeoplegtMe and Yoult/peoplegt ltpeoplegtThem
lt/peoplegtlt/fredgt - Gets an error due to presence of ltpeoplegt tag
- Adding lt!ELEMENT people ANY gt inside DTD
declaration produces a valid document.
7Entities
- The DTD of an XML document can contain entity
declarations. These are like macro substitutions
in other languages. - ENTITYs are defined in DTD and consist of
several flavors - General Entities are referenced as EntName
- Parameter Entities are referenced as Entname
- We have already seen the character entities
- amp for
- apos for
- gt for gt
- lt for lt
- quot for
- These are built in but you could add other such
entities with - lt!ENTITY aitself A gt and aitself would be
replaced by A
8General Entities
- As another example, we can use in DTDlt!ENTITY
TODAY May 12 2003 gt andltcommentgtTODAY was
very quiet in Irvinelt/commentgtis parsed as
ltcommentgtMay 12 2003 was very quiet in
Irvinelt/commentgt - General Entity references can be nested inside a
DTD, e.g., one can write lt!ENTITY YEAR 2003 gt
lt!ENTITY TODAY May 12 YEAR gt - However one must use Parameter Entities and not
General Entities for macro substitution in other
DTD declarations like lt!ATTLIST and lt!ELEMENT - Parameter entities are defined as inlt!ENTITY
CUSTARDTAGS (NAME,DATE,ORDERS) gt
9Parameter Entities
- lt!ENTITY peopletags (firstname,lastname,dateofbi
rth) gtlt!ELEMENT student peopletags gt
lt!ELEMENT teacher peopletags gt lt!ELEMENT
administrator peopletags gt - Defines a bunch of people ELEMENTS to have the
same child elements - Parameter entities are even more commonly used
for attributes because almost always several
ELEMENTS share the same attributes (with often a
basic set being augmented in different ways for
different ELEMENTS) - This basic set can be set in a parameter Entity
10Defining Implied Attributes
- Attributes must be declared in the DTD to be able
to be used - Implied means that this attribute optional and
there is no default value - lt!ELEMENT population (PCDATA)gt
- lt!ATTLIST population year CDATA IMPLIEDgt
- The attribute year can be defined or undefined in
the element population. Valid Examples - ltpopulation year2000gt80lt/populationgt
- ltpopulationgt80lt/populationgt
11Defining Required Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year REQUIREDgt - The population must contain a year attribute
- ltpopulation year1996gt80lt/populationgt
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year (20002001) REQUIREDgt - The population must contain a year attribute of
2000 or 2001 - ltpopulation year2000gt80lt/populationgt
- No quotes on the enumeration values
12Defining Default Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year CDATA 2000gt - All these are valid
- ltpopulation year2001gt80lt/populationgt
- ltpopulation year2000gt80lt/populationgt
- ltpopulationgt80lt/populationgt
13Defining Fixed Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year CDATA FIXED 2000gt - Invalid ltpopulation year2001gt80lt/populationgt
- Valid ltpopulation year2000gt80lt/populationgt
- Valid ltpopulationgt80lt/populationgt
14Defining Unique Attributes
- lt!ELEMENT animal (name)gt
- lt!ATTLIST animal code ID REQUIREDgt
- The code attribute has to be unique in the XML
document - ltanimal codeT50gtltnamegtLionlt/namegt lt/animalgt
ltanimal codeT51gtltnamegtRabbitlt/namegt lt/animalgt
15Referring Unique Attributes
- lt!ELEMENT website (url)gt
lt!ATTLIST website animal_refer IDREF REQUIREDgt - animal_refer attribute refers to previous ID
attribute defined - ltwebsite animal_referT50gt
lturlgthttp//www.lions.comlt/urlgt
lt/websitegt
16Referring Multiple Unique Attributes
- lt!ELEMENT website (url)gt
lt!ATTLIST website contents IDREFS REQUIREDgt - contents attribute contain series of IDs
- ltwebsite contentsT50 T51gt
lturlgthttp//www.animals.comlt/urlgt
lt/websitegt
17XML Example - the DTD
- lt!ELEMENT addressBook (person)gt
- lt!ELEMENT person (name, email, link?) gt
- lt!ATTLIST person id ID REQUIRED gt
- lt!ATTLIST person gender (malefemale) IMPLIEDgt
- lt!ELEMENT name (PCDATA(family,given))gt
- lt!ELEMENT family (PCDATA)gt
- lt!ELEMENT given (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- lt!ELEMENT link EMPTY gtlt!ATTLIST link manager
IDREF IMPLIED
subordinates IDREF IMPLIEDgt
18DOCTYPE declarations
- Internal local definition of DTD
- External to an external file
- Can combine both
19Internal DTD
- lt?xml version"1.0" standalone"yes" ?gt
- lt!--open the DOCTYPE declaration -
- the open square bracket indicates an internal
DTD--gt - lt!DOCTYPE foo
- lt!--define the internal DTD--gt
- lt!ELEMENT foo (PCDATA)gt
- lt!--close the DOCTYPE declaration--gt
- gt
- ltfoogtHello World.lt/foogt
20Internal DTD rules
- The document type declaration must be placed
between the XML declaration and the first element
(root element) in the document . - The keyword DOCTYPE must be followed by the name
of the root element in the XML document . - The keyword DOCTYPE must be in upper case .
21External DTD
- Useful for creating a common DTD that can be
shared between multiple documents. - Any changes that are made to the external DTD
automatically updates all the documents that
reference it. - Two types private, and public.
- Rules
- If any elements, attributes, or entities are used
in the XML document that are referenced or
defined in an external DTD, standalone"no" must
be included in the XML declaration .
22"Private" External DTDs
- Identified by the keyword SYSTEM
- Intended for use by a single author or group of
authors. - Example
- lt!DOCTYPE root_element SYSTEM "DTD_location"gt
- where DTD_location is relative or absolute URL
(such as - http/ and file/).
23"Private" External DTDs (cont)
- XML document
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE document SYSTEM "subjects.dtd"gt
- ltdocumentgt lt/documentgt
- subjects.dtd
- lt!ELEMENT document gt
24Public" External DTDs
- Identified by the keyword PUBLIC
- Intended for broad use.
- lt!DOCTYPE root_element PUBLIC "DTD_name"
"DTD_location"gt where - DTD_location relative or absolute URL
- DTD_name follows the syntax
- "prefix//owner_of_the_DTD// description_of_the_D
TD//ISO 639_language_identifier - "DTD_location" is used to find the public DTD if
it cannot be located by the "DTD_name".
25Public" External DTDs (cont)
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http//www.w3.org/TR/REC-html40
/loose.dtd"gt - ltHTMLgt
- ltHEADgt
- ltTITLEgtA typical HTML filelt/TITLEgt
- lt/HEADgt
- ltBODYgt
-
- lt/BODYgt
- lt/HTMLgt
26Public" External DTDs (cont)
- Valid DTD_name Prefix
- ISO The DTD is an ISO standard. All ISO
standards are approved. - The DTD is an approved non-ISO standard.
- - The DTD is an unapproved non-ISO standard.
27Combining Internal and External DTDs
- A document can use both internal and external DTD
subsets. - The internal DTD subset is specified between the
square brackets of the DOCTYPE declaration. - The declaration for the external DTD subset is
placed before the square brackets immediately
after the SYSTEM keyword. - Declaring an ELEMENT with the same name in both
the internal and external DTD subsets is invalid
28Example
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE document SYSTEM "subjects.dtd"
-
- lt!ATTLIST assessment assessment_type (exam
assignment prac)gt - lt!ELEMENT results (PCDATA)gt
- gt
- subjects.dtd
- lt!ELEMENT document (title,subjectID,subjectname,p
rerequisite?, classes,assessment,syllabus,textbook
s)gt - lt!ELEMENT prerequisite (subjectID,subjectname)gt
29DTD Validation
- A XML content can be well-formed but invalid
under DTD rules - e.g. DTD rule lt!ELEMENT name (PCDATA)gt
- Acceptable ltnamegt Giancarlo Succi lt/namegt
- Unacceptable
- ltnamegt
- ltfirst_namegt Giancarlo lt/first_namegt
- ltlast_namegt Succi lt/last_namegt
- lt/namegt
30Beyond DTDs
- DTD limitations
- Simple document structures
- Lack of real datatypes
- Advanced schema languages
- XML Schema
- Relax NG
31Limitations of DTDs
- No typing of text elements and attributes
- All values are strings, no integers, reals, etc.
- Difficult to specify unordered sets of
subelements - Order is usually irrelevant in databases
- (A B) allows specification of an unordered
set, but - Cannot ensure that each of A and B occurs only
once - IDs and IDREFs are untyped
- The owners attribute of an account may contain a
reference to another account, which is
meaningless - owners attribute should ideally be constrained to
refer to customer elements
32Shortcomings of DTDs
- Useful for documents, but not so good for data
- No support for structural re-use
- Object-oriented-like structures arent supported
- No support for data types
- Cant do data validation
- Can have a single key item (ID), but
- No support for multi-attribute keys
- No support for foreign keys (references to other
keys) - No constraints on IDREFs (reference only a
Section)
33XML Schema
- In XML format
- Includes primitive data types (integers, strings,
dates, etc.) - Supports value-based constraints (integers gt 100)
- User-definable structured types
- Inheritance (extension or restriction)
- Foreign keys
- Element-type reference constraints
34XML Schema
- XML Schema is a more sophisticated schema
language which addresses the drawbacks of DTDs.
Supports - Typing of values
- E.g. integer, string, etc
- Also, constraints on min/max values
- User defined types
- Is itself specified in XML syntax, unlike DTDs
- More standard representation, but verbose
- Is integrated with namespaces
- Many more features
- List types, uniqueness and foreign key
constraints, inheritance .. - BUT significantly more complicated than DTDs.
35XML Schema Simple Types
- Elements that do not contain other elements or
attributes are of type simpleType. - ltxsdelement nameSTAFFNO type
xsdstring/gt - ltxsdelement nameDOB type xsddate/gt
- ltxsdelement nameSALARY type xsddecimal/gt
- Attributes must be defined last
- ltxsdattribute namebranchNo type
xsdstring/gt
36XML Schema Complex Types
- Elements that contain other elements are of type
complexType. - List of children of complex type are described by
sequence element. - ltxsdelement name STAFFLISTgt
- ltxsdcomplexTypegt
- ltxsdsequencegt
- lt!-- children defined here --gt
- lt/xsdsequencegt
- lt/xsdcomplexTypegt
- lt/xsdelementgt
37Cardinality
- Cardinality of an element can be represented
using attributes minOccurs and maxOccurs. - To represent an optional element, set minOccurs
to 0 to indicate there is no maximum number of
occurrences, set maxOccurs to unbounded. - ltxsdelement nameDOB typexsddate
- minOccurs 0/gt
- ltxsdelement nameNOK typexsdstring
- minOccurs 0 maxOccurs 3/gt
38References
- Can use references to elements and attribute
definitions. - ltxsdelement nameSTAFFNO typexsdstring/gt
- .
- ltxsdelement ref STAFFNO/gt
- If there are many references to STAFFNO, use of
references will place definition in one place and
improve the maintainability of the schema.
39Defining New Types
- Can also define new data types to create elements
and attributes. - ltxsdsimpleType name STAFFNOTYPEgt
- ltxsdrestriction base xsdstringgt
- ltxsdmaxLength value 5/gt
- lt/xsdrestrictiongt
- lt/xsdsimpleTypegt
- New type has been defined as a restriction of
string (to have maximum length of 5 characters).
40Groups
- Can define both groups of elements and groups of
attributes. Group is not a data type but acts as
a container holding a set of elements or
attributes. - ltxsdgroup name StaffTypegt
- ltxsdsequencegt
- ltxsdelement nameStaffNo
typeStaffNoType/gt - ltxsdelement namePosition typePositionType
/gt - ltxsdelement nameDOB type xsddate/gt
- ltxsdelement nameSalary typexsddecimal/gt
- lt/xsdsequencegt
- lt/xsdgroupgt
41Constraints
- XML Schema provides XPath-based features for
specifying uniqueness constraints and
corresponding reference constraints that will
hold within a certain scope. - ltxsdunique name NAMEDOBUNIQUEgt
- ltxsdselector xpath STAFF/gt
- ltxsdfield xpath NAME/LNAME/gt
- ltxsdfield xpath DOB/gt
- lt/xsduniquegt
42XML Schema Version of Bank
- ltxsdschema xmlnsxsdhttp//www.w3.org/2001/XMLSc
hemagt - ltxsdelement namebank typeBankType/gt
- ltxsdelement nameaccountgtltxsdcomplexTypegt
ltxsdsequencegt ltxsdelement
nameaccount-number typexsdstring/gt
ltxsdelement namebranch-name
typexsdstring/gt ltxsdelement
namebalance typexsddecimal/gt
lt/xsdsquencegtlt/xsdcomplexTypegt - lt/xsdelementgt
- .. definitions of customer and depositor .
- ltxsdcomplexType nameBankTypegtltxsdsquencegt
- ltxsdelement refaccount minOccurs0
maxOccursunbounded/gt - ltxsdelement refcustomer minOccurs0
maxOccursunbounded/gt - ltxsdelement refdepositor minOccurs0
maxOccursunbounded/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
- lt/xsdschemagt
43References
- http//www.java.sun.com/xml/docs/tutorial/TOC.html
- http//www.xml.com/pub/a/1999/09/expat/index.html
- http//xmlfiles.com/dtd/dtd_attributes.asp
- http//xmlwriter.net/xml_guide/doctype_declaration
.shtml
44What is an XML Parsing API?
- Programming model for accessing an XML document
- Sits on top of an XML parsing engine
- Language/platform independent
45Java XML Parsing Specification
- The Java XML Parsing Specification is a request
to include a standardised way of parsing XML into
the Java standard library - The specification defines the following packages
- javax.xml.parsers
- org.xml.sax
- org.xml.sax.helpers
- org.w3c.dom
- The first is an all-new plugability layer, the
others come from existing packages
46Two ways of using XML parsers SAX and DOM
- The Java XML Parsing Specification specifies two
interfaces for XML parsers - Simple API for XML (SAX) is a flat, event-driven
parser - Document Object Model (DOM) is an object-oriented
parser which translates the XML document into a
Java Object hierarchy
47SAX
- Simple API for XML
- Event-based XML parsing API
- Not governed by any standards body
- Guy named David Megginson basically owns it
- SAX is simply a programming model that the
developers of individual XML parsers implement - SAX parser written in Java would expose the
equivalent events - "serial access" protocol for XML
48SAX (cont)
- A SAX parser reads the XML document as a stream
of XML tags - starting elements, ending elements, text
sections, etc. - Every time the parser encounters an XML tag it
calls a method in its HandlerBase object to deal
with the tag. - The HandlerBase object is usually written by the
application programmer. - The HandlerBase object is given as a parameter to
the parse() method in the SAX parser. It includes
all the code that defines what the XML tags
actually do.
49How Does SAX work?
XML Document
SAX Objects
lt?xml version1.0?gt
Parser
ltaddressbookgt lt/addressbookgt
Parser
ltpersongt lt/persongt
ltnamegtJohn Doelt/namegt
Parser
ltemailgtjdoe_at_yahoo.comlt/emailgt
Parser
Parser
ltpersongt lt/persongt
Parser
ltnamegtJane Doelt/namegt
Parser
Parser
ltemailgtjdoe_at_mail.comlt/emailgt
Parser
Parser
50SAX structure
51SAX tutorial
- http//java.sun.com/xml/jaxp/dist/1.1/docs/tutoria
l/sax/index.html - Notes some files are at
- http//www.ics.uci.edu/ics185/handouts/slides13-s
ax/
52More info about SAX
- Read the tutorial
- http//java.sun.com/xml/jaxp/dist/1.1/docs/tutoria
l/sax/index.html
53Document Object Model (DOM)
- Most common XML parser API
- Tree-based API
- W3C Standard
- All DOM compliant parsers use the same object
model
54DOM (cont)
- A DOM parser is usually referred to as a document
builder. It is not really a parser, more like a
translator that uses a parser. - In fact, most DOM implementations include a SAX
parser within the document builder. - A document builder reads in the XML document and
outputs a hierarchy of Node objects, which
corresponds to the structure of the XML document.
55How Does DOM work?
DOM Objects
XML Document
lt?xml version1.0?gt
ltaddressbookgt lt/addressbookgt
ltpersongt lt/persongt
XML Parser
ltnamegtJohn Doelt/namegt
ltemailgtjdoe_at_yahoo.comlt/emailgt
ltpersongt lt/persongt
ltnamegtJane Doelt/namegt
ltemailgtjdoe_at_mail.comlt/emailgt
56DOM Structure Model and API
- hierarchy of Node objects
- document, element, attribute, text, comment, ...
- language independent programming DOM API
- get... first/last child, prev/next sibling,
childNodes - insertBefore, replace
- getElementsByTagName
- ...
- Alternative event-based SAX API (Simple API for
XML) - does not build a parse tree (reports events when
encountering begin/end tags) - for (partially) parsing very large documents
57DOM references
- Online tutorial
- http//java.sun.com/xml/jaxp/dist/1.1/docs/tutoria
l/dom/index.html - API
- http//java.sun.com/j2se/1.4.1/docs/guide/plugin/d
om/
58Validating versus Non-validating
- An XML document is well-formed if its
syntactically correct - An XML document is valid if its well-formed, and
it conforms to all constraints imposed by a DTD. - A parser is validating if it tells whether an XML
document is valid. Otherwise, its
non-validating. - The tutorial has examples for both validating
parsers and non-validating parsers - All of them check the well-formedness of an XML
document - Here we focus on those non-validating parsers.
59Querying XML
Application/User Query over XML Documents
XML Result (processed or displayed in browser)
Query Engine
60Outline
- XML queries
- Many standards
- As an example X-Query
61Example XML DTD
lt!ELEMENT book (booktitle, author)gt lt!ELEMENT
booktitle (PCDATA)gt lt!ELEMENT author (name,
address)gt lt!ATTLIST author id ID
REQUIREDgt lt!ELEMENT name (firstname?,
lastname)gt lt!ELEMENT firstname (PCDATA)gt lt!ELEMEN
T lastname (PCDATA)gt lt!ELEMENT address
ANYgt lt!ELEMENT article (title, author,
contactauthor)gt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT contactauthor
EMPTYgt lt!ATTLIST contactauthor authorID IDREF
IMPLIEDgt lt!ELEMENT monograph (title, author,
editor)gt lt!ELEMENT editor (monograph)gt lt!ATTLIST
editor name CDATA REQUIREDgt
62DTD Graph
63An example XML document
- ltbookgt
- ltbooktitlegt Gene lt/booktitlegt
- ltauthor id dawkinsgt
- ltnamegt
- ltfirstnamegt Richard lt/firstnamegt
- ltlastnamegt Dawkins lt/lastnamegt
- lt/namegt
- ltaddressgt
- ltcitygt Timbuktu lt/citygt
- ltzipgt 99999 lt/zipgt
- lt/addressgt
- lt/authorgt
- lt/bookgt
- Note an XML can be rooted at any element in the
DTD!
64An XML Query Language X-Query
- Full specifications http//www.w3.org/TR/xquery/
- FLWOR Expressions
- for
- let
- where
- order by
- return
65X-Query Example Q1
Find the last names of the authors of the book(s)
titled Gene.
- ltgeneLastnameListgt
-
- for b in doc(pub.xml")//book
- let t b/booktitle
- where t Gene
- return
- ltgeneLastnamegt
- b/author/name/lastname
- lt/geneLastname gt
-
- lt/geneLastnameListgt
Results ltgeneLastnameListgt ltgeneLastNamegt
lt/geneLastNamegt ltgeneLastNamegt
lt/geneLastNamegt lt/geneLastnameListgt
66for versus let
- Both clauses bind variables
- for is bound to each of the resulting tuples
- Tuples are iterated one by one
- let is bound to the entire resulting tuples
67for example
- for s in (ltone/gt, lttwo/gt, ltthree/gt)
- return ltoutgtslt/outgt
- Tuple stream
- ltoutgt
- ltone/gt
- lt/outgt
- ltoutgt
- lttwo/gt
- lt/outgt
- ltoutgt
- ltthree/gt
- lt/outgt
68let example
- let s (ltone/gt, lttwo/gt, ltthree/gt)
- return ltoutgtslt/outgt
- Tuple Stream
- ltoutgt
- ltone/gt
- lttwo/gt
- ltthree/gt
- lt/outgt
69Multiple lets
- for i in (1, 2), j in (3, 4)
- Tuple stream
- (i 1, j 3)
- (i 1, j 4)
- (i 2, j 3)
- (i 2, j 4)
70Path expressions
- Use / or // to separate
- Starting with /
- Example /student
- The root must be the tag student
- Starting with //
- Example //student
- The root or its decedent must be the tag
student.
71Evaluating E1/E2
- Expression E1 is evaluated,
- If the result is not a sequence of nodes, raise a
dynamic error - Each node resulting from the evaluation of E1
then serves in turn to provide an inner focus for
an evaluation of E2 - Each evaluation of E2 must result in a sequence
of nodes - otherwise, a dynamic error is raised.
- The sequences of nodes resulting from all the
evaluations of E2 are merged, eliminating
duplicate nodes based on node identity.
72Example
a
b
b
b
c
c
c
c
c
1
8
2
3
5
73Example Q2
Find the names of the authors of the book(s)
titled Gene.
- ltgeneNameListgt
-
- for b in doc(pub.xml")//book
- where b/booktitle Gene
- return
- ltgeneNamegt
- b/author/name/lastname
- b/author/name/firstname
- lt/geneName gt
-
- lt/geneNameListgt
Results ltgeneNameListgt ltgeneNamegt
lt/geneNamegt ltgeneNamegt lt/geneNamegt lt/gene
NameListgt
74Example Q3
For each author last name, list his/her address
and the titles of all his/her books. Sort the
results based on the lastnames.
- ltauthorInfogt
-
- for ln in distinct-values(doc(pub.xml)/article/
author/name/lastname) - order by ln
- return
- ltauthor aLastname " ln "gt
-
- for b in doc(pub.xml")//book
- a in b/author
- where a/lastname ln
- return
- lttitlegt b/booktitle lt/titlegt
- ltaddrgt a/address lt/addrgt
-
- lt/authorgt
-
- lt/authorInfogt
Results ltauthorInfogt ltauthor aLastname
gt lttitlegt lt/titlegt ltaddrgt
lt/addrgt lttitlegt lt/titlegt
ltaddrgt lt/addrgt lt/authorgt lt/authorInfogt
75Example Q4
- ltarticle-pairsgt
-
- for ar1 in doc(pub.xml")//article
- a1 in ar1/author
- ar2 in doc(pub.xml")//article
- where some a2 in ar2/author
- satisfies (a2/lastname a1/lastname
and a2/firstnamea1/firstname) - and ar1/title lt ar2/title
- return
- ltarticle-pairgt
- ar1/title
- ar2/title
-
- lt/article-pairsgt
Find all article pairs by the same author
(without duplicates).
Results ltarticle-pairsgt lt article-pair gt
title1, title2 lt/article-pairgt lt/
article-pairsgt
76some, every, satisfies
- True
- some x in (1, 2, 3) satisfies x 1
- some x in (1, 2, 3), y in (5, 6, 7) satisfies
x y 8 - every x in (1, 2, 3), y in (5, 6, 7) satisfies
x lt y - False
- some x in (1, 2, 3), y in (5, 6, 7)
satisfies x y 20 - every x in (1, 2, 3) satisfies x 1
77Problem
- Given
- DTDs
- Collection of XML documents conforming to DTDs
- Query
- Based on DTD schemas
- Over collection of XML documents, performing
selections, joins, etc. - Producing an XML result
78RDF
- http//www.w3.org/TR/REC-rdf-syntax (2/99)
- purpose metadata for Web
- help search engines
- syntax in XML
- semantics edge-labeled graphs
79Resource Description Framework (RDF)
- Even XML Schema does not provide the support for
semantic interoperability required. - For example, when two applications exchange
information using XML, both agree on use and
intended meaning of the document structure. - Must first build a model of the domain of
interest, to clarify what kind of data is to be
sent from first application to second. - However, as XML Schema just describes a grammar,
there are many different ways to encode a
specific domain model into an XML Schema, thereby
losing the direct connection from the domain
model to the Schema.
80Resource Description Framework (RDF)
- Problem compounded if third application wishes to
exchange information with other two. - Not sufficient to map one XML Schema to another,
since the task is not to map one grammar to
another grammar, but to map objects and relations
from one domain of interest to another. - Three steps required
- reengineer original domain models from XML
Schema - define mappings between the objects in the domain
models - define translation mechanisms for the XML
documents, for example using XSLT.
81Resource Description Framework (RDF)
- RDF is infrastructure that enables encoding,
exchange, and reuse of structured meta-data. - This infrastructure enables meta-data
interoperability through design of mechanisms
that support common conventions of semantics,
syntax, and structure. - RDF does not stipulate semantics for each domain
of interest, but instead provides ability for
these domains to define meta-data elements as
required. - RDF uses XML as a common syntax for exchange and
processing of meta-data.
82RDF Data Model
- Basic RDF data model consists of three objects
- Resource anything that can have a URI eg, a
Web page, a number of Web pages, or a part of a
Web page, such as an XML element. - Property a specific attribute used to describe
a resource eg, attribute Author may be used to
describe who produced a particular XML document. - Statement consists of combination of a
resource, a property, and a value.
83RDF Data Model
- Components known as subject, predicate, and
object of an RDF statement. - Example statement
- Author of http//www.dh.co.uk/staff_list.xml is
John White - ltrdfRDF xmlnsrdfhttp//www.w3.org/1999/02/22-r
df-syntax-ns xmlnsshttp//www.dh.co.uk/schema
/"gt - ltrdfDescription abouthttp//www.dh.co.uk/sta
ff_list.xmlgt - ltsAuthorgtJohn Whitelt/sAuthorgt
- lt/rdfDescriptiongt
- lt/rdfRDFgt
84RDF Data Model
- To store descriptive information about the
author, model author as a resource.
85RDF Schema
- Specifies information about classes in a schema
including properties (attributes) and
relationships between resources (classes). - RDF Schema mechanism provides a basic type system
for use in RDF models, analogous to XML Schema. - Defines resources and properties such as
rdfsClass and rdfssubClassOf that are used in
specifying application-specific schemas. - Also provides a facility for specifying a small
number of constraints such as cardinality.
86RDF Metadata standard
- ltrdfDescription aboutwww.mypage.comgt
- ltaboutgt birds, butterflies, snakes
lt/aboutgt - ltauthorgt ltrdfDescriptiongt
- ltfirstnamegt John
lt/firstnamegt - ltlastnamegt Smith
lt/lastnamegt - lt/rdfDescriptiongt
- lt/authorgt
- lt/rdfDescriptiongt
87More RDF Examples
88RDF Terminology
statement
89More RDF Containers
- bag, sequence, alternative
- ltrdfDescriptiongt ltagt ltrdfBaggt
-
ltrdfligt s1 lt/rdfligt -
ltrdfligt s2 lt/rdfligt - lt/rdfBaggt
- lt/agt
- lt/rdfDescriptiongt
90RDF Containers (contd)
a
rdftype
rdf_2
rdf_1
Bag
s1
s2
91More RDF Higher Order Statements
- the author of www.thispage.com says the topic
of www.thatpage.com is environment
RDF uses reification
92What/where is the Semantic Web?
- Machine-understandable information Semantic Web
- A new form of Web content that is meaningful to
computers will unleash a revolution of new
possibilities
93What/where is the Semantic Web?
- Layered on top of existing Web. (Just like HTTP
is built on top of TCP, which is on top of IP,
which is on top of the data-link layer)
research / vapourware
solid implementations
TCPIP Data-Link
94Layer 1 URI
- Everything is a Resource (people, books, the
attribute title of an Amazon book object, Web
pages, the concept laziness, ) - Ever resource has a unique identifier -- Uniform
Resource Identifier - eg, the URI of a Web Page is its URL
- Eg, the URI of my email address is
mailtonick_at_ucd.ie - Owner of object can pick any URI they want as
long is it is unique. Often has URL-like
syntax but that is purely convention/arbitrary
95Layer 2 XML
- Use XML as common formatting standard for
encoding data. - (Could invent a new format for every kind of data
but why bother?) - ltbookgtlttitlegtWar Peacelt/titlegtlt/bookgt
- lttaxonomy idamazongt ltconcept superclassthinggtbo
oklt/conceptgt ltattribute classbookgttitlelt/attribu
tegt lt/taxonomygt - ltontologygt ltmatchgtltsource fromamazongttitlelt/sour
cegt ltdest ontfredhannagtnamelt/destgtlt/matchgt lt/
ontologygt
Data
Meta-Data
Meta-Meta-Data
Danger/Warning Made-up syntax!!
96XML Schema
- An XML Schema document is an XML document that
defines a set of XML tags (and how they may be
used)
97XML Namespaces
- An XML documents may use tags defined in more
than one XML Schema document - Namespace prefixes (xxxyyy) are used to
unambiguously point to the defining XML Schema
document
ltrdfRDF xmlnsrdf"http//www.w3.org/1999/02/22-r
df-syntax-ns" xmlnsdc"http//purl.org/dc/
elements/1.1/"gt ltrdfDescription
about"http//www.cs.ucd.ie/staff/nick"gt
ltdctitlegtNicks Home Pagelt/dctitlegt
lt/rdfDescriptiongtlt/rdfRDFgt
98Layer 3 RDF
- All data/knowledge/facts/opinions/information is
expressed on the Semantic Web as Resource
Description Framework statements - Very simple language for making assertions
- Triple (value) (attribute) (object)
- (nick_at_ucd.ie) (is email address of) (Nick
Kushmerick) - (0140444173) (is ISBN number of) (War Peace)
- (field 5 of database A) (is a field of type)
(postal code)
99Everything is XML
- Remember (Nicks Home Page) (is title of)
(http//www.cs.ucd.ie/staff/nick)is actually
encoded as some very ugly XMLlt?xml
version"1.0"?gtlt!DOCTYPE rdfRDF SYSTEM
"http//purl.org/dc/schemas/dcmes-xml-20000714.dtd
"gtltrdfRDF xmlnsrdf"http//www.w3.org/1999/02/2
2-rdf-syntax-ns" xmlnsdc"http//purl.or
g/dc/elements/1.1/"gt ltrdfDescription
about"http//www.cs.ucd.ie/staff/nick"gt
ltdctitlegtNicks Home Pagelt/dctitlegt
lt/rdfDescriptiongtlt/rdfRDFgt
100Layer 4 Ontologies (RDF Schema)
- There are lots of common RDF attribute-sets for
lots of common tasks - eg -- Dublin Core Ohio, Sorry! defines a few
dozen standard attributes for asserting
statements about documents title, author, date,
version, format, owner, - But suppose you want to define your own
concepts/attributes -- - RDF Schema set of RDF tags for defining a new
set of RDF tags (no, this isnt circular)
101RDF Schema for Dublin Core Ontology
102Layer 4½ Mapping Between Ontologies
- Taxonomy Crisis
- How can your agent know that my title is your
name?! - How can my agent know that some of your address
objects are post-boxes, not physical addresses?! - How can my agent know that many Asian first names
correspond to Western surnames? - Semantic Web Solution Services for
translating/mapping between related ontologies. - Suppose Amazon.com uses Dublin Core (title),
while Fred Hanna uses its own document ontology
(name). So far my agent is forced to choose
a ontology, or must be carefully crafted to
understand both lanuages - A better solution A niche now exists for a
independent entity (UniversalBookInfo.com) that
maps title ? name etc
103without UniversalBookInfo.com
Nick wants tobuy War Peace
Nicksvery complicatedagent
Programmersbank account
Amazonontology
FredHannaontology
Amazon
Fred Hanna
104with UniversalBookInfo.com
Nick wants tobuy War Peace
Nicks agent
Joes agent
Janes Agent
UniversalBookInfo.com
Amazon
Fred Hanna
Bank Account
105Layer 5 Logic
DAML OIL
- Ontologies also allow axioms
- All people have brains
- Expressiveness Key challenge in formalizing
axioms want to be able to say anything you need
to in a particular domain. - All people have brains, except George Bush.
- But more expressive logics mean slower inference
- Intuitively, applying a rule such as You cant
fool all of the people all of the time could
require checking everyone in the universe to
determine if there exists even one foolable
person.
106Integrating Services
- Source can be services rather than data
repositories - Eg. Amazon as a composite service for book buying
- Separating line is somewhat thin
- Handling services
- Description (APII/O spec)
- WSDL
- Composition
- Planning in general
- Execution
- Data-flow architectures
- See next part
107Who will annotate the data?
- Semantic web works if the users annotate their
pages using some existing ontology (or their own
ontology, but with mapping to other ontologies) - But users typically do not conform to standards..
- and are not patient enough for delayed
gratification - The way to force them is to act as if you are
helping them write web-pages - Currently most people dont write their HTML
codethe MS frontpages and Claris Homepages of
the world do.. - What if we change the MS Frontpage/Claris
Homepage so that they (slyly) add annotations? - E.g. The Mangrove project at U. Wash.
- Help user in tagging their data (allow graphical
editing) - Provide instant gratification by running services
that use the tags.
108Layer 6 Proofs
ugly XML encoding
Proof Verifier
Yes this proof is correct No this proof is flawed
(Easy to build once the Logic layer is fixed)
109Proofs Huh?!??!
ugly XML encoding
I would like to buy this bookplease send my
company an invoice
I am an employee of XYZ Corp(because it says so
on this Webpage, which is an XYZ Corpofficial
document)
OK, book successfully ordered
Proof Verifier
Yes this proof is correct No this proof is flawed
(Easy to build once the Logic layer is fixed)
Sorry, we need a credit card!
110Proofs ? Trust
- In the Semantic Web, a proof is a procedure
that can be automatically followed in order to
verify an assertion. - Believability is always relative to a set of
resources that you trust - I own bank account 239489248234, because my
Digital Signature XXXX matches the record on Web
page http//bank.com/accounts, and you trust this
page because you own bank.com
111Summary
- Distributed global information ecosystem enables
wide variety of value-added information services
(monitoring your online purchases finding
entertainment in which you might be interested
scheduling appointments ) - But doing so is difficult/impossible if relevant
data is tied up in legacy documents intended for
human eyes/common sense - The Semantic Web as Global Database/Brain for All
Humanity -- Probably hopelessly futile - But within sufficiently motivated (ie, rich)
segments of the Web todays Syntactic Web may
well evolveinto A Semantic Web
Rose colored glasses are never made in bi-focals
because no-body wants to read smallprint in
dreams