Title: Semi-structured Data
1Semi-structured Data
2Facts about the Web
- Growing fast
- Popular
- Semi-structured data
- Data is presented for human-processing
- Data is often self-describing (including name
of attributes within the data fields)
3Vision for Web data
- Object-like it can be represented as a
collection of objects of the form described by
the conceptual data model - Schemaless not conformed to any type structure
- Self-describing necessary for machine readable
data
4Facts about database systems
- Integration of databases with different schemas
is often needed - Sharing information between different databases
on the World Wide Web becomes more and more
important for business
5Semi-structured data
- Bridging different data models (relational,
object-oriented
6Semi-structured data representation
- A database of semi-structured data is a graph
with - A set of nodes, each is either a leaf or a
interior node - Each interior node has a set of arcs coming out
from it, connecting it with another node each
arc has a label and - A root that does not have an arc entering it.
Every node must be reachable from the root.
7Root
movie
star
star
starOf
cf
mh
mv
starsIn
address
city
name
address
name
street
title
year
Mark Hamill
Carrie Fisher
Oak
Hwood
Star Wars
1977
street
city
street
city
starOf
Maple
Hwood
Locust
Malibu
starsIn
Example of semi-structured data representing a
movie and stars
8Information integration via semi-structured data
User
- Simple
- Semi-structured data as interface between users
of different databases (with different schemas)
Interface
DB1
DB2
Application of DB1
Application of DB2
9XML Overview
- Simplifying the data exchange between software
agents - Popular thanks to the involvement of W3C (World
Wide Web Consortium independent organization - www.w3c.org)
10XML Characteristics
- Simple, open, widely accepted
- HTML-like (tags) but extensible by users (no
fixed set of tags) - No predefined semantics for the tags (because XML
is developed not for the displaying purpose) - Semantics is defined by stylesheet (later)
11XML Documents
- User-defined tags
- lttaggt info lt/taggt
- Properly nestedlttag1gt.. lttag2gtlt/tag1gtlt/tag2gt
- is not valid
- Root element an element contains all other
elements - Processing instructions lt?command .?gt
- Comments lt!--- comment --- gt
- CDATA type
- DTD
12XML element
- Begin with a opening tag of the form
- ltXML_element_namegt
- End with a closing tag
- lt/XML_element_namegt
- The text between the beginning tag and the
closing tag is called the content of the element
13XML element
Star Elelement
Name elelement
- ltStar-Movie-Datagt
- ltStargt
- ltNamegt Carrie Fisher lt/Namegt
- ltAddressgt ltStreetgt 123 Maple St. lt/Streetgt
ltCitygt Hollywood lt/Citygt lt/Addressgt - ltAddressgt ltStreetgt 5 Locus Ln. lt/Streetgt ltCitygt
Malibult/Citygt lt/Addressgt - lt/Stargt
- ltStargt
- ltNamegt Mark Hamill lt/Namegt
- ltAddressgt ltStreetgt 456 Oak Rd. lt/Streetgt ltCitygt
Brentwood lt/Citygtlt/Addressgt - lt/Stargt
- ltMoviegt
- ltTitlegt Star Wars lt/Titlegt ltYeargt1997lt/Yeargt
- lt/Moviegt
- lt/ Star-Movie-Datagt
14XML element
Attribute Value of the attribute
- ltStar-Movie-Datagt
- ltStar nameCarrie Fishergt
- .
- lt/Stargt
-
- lt/ Star-Movie-Datagt
15Relationship between XML elements
- Child-parent relationship
- Elements nested directly in an element are the
children of this element (Student is a child of
PersonList, Name is a child of Student, etc.) - Ancestor/descendant relationship important for
querying XML documents (extending the
child/parent relationship)
16XML elements Database Objects
- XML elements can be converted into objects by
- considering the tags names of the children as
attributes of the objects - Recursive process
Partially converted object
ltStudent StudentID123gt ltNamegt XYZ PQR
lt/Namegt ltCrsTakengt ltCrsNamegtCS582lt/CrsNa
megt ltGradegtAlt/Gradegt lt/CrsTakengt lt/Studen
tgt
(099, Name XYZ PQR CrsTaken
ltCrsNamegtCS582lt/CrsNamegt
ltGradegtAlt/Gradegt )
17XML elements Database Objects
- Differences Additional text within XML elements
ltStudent StudentID123gt ltNamegt XYZ PQR
lt/Namegt has taken the following course
ltCrsTakengt Database management system II
ltCrsNamegtCS582lt/CrsNamegt with the grade
ltGradegtAlt/Gradegt lt/CrsTakengt lt/Studentgt
18XML elements Database Objects
- Differences XML elements are orderd
ltCrsTakengt ltCrsNamegtCS582lt/CrsNamegt
ltGradegtAlt/Gradegt lt/CrsTakengt
ltCrsTakengt ltGradegtAlt/Gradegt
ltCrsNamegtCS582lt/CrsNamegt lt/CrsTakengt
901, Grade A, CrsName CS582
19XML Attributes
- Can occur within an element (arbitrary many
attributes, order unimportant, same attribute
only one) - Allow a more concise representation
- Could be replaced by elements
- Less powerful than elements (only string value,
no children) - Can be declared to have unique value, good for
integrity constraint enforcement (next slide)
20XML Attributes
- Can be declared to be the type of ID, IDREF, or
IDREFS - ID unique value throughout the document
- IDREF refer to a valid ID declared in the same
document - IDREFS space-separated list of strings of
references to valid IDs
21Well-formed XML Document
- It has a root element
- Every opening tag is followed by a matching
closing tag, elements are properly nested - Any attribute can occur at most once in a given
opening tag, its value must be provided, quoted
22Document Type Definition
- Set of rules (by the user) for structuring an XML
document - Can be part of the document itself, or can be
specified via a URL where the DTD can be found - A document that conforms to a DTD is said to be
valid - Viewed as a grammar that specifies a legal XML
document, based on the tags used in the document
23DTD Components
- A name must coincide with the tag of the root
element of the document conforming to the DTD - A set of ELEMENTs one ELEMENT for each allowed
tag, including the root tag - ATTLIST statements specifies the allow
attributes and their type for each tag - , , ? like in grammar definition
- zero or finitely many number
- at least one
- ? zero or one
24DTD Components Element
- lt!ELEMENT Name definitiongt
- type, element list etc.
- Name of the element
- definition can be EMPTY, (PCDATA), or element
list (e1,e2,,en) where the list (e1,e2,,en) can
be shorted using grammar like notation
25DTD Components Element
- lt!ELEMENT Name(e1,,en)gt
-
nth element -
- 1st element
- Name of the element
- lt!ELEMENT PersonList (Title,Contents)gt
- lt!ELEMENT Contents(Person )gt
26DTD Components Element
- lt!ELEMENT Name EMPTYgt
- no child for the element Name
- lt!ELEMENT Name (PCDATA)gt
- value of Name is a character string
- lt!ELEMENT Title EMPTYgt
- lt!ELEMENT Id (PCDATA)gt
27DTD Components Attribute List
- lt!ATTLIST EName Att Type Propertygt
where - - Ename name of an element defined in the DTD
- - Att attribute name allowed to occur in the
opening tag of Ename - - type might/might not be there specify the
type of the attribute (CDATA, ID, IDREF, IDREFS) - - Property either REQUIRED or IMPLIED
28- lt!DOCTYPE Stars
- lt!ELEMENT STARS (STAR)gt
- lt!ELEMENT STAR(NAME,ADDRESS,MOVIES)gt
- lt!ELEMENT NAME (PCDATA)gt
- lt!ELEMENT ADDESS (STREET, CITY)gt
- lt!ELEMENT STREET (PCDATA)gt
- lt!ELEMENT CITY (PCDATA)gt
- lt!ELEMENT MOVIES (MOVIE)gt
- lt!ELEMENT MOVIE (TITLE, YEAR)gt
- lt!ELEMENT TITLE (PCDATA)gt
- lt!ELEMENT YEAR (PCDATA)gt
- gt
A simple DTD for the movie and star database (no
integrity constraints)
29- lt!DOCTYPE Stars-Movies
- lt!ELEMENT STARS-MOVIES (STAR MOVIES)gt
- lt!ELEMENT STAR(NAME,ADDRESS)gt
- lt!ATTLIST STAR starID ID starredIn IDREFgt
- lt!ELEMENT NAME (PCDATA)gt
- lt!ELEMENT ADDESS (STREET, CITY)gt
- lt!ELEMENT STREET (PCDATA)gt
- lt!ELEMENT CITY (PCDATA)gt
- lt!ELEMENT MOVIE (TITLE, YEAR)gt
- lt!ATTLIST MOVIE movieID ID starsOf IDREFgt
- lt!ELEMENT TITLE (PCDATA)gt
- lt!ELEMENT YEAR (PCDATA)gt
- gt
A DTD for the movie and star database with
attributes and integrity constraints
30Homework 5 (Due Oct 23)
- 4.2.3 (Pg 146, complete book) (10pt)
- 4.4.1 (part c, Pg 164, complete book) (10pt)
- 4.5.4 (Pg 172, complete book) (10pt)