Title: Models and languages for semistructured data
1Models and languages forsemistructured data
- Bridging documents and databases
2Lectures
- 1. Introduction to data models
- 2. Query languages for relational databases
- 3. Models and query languages for object
databases - 4. Models and query languages for semistructured
data, XML - 5. Embedded query languages
- 6. Guest lecture on Object Role Modelling
3Why do we like types?
- Types facilitate understanding
- Types enable compact representations
- Types enable query optimisation
- Types facilitate consistency enforcement
4Background assumptions fortyped data
- Data stable over time
- Organisational body to control data
- Exercise Give an example of a context where
these assumptions do not hold
5Semistructured data
Semistructured data is schemaless and self
describing The data and the description of the
data are integrated
6An example
name first John, last Smith, tel
112233, email john_at_123.edu
7Another example
person o1name Eva, age 40, child
o2, person o2name Abel, age 20
An object identifier, such as o1, before a
structure, binds the object identifier to the
identity of that structure. The object identifier
can then be used to refer to the structure.
8Terminology
- The following is an ssd-expression
- o1name Eva, age 40, child o2
9A database
author
Crick
DNA spiral
author
n1
Wallace
1956
paper
title
date
Origin
1848
Darwin
author
biblio
book
n2
db
title
date
book
Kapital
1860
Marx
author
.
n3
title
date
10Path expressions
- A path expression is a sequence of labels
- l1.l2ln
- A path expression results in a set of nodes
- Path properties are specified by regular
expressions on two levels on the alphabet of
labels and on the alphabet of characters that
comprise labels
11A path expression
author
Crick
DNA spiral
biblio.book.author
author
n1
Wallace
1956
paper
title
date
Origin
1848
Darwin
author
biblio
book
n2
db
title
date
book
Kapital
1860
Marx
author
.
n3
title
date
12A path expression
author
Crick
DNA spiral
biblio.(book l paper).author
author
n1
Wallace
1956
paper
title
date
Origin
1848
Darwin
author
biblio
book
n2
db
title
date
book
Kapital
1860
Marx
author
.
n3
title
date
13Examples of path expressions
- biblio.book.author - authors of books
- biblio.paper.author - authors of papers
- biblio.(book l paper).author - authors of books
or papers - biblio._.author - authors of anything
- biblio._.author - nodes at the ends of paths
starting with biblio, ending with author, and
having an arbitrary sequence of labels between
14Example of a label pattern
- ((b l B)ook l (a l A)uthor) (s)? - book, Book,
author, Author, books, Books, authors, Authors
15An exercise
- biblio._.author.(s l Section)
- Which ones of the following paths match the path
expression above? - 1. Biblio.author.Section
- 2. Biblio.cat.rat.hat.author.section
- 3. Biblio.author
- 4. Biblio.cat.author.section.Section
16A simple query
- Select author X
- from biblio.book.author X
- Result
- author Darwin, author Marx
17A query with a condition
- select row X
- from biblio._ X
- where Crick in X.author
- Result
- row author Crick,
- author Wallace,
- date 1956,
- title The spiral DNA,
18Two exercises
- select row title Y, date Z
- from biblio.paper X, X.title Y, X.date Z
- select row author Y, date Z
- from biblio.book X, X.author Y, X.date Z
19A database
select row title Y, date Z from biblio.paper
X, X.title Y, X.date Z
author
Crick
DNA spiral
author
n1
Wallace
1956
paper
title
date
Origin
1848
Darwin
author
biblio
book
n2
db
title
date
book
Kapital
1860
Marx
author
.
n3
title
date
20A database
author
Crick
DNA spiral
author
n1
Wallace
1956
paper
title
date
Origin
1848
Darwin
author
biblio
book
n2
db
title
date
book
Kapital
1860
Marx
author
.
n3
title
date
21Nested queries
- select row (select author Y
- from X.author Y)
- from biblio.book X
22Three exercises
- Which authors have written a book or a paper in
1992? - Which authors have written a book together with
Jones? - Which authors have written both a book and a
paper?
23Expressing relations
r1
r2
a b c
b d e
1 2 3
1 1 3
3 2 2
3 4 2
4 3 1
2 3 1
r1 row a 1, b2, c2, row a
1, b2, c2, row a 1, b2, c2 ,
r2 row b 1, d2, e2, row b
1, d2, e2, row b 1, d2, e2
24Expressing relational joins
- select a A, d D
- from r1.row X
- r2.row Y
- X.a A, X.b B, Y.b B, Y.d D
- where B B
25Label variables
Label variable
- select L X
- from biblio._.L X
- where matches(.Shakespeare., X)
Macbeth
1622
Shakespeare
author
biblio
book
n2
db
title
date
book
Best of Shakespeare
1992
Smith
author
.
n3
title
date
26Label variables
- select L X
- from biblio._.L X
- where matches(.Shakespeare., X)
- author Shakespeare,
- title Best of Shakespeare
27Turning labels into data
- select publ type L, author A
- from biblio.L X, X.author A
publ type paper, author Crick, publ
type paper, author Wallace, publ type
book, author Darwin
28An exercise
- List all publications in 1992, their types, and
titles.
29Basic XML syntax
- XML is a textual representation of data
- An element is a text bounded by tags
- ltnamegt John lt/namegt
ltnamegt lt/namegt can be abbreviated as ltname/gt
30Basic XML syntax
- Elements may contain subelements
- ltpersongt
- ltnamegt John lt/namegt
- lttelgt 112233 lt/telgt
- ltemailgt john_at_123.edu lt/emailgt
- lt/persongt
31XML attributes
- An attribute is defined by a name-value pair
within a tag - ltprice currency dollargt 500 lt/pricegt
- ltlength unit cmgt 25 lt/lengthgt
32XML attributes and elements
- ltproductgt
- ltnamegt widget lt/namegt
- ltpricegt 10 lt/pricegt
- lt/productgt
- ltproduct price 10gt
- ltnamegt widget lt/namegt
- lt/productgt
- ltproduct name widget price 10/gt
33XML and ssd-expressions
- ltpersongt
- ltnamegt John lt/namegt
- lttelgt 112233 lt/telgt
- ltemailgt john_at_123.edu lt/emailgt
- lt/persongt
person name John, tel 112233, email
john_at_123.edu
34XML references
- ltperson id p1gt
- ltnamegt John lt/namegt
- lttelgt 112233 lt/telgt
- lt/persongt
- ltperson id p2gt
- ltnamegt Peter lt/namegt
- lttelgt 998877 lt/telgt
- ltboss idref p1/gt
- lt/persongt
35Document Type Definitions
- lt!DOCTYPE db
- lt!ELEMENT db (person)gt
- lt!ELEMENT person (name, age, email)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT age (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- gt
36An exercise on DTDs as schemas
- ltdbgt ltr1gt ltagt a1 lt/agt ltbgt b1 lt/bgt lt/r1gt
- ltr1gt ltagt a2 lt/agt ltbgt b2 lt/bgt lt/r1gt
- ltr2gt ltcgt a1 lt/cgt ltdgt b1 lt/dgt lt/r1gt
- ltr2gt ltcgt c2 lt/cgt ltdgt d2 lt/dgt lt/r1gt
- ltr3gt ltagt a1 lt/agt ltcgt b1 lt/cgt lt/r1gt
- lt/dbgt
- Write down a DTD for the data above!
37Attributes in DTDs
- ltproductgt
- ltname language Swedish department musicgt
- trumpet lt/namegt
- ltprice currency dollargt 500 lt/pricegt
- ltlength unit cmgt 25 lt/lengthgt
- lt/productgt
-
lt!ATTLIST name language CDATA REQUIRED
department CDATA IMPLIEDgt lt!ATTLIST price
currency CDATA REQUIREDgt lt!ATTLIST length unit
CDATA REQUIREDgt
38Reference attributes in DTDs
- lt!DOCTYPE people
- lt!ELEMENT people (person)gt
- lt!ELEMENT person (name)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ATTLIST person id ID REQUIRED
- boss IDREF REQUIRED
- friends IDREFS IMPLIEDgt
- gt
-
-
-
39An exercise
- ltpeoplegt
- ltpersongt id sven boss ollegt
- ltnamegt Sven Svensson lt/namegt
- lt/persongt
- ltpersongt id olle friends nils evagt
- ltnamegt Olle Olsson lt/namegt
- lt/persongt
- ltpersongt id pelle boss nils evagt
- ltnamegt Per Persson lt/namegt
- lt/persongt
- ltpeoplegt
- Does this XML element conform to the previous
DTD?
40Limitations of DTDs as schemas
- DTDs impose order
- No base types
- The types of IDREFs cannot be constrained
41XSL - extensible stylesheet language
- ltbibgt ltbookgt lttitlegt t1 lt/titlegt
- ltauthorgt a1 lt/authorgt
- ltauthorgt a2 lt/authorgt
- lt/bookgt
- ltpapergt
- lttitlegt t2 lt/titlegt
- ltauthorgt a3 lt/authorgt
- ltauthorgt a4 lt/authorgt
- lt/papergt
- ltbookgt lttitlegt t3 lt/titlegt
- ltauthorgt a5 lt/authorgt
- ltauthorgt a6 lt/authorgt
- lt/bookgt
- lt/bibgt
42Template rules and XSL patterns
- ltxsl templategt
- ltxsl apply-templates/gt
- lt/xsl templategt
- ltxsl template match bib//titlegt
- ltresultgt
- ltxsl value-of/gt
- lt/resultgt
- lt/xsl templategt
ltresultgt t1 lt/resultgt ltresultgt t2
lt/resultgt ltresultgt t3 lt/resultgt
43Two exercises
- select row title Y, date Z
- from biblio.paper X, X.title Y, X.date Z
- row title The spiral DNA,
- date 1956,
- title Origin,
- date 1848,
- title Kapital,
- date 1860
- select row author Y, date Z
- from biblio.book X, X.author Y, X.date Z
44Which authors have written a book or a paper in
1992? select author X from biblio.(book
paper) Y, Y.author X where Y.date 1992
45Which authors have written a book together with
Jones? select author X from biblio.book Y,
Y.author X where Jones in Y.author
46Which authors have written both a book and a
paper? select author A from biblio.book B,
biblio.paper P, B.author A where B.author
P.author select author A1 from biblio.book B,
biblio.paper P, B.author A1, P.author A2 where A1
A2
47List all publications in 1992, their types, and
titles. select publ type L, title T from
biblio.L X, X.title T where X.date 1992
48- lt!DOCTYPE db
- lt!ELEMENT db (r1, r2, r3)gt
- lt!ELEMENT r1 (a, b)gt
- lt!ELEMENT r2 (c, d)gt
- lt!ELEMENT r3 (a, c)gt
- lt!ELEMENT a (PCDATA)gt
- lt!ELEMENT b (PCDATA)gt
- lt!ELEMENT c (PCDATA)gt
- lt!ELEMENT d (PCDATA)gt
- gt
- ltdbgt ltr1gt ltagt a1 lt/agt ltbgt b1 lt/bgt lt/r1gt
- ltr1gt ltagt a2 lt/agt ltbgt b2 lt/bgt lt/r1gt
- ltr2gt ltcgt a1 lt/cgt ltdgt b1 lt/dgt lt/r1gt
- ltr2gt ltcgt c2 lt/cgt ltdgt d2 lt/dgt lt/r1gt
- ltr3gt ltagt a1 lt/agt ltcgt b1 lt/cgt lt/r1gt
- lt/dbgt