Well-designed XML Data - PowerPoint PPT Presentation

About This Presentation
Title:

Well-designed XML Data

Description:

Insertion Anomaly. GB248. 3. Computer Organization I. CSC258. Numerical Methods. CSC336 ... Avoiding Update Anomalies. Database Systems. CSC434. Computer ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 88
Provided by: marcelo68
Category:
Tags: xml | anomaly | data | designed | well

less

Transcript and Presenter's Notes

Title: Well-designed XML Data


1
Well-designed XML Data
Marcelo Arenas and Leonid Libkin University of
Toronto
2
Outline
  • Part 1 - Database Normalization from the 1970s
    and 1980s.
  • Part 2 - Classical theory revisited normalizing
    XML documents.
  • Part 3 - Classical theory re-done new
    justifications for normalization.

2
3
Part 1 Classical Normalization
  • Design decide how to represent the information
    in a particular data model.
  • Even for simple application domains there is a
    large number of ways of representing the data of
    interest.
  • We have to design the schema of the database.
  • Set of relations.
  • Set of attributes for each relation.
  • Set of data dependencies.

3
4
Designing a Database An Example
  • Attributes number, title, section, room.
  • Data dependency every course number is
    associated with only one title.
  • Relational Schema

R(number, title, section, room), number ? title
BAD alternative
GOOD alternative S(number, title), number ? title
T(number, section, room), ?
4
5
Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
5
6
Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
5
7
Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
The instance stores redundant information.
5
8
Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
CSC434 is not given in this term.
6
9
Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
CSC434 is not given in this term.
6
10
Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 is not given in this term.
Additional effect all the information about
CSC434 was deleted.
6
11
Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
A new course is created (CSC336, Numerical
Methods)
7
12
Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248

A new course is created (CSC336, Numerical
Methods)
7
13
Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC336 Numerical Methods ? ?
A new course is created (CSC336, Numerical
Methods)
The instance stores attributes that are not
directly related.
7
14
Avoiding Update Anomalies
number title
CSC258 Computer Organization
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
8
15
Avoiding Update Anomalies
number title
CSC258 Computer Organization
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
8
16
Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
CSC434 is not given in this term.
The instance does not store redundant information.
8
17
Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
CSC434 is not given in this term.
8
18
Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 is not given in this term.
A new course is created (CSC336, Numerical
Methods)
The title of CSC434 is not removed from the
instance.
8
19
Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems

number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
A new course is created (CSC336, Numerical
Methods)
8
20
Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
CSC336 Numerical Methods
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
A new course is created (CSC336, Numerical
Methods)
No information about sections has to be provided.
Each relation stores attributes that are directly
related.
8
21
Normalization Theory
  • Main idea a normal form defines a condition that
    a well designed database should satisfy.
  • Normal form syntactic condition on the database
    schema.
  • Defined for a class of data dependencies.
  • Main problems
  • How to test whether a database schema is in a
    particular normal form.
  • How to transform a database schema into an
    equivalent one satisfying a particular normal
    form.

9
22
Normalization Theory Today
  • Normalization theory for relational databases was
    developed in the 70s and 80s.
  • Why do we need normalization theory today?
  • New data models have emerged XML.
  • XML documents can contain redundant information.
  • Redundant information in XML documents
  • Can be discovered if the user provides semantic
    information.
  • Can be eliminated.

10
23
Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
24
Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
25
Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
26
Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
27
Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
28
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
29
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
30
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt

lt/coursegt
ltcourse cnoCSC434gt

lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
31
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
32
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
33
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
34
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
35
Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
36
XML Databases
XML Schema (D, ?)
D
?
Two students with the same _at_sno value must have
the same name.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
12
37
Redundancy in XML
courses
course
course
info
_at_cno
_at_cno
taken_by
taken_by
name
_at_sno
CSC258
CSC434
st1
Fox
student
student
student
. . .
_at_sno
name
grade
grade
name
_at_sno
st1
st1
A
B
Fox
Fox
13
38
XML Database Normalization
DTD
Data dependency
Two students with the same _at_sno value must have
the same name.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
14
39
XML Database Normalization
DTD
Data dependency
, info
_at_sno is the identifier of info elements.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? grade
Two students with the same _at_sno value must have
the same name.
info ? _at_sno
info ? name
14
40
A Non-relational Example
DBLP
conf
conf
title
issue
issue
ICDT
_at_year
article
article
article
_at_year
1999
2001

_at_year
title
title
_at_year
author
_at_year
title
author
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
15
41
XNF XML Normal Form
  • It eliminates two types of anomalies.
  • It was defined for XML functional dependencies
  • DBLP.conf._at_title ? DBLP.conf
  • DBLP.conf.issue ? DBLP.conf.issue.article._at_year

16
42
Problems to Address
  • Functional dependencies for XML.
  • Normal form for XML documents (XNF).
  • Generalizes BCNF.
  • Algorithm for normalizing XML documents.
  • Implication problem for functional dependencies.

17
43
Framework Paths in DTDs
  • Paths(D) all paths in a DTD D
  • courses.course
  • courses.course._at_cno
  • courses.course.student.name
  • courses.course.student.name.S
  • We distinguish three kinds of elements
    attributes (_at_), strings (S) and element types.
  • FDs are defined by means of a relational
    representation of XML documents.

18
44
Framework XML Trees
courses
v0
course
course
v1
. . .
_at_cno
student
student
v2
v5
cs100
_at_sno
grade
_at_sno
grade
name
name
v3
v4
v6
v7
123
456
S
S
S
S
Fox
B
Smith
A-
19
45
Tree Tuples
  • Relational representation tree tuples - mappings
  • t Paths(D) ? Vertices ? Strings ? ?
  • A tree tuple represents an XML tree

courses
t(courses) v0 t(courses.course)
v1 t(courses.course._at_cno) cs100 t(courses.cour
se.student) v2 t(p) ?, for the remaining paths
v0
course
v1
_at_cno
student
v2
cs100
20
46
XML Tree set of Tree Tuples
courses
courses
v0
v0
course
course
course
course
v1
. . .
v1
. . .
_at_cno
student
_at_cno
student
student
student
v2
v5
v2
v5
cs100
cs100
_at_sno
grade
_at_sno
grade
_at_sno
grade
_at_sno
grade
name
name
name
name
v3
v4
v6
v7
v3
v4
v6
v7
123
456
123
456
S
S
S
S
S
S
S
S
Fox
B
Smith
A-
Fox
B
Smith
A-
21
47
Functional Dependencies for XML
  • Expressions of the form
  • X ? Y
  • defined over a DTD D, where X, Y are finite
  • non-empty subsets of Paths(D).
  • XML tree T can be tested for satisfaction of X ?
    Y if
  • X ? Y ? Paths(T) ? Paths(D)
  • T ? X ? Y if for every pair u, v of tree tuples
    in T
  • u.X v.X and u.X ? ? implies u.Y v.Y

22
48
FD Examples
  • University DTD courses ? course
  • course ? _at_cno, student
  • student ? _at_sno, name, grade
  • Two students with the same _at_sno value must have
    the same name
  • courses.course.student._at_sno ? courses.course.stud
    ent.name.S
  • Every student can have at most one grade in every
    course
  • courses.course,
  • courses.course.student._at_sno ?
    courses.course.student.grade.S

23
49
Implication Problem for FD
  • Given a DTD D and a set of functional
    dependencies ? ? ?
  • (D, ?) ? ? if for any XML tree T conforming to D
    and satisfying ? , it is the case that T ? ?
  • (D, ?) ? (D, ?) ? ?
  • Functional dependency ? is trivial if it is
    implied by the DTD alone (D, ?) ? ?

24
50
XNF XML Normal Form
  • XML specification a DTD D and a set of
    functional dependencies ?.
  • A Relational DB is in BCNF if for every
    non-trivial functional dependency X ? Y in the
    specification, X is a key.
  • (D, ?) is in XNF if
  • For each non-trivial FD X ? p._at_l or X ? p.S in
    (D, ?), X ? p is in (D, ?).

25
51
Back to DBLP
  • DBLP is not in XNF
  • DBLP.conf.issue ? DBLP.conf.issue.article._at_year
    ? (D,?)
  • DBLP.conf.issue ?
  • DBLP.conf.issue.article ? (D,?)
  • Proposed solution is in XNF.

26
52
Normalization Algorithm
  • The algorithm applies two transformations until
    the
  • schema is in XNF.
  • If there is an anomalous FD of the form
  • DBLP.conf.issue ? DBLP.conf.issue.article._at_year
  • then apply the DBLP example rule.
  • Otherwise choose a minimal anomalous FD and
    apply the University example rule.

27
53
Normalizing XML Documents
  • Theorem The decomposition algorithm terminates
    and outputs a specification in XNF.
  • Furthermore, it does not lose information
  • Unnormalized Normalized
  • XML document XML Document
  • Q1, Q2 are XQuery core queries.

Q1
Q2
28
54
Part 3 What was Missing? Justification!
  • What is a good database design?
  • Well-known solutions BCNF, 4NF,
  • But what is it that makes a database design good?
  • Elimination of update anomalies.
  • Existence of algorithms that produce good
    designs lossless decomposition, dependency
    preservation.
  • Previous work was specific for the relational
    model.
  • Classical problems have to be revisited in the
    XML context.

29
55
Justification of Normal Forms
  • Problematic to evaluate XML normal forms.
  • No XML update language has been standardized.
  • No XML query language yet has the same
    yardstick status as relational algebra.
  • We do not even know if implication of XML FDs is
    decidable!
  • We need a different approach.
  • It must be based on some intrinsic
    characteristics of the data.
  • It must be applicable to new data models.
  • It must be independent of query/update/constraint
    issues.
  • Our approach is based on information theory.

30
56
Information Theory
  • Entropy measures the amount of information
    provided by a certain event.
  • Assume that an event can have n different
    outcomes with probabilities p1, , pn.

Amount of information gained by knowing that event i occurred
Average amount of information gained (entropy)
Entropy is maximal if each pi 1/n
31
57
Entropy and Redundancies
  • Database schema R(A,B,C), A ? B
  • Instance I
  • Pick a domain properly containing adom(I)
  • Probability distribution P(4) 0 and P(a)
    1/5, a ? 4
  • Entropy log 5 2.322

A B C
1 2 3
1 2 4
A B C
1 2 3
1 2 4
A B C
1 2
1 2 4
A B C
1 2 3
1 2 4
A B C
1 3
1 2 4
  • Pick a domain properly containing adom(I) 1,
    , 6
  • Probability distribution P(2) 1 and P(a) 0,
    a ? 2
  • Entropy log 1 0

1, , 6
32
58
Entropy and Normal Forms
  • Let ? be a set of FDs over a schema S.
  • Theorem (S,?) is in BCNF if and only if for
    every instance of (S,?) and for every domain
    properly containing adom(I), each position
    carries non-zero amount of information (entropy gt
    0).
  • A similar result holds for 4NF and MVDs.
  • This is a clean characterization of BCNF and 4NF,
    but the measure is not accurate enough ...

33
59
Problems with the Measure
  • The measure cannot distinguish between different
    types of data dependencies.
  • It cannot distinguish between different instances
    of the same schema

R(A,B,C), A ? B
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
entropy 0
entropy 0
34
60
A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
35
61
A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
35
62
A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
35
63
A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
64
A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
65
A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
66
A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
67
A General Measure
A B C
2 3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
68
A General Measure
A B C
1 2 3
1 2 1
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
69
A General Measure
A B C
4 2 3
1 2 7
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
70
A General Measure
A B C
1 2 3
1 2 3
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
35
71
A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
72
A General Measure
A B C
a 3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
73
A General Measure
A B C
2 a 3
1 2 7
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
74
A General Measure
A B C
1 a 3
1 2 6
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
(48 6 ? 42) 0.16
For a ? 2, P(a X)
42/
(48 6 ? 42) 0.14
Entropy 2.8057
(log 7 2.8073)
35
75
A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
  • Value we consider the average over all sets
    X ? Pos(I) p.
  • Average 2.4558 lt log 7 (maximal entropy)
  • It corresponds to conditional entropy.
  • It depends on the value of k ...

35
76
A General Measure
  • Previous value
  • For each k, we consider the ratio
  • How close the given position p is to having the
    maximum possible information content.
  • General measure

36
77
Basic Properties
  • The measure is well defined
  • For every set of firstorder constraints ?
    defined over a schema S, every I ? inst(S,?), and
    every p ? Pos(I) exists.
  • Bounds

37
78
Basic Properties
  • The measure does not depend on a particular
    representation of constraints. If ?1 and ?2 are
    equivalent
  • It overcomes the limitations of the simple
    measure R(A,B,C), A ? B

A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
0.875
0.781
38
79
Well-Designed Databases
  • Definition A database specification (S,?) is
    well-designed if for every I ? inst(S,?) and
    every p ? Pos(I), 1.
  • In other words, every position in every instance
    carries the maximum possible amount of
    information.
  • We would like to test this definition in the
    relational world ...

39
80
Relational Databases
  • ? is a set of data dependencies over a schema S
  • ? ? (S,?) is well-designed.
  • ? is a set of FDs (S,?) is well-designed if and
    only if (S,?) is in BCNF.
  • ? is a set of FDs and MVDs (S,?) is
    well-designed if and only if (S,?) is in 4NF.
  • ? is a set of FDs and JDs
  • If (S,?) is in PJ/NF or in 5NFR, then (S,?) is
    well-designed. The converse is not true.
  • A syntactic characterization of being
    well-designed is given in AL03.

40
81
Relational Databases
  • The problem of verifying whether a relational
    schema is well-designed is undecidable.
  • If the schema contains only universal constraints
    (FDs, MVDs, JDs, ), then the problem becomes
    decidable.
  • Now we would like to apply our definition in the
    XML world ...

41
82
XML Databases
  • XML schema (D,?).
  • D is a DTD.
  • ? is a set of data dependencies over D.
  • We would like to evaluate XML normal forms.
  • The notion of being well-designed extends from
    relations to XML.
  • The measure is robust we just need to define the
    set of positions in an XML tree T Pos(T).

42
83
Positions in an XML Tree
DBLP
conf
conf
title
issue
issue
ICDT
ICDT
article
article
article

_at_year
title
title
_at_year
author
_at_year
title
author
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
43
84
Well-Designed XML Data
  • We consider k such that adom(T) ? 1, ,k.
  • For each k
  • We consider the ratio
  • General measure

44
85
XNF XML Normal Form
  • For arbitrary XML data dependencies
  • Definition An XML specification (D,?) is
    well-designed if for every T ? inst(D,?) and
    every p ? Pos(T),
    1.
  • For functional dependencies
  • Theorem An XML specification (D,?) is in XNF if
    and only if (D,?) is well-designed.

45
86
Normalization Algorithms
  • The information-theoretic measure can also be
    used for reasoning about normalization
    algorithms.
  • For BCNF and XNF decomposition algorithms
  • Theorem After each step of these decomposition
    algorithms, the amount of information in each
    position does not decrease.

46
87
Future Work
  • We would like to consider more complex XML
    constraints and characterize good designs they
    give rise to.
  • We would like to characterize 3NF by using the
    measure developed in this paper.
  • In general, we would like to characterize
    non-perfect normal forms.
  • We would like to develop better characterizations
    of normalization algorithms using our measure.
  • Why is the usual BCNF decomposition algorithm
    good? Why does it always stop?

47
Write a Comment
User Comments (0)
About PowerShow.com