Title: Well-designed XML Data
1Well-designed XML Data
Marcelo Arenas and Leonid Libkin University of
Toronto
2Outline
- Part 1 - Database Normalization from the 1970s
and 1980s. - Part 2 - Classical theory revisited normalizing
XML documents. - Part 3 - Classical theory re-done new
justifications for normalization.
2
3Part 1 Classical Normalization
- Design decide how to represent the information
in a particular data model. - Even for simple application domains there is a
large number of ways of representing the data of
interest. - We have to design the schema of the database.
- Set of relations.
- Set of attributes for each relation.
- Set of data dependencies.
3
4Designing a Database An Example
- Attributes number, title, section, room.
- Data dependency every course number is
associated with only one title. - Relational Schema
R(number, title, section, room), number ? title
BAD alternative
GOOD alternative S(number, title), number ? title
T(number, section, room), ?
4
5Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
5
6Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
5
7Problems with BAD Update Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer
Organization I.
The instance stores redundant information.
5
8Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
CSC434 is not given in this term.
6
9Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 Database Systems 1 GB248
CSC434 is not given in this term.
6
10Deletion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC434 is not given in this term.
Additional effect all the information about
CSC434 was deleted.
6
11Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
A new course is created (CSC336, Numerical
Methods)
7
12Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
A new course is created (CSC336, Numerical
Methods)
7
13Insertion Anomaly
number title section room
CSC258 Computer Organization I 1 LP266
CSC258 Computer Organization I 2 GB258
CSC258 Computer Organization I 3 GB248
CSC336 Numerical Methods ? ?
A new course is created (CSC336, Numerical
Methods)
The instance stores attributes that are not
directly related.
7
14Avoiding Update Anomalies
number title
CSC258 Computer Organization
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
8
15Avoiding Update Anomalies
number title
CSC258 Computer Organization
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
8
16Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
Title of CSC258 is changed to Computer
Organization I.
CSC434 is not given in this term.
The instance does not store redundant information.
8
17Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 1 GB248
CSC434 is not given in this term.
8
18Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
CSC434 is not given in this term.
A new course is created (CSC336, Numerical
Methods)
The title of CSC434 is not removed from the
instance.
8
19Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
A new course is created (CSC336, Numerical
Methods)
8
20Avoiding Update Anomalies
number title
CSC258 Computer Organization I
CSC434 Database Systems
CSC336 Numerical Methods
number section room
CSC258 1 LP266
CSC258 2 GB258
CSC258 3 GB248
A new course is created (CSC336, Numerical
Methods)
No information about sections has to be provided.
Each relation stores attributes that are directly
related.
8
21Normalization Theory
- Main idea a normal form defines a condition that
a well designed database should satisfy. - Normal form syntactic condition on the database
schema. - Defined for a class of data dependencies.
- Main problems
- How to test whether a database schema is in a
particular normal form. - How to transform a database schema into an
equivalent one satisfying a particular normal
form.
9
22Normalization Theory Today
- Normalization theory for relational databases was
developed in the 70s and 80s. - Why do we need normalization theory today?
- New data models have emerged XML.
- XML documents can contain redundant information.
- Redundant information in XML documents
- Can be discovered if the user provides semantic
information. - Can be eliminated.
10
23Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
24Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
25Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
26Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
27Part 2 XML and Normalization
XML Document
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
11
28Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
29Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
30Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lt/coursegt
ltcourse cnoCSC434gt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
31Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
32Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
33Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
34Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
35Part 2 XML and Normalization
XML Document
DTD
ltcoursesgt
ltcourse cnoCSC258gt
lttaken_bygt
ltstudent snost1gt
ltnamegt Fox lt/namegt
ltgradegt B lt/gradegt
lt/studentgt
lt/taken_bygt
lt/coursegt
lt/coursesgt
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
name ? PCDATA
grade ? PCDATA
11
36XML Databases
XML Schema (D, ?)
D
?
Two students with the same _at_sno value must have
the same name.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
12
37Redundancy in XML
courses
course
course
info
_at_cno
_at_cno
taken_by
taken_by
name
_at_sno
CSC258
CSC434
st1
Fox
student
student
student
. . .
_at_sno
name
grade
grade
name
_at_sno
st1
st1
A
B
Fox
Fox
13
38XML Database Normalization
DTD
Data dependency
Two students with the same _at_sno value must have
the same name.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? name, grade
14
39XML Database Normalization
DTD
Data dependency
, info
_at_sno is the identifier of info elements.
courses ? course
course ? _at_cno
course ? taken_by
taken_by ? student
student ? _at_sno
student ? grade
Two students with the same _at_sno value must have
the same name.
info ? _at_sno
info ? name
14
40A Non-relational Example
DBLP
conf
conf
title
issue
issue
ICDT
_at_year
article
article
article
_at_year
1999
2001
_at_year
title
title
_at_year
author
_at_year
title
author
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
15
41XNF XML Normal Form
- It eliminates two types of anomalies.
- It was defined for XML functional dependencies
- DBLP.conf._at_title ? DBLP.conf
- DBLP.conf.issue ? DBLP.conf.issue.article._at_year
16
42Problems to Address
- Functional dependencies for XML.
- Normal form for XML documents (XNF).
- Generalizes BCNF.
- Algorithm for normalizing XML documents.
- Implication problem for functional dependencies.
17
43Framework Paths in DTDs
- Paths(D) all paths in a DTD D
- courses.course
- courses.course._at_cno
- courses.course.student.name
- courses.course.student.name.S
- We distinguish three kinds of elements
attributes (_at_), strings (S) and element types. - FDs are defined by means of a relational
representation of XML documents.
18
44Framework XML Trees
courses
v0
course
course
v1
. . .
_at_cno
student
student
v2
v5
cs100
_at_sno
grade
_at_sno
grade
name
name
v3
v4
v6
v7
123
456
S
S
S
S
Fox
B
Smith
A-
19
45Tree Tuples
- Relational representation tree tuples - mappings
- t Paths(D) ? Vertices ? Strings ? ?
- A tree tuple represents an XML tree
courses
t(courses) v0 t(courses.course)
v1 t(courses.course._at_cno) cs100 t(courses.cour
se.student) v2 t(p) ?, for the remaining paths
v0
course
v1
_at_cno
student
v2
cs100
20
46XML Tree set of Tree Tuples
courses
courses
v0
v0
course
course
course
course
v1
. . .
v1
. . .
_at_cno
student
_at_cno
student
student
student
v2
v5
v2
v5
cs100
cs100
_at_sno
grade
_at_sno
grade
_at_sno
grade
_at_sno
grade
name
name
name
name
v3
v4
v6
v7
v3
v4
v6
v7
123
456
123
456
S
S
S
S
S
S
S
S
Fox
B
Smith
A-
Fox
B
Smith
A-
21
47Functional Dependencies for XML
- Expressions of the form
- X ? Y
- defined over a DTD D, where X, Y are finite
- non-empty subsets of Paths(D).
- XML tree T can be tested for satisfaction of X ?
Y if - X ? Y ? Paths(T) ? Paths(D)
- T ? X ? Y if for every pair u, v of tree tuples
in T - u.X v.X and u.X ? ? implies u.Y v.Y
22
48FD Examples
- University DTD courses ? course
- course ? _at_cno, student
- student ? _at_sno, name, grade
- Two students with the same _at_sno value must have
the same name - courses.course.student._at_sno ? courses.course.stud
ent.name.S - Every student can have at most one grade in every
course - courses.course,
- courses.course.student._at_sno ?
courses.course.student.grade.S
23
49Implication Problem for FD
- Given a DTD D and a set of functional
dependencies ? ? ? - (D, ?) ? ? if for any XML tree T conforming to D
and satisfying ? , it is the case that T ? ? - (D, ?) ? (D, ?) ? ?
- Functional dependency ? is trivial if it is
implied by the DTD alone (D, ?) ? ?
24
50XNF XML Normal Form
- XML specification a DTD D and a set of
functional dependencies ?. - A Relational DB is in BCNF if for every
non-trivial functional dependency X ? Y in the
specification, X is a key. - (D, ?) is in XNF if
- For each non-trivial FD X ? p._at_l or X ? p.S in
(D, ?), X ? p is in (D, ?).
25
51Back to DBLP
- DBLP is not in XNF
- DBLP.conf.issue ? DBLP.conf.issue.article._at_year
? (D,?) - DBLP.conf.issue ?
- DBLP.conf.issue.article ? (D,?)
- Proposed solution is in XNF.
26
52Normalization Algorithm
- The algorithm applies two transformations until
the - schema is in XNF.
- If there is an anomalous FD of the form
- DBLP.conf.issue ? DBLP.conf.issue.article._at_year
- then apply the DBLP example rule.
- Otherwise choose a minimal anomalous FD and
apply the University example rule.
27
53Normalizing XML Documents
- Theorem The decomposition algorithm terminates
and outputs a specification in XNF. - Furthermore, it does not lose information
-
- Unnormalized Normalized
- XML document XML Document
- Q1, Q2 are XQuery core queries.
Q1
Q2
28
54Part 3 What was Missing? Justification!
- What is a good database design?
- Well-known solutions BCNF, 4NF,
- But what is it that makes a database design good?
- Elimination of update anomalies.
- Existence of algorithms that produce good
designs lossless decomposition, dependency
preservation. - Previous work was specific for the relational
model. - Classical problems have to be revisited in the
XML context.
29
55Justification of Normal Forms
- Problematic to evaluate XML normal forms.
- No XML update language has been standardized.
- No XML query language yet has the same
yardstick status as relational algebra. - We do not even know if implication of XML FDs is
decidable! - We need a different approach.
- It must be based on some intrinsic
characteristics of the data. - It must be applicable to new data models.
- It must be independent of query/update/constraint
issues. - Our approach is based on information theory.
30
56Information Theory
- Entropy measures the amount of information
provided by a certain event. - Assume that an event can have n different
outcomes with probabilities p1, , pn.
Amount of information gained by knowing that event i occurred
Average amount of information gained (entropy)
Entropy is maximal if each pi 1/n
31
57Entropy and Redundancies
- Database schema R(A,B,C), A ? B
- Instance I
- Pick a domain properly containing adom(I)
- Probability distribution P(4) 0 and P(a)
1/5, a ? 4 - Entropy log 5 2.322
A B C
1 2 3
1 2 4
A B C
1 2 3
1 2 4
A B C
1 2
1 2 4
A B C
1 2 3
1 2 4
A B C
1 3
1 2 4
- Pick a domain properly containing adom(I) 1,
, 6 - Probability distribution P(2) 1 and P(a) 0,
a ? 2 - Entropy log 1 0
1, , 6
32
58Entropy and Normal Forms
- Let ? be a set of FDs over a schema S.
- Theorem (S,?) is in BCNF if and only if for
every instance of (S,?) and for every domain
properly containing adom(I), each position
carries non-zero amount of information (entropy gt
0). - A similar result holds for 4NF and MVDs.
- This is a clean characterization of BCNF and 4NF,
but the measure is not accurate enough ...
33
59Problems with the Measure
- The measure cannot distinguish between different
types of data dependencies. - It cannot distinguish between different instances
of the same schema
R(A,B,C), A ? B
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
entropy 0
entropy 0
34
60A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
35
61A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
35
62A General Measure
A B C
1 2 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
35
63A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Initial setting pick a position p ? Pos(I) and
pick k such that adom(I) ? 1, , k. For
example, k 7.
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
64A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
65A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
35
66A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
67A General Measure
A B C
2 3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
68A General Measure
A B C
1 2 3
1 2 1
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
69A General Measure
A B C
4 2 3
1 2 7
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
35
70A General Measure
A B C
1 2 3
1 2 3
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
35
71A General Measure
A B C
3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
72A General Measure
A B C
a 3
1 2
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
73A General Measure
A B C
2 a 3
1 2 7
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
For a ? 2, P(a X)
35
74A General Measure
A B C
1 a 3
1 2 6
Instance I of schema R(A,B,C), A ? B
Computation for every X ? Pos(I) p, compute
probability distribution P(a X), a ? 1, , k.
P(2 X)
48/
(48 6 ? 42) 0.16
For a ? 2, P(a X)
42/
(48 6 ? 42) 0.14
Entropy 2.8057
(log 7 2.8073)
35
75A General Measure
A B C
1 3
1 2 4
Instance I of schema R(A,B,C), A ? B
- Value we consider the average over all sets
X ? Pos(I) p. - Average 2.4558 lt log 7 (maximal entropy)
- It corresponds to conditional entropy.
- It depends on the value of k ...
35
76A General Measure
- Previous value
- For each k, we consider the ratio
- How close the given position p is to having the
maximum possible information content. - General measure
36
77Basic Properties
- The measure is well defined
- For every set of firstorder constraints ?
defined over a schema S, every I ? inst(S,?), and
every p ? Pos(I) exists. - Bounds
37
78Basic Properties
- The measure does not depend on a particular
representation of constraints. If ?1 and ?2 are
equivalent - It overcomes the limitations of the simple
measure R(A,B,C), A ? B
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
0.875
0.781
38
79Well-Designed Databases
- Definition A database specification (S,?) is
well-designed if for every I ? inst(S,?) and
every p ? Pos(I), 1. - In other words, every position in every instance
carries the maximum possible amount of
information. - We would like to test this definition in the
relational world ...
39
80Relational Databases
- ? is a set of data dependencies over a schema S
- ? ? (S,?) is well-designed.
- ? is a set of FDs (S,?) is well-designed if and
only if (S,?) is in BCNF. - ? is a set of FDs and MVDs (S,?) is
well-designed if and only if (S,?) is in 4NF. - ? is a set of FDs and JDs
- If (S,?) is in PJ/NF or in 5NFR, then (S,?) is
well-designed. The converse is not true. - A syntactic characterization of being
well-designed is given in AL03.
40
81Relational Databases
- The problem of verifying whether a relational
schema is well-designed is undecidable. - If the schema contains only universal constraints
(FDs, MVDs, JDs, ), then the problem becomes
decidable. - Now we would like to apply our definition in the
XML world ...
41
82XML Databases
- XML schema (D,?).
- D is a DTD.
- ? is a set of data dependencies over D.
- We would like to evaluate XML normal forms.
- The notion of being well-designed extends from
relations to XML. - The measure is robust we just need to define the
set of positions in an XML tree T Pos(T).
42
83Positions in an XML Tree
DBLP
conf
conf
title
issue
issue
ICDT
ICDT
article
article
article
_at_year
title
title
_at_year
author
_at_year
title
author
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
1999
1999
Dong
2001
Jarke
. . .
. . .
. . .
43
84Well-Designed XML Data
- We consider k such that adom(T) ? 1, ,k.
- For each k
- We consider the ratio
- General measure
44
85XNF XML Normal Form
- For arbitrary XML data dependencies
- Definition An XML specification (D,?) is
well-designed if for every T ? inst(D,?) and
every p ? Pos(T),
1. - For functional dependencies
- Theorem An XML specification (D,?) is in XNF if
and only if (D,?) is well-designed.
45
86Normalization Algorithms
- The information-theoretic measure can also be
used for reasoning about normalization
algorithms. - For BCNF and XNF decomposition algorithms
- Theorem After each step of these decomposition
algorithms, the amount of information in each
position does not decrease.
46
87Future Work
- We would like to consider more complex XML
constraints and characterize good designs they
give rise to. - We would like to characterize 3NF by using the
measure developed in this paper. - In general, we would like to characterize
non-perfect normal forms. - We would like to develop better characterizations
of normalization algorithms using our measure. - Why is the usual BCNF decomposition algorithm
good? Why does it always stop?
47