CS 728 Advanced Database Systems Chapter 14 presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 728 Advanced Database Systems Chapter 14

1
CS 728 Advanced Database Systems Chapter 14

Database Design Theory
Introduction to Normalization Using Functional
Multivalued Dependencies

2
Design Guidelines for Relation Schemas

1. Attributes should have clear meanings
(semantics) and related attributes are grouped
into single entities.
2. Avoid update anomalies by reducing redundant
data.
3. Reduce the NULL values.
4. When relations are joined no spurious tuples
will be generated.

3
Semantics of Relation Attributes

Guideline 1
Design a relation schema so that it is easy to
explain its meaning. Do not combine attributes
from multiple entity types and relationship types
into a single relation.
Design I
STUDENT(STNO, Name, Address, ANO)
ADVISOR(ANO, Name, Address, Dept)
Design II
Student-Advisor(STNO, Name, Address, ANO,
A-name, A-address, Dept)
Design I is better when compared with Design II.

4
Update Anomalies

Insertion anomalies
In design II, if we add a new student to
Student-Advisor, we have to add data related to
that students advisor. This information should
be consistent with all other occurrences of that
advisor. Note that in design II all data related
to a particular advisor is repeated a number of
times which equals to the number of students
supervised by that advisor. In design I, only
advisor number is repeated.
It is difficult to add a new advisor who have no
students yet to the database. This is because we
have to assign nulls to STNO and STNO is the
primary key for Student-Advisor.

5
Update Anomalies

Deletion Anomalies
If we delete the last student associated with a
particular advisor, then that advisor cannot
exist in the database in design II any more.
Modification Anomalies
If an advisor changes his/her address, say, then
we have to modify his/her address in all tuples.
If we miss some tuples, then we will have several
addresses for the same advisor.

6
Update Anomalies

Guideline 2
Design the base relation schemas so that no
insertion, deletion, or modification anomalies
occur.
If any anomalies are present, note them clearly
so that the programs that update the database
will operate correctly.

7
NULL values in Tuples

Guideline 3
Avoid placing attributes in a base relation whose
values may be null.
If nulls are unavoidable, make sure that they
apply in exceptional cases only and do not apply
to majority of tuples in the relation.

8
NULL values in Tuples

Problems with Nulls
Waste storage space.
Have multiple interpretations (not-applicable,
not-known,).
Create ambiguities with aggregate functions
(count, avg, )
Create ambiguities with joins.

9
NULL values in Tuples

Example
If only 10 of employees have phones, then
Employee(SSN, Name,., Office-phone) is a poor
design, because 90 of the last column values
will be nulls
But
Employee(SSN, Name, )
Phone(SSN, Office-phone)
is a better design.

10
Spurious Tuples

Guideline 4
Design relation schemas so that they can be
joined with equality conditions on attributes
that are either primary keys or foreign keys in a
way which guarantees that no spurious tuples are
generated.

11
Spurious Tuples

Emp-Proj(SSN, Pno, Hours, Ename, Pname, Plocation)

12
Spurious Tuples

Emp-locs(Ename, Plocation)
Emp-Proj1(SSN, Pno, hours, Pname, Plocation)

13
Spurious Tuples

Emp-Proj(SSN, Pno, Hours, Pname, Plocation)

14
Spurious Tuples

Result ? Emp-Locs ? Emp-proj1
Then Result(Ename, Plocation, SSN, Pno, Hours,
Pname)

15
Spurious Tuples

If we combine Emp-Locs and Emp_Proj1 on Plocation
attribute, we will get spurious tuples as you
have noticed in the previous slide. This is
because Plocation is the attribute which combines
the two relations and it is neither a primary key
nor a foreign key in either Emp-Locs or
Emp-Proj1

16
Bad Tables (1)

Alternative designs for a product database
(1)
Products(Prod_no, Prod_name, Price, Manu_id)
Manufacturers (Manu_id, Manu_name, Address)
(2)
Prod_Manu (Prod_no, Prod_name, Price, Manu_id,
Manu_name, Address)

17
Bad Tables (2)

Problems with the second design
Redundancy --- the name and address of each
manufacturer will be repeated once for each
product made by the manufacturer.
more storage space needed
potential inconsistency (update anomalies)

18
Bad Tables (3)

Insertion anomalies --- a manufacturer's name and
address cannot be recorded in the database if it
does not make at least one product (because
Prod_no is part of the primary key).
Deletion anomalies --- If we delete all products
made by a manufacturer, we will unintentionally
lose track of the manufacturer's name and
address.
The first design does not have similar problems.
The challenge is to identify bad relations and
convert them into good relations.

19
Functional Dependencies

R Relation schema.
r(R) Relation instance
A1, A2, , An Attributes which belong to
universal relation.
X, Y Sets of attributes.

20
Functional Dependencies

A functional dependency (X ? Y), between X and Y
specifies a constraint on the possible tuples
that can form a relation instance r of R.
The constraint states that for any two tuples t1
and t2 in r such that t1X t2X, then t1Y
t2Y
This means that the Y component of a tuple in r
depends on (or determined by) the values of the X
component of that tuple in r.
Or the values of the X component of a tuple in r
uniquely or functionally determines the values of
Y component.

21
Functional Dependencies

X is said to functionally determine Y (or Y is
functionally dependent on X) if for every legal
relation instance r(R), for any two tuples t1 and
t2 in r(R), we have
t1X t2X , then t1Y t2Y
X ? R denotes that X is a subset of the
attributes of R
X ?Y denotes that X functionally determines Y.

22
Functional Dependencies

Note that
if X is a candidate key then X ? Y is correct
for any subset of attributes of R.
if X ? Y, this does not say whether Y ? X is
correct or not.
Student(SSN, STNO, Name, Major)
SSN ? Name
t1(91910, 980090012, Ahmed, ..)
t2(91910, 980090012, Ahmed, ..)
STNO ? Major
t10((980090012,, Math)
t12(980090012, , Math)
SSN ? SSN, Name, STNO, Major

23
Functional Dependencies

Manager(SSN, Name, . Dept)
SSN ? Name, Dept (correct)
SSN uniquely determines the Name and Dept
Dept ? SSN (correct)
Dept uniquely determines the SSN
Name ? SSN (incorrect)

24
Functional Dependencies

FD is a property of the meaning or semantics of
attributes
FD is specified as a constraint on R, all
extensions of R (i.e. r(R)) should specify that
constraint (legal extensions) otherwise r(R) are
called illegal extensions
Note that we cannot infer FDs from r(R)

25
Diagrammatic Representation of FDs

SSN ? STNO, NAME, MAJOR
STNO ? SSN, NAME, MAJOR

Student(SSN, STNO, Name, Major)
FD 1
FD 2
26
Inference Rules for Functional Dependencies

Let
F set of functional dependencies defined on R
F (Closure of F) is the set of all functional
dependencies that can be defined on R
The closure of F is the set of all FDs that are
logically implied by F
The closure of F is denoted by F
F X ? Y F X ? Y
A BIG F may be derived from a small F
For R(A, B, C) and F A ? B, B ? C
F A ? B, B ? C, A ? C, A ? A, B ? B,C ? C,
AB ? AB, AB ? A, AB ? B, ...

27
Inference Rules for Functional Dependencies

Emp-Dept(SSN, Ename, Bdate, Address, Dnumber,
Dname, MGR-SSN)
F
SSN ? Ename,Bdate,Address,Dnumber, Dnumber ?
Dname, MGR-SSN
We can infer the following FDs
SSN ? SSN (Reflexive)
SSN ? Ename (Decomposition)
SSN ?Dname, MGR-SSN (Transitive)
We usually write F X ? Y to denote that the FD
X ?Y is inferred from F
X, Y are subsets of attributes

28
Functional Dependency

Several equivalent definitions
X ? Y in R iff for any t1, t2 in r(R), if t1
and t2 have the same X-value, then t1 and t2 also
have the same Y-value
X ? Y in R iff there exist no t1, t2 in r(R)
such that t1 and t2 have the same X-value but
different Y-values
X ? Y in R iff for each X-value, there
corresponds to a unique Y-value.

29
Functional Dependency

Question if all X-values are different in all
possible r(R), does X ? Y in R?
Theorem 1
If X is a superkey of R and Y is any subset of R,
then X ? Y in R
Note that X ? Y in R is a property that must be
true for all possible legal r(R), not just for
the present r(R)

30
Functional Dependency

Example which is true?
A B C D A ? B
a1 b1 c1 d1 A ? C
a1 b2 c1 d2 C ? A
a2 b2 c2 d2 A ? D
a2 b3 c2 d3 B ? D
a3 b3 c2 d4 AB ? D
AB ? C
AB ? CD

31
Identify Functional Dependency

FD created by assertions.
Employees(SSN, Name, Years_of_emp, Salary, Bonus)
Assertion
Employees hired the same year have the same
salary
This assertion implies
Years_of_emp ? Salary

32
Inference Rules

IR1 Reflexive Rule ( ?????)
Y ? X, then X ?Y
e.g.
SSN ? SSN
P, S, Qty ? Qty
IR2 Augmentation Rule ?????))
X ? Y XZ ? YZ
e.g. F SSN ? Address
F SSN, Name ? Address, Name

33
Inference Rules

IR3 Transitive rule (??????)
X ? Y, Y ? Z X ? Z
e.g.
F SSN ? Dnumber, Dnumber ? Dname
F SSN ? Dname
IR4 Decomposition Rule (????????)
X ? Y, Z X ? Y and X ? Z
e.g.
F SSN ? Ename,BDate,Address, Dnumber
F SSN ? Ename
SSN ? Bdate
SSN ? Address
SSN ? Dnumber

34
Inference Rules

IR5 Union (Additive) Rule
X ? Y, X ? Z X ? Y, Z
e.g.
F SSN ? Ename, SSN ? Bdate
F SSN ? Ename, Bdate
IR6 Pseudo-transitive Rule
X ? Y, WY ? Z WX ? Z
e.g.
F SSN ? STNO, Major, STNO ? Name
F Major, SSN ? Name

35
Closure of Attributes

How to determine if F X ? Y is true?
Method 1
Compute F
If X ?Y ? F, then F X ?Y
Problem
Computing F could be very expensive!

36
Closure of Attributes

Method 2
Compute X the closure of X under F
X denotes the set of attributes that are
functionally determined by X under F.
X Y X ? Y ? F
Theorem
X ?Y ? F if and only if Y ? X

37
Algorithm for Computing X

Input
a set of FDs F, a set of attributes X in R
Output
X
Begin
X X
Repeat
oldX X
for each FD Y ? Z in F do
if Y ? X then X X ? Z
until (X oldX )
end

38
Algorithm for Computing X

Example
R(A, B, C, G, H, I) ABCGHI
X AG
F A ? B, CG ? HI, B ? H, A ? C
Compute X (AG)
Initialization
X AG

39
Algorithm for Computing X

1st iteration
consider A ? B
since A is a subset of X, X ABG
consider CG ? HI
since CG is not a subset of X, X ABG
consider B ? H
since B is a subset of X, X ABGH
consider A ?C
since A is a subset of X, X ABCGH
X is changed from AG to ABCGH

40
Algorithm for Computing X

2nd iteration
consider A ? B
since A is a subset of X, X ABCGH
consider CG ? HI
since CG is a subset of X, X ABCGHI
consider B ? H
since B is a subset of X, X ABCGHI
consider A ? C
since A is a subset of X, X ABCGHI
X is changed from ABCGH to ABCGHI

41
Algorithm for Computing X

3rd iteration
consider each FD in F again, but there is no
change to X, exit
Result
(AG) ABCGHI.
The performance of the algorithm is sensitive to
the order of FDs in F

42
Algorithm for Computing X

Theorem
Given R(A1, ..., An) and a set of FDs F in R, K ?
R is a
superkey if K A1, ..., An
candidate key if K is a superkey and for any
proper subset X of K, X ? A1, ..., An.

43
Algorithm for Computing X

Continue the above example
AG is a superkey of R since
(AG) ABCGHI.
Since A ABCH, G G, neither A nor G is a
superkey.
Hence, AG is a candidate key

44
Normal Forms Based on Primary Keys

Normalization of data
is a process during which unsatisfactory relation
schemas are decomposed by breaking up their
attributes into smaller relation schemas that
posses desirable properties
We normalize data for several reasons one of them
is to avoid update anomalies
We have
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF) (a stronger
definition of 3NF)
All the above normal forms are based on
functional dependencies.

45
Normal Forms Based on Primary Keys

Forth Normal Form (4NF) Based on multivalued
dependencies.
Fifth Normal Form (5NF) based on join
dependencies.
Aside
Student-Adv(STNO, StName, Major,, Ano, Aname, )
Has several problems
Student(STNO, StName, Major, , Ano)
Advisor(Ano, Aname, .)
Pay the price of expensive joins

46
Basic Definitions

R A1, , An)
Superkey is set of attributes S ? R and t1S
? t2S ?i 1, ..., n. S may contain redundant
attributes.
Key (K) is a superkey with no redundant
attributes, i.e. removal of any attribute from K
will no longer make it a superkey.
Candidate Key if a relation has more than one
key, each is called a candidate key. One of these
keys is arbitrarily chosen as a primary key.
Prime Attribute is an attribute which is a
member of any key (primary or candidate) other
attributes are called nonprime attributes.

47
Basic Definitions

Student(SSN, STNO, Name, Address, Salary)
Superkeys
SSN,Name/SSN,STNO,Name,Address,Salary
Candidate keys
SSN, STNO
Primary Key
SSN
Prime Attribute
SSN and STNO
Nonprime Attributes
Name, Address, Salary

48
1NF (First Normal Form)

A relation schema R is in 1NF if every attribute
of R takes only single and atomic values.
Domains of attributes must include only atomic
values and that the value of any attribute in a
tuple must be a single value from the domain of
that attribute.
In other words, multivalued and composite
attributes are disallowed.

49
1NF (First Normal Form)

Student(STNO, StName, Course(CNO, Ctitle)
The set braces identify the attribute Course
as multivalued
The set braces () identify the attribute Course
as a composite attribute
Because of the last attribute (Course), the
Student relation schema is not in 1NF.

50
1NF (First Normal Form)

Student(STNO, StName, Course(CNO, Ctitle)
To normalize it to 1NF
Student(STNO, StName, CNO, Ctitle)
this is not a good representation because it has
many disadvantages. Replication of Course
information and student information. Combining
attributes which belong to two separate entities
(namely student and course) into a single
relation schema.

51
1NF (First Normal Form)

Student(STNO, StName, Course(CNO, Ctitle)
Student(STNO, StName, CNO)
Course(CNO, Ctitle)
this design suffers from drawback that student
information has to be repeated in the student
relations several times (in fact a number of
times that is equal to the number of courses
taken by that student).

52
1NF (First Normal Form)

Student(STNO, StName, Course(CNO, Ctitle)
Student(STNO, StName)
Course(CNO, Ctitle)
Study(STNO, CNO)
This is the best representation because we have
reduced the duplication.

53
Second Normal Form (2NF)

Prime attribute --- an attribute in any candidate
key.
Y is fully functionally dependent on X if X ? Y
and no proper subset of X functionally determines
Y
FD X ? Y is a fully functional dependency if
removal of any attribute A from X means that the
dependency does not hold any more. In other words
(X - A) does not determine Y
A FD X ? Y is a partial FD if exist some
attribute A which belongs to X and (X - A) ? Y
still holds

54
Second Normal Form (2NF)

General Definition of 2NF
A relation schema R is in 2NF if every nonprime
attribute A in R is not partially dependent on
any key of R
(i.e. if every nonprime attribute is fully
functionally dependent on every key of R)

55
Second Normal Form (2NF)

Emp-Proj(SSN, Pnumber, Hours, Ename, Pname,
Plocation)
FD1 SSN, Pnumber ? Hours (FD)
FD2 SSN ? Ename (PD)
FD3 Pnumber ? Pname, Plocation (PD)
Because of FD2 and FD3 Emp-Proj is not in 2NF
2NF Normalization
Emp(SSN, Ename)
Proj(Pnumber, Pname, Plocation)
Work(SSN, Pnumber, Hours)

56
Second Normal Form (2NF)

Consider
Bank-Loans (Bank_name, Assets, Headquarter,
Loan_no, Customer_name, Amount),
FD1 Bank_name ? Assets, Headquarter
FD2 Bank_name, Loan_no ? Customer_name,
Amount
Because if FD1, Bank-Loans is not in 2NF.
2NF Normalization
Banks(Bank_name, Assets, Headquarter)
Loans(Bank_name, Loan_no, Customer_name, Amount)

57
Second Normal Form (2NF)

2NF is not good enough
A relation schema in 2NF can still have serious
redundancy problem as well as insertion and
deletion anomalies.
Consider Parts(Part_no, Name, Location,
Unit_price, Manu_id, Manu_name, Manu_Address)
It is obvious that Parts is in 2NF
Redundancy and various anomalies are introduced
by
Manu_id ? Manu_name, Manu_Address

58
Second Normal Form (2NF)

Consider
EMP_DEPT(SSN, EName, BDate, Address, DNo, DName,
DMGRSSN)
It is obvious that EMP_DEPT is in 2NF
Redundancy and various anomalies are introduced
by
DNo ? DName, DMGRSNN

59
Third Normal Form (3NF)

Transitive Dependency
a FD X ? Y in R is transitive if exist some set
of attributes Z that is not a subset of any key
of R and both X ? Z and Z ? Y hold.
A relation schema R is in 3NF if it is in 2NF and
no nonprime attribute A of R is transitively
dependent on a key of R

60
Third Normal Form (3NF)

A relation schema R is in 3NF if for every FD X ?
A, where A is a single attribute, at least one of
the following is true
(a) A ? X (Trivial dependency - Reflexive)
(b) A is a prime
(c) X is a superkey

61
Third Normal Form (3NF)

R is not in 3NF if a non-prime non-trivially
depends on a non-superkey.
If R in 3NF, it should not have a nonkey
attribute functionally determined by another
nonkey attribute (or by a set of nonkey
attributes)

62
Third Normal Form (3NF)

General Definition of 3NF
A relation schema R is in 3NF if, whenever a
nontrivial functional dependency X ? A holds,
either
(a) X is a superkey of R, or
(b) A is a prime attribute of R

63
Third Normal Form (3NF)

Emp-Dept(SSN, Ename, Bdate, Address, Dnumber,
Dname, DMGR-SSN)
FD1 SSN ? Ename, Bdate, Address, Dnumber,
Dname, DMGR-SSN
FD2 Dnumber ? Dname, DMGR-SSN
Emp-Dept is in 1NF, 2NF, but because of
SSN ? Dnumber and
Dnumber ? Dname and
Dname is a nonprime attribute, and
Dnumber is not a superkey,
Emp-Dept is not in 3NF.
To transform Emp-Dept into 3NF
Emp(Enam, SSN, Bdate, Address, Dnumber)
Dept(Dnumber, Dname, DMG-SSN)

64
Third Normal Form (3NF)

Employees (SSN, Name, Age, Salary, Dept_name,
Dept_manager_SSN).
Employees is in 2NF
since SSN is the only candidate key and every
attribute is fully dependent on it.
Employees is not in 3NF because
Dept_name ? Dept_manager_SSN

65
Third Normal Form (3NF)

LOTS(Property-ID, County-Name, Lot, Area,
Price, Tax-Rate)
FD1 Property-ID ? County-Name, Lot, Area,
Price, Tax-Rate
FD2 County-Name, Lot ? Property-ID, Area,
Price, Tax-Rate
FD3 County-Name ? Tax-Rate
FD4 Area ? Price
2 candidate keys Property-ID,County-Name,
Lot
LOTS is not in 2NF, because of County-Name ?
Tax-Rate
Tax-Rate is partially dependent on the candidate
key County-Name, Lot.
2NF
LOTS1(Property-ID, County-Name, LOT, Area,
Price)
LOTS2(County-Name, Tax-Rate)

66
Third Normal Form (3NF)

The relation LOTS1 is not in 3NF, because of Area
? Price
Area is not a superkey and Price is not prime
attribute
3NF
LOTS1A(Property-ID, County-Name, Lot, Area)
LOTS1B(Area, Price)
LOT2(County-Name, Tax-Rate)
The above relation schemas are in 3NF

67
Boyce-Codd Normal Form (BCNF)

Assume that we have thousands of lots, but
two-counties
Marion county and Liberty county.
Lot areas in Marion county are
.5, .6, .7, .8, .9 and 1 acres
Lot areas in Liberty county are
1.1,1.2, , 1.9, 2.0 acres
In this case we have AREA ? County-Name
LOTS1A(Property-ID, County-Name, Lot, Area)
FD5 Area ? County-Name
Still in 3NF , since County-Name is a prime
attribute

68
Boyce-Codd Normal Form (BCNF)

A relation schema R is in BCNF if whenever a FD X
? A holds in R, then X is a superkey of R.
R is in BCNF if for every non-trivial FD, the
left side is a superkey.
LOTS1A-X(Property-ID, Area, Lot)
LOTS1A-Y(Area, County-Name)
To describe a relation schema R as good it
should be at least in 3NF (general) or BCNF
If R is in BCNF, then R is also in 3NF
However, R in 3NF R in BCNF

69
Normal Forms Summary

Relationships between different NFs.

70
Normal Forms Summary

1NF
Attributes should be single-valued and have
atomic domain
Normalize into 1NF
Form a new relations for each non-atomic
attribute
2NF
2NF removes some insertion anomalies and deletion
anomalies.
2NF removes some redundancies, namely,
redundancies caused by partial dependencies on
key.

71
Normal Forms Summary

3NF
3NF removes all insertion anomalies and deletion
anomalies.
3NF also removes some redundancies caused by
transitive dependencies.
BCNF
achieves all achieved by 3NF.
BCNF removes all redundancies caused by FDs.

Write a Comment

User Comments (0)

About PowerShow.com

CS 728 Advanced Database Systems Chapter 14 PowerPoint PPT Presentation