Title: Databases I H.10H.11 Summary FDs and Normalisation
1Databases I (H.10/H.11)Summary FDs and
Normalisation
versie 2007
Steven KlusenerVrije Universiteit, Amsterdam
2Topics
- Functional dependencies (FDs)
- Formal definition of an FD
- Deriving FDs from others, or, constructing a
minimal cover - Showing counter examples
- Candidate and primary keys, based on FDs
- Normalisation 2 NF, 3 NF and BCNF
- Decomposition and the lossless join property
- A 3NF decomposition algorithm that satisfies the
lossless join property.
3Functional Dependency (FD)
- Given
- A relation R with attributes A1, A2, , An
- X ? A1, A2, , An and Y ? A1, A2, , An
- Y depends functionally on X (notation X ? Y)iff
for each possible extension of R it holds that - ?t1,t2 ? R (t1X t2X) ? (t1Y t2Y)
- Note that if all tuples in a R have a different
value for X, then the FD X ? Y is trivially
satisfied
4Minimal set of FDs
- Certain functional dependencies can be derived
from others - For example
- ENR ? BDATE
- (ENR value E1 (John) determines the birth date
1964-08-28) - BDATE ? ZODIAC zodiac sterrenbeeld
- (The birth date 1964-08-28 determines the zodiac
value Virgin) - Hence, each employee number determines the zodiac
sign of that employee, so we have ENR ? ZODIAC,
which follows from ENR ? BDATE and BDATE ? ZODIAC
(by transitivity as we will see later)
5Derivability of FDs, closure
- We can derive FDs from others using the
following inteference rules - Reflexive Rule the FD X ? Y is given, and hence
derivable, - for any Y ? X.
- Augmentation Rule the FD XZ ? YZ is derivable
from - the FD X ? Y.
- Transitive Rule the FD X ? Z is derivable from
- the two FDs X ? Y and Y? Z
- From these rules other interference rules can be
obtained, such as - Decomposition Rule the FD X ? Y is derivable
from - the FD X ? YZ.
- Additive Rule the composed FD X ? YZ is
derivable from - the two FDs X ? Y and X ? Z.
- From a set of FDs F we write F for the set of
all FDs that can be derived from F, F is called
the closure of F
6Minimal set of FDs
- A set F of functional dependencies is minimal
iff - Every FD in F is of the form X ? A, where A is an
attribute - We cannot make any left-hand-side smaller, so if
a rule XY ? A is simplified into X ? A then the
closure gets strictly smaller - We cannot remove FDs from F without losing FDs
in the closure of F - In general, the construction of a minimal cover
does not lead to unique one
7Keys
- Given
- A relation R with attributes A1, A2, , An
- A minimal set of FDs F
- A set of attributes X ? A1, A2, , An
- then
- X is a (candidate) key of R iff for each i in
1,..,n - 1) identification X ? Ai is derivable from F èn
- 2) minimalitity if Y?X ? Y?X then Y ? Ai is not
derivable from F
8Some terminology w.r.t. keys
- superkey (identification, but no minimality)
- Like (ENR,NAME)
- candidate key (a candidate, see previous
slide) - Like (ENR)
- primary key (the selected candidate key)
- alternate/alternative key (all remaining
candidate keys)
9Categorizing FDs, based on the key
- Given a relation R and a minimal set of FDs F
- An attribute A is a prime attribute if A ? K, for
some candidate key K, otherwise it is a non-prime
attribute - If attr. A is a non-prime attribute, then X ? A
- is a regular FD if X is a candidate key
- is a partial FD if X ? K, for some candiate key K
- is a transitive FD if all atributes in X are
non-prime - If attr. A is a prime attribute, then X ? A is a
prime FD
10Normal forms
- Given a relation R, a minimal set of FDs F
- R is in
- 2NF, if there are no partial FDs
- 3NF, there are no partial and no transitive FDs
- BCNF, if R is in 3NF, ànd if for every X ? A in F
it holds that X is a key
112 NF
- 2 NF no partial FDs are allowed, if they occur,
then the relation has to be decomposed. - EMP_PROJ(SSN,PNUMBER,HOURS,ENAME,PNAME,PLOC)
- With FDs
- SSN, PNUMBER ? HOURS
- SSN ? ENAME (partial dependency)
- PNUMBER ? PNAME, PLOC (partial dependency)
- After decomposition
- EMP_PROJ1 (SSN,PNUMBER, HOURS)
- EMP (SSN, ENAME)
- PROJ (PNUMBER, PNAME, PLOC)
- Note
- If the key contains only one attribute, the 2NF
property holds trivially - Be sure that you have lossless projection (to be
discussed later)
123 NF
- 3 NF no transitive FDs are allowed, if they
occur, then the relation has to be decomposed. - EMP_DEPT(SSN, ENAME, BDATE, DNUMBER, DNAME,
DMGRSSN) - With FDs
- SSN ? ENAME, BDATE, DNUMBER
- DNUMBER ? DNAME, DMGRSSN (transitive
dependency) - After decomposition
- EMP (SSN, ENAME, BDATE, DNUMBER)
- DEPT (DNUMBER, DNAME, DMGRSSN)
- Rephrasing the 3 NF property for every non-prime
attribute A and FD X ? A in F, X must be a
candidate key
13BCNF
- For every FD X ? A, X must be a superkey
- TEACH(STUDENT, COURSE, INSTRUCTOR)
- With FDs
- STUDENT, COURSE ? INSTRUCTOR
- INSTRUCTOR ? COURSE (INSTRUCTOR is not a key)
- Decomposition is not trivial
- S-I-1(STUDENT, INSTRUCTOR) and S-C-1(STUDENT,
COURSE) - C-I-2(INSTRUCTOR, COURSE) and C-S-2(COURSE,
STUDENT) - I-C-3(INSTRUCTOR, COURSE) and I-S-3(INSTRUCTOR,
STUDENT) - Conclusion
- FD1 is lost in all three cases (no subrelation
contains all three attributes) - I-C-3/I-S-3 is the best, because it is
non-additive (to be discussed later)
14Dependency preservation during decomposition
- During normalisation we decompose relations into
subrelations, until we arrive at the right level - However, we have to take care this decomposition
guarantees - Lossless joins we must be able to construct the
original relation from joining the subrelations - In other words, joining the subrelations may not
introduce spurious tuples
15Example Natural Join (?)
- EMP DEPT
- E ENAME BDATE D D DNAME
BUDGET - E1 John 28-08-1964 D1 D1
engineering 500,000 - E2 Joe 04-04-1968 D1 D2 sales
200,000 - E3 Jack 03-09-1969 D1
- E4 Will 21-03-1971 D2
- E5 Bridget 22-01-1972 D2
-
- EMP ? DEPT Join of EMP and DEPT, combine every
tuple from EMP and tuple of DEPT if their common
attributes (here D) have the same value - EMP ? DEPT
- E ENAME BDATE D NAME BUDGET
- E1 John 28-08-1964 D1 engineering 500,000
- E2 Joe 04-04-1968 D1 engineering 500,000
- E3 Jack 03-09-1969 D1 engineering 500,000
- E4 Will 21-03-1971 D2 sales 200,000
- E5 Bridget 22-01-1972 D2 sales 200,000
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Dependency preservation during decomposition
- During normalisation we decompose relations into
subrelations, until we arrive at the right level - However, we have to take care this decomposition
guarantees - Lossless joins we must be able to construct the
original relation from joining the subrelations - In other words, joining the subrelations may not
introduce spurious tuples
22Lossless-join, decomp. into 2 projections
- D is a lossless join decomposition w.r.t. F iff
- (R1 ? R2) ? R1 ? F, òr
- (R1 ? R2) ? R2 ? F
- R DPD_EMP E, DPD_N, REL, EMP_N, BDATE,
Dwith F E, DPD_N ? REL, E ?
EMP_N, E ? BDATE, E ? D
R1 DPD E, DPD_N, RELR2 EMP E,
EMP_N, BDATE, DR1 ? R2 E and E ?
E, EMP_N, BDATE, D (EMP)Hence this
decomposition of DPD_EMP is lossless w.r.t. F
23Ex. lossless-join (with n projections) (1/5)
- DPD_EMP_DPM (E, DPD_N, REL, EMP_N, BDATE,
D, DPM_N, BUDGET) - with F E, DPD_N ? REL, D ? DPM_N,
E ? EMP_N, D ? BUDGET, E ?
BDATE, DPM_N ? D, E ? D, DPM_N ?
BUDGET DPD (E, DPD_N, REL)EMP (E,
EMP_N, BDATE, D)DEPT (D, DPM_N, BUDGET)
24Ex. lossless-join (with n projections) (2/5)
- DPD_EMP_DPM (E, DPD_N, REL, EMP_N,
BDATE, D, DPM_N, BUDGET) - with F E, DPD_N ? REL, D ? DPM_N, E
? EMP_N, D ? BUDGET, E ? BDATE, DPM_N
? D, E ? D, DPM_N ? BUDGET - INITIAL MATRIX ai for each relation/row that
has a certain attribute (whit index i) -
- E DPD_N REL EMP_N BDATE D DPM_N BUDGET
- DPD a1 a2 a3 b14 b15 b16 b17 b18
- EMP a1 b22 b23 a4 a5 a6 b27 b28
- DEPT b31 b32 b33 b34 b35 a6 a7
a8
25Ex. lossless-join (with n projections) (3/5)
- DPD_EMP_DPM (E, DPD_N, REL, EMP_N,
BDATE, D, DPM_N, BUDGET) - met F E, DPD_N ? REL, D ? DPM_N, E ?
EMP_N, D ? BUDGET, E ? BDATE, DPM_N ?
D, E ? D, DPM_N ? BUDGET - E DPD_N REL EMP_N BDATE D DPM_N BUDGET
- DPD a1 a2 a3 b14 b15 b16 b17 b18
- EMP a1 b22 b23 a4 a5 a6 b27 b28
- DEPT b31 b32 b33 b34 b35 a6 a7
a8 - N.B. After applying the FD E ? EMP_N, BDATE, D
26Ex. lossless-join (with n projections) (4/5)
- DPD_EMP_DPM (E, DPD_N, REL, EMP_N,
BDATE, D, DPM_N, BUDGET) - met F E, DPD_N ? REL, D ? DPM_N, E ?
EMP_N, D ? BUDGET, E ? BDATE, DPM_N ?
D, E ? D, DPM_N ? BUDGET - E DPD_N REL EMP_N BDATE D DPM_N BUDGET
- DPD a1 a2 a3 a4 a5 a6 b17 b18
- EMP a1 b22 b23 a4 a5 a6 b27 b28
- DEPT b31 b32 b33 b34 b35 a6 a7
a8 - N.B. After applying the FD E ? EMP_N, BDATE, D
27Ex. lossless-join (with n projections) (5/5)
- DPD_EMP_DPM (E, DPD_N, REL, EMP_N,
BDATE, D, DPM_N, BUDGET) - met F E, DPD_N ? REL, D ? DPM_N, E ?
EMP_N, D ? BUDGET, E ? BDATE, DPM_N ?
D, E ? D, DPM_N ? BUDGET - E DPD_N REL EMP_N BDATE D DPM_N BUDGET
- DPD a1 a2 a3 a4 a5 a6 a7 a8
- EMP a1 b22 b23 a4 a5 a6 a7 a8
- DEPT b31 b32 b33 b34 b35 a6 a7
a8 - N.B. After applying the FDs E ? EMP_N, BDATE,
D and D ?
DPM_N, BUDGET - Lossless-join eigenschap has been shown row DPD
contains only as!
28Algorithm lossless-join
- Does the decomposition R1, , Rk of R satisfy
the lossless-join property w.r.t. m.b.t. A set of
FDs F? - 1) construct the initial matrix, consisting of
- A row for each subrelation Ri
- A column for each attribute of R
- For each entry, thus for each row i and column j
- Put ai in the entry, if row i has the attribute
of column j - Otherwise, put bij in this entry
- 2) apply each FD X ? Y ? F (see below) until
- Either one row is of the form a1, , an,, and the
composition is lossless indeed - Or, applying the FDs again will not change the
matrix anymore, and the composition is not
lossless - Applying FD X ? YFor all rows that have the
same values for the attributes in X, make the
attributes in Y equal as well (first start with
the aj, otherwise try the bij)
293NF decomposition algoritme (lossless d.p.)
- Given a relation scheme R and a set of FDs F
- Construct a minimal cover of F, call it G
- For each Xi ? Ai in G, construct a subrelation
scheme XiAi - If Xi Xj for two subrelation schemes (XiAi and
XiAj), merge them into XiAiAj. - Construct one relation scheme X, where X is the
prime key of R. - Check whether there are still remaing attributes
(which are not yet covered by the earlier steps),
put them in separate subrelation schemes - N.B. Steps 2 and 3 can also be combined into one
step.
30Final remarks
- With this material one must be capable to
recognize potential redundancies, and to avoid
them to a certain extent - Normalisation can lead to a large number of
small tables (i.e., tables with a small number of
attributes) which have to be combined with others
to obtain proper data. This may cost performance
and maintenance overhead. - Hence, it is up to the database designer to
decide whether these potential redundancies have
to be resolved or not.