Title: Teknik Informatika Fakultas Teknik
1Module 5FunctionalDependencies
andNormalization
- Teknik Informatika Fakultas Teknik
- Universitas Dr. Soetomo Surabaya
2Module 5 - Motivation
- Relational Database Schema resulting from an
ER-Relational Mapping exhibits good design
characteristics - How can we measure the goodness or quality of
the design ? - Use formal theories to distinguish good relation
schemas from bad - Functional dependency information
- Normalization techniques
3Module 5 - Contents
- Functional Dependencies and Normalization
- Informal Design Guidelines
- Functional Dependency
- Normalization
- Properties of Decomposition
4Informal Design Guidelines
- Informal measures of relational database
- schema quality and design guidelines
- Semantics of the attributes
- Reducing the redundant values in tuples
- Reducing the null values in tuples
- Disallowing spurious tuples
- Apply to schemas of base relations
- (files that will be physically stored)
5Semantics of the attributes
- Grouping attributes to construct a relation
implies that a particular meaning or semantics
is associated with the attributes - Employee Smith, with unique id (SSN) 123, who
lives - at 23 Queens Street, earns salary 34K and
works in - Department Payroll
- EMPLOYEE SSN, Name, Address, Salary, Dept
- The employee with unique id (SSN) 123 Works-on
- Project Y2K for 23 hours
- WORKS-ON SSN, Proj, Hours
6Guideline 1
- Design each relation so that it is easy to
explain its meaning - Do not combine attributes from multiple entity
types and relationship types into a single
relation - Employee Smith, with unique id (SSN) 123,
- who lives at 23 Queens Street, earns salary
34K and - works in Department Payroll that is managed
by - Wyeth and is located in Perth
- EMPLOYEE SSN, Name, Address, Salary, Dept,
Manager, Location
7Redundant values in tuples
- One design goal is to minimize the storage space
that base relations occupy - The way the grouping of attributes into relation
schemas is done, has a significant effect on
storage space - In addition, an incorrect grouping may cause
update anomalies which may result in inconsistent
data or even loss of data
8Example with redundant values
- Consider relation BigEmp,
- which groups tuples from
- two distinct entity types
9Update Anomalies
- Insertion Anomalies
- Problems with inserting new tuples
- Deletion Anomalies
- Problems with deletion of tuples - may result
- in loss of information
- Modification Anomalies
- Problems with modifying values of attributes
- may result in inconsistent data
10Insertion Anomaly 1
- Insertion of new employee requires correct
department name and location details or null
values if the employee not assigned to any
department yet
11Insertion Anomaly 2
- Insertion of new department that has no employees
yet is not possible since EMP (primary key of
BigEmp) can not be null
12Deletion Anomalies
- Deletion of all employees in Department D4
results in loss of departments name and location
13Modification Anomalies
- If we change the location of the Shipping
department, many tuples must be modified to keep
the database consistent
14Guideline 2
- Design the base relation schema so that no
insertion, deletion, or modification anomalies
occur in the relations - If any do occur, ensure that all applications
that access the database update the relations in
such a way as to not compromise the integrity of
the database
15Null Values in Tuples
- Tuples without data for certain attributes are
said to contain null values - Example NULL value for dept in EMPLOYEE
- relation (123,Jones, 12 Queen St, NULL, 34K)
- When occurring in large numbers, null values
represent a waste of physical disk space - More importantly, null values can introduce
problems with the semantics of relations, and in
specification of various operations such as join,
aggregation
16Semantics of Null Values
- Nulls can have multiple interpretations
- Attribute does not apply to this relation
- Attribute value for this tuple is unknown
- The value is known but not yet recorded yet
- Having the same representation (NULL) for all
possibilities above compromises the different
meanings they may have
17Guideline 3
- As far as possible, avoid placing attributes in a
base relation whose values may be null - If nulls are unavoidable, make sure that they
apply in exceptional cases only and that they do
not apply to a majority of tuples in the relation
18Disallowing spurious tuples
- The bifurcation, or general decomposition, of a
large relation into a set of smaller ones can
result in numerous problems - One problem involves the situation where
information in the original relation is lost due
to the decomposition - Ironically, the problem becomes apparent when
the - smaller relations are joined back together
and the - result is larger than the original relation,
i.e. contains - spurious tuples
19Example Decomposition
20Join with Spurious Tuples
21Guideline 4
- Design the relation schemas so that they can be
(relationally) joined with equality conditions on
attributes that are either primary keys or
foreign keys in a way that guarantees that no
spurious tuples are generated - Do not have relations that contain matching
- attributes other than foreign key - primary
key - combinations
- If such relations are unavoidable, do not join
- them on such attributes, because the join may
- contain spurious tuples
22Module 5 - Contents
- Informal Design Guidelines
- This section presented informal
- guidelines for good relational design
- The following sections present formal
- concepts and applied theory
- Functional Dependency
- Normalization
- Properties of Decomposition
23Functional Dependency
- Single most important concept in relational
- schema design
- To understand the concept, we will ...
- introduce, motivate and formally define the
- term functional dependency
- describe basic inference rules for functional
- dependencies
- use these inference rules to derive others and
- to compute closures
24What is Functional Dependency
- A function dependency (FD) is a constraint
denoted X ? Y between two sets of attributes X
and Y from a relational schema R. - The FD specifies a restriction on the possible
tuples that can form a relation instance r of R
25The FD Constraint - Informally
- Example A relation
- DEPARTMENT
- (DNO, DNAME, DLOC)
- can have following FDs
- FD1 DNO -gt DLOC
- FD2 DNO -gt DNAME
- Functional dependency X -gt Y holds if and only if
whenever two tuples agree on their X-value, they
must necessarily agree on their Y value
26The FD Constraint - Formally
- The FD constraint is that for any two tuples t1
and t2 in the relation instance r that have - t1X t2X
- we must also have
- t1Y t2Y
- This means that the values
- of the Y component of a tuple depend on, or are
determined by, the X component - The values of the X component of a tuple uniquely
(or functionally) determine the values of the Y
component
If X -gt Y holds, then Y is functionally dependent
on X X is termed the left-hand-side (LHS)
of the FD Y is termed the right-hand-side
(RHS) of the FD
27What determines FDs
- FD is a property of the meaning or semantics of
the attributes - Humans use their understanding of these semantics
to create and specify each FD, they cannot be
accurately obtained from viewing a particular
relation instance - However, if a FD does not hold in even one
instance then the FD cannot hold on the schema
28Relation at Time T1
- Can you determine if the FD STATE -gt CODE holds
true in the database from viewing the instance
above?
29Relation at Time T2
- If the existence of the Smith tuple is legal,
then it implies that the proposed FD - STATE -gt CODE is invalid !
30Application of FDs
- The main use of FDs is to further describe (if
the semantics of the application warrant) a
relation schema R by specifying constraints on
its attributes that must hold at all times - Relation instances r(R), that satisfy the FD
constraints are called legal extensions, legal
instances or legal relation states
31Inferred FDs
- Designer specifies obvious FDs, however numerous
other FDs can be inferred - Let F be a set of FDs on R
- F Pnum -gt Pname, Plocation, Dnum,
- Dnum -gt Dname, MgrSSN, MgrStartDate
- FDs that can be inferred are
- Pnum -gt MgrStartDate
- Dnum -gt MgrSSN
- Pnum -gt Pnum
32Notation for Inferred FDs
- The notation F X -gt Y denotes that the
- FD X -gt Y is inferred from the set of
- functional dependencies F
- Example
- F Pnum -gt MgrStartDate,
- Dnum -gt MgrSSN,
- Pnum -gt Pnum
- where F Pnum -gt Pname, Plocation, Dnum,
- Dnum -gt Dname, MgrSSN, MgrStartDate
33Compact Notation for FDs
- Pnum -gt PName, PLocation, Dnum
- Dnum -gt DnaMe, MGrSSN, MgrStartDate
- ESSN, DependenT_Name -gt BDate
-
- Compact Notation
- P -gt NLD
- D -gt MGS
- ET -gt B
34Inference Rules
- It is practically impossible to specify all
- possible FDs that may hold
- To infer additional FDs from a set of valid FDs
we need a system of inference rules - There are 6 inference rules IR1 - IR6
- IR1-IR3 are referred to as Armstrongs
- Inference Rules
- The set of all dependencies that can be
- inferred from a given F is called the
- closure of F, denoted by F
35IR1 Reflexive Rule
- If X ? Y then X -gt Y
- A set of attributes always determines itself or
- any of its subsets
- If ESSN, Dependent_Name ? ESSN then
- ESSN, Dependent_Name -gt ESSN holds
36IR2 Augmentation Rule
- X -gt Y XZ ? YZ
- Adding the same set of attributes to both the
- LHS RHS of a FD results in another
- valid FD
- If SSN -gt Ename then
- SSN, Address -gt Ename, Address
- holds
37IR3 Transitive Rule
- X -gt Y, Y -gt Z X -gt Z
- FDs are transitive
- If SSN -gt Dno
- and Dno -gt Dlocation
- then SSN -gt Dlocation holds
38Armstongs Inference Rules
- The rules IR1-IR3 are sound and complete.
- Sound Given a set of FDs F holding on a relation
schema R, any FD that we can infer from F by
using IR1-IR3 holds in every legal relation
instance - Complete Repeatedly applying IR1-IR3 generates
all possible FDs that can be inferred from F
39IR4 Decomposition Rule
- X -gt YZ X -gt Y
- We can remove attributes from the RHS of a
- dependency, and decompose the FD
- If SSN -gt Ename, Dno
- then SSN -gt Ename and SSN -gt Dno
- both hold
40Proof of Decomposition Rule
- We can use IR1-IR3 to prove IR4
- 1 X -gt YZ Given
- 2 YZ -gt Y Using IR1 plus YZ ? Y
- 3 X -gt Y Using IR3 on results of 1 2
41IR5 Union Rule
- X -gt Y, X -gt Z X -gt YZ
- We can union attributes from the RHS of a
- dependency, and combine a set of FDs
- into a single FD (reverse of IR4)
- If SSN -gt Ename and SSN -gt Dno
- then SSN -gt Ename, Dno
- holds
42Proof of Union Rule
- We can use IR1-IR3 to prove IR5
- 1 X -gt Y Given
- 2 X -gt Z Given
- 3 XX -gt XY Using IR2 on 1 (adding X)
- 4 X -gt XY Since XXX (not using IRI-R3)
- 5 XY -gt YZ Using IR2 on 2 (adding Y)
- 6 X -gt YZ Using IR3 on 4 5
43IR6 Pseudotransitive Rule
- X -gt Y, WY -gt Z WX -gt Z
- Represents a variant of IR3
- If SSN -gt MgrSSN and
- MgrSSN, Dependent_Name -gt
- Relationship
- then SSN, Dependent_Name -gt Relationship
44Proof of Pseudotransitive Rule
- We can use IR1-IR3 to prove IR6
- 1 X -gt Y Given
- 2 WY -gt Z Given
- 3 WX -gt WYUsing IR2 on 1 (adding W)
- 4 WX -gt Z Using IR3 on 3 2
45Use of Inference Rules
- Step 1 Database designers first specify the set
of FDs F - that can easily be determined from
semantics - Step 2 Armstrongs rules are used to infer
additional FDs - that also hold
- Determine set of attributes X that appear on a
FDs LHS - Use Armstrongs rules to determine attributes
- determined by X (called X)
- X is the closure of X under the set F
46Computing the Closure (X)
- Algorithm to compute the closure of X under F
- X X
- repeat
- old X X
- for each FD Y -gt Z in F do
- if Y ? X then X X? Z
- until (old X X )
47Example Closure Computation
- X Pnumber
- F Pnumber -gt Pname, Plocation, Dnum
- Dnum -gt Dname, MgrSSN, MgrStartDate
- Step 1 X Pnumber
- Step 2 X Pnumber, Pname, Plocation, Dnum
- Step 3 X Pnumber, Pname, Plocation, Dnum,
- Dname, MgrSSN,
MgrStartDate
48Module 5 - Contents
- Informal Design Guidelines
- Functional Dependency
- Functional dependencies are constraints forced
on all - database instances
- The designer specifies some intuitive FDs but
others - can be inferred using rules
- The closure of a set of FDs F is the set of
all FDs F that - can be inferred from F, where as the
closure of a attribute - X under F is the set of attributes X
that are functionally - determined by X
- Normalization
- Properties of Decomposition
49Normalization
- Normalization is a process that aims at achieving
better designed relational database schemas
through the use of semantic information given by - Functional Dependencies
- Primary Keys
50The Normalization Process
- The normalization process takes a relational
schema through a series of tests to certify
whether it satisfies certain conditions - The schemas that satisfy certain condition are
- said to be in a given Normal Form.
- Unsatisfactory schemas are decomposed by
- breaking up their attributes into smaller
- relations that possess desirable properties
- (e.g., no anomalies)
51Review of Key Concepts
- Superkey - A set of attributes such that no two
- tuples have the same values for these attributes
- Key - A Minimal Superkey, called a Candidate
- Key if more than one
- Primary key - A selected candidate key
- Secondary key - Remaining candidate keys
- Prime Attribute - An attribute that is a member
of - any candidate key
- Non-prime attribute An attribute that is not a
- member of any candidate key
52First Normal Form (1NF)
- A relation schema is in 1NF if domains of
attributes - include only atomic (simple, indivisible) values
- and the value of an attribute is a single value
- from the domain of that attribute
- 1NF disallows
- having a set of values, a tuple of values, or
a - combination of both as an attribute value for
a - single tuple
- relations within relations and relations as
- attributes of tuples
- Considered part of the formal definition of
relation
53Non-1NF Relation
Non atomic values
54Relations in 1NF
Problem with this design Redundancy (Jones, 123)
55Relations in 1NF
- Problem with this design
- Too many Null values
- What if an order has gtmax (4) items ?
56Second Normal Form (2NF)
A relation schema is in 2NF if it is in 1NF, and
every non-prime attribute is fully functionally
dependent on the primary key
- A FD X -gt Y is termed partial
- if some attribute can be
- removed from X and the
- dependency still holds
- Formally, for some attribute
- A? X, (X-A) -gt Y is true
- A FD X -gt Y is termed full if
- removal of any attribute
- from X means that the FD
- no longer holds
- Formally, for any attribute
- A ? X, (X-A) -gt Y is false
57Non 2NF Relation
- Relation DEP
- DEP SSN, Dep_Name, Dep_Bdate,
- Emp_Bdate
- Functional Dependencies in DEP
- FD1 SSN, Dep_Name -gt Dep_Bdate
- FD2 SSN -gt Emp_Bdate
58Relation in 2NF
- 2NF relations can be produced by creating a
decomposition where non-prime attributes are
associated only with the part of the primary key
on which they are fully dependent - Relations after decomposition of DEP
- EMP SSN, Emp_Bdate
- NEW-DEP SSN,Dep_Name, Dep_Bdate
59Third Normal Form (3NF)
- A relation schema is in 3NF if it is in 2NF,
- and no non-prime attribute is transitively
- dependent on the primary key
- A FD X -gt Y is termed transitive if there is a
- set of attributes Z that is not a subset of any
- key and both X -gt Z and Z -gt Y hold
60Non 3NF Relation
- Relation EMP
- EMPName, State, Code
Functional Dependencies in PERSON FD1 NAME -gt
STATE, CODE FD2 CODE -gt STATE Transitive
Dependency !
61Relation in 3NF
- 3NF relations can be produced from 2NF relations
by creating a decomposition where no transitive
dependencies exist
Relations after decomposition of EMP EMP-CODE
Name, Code STATE-CODE State, Code
62Generalization of 2NF and 3NF
- 2NF and 3NF respectively require that a schema
should have neither partial nor transitive
dependencies on the primary key to avoid update
anomalies - However the dependencies could exist for
- other candidate keys !
- General definitions for 2NF and 3NF ...
63General Definition of 2NF
- A relation schema R is in 2NF if it is in 1NF,
- and every non-prime attribute is fully
- functionally dependent on every key of R
- Property PID, Shire, Lot, Size, Tax_Rate,
Price - FD1 PID -gt Shire, Lot, Size, Tax_Rate, Price
- FD2 Shire, Lot -gt PID, Size, Tax_Rate, Price
- FD3 Shire -gt Tax_Rate
- FD4 Size -gt Price
- What are the candidate keys of Property ?
64Generalized 2NF Example
- Property PID, Shire, Lot, Size, Tax_Rate,
Price - Candidate keys are PID Shire, Lot
- Is there a partial dependency on any key ?
- FD1 PID -gt Shire, Lot, Size, Tax_Rate, Price
- FD2 Shire, Lot -gt PID, Size, Tax_Rate , Price
- FD3 Shire -gt Tax_Rate
- If so, the relation is not in 2NF
- 2NF Relations after decomposition
- Property PID, Shire, Lot, Size, Price
- ShireRates Shire, Tax_Rate
65General Definition of 3NF
- A relation schema R is in3NF if for every FD
- X-gtA that holds on R, either
- a) X is a superkey of R
- OR
- b) A is a prime attribute of R
- This definition can be applied directly to test
for 3NF without having to check for 2NF first
66Generalised 3NF Example
- Property PID, Shire, Lot, Size, Price
- Primary key PID, Candidate key Shire, Lot
- FD1 PID -gt Shire, Lot, Size, Price
- FD2 Shire, Lot -gt PID, Size, Price
- FD4 Size -gt Price
- Does any FD fail the 3NF test?
- If so, the relation is not in 3NF
- 3NF Relations after decomposition
- Property PID, Shire, Lot, Size
- Cost Size, Price
67Alternate Definition of 3NF
- A relation schema R is in 3NF if every nonprime
attribute of R is - a) Fully functionally dependent on every key of
R - AND
- b) Non-transitively dependent on every key of R
68Boyce-Codd NF (BCNF)
- A relation schema R is in BCNF if for every
- FD X-gtA that holds on R, X is a superkey of R
- A stricter normal form than 3NF
- In practice, most 3NF relational schemas are
- also in BCNF
- BCNF is better than 3NF unless other
- properties (to be discussed) are lost
69BCNF Example
- Property PID, Shire, Lot, Size
- FD1 PID -gt Shire, Lot, Size
- FD2 Shire, Lot -gt PID, Size
- FD5 Size -gt Shire
- Candidate keys are PID Shire, Lot
- Size is not a super key
- But, Shire is a prime attribute
- Property is in 3NF but not in BCNF (due to FD5)
70BCNF Example (cont)
- BCNF Relations after decomposition
- PropertyPID, Size, Lot
- ShireLotSize Size, Shire
- where
- FD1 PID -gt Shire, Lot, Size
- FD2 Shire, Lot -gt PID, Size
- FD5 Size -gt Shire
- Property and ShireLotSize are in BCNF
71Limitations of Normalization
- Isolation
- Normal forms, when considered in isolation for
- individual relations, do not guarantee a good
design - Normalization must also consider additional
properties - that the relational schemas, taken
together, must - possess (loss less join and dependency
preservation) - Performance
- Ideally, relation schemas should be at least
in 3NF. - However, for performance reasons, some
designers - leave relations in lower forms
72Module 5 - Contents
- Informal Design Guidelines
- Functional Dependency
- Normalization
- Properties of Decomposition
- Dependency Preservation Property
- Loss less (non additive) Join Property
73Properties of Decompositions
- Normalization must also consider
- additional properties that the relational
- schemas, taken together, must possess
- Dependency Preservation Property
- Loss less (non additive) Join Property
74Dependency Preservation
- Informally, each FD in F must either appear
directly on one of the relation schema in the
decomposition or can be inferred from the
dependencies that appear in some - relation schema
- FDs are user-specified constraints that
capture - data semantics. The DBMS should enforce
them - efficiently
- The relational database schema must exhibit
- the dependency preservation property
for - this to occur
75Formal Definition
- A decomposition D R1,R2,,Rm of R is
dependency preserving with respect to a set of
functional dependencies F on R if the union of
projections of F on each Ri in D is equivalent to
F, that is - (p R1(F) ? p R2(F) ?p Rm(F) ) F
- where the projection of F on Ri, denoted by
pRi(F) is the set of dependencies X -gt Y in F,
such that the attributes X ? Y are contained in
Ri.
76Decomposition with Lost FD
- Property PID, Shire, Lot, Size decomposed to
- Property PID, Size, Lot
- ShireLotSize Size, Shire
- FD1 PID -gt Shire, Lot, Size
- FD2 Shire, Lot -gt PID, Size
- FD3 Size -gt Shire
- Property FP PID -gt Lot, Size
- ShireLotSize FS Size -gt Shire
- (FP ? FS) FD1,FD3
- What about FD2?
77Dep. Preserving Decomposition
- EMP EMP, STATE, CODE decomposed to
- Emp-Code(EMP,CODE)
- State-Code(STATE,CODE)
- FD1 EMP -gt STATE
- FD2 EMP -gt CODE
- FD3 CODE -gt STATE
- Emp-Code FE EMP -gt CODE
- State-Code FS CODE -gt STATE
- (FE?FS)FD2,FD3, (FE?FS)FD1,FD2,FD3
78Decomposition with Lost FD
- EMP EMP, STATE, CODE decomposed to
- Emp-Code(EMP,STATE)
- State-Code(STATE,CODE)
- FD1 EMP -gt STATE
- FD2 EMP -gt CODE
- FD3 CODE -gt STATE
- Emp-Code FE EMP -gt STATE
- State-Code FS CODE -gt STATE
- (FE?FS)FD2,FD3, (FE?FS)FD1,FD3 F2?
79Loss less (non-additive)Join
- Loss less refers to the loss of information,
- not to loss of tuples
- A decomposition which is not loss less,
- may result in spurious tuples after a join
- is applied to the decomposed relations
80Formal Definition
- A decomposition D R1,R2,,Rm of R is
- lossless with respect to a set of functional
- dependencies F on R if for every state r of
- R that satisfies F, the following holds
- p R1(r) p R2(r) p Rm(r) r
- where p R1(r) is the projection of r on Ri
- and is a join
81Decomposition with Loss
- EMP EMP, STATE, CODE decomposed to
- Emp-Code(EMP,STATE)
- State-Code(STATE,CODE)
- When joined together, it will result in spurious
tuples - Information regarding the CODE for the EMP is lost
82Join with Spurious Tuples
83Relational Database Design
- Design by Analysis
- (Top Down Strategy)
- 1. Start from attribute
- groupings as obtained
- from conceptual modeling
- and mapping activities
- 2. Analyze and decompose
- until desirable properties
- are met
- (Covered in Modules 4-5)
- Practical and widely used
- approach
- Design by Synthesis
- (Bottom Up Strategy)
- 1. Start from individual attributes and FDs
- 2. Synthesize into a 3NF (or higher) relations
- (Not Covered)
- All possible FDs must be identified in advance.
Even if a few FDs are missing, the schema can be
suboptimal - Results are dependent on the order in which the
synthesis algorithm receives the
FDs(nondeterministic) -
84From UoD to Relational Schema
85Module 5 - Review
Functional Dependency concepts and Normalization
techniques are formal theories to measure
thegoodness or quality of a relational schema
design resulting from ER-Relational mapping
- Functional Dependencies
- represent data semantics.
- Some FDs can be specified by
- database designers, others
- can be inferred. These FDs
- can be used to enforce
- constraints on data that should
- hold on all instances of a given
- schema
- Normalization techniques use FD and primary key
information to restructure (decompose) relation
schemas, such that the resulting relations
possess desirable properties and are said to be
in a given normal form. A higher NF is generally
better, however other factors such as performance
and decomposition need to be considered
86Recommended Readings
- Elmasri Navathe
- Chapter 14,
- Optional15(15.1.2, 15.1.3)
87Next ...
- Module 6
- Relational Algebra