Title: Lecture 22
1Lecture 22
- Functional Dependencies (FDs) and
- Normalization
2Schedule Change
- http//www.users.csbsju.edu/irahal/
3The Database Design Process
- Conceptual design
- Use a data model language to come up with an
accurate, high-level description of the system
requirements - Words (unstructured) ? Diagrams
- Logical design
- Map the resulting EERD into a set of relations
- Diagram ? Relations
- Physical design
- Use DDL on some DBMS to create tables
corresponding to your relations
4The Database Design Process
- Limitations of E-R Designs
- The EER model provides a set of guidelines
- Does not result in a unique database schema
- Does not provide a formal way of evaluating
alternatives - Relies largely on the common sense of the
designer - Here we try to answer
- What are the criteria for "good" base relations?
- Meaningful grouping of attributes
- When designing a relation schema, how to decide
which attributes to include? - So far, attributes are grouped to form the
relation schema by using the common sense of the
database designer
5The Database Design Process
- First discuss informal guidelines for good
relational design - Then we discuss formal concepts of functional
dependencies and normal forms - - 1NF (First Normal Form)
- - 2NF (Second Normal Form)
- - 3NF (Third Normal Form)
- - BCNF (Boyce-Codd Normal Form)
- Additional types of dependencies, further normal
forms, relational design algorithms by synthesis
are discussed in Chapter 11
6Relation Schema Informal Measures
- We have some informal measures
- Semantics of the attributes
- Reducing the redundancy values in tuples
- Disallowing the possibility of generating
spurious tuples - Reducing null values
- Not always independent of one another
7Semantics of the Relation Attributes (1)
- Any grouping of attributes to form a relation
schema must portray a certain real-world meaning - Each tuple in a relation should represent one
entity or relationship instance - Guideline 1 Design a relation schema so that it
is easy to explain its meaning - Semantics of attributes should be easy to
interpret - Attributes of different entities should not be
mixed - Only foreign keys should be used to refer to
other entities
8Relations that violate Guideline 1 by intermixing
attributes from different relations
9Relation schemas that abide by Guideline 1
10Redundant Information (2)
- One goal of schema design is to reduce redundancy
- Information is stored redundantly wasting storage
- Problems with update anomalies
- Modification anomalies
- Insertion anomalies
- Deletion anomalies
- Mixing attributes of multiple entities may cause
the above problems
11- Update anomalies
- Modification Anomalies
- Update PNAME from ProductY to
Customer-Accounting - Insert Anomalies
- Insert a new employee not assigned to known
project - Insert a new project with no working employees
- Delete Project
- Delete PNUMBER2
- Delete the sole employee of a project
-
12Modification Anomalies
- Consider the relation
- EMP_PROJ ( Emp, Proj, No_hours, Ename, Pname,
Plocation) - Modification Anomaly
- Changing the name of project number P2 from
Project Y to Customer-Accounting - May cause this update to be made for all
employees working on project P2 otherwise the DB
will become inconsistent
13Modification Anomalies
- Consider the relation
- EMP_PROJ( Emp, Proj, No_hours, Ename, Pname,
Plocation) - Insert Anomaly Cannot insert a project unless
an employee is assigned to - Inversely - Cannot insert an employee unless
he/she is assigned to a project. - Delete Anomaly When a project is deleted ?
delete all the employees who work on the project - Alternately, if an employee is the sole employee
on a project, deleting that employee would result
in deleting the corresponding project
14Guidelines to Redundant Information in Tuples and
Update Anomalies
- Guideline 2 Design a schema that does not suffer
from the insertion, deletion and update anomalies
- If there are any present, then note them so that
applications can be made to take them into
account - Might need to break the guidelines to improve
performance for certain queries - Assume that we always access employee information
only with department information - The design EMP_PROJ (Emp, Proj, No_hours,
Ename, Pname, Plocation) might be could for such
cases
15Example
ER Model
SSN Name Address
Hobby 1111 Joe 123 Main biking,
hiking
Relational Model (SSN, Hobby, Name, Address)
16Example
- Redundancy leads to anomalies
- A change in Address must be made in several
places - Suppose a person gives up all hobbies. Do we
- Set Hobby attribute to null? No, since Hobby is
part of key - Delete the entire row? No, since we lose other
information in the row - No hobby information?
- Hobby value must be supplied for any inserted row
since Hobby is part of key
17Decomposition
- Solution use two relations to store Person
information - Person1 (SSN, Name, Address)
- Hobbies (SSN, Hobby)
- People with/without hobbies can now be described
- No update anomalies
- Name and address stored once
- A hobby can be separately supplied or deleted
18Spurious Tuples (3)
- Bad designs for a relational database (or bad
decompositions) may result in erroneous results
for certain JOIN operations - Any decomposition MUST have the "lossless join"
property - No spurious tuples should be generated by doing a
natural-join of any decomposed relations - Person1 (SSN, Name, Address)
- Hobbies (SSN, Name)
- Here, loss relates to loss of information
19Spurious Tuples (3)
- Suppose we replace
- EMP_PROJ ( SSN, PNUMBER, ENAME, PNAME, PLOCATION,
HOURS) by - EMP_PROJ1(SSN, PNUMBER, PNAME, PLOCATION, HOURS)
ANDEMP_LOCS (ENAME, PLOCATION) - Guideline 3 The relations should be designed to
satisfy the lossless join condition - Avoid relations that contain matching attributes
that are not (foreign key, primary key)
combinations
20Null Values in Tuples (4)
- Guideline 4 Relations should be designed such
that their tuples will have as few NULL values as
possible - Make sure only NULLs are exceptional cases
- If many attributes do not apply to all tuples in
the relation, we end up with many nulls - Waste space
- Ambiguity in meaning
- Attribute not applicable or invalid
- Value known to exist, but unavailable
- Attribute value unknown (may or may not exist)
- Difficulty specifying JOIN operations (inner or
outer joins) - Attributes that are NULL frequently could be
placed in separate relations (with the primary
key)
21Functional Dependencies
- Functional dependencies (FDs) are used to specify
formal measures of the "goodness" of a
relational database design - FDs are constraints that are derived from the
meaning and interrelationships of the data
attributes - An FD is a constraint between two sets of
attributes X and Y - X ? Y holds if whenever two tuples have the same
value for X, they must have the same value for Y - For any two tuples t1 and t2 in any relation
instance r(R) If t1Xt2X, then t1Yt2Y
22Functional Dependencies
- X ? Y A set of attributes X functionally
determines a set of attributes Y (or Y is
functionally determined by X) if the value of X
determines a unique value for Y - X ? Y in R specifies a constraint on all
relation instances r(R) - FDs are derived from the real-world constraints
on the attributes - Property of the intention of the database
- An FD is a property of the attributes in the
schema R - The constraint must hold on every relation
instance r(R) - Can NEVER be deduced from an extension
- E.g. if in some case, all people having the same
first name are registered for the same course,
can we deduce that name ? course?
23Examples of FD Constraints
- EMP_PROJ (SSN, PNUMBER, ENAME, PNAME, PLOCATION,
HOURS) - social security number determines employee name
- SSN ? ENAME
- project number determines project name and
location - PNUMBER ? PNAME, PLOCATION
- employee SSN and project number determines the
hours per week that the employee works on the
project - SSN, PNUMBER ? HOURS
24More on FD Constraints
- Definition of a relation KEY (If K is a key of R)
- K functionally determines all attributes in R
- If X?Y is true, does that make Y?X true?
- Some FDs are always true regardless of the
relation in which they occur - State, Driver_License_Number ? SSN
- Zip ? City, State
25Inference Rules for FDs
- Given a set of FDs F, we can infer additional FDs
that hold whenever the FDs in F hold - Armstrong's inference rules
- IR1. (Reflexive) If Y ? X, then X ? Y
- (Generates trivial FDs)
- E.g. SSN, ENAME ? ENAME
- IR2. (Augmentation) If X ? Y, then XZ ? YZ (Note
that XZ stands for X U Z) - E.g. SSN ? ENAME, then SSN, PNUMBER ? ENAME,
PNUMBER - IR3. (Transitive) If X ? Y and Y ? Z, then X ? Z
- E.g. SSN ? DOB and DOB ? horoscope sign then SSN
? horoscope sign
26Inference Rules for FDs
- IR1, IR2, IR3 form a sound and complete set of
inference rules - Sound ? Any rule inferred using IR1, IR2 or IR3 a
valid FD - Complete ? All possible FDs can be generated
using them - Some additional inference rules that are useful
- IR4. (Decomposition) If X ? YZ, then X ? Y and X
? Z - SSN ? ENAME, DOB then SSN ? DOB SSN ? ENAME
- IR5. (Union) If X ? Y and X ? Z, then X ? YZ
- SSN ? DOB SSN ? ENAME then SSN ? ENAME, DOB
- IR6. (Pseudo-transitivity) If X ? Y and WY ? Z,
then WX ? Z - Can be deduced from IR1, IR2, and IR3
(completeness property) - OfficeLocation? Department Department, Ename ?
Salary-level then - OfficeLocation,Ename ? Salary-level
27Proofs
- IR1. (Reflexive) If Y ? X, then X ? Y
- For any two tuples t1 and t2 with t1X t2X
then t1Y t2Y because Y ? X - IR2. (Augmentation) If X ? Y, then XZ ? YZ
- Proof by contradiction
- If for two tuples t1 and t2 we have
- (1) t1X t2X
- (2) t1Y t2Y
- (3) t1XZ t2XZ
- (4) t1YZ ? t2YZ
- Cant be true since from (1) and (3) we have (5)
t1Z t2Z and from (2) and (5) we have t1YZ
t2YZ which contradicts (4)
28Proofs
- IR3. (Transitive) If X ? Y and Y ? Z, then X ? Z
- For any two tuples t1 and t2 with t1X t2X
then t1Y t2Y which implies that t1Z
t2Z hence X ? Z holds - IR4. (Decomposition) If X ? YZ, then X ? Y and X
? Z - X ? YZ
- YZ ? Y (Using IR1)
- X ? Y (Using IR3)
- Similarly for X ? Z
29Proofs
- IR5. (Union) If X ? Y and X ? Z, then X ? YZ
- X ? Y
- X ? Z
- X ? XY (Using IR1)
- XY ? YZ (Using IR2)
- X ? YZ (Using IR3)
- IR6. (Psuedotransitivity) If X ? Y and WY ? Z,
then WX ? Z - X ? Y
- WY ? Z
- WX ? WY (Using IR2)
- WX ? Z (Using IR3)