CS 728 Advanced Database Systems Chapter 14 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 728 Advanced Database Systems Chapter 14

Description:

CS 728 Advanced Database Systems Chapter 14 Database Design Theory: Introduction to Normalization Using Functional & Multivalued Dependencies – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 72
Provided by: eduj71
Category:

less

Transcript and Presenter's Notes

Title: CS 728 Advanced Database Systems Chapter 14


1
CS 728 Advanced Database Systems Chapter 14
  • Database Design Theory
  • Introduction to Normalization Using Functional
    Multivalued Dependencies

2
Design Guidelines for Relation Schemas
  • 1. Attributes should have clear meanings
    (semantics) and related attributes are grouped
    into single entities.
  • 2. Avoid update anomalies by reducing redundant
    data.
  • 3. Reduce the NULL values.
  • 4. When relations are joined no spurious tuples
    will be generated.

3
Semantics of Relation Attributes
  • Guideline 1
  • Design a relation schema so that it is easy to
    explain its meaning. Do not combine attributes
    from multiple entity types and relationship types
    into a single relation.
  • Design I
  • STUDENT(STNO, Name, Address, ANO)
  • ADVISOR(ANO, Name, Address, Dept)
  • Design II
  • Student-Advisor(STNO, Name, Address, ANO,
    A-name, A-address, Dept)
  • Design I is better when compared with Design II.

4
Update Anomalies
  • Insertion anomalies
  • In design II, if we add a new student to
    Student-Advisor, we have to add data related to
    that students advisor. This information should
    be consistent with all other occurrences of that
    advisor. Note that in design II all data related
    to a particular advisor is repeated a number of
    times which equals to the number of students
    supervised by that advisor. In design I, only
    advisor number is repeated.
  • It is difficult to add a new advisor who have no
    students yet to the database. This is because we
    have to assign nulls to STNO and STNO is the
    primary key for Student-Advisor.

5
Update Anomalies
  • Deletion Anomalies
  • If we delete the last student associated with a
    particular advisor, then that advisor cannot
    exist in the database in design II any more.
  • Modification Anomalies
  • If an advisor changes his/her address, say, then
    we have to modify his/her address in all tuples.
  • If we miss some tuples, then we will have several
    addresses for the same advisor.

6
Update Anomalies
  • Guideline 2
  • Design the base relation schemas so that no
    insertion, deletion, or modification anomalies
    occur.
  • If any anomalies are present, note them clearly
    so that the programs that update the database
    will operate correctly.

7
NULL values in Tuples
  • Guideline 3
  • Avoid placing attributes in a base relation whose
    values may be null.
  • If nulls are unavoidable, make sure that they
    apply in exceptional cases only and do not apply
    to majority of tuples in the relation.

8
NULL values in Tuples
  • Problems with Nulls
  • Waste storage space.
  • Have multiple interpretations (not-applicable,
    not-known,).
  • Create ambiguities with aggregate functions
    (count, avg, )
  • Create ambiguities with joins.

9
NULL values in Tuples
  • Example
  • If only 10 of employees have phones, then
    Employee(SSN, Name,., Office-phone) is a poor
    design, because 90 of the last column values
    will be nulls
  • But
  • Employee(SSN, Name, )
  • Phone(SSN, Office-phone)
  • is a better design.

10
Spurious Tuples
  • Guideline 4
  • Design relation schemas so that they can be
    joined with equality conditions on attributes
    that are either primary keys or foreign keys in a
    way which guarantees that no spurious tuples are
    generated.

11
Spurious Tuples
  • Emp-Proj(SSN, Pno, Hours, Ename, Pname, Plocation)

12
Spurious Tuples
  • Emp-locs(Ename, Plocation)
  • Emp-Proj1(SSN, Pno, hours, Pname, Plocation)

13
Spurious Tuples
  • Emp-Proj(SSN, Pno, Hours, Pname, Plocation)

14
Spurious Tuples
  • Result ? Emp-Locs ? Emp-proj1
  • Then Result(Ename, Plocation, SSN, Pno, Hours,
    Pname)

15
Spurious Tuples
  • If we combine Emp-Locs and Emp_Proj1 on Plocation
    attribute, we will get spurious tuples as you
    have noticed in the previous slide. This is
    because Plocation is the attribute which combines
    the two relations and it is neither a primary key
    nor a foreign key in either Emp-Locs or
    Emp-Proj1

16
Bad Tables (1)
  • Alternative designs for a product database
  • (1)
  • Products(Prod_no, Prod_name, Price, Manu_id)
  • Manufacturers (Manu_id, Manu_name, Address)
  • (2)
  • Prod_Manu (Prod_no, Prod_name, Price, Manu_id,
    Manu_name, Address)

17
Bad Tables (2)
  • Problems with the second design
  • Redundancy --- the name and address of each
    manufacturer will be repeated once for each
    product made by the manufacturer.
  • more storage space needed
  • potential inconsistency (update anomalies)

18
Bad Tables (3)
  • Insertion anomalies --- a manufacturer's name and
    address cannot be recorded in the database if it
    does not make at least one product (because
    Prod_no is part of the primary key).
  • Deletion anomalies --- If we delete all products
    made by a manufacturer, we will unintentionally
    lose track of the manufacturer's name and
    address.
  • The first design does not have similar problems.
  • The challenge is to identify bad relations and
    convert them into good relations.

19
Functional Dependencies
  • R Relation schema.
  • r(R) Relation instance
  • A1, A2, , An Attributes which belong to
    universal relation.
  • X, Y Sets of attributes.

20
Functional Dependencies
  • A functional dependency (X ? Y), between X and Y
    specifies a constraint on the possible tuples
    that can form a relation instance r of R.
  • The constraint states that for any two tuples t1
    and t2 in r such that t1X t2X, then t1Y
    t2Y
  • This means that the Y component of a tuple in r
    depends on (or determined by) the values of the X
    component of that tuple in r.
  • Or the values of the X component of a tuple in r
    uniquely or functionally determines the values of
    Y component.

21
Functional Dependencies
  • X is said to functionally determine Y (or Y is
    functionally dependent on X) if for every legal
    relation instance r(R), for any two tuples t1 and
    t2 in r(R), we have
  • t1X t2X , then t1Y t2Y
  • X ? R denotes that X is a subset of the
    attributes of R
  • X ?Y denotes that X functionally determines Y.

22
Functional Dependencies
  • Note that
  • if X is a candidate key then X ? Y is correct
    for any subset of attributes of R.
  • if X ? Y, this does not say whether Y ? X is
    correct or not.
  • Student(SSN, STNO, Name, Major)
  • SSN ? Name
  • t1(91910, 980090012, Ahmed, ..)
  • t2(91910, 980090012, Ahmed, ..)
  • STNO ? Major
  • t10((980090012,, Math)
  • t12(980090012, , Math)
  • SSN ? SSN, Name, STNO, Major

23
Functional Dependencies
  • Manager(SSN, Name, . Dept)
  • SSN ? Name, Dept (correct)
  • SSN uniquely determines the Name and Dept
  • Dept ? SSN (correct)
  • Dept uniquely determines the SSN
  • Name ? SSN (incorrect)

24
Functional Dependencies
  • FD is a property of the meaning or semantics of
    attributes
  • FD is specified as a constraint on R, all
    extensions of R (i.e. r(R)) should specify that
    constraint (legal extensions) otherwise r(R) are
    called illegal extensions
  • Note that we cannot infer FDs from r(R)

25
Diagrammatic Representation of FDs
  • SSN ? STNO, NAME, MAJOR
  • STNO ? SSN, NAME, MAJOR

Student(SSN, STNO, Name, Major)
FD 1
FD 2
26
Inference Rules for Functional Dependencies
  • Let
  • F set of functional dependencies defined on R
  • F (Closure of F) is the set of all functional
    dependencies that can be defined on R
  • The closure of F is the set of all FDs that are
    logically implied by F
  • The closure of F is denoted by F
  • F X ? Y F X ? Y
  • A BIG F may be derived from a small F
  • For R(A, B, C) and F A ? B, B ? C
  • F A ? B, B ? C, A ? C, A ? A, B ? B,C ? C,
    AB ? AB, AB ? A, AB ? B, ...

27
Inference Rules for Functional Dependencies
  • Emp-Dept(SSN, Ename, Bdate, Address, Dnumber,
    Dname, MGR-SSN)
  • F
  • SSN ? Ename,Bdate,Address,Dnumber, Dnumber ?
    Dname, MGR-SSN
  • We can infer the following FDs
  • SSN ? SSN (Reflexive)
  • SSN ? Ename (Decomposition)
  • SSN ?Dname, MGR-SSN (Transitive)
  • We usually write F X ? Y to denote that the FD
    X ?Y is inferred from F
  • X, Y are subsets of attributes

28
Functional Dependency
  • Several equivalent definitions
  • X ? Y in R iff for any t1, t2 in r(R), if t1
    and t2 have the same X-value, then t1 and t2 also
    have the same Y-value
  • X ? Y in R iff there exist no t1, t2 in r(R)
    such that t1 and t2 have the same X-value but
    different Y-values
  • X ? Y in R iff for each X-value, there
    corresponds to a unique Y-value.

29
Functional Dependency
  • Question if all X-values are different in all
    possible r(R), does X ? Y in R?
  • Theorem 1
  • If X is a superkey of R and Y is any subset of R,
    then X ? Y in R
  • Note that X ? Y in R is a property that must be
    true for all possible legal r(R), not just for
    the present r(R)

30
Functional Dependency
  • Example which is true?
  • A B C D A ? B
  • a1 b1 c1 d1 A ? C
  • a1 b2 c1 d2 C ? A
  • a2 b2 c2 d2 A ? D
  • a2 b3 c2 d3 B ? D
  • a3 b3 c2 d4 AB ? D
  • AB ? C
  • AB ? CD

31
Identify Functional Dependency
  • FD created by assertions.
  • Employees(SSN, Name, Years_of_emp, Salary, Bonus)
  • Assertion
  • Employees hired the same year have the same
    salary
  • This assertion implies
  • Years_of_emp ? Salary

32
Inference Rules
  • IR1 Reflexive Rule ( ?????)
  • Y ? X, then X ?Y
  • e.g.
  • SSN ? SSN
  • P, S, Qty ? Qty
  • IR2 Augmentation Rule ?????))
  • X ? Y XZ ? YZ
  • e.g. F SSN ? Address
  • F SSN, Name ? Address, Name

33
Inference Rules
  • IR3 Transitive rule (??????)
  • X ? Y, Y ? Z X ? Z
  • e.g.
  • F SSN ? Dnumber, Dnumber ? Dname
  • F SSN ? Dname
  • IR4 Decomposition Rule (????????)
  • X ? Y, Z X ? Y and X ? Z
  • e.g.
  • F SSN ? Ename,BDate,Address, Dnumber
  • F SSN ? Ename
  • SSN ? Bdate
  • SSN ? Address
  • SSN ? Dnumber

34
Inference Rules
  • IR5 Union (Additive) Rule
  • X ? Y, X ? Z X ? Y, Z
  • e.g.
  • F SSN ? Ename, SSN ? Bdate
  • F SSN ? Ename, Bdate
  • IR6 Pseudo-transitive Rule
  • X ? Y, WY ? Z WX ? Z
  • e.g.
  • F SSN ? STNO, Major, STNO ? Name
  • F Major, SSN ? Name

35
Closure of Attributes
  • How to determine if F X ? Y is true?
  • Method 1
  • Compute F
  • If X ?Y ? F, then F X ?Y
  • Problem
  • Computing F could be very expensive!

36
Closure of Attributes
  • Method 2
  • Compute X the closure of X under F
  • X denotes the set of attributes that are
    functionally determined by X under F.
  • X Y X ? Y ? F
  • Theorem
  • X ?Y ? F if and only if Y ? X

37
Algorithm for Computing X
  • Input
  • a set of FDs F, a set of attributes X in R
  • Output
  • X
  • Begin
  • X X
  • Repeat
  • oldX X
  • for each FD Y ? Z in F do
  • if Y ? X then X X ? Z
  • until (X oldX )
  • end

38
Algorithm for Computing X
  • Example
  • R(A, B, C, G, H, I) ABCGHI
  • X AG
  • F A ? B, CG ? HI, B ? H, A ? C
  • Compute X (AG)
  • Initialization
  • X AG

39
Algorithm for Computing X
  • 1st iteration
  • consider A ? B
  • since A is a subset of X, X ABG
  • consider CG ? HI
  • since CG is not a subset of X, X ABG
  • consider B ? H
  • since B is a subset of X, X ABGH
  • consider A ?C
  • since A is a subset of X, X ABCGH
  • X is changed from AG to ABCGH

40
Algorithm for Computing X
  • 2nd iteration
  • consider A ? B
  • since A is a subset of X, X ABCGH
  • consider CG ? HI
  • since CG is a subset of X, X ABCGHI
  • consider B ? H
  • since B is a subset of X, X ABCGHI
  • consider A ? C
  • since A is a subset of X, X ABCGHI
  • X is changed from ABCGH to ABCGHI

41
Algorithm for Computing X
  • 3rd iteration
  • consider each FD in F again, but there is no
    change to X, exit
  • Result
  • (AG) ABCGHI.
  • The performance of the algorithm is sensitive to
    the order of FDs in F

42
Algorithm for Computing X
  • Theorem
  • Given R(A1, ..., An) and a set of FDs F in R, K ?
    R is a
  • superkey if K A1, ..., An
  • candidate key if K is a superkey and for any
    proper subset X of K, X ? A1, ..., An.

43
Algorithm for Computing X
  • Continue the above example
  • AG is a superkey of R since
  • (AG) ABCGHI.
  • Since A ABCH, G G, neither A nor G is a
    superkey.
  • Hence, AG is a candidate key

44
Normal Forms Based on Primary Keys
  • Normalization of data
  • is a process during which unsatisfactory relation
    schemas are decomposed by breaking up their
    attributes into smaller relation schemas that
    posses desirable properties
  • We normalize data for several reasons one of them
    is to avoid update anomalies
  • We have
  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)
  • Boyce-Codd Normal Form (BCNF) (a stronger
    definition of 3NF)
  • All the above normal forms are based on
    functional dependencies.

45
Normal Forms Based on Primary Keys
  • Forth Normal Form (4NF) Based on multivalued
    dependencies.
  • Fifth Normal Form (5NF) based on join
    dependencies.
  • Aside
  • Student-Adv(STNO, StName, Major,, Ano, Aname, )
  • Has several problems
  • Student(STNO, StName, Major, , Ano)
  • Advisor(Ano, Aname, .)
  • Pay the price of expensive joins

46
Basic Definitions
  • R A1, , An)
  • Superkey is set of attributes S ? R and t1S
    ? t2S ?i 1, ..., n. S may contain redundant
    attributes.
  • Key (K) is a superkey with no redundant
    attributes, i.e. removal of any attribute from K
    will no longer make it a superkey.
  • Candidate Key if a relation has more than one
    key, each is called a candidate key. One of these
    keys is arbitrarily chosen as a primary key.
  • Prime Attribute is an attribute which is a
    member of any key (primary or candidate) other
    attributes are called nonprime attributes.

47
Basic Definitions
  • Student(SSN, STNO, Name, Address, Salary)
  • Superkeys
  • SSN,Name/SSN,STNO,Name,Address,Salary
  • Candidate keys
  • SSN, STNO
  • Primary Key
  • SSN
  • Prime Attribute
  • SSN and STNO
  • Nonprime Attributes
  • Name, Address, Salary

48
1NF (First Normal Form)
  • A relation schema R is in 1NF if every attribute
    of R takes only single and atomic values.
  • Domains of attributes must include only atomic
    values and that the value of any attribute in a
    tuple must be a single value from the domain of
    that attribute.
  • In other words, multivalued and composite
    attributes are disallowed.

49
1NF (First Normal Form)
  • Student(STNO, StName, Course(CNO, Ctitle)
  • The set braces identify the attribute Course
    as multivalued
  • The set braces () identify the attribute Course
    as a composite attribute
  • Because of the last attribute (Course), the
    Student relation schema is not in 1NF.

50
1NF (First Normal Form)
  • Student(STNO, StName, Course(CNO, Ctitle)
  • To normalize it to 1NF
  • Student(STNO, StName, CNO, Ctitle)
  • this is not a good representation because it has
    many disadvantages. Replication of Course
    information and student information. Combining
    attributes which belong to two separate entities
    (namely student and course) into a single
    relation schema.

51
1NF (First Normal Form)
  • Student(STNO, StName, Course(CNO, Ctitle)
  • Student(STNO, StName, CNO)
  • Course(CNO, Ctitle)
  • this design suffers from drawback that student
    information has to be repeated in the student
    relations several times (in fact a number of
    times that is equal to the number of courses
    taken by that student).

52
1NF (First Normal Form)
  • Student(STNO, StName, Course(CNO, Ctitle)
  • Student(STNO, StName)
  • Course(CNO, Ctitle)
  • Study(STNO, CNO)
  • This is the best representation because we have
    reduced the duplication.

53
Second Normal Form (2NF)
  • Prime attribute --- an attribute in any candidate
    key.
  • Y is fully functionally dependent on X if X ? Y
    and no proper subset of X functionally determines
    Y
  • FD X ? Y is a fully functional dependency if
    removal of any attribute A from X means that the
    dependency does not hold any more. In other words
    (X - A) does not determine Y
  • A FD X ? Y is a partial FD if exist some
    attribute A which belongs to X and (X - A) ? Y
    still holds

54
Second Normal Form (2NF)
  • General Definition of 2NF
  • A relation schema R is in 2NF if every nonprime
    attribute A in R is not partially dependent on
    any key of R
  • (i.e. if every nonprime attribute is fully
    functionally dependent on every key of R)

55
Second Normal Form (2NF)
  • Emp-Proj(SSN, Pnumber, Hours, Ename, Pname,
    Plocation)
  • FD1 SSN, Pnumber ? Hours (FD)
  • FD2 SSN ? Ename (PD)
  • FD3 Pnumber ? Pname, Plocation (PD)
  • Because of FD2 and FD3 Emp-Proj is not in 2NF
  • 2NF Normalization
  • Emp(SSN, Ename)
  • Proj(Pnumber, Pname, Plocation)
  • Work(SSN, Pnumber, Hours)

56
Second Normal Form (2NF)
  • Consider
  • Bank-Loans (Bank_name, Assets, Headquarter,
    Loan_no, Customer_name, Amount),
  • FD1 Bank_name ? Assets, Headquarter
  • FD2 Bank_name, Loan_no ? Customer_name,
    Amount
  • Because if FD1, Bank-Loans is not in 2NF.
  • 2NF Normalization
  • Banks(Bank_name, Assets, Headquarter)
  • Loans(Bank_name, Loan_no, Customer_name, Amount)

57
Second Normal Form (2NF)
  • 2NF is not good enough
  • A relation schema in 2NF can still have serious
    redundancy problem as well as insertion and
    deletion anomalies.
  • Consider Parts(Part_no, Name, Location,
    Unit_price, Manu_id, Manu_name, Manu_Address)
  • It is obvious that Parts is in 2NF
  • Redundancy and various anomalies are introduced
    by
  • Manu_id ? Manu_name, Manu_Address

58
Second Normal Form (2NF)
  • Consider
  • EMP_DEPT(SSN, EName, BDate, Address, DNo, DName,
    DMGRSSN)
  • It is obvious that EMP_DEPT is in 2NF
  • Redundancy and various anomalies are introduced
    by
  • DNo ? DName, DMGRSNN

59
Third Normal Form (3NF)
  • Transitive Dependency
  • a FD X ? Y in R is transitive if exist some set
    of attributes Z that is not a subset of any key
    of R and both X ? Z and Z ? Y hold.
  • A relation schema R is in 3NF if it is in 2NF and
    no nonprime attribute A of R is transitively
    dependent on a key of R

60
Third Normal Form (3NF)
  • A relation schema R is in 3NF if for every FD X ?
    A, where A is a single attribute, at least one of
    the following is true
  • (a) A ? X (Trivial dependency - Reflexive)
  • (b) A is a prime
  • (c) X is a superkey

61
Third Normal Form (3NF)
  • R is not in 3NF if a non-prime non-trivially
    depends on a non-superkey.
  • If R in 3NF, it should not have a nonkey
    attribute functionally determined by another
    nonkey attribute (or by a set of nonkey
    attributes)

62
Third Normal Form (3NF)
  • General Definition of 3NF
  • A relation schema R is in 3NF if, whenever a
    nontrivial functional dependency X ? A holds,
    either
  • (a) X is a superkey of R, or
  • (b) A is a prime attribute of R

63
Third Normal Form (3NF)
  • Emp-Dept(SSN, Ename, Bdate, Address, Dnumber,
    Dname, DMGR-SSN)
  • FD1 SSN ? Ename, Bdate, Address, Dnumber,
    Dname, DMGR-SSN
  • FD2 Dnumber ? Dname, DMGR-SSN
  • Emp-Dept is in 1NF, 2NF, but because of
  • SSN ? Dnumber and
  • Dnumber ? Dname and
  • Dname is a nonprime attribute, and
  • Dnumber is not a superkey,
  • Emp-Dept is not in 3NF.
  • To transform Emp-Dept into 3NF
  • Emp(Enam, SSN, Bdate, Address, Dnumber)
  • Dept(Dnumber, Dname, DMG-SSN)

64
Third Normal Form (3NF)
  • Employees (SSN, Name, Age, Salary, Dept_name,
    Dept_manager_SSN).
  • Employees is in 2NF
  • since SSN is the only candidate key and every
    attribute is fully dependent on it.
  • Employees is not in 3NF because
  • Dept_name ? Dept_manager_SSN

65
Third Normal Form (3NF)
  • LOTS(Property-ID, County-Name, Lot, Area,
    Price, Tax-Rate)
  • FD1 Property-ID ? County-Name, Lot, Area,
    Price, Tax-Rate
  • FD2 County-Name, Lot ? Property-ID, Area,
    Price, Tax-Rate
  • FD3 County-Name ? Tax-Rate
  • FD4 Area ? Price
  • 2 candidate keys Property-ID,County-Name,
    Lot
  • LOTS is not in 2NF, because of County-Name ?
    Tax-Rate
  • Tax-Rate is partially dependent on the candidate
    key County-Name, Lot.
  • 2NF
  • LOTS1(Property-ID, County-Name, LOT, Area,
    Price)
  • LOTS2(County-Name, Tax-Rate)

66
Third Normal Form (3NF)
  • The relation LOTS1 is not in 3NF, because of Area
    ? Price
  • Area is not a superkey and Price is not prime
    attribute
  • 3NF
  • LOTS1A(Property-ID, County-Name, Lot, Area)
  • LOTS1B(Area, Price)
  • LOT2(County-Name, Tax-Rate)
  • The above relation schemas are in 3NF

67
Boyce-Codd Normal Form (BCNF)
  • Assume that we have thousands of lots, but
    two-counties
  • Marion county and Liberty county.
  • Lot areas in Marion county are
  • .5, .6, .7, .8, .9 and 1 acres
  • Lot areas in Liberty county are
  • 1.1,1.2, , 1.9, 2.0 acres
  • In this case we have AREA ? County-Name
  • LOTS1A(Property-ID, County-Name, Lot, Area)
  • FD5 Area ? County-Name
  • Still in 3NF , since County-Name is a prime
    attribute

68
Boyce-Codd Normal Form (BCNF)
  • A relation schema R is in BCNF if whenever a FD X
    ? A holds in R, then X is a superkey of R.
  • R is in BCNF if for every non-trivial FD, the
    left side is a superkey.
  • LOTS1A-X(Property-ID, Area, Lot)
  • LOTS1A-Y(Area, County-Name)
  • To describe a relation schema R as good it
    should be at least in 3NF (general) or BCNF
  • If R is in BCNF, then R is also in 3NF
  • However, R in 3NF R in BCNF

69
Normal Forms Summary
  • Relationships between different NFs.

70
Normal Forms Summary
  • 1NF
  • Attributes should be single-valued and have
    atomic domain
  • Normalize into 1NF
  • Form a new relations for each non-atomic
    attribute
  • 2NF
  • 2NF removes some insertion anomalies and deletion
    anomalies.
  • 2NF removes some redundancies, namely,
    redundancies caused by partial dependencies on
    key.

71
Normal Forms Summary
  • 3NF
  • 3NF removes all insertion anomalies and deletion
    anomalies.
  • 3NF also removes some redundancies caused by
    transitive dependencies.
  • BCNF
  • achieves all achieved by 3NF.
  • BCNF removes all redundancies caused by FDs.
Write a Comment
User Comments (0)
About PowerShow.com