Title: Normalization, Robertss Rules and Introduction to Data Modeling
1Normalization, Robertss Rules and Introduction
to Data Modeling
- CSCI 242
- February 4, 2009
2Agenda
- Robertss Rules
- Normalization
- Robertss Rules and Normalization
3Why Are We Talking About This?
- To design a database, we choose a set of entities
that models a problem - We will store data in tables corresponding to our
entity choices - The names of the entity types become embedded in
our programs - Changing later on is expensive, so we want a
stable model of the problem
4Midterm Question
- The first question on the midterm will deal with
normal forms. It will deal with the relationship
between normal forms and Robertss Rules. - This one question will count more than any other
question on the exam. - The homework assignment on normalization and
Robertss Rules looks a lot like Question 1 on
the midterm.
5Syntax and Semantics
- Syntax deals with the structure and form of a
statement or language - Semantics deals with the meaning that is conveyed
by a statement or language
6Question
- Is normalization a syntactic or a semantic
construct? - That is, does it deal with the form of
information, or is it involved with meaning?
7Intentional vs. Extensional Data
- Extensional datathe data that is actually
present - Intentional dataall the data that is allowed to
be present - Question does normalization deal with
intentional or extensional data?
8Entity and Entity Type
- An entity is something that we record information
about in the database - An entity type is a set of similar things that we
store information about - An entity instance is one example of some entity
type. - Usually we dont say entity instance and entity
type when context makes the meaning clear we
just say entity.
9Summary of Terminology
10Fact
- A value of an attribute in a row conveys one fact
about an entity instance - An attribute is a fact stating that This entity
instance has the value ltvaluegt - Consider emp(empno,ename,job,deptno)
- Each value of ename in a row states that This
persons name is ltvaluegt. - Each row of this table can be viewed as a
collection of four facts
11Robertss Rules
- Robertss Rules are a set of plain English rules
that, if followed during database design, result
in a highly normalized database design. - We will explore the relationship of Robertss
Rules to normalization, and vice versa.
12Normalization
- A set of formal rules that are intended to be a
definition of a properly-structured database - A normal form generally deals with and removes
certain anomalous behavior from the use of a
relation that is normalized.
13Examples of Anomalies
- Insert anomalies
- If we want to enter information about a new
entity in the database we need to enter
information about some other entity first - Delete anomalies
- In order to delete information about an entity we
must delete information about another entity - Update anomalies
- In order to change the value of a single fact we
may have to change many stored values in the
database
14Robertss Rules
15Rule 1
- Each relation describes exactly one entity type.
- A relation models a distinct entity type, and
each tuple of the relation models an instance of
that entity. - The relation models an entity by storing its
attributes. The attributes that identify it are
called candidate keys the other attributes are
non-key.
16Example of Rule 1
- These are OK
- Emp(empno,ename,job,sal,deptno)
- Dept(deptno,dname,loc)
- Not OK
- Emp(empno,ename,job,sal,deptno,loc)
- Dept(deptno,dname,loc,loc_sales_tax)
17Rule 2
- Each fact is represented only once in the
database. - A tuple (aka row) is a collection of facts about
an entity instance, one fact per column. A row
with 5 columns has five facts about an instance
in each row.
18Facts
- Emp(empno,ename,job,sal,deptno)
- Each row consists of five facts about an employee
- Each of these facts must be represented just once
in the database
19Duplicate Representation?
20Rule 3
- Each tuple can reside in only one relation.
- A relation is a model of an entity type, not a
station on a factory assembly line. - Instead of moving a tuple from relation to
relation, add an attribute that characterizes
status.
21Rule 3 Example
- Modeling cars on an assembly line that has three
stages - Chassisbuild(VIN,date_built,date_ord,dealer,make,m
odel) - IntFinish(VIN,date_built,date_ord,dealer,make,mode
l) - FinalAssmbl(VIN,date_built,date_ord,dealer,make,mo
del) - Instead should be one table
- Vehicle(VIN,date_built,date_ord,dealer,make,model,
prod_stage) - Note the use of one additional attribute to
indicate the stage of production
22Rule 4
- Database design must be insensitive to
cardinality of the problem, when the cardinality
is greater than one. - Its easyand very riskyto presume that the
cardinality of various entity types and subtypes
will remain the same.
23Rule 4 Examples
- Company car
- College degree
- Telephone number
- Home address
- Business address
- Email address
- How many of these does your entity have? How
many may they have some day?
24Rule 4 Restated
- Grow down, not across
- That is, when a value can repeat, make it into
its own entity type, with a separate row for each
repetition - So education becomes its table, one row for
each value
25Example of Robertss Rules
- EMP ( EMPNO, ENAME, DEPTNO, DNAME)
- DEPT (DEPTNO, DNAME, DLOC)
- This relation violates the following Robertss
rules - Rule 1. The EMP table describes employee as well
as department - Rule 2. In the EMP table, if we have the same
DEPTNO in multiple rows, DNAME will be
represented multiple times.
26Another Example
- EMP (ENAME, DEGREE1, DEGREE2, DEGREE3)
- This schema violates the following Robertss rule
Rule 4. The design assumes every employee has a
maximum of 3 degrees. If an employee has 4
degrees, then the database needs to be
restructured by adding DEGREE4 in the EMP table. - Rule 4 deals with an aspect of data independence.
It can be stated informally as - "Grow down, not across"
27Normalization
28Basic Concepts
- Entity Type a class of an object that we record
information about. Aka relation, table - Attribute a characteristic of an entity. Aka
column. - Entity Instance a single occurrence of an entity
type. Aka tuple, row
29Candidate Keys
- Candidate key a set of attributes Ai, Aj,Ak
that is a candidate key has two (time-invariant)
properties - Uniqueness no two tuples have the same value
for the candidate key - 2. Minimality if any Ai is discarded from the
candidate key, then the uniqueness property is
lost. It is the smallest set of attributes that
identifies a row. - How many candidate keys can a table have?
30Primary Key
- One of the candidate keys is selected to be the
primary identifier of rows. It is called the
primary key. - The selection is usually made based on the
usefulness of the attribute that is the primary
key.
31Functional Dependence
- R.X?R.Y or R.X FD R.Y
- Given a relation R, attribute Y of R is
functionally dependent on attribute X of R iff
each X-value in R has associated with it
precisely one Y-value in R (at any one time) - In other words, for each value of X in table R,
there is one and only one value of Y. A given X
value must always occur with the same Y value.
32Functional Dependence Examples
Does X?Y? Does Y?X?
33Anomalies
- Update anomalies If one copy of repeated data
is updated, inconsistency is created unless all
copies are similarly updated. - Insert anomalies It may not be possible to
store some information unless some other
information is stored as well. - Delete anomalies It may not be possible to
delete some information without losing some other
information as well.
34Full Functional Dependence
- Y is fully functionally dependent on X iff X?Y
and no subset of X determines Y. - That is, X is the smallest collection of columns
that determines Y.
35Normalization
36First Normal Form
37First Normal Form
- A relation is said to be in first normal form iff
every attribute of every tuple is atomic.
381NF Example
Question Is this relation in 1NF?
Question Does this relation show any anomalies?
39Second Normal Form
40Second Normal Form
- A relation is said to be in second normal form
iff it is in first normal form and every
attribute is fully functionally dependent on the
primary key.
412NF Example
Does this relation follow Robertss Rules?
Do you see any anomalies?
SID
City
Status
SNAME
422NF and RR
- What is the relationship between 2NF and
Robertss Rules? - If Rule 1 is met, is the relation in 2NF?
- What about Rule 2?
43Third Normal Form
44Third Normal Form
- A relation is said to be in third normal form iff
it is in second normal form and there are no
transitive dependencies.
45How Do We Convert To 3NF?
SID
City
SNAME
Status
City
463NF and RR
- If a relation is in 3NF, what about rules 1 and 2?
47Fourth Normal Form
48Multi-Valued Dependency
- R.X is said to multi-value determine R.Y if there
is a set of values for Y that must appear in any
relation where R.X appears. - For example, if a course has two textbooks, then
there will be an MVD between the course number
and the names of the books.
49Fourth Normal Form
- A relation is said to be in fourth normal form
iff it is in third normal form and it does not
have more than one multi-valued dependency.
50Example of 4NF
SPORT-INSTRUMENT
Instrument
MVD
SID
MVD
Sport
51Converting to 4NF
SPORT
INSTRUMENT
Instrument
MVD
SID
SID
Sport
MVD
52Boyce-Codd Normal Form
53Boyce-Codd Normal Form
- A relation is said to be in Boyce-Codd normal
form iff every determinant is a key. - BCNF deals with problems that can be caused by
overlapping candidate keys.
54Example of BCNF
How does this relation comply with Rule 1 and
Rule 2?
Are there any anomalies?
S
SNAME
QTY
P
55Converting to BCNF
S
SNAME
S
QTY
P
56Fifth Normal Form
57Join Dependency
- A relation R satisfies join dependency (R1, R2,
..., Rn) if and only if R is equal to the join of
R1, R2, ..., Rn where Ri are subsets of the set
of attributes of R.
585NF
- A relation R is in 5NF (or project-join normal
form, PJNF) if for all join dependencies at least
one of the following holds - (a) (R1, R2, ..., Rn) is a trivial
join-dependency (that is, one of Ri is R) - (b) Every Ri is a candidate key for R.
59Example of 5NF
- Dept Subject Student
- Comp. Sc. CP1000 John Smith
- Comp. Sc. CP2000 John Smith
- Comp. Sc. CP3000 Arun Kumar
- Mathematics MA1000 Reena Rani
- Chemistry CH2000 Raymond Chew
- Physics PH1000 Albert Garcia
- This relation states that Comp. Sc. offers
subjects CP1000, CP2000 and CP3000 which are
taken by a variety of students. No student takes
all the subjects and no subject has all students
enrolled in it. Therefore, all three attributes
are needed to represent the information.
60Is The Relation in 5NF?
- The relation cannot be decomposed into two
relations - (dept, subject), and (dept, student)
- without losing some important information.
- However, the relation can be decomposed into the
following three relations - (dept, subject), (dept, student) and (subject,
student) - and it can be shown that this decomposition is
lossless. So the relation is not in 5NF.
61Data Modeling
- When we design a relational database, we search
for a set of entity types that will model the
problem of interest - If we choose a robust data model, it will last a
long time without major changes, even though the
programs that use it may change - Now that we have some idea what a good data model
is, we will talk about how to design one.
62Natural vs. Generated Keys
63Definitions
- Natural keya key that occurs in the data, that
serves as a unique identifier for rows - Generated keya key that is generated in order to
make rows unique
64Generated Keys
- In some organizations, programmers have the idea
that generated keys should be included in every
relationthis is a misunderstanding of the
relational approach. - Tuples are naturally unique because each one
corresponds to a distinct instance of an entity
type in the real world. You dont have to add
anything to make them unique. - If your database design is correct, you have
unique rows even if you havent added a key to
each row.
65When to Use Generated Keys
- Sometimes you have a compound key that is very
long - Lots of queries will be complicated and tables
with foreign keys will copy most of another table - In this case, a generated key can be used
- As a guideline, the only time to use a generated
key is when the natural key has at least three
columns
66Class Project
67Potential Class Project
- A social network for arts and cultural
organizations - It will operate across the Atlanticand be called
Across Atlantic - The site will use cultural exchangesarts and
musicto break down barriers between cultures - Cultural organizations can register, and so can
their members - Language translation will be provided
68The Project
- The project has some interesting database design
problems - We can allow people to enroll to a limited degree
without review - Manual review will be required for full
membership - We can use translucent database techniques to
protect private information - We can have blogs translated into several
languages automatically
69Thank you!