Title: Relational Database M 358
1Relational Database (M 358)
- Handout 7
- Tang Kwok Wa, Billy
2Block 4 Database Development
This block concerns
Establish Requirement
data requirements
Data analysis (ER Modeling, Normalization)
Conceptual data model
Database design
Logical schema
Implementation
Schema and database
3Establishing Data Requirements
- Data requirements is normally a document written
in natural language. - Documenting data requirements should
- enable the identification of subjects
- enable the recognition of relevant connections
between subjects together with the properties of
the relations - Problems of ambiguity in data requirements
4Data Analysis Process - Entity-Relation Modeling
- Entity-Relation Modeling
- Model is for communication between people, but
not for computers. - Not influenced by design issues. The following
issues are not relevant - data format (e.g. length of strings)
- data volume (e.g. how many occurrences of
students) - data usage (what processing required)
- data access (how data may be retrieved
efficiently) - Good ER model should
- represent all users data requirements -
completeness - have no duplication - integrity
- include all constraints - integrity
- be general (not restricted by the particular
initial requirements) - flexibility - understandable (simple, with no unnecessary
complexity) - usability
5Data Analysis Process - Entity-Relation Modeling
- Entity Types
- Represents things having a meaning in a given
context and with a need to record data. - Defines the properties common to a collection of
entities - e.g. Student - How to look for the entity type? In the
University data requirement, there are words like
information about each student is initially
recorded at registration. So one would think
each student would be represented by an entity
with some information to be recorded. - Student is the subject being modeled. To identify
subjects - exclude terms that only refer to data about those
subjects, such as course code or identification
number. - exclude nouns that refer to processes, such as
registration which refers to the process of
initially recording a students details.
6Data Analysis Process - Entity-Relation Modeling
- Relationships
- Relationship name
- Assign the relationship name in one to many
direction - E.g
- staff - counsels - student, not student -
IsCounselled By - staff. - Remember
- In E-R model, the relationship counsel is simply
represented by the line connected between Staff
and Student. - The attribute counsellor_no, being a foreign key
in Student referencing to Staff for the Counsel
relationship in Relational model, is not an
attribute in the Student entity type of the E-R
model
Student(student_id, name, registration, region)
Staff (staff_no, name, region)
7Data Analysis Process - Entity-Relation Modeling
- Relationships
- mn relationship
- mn relationship is decomposed into two 1n
relationships with an entity type introduced in
between them, unless there is absolutely no
additional data to be recorded by the
relationship. - the participation conditions of the introduced
entity type in the decomposition are mandatory.
It is meaningless for any occurrence exists in
this entity type without relationship to both. - identifiers from two related entity types are
combined to give a composite identifier for the
intersection entity type - Inclusive and exclusive relationships
- two relationships are inclusive if it is required
that every occurrence of the common entity type
participates either in both relationships or in
neither - two relationships are exclusive if it is required
that no occurrence of common entity type
participates in both relationships.
8Data Analysis Process - Entity-Relation Modeling
- Attributes
- According to data requirements, one may find
specification of data directly associated with
the subject. E.g. each student would be recorded
with his/her name, then student name will be one
of the attributes for the student entity type. - Attributes should be assigned to avoid duplicated
data - E.g. Each student should be given his
counsellors telephone number for communication - if counsellor telephone numbers are recorded in
Student entity type, forming - student(student_id, counsellor_telno)
- then, counsellor_telno will have repeated values
in this student entity type since a counsellor
would counsel many students. Thus
counsellor_telno should be recorded in staff
entity type.
9Data Analysis Process - Entity-Relation Modeling
- Attributes
- Complex values
- A staff has more than one qualifications If we
just introduce a Qualification attribute in Staff
entity to cater for this requirement, then a
staff will be multivalued when the staff has more
than one qualification such as BSc, MSc and PhD - Resolved by 1n relationship.
- Derived data
- having values based on some processing deriving
from other data - should not be included in an E-R model because it
is a kind of redundancy - occasionally exists in a data model, either
because its derivation is difficult to infer, or
because it is important for the data to be made
explicit. (should include a constraint to
prevent the inconsistency that may result from
duplication.)
Obtained
Staff
Qualifications
Qualifications(StudentId, Degree) (5324,
Bsc) (5324, MSc) (5324, PhD)
10Normalization
- A process to assess and improve the quality of
relational data model by identifying duplication
in a table and reorganizing the data into two or
more tables, without duplication or loss of
information. - Used in data analysis stage during the
development of database.
- Problem of a table not in normalized form
- Amending a value may cause changes in many tuples
( in order to maintain consistency of the data) - If GPId g1 changes the name from Leila to John,
then rows 1, 2, 3, 9, 10 have to be changed
11Normalization
- Single-Valued Fact (SVF)
- is a statement of expressing the meaning of
attribute in a relation, it associates one
attribute with exactly one another attribute. - E.g. a general practitioner has a name
- It is interpreted as a general practitioner
identified by general - practitioner identifier has (exactly) a name
- Having known such SVF, we know that in a
relation GeneralPractitioner, - GPID as the primary key determines exactly one
non-primary key attribute GPname - Every non-primary-key attribute in a relation
(e.g. GPName) becomes a single-valued fact about
the primary key (e.g. GPID). - Properties of single-valued fact
- Occurrences of a SVF will be recorded just once.
4 occurrences in the above example - The first occurrence of this SVF is the
general practitioner identified by g1 has a name
Leila - The second occurrence of this SVF is the
general practitioner identified by g2 has a name
Ftizdare.. - It is not reversible. A general practitioner has
one name - it is not true that a name is
associated with just one general practitioner.
relation GeneralPractitioner
GPId GPName g1 Leila g2 Fitzdare g3 von B g4 von B
12Normalization
- Functional Dependencies
- Functional dependency is - Within a relation R,
attribute B is functionally dependent on
attribute A if and only if each value of A
determines precisely one value of B -
-
- same as what single-value fact expresses.
- SVF a general practitioner identified by general
practitioner identifier (GPID) has a name can be
expressed in a short form as functional
dependency using the notation below - GPId ? GPName
- (it means that GPId determines precisely one
value of GPName) - GPId is said to be a determinant
- Given a relation R, B is fully functionally
dependent on A if it is functionally dependent on
A and not functionally dependent on any subset of
A - E.g Appointment(PatientId, ApptDate, ApptTime,
ConsId) - PatientId, ApptDate, ApptTime ? ConsId (not
full functional dependency) - PatientId, ApptDate ? ConsId (full
functional dependency) -
13Normalization Process
- Normalization process
- first normal form - gt second normal form - gt
third normal form - gt Boyce-Codd Normal Form - Starting point of normalization process
- Given a relation in first normal form
- Given a collection of Single Value Facts (or
Functional Dependencies)
ExtendedAppointment (PatientId, PatientName,
GPId, GPName, ApptDate, ApptTime, ConsId,
ConsName, HospNo, HospName)
14Normalization Process
- First Normal Form (1NF)
- A relation is in 1NF if and only if every
non-primary key attribute is functionally
dependent (not necessarily fully functionally
dependent) on the primary key. - Any relation is in 1NF, since a primary value
will identify a tuple and thus determine exactly
one value of each of the non-primary attributes. - ExtendedAppointment is in 1NF because it is a
relation.
15Normalization Process
- Second Normal Form (2NF)
- A relation is in 2NF if and only if it is in 1NF
and every non-primary key attribute is fully
functionally dependent on the primary key. - Need to know all the FDs holds in the relation.
- Identify additional determinant from the given
relation. (FD10 -FD17) - Add Transitivity FDs
- if PatientId ? GPId and GPId ? GPName then
PatientId ? GPName also holds - Add Augmentation FDs
- if A ? B holds, then A,Z ? B, Z also holds.
- E.g. PatientId, ApptDate ? ConsId (FD 5)
- ? PatientId, ApptDate ? ConsId, ApptDate
- ConsId, ApptDate ? HospNo (FD 9)
- therefore PatientId, ApptDate ? HospNo (FD
15)
16Normalization Process
- Whether a relation is not in 2NF
- Examine whether non-primary key attributes are
fully functionally dependent on primary key - In ExtendedAppointment, FD2, FD3 and FD10,
PatientId ? PatientName, PatientId ? GPId,
PatientId ? GPName. PatientId is only subset of
primary key (which is PatientId, ApptDate).
Therefore ExtendedAppointment is not in 2NF. - Putting a relation into 2NF
- ExtendedAppointment (PatientId, PatientName,
GPId, GPName, ApptDate, ApptTime, ConsId,
ConsName, HospNo, HospName) - Separate the original relation into two
- Put the subset of primary key together with its
determined attributes into a separate relation. - Patient2(PatientId , PatientName, GPId, GPName)
- Put the primary key together with those
attributes not determined by above into another
relation - PatientAppointment2 (PatientId, ApptDate,
ApptTime, ConsId, ConsName, HospNo, HospName)
17Normalization Process
1NF
FD 2, 3, 10 PatientId ? PatientName PatientId
? GPId PatientId ? GPName
2NF
PatientAppointment2 (PatientId, ApptDate,
ApptTime, ConsId, ConsName, HospNo, HospName)
PatientA2 (PatientId, PatientName, GPId, GPName)
Producing 1n relationship through decomposition
of original relation, with the common names
(PatientId) as the foreign key
18Normalization Process
- Third Normal Form (3NF)
- A relation is in 3NF if and only if it is in 2NF
and no non-primary key attribute is transitively
dependent on the primary key. - Transitive dependency is
- Within a relation, where A, B and C are each an
attribute or combination of attributes with no
attributes in common, if C is functionally
dependent on B and B is functionally dependent on
A, then C is transitively dependent on A via B,
provided that A is not functionally dependent on
B or C. - Putting relations into 3NF
- to move the attributes of the offending FDs, into
separate relation (or relations if determinant
differ) - offending FD in a transitive dependency A ? B
and B ? C implying A ?C, B ?C is the offending
FD - In PatientA2 (PatientId, PatientName, GPId,
GPName), GPName is transitive dependent of
PatientId because of PatientId ? GPId and GPId
? GPName - thus GPId, GPName should be moved to a new
relation (see in the following page) - remove the right-hand side attributes from
original relation - create new relation for each different
determinant that occurs in the offending FDs. - primary key of the new relation(s) is the
determinant of the offending FDs used to
construct it - primary key of the relation from which the
attributes have been removed is unchanged
19Normalization Process
2NF
PatientA2
PatientAppointment2
3NF
PatientA3
FD7,8 ConsId ? ConsName HospNo ?
HospName
ConsultantA3
PatientAppointmentA3
HospitalA3
GeneralPractitionerA3
20Normalization Process
- Putting Third Normal Form to Boyce-Codd Normal
Form (BCNF) - A relation is in BCNF if and only if every
determinant is a candidate key. - So, in BCNF, all the non-primary attributes can
only be determined from the candidate keys - See if any 3NF in which a determinant is not
candidate key. If there is such determinant,
open new relation for it. - Whether a 3NF is BCNF, checking by
- only one candidate key (that is, there are no
alternate keys) - or if more than one candidate key, then
- the candidate keys are not combinations of
attributes - or if the candidate keys are combination of
attributes, then - the candidate keys do not overlap
21Normalization Process
ConsId, ApptDate ? HospNo in
PatientAppointmentA3 (FD 9), but ConsId,
ApptDate cannot determine ApptTime and ConsId,
ApptDate cannot determine PatientID, thus
ConsId, ApptDate is not a candidate key in
PatientAppointmentA3
PatientAppointmentA3
ConsultantA3
PatientA3
3NF
HospitalA3
GeneralPractitionerA3
PatientAppointmentBCNF
General PractitionerA3
ConsultantVisitBCNF
HospitalA3
ConsultantA3
ConsultantVisitBCNF
PatientA3
PatientAppointmentBCNF
22Normalization Process
- Limitation of Normalization
- it only produce relations of 1n relationships,
results in no redundancy - relations of 11 relationship cannot be produced
by normalization. - no participation conditions are considered
- no constraint conditions are considered.
23Database Design
Use a relational conceptual data model to give a
set of tables for initial database design
Do the tables represent the data in an acceptable
way according to the given criteria for
usability, efficiency, and so on?
Yes
No
- Define constraints of each table
- columns
- primary key
- foreign keys
- constraints
Revise table
Revise if necessary
Implementation
24Defining Column in SQL
- Numeric data type
- Range
- Integer 2147483647
- Smallint 32767
- Decimal numbers
- precision number of decimal digits
- scale number of digits after the decimal point
- E.g cast( input_no as decimal(5,2)) as output_no
- if input_no 999.123456 output_no 999.12
- Metadata (data about data) - any data that is
required to interpret other data as meaningful
information. - E.g. A numeric column Amount showing x.xx
would mean HK x.xx K. The actual meaning of
Amount would be interpreted as in scaling
factor of HK000 (metadata)
25Defining Column in SQL
- Character data type
- VARCHAR (variable length)
- CHAR (fixed length, adding spaces so that string
of the specified length is stored) - (Infomaker only supports VARCHAR. CHAR is
implemented as VARCHAR) - Date-time data types
- DATE
- TIME
- TIMESTAMP (combination of date and time)
- Default values
- Real values as default e.g. default is current
year., or - Non real value as default, e.g. 9999 as default
for integer column. May cause error in queries,
e.g. avg()
26Defining Column in SQL
- Using Codes
- e.g. HK representing Hong Kong
- Advantage less space, less typing mistakes
- Disadvantage Not immediately understandable
- (may need to link with a full name table)
- SQL domains not in same way of relational
domains - comparison of two SQL columns requires only that
they are comparable data type, but not same
domain. - SQL foreign keys and operations are not
constrained by domains as they are for relations.
27Defining Table in SQL
- Defining primary key
- Choose primary key according to uniqueness,
minimality and not being null. - Using surrogate - a field with distinguish values
for primary key. It has no meaning at all, just
generate by a DBMS to distinguish different row.
E.g. a numeric column as DEFAULT AUTOINCREMENT
will automatically generate a new and unique
value for the column when a new row is added. - Defining constraints in SQL
- primary key - expressed as PRIMARY KEY
- alternate key - expressed as UNIQUE
- referential constraints FOREIGN KEY
- mandatory participation condition - NOT NULL for
foreign keys, or using CHECK clause - Constraints part of the model - expressed as a
CHECK clause - Considerations in using CHECK
- Inefficient processing if CHECK statement is
complicate. - Every update involves processing a CHECK, causing
unacceptable delay.
28Revising Tables
- Avoiding Null
- E.g. Student( StudentId, Name, Telephone_no)
- Because some students might not have telephone,
telephone_no column will be null in some rows. - To avoid null, can use two tables
- Student( StudentId, Name) Telephone(StudentId,
Telephone_no) - However, query might become complex as it will
then involve join. - Denormalization
- Deliberately combining normalized tables into
one, in order to achieve efficient data retrieval
and simple query. - Beware duplicated data after denormalization.
29Additional Data
- Derived data - values that are obtained by
processing other data. - To maintain exact consistency of derived data, it
should be updated every time the data used in the
derivation is changed (can be done by trigger) - Can also be done by periodic updating of
student_count by a separate application process
(sacrifice some data consistency but relief
processing burden) - Defining Extra Tables because of
- snapshot data (e.g. to make an annual report)
- summary data for users
- historical record (for auditing purpose)
30Database Implementation
- Indexes - fast data access method to support
efficient retrieval of data. - It involves a mechanism that associates a value
in a column of a table with the stored locations
of all the rows which contain that value. - When a query is executed in which an indexed
column is referenced in the WHERE clause, it can
use the index to provide the stored locations of
the particular rows that satisfy the WHERE
condition, thus enable direct access to the
required data. - Need extra space for index file
- Index must be updated every time when there is
change in the values in the indexed column - Need to balance against the possible benefits of
an index in deciding whether one is appropriate
for a given column(s)
31Database Implementation
- Establishing a database
- Ways of populating the tables of a new database
- by import of old tables
- prepares an application program that reads in the
old data, transforms it to appropriate format and
then inserts it into the new table. - Backup (for data recovery when necessary)
- continuous copying of all changes to a database
in a log file - periodic unloading of database to make a backup
copy
32Logical Processing Sequential Processing
(quicker)
Find the best plan for executing SQL statements
FROM get the required table store as the 1st
intermediate table
FROM get a row of the required table
next row
WHERE get the 1st intermediate table retain the
rows that satisfy the where clause store as 2nd
int. table
WHERE retain the rows that satisfy the where
clause
SELECT get the 2nd intermediate table retain the
columns required produce the final table
SELECT retain the columns required
1 disk access
5 disk accesses
33Find the best plan for executing SQL statements
Using Index
- When will index become ineffective?
- For small table, single disk access would
retrieve the whole table. Index will not help
much - With many rows spreading in each blocks, most of
the blocks are likely to be accessed, so using an
index to identify the required blocks may be
slower than simply retrieving every block for
sequential plan of processing of the whole table. - Any conditions in the WHERE clause not involving
the indexed column would need to be processed
separately
With an index on tutor_no
FROM WHERE get a row that satisfies the WHERE
clause
next row
SELECT retain the columns required
34THE END