Title: Fundamentals of CS 2: Databases REVISION WEEK 2
1Fundamentals of CS 2DatabasesREVISION WEEK 2
- John Barnden
- Professor of Artificial Intelligence
- School of Computer Science
- University of Birmingham, UK
2What Is a Database? contd.
- A database is generally a regularly (but often
complexly) structured body of information about
entities of various specific, precisely defined
types. - The entities are generally in various specific
types of relationship to each other - Each entity has a specific set of (intrinsic)
attributes of interest. - The values of intrinsic attributes are generally
of fairly basic, simple sorts (e.g., numbers,
dates, names). - Generally there are many entities of some of the
types or rather the expectation is that the
numbers could get large. - The entities of a given type are typically not in
any special order.
3What Is a Database(?) contd.
- The individual data elements held, though of
simple sorts, are generally close to end-user
concepts and concerns -- they are directly
meaningful interesting to users. E.g., price of
a car. - Contrast the detailed meteorological
measurements across the country and beyond, used
in a weather forecasting system for the UK. - The data held and retrieved is generally of exact
form (no vagueness expressed) and of definite
form (no uncertainty expressed or expected). - The operations provided to users for extracting,
inserting and updating data are of conceptually
straightforward sorts, not requiring elaborate
reasoning, problem-solving or analysis. - However, aggregate/overview/statistical
information (counts, averages, maxima, graphs,
histograms, etc.) often needs to be computed from
the data.
4What We Mainly Studied
- The nature of relational databases, the central
modern type of database. Entity types represented
as tables, holding relations. - Some basic mathematical concepts underpinning
relational databases, and useful also in many
other branches of CS. - Key aspects of how to develop the
conceptual/logical design of relational
databases. - In particular, how to achieve certain types of
good structuring, to help achieve certain types
of correctness and efficiency. - How to create and manipulate databases using a
particular database language, PostgreSQL (a
version of Structured Query Language very widely
used in various forms).
5Data Redundancy, Data Anomalies, and Data
Integrity
- See Ch.1 and many other parts of book.
- Data Redundancy replicating data in different
places in a data repository. - E.g., in a recipe book, saying how to fry onions
every time fried onions are needed in a recipe. - Encourages data anomalies and lack of
integrity basically, inconsistency between
the different places. - Such problems arise with insertions, deletions
and modifications in general. - Also causes a type of inefficiency replicated
updates.
6Redundancy, etc., contd. 1
- Redundancy implies that if you want to
modify/delete a piece of information, you need to - know that there is (or is not) replication, or
check for possible replications - go to the effort of repeating changes when the
item is replicated - avoid errors in such repeated changes.
7A Conceptual Model
- Represents global view of the database
- Enterprise-wide representation of data as viewed
by high-level managers - Basis for identification and description of main
data objects and relationships, avoiding
details
8The Entity Relationship Model
- Introduced by Chen in 1976
- Most widely used conceptual model of DBs
- The ER model strictly speaking is just the
approach of thinking of things as composed of
entities, attributes and relationships it has
nothing intrinsically to do with diagrams
J.A.B. - We also say that applying this approach to a
particular body of data gives rise to an ER model
of the specific intended database - But diagrams based on the model are a widely
accepted and adapted graphical approach to data
modeling
9Entity Relationship Diagrams (ERDs)
- The ER model of a database forms the basis of an
ER diagram (or several diagrams). - The ERDs represent the database as viewed by end
users. - There are several markedly different styles of
ERD, and for each main style there are several
variants. - And the style in the module handouts will differ
somewhat from that in the textbook and these
lectures - An ERD depicts (some of) the ER models entities,
attributes and relationships, and (depending on
the diagram style) varying amounts of other info
such as connectivities, cardinalities, keys,
weakness,
10Meriting Tables
- The question of what types of thing should
correspond to entity types and hence tables
depends on the application and your design
judgment. - It all depends on things like
- what variety of data is needed about something
- how separate the pieces of data about a given
thing are - what operations are needed
- how often theyre needed.
11Restrictions on our Tables
- We will apply the following restrictions amongst
others - Regular overall shape rows all same length,
similarly columns. No division into different
regions (with a certain exception). - No labels for rows, as opposed to columns. No
significance to the order or number of rows (the
number can change). - All cells in any one column given same intuitive
interpretation. - One data item per cell (but it can be a
variable-length character string, containing
anything). - Each cells item restricted to a pre-specified,
fairly simple format, and all cells in any given
column restricted to same format. No exceptional
entries. - Uncertainty markers not recognized though
tolerated in principle. - Some columns not supposed to have empty or vague
entries. - No additional comments, footnotes, etc.
12A Bad Table
13Extra, Crucial Restriction
- No row can be repeated in a table. (I.e., no two
rows can contain be exactly the same in terms of
the values they contain.) Exception
dynamically-created, temporary tables. - Equivalent to saying
- Rows are uniquely determined (picked out) by the
values in some set of columns. That is, given
some values for those columns, there is at most
one row that has those values in those columns.
14Attribute Determination
- A collection of one or more attributes determines
another attribute A if only one value for A is
possible given the values for the former
attributes. - E.g., the collection DAY-NUMBER, MONTH and YEAR
could determine DAY-NAME. - Also write this as DAY-NUMBER, MONTH, YEAR ?
DAY-NAME - We alternatively say that DAY-NAME is
functionally dependent on DAY-NUMBER, MONTH and
YEAR.
15Keys(RC pp.82/3)
- A key for a table is a collection of one or more
attributes that determines some other
attribute(s) in that same table. - A superkey for a table is a collection of one or
more attributes that determines all the other
attributes in the table, i.e. determines a whole
row. - Trivially, the collection of all the attributes
is a superkey. - A candidate key is a minimal superkey (i.e., you
cant remove attributes from it and still have a
superkey.)
16Primary Keys
- A primary key for a table (entity type) is a
candidate key that the DB designer has chosen as
being the main way of uniquely identifying a row
(entity). Extra restriction Its attributes are
not allowed to have null values. - It could be that theres only one candidate key
in practice anyway, such as a persons ID number. - Primary keys are the main way of identifying
target entities in entity relationships, e.g.,
the way to identify someones employing
organization. - For efficiency reasons, the simpler primary keys
are the better. - Identity numbers (of people, companies, products,
courses, etc.), or combinations of them with one
or two other attributes, are the typical primary
keys in examples in RC.
17Strong and Weak Relationships
- Also called identifying and non-identifying
relationships respectively. - A relationship from entity type A to entity type
B, mediated by having As primary key (PK) as an
attribute of B, is strong when Bs PK contains
As. - E.g., A People, B Dependents, where each
entity in People is identified by a PERS_ID (As
PK) and each entity in Dependents is identified
by a PERS_ID plus a first name and a
kinship/friendship attribute. - So a PK value in B could be (1698674, Mary,
child), meaning that this entity is the child
called Mary of person 1698674. - A relationship is weak when it isnt strong.
18Strong and Weak Entity Types
- A weak entity type is one to which there is at
least one strong relationship going to it from
another entity type. - So on previous slide, Dependents is weak, because
there is a strong relationship to it from People. - A strong entity type is one that is not weak!
i.e. there are no strong relationships going to
it from another type. - A weak entity is existence-dependent on every
entity type that has a strong relationship going
to it. - Marys existence in the database as a member of
Dependents relies on the existence of person
1698674 in the database. (Doesnt mean Mary would
vanish from the planet if person 1698674 left the
database! And indeed Mary could herself be an
entity in type People even if 1698674 leaves.)
19Connectivity Cardinality
- Relationships are importantly categorized as to
uniqueness or multiplicity of entities at either
end connectivity. Has big effect on DB design. - Relationships can be further specified as to how
many entities allowed or required at either end
cardinality. - Also has significant effect on DB design.
20Relationship Participation
- Optional in a particular direction, X to Y
- an X entity occurrence does not require a
corresponding Y entity occurrence - i.e. the minimum number of Ys per X is 0
- Mandatory in a particular direction, X to Y
- an X entity occurrence requires a corresponding Y
entity occurrence - i.e. the minimum number of Ys per X is 1 or more
21Relationships and Foreign Keys
- A foreign key in a table T is a chosen collection
of attributes intended to match the attributes in
the primary key in another table. In essence, the
foreign key is that primary keys ambassador in
T. - Standardly, a relationship is represented by
means of foreign keys.
221M Connectivity between Tables
More than one employee allowed per organization,
but no more than one employer per person. NOTE
direction of use of a foreign key. Why so??
People
Organizations
23MN Connectivity between Tablesusing a Bridging
Entity Type
People
Employments
Organizations
24Creating ER Models/Diagrams
- Designing an ER model for a database is an
iterative process, because, e.g. - As you proceed, you think of new ways of
conceiving whats going on (much as in ordinary
programming) - Multivalued attributes need to be re-represented
- MN relationships can be included as such at an
early stage, but generally need to be replaced by
bridging entity types at some point - 11 relationships raise a red flag may indicate
poor design (though standard in supertype/subtype
representation) - Special supertype/subtype notation may need to be
converted into more standard diagram notation - Conversion to a Normal Form
25Examples in Two Styles of Diagram
26Multivalued Attributes in ERMs and ERDs
27Splitting the Multivalued Attribute into New
Namable Component Attributes
28OR, Replace the Attribute by a New Entity Set
with an Attribute identifying the Original
Attributes Separate Components
29Generalization Hierarchies in ERMs and ERDs
30A Generalization Hierarchy with Overlapping
Subtypes (Gs)and Disjoint Subtypes (G)
31Supertypes/Subtypes in ERDs
A supertype maintains a 11 relationship with
each subtype optional in the super-to-sub
direction and mandatory in the other
32Non-Binary and Recursive Relationships in ERMs
and ERDs
33Tables for a Ternary Relationship
CFR is just like a bridging entity type for a
binary MN relationship, but has 3 foreign keys
instead of 2
34Recursive Relationships
- Relationship can exist between occurrences of the
same entity set - E.g. marriage, management, parthood,
35Table for the 1M EMPLOYEE Manages EMPLOYEE
Recursive Relationship
36Tables for the MN Recursive PART Contains PART
Relationship
The COMPONENT entity type is just a bridging type
37Alternative Implementations of a 11 Recursive
Relationship
- As previously.
- MARRIED_VI is just a bridging entity type.
- MARRIAGE together with MARPART act similarly as
bridge.
38NORMALIZATION(Ch. 5 of RC)
39In this chapter, you learn
- What normalization is and what role it plays in
the database design process - About the normal forms 1NF, 2NF, 3NF, BCNF, and
4NF - How normal forms can be transformed from lower
normal forms to higher normal forms - That normalization and ER modeling are used
concurrently to produce a good database design,
helping to eliminate data redundancies
anomalies. - That some situations require denormalization to
generate information efficiently
40First Normal Form
- Tabular format in which
- All attributes are dependent on a chosen
primary key - Primary-key attributes do not have NULL values
- There are no repeating groups in the table
- All relational tables satisfy 1NF requirements
41A Table that is Not in 1NF because of repeating
groups
42Second Normal Form
- Table is in second normal form (2NF) if
- It is in 1NF and
- It includes no partial dependencies
- Convert to 2NF by creating a new table for each
part of the primary key that provides partial
dependencies, and moving out the corresponding
dependent attributes into the new tables.
43A Dependency Diagram for a Table that is merely
in 1NF (by having a partial dependency)
44Second Normal Form (2NF) Conversion Results
45Third Normal Form
- A table is in third normal form (3NF) if
- It is in 2NF and
- It contains no transitive dependencies
46Conversion to Third Normal Form
- For every transitive dependency, take its
determinant (the attribute or collection of
attributes on which the dependency rests) as a
PK for a new table. - Move the attributes dependent on the determinant
to that new table.
47Third Normal Form (3NF) Conversion Results
48The Boyce-Codd Normal Form (BCNF)
- A table is in BCNF if every one-attribute
determinant in the table is a candidate key - i.e., every attribute that determines any other
attribute determines all the attributes, and
therefore determines a unique row, so theres no
redundancy problem
49A Table in 3NF but not in BCNF
50Decomposition to BCNF
The middle diagram shows that changing the PK so
as to include C doesnt work
51Plusses/Minusses of Normalization
- PLUS Unnormalized tables in a production
database tend to have these defects - Data updates are less efficient because programs
that read and update tables must deal with larger
tables - Indexing is much more cumbersome
- MINUS Joining larger number of tables takes
additional disk input/output (I/O) operations and
processing logic - Reduces system speed
- Conflicts among design efficiency, information
requirements, and processing speed are often
resolved through compromises that may include
denormalization
52MATHEMATICAL VIEW
53Tuples in a Table
People
- The tuples are
- ?9568876A, Chopples, 37gt
- ?2544799Z, Blurp, 21gt
- ?1698674F, Rumpel, 88gt
54Cartesian Product of Domains
- The set of all possible tuples formed from the
domains is called the Cartesian product of the
domains. - (This applies to any sets, not just domains in
tables.) - Notation, e.g. D ? E ? F ? G ? H
- if D, E, F, G, H are the sets (domains)not
necessarily different. - In discrete mathematics, any subset at all of
that Cartesian product is called a relation on
the sets in question (D, E, ) - even the whole of the product (even if infinite)
- and even the empty set.
- (The notion of relation is very useful in CS as a
whole.)
55Relation from a Table
People
- The relation at the moment is
- ? ?9568876A, Chopples, 37gt
- ?2544799Z, Blurp, 21gt
- ?1698674F, Rumpel, 88gt ?
56A Table as a Relation?
- People loosely talk about tables being relations.
- This is mathematically inaccurate for two
reasons - The table properly speaking includes not just the
rows but also the attributes themselves, their
domains, specification of primary and foreign
keys, etc. - Its only the rows at any given moment that form
a relation. - But OK if you know what you (and they) mean.
57Relations from Somewhere to Somewhere
- Some convenient terminology of my own
- A relation R from A1, A2, A3, to B1, B2, B3,
is a subset of A1 ? A2 ? A3 ? B1 ? B2 ? B3 . - i.e., R ? A1 ? A2 ? A3 ? B1 ? B2 ? B3 .
- Same thing as a relation on A1, A2, A3, B1, B2,
B3, just different terminology.
58Totality and Partiality
- A relation R from A to B is total (on A) if it
relates everything in A to one or more things in
B. - I.e., for every member a of A, there is at least
one b in B such that ?a, bgt is in R. - A relation that is not total is (merely) partial
(on A above). - But we can always make R total by slimming A
down enough! - Can generalize
- A relation R from A, B, C, to D, E, is total
(on A, B, C, ) if for every member a of A, b of
B, c of C, etc. there is at least one d in D, e
in E, etc. such that ?a, b, c, , d, e, gt is in
R.
59Partiality of Tables
- The relation in a table, considered as a relation
from its first attribute domain to the remaining
attribute domains, will almost always be merely
partial. - Similarly, any attribute-reordered version of the
tables relation will be merely partial. - We can sum this up by saying that the relation is
partial on each of its attribute domains.
60Functional Relations
- A relation from A to B is functional if, for any
a in A, there is at most one b in B such that ?a,
bgt is in R. - A total functional relation from A to B is called
a function from A to B. - A (merely) partial functional relation from A to
B is called a (merely) partial function from A to
B. - Can generalize a relation from A1, A2, A3 to
B1, B2, B3, is functional if, for any a1, a2,
a3, in A1, A2, A3, respectively, there is at
most one b1, b2, b3, in B1, B2, B3,
respectively such that ?a1, a2, a3, , b1, b2,
b3, gt is in R.
61Functional Relations arising from Dependency
Relationships
- Suppose attribute X is functionally dependent on
( determined by) attributes A, B, in a table. - Restrict to the subtable that just has attributes
X and A, B, , with A, B, first. - Then the relation in this subtable is a partial
function from the A, B, domains to the X
domain. - In particular, each attribute X outside the
primary key of a table is functionally dependent
on the primary key, so we have a partial function
from the PK domains to the X domain.
62Other Categories of Relation
- A relation R from A to B is one-to-one (1-1) if,
for any a in A, there is at most one b in B such
that ?a, bgt is in R, AND for any b in B, there is
at most one a in A such that ?a, bgt is in R. - That is, both the relation and its inverse from B
to A are functional. - To put it another way it is functional and
different members of A map to different members
of B. - Or again Different members of A map to different
members of B and different members of B map to
different members of A.
63 - A relation R from A to B is many-to-one if it is
functional but not necessarily one-to-one i.e.,
there may be at least one case of different
members of A mapping to the same member of B. - A relation R from A to B is one-to-many if it is
NOT necessarily functional but its inverse from B
to A is functional i.e., there may be at least
one case of a member of A mapping to more than
one member of B, but each member of B can map to
at most one member of A. - A relation R from A to B is many-to-many if
neither it nor its inverse is necessarily
functional i.e., there may be at least one case
of a member of A mapping to more than one member
of B, and there may be at least one case of a
member of B mapping to more than one member of A.
64Relations from Entity Relationships
- E.g., the EMPLOYED-BY relationship from the
People entity type to the Organizations entity
type says that the database (at any moment)
stores a relation from the People entity set to
the Organizations entity set.
65Example Continued
- So at any given moment the relation might be
- ?Person1, Org1gt, ?Person2, Org1gt, ?Person3,
Org1gt, - ?Person4, Org2gt, ?Person3, Org2gt
- Each Person.. and Org.. is an entity ...
therefore represented as a row of the
corresponding table ... therefore itself
mathematically represented as a tuple of
attribute values - So ?Person1, Org1gt could be, in more detail,
- ? ?E156, Sam, Finks, I678gt, ?I678, IBM,
USAgt gt - Note the nested tuples.
66Bridging Entity Types
- Recall that bridging entity types are brought in
to represent MN relationships (and similarly
MNP relationships, etc.) - People/Organizations again the relation in the
bridging table would look like ? E156, I678gt, ?
E257, I996gt, ?E714, I678gt, . - This relation can also be said to correspond to
the original People-Organization relationship,
but is abstracted from the above relation by
replacing tuples representing entities, such as
?E156, Sam, Finks, I678gt, by the PK values in
them, such as E156.
67Connectivities
- If a relationship from an entity type to another
is 11, then at any moment the actual relation
must be one-to-one (1-1). - If the relationship is 1M then the relation at
any moment is allowed to be one-to-many (but may
by chance be one-to-one). - If the relationship is MN then the relation at
any moment is allowed to be many-to-many (but may
by chance be one-to-one, one-to-many, or
many-to-one).
68Optionality/Mandatoriness
- If a relationship from an entity type E to
another type F is mandatory then the relation at
any moment is total when restricted to the set
of entities currently in E (and NOT on the whole
entity set of E, unless all possible entities of
type E are in the E table!!). - Observation So if it is also one-to-one or
many-to-one it is in fact a function on the
current set of E entities to the set of
organizations. - If a relationship from an entity type E to
another type F is optional then the relation is
not required to be total in the above sense (but
may happen to be).
69 Maths continued towards Relational Operators
in Databases
70Some Operations on Sets in General
- Union of sets A and B
- A ? B the set of things that are in A or B (or
both). - NB no repetitions allowed!
- Intersection of sets A and B
- A ? B the set of things that are in both A and
B. - Difference of sets A and B
- A ? B the set of things that are in A but not
B. - Note also notated by a backslash instead of a
minus sign.
71General Set Operations contd.
- I need to define also the non-standard notion of
flattened Cartesian product of two sets A and
B, applicable when the members of A and B are
already tuples. Notated by the symbol ?
(underlined multiplication symbol). - A ? B the set of tuples that are the
concatenations of members of A and members of B.
E.g., if lta,b,cgt is in A and ltd,e,fgt is in B then
lta,b,c,d,e,fgt is in A ? B.
72General Set Operations contd.
- So if A is the People relation and B is the
Organizations relation, - A ? B has members of form
- ? ?E156, Sam, Finks, I678gt, ?I459,
Dell, UKgt gt - BUT
- A ? B has members of form
- ?E156, Sam, Finks, I678, I459, Dell,
UKgt
73Relational Database Operators
- Relational algebra
- Defines theoretical way of manipulating tables
using relational operators that mainly
manipulate the relations in the tables. - SELECT
- PROJECT
- JOIN (various sorts)
- INTERSECT
- Use of relational algebra operators on existing
tables produces new tables
- UNION
- DIFFERENCE
- PRODUCT
- DIVIDE
74Select
- SQL
- SELECT FROM WHERE
- Note its the WHERE part that is actually doing
the selection according to a criterion. - Maths of relations
- No particular notation for result relation. Could
express as - t ?R t satisfies C
- where R is the relation in the given table and C
is the criterion. - Relational algebra notation in handout
- Result table is ?C(T) where T is the given table.
- More compact than SQL notation.
- Problem relations and RA dont account for
duplicates of rows, if there are any in the given
table.
75Project
- SQL
- SELECT FROM
- Note its the after the SELECT that is
actually doing the projection (selection of
specified attributes)!! - Maths of relations
- No particular concise notation for result
relation. Could express as - t ?P there is a tuple u in R such that t is u
with such and such components removed - where R is the relation in the given table and P
is the Cartesian product of the domains of the
selected attributes. - Relational algebra notation in handout
- Result table is ?X(T) where T is the given table
and X is the list of selected attributes. - Problem relations and RA dont account for
duplicates of rows, if there are any in the given
or result tables.
76Union and Intersection
- SQL
- UNION, INTERSECT (UNION ALL, INTERSECT ALL)
- Maths of relations
- Result relations are R1 ? R2 and R1 ? R2
- where R1and R2 are the relations in the given
tables (under the assumption that they have their
attributes listed in the same order). - Relational algebra notation in handout
- Result tables are T1 ? T2 and T1 ? T2 where T1
and T2 are the given tables. - Problem relations and RA dont account for
duplicates of rows, so dont handle UNION ALL or
INTERSECT ALL.
77Difference
- SQL
- MINUS or EXCEPT (MINUS ALL or EXCEPT ALL)
- Maths of relations
- Result relation is R1 ? R2
- where R1and R2 are the relations in the given
tables (under the assumption that they have their
attributes listed in the same order). - Relational algebra notation in handout
- Result table is T1 ? T2 where T1 and T2 are the
given tables. - Problem relations and RA dont handle EXCEPT ALL.
78Product or Cross Join
- SQL
- SELECT FROM two or more tables
- NB its the mere listing of the tables that does
the Product, but its possible also to write - SELECT FROM T1 CROSS JOIN T2 CROSS JOIN ...
- Maths of relations
- Result relation is R1 ? R2 where R1and R2 are the
relations in the given tables. - Relational algebra notation
- Result table is T1 ? T2 where T1 and T2 are the
given tables. - Problem relations and RA dont account for
duplicates of rows, if there are any in the given
tables.
79Natural Join
- Links two tables by finding rows in each that
match on the attributes the two tables have in
common (if any), and then joining the matching
rows together in a natural way. - The common attributes or columns are called the
join attributes or columns) just the AGENT_CODE
attribute in above example - Can be thought of as the result of a three-stage
process - the PRODUCT of the tables is created
- a SELECT is performed on the resulting table to
yield only the rows for which the join-attribute
values (e.g. AGENT_CODE values) are equal - a PROJECT is now performed to yield a single copy
of each join attribute, thereby eliminating
duplicate columns
80Natural Join (continued)
- SQL
- SELECT FROM T1, T2 WHERE explicit
- statements of equality of ALL the shared
attributes ... - NB its possible also to write
- SELECT FROM T1 NATURAL JOIN T2
- and see book section 7.2 for more.
- Maths of relations
- No standard concise notation. Result relation
could be expressed as the result of Project and
Select operations on R1 ? R2 where R1 and R2 are
the relations in the given tables. - Relational algebra notation
- Result table is T1 ... T2 where T1 and T2 are the
given tables and the is a symbol I cant find
in PowerPoint!
81Other Forms of Join
- Equijoin (often what you have been doing in SQL
work) - Links tables on the basis of an equality
condition that compares SPECIFIED attributes of
each table, rather than automatically taking the
common attributes. - These specified join attributes do not have to
have the same name. - Result does not eliminate duplicate columns
- Outer Join (various forms) includes (some)
non-matching rows