Title: Database Overview
1Database Overview
- File Management vs Database Management
- Advantages of Database systems storage
persistence, - programming interface, transaction management
- Three level Data Model
- DBMS Architecture
- Database System Components
- Users classification
2File Management System Problems
- Data redundancy
- Data Access New request-new program
- Data is not isolated from the access
implementation - Format incompatible data
- Concurrent program execution on the same file
- Difficulties with security enforcement
- Integrity issues
3Advantages of Databases
- Persistent Storage Database not only provides
persistent storage but also efficient access to
large amounts of data - Programming Interface Database allows users to
access and modify data using powerful query
language. It provides flexibility in data
management - Transaction Management Database supports a
concurrent access to the data
4Three Aspects to Studying DBMS's
- 1. Modeling and design of databases.
- Allows exploration of issues before committing to
an implementation. - 2. Programming queries and DB operations like
update. - SQL intergalactic dataspeak.
- 3. DBMS implementation.
- .
5Definitions
- A database is a collection of stored operational
data used by various applications and/or users by
some particular enterprise or by a set of outside
authorized applications and authorized users - A DataBase Management System (DBMS) is a software
system that manages execution of users
applications to access and modify database data
so that the data security, data integrity, and
data reliability is guaranteed for each
application and each application is written with
an assumption that it is the only application
active in the database.
6Three Level Data View Data Abstractions
. . . . .
View1
View k
Conceptual View Of Data
Phyisal Data Storage
7DBMS Architecture
8Logical and Physical Database Components
- Data Definition Language (DDL)
- Data Manipulation Language (DML)
- Host Language Interface
- Data Administrator
- Users
- Query Processor
- Compiler
- Optimizer
- Management
- Transaction Manager
- File Manager
- Buffer Manager
- Authorization and Integrity Manager
Logical
Physical
9Database Languages
Department
Faculty
Dept
Chair
Name
Dept
SQL
- SELECT Chair
- FROM Faculty, DepartmentWHERE Faculty.name
Ken Noname AND Faculty.Dept
Department.Dept - Data definition language (DDL) like type
definitions in C or C - Data Manipulation Language (DML) Query
(SELECT) UPDATE lt relation name gt SET
ltattributegt lt new-valuegt WHERE ltconditiongt
10Logical Data Models
- A collection of tools for describing
- data
- data relationships
- data semantics
- data constraints
- Value based models ER Model, OO Model
- Record Based Models Relational Model
11Entity-Relationship Model
- The enterprise data can be described as a set of
entities and a set of relationships between them. - Entity a data that pertains to, or describes
some component of the enterprise - Each entity is characterized by a set of
attributes - Relationship describes an interconnection
between different entities - Entity Set a set of entities that are
characterized by the same entity definition - Relationship Set a set of relationships of the
same type
12Entity-Relationship Model
- Example of schema in the entity-relationship model
13Relational Model
- An enterprise is represented as a set of
relations - Domain is a set of atomic values. Each domain
has a NULL value. - Data type Description of a form that domain
values can be represented. - Relation is a subset of a cartesian product of
one or more domains - The elements of relations are called tuples. Each
element in the cartesian product is called
attribute.
14- Relational model is good for
- Large amounts of data gt simple operations
- Navigate among small number of relations
- Difficult Applications for relational model
- VLSI Design (CAD in general)
- CASE
- Graphical Data
ALU
ADDER
CPU
A
FA
Adder
ALU
ADDER
Bill of Materials or transitive closure
15Relational Model
Attributes
Street
City
gpa
Name
Student-id
- Example of tabular data in the relational model
Johnson Smith Johnson Jones Smith
192-83-7465 019-28-3746 192-83-7465 321-12-3123
019-28-3746
Alma North Alma Main North
3.6 2.7 3.2 4.0 3.45
Palo Alto Rye Palo Alto Harrison Rye
16Relational Algebra
17 Relational Model
- Basic Notions
- Fundamental Relational Algebra Operations
- Additional Relational Algebra Operations
- Extended Relational Algebra Operations
- Null Values
- Modification of the Database
- Views
- Bags and Bag operations
18Basic Structure
- Formally, given sets D1, D2, . Dn a relation r
is a subset of D1 x D2 x x DnThus,
a relation is a set of n-tuples (a1, a2, , an)
where each ai ? Di - Example
- customer_name Jones, Smith,
Curry, Lindsay customer_street Main, North,
Park customer_city Harrison, Rye,
PittsfieldThen r (Jones, Main, Harrison),
(Smith, North, Rye),
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) is a relation over
- customer_name , customer_street, customer_city
19Relation Schema
- A1, A2, , An are attributes
- R (A1, A2, , An ) is a relation schema
- Example
- Customer_schema (customer_name,
customer_street, customer_city) - r(R) is a relation on the relation schema R
- Example
- customer (Customer_schema)
20Database
- A database consists of multiple relations
- Information about an enterprise is broken up into
parts, with each relation storing one part of
the information - account stores information about accounts
depositor stores information about which
customer owns which
account customer stores information
about customers - Storing all information as a single relation such
as bank(account_number, balance,
customer_name, ..)results in repetition of
information (e.g., two customers own an account)
and the need for null values (e.g., represent a
customer without an account)
21Keys
- Let K ? R
- K is a superkey of R if values for K are
sufficient to identify a unique tuple of each
possible relation r(R) - by possible r we mean a relation r that could
exist in the enterprise we are modeling. - Example customer_name, customer_street and
customer_name are both superkeys
of Customer, if no two customers can possibly
have the same name. - K is a candidate key if K is minimalExample
customer_name is a candidate key for. - Primary Key
22Select Operation Example
A
B
C
D
? ? ? ?
? ? ? ?
1 5 12 23
7 7 3 10
A
B
C
D
? ?
? ?
1 23
7 10
23Project Operation Example
A
B
C
? ? ? ?
10 20 30 40
1 1 1 2
A
C
A
C
That is, the projection of a relation on a set of
attributes is a set of tuples
? ? ? ?
1 1 1 2
? ? ?
1 1 2
24Union Operation Example
A
B
A
B
? ? ?
1 2 1
? ?
2 3
s
r
A
B
? ? ? ?
1 2 1 3
r ? s
25Set Difference Operation Example
A
B
A
B
? ? ?
1 2 1
? ?
2 3
s
r
A
B
? ?
1 1
r s
26Cartesian-Product Operation-Example
A
B
C
D
E
Relations r, s
? ?
1 2
? ? ? ?
10 10 20 10
a a b b
r
s
r x s
A
B
C
D
E
? ? ? ? ? ? ? ?
1 1 1 1 2 2 2 2
? ? ? ? ? ? ? ?
10 10 20 10 10 10 20 10
a a b b a a b b
27Additional Operations
- We define additional operations that do not add
any power - to the relational algebra, but that simplify
common queries. - Set intersection
- Natural join
- Division
- Assignment
28Set-Intersection Operation - Example
A B
A B
? ? ?
1 2 1
? ?
2 3
r
s
A B
? 2
29Natural Join Operation Example
B
D
E
A
B
C
D
1 3 1 2 3
a a a b b
? ? ? ? ?
? ? ? ? ?
1 2 4 1 2
? ? ? ? ?
a a b a b
r
s
A
B
C
D
E
? ? ? ? ?
1 1 1 1 2
? ? ? ? ?
a a a a b
? ? ? ? ?
30Division Operation Example
A
B
Relations r, s
B
? ? ? ? ? ? ? ? ? ? ?
1 2 3 1 1 1 3 4 6 1 2
1 2
s
r ? s
A
r
? ?
31Another Division Example
Relations r, s
A
B
C
D
E
D
E
? ? ? ? ? ? ? ?
a a a a a a a a
? ? ? ? ? ? ? ?
a a b a b a b b
1 1 1 1 3 1 1 1
a b
1 1
s
r
A
B
C
r ? s
? ?
a a
? ?
32Example Queries
- Find the largest account balance
- 1. Rename account relation as d
- 2. The query is
-
?balance(account) - ?account.balance
(?account.balance lt d.balance (account x rd
(account)))
33Example Queries
- Find all customers who have an account at all
branches located in Brooklyn city.
34Extended Relational-Algebra-Operations
- Generalized Projection
- Outer Join
- Aggregate Functions
35Generalized Projection
- Extends the projection operation by allowing
arithmetic functions to be used in the projection
list. ? F1, F2, , Fn(E) - E is any relational-algebra expression
- Each of F1, F2, , Fn are are arithmetic
expressions involving constants and attributes in
the schema of E. - Given relation credit-info(customer-name, limit,
credit-balance), find how much more each person
can spend - ?customer-name, limit credit-balance
(credit-info)
36Aggregate Functions and Operations
- Aggregation function takes a collection of values
and returns a single value as a result. - avg average value min minimum value max
maximum value sum sum of values count
number of values - Aggregate operation in relational algebra
- G1, G2, , Gn g F1( A1), F2( A2),, Fn( An)
(E) - E is any relational-algebra expression
- G1, G2 , Gn is a list of attributes on which to
group (can be empty) - Each Fi is an aggregate function
- Each Ai is an attribute name
37Aggregate Operation Example
A
B
C
? ? ? ?
? ? ? ?
7 7 3 10
sum-C
g sum(c) (r)
27
38Aggregate Operation Example
- Relation account grouped by branch-name
branch-name
account-number
balance
Perryridge Perryridge Brighton Brighton Redwood
A-102 A-201 A-217 A-215 A-222
400 900 750 750 700
branch-name g sum(balance) (account)
branch-name
balance
Perryridge Brighton Redwood
1300 1500 700
39Aggregate Functions
- Result of aggregation does not have a name
- Can use rename operation to give it a name
- For convenience, we permit renaming as part of
aggregate operation
branch-name g sum(balance) as sum-balance
(account)
40Outer Join Example
- Relation loan
- Employee(ename,str,city)
- Works(ename,cname,sal)
- Company(cname,city)
- Manages(ename,mname)
41 SQLLecture 3
42 SQL
- Data Definition
- Basic Query Structure
- Set Operations
- Aggregate Functions
- Null Values
- Nested Subqueries
- Complex Queries
- Views
43Data Definition Language
Allows the specification of not only a set of
relations but also information about each
relation, including
- The schema for each relation.
- The domain of values associated with each
attribute. - Integrity constraints
- The set of indices to be maintained for each
relations. - Security and authorization information for each
relation. - The physical storage structure of each relation
on disk.
44Basic Query Structure
- A typical SQL query has the form select A1, A2,
..., An from r1, r2, ..., rm where P - order by
- group by
- having Q
- Ais represent attributes
- ris represent relations
- P is a predicate.
- This query is equivalent to the relational
algebra expression. - ?A1, A2, ..., An(?P (r1 x r2 x ... x
rm)) - The result of an SQL query is a relation.
45Set Operations
- Find all customers who have a loan, an account,
or both
(select customer-name from depositor) union (sel
ect customer-name from borrower)
- Find all customers who have both a loan and
an account.
(select customer-name from depositor) intersect
(select customer-name from borrower)
- Find all customers who have an account but no
loan.
- (select customer-name from depositor) minus
- (select customer-name from borrower)
46Aggregate Functions
- These functions operate on the multiset of values
of a column of a relation, and return a value - avg average value min minimum value max
maximum value sum sum of values count
number of values
47Null Values and Aggregates
- Total all loan amounts
- select sum (amount) from loan
- Above statement ignores null amounts
- result is null if there is no non-null amount,
that is the - All aggregate operations except count() ignore
tuples with null values on the aggregated
attributes.
48Nested Subqueries
- SQL provides a mechanism for the nesting of
subqueries. - A subquery is a select-from-where expression that
is nested within another query. - A common use of subqueries is to perform tests
for set membership, set comparisons, and set
cardinality.
49Set Comparison
- Find all branches that have greater assets than
some branch located in Brooklyn.
select distinct T.branch-name from branch as T,
branch as S where T.assets gt S.assets and
S.branch-city Brooklyn
- Same query using gt some clause
select branch-name from branch where assets gt
some (select assets from branch
where branch-city Brooklyn)
50Example Query
- Find the names of all branches that have greater
assets than all branches located in Brooklyn.
select branch-name from branch where assets gt
all (select assets from branch where
branch-city Brooklyn)
51Test for Empty Relations
- The exists construct returns the value true if
the argument subquery is nonempty. - exists r ?? r ? Ø
- not exists r ?? r Ø
52Example
Name
Sport
Yuri soccer Yuri baseball Yuri
tennis Joe football Joe soccer Jane
tennis Jim tennis Yuri tennis Jim
football
Find students that play all sports
.
(Student)
.
Student
sport
53Example
- Student(name,sport)
- Find students that play all sports
- Select distinct id
- from students S
- where not exists (
- (select distinct
sport from student) - minus
- (select distinct
sport from student T - where S.id
T.id) - Find students that are playing exactly one sport
- Select id
- from (select id, count()
- from student
- group by id
- having count() 1)
-
-
54Entity-Relationship Model
55ER Model Components
- Entity Sets
- Attributes
- Relationships
56ER Model
title
year
Name
Address
Star-in
Movies
Stars
length
genre
owns
Studios
Name
Address
57Relationship
- A relationship is a cartesian product of
- n ? 2 entities
- (e1, e2, en) e1 ? E1, e2 ? E2, , en ?
Enwhere (e1, e2, , en) is a relationship
58Degree of a Relationship Set
- Refers to number of entity sets that participate
in a relationship set. - Relationship sets that involve two entity sets
are binary (or degree two). Generally, most
relationship sets in a database system are
binary. - Relationship sets may involve more than two
entity sets. - Relationships between more than two entity sets
are rare. Most relationships are binary.
- E.g. Suppose employees of a bank may have jobs
(responsibilities) at multiple branches, with
different jobs at different branches. Then there
is a ternary relationship set between entity sets
employee, job and branch
59Courses
Enrolls
Students
Instructors
- Students Courses TAs
- Ann CS43005 Jan
- Sue CS43005 Pat
- Bob CS43005 Jan
-
60Types of Binary Relationships
Many-one
Many-many
One-one
- Representation of Many-One
Many-one E/R arrow pointing to one.
61Keys for Relationship Sets
- The combination of primary keys of the
participating entity sets forms a super key of a
relationship set. - (customer-id, account-number) is the super key of
depositor - NOTE this means a pair of entity sets can have
at most one relationship in a particular
relationship set. - E.g. if we wish to track all access-dates to each
account by each customer, we cannot assume a
relationship for each access. We can use a
multivalued attribute though - Must consider the mapping cardinality of the
relationship set when deciding the what are the
candidate keys - Need to consider semantics of relationship set in
selecting the primary key in case of more than
one candidate key
62Converting Multiway to 2-Way
- Create a new connecting E.S. to represent rows of
a relationship set. - E.g., (Joe's Bar, Bud, 2.50) for the Sells
relationship. - Many-one relationships from the connecting E.S.
to the others.
BBP
The-Bar
The-Beer
The-Price
Bars
Beers
Price
63Specialization
- within an entity set that are distinctive from
other entities in the set. - These subgroupings become lower-level entity sets
that have attributes or participate in
relationships that do not apply to the
higher-level entity set. - Depicted by a triangle component labeled ISA
(E.g. Top-down design process we designate
subgroupings customer is a person). - Attribute inheritance a lower-level entity set
inherits all the attributes and relationship
participation of the higher-level entity set to
which it is linked.
64Specialization Example
65Generalization
- A bottom-up design process combine a number of
entity sets that share the same features into a
higher-level entity set. - Specialization and generalization are simple
inversions of each other they are represented in
an E-R diagram in the same way. - The terms specialization and generalization are
used interchangeably.
66Aggregation
- Consider the ternary relationship works-on,
which we saw earlier - Suppose we want to record managers for tasks
performed by an employee at a branch
67E-R Diagram With Aggregation
68Weak Entity Sets
- An entity set that does not have a primary key is
referred to as a weak entity set. - The existence of a weak entity set depends on the
existence of a identifying entity set - it must relate to the identifying entity set via
a total, one-to-many relationship set from the
identifying to the weak entity set - Identifying relationship depicted using a double
diamond - The discriminator (or partial key) of a weak
entity set is the set of attributes that
distinguishes among all the entities of a weak
entity set. - The primary key of a weak entity set is formed by
the primary key of the strong entity set on which
the weak entity set is existence dependent, plus
the weak entity sets discriminator.
69Weak Entity Sets
- We depict a weak entity set by double rectangles.
- We underline the discriminator of a weak entity
set with a dashed line. - payment-number discriminator of the payment
entity set - Primary key for payment (loan-number,
payment-number)
70Example
student
instructor
enrolls
teaches
offering
isoffered
course
requires
71Reduction of an E-R Schema to Tables
- Primary keys allow entity sets and relationship
sets to be expressed uniformly as tables which
represent the contents of the database. - A database which conforms to an E-R diagram can
be represented by a collection of tables. - For each entity set and relationship set there is
a unique table which is assigned the name of the
corresponding entity set or relationship set. - Each table has a number of columns (generally
corresponding to attributes), which have unique
names.
72Representing Entity Sets as Tables
- A strong entity set reduces to a table with the
same attributes.
73Representing Relationship Sets as Tables
- A many-to-many relationship set is represented as
a table with columns for the primary keys of the
two participating entity sets, and any
descriptive attributes of the relationship set. - E.g. table for relationship set borrower
74Additional Rules for Translating Relationship
into Relation
If one entity set participates several times in
the relationship with different roles, its key
attributes must be listed as many times and with
different names for each role. Studies(SSN,
Name) Favorite(SSN, Name) Friends(SSN1, SSN2)
Name
SSN
subject
studies
Student
friends
favorite
75Redundancy of Tables
- Many-to-one relationship sets that are total on
the many-side can be represented by adding an
extra attribute to the many side, containing the
primary key of the one side - Example We eliminate relation Favorite and we
extend relation for Student as follows - Student(SSN, Name, Subject.name)
- If, however, the relationship is many-to-many we
cannot do that since it leads to redundancy - For example relation Studies cannot be
eliminated since otherwise we may end up with - 111-222-333 John OS
- 111-222-333 John DBMS
-
76Representing Weak Entity Sets
- A weak entity set becomes a table that includes a
column for the primary key of the identifying
strong entity set
77Representing Specialization as Tables
- Form a table for the higher level entity
- Form a table for each lower level entity set,
include primary key of higher level entity set
and local attributes table table
attributesperson name, street, city
customer name, credit-ratingemployee name,
salary - Drawback getting information about, e.g.,
employee requires accessing two tables
78Relations Corresponding to Aggregation
- To represent aggregation, create a table
containing - primary key of the aggregated relationship,
- the primary key of the associated entity set
- Any descriptive attributes
79Example
ssn
name
ISA
person
passenger
age
booked
ISA
date
departure
assigned
pilot
fhrs
gate
instantof
canfly
flight
plane
dtime
atime
F
man
model
80Example
ssn
Name
name
Plays
Player
Team
age
Uses
ISA
Has
Colors
Captain
Likes-Colors
likes
Fan
name
addr
81Relational Database Design Theory
82First Normal Form
- Domain is atomic if its elements are considered
to be indivisible units - Examples of non-atomic domains
- Set of names, composite attributes
- Identification numbers like CS101 that can be
broken up into parts - A relational schema R is in first normal form if
the domains of all attributes of R are atomic - Non-atomic values complicate storage and
encourage redundant (repeated) storage of data - Example Set of accounts stored with each
customer, and set of owners stored with each
account - We assume all relations are in first normal form
83First Normal Form
- Atomicity is actually a property of how the
elements of the domain are used. - Example Strings would normally be considered
indivisible - Suppose that students are given roll numbers
which are strings of the form CS0012 or EE1127 - If the first two characters are extracted to find
the department, the domain of roll numbers is not
atomic. - Doing so is a bad idea leads to encoding of
information in application program rather than in
the database.
84Functional Dependencies
- Constraints on the set of legal relations.
- Require that the value for a certain set of
attributes determines uniquely the value for
another set of attributes. - A functional dependency is a generalization of
the notion of a key.
85Functional Dependencies
- Let R(A1, A2, .Ak) be a relational schema X and
Y are subsets of A1, A2, Ak. We say that X-gtY, - if any two tuples that agree on X, then
they agree on Y. - Example
- Student(SSN,Name,Addr,subjectTaken,favSubject,Prof
) - SSN-gtName
- SSN-gtAddr
- subjectTaken-gtProf
- Assign(Pilot,Flight,Date,Departs)
- Pilot,Date,Departs-gtFlight
- Flight,Date-gtPilot
86Functional Dependencies
- A functional dependency X-gtY is trivial if it is
satisfied by any relation that includes
attributes from X and Y - E.g.
- customer-name, loan-number ? customer-name
- customer-name ? customer-name
- In general, ? ? ? is trivial if ? ? ?
87Closure of a Set of Functional Dependencies
- Given a set F set of functional dependencies,
there are certain other functional dependencies
that are logically implied by F. - E.g. If A ? B and B ? C, then we can infer
that A ? C - The set of all functional dependencies logically
implied by F is the closure of F. - We denote the closure of F by F.
88Closure of a Set of Functional Dependencies
- An inference axiom is a rule that states if a
relation satisfies certain FDs, it must also
satisfy certain other FDs - Set of inference rules is sound if the rules lead
only to true conclusions - Set of inference rules is complete, if it can be
used to conclude every valid FD on R - We can find all of F by applying Armstrongs
Axioms - if ? ? ?, then ? ? ?
(reflexivity) - if ? ? ?, then ? ? ? ? ?
(augmentation) - if ? ? ?, and ? ? ?, then ? ? ? (transitivity)
- These rules are
- sound and complete
89Example
- R (A, B, C, G, H, I)F A ? B A ? C CG
? H CG ? I B ? H - some members of F
- A ? H
- by transitivity from A ? B and B ? H
- AG ? I
- by augmenting A ? C with G, to get AG ? CG
and then transitivity with CG ? I
90Procedure for Computing F
- To compute the closure of a set of functional
dependencies F - F Frepeat for each functional
dependency f in F apply reflexivity and
augmentation rules on f add the resulting
functional dependencies to F for each pair of
functional dependencies f1and f2 in F if
f1 and f2 can be combined using transitivity
then add the resulting functional dependency to
Funtil F does not change any further
91Closure of Attribute Sets
- Given a set of attributes a, define the closure
of a under F (denoted by a) as the set of
attributes that are functionally determined by a
under F a ? ? is in F ? ? ? a - Algorithm to compute a, the closure of a under
F result a while (changes to result)
do for each ? ? ? in F do begin if ? ?
result then result result ? ? end
92Uses of Attribute Closure
- There are several uses of the attribute closure
algorithm - Testing for superkey
- To test if ? is a superkey, we compute ?, and
check if ? contains all attributes of R. - Testing functional dependencies
- To check if a functional dependency ? ? ? holds
(or, in other words, is in F), just check if ? ?
?. - That is, we compute ? by using attribute
closure, and then check if it contains ?. - Is a simple and cheap test, and very useful
- Computing closure of F
- For each ? ? R, we find the closure ?, and for
each S ? ?, we output a functional dependency ?
? S.
93Example of Attribute Set Closure
- R (A, B, C, G, H, I)
- F A ? B, A ? C, CG ? H, CG ? I, B ? H
- (AG)
- 1. result AG
- 2. result ABCG (A ? C and A ? B)
- 3. result ABCGH (CG ? H and CG ? AGBC)
- 4. result ABCGHI (CG ? I and CG ? AGBCH)
- Is AG a key?
- Is AG a super key?
- Does AG ? R? Is (AG) ? R
- Is any subset of AG a superkey?
- Does A ? R? Is (A) ? R
- Does G ? R? Is (G) ? R
94Extraneous Attributes
- Consider a set F of functional dependencies and
the functional dependency ? ? ? in F. - Attribute A is extraneous in ? if A ? ? and
- F logically implies (? A) ? ? or
- Attribute A ? ? is extraneous in ? if A ? ?
and the set of functional dependencies (F
? ? ?) ? ? ?(? A) logically implies F. - Example Given F A ? C, AB ? C
- B is extraneous in AB ? C because A ? C
logically implies AB? C, A ?C. - Example Given F A ? C, AB ? CD
- C is extraneous in AB ? CD since AB ? D,A ?C
implies AB ? C
95Testing if an Attribute is Extraneous
- Consider a set F of functional dependencies and
the functional dependency ? ? ? in F. - To test if attribute A ? ? is extraneous in ?
- compute (? A) using the dependencies in
- F
- 2. check that (? A) contains A if it does,
A is extraneous - To test if attribute A ? ? is extraneous in ?
- compute ? using only the dependencies in
F (F ? ? ?) ? ? ?(? A), - check that ? contains A if it does, A is
extraneous
96Canonical Cover
- Sets of functional dependencies may have
redundant dependencies that can be inferred from
the others - Eg A ? C is redundant in A ? B, B ? C,
A ? C - Parts of a functional dependency may be redundant
- E.g. A ? B, B ? C, A ? CD can be
simplified to A ? B,
B ? C, A ? D - E.g. A ? B, B ? C, AC ? D can be
simplified to A ? B,
B ? C, A ? D - A canonical cover of F is a minimal set of
functional dependencies equivalent to F, having
no redundant dependencies or redundant parts of
dependencies
97Canonical Cover(Formal Definition)
- A canonical cover for F is a set of dependencies
Fc such that - F logically implies all dependencies in Fc, and
- Fc logically implies all dependencies in F, and
- No functional dependency in Fc contains an
extraneous attribute, and - Each left side of functional dependency in Fc is
unique.
98Canonical CoverComputation
- To compute a canonical cover for Frepeat Use
the union rule to replace any dependencies in
F ?1 ? ?1 and ?1 ? ?1 with ?1 ? ?1 ?2 Find a
functional dependency ? ? ? with an extraneous
attribute either in ? or in ? If an extraneous
attribute is found, delete it from ? ? ? until F
does not change
99Example of Computing a Canonical Cover
- R (A, B, C)F A ? BC B ? C A ? B AB ?
C - Combine A ? BC and A ? B into A ? BC
- A is extraneous in AB ? C
- Set is now A ? BC, B ? C
- C is extraneous in A ? BC
- Check if A ? C is logically implied by A ? B and
the other dependencies - Yes using transitivity on A ? B and B ? C.
- The canonical cover is A ? B B ? C
100Decomposition
- All attributes of an original schema (R) must
appear in the decomposition (R1, R2) - R R1 ? R2
- Lossless-join decomposition.For all possible
relations r on schema R - r ?R1 (r) ?R2 (r)
- A decomposition of R into R1 and R2 is lossless
join if and only if at least one of the following
dependencies is in F - R1 ? R2 ? R1
- R1 ? R2 ? R2
101Normalization Using Functional Dependencies
- When we decompose a relation schema R with a set
of functional dependencies F into R1, R2,.., Rn
we want - Lossless-join decomposition Otherwise
decomposition would result in information loss. - Dependency preservation Let Fi be the set of
dependencies F that include only attributes in
Ri. - (F1 ? F2 ? ? Fn) F
- .
102Example
- R (A, B, C)F A ? B, B ? C)
- Can be decomposed in two different ways
- R1 (A, B), R2 (B, C)
- Lossless-join decomposition
- R1 ? R2 B and B ? BC
- Dependency preserving
- R1 (A, B), R2 (A, C)
- Lossless-join decomposition
- R1 ? R2 A and A ? AB
- Not dependency preserving (cannot check B ? C
without computing R1 R2)
103Testing for Dependency Preservation
- To check if a dependency ??? is preserved in a
decomposition of R into R1, R2, , Rn we apply
the following simplified test (with attribute
closure done w.r.t. F) - result ?while (changes to result) do for each
Ri in the decomposition t (result ? Ri) ?
Ri result result ? t - If result contains all attributes in ?, then the
functional dependency ? ? ? is preserved. - We apply the test on all dependencies in F to
check if a decomposition is dependency preserving - This procedure takes polynomial time, instead of
the exponential time required to compute F and
(F1 ? F2 ? ? Fn)
104Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a
set F of functional dependencies if for all
functional dependencies in F of the form ??? ?,
where ? ? R and ? ? R, at least one of the
following holds
- ?? ? ? is trivial (i.e., ? ? ?)
- ? is a superkey for R
105Example
- R (A, B, C)F A ? B B ? CKey A
- R is not in BCNF
- Decomposition R1 (A, B), R2 (B, C)
- R1 and R2 in BCNF
- Lossless-join decomposition
- Dependency preserving
106Testing for BCNF
- To check if a non-trivial dependency ???? causes
a violation of BCNF - 1. compute ? (the attribute closure of ?), and
- 2. verify that it includes all attributes of R
- Using only F is incorrect when testing a relation
in a decomposition of R - E.g. Consider R (A, B, C, D), with F A ?B, B
?C - Decompose R into R1(A,B) and R2(A,C,D)
- Neither of the dependencies in F contain only
attributes from (A,C,D) so we might be mislead
into thinking R2 satisfies BCNF. - In fact, dependency A ? C in F shows R2 is not
in BCNF.
107BCNF Decomposition Algorithm
- result Rdone falsecompute Fwhile
(not done) do if (there is a schema Ri in result
that is not in BCNF) then begin let ?? ? ?
be a nontrivial functional dependency that holds
on Ri such that ?? ? Ri is not in F, and ? ? ?
?result (result Ri ) ? (Ri ?) ? (?, ?
) end else done true - Each Ri is in BCNF, and decomposition is
lossless-join.
108BCNF and Dependency Preservation
It is not always possible to get a BCNF
decomposition that is dependency preserving
- R (A, B, C)F AB ? C C ? BTwo candidate
keys AB and AC - R is not in BCNF
- Any decomposition of R will fail to preserve
- AB ? C
109Third Normal Form
- A relation schema R is in third normal form (3NF)
if for all ? ? ? in F at least one of the
following holds - ? ? ? is trivial (i.e., ? ? ?)
- ? is a superkey for R
- Each attribute A in ? ? is contained in a
candidate key for R. - If a relation is in BCNF it is in 3NF (since in
BCNF one of the first two conditions above must
hold). - Third condition is a minimal relaxation of BCNF
to ensure dependency preservation.
110Third Normal Form
- Example
- R (A,B,C)F AB ? C, C ? B
- Two candidate keys AB and AC
- R is in 3NF
- AB ? C AB is a superkey C ? B B is contained
in a candidate key - BCNF decomposition has (AC) and (BC)
- Testing for AB ? C requires a join
111Testing for 3NF
- Use attribute closure to check for each
dependency ? ? ?, if ? is a superkey. - If ? is not a superkey, we have to verify if each
attribute in ? is contained in a candidate key of
R - this test is rather more expensive, since it
involve finding candidate keys - testing for 3NF has been shown to be NP-hard
- However, decomposition into third normal form can
be done in polynomial time
1123NF Decomposition Algorithm
- Let Fc be a canonical cover for Fi 0for
each functional dependency ? ? ? in Fc do if
none of the schemas Rj, 1 ? j ? i contains ? ?
then begin i i 1 Ri ? ?
endif none of the schemas Rj, 1 ? j ? i
contains a candidate key for R then begin i
i 1 Ri any candidate key for
R end return (R1, R2, ..., Ri) -
113Storage Hierarchy
114Storage Hierarchy
- primary storage Fastest media but volatile
(cache, main memory). - secondary storage next level in hierarchy,
non-volatile, moderately fast access time - also called on-line storage
- E.g. flash memory, magnetic disks
- tertiary storage lowest level in hierarchy,
non-volatile, slow access time - also called off-line storage
- E.g. magnetic tape, optical storage
115Magnetic Disks
- Disk controller interfaces between the computer
system and the disk drive hardware. - accepts high-level commands to read or write a
sector - initiates actions such as moving the disk arm to
the right track and actually reading or writing
the data - Computes and attaches checksums to each sector to
verify that data is read back correctly - If data is corrupted, with very high probability
stored checksum wont match recomputed checksum - Ensures successful writing by reading back sector
after writing it - Performs remapping of bad sectors
116Disk Subsystem
- Multiple disks connected to a computer system
through a controller - Controllers functionality (checksum, bad sector
remapping) often carried out by individual disks
reduces load on controller - Disk interface standards families
- ATA (AT adaptor) range of standards
- SCSI (Small Computer System Interconnect) range
of standards - Several variants of each standard (different
speeds and capabilities)
117Performance Measures of Disks
- Access time the time it takes from when a read
or write request is issued to when data transfer
begins. Consists of - Seek time time it takes to reposition the arm
over the correct track. - Average seek time is 1/2 the worst case seek
time. - 4 to 10 milliseconds on typical disks
- Rotational latency time it takes for the sector
to be accessed to appear under the head. - Average latency is 1/2 of the worst case
latency. - 4 to 11 milliseconds on typical disks
- Data-transfer rate the rate at which data can
be retrieved from or stored to the disk. - 4 to 8 MB per second is typical
- Multiple disks may share a controller, so
transfer rate that controller can handle is also
important - E.g. ATA-5 66 MB/second, SCSI-3 40 MB/s
- Fiber Channel 256 MB/s
118Storage Access
- A database file is partitioned into fixed-length
storage units called blocks. Blocks are units of
both storage allocation and data transfer. - Database system seeks to minimize the number of
block transfers between the disk and memory. We
can reduce the number of disk accesses by keeping
as many blocks as possible in main memory. - Buffer portion of main memory available to
store copies of disk blocks. - Buffer manager subsystem responsible for
allocating buffer space in main memory.
119Buffer Manager
- Programs call on the buffer manager when they
need a block from disk. - If the block is already in the buffer, the
requesting program is given the address of the
block in main memory - If the block is not in the buffer,
- the buffer manager allocates space in the buffer
for the block, replacing (throwing out) some
other block, if required, to make space for the
new block. - The block that is thrown out is written back to
disk only if it was modified since the most
recent time that it was written to/fetched from
the disk. - Once space is allocated in the buffer, the buffer
manager reads the block from the disk to the
buffer, and passes the address of the block in
main memory to requester.
120Buffer-Replacement Policies
- Most operating systems replace the block least
recently used (LRU strategy) - Idea behind LRU use past pattern of block
references as a predictor of future references - Queries have well-defined access patterns (such
as sequential scans), and a database system can
use the information in a users query to predict
future references - LRU can be a bad strategy for certain access
patterns involving repeated scans of data - e.g. when computing the join of 2 relations r
and s by a nested loops for each tuple tr of r
do for each tuple ts of s do if the
tuples tr and ts match - Mixed strategy with hints on replacement strategy
providedby the query optimizer is preferable
121Buffer-Replacement Policies
- Pinned block memory block that is not allowed
to be written back to disk. - Toss-immediate strategy frees the space
occupied by a block as soon as the final tuple of
that block has been processed - Most recently used (MRU) strategy system must
pin the block currently being processed. After
the final tuple of that block has been processed,
the block is unpinned, and it becomes the most
recently used block. - Buffer manager can use statistical information
regarding the probability that a request will
reference a particular relation - E.g., the data dictionary is frequently accessed.
Heuristic keep data-dictionary blocks in main
memory buffer - Buffer managers also support forced output of
blocks for the purpose of recovery
122File Organization
- The database is stored as a collection of files.
Each file is a sequence of records. A record is
a sequence of fields. - One approach
- assume record size is fixed
- each file has records of one particular type only
- different files are used for different relations
- This case is easiest to implement will consider
variable length records later.
123Fixed-Length Records
- Simple approach
- Store record i starting from byte n ? (i 1),
where n is the size of each record. - Record access is simple but records may cross
blocks - Modification do not allow records to cross block
boundaries - Deletion of record I alternatives
- move records i 1, . . ., n to i, . . . , n 1
- move record n to i
- do not move records, but link all free records
on afree list
124Variable-Length Records Slotted Page Structure
- Slotted page header contains
- number of record entries
- end of free space in the block
- location and size of each record
- Records can be moved around within a page to keep
them contiguous with no empty space between them
entry in the header must be updated. - Pointers should not point directly to record
instead they should point to the entry for the
record in header.
125Organization of Records in Files
- Heap a record can be placed anywhere in the
file where there is space - Sequential store records in sequential order,
based on the value of the search key of each
record - Hashing a hash function computed on some
attribute of each record the result specifies in
which block of the file the record should be
placed - Records of each relation may be stored in a
separate file. In a clustering file organization
records of several different relations can be
stored in the same file - Motivation store related records on the same
block to minimize I/O
126Sequential File Organization
- Suitable for applications that require sequential
processing of the entire file - The records in the file are ordered by a
search-key
127Clustering File Organization
- Simple file structure stores each relation in a
separate file - Can instead store several relations in one file
using a clustering file organization - E.g., clustering organization of customer and
depositor - l scan using a secondary index is expensive
- each record access may fetch a new block from
disk - an entry was deleted from their parent)
- Root node then had only one child, and was
deleted and its child became the new root node