Title: Database Design and The Entity-Relationship Model
1Database Design andThe Entity-Relationship Model
- Lecture 21
- RG - Chapter 2
A relationship, I think, is like a shark, you
know? It has to constantly move forward or it
dies. And I think what we got on our hands is a
dead shark. Woody Allen (from Annie Hall, 1979)
2Administriva
- Homework 4 Due One Week From Today
- Midterm is graded
3Exam Summary
- Average 74.5
- Max 93.5
- Min 36.5
- Std Dev 11.6
4Exam Details
- Section Low High / Total (Average)
- 1 External Sorting 10 24 / 24 (18)
- 2 Query Execution 17 43 / 44 (36)
- 3 Query Optim. 6.5 29 / 32 (20)
5Section 1 External Sorting
- For all questions in this section, assume that
you have 96 pages in the buffer pool, and you
wish to sort EMP by Ename. - Using the general external sort, how many sorting
passes are required? (6pts) - passes 1 Ceil(log(B-1)(N/B))
- 1 Ceil(log(95)(10,000/96))
- 1 2
- 3
- What is the I/O cost of doing this sort? (6pts)
- I/O Cost 2N ( passes)
- 20,000 3
6Section 1 (cont)
- Database systems sometimes use blocked I/O
because reading a block of continuous pages is
more efficient that doing a separate I/O for each
page. Assume that in our computer system, it is
much more efficient to read and write blocks of
32 pages at a time, so all reads and writes from
files must be in blocks of 32 pages (and if a
file has less than 32 pages, it is padded with
blank pages). Consider doing external sort with
blocked I/O, which gets faster I/O at the expense
of more sorting passes. If the database must read
and write 32 pages at a time, how many sorting
passes are required? Show either the formula you
use, or the temp file sizes after each pass.
(4pts) -
- Pass 0 105 runs of 96 pages
- Pass 1 2-way merge 96 96 -gt result is 53 runs
of 192 pages - Pass 2 2-way merge of 192 192 -gt result is 27
runs of 384 pages - Pass 3 -gt result is 14 runs of 768 pages,
- Pass 4 -gt result is 7 runs 1536 pages,
- Pass 5 -gt result is 4 runs 3072 pages,
- Pass 6 -gt result is 2 runs of 6144,
- Pass 7 -gt result is 10,000
- 8 passes total
7Section 1 (cont)
- Blocked I/O is used because reading a 32-page
block is faster than 32 separate 1-page I/Os.
For each sorting pass, you still must read and
write every page in the file, but instead of
doing 10,000 1-page I/Os, instead you do
(10,000/32) block I/Os. Assume that in our
system, we can read a 32-page block in the time
it would normally take to do 8 single-page I/Os.
Is the blocked I/O sort faster or slower than a
regular sort from question 2? By approximately
what ratio? (4pts) - Assume that instead of a heap file, the records
from EMP are stored a clustered B-Tree index,
whose key is Ename, using alternative 1 (i.e.,
the full data records are stored in the leaves of
the tree). The B-tree has depth 3. Assuming the
B-Tree already exists, what is the approximate
I/O cost to use the B-tree to get the records in
sorted order? (4pts)
8Section 2 Relational Operators
- Consider the operation ? (EID lt 5000)EMP
- What is the I/O cost of this operation if there
is no index? (4pts) - Lecture 15 slide 9 no index -gt sequential scan,
N I/Os, meaning 10,000 I/Os - What is the reduction factor? (4pts)
- 5000/100,000 0.05
- Assume that instead of a heap file, the records
from EMP are stored in a clustered B-Tree index,
whose key is Ename, using alternative 1 (i.e.,
the full data records are stored in the leaves of
the tree). The B-tree has depth 3. Assuming
the B-Tree already exists, what is the I/O cost
of this selection operation? (4pts) - Due to a copy paste error on my part, the B-Tree
Index is not useful for the query. I gave
credit either way, whether people tried to use
the index as an index, or for sequential scan. - In considering the cost of using the index,
either as an index or for a scan, full credit
required knowing that B-Trees have an overhead of
approximately 50, i.e., there are 50 more
leaves than the number of pages in a heap file.
9Relational Operators (cont)
- Consider the query select distinct Ename from
EMP order by Ename - If you had 96 pages of memory, what algorithm
would you use to execute this query and why?
(4pts) - Order by suggests sort-merge to remove dups.
- If, instead, the query were select distinct
Ename from EMP and you had 101 pages of memory,
what algorithm would you use and why? (4pts) - Could use hashing to remove dups, which would
have a similar I/O cost but be more CPU efficient.
10Relational Operators (cont)
- Consider the join In_Dept ? Dept
- Assuming 100 pages of memory, what is the I/O
cost of this join using Blocked Nested Loops?
(4pts) - Lec 15, slide 27 Blocked nested loops M
Ceil(M/(B-2)) N - 550 Ceil(550/100)) 20 550 6 20 670
- What is the I/O cost of this using Index Nested
Loops, with an unclustered Hash index on
Dept.DID? Assume an average cost of 1.2 I/Os to
get to the right hash bucket. (4pts) - Lec 15, slide 27 index nested loops M tuples
in M index cost - 550 110,000 (1.2 1) 242,000
- Consider the join Dept ? In_Dept. Assuming 100
pages of memory, what is the I/O cost of this
using Sort-Merge-Join, optimized so that the last
merge pass is combined with the join if possible?
(4pts) - Best 1 pass over In_Dept, to make 6 runs, sort
Dept in memory, merge. Cost 5503 20 1670 - By the book 3(M N) 3570 1710
- Assuming 100 pages of memory, what is the I/O
cost of this using regular (not hybrid) hash
join? (4pts) - Best partition In_Dept (2 550), hash Dept in
memory, match. Cost 550 3 20 1670 - By the book 3(M N) 1710
11Relational Operators (cont)
- Consider the join (? (EID lt 100)EMP) ?
In_Building Assume there is no index on EMP, and
an unclustered hash index on In_Building.EID, and
plan operators are pipelined where possible. You
have 100 pages of memory. Assume that it takes
on average 1.2 I/Os get reach the appropriate
hash bucket. - What is the I/O cost of this using Blocked Nested
Loops? (4pts) - M Ceil(M/(B-2)) N
- If you do not push select down10,000
Ceil(10,000/(100-2)) 550 10,000 103550
66,650 - If you do push select, pipeline result Size of
selection in pages is S (I/100,000)10,000
I/10 10 - Join cost 10,000 Ceil(10/(100-2)) (550)
10,550 - What is the I/O cost of this using Index Nested
Loops? (4pts) - If you do push select, pipeline result Size of
selection in pages is - S (I/100,000)10,000 I/10 10
- each EID will have 1.1 matching tuples.
- Join cost M ( tuples index cost)
- 10,000 100 (1.2 1.1) 10,230
12Section 3 Query Optimization
- There are several parts of an SQL query that can
cause a column (or columns) to be considered an
"interesting order", meaning that an ordered
input can lower the cost of evaluating that part
of the query. Briefly describe three of these,
and for each one briefly explain why. (6 pts) - Order By, Group By, Joins, Distinct, some
aggregations (e.g. Max, Min) - Write the reduction factor that would be used by
a System R style query optimizer for each of the
following predicates (6pts) - Building.BID gt 150
- BID ranges from 1 to 200, so this would be ¼.
- EMP.Ename Joe
- We dont know about distribution of names, so
1/10. - IN_DEPT.EID 003
- Since there are 100,000 employees, this is
1/100,000 - Given the following query, where X is the join
operator - p(EMP.Ename) ?(Dept.Budget gt 500000) (Emp X
In_Dept X Dept) - Mark whether each of the following queries are
equivalent (True/False). - __T___p(EMP.Ename) ?(Dept.Budget gt 500000) (Emp X
Dept X In_Dept) - __T___p(EMP.Ename) ?(Dept.Budget gt 500000) (Dept
X In_Dept X Emp) - __T___p(EMP.Ename) (Emp X In_Dept X ?(Dept.Budget
gt 500000)(Dept)) - __F___?(Dept.Budget gt 500000) (p(EMP.Ename) (Emp)
X In_Dept X Dept) -
13Query Optimization (cont)
- Given the schema on the first page of the exam,
assume the following - There are unclustered hash indexes on both
Emp.EID and Dept.DID. - There are clustered B-Tree Indexes (alt 1 data
records store in the B-Tree leaves) on both
Emp.EID and Dept.DID. - There is an unclustered Btree index on
(Emp.Salary, Emp.EID) - You can assume that Btree-Indexes have heights of
3. And the cost for getting to the data record
using a hash index is 2. - What is the best access plan for the following
query (4pts). - SELECT E.eid, E.Salary
- FROM Emp E
- WHERE E.Salary gt 100,000
- Very best is unclustered B-tree, index only plan
-
- Draw all possible join orders considered by a
System-R style query optimizer for the following
query. (4pts) - SELECT E.eid, D.dname
- FROM Emp E, In_Dept I, Dept D
- WHERE E.sal 64,000 AND D.budget gt 500,000 AND
- E.eid I.eid AND I.did D.did
- (E X I) X D)
- (I X E) X D)
14Where are we?
- This week Database Design (me)
- The ER Model as a design tool
- Next Week Concurrency Control Recovery (Prof.
Roth) - Week after More Database Design (me)
15Today and Thursday The ER Model
- A different data model from Relational
- Most commonly used for database design
- Today Details of the ER Model
- Thursday Translating ER Schemas to Relational
16Databases Model the Real World
- Data Model translates real world things into
structures computers can store - Many models
- Relational, E-R, O-O, Network, Hierarchical, etc.
- Relational
- Rows Columns
- Keys Foreign Keys to link Relations
Enrolled
Students
17Database Design
- The process of modelling things in the real world
into elements of a data model. - I.E., describing things in the real world using a
data model. - E.G., describing students and enrollments using
various tables with key/foreign key relationships - The Relational model is not the only model in use
18A Problem with the Relational Model
CREATE TABLE Students (sid CHAR(20), name
CHAR(20), login CHAR(10), age INTEGER, gpa
FLOAT)
CREATE TABLE Enrolled (sid CHAR(20), cid
CHAR(20), grade CHAR(2))
- With complicated schemas, it may be hard for a
person to understand the structure from the data
definition.
Enrolled
Students
19One Solution The E-R Model
- Instead of relations, it has
- Entities and Relationships
- These are described with diagrams,
- both structure, notation more obvious to humans
- A visual language for describing schemas
20Steps in Database Design
- Requirements Analysis
- user needs what must database do?
- Conceptual Design
- high level descr (often done w/ER model)
- Logical Design
- translate ER into DBMS data model
- Schema Refinement (in 2 weeks)
- consistency, normalization
- Physical Design (discussed already)
- indexes, disk layout
- Security Design
- who accesses what, and how
21Conceptual Design
- Define enterprise entities and relationships
- What information about entities and relationships
should be in database? - What are the integrity constraints or business
rules that hold? - A database schema in the ER Model is
represented pictorially (ER diagrams). - Can map an ER diagram into a relational schema.
22ER Model Basics
- Entity
- Real-world thing, distinguishable from other
objects. - Entity described by set of attributes.
- Entity Set A collection of similar entities.
E.g., all employees. - All entities in an entity set have the same set
of attributes. (Until we consider hierarchies,
anyway!) - Each entity set has a key (underlined).
- Each attribute has a domain.
23ER Model Basics (Contd.)
- Relationship Association among two or more
entities. E.g., Attishoo works in Pharmacy
department. - relationships can have their own attributes.
- Relationship Set Collection of similar
relationships. - An n-ary relationship set R relates n entity sets
E1 ... En each relationship in R involves
entities e1 ? E1, ..., en ? En
24ER Model Basics (Cont.)
subor-dinate
super-visor
- Same entity set can participate in different
relationship sets, or in different roles in the
same set.
25Key Constraints
- An employee can work in many departments a dept
can have many employees. -
In contrast, each dept has at most one manager,
according to the key constraint on Manages.
26Participation Constraints
- Does every employee work in a department?
- If so, this is a participation constraint
- the participation of Employees in Works_In is
said to be total (vs. partial) - What if every department has an employee working
in it? - Basically means at least one
since
since
name
dname
name
dname
ssn
lot
budget
did
budget
did
Departments
Employees
Manages
Works_In
since
27Weak Entities
- A weak entity can be identified uniquely only by
considering the primary key of another (owner)
entity. - Owner entity set and weak entity set must
participate in a one-to-many relationship set
(one owner, many weak entities). - Weak entity set must have total participation in
this identifying relationship set.
Weak entities have only a partial key (dashed
underline)
28Binary vs. Ternary Relationships
- If each policy is owned by just 1 employee
Bad design
Key constraint on Policies would mean policy
can only cover 1 dependent!
- Think through all the constraints in the 2nd
diagram!
29Binary vs. Ternary Relationships (Contd.)
- Previous example illustrated case when two binary
relationships were better than one ternary
relationship. - Opposite example a ternary relation Contracts
relates entity sets Parts, Departments and
Suppliers, and has descriptive attribute qty. No
combination of binary relationships is an
adequate substitute.
30Binary vs. Ternary Relationships (Contd.)
qty
Departments
Parts
Contract
VS.
Suppliers
Parts
Departments
needs
can-supply
deals-with
Suppliers
- S can-supply P, D needs P, and D
deals-with S does not imply that D has agreed
to buy P from S. - How do we record qty?
31Summary so far
- Entities and Entity Set (boxes)
- Relationships and Relationship sets (diamonds)
- binary
- n-ary
- Key constraints (1-1,1-M, M-M, arrows on 1 side)
- Participation constraints (bold for Total)
- Weak entities - require strong entity for key
- Next, a couple more advanced concepts