Title: Physical Database Design
1Physical Database Design
- University of California, Berkeley
- School of Information
- IS 257 Database Management
2Lecture Outline
- Review
- Normalization
- Physical Database Design
- Access Methods
3Lecture Outline
- Review
- Normalization
- Physical Database Design
- Access Methods
4Database Design Process
Application 1
Application 2
Application 3
Application 4
External Model
External Model
External Model
External Model
Application 1
Conceptual requirements
Application 2
Conceptual Model
Logical Model
Conceptual requirements
Internal Model
Application 3
Conceptual requirements
Application 4
Conceptual requirements
5Normalization
- Normalization theory is based on the observation
that relations with certain properties are more
effective in inserting, updating and deleting
data than other sets of relations containing the
same data - Normalization is a multi-step process beginning
with an unnormalized relation - Hospital example from Atre, S. Data Base
Structured Techniques for Design, Performance,
and Management.
6Normal Forms
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5NF)
7Normalization
8Unnormalized Relations
- First step in normalization is to convert the
data into a two-dimensional table - In unnormalized relations data can repeat within
a column
9Unnormalized Relation
10First Normal Form
- To move to First Normal Form a relation must
contain only atomic values at each row and
column. - No repeating groups
- A column or set of columns is called a Candidate
Key when its values can uniquely identify the row
in the relation.
11First Normal Form
12Second Normal Form
- A relation is said to be in Second Normal Form
when every nonkey attribute is fully functionally
dependent on the primary key. - That is, every nonkey attribute needs the full
primary key for unique identification
13Second Normal Form
14Second Normal Form
15Second Normal Form
16Third Normal Form
- A relation is said to be in Third Normal Form if
there is no transitive functional dependency
between nonkey attributes - When one nonkey attribute can be determined with
one or more nonkey attributes there is said to be
a transitive functional dependency. - The side effect column in the Surgery table is
determined by the drug administered - Side effect is transitively functionally
dependent on drug so Surgery is not 3NF
17Third Normal Form
18Third Normal Form
19Boyce-Codd Normal Form
- Most 3NF relations are also BCNF relations.
- A 3NF relation is NOT in BCNF if
- Candidate keys in the relation are composite keys
(they are not single attributes) - There is more than one candidate key in the
relation, and - The keys are not disjoint, that is, some
attributes in the keys are common
20BCNF Relations
21Fourth Normal Form
- Any relation is in Fourth Normal Form if it is
BCNF and any multivalued dependencies are trivial - Eliminate non-trivial multivalued dependencies by
projecting into simpler tables
22Fifth Normal Form
- A relation is in 5NF if every join dependency in
the relation is implied by the keys of the
relation - Implies that relations that have been decomposed
in previous NF can be recombined via natural
joins to recreate the original relation.
23Normalization
- Normalization is performed to reduce or eliminate
Insertion, Deletion or Update anomalies. - However, a completely normalized database may not
be the most efficient or effective
implementation. - Denormalization is sometimes used to improve
efficiency.
24Denormalization
- Usually driven by the need to improve query speed
- Query speed is improved at the expense of more
complex or problematic DML (Data manipulation
language) for updates, deletions and insertions.
25Downward Denormalization
26Upward Denormalization
27Using RDBMS to help normalize
- Example database Cookie
- Database of books, libraries, publisher and
holding information for a shared (union) catalog
28Cookie relationships
29Cookie BIBFILE relation
30How to Normalize?
- Currently no way to have multiple authors for a
given book, and there is duplicate data spread
over the BIBFILE table - Can we use the DBMS to help us normalize?
- Access example
31Lecture Outline
- Review
- Normalization
- Using Relational DBs in normalization
- Physical Database Design
- Access Methods
32Database Design Process
Application 1
Application 2
Application 3
Application 4
External Model
External Model
External Model
External Model
Application 1
Conceptual requirements
Application 2
Conceptual Model
Logical Model
Conceptual requirements
Internal Model
Application 3
Conceptual requirements
Application 4
Conceptual requirements
PhysicalDesign
33Physical Database Design
- Many physical database design decisions are
implicit in the technology adopted - Also, organizations may have standards or an
information architecture that specifies
operating systems, DBMS, and data access
languages -- thus constraining the range of
possible physical implementations. - We will be concerned with some of the possible
physical implementation issues
34Physical Database Design
- The primary goal of physical database design is
data processing efficiency - We will concentrate on choices often available to
optimize performance of database services - Physical Database Design requires information
gathered during earlier stages of the design
process
35Physical Design Information
- Information needed for physical file and database
design includes - Normalized relations plus size estimates for them
- Definitions of each attribute
- Descriptions of where and when data are used
- entered, retrieved, deleted, updated, and how
often - Expectations and requirements for response time,
and data security, backup, recovery, retention
and integrity - Descriptions of the technologies used to
implement the database
36Physical Design Decisions
- There are several critical decisions that will
affect the integrity and performance of the
system - Storage Format
- Physical record composition
- Data arrangement
- Indexes
- Query optimization and performance tuning
37Storage Format
- Choosing the storage format of each field
(attribute). The DBMS provides some set of data
types that can be used for the physical storage
of fields in the database - Data Type (format) is chosen to minimize storage
space and maximize data integrity
38Objectives of data type selection
- Minimize storage space
- Represent all possible values
- Improve data integrity
- Support all data manipulations
- The correct data type should, in minimal space,
represent every possible value (but eliminate
illegal values) for the associated attribute and
can support the required data manipulations (e.g.
numerical or string operations)
39Access Data Types
- Numeric (1, 2, 4, 8 bytes, fixed or float)
- Text (255 max)
- Memo (64000 max)
- Date/Time (8 bytes)
- Currency (8 bytes, 15 digits 4 digits decimal)
- Autonumber (4 bytes)
- Yes/No (1 bit)
- OLE (limited only by disk space)
- Hyperlinks (up to 64000 chars)
40Access Numeric types
- Byte
- Stores numbers from 0 to 255 (no fractions). 1
byte - Integer
- Stores numbers from 32,768 to 32,767 (no
fractions) 2 bytes - Long Integer (Default)
- Stores numbers from 2,147,483,648 to
2,147,483,647 (no fractions). 4 bytes - Single
- Stores numbers from -3.402823E38 to 1.401298E45
for negative values and from 1.401298E45 to
3.402823E38 for positive values. 4 bytes - Double
- Stores numbers from 1.79769313486231E308 to
4.94065645841247E324 for negative values and
from 1.79769313486231E308 to 4.94065645841247E324
for positive values. 15 8 bytes - Replication ID
- Globally unique identifier (GUID) N/A 16 bytes
41Controlling Data Integrity
- Default values
- Range control
- Null value control
- Referential integrity (next time)
- Handling missing data
42Designing Physical Records
- A physical record is a group of fields stored in
adjacent memory locations and retrieved together
as a unit - Fixed Length and variable fields
43Designing Physical/Internal Model
- Overview
- terminology
- Access methods
44Physical Design
- Internal Model/Physical Model
User request
Interface 1
Interface 2
Operating System Access Methods
Interface 3
Data Base
45Physical Design
- Interface 1 User request to the DBMS. The user
presents a query, the DBMS determines which
physical DBs are needed to resolve the query - Interface 2 The DBMS uses an internal model
access method to access the data stored in a
logical database. - Interface 3 The internal model access methods
and OS access methods access the physical
records of the database.
46Physical File Design
- A Physical file is a portion of secondary storage
(disk space) allocated for the purpose of storing
physical records - Pointers - a field of data that can be used to
locate a related field or record of data - Access Methods - An operating system algorithm
for storing and locating data in secondary
storage - Pages - The amount of data read or written in one
disk input or output operation
47Internal Model Access Methods
- Many types of access methods
- Physical Sequential
- Indexed Sequential
- Indexed Random
- Inverted
- Direct
- Hashed
- Differences in
- Access Efficiency
- Storage Efficiency
48Physical Sequential
- Key values of the physical records are in logical
sequence - Main use is for dump and restore
- Access method may be used for storage as well as
retrieval - Storage Efficiency is near 100
- Access Efficiency is poor (unless fixed size
physical records)
49Indexed Sequential
- Key values of the physical records are in logical
sequence - Access method may be used for storage and
retrieval - Index of key values is maintained with entries
for the highest key values per block(s) - Access Efficiency depends on the levels of index,
storage allocated for index, number of database
records, and amount of overflow - Storage Efficiency depends on size of index and
volatility of database
50Index Sequential
51Indexed Sequential Two Levels
52Indexed Random
- Key values of the physical records are not
necessarily in logical sequence - Index may be stored and accessed with Indexed
Sequential Access Method - Index has an entry for every data base record.
These are in ascending order. The index keys are
in logical sequence. Database records are not
necessarily in ascending sequence. - Access method may be used for storage and
retrieval
53Indexed Random
54Btree
55Inverted
- Key values of the physical records are not
necessarily in logical sequence - Access Method is better used for retrieval
- An index for every field to be inverted may be
built - Access efficiency depends on number of database
records, levels of index, and storage allocated
for index
56Inverted
57Direct
- Key values of the physical records are not
necessarily in logical sequence - There is a one-to-one correspondence between a
record key and the physical address of the record - May be used for storage and retrieval
- Access efficiency always 1
- Storage efficiency depends on density of keys
- No duplicate keys permitted
58Hashing
- Key values of the physical records are not
necessarily in logical sequence - Many key values may share the same physical
address (block) - May be used for storage and retrieval
- Access efficiency depends on distribution of
keys, algorithm for key transformation and space
allocated - Storage efficiency depends on distibution of keys
and algorithm used for key transformation
59Comparative Access Methods
Indexed No wasted space for data but extra space
for index Moderately Fast Moderately Fast Very
fast with multiple indexes OK if dynamic OK if
dynamic Easy but requires Maintenance of indexes
60Database Creation in Access
- Simplest to use a design view
- wizards are available, but less flexible
- Need to watch the default values
- Helps to know what the primary key is, or if one
is to be created automatically - Automatic creation is more complex in other RDBMS
and ORDBMS - Need to make decision about the physical storage
of the data
61Database Creation in Access
62Next Time
- Indexes and when to index
- Integrity Constraints
- Referential Integrity