Title: Physical Database Design
1Physical Database Design
- Chapter 7
- MIS 2403
- Dr. Segall
- Fall 2001
2Introduction
- The purpose of physical database design is to
translate the logical description of data into
the technical specifications for storing and
retrieving data - The goal is to create a design for storing data
that will provide adequate performance, and
insure database integrity, security, and
recoverability
3Inputs to Physical Design
- Normalized relations
- Attribute definitions
- Estimations of data processing volume
- Descriptions of where and when data are entered,
retrieved, deleted, and updated - Response time expectations/requirements
- Requirements for data security, backup, recovery,
retention, and integrity - Characteristics of the DBMS to be used
4Data Volume andUsage Analysis
- Often the first step in the physical design
process - Relative numbers are what are most important(to
identify areas requiring attention to
performance) - Figures may be recorded and displayed using an
annotated entity-relationship diagram (Fig. 7-1) - Data volumes reflect number of records in tables
- Access frequencies reflect number of table record
accesses per unit of time - Note what attributes are used in table
accesses(to aid design of table indexes) - Note of creations/retrievals/updates/deletions
(to aid selection of file organization technique)
5Physical Design Decisions
- Specify the data type for each attribute from the
logical data model - Specify physical records by grouping attributes
from the logical data model - Specify the file organization technique to use
for physical storage of data records - Specify indexes to optimize data retrieval
- Specify query optimization strategies
6Designing Fields
- Choosing the field data type
- Select from available types such as text, memo,
number, date/time, currency, etc. - Seek to
- Minimize storage space
- e.g., Integer vs. Floating Point
- Represent all possible values
- e.g., Floating Point vs. Integer
- Improve data integrity (more on next slide)
- e.g., Yes/No
- Support all data manipulations
- e.g., Date/Time
7Designing Fields
- Controlling data integrity
- Default value
- e.g., value FL for State field
- Range control
- e.g., value lt100 for Test_Score field
- Null value control
- e.g., prohibit leaving Date_of_Birth field blank
- Referential integrity
- e.g., restrict valid values for Part_No field in
Order table to the contents of this field in the
Part table
8Designing Fields
- Coding
- e.g., J(onesboro), B(eebe), M(outain Home) for
the ASU branch campuses - implement by creating a look-up table
- there is a trade-off in that you must create and
store a second table and you must access this
table to look up the code value - consider using when a field has a limited number
of possible values, each of which occupies a
relatively large amount of space, and the number
of records is large and/or the number of record
accesses is small.
(see next slide also)
9Example Code Look-up Table (Figure 7-2)
10Designing Fields
- Fixed-Length Fields
- Make it easy to locate a specific record in a
file and/or a specific field in that record - Each field has its maximum length specified and
unused space in any given field is padded with
spaces (text) or leading zeros (numeric) - Variable-Length Fields
- When the need arises for a variable-length field
(e.g., a memo field), this field can be stored
separate from the rest of the record with a
pointer used to locate it when needed
11Physical Records
- Physical Record A group of fields stored in
adjacent memory locations and retrieved together
as a unit. - Page The amount of data read or written in one
secondary memory (disk) input or output
operation. - Blocking Factor The number of physical records
per page.
12Database Access Model
The goal in structuring physical records is to
minimize performance bottlenecks resulting from
disk accesses (accessing data from disk is slow
compared to main memory)
13Optimization Decisions
- Denormalization
- Partitioning
- Selection of File Organization
- Clustering Files
- Creation of Indexes
- RAID
- Other
14Denormalization
- The process of transforming normalized relations
into unnormalized physical record specifications
for the purpose of improv-ing overall database
performance.
15Denormalization
- Involves a trade-off
- Reduced disk accesses and greater performance
(due, for example, to fewer table joins) - - But -
- Introduction of anomalies (and thus redundancies)
that will necessitate extra data maintenance
16Partitioning
- Horizontal Partitioning Distributing the rows
of a table into two or more separate files - e.g., Customer table is partitioned into four
separate files, one for each geographical region - Vertical Partitioning Distributing the columns
of a table into two or more separate files - e.g., Employee table is partitioned into public
file (name, office, extension, etc.) and private
file (salary, health history, etc.) - Note the primary key is repeated in each file
17Partitioning
- Advantages of Partitioning
- Records used together are grouped together
- Each partition can be optimized for performance
- Security and recovery
- Partitions stored on different disks less
contention - Parallel processing capability
- Disadvantages of Partitioning
- Slower retrievals when across partitions
- Complexity for application programmers
- Anomalies and extra storage space requirements
due to duplication of data across partitions
18Physical Files
- Physical File A file as stored on disk
- Constructs to link two pieces of data
- Sequential storage
- Pointers
- File Organization How the files are arranged on
the disk (more on this later) - Access Method How the data can be retrieved
based on the file organization - Relative - data accessed as an offset from the
most recently referenced point in secondary
memory - Direct - data accessed as a result of a
calculation to generate the beginning address of
a record
19File Organizations
- A technique for physically arranging the records
of a file on secondary storage devices. - Goals in selecting (trade-offs exist, of
course) - Fast data retrieval
- High throughput for input and maintenance
- Efficient use of storage space
- Protection from failures or data loss
- Minimal need for reorganization
- Accommodation for growth
- Security from unauthorized use
20File Organizations
- Sequential
- Indexed
- Indexed Sequential
- Indexed Nonsequential
- Hashed (also called Direct)
- See Table 7-3 for comparison
21Sequential File Organization
- Records of the file are stored in sequence by the
primary key field values
22Comparisons of file organizations (a) Sequential
23Sequential Retrieval
- Consider a file of 10,000 records each occupying
1 page - Queries that require processing all records will
require 10,000 accesses - e.g., Find all items of type 'E'
- Many disk accesses are wasted if few records meet
the condition - However, very effective if most or all records
will be accessed (e.g., payroll)
24Indexed File Organization
- Index concept is like index in a book
- Indexed-sequential file organization The
records are stored sequentially by primary key
values and there is an index built on the primary
key field (and possibly indexes built on other
fields, also)
25(b) Indexed
26Hashed File Organization
- Hashing Algorithm Converts a primary key value
into a record address - Division-remainder method is common hashing
algorithm(more to come on this)
27(c ) Hashed
28Hashing
- A technique for reducing disk accesses for direct
access - Avoids an index
- Number of accesses per record can be close to one
- The hash field is converted to a hash address by
a hash function
29Hashing
30Shortcomings of Hashing
- Different hash fields may convert to the same
hash address - these are called Synonyms
- store the colliding record in an overflow area
- Long synonym chains degrade performance
- There can be only one hash field per record
- The file can no longer be processed sequentially
31Clustering
- In some relational DBMS, related records from
different tables that are often retrieved
together can be stored close together on disk - Since the related records are stored close to one
another on the physical disk, less time is needed
to retrieve the data - E.g., Customer data and Order data may frequently
be retrieved together - Can require substantial maintenance if the
clustered data changes frequently
32Indexing
- An index is a table file that is used to
determine the location of rows in another file
that satisfy some condition
33Querying with an Index
- Read the index into memory
- Search the index to find records meeting the
condition - Access only those records containing required
data - Disk accesses are substantially reduced when the
query involves few records
34Maintaining an Index
- Adding a record requires at least two disk
accesses - Update the file
- Update the index
- Trade-off
- Faster queries
- Slower maintenance (additions, deletions, and
updates of records) - Thus, more static databases benefit more overall
35Rules of Thumbfor Using Indexes
- 1. Indexes are most useful on larger tables
- 2. Index the primary key of each table(may be
automatic, as in Access) - 3. Indexes are useful on search fields (WHERE)
- 4. Indexes are also useful on fields used for
sorting (ORDER BY) and categorizing (GROUP BY) - 5. Most useful to index on a field when there are
many different values for that field
36Rules of Thumbfor Using Indexes
- 6. Find out the limits placed on indexing by your
DBMS (Access allows 32 indexes per table, and no
index may contain more than 10 fields) - 7. Depending on the DBMS, null values may not be
referenced from an index (thus, rows with a null
value in the field that is indexed may not be
found by a search using the index)
37RAID
- Redundant Arrays of Inexpensive Disks
- Exploits economies of scale of disk manufacturing
for the computer market - Can give greater security
- Increases fault tolerance of systems
- Not a replacement for regular backup
38RAID
- The operating system sees a set of physical
drives as one logical drive - Data are distributed across physical drives
- All levels, except 0, have data redundancy or
error-correction features - Parity codes or redundant data are used for data
recovery
39Mirroring
- Write
- Identical copies of file are written to each
drive in array - Read
- Alternate pages are read simultaneously from each
drive - Pages put together in memory
- Access time is reduced by approximately the
number of disks in the array - Read error
- Read required page from another drive
- Tradeoffs
- Provides data security
- Reduces access time
- Uses more disk space
40Mirroring
Complete Data Set
Complete Data Set
No parity
41Striping
- Three drive model
- Write
- Half of file to first drive
- Half of file to second drive
- Parity bit to third drive
- Read
- Portions from each drive are put together in
memory - Read error
- Lost bits are reconstructed from third drives
parity data - Tradeoffs
- Provides data security
- Uses less storage space than mirroring
- Not as fast as mirroring
42Striping
One-Half Data Set
One-Half Data Set
Parity Codes
43One Other Rule of Thumbfor Increasing Performance
- Consider contriving a shorter field or selecting
another candidate key to substitute for a long,
multi-field primary key (and all associated
foreign keys)
44Database Architectures
- See pages 282-285
- Hierarchical
- Network
- Relational
- Object-oriented
- Multidimensional
45Physical Database Design