File Organization and Indices

About This Presentation

Title:

File Organization and Indices

Description:

Maintaining data records in order involves high overhead on insertions and deletions. ... keys are referred to as composite search keys or concatenated keys. ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 20

Provided by: johne78

Category:

more less

Transcript and Presenter's Notes

Title: File Organization and Indices

1
File Organization and Indices

External Storage
Introduction
Indices
Clustered and unclustered indices
Dense and sparse indices
Composite indices
File Organization Comparision
Cost model
Heap files
Sorted files
Clustered B tree files
Heap file with unclustered tree index
Heap file with unclustered hash index

2
External Storage

Databases store a large amount of data
The data must be persistent.
It is stored on disk (or tape).
Accessing data from disk is costly
Data on disk is read or written in units of a
page, usually 4 or 8 KB.
e.g. 10 milliseconds for accessing a record on a
disk versus 60 nanoseconds for accessing a record
in main memory.
Note that reading pages in sequence is less
expensive than reading them one at a time.
Every record in a file is given a unique record
ID (rid). This includes a disk page address.
Disk access (time) can be improved by the use of
specific file organizations and indexing
techniques.
A file organization is a method of arranging
records in a file.
Each file organizations has it strengths and
weaknesses (in terms of file operations).

3
File Organizations

A file organization is a method of storing
records in a file when the file is on a disk.
A file will be accessed and modified in a number
of ways.
Different organizations enable different
operations to be carried out efficiently.
A file sorted by SIN helps in retrieving records
in SIN order but is not helpful in retrieving
customers in a particular range of incomes.
Files should be given organizations based on
their patterns of use.
A DBMS often has to support more than one type of
operation.
Compare a number of possible file (and index
organizations).
Heap (unordered, no index) file.
Sorted file (no index).
Clustered B tree file.
Heap file with unclustered B tree index.
Heap file with unclustered hash index.

4
Indices

An index is an additional data structure to
efficiently retrieve records based on the search
key of the index.
An index is a collection of data entries.
An index must have an efficient way to find all
data entries with search key value k.
For example store SIN and rids.
Data entry alternatives.
A data entry is a record with search key value k.
There is no need to store the data records
separately.
Referred to as an indexed file organization.
A data entry is a k, rid pair.
Independent of the file organization.
A data entry is a k, rid-list pair.
Independent of the file organization.
Good space utilization but contains variable
length data.
The last two alternatives are independent of the
underlying file organization.

5
Clustered vs. Unclustered

If a file is ordered in the same way as an index
the index is clustered.
That is, both the files records and the indexs
data entries are ordered on the same field.
If the data records are the data entries the file
must be clustered.
Maintaining data records in order involves high
overhead on insertions and deletions.
In practice (unless the index uses data records
as data entries) files are rarely fully sorted.
Initially records are sorted and some free space
is kept on pages for insertions.
Once the free space is used up insertions are put
in a linked list of overflow pages.
Periodically, when the order has degenerated the
file is sorted again.
If an index is unclustered the data records are
not ordered on the indexs search key.
Consequently retrieving a range selection on the
index search key takes longer than with a
clustered index.
If a file has more than one index only one can be
clustered.

6
Dense vs. Sparse

A dense index is one where there is (at least)
one data entry for each search key value that
appears in a data record.
If the data entries are the data records the
index is always dense.
A sparse index contains one data entry for each
page of records in a data file.
A sparse index must also be clustered or it would
be impossible to find records with search key
values between the data entry values.
A sparse index is usually much smaller than a
dense index.
Primary and secondary indices
A primary index is one where the search key
includes the primary key.
The search key of a secondary index does not.
If there is a dense secondary index on a file the
file is referred to as being inverted.
If there is a dense secondary index on each field
not in the primary key it is referred to as being
fully inverted.

7
Composite Indices

An indexs search key can contain several fields.
Such search keys are referred to as composite
search keys or concatenated keys.
For example fName, lName
For searches on equality the values for each
field in the search key must match the values in
record.
e.g. John Smith does not match John Jones or Fred
Smith.
For range queries ranges can be specified for all
fields in the search key.
If no range is specified for a field it implies
that any value is acceptable for that field.
e.g. (fName ) John will find all customer whose
first name is John regardless of their last
names.

8
Index Data Structures

Hash-based indexing
Group the records (or data entries) in buckets
consisting of a primary page and, if necessary, a
chain of overflow pages.
Records are mapped to buckets by applying a hash
function to the search key.
Supports very efficient equality searches.
The cost is typically 1 or 2 disk I/Os.
Tree indexing
The non-leaf levels of the tree provide a sparse
index to the level below.
The leaf level contains either a dense index to
the file, or the records themselves.
Tree indexes efficiently support range searches.
To find a record costs 1 disk I/O for each level
of the tree.
Therefore a tree with a high fan-out (each node
has many children) is desirable.

9
File Organization Comparison

Consider the following operations
Scan fetch all records in a file from disk into
the buffer.
Search with equality selection fetch all
records that satisfy the selection. Pages with
matching records must be fetched from disk and
the records located within the pages.
Search with a range selection fetch all records
that contain a range of values.
Insert insert a given record into a file. The
page for the insert must be located and fetched,
then modified to include the new record and
written back to the file.
Delete delete a record from a file. The page
for the record must be located, fetched, modified
and written back.
Both insert and delete may involve fetching and
modifying more than the page with the insertion
or deletion.
Note that most of the file organizations require
a search key.

10
Cost Model

To compare file organizations we need to estimate
the cost (in execution time) of different DB
operations.
Assume the following
B data pages with R records a page.
Average time to read / write a page is D.
Average time to process a record is C.
The time to map a value to a hash function is H.
Typical values for these are
D 15 milliseconds.
C and H 100 nanoseconds.
Therefore the cost of I / O dominates and will be
used as the cost metric for the comparison.
Real world systems must consider other aspects of
cost.
CPU cost and usage.
Transmission cost if the DB is distributed.
The importance of reading contiguous pages where
possible (blocked access).
etc.

11
Heap Files

Scan.
The cost is B(D RC).
Each page must be read and each record must be
processed.
Search with equality selection.
Where the selection is on a superkey.
The cost is ½ B(D RC) (if the record exists).
Only one record needs to be fetched so only half
the file needs to be scanned (on average).
Where the selection returns multiple records
The cost is B(D RC).
Search with a range selection.
The cost is B(D RC).
The entire file has to be scanned because we
dont know where qualifying records are.
Insert.
Assume records inserted at the end of the file.
The cost is 2D C (read and write and modify).
Delete.
First retrieve the record. If the rid is known
so is the page, otherwise scan the file.
The cost is therefore search cost D C

12
Sorted Files

Scan.
The cost is B(D RC) to examine all records.
Records are scanned in sort order.
Search with equality selection.
If the selection is not on the sort key.
If selection is a superkey cost is ½ B(D RC).
Otherwise, the cost is B(D RC).
If the selection is on the sort key.
Find the first page using a binary search and
then fetch all qualifying records.
The cost is Dlog2B Clog2R.
Note that the cost increases as the number of
records that match the selection increases (only
1 if the selection is on a superkey).
Search with a range selection.
If it is not on the sort key cost is B(D RC).
If the selection is on the sort key perform a
binary search as for the equality selection.
With a large range the selection may not fit on
one page.
Therefore the cost increases as additional pages
have to be retrieved.

13
Sorted Files continued

Insert.
The record has to be inserted in order so its
position must be found.
The cost of finding a record is the same as the
cost for selection on equality.
Once found and written all pages subsequent to
the page with the insert must be read and written
(to shift records down one space). On average a
record is in the middle of the file.
Modifying each page costs 2D RC
The cost is search cost B(D ½RC).
Delete.
The cost is similar to the insert cost (because
the file has to be compacted).
However, if the rid is known there is no
additional cost for finding the record (though it
still has to be fetched).
The cost is search cost B(D ½RC).

14
Clustered B Tree File

This organization consists of a B tree where the
leaf pages contain data records.
The pages are usually only 2/3 full.
There are therefore 1.5B data pages.
Scan.
The cost is 1.5B(D RC) to examine all records.
Records are scanned in sort order.
Search with equality selection.
If the selection is not on the search key.
If selection is a superkey cost is ½ 1.5B(D
RC).
Otherwise, the cost is 1.5B(D RC).
If the selection is on the search key.
Find the first page using the tree. Visit each
level of the tree (including a leaf). The height
is given by logF1.5B where F is the fan-out of a
node.
The cost is DlogF1.5B Clog2R.
Note that the cost increases as the number of
records that match the selection increases (only
1 if the selection is on a superkey).

15
Clustered B Tree Files continued

Search with a range selection.
If it is not on the search key cost is 1.5B(D
RC).
If the selection is on the search key use the
index as for equality selection and search
contiguous leaf pages until the end of the range
is reached.
Note that additional leaf pages may have to be
retrieved if the selection is relatively large.
Insert.
The record has to be inserted in the appropriate
place so find its position using the index.
Once the page is found it must be modified and
written back.
The cost is DlogF1.5B Clog2R D.
This analysis assumes that the new record can fit
on the leaf page.
Delete.
The cost is the same as the insert cost as the
record must be found and the leaf page modified.
The cost is DlogF1.5B Clog2R D.

16
Heap File with Unclustered Tree Index

With an unclustered index the leaf pages contain
data entries (search key, rid pairs).
These are typically smaller than data records.
The number of leaf pages L is therefore less than
B. Assume that there are RL data entries per
leaf page.
Scan.
The cost is B(D RC) to examine all records.
This assumes the records are scanned from the
heap file without using the index (not in order).
If the scan is to be ordered by the search key it
is better to sort the file than to use the index.
Search with equality selection.
If the selection is not on the search key the
file has to be scanned.
If the selection is on the search key.
Find the appropriate leaf page using the tree.
Use the data entry to find the associated record.
The cost is DlogFL Clog2RL D
The cost increases with the number of records
that match the selection. Because the file is a
heap file matching records may not reside on the
same page.

17
Heap File with Unclustered Tree Index

Search with a range selection.
If it is not on the search key cost is B(D RC).
If the selection is on the search key use the
index as for equality selection and search
contiguous leaf pages until the end of the range
is reached.
For each matching data entry in a leaf page
retrieve the record. As records may reside on
separate pages, using the index may become
prohibitively expensive if the range is large.
Insert.
First insert the record in the heap file (2D
C).
Then insert an entry in the index find, modify
and write it back (DlogFL Clog2RL D).
This analysis assumes that the new record can fit
on the leaf page.
Delete.
Find the record to be deleted using the index,
modify and write back the file and index pages.
The cost is DlogF1.5B Clog2R D 2D.

18
Heap File with Unclustered Hash Index

With an unclustered index the buckets contain
data entries (search key, rid pairs).
These are typically smaller than data records.
Assume that there are RB data entries per bucket.
The cost of using a hash index (H) is the cost of
making a main memory calculation (like C).
Scan.
The cost is B(D RC) to examine all records.
This assumes the records are scanned from the
heap file without using the index (not in order).
Search with equality selection.
If the selection is not on the search key the
file has to be scanned.
If the selection is on the search key.
Find the appropriate bucket using the hash
function. Retrieve the bucket (1 I/O) and then
retrieve the appropriate page of the file.
The cost is H 2D ½RB
The cost increases with the number of records
that match the selection. Because the file is a
heap file matching records may not reside on the
same page.

19
Heap File with Unclustered Hash Index

Search with a range selection.
Range selections are not supported by hash
indices.
Because a hash function is uniform and random,
entries in the range will be randomly distributed
across buckets.
A file scan is therefore required.
Insert.
First insert the record in the heap file (2D
C).
Then insert an entry in the hash index find,
modify and write it back (H 2D C).
Delete.
Find the record to be deleted using the index,
modify and write back the file and index pages.
The cost is H 2D ½RB 2D.
Overflow pages.
This analysis assumes that each bucket consists
of one primary page with no overflow pages.