File Organization and Indices - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

File Organization and Indices

Description:

Maintaining data records in order involves high overhead on insertions and deletions. ... keys are referred to as composite search keys or concatenated keys. ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 20
Provided by: johne78
Category:

less

Transcript and Presenter's Notes

Title: File Organization and Indices


1
File Organization and Indices
  • External Storage
  • Introduction
  • Indices
  • Clustered and unclustered indices
  • Dense and sparse indices
  • Composite indices
  • File Organization Comparision
  • Cost model
  • Heap files
  • Sorted files
  • Clustered B tree files
  • Heap file with unclustered tree index
  • Heap file with unclustered hash index

2
External Storage
  • Databases store a large amount of data
  • The data must be persistent.
  • It is stored on disk (or tape).
  • Accessing data from disk is costly
  • Data on disk is read or written in units of a
    page, usually 4 or 8 KB.
  • e.g. 10 milliseconds for accessing a record on a
    disk versus 60 nanoseconds for accessing a record
    in main memory.
  • Note that reading pages in sequence is less
    expensive than reading them one at a time.
  • Every record in a file is given a unique record
    ID (rid). This includes a disk page address.
  • Disk access (time) can be improved by the use of
    specific file organizations and indexing
    techniques.
  • A file organization is a method of arranging
    records in a file.
  • Each file organizations has it strengths and
    weaknesses (in terms of file operations).

3
File Organizations
  • A file organization is a method of storing
    records in a file when the file is on a disk.
  • A file will be accessed and modified in a number
    of ways.
  • Different organizations enable different
    operations to be carried out efficiently.
  • A file sorted by SIN helps in retrieving records
    in SIN order but is not helpful in retrieving
    customers in a particular range of incomes.
  • Files should be given organizations based on
    their patterns of use.
  • A DBMS often has to support more than one type of
    operation.
  • Compare a number of possible file (and index
    organizations).
  • Heap (unordered, no index) file.
  • Sorted file (no index).
  • Clustered B tree file.
  • Heap file with unclustered B tree index.
  • Heap file with unclustered hash index.

4
Indices
  • An index is an additional data structure to
    efficiently retrieve records based on the search
    key of the index.
  • An index is a collection of data entries.
  • An index must have an efficient way to find all
    data entries with search key value k.
  • For example store SIN and rids.
  • Data entry alternatives.
  • A data entry is a record with search key value k.
  • There is no need to store the data records
    separately.
  • Referred to as an indexed file organization.
  • A data entry is a k, rid pair.
  • Independent of the file organization.
  • A data entry is a k, rid-list pair.
  • Independent of the file organization.
  • Good space utilization but contains variable
    length data.
  • The last two alternatives are independent of the
    underlying file organization.

5
Clustered vs. Unclustered
  • If a file is ordered in the same way as an index
    the index is clustered.
  • That is, both the files records and the indexs
    data entries are ordered on the same field.
  • If the data records are the data entries the file
    must be clustered.
  • Maintaining data records in order involves high
    overhead on insertions and deletions.
  • In practice (unless the index uses data records
    as data entries) files are rarely fully sorted.
  • Initially records are sorted and some free space
    is kept on pages for insertions.
  • Once the free space is used up insertions are put
    in a linked list of overflow pages.
  • Periodically, when the order has degenerated the
    file is sorted again.
  • If an index is unclustered the data records are
    not ordered on the indexs search key.
  • Consequently retrieving a range selection on the
    index search key takes longer than with a
    clustered index.
  • If a file has more than one index only one can be
    clustered.

6
Dense vs. Sparse
  • A dense index is one where there is (at least)
    one data entry for each search key value that
    appears in a data record.
  • If the data entries are the data records the
    index is always dense.
  • A sparse index contains one data entry for each
    page of records in a data file.
  • A sparse index must also be clustered or it would
    be impossible to find records with search key
    values between the data entry values.
  • A sparse index is usually much smaller than a
    dense index.
  • Primary and secondary indices
  • A primary index is one where the search key
    includes the primary key.
  • The search key of a secondary index does not.
  • If there is a dense secondary index on a file the
    file is referred to as being inverted.
  • If there is a dense secondary index on each field
    not in the primary key it is referred to as being
    fully inverted.

7
Composite Indices
  • An indexs search key can contain several fields.
  • Such search keys are referred to as composite
    search keys or concatenated keys.
  • For example fName, lName
  • For searches on equality the values for each
    field in the search key must match the values in
    record.
  • e.g. John Smith does not match John Jones or Fred
    Smith.
  • For range queries ranges can be specified for all
    fields in the search key.
  • If no range is specified for a field it implies
    that any value is acceptable for that field.
  • e.g. (fName ) John will find all customer whose
    first name is John regardless of their last
    names.

8
Index Data Structures
  • Hash-based indexing
  • Group the records (or data entries) in buckets
    consisting of a primary page and, if necessary, a
    chain of overflow pages.
  • Records are mapped to buckets by applying a hash
    function to the search key.
  • Supports very efficient equality searches.
  • The cost is typically 1 or 2 disk I/Os.
  • Tree indexing
  • The non-leaf levels of the tree provide a sparse
    index to the level below.
  • The leaf level contains either a dense index to
    the file, or the records themselves.
  • Tree indexes efficiently support range searches.
  • To find a record costs 1 disk I/O for each level
    of the tree.
  • Therefore a tree with a high fan-out (each node
    has many children) is desirable.

9
File Organization Comparison
  • Consider the following operations
  • Scan fetch all records in a file from disk into
    the buffer.
  • Search with equality selection fetch all
    records that satisfy the selection. Pages with
    matching records must be fetched from disk and
    the records located within the pages.
  • Search with a range selection fetch all records
    that contain a range of values.
  • Insert insert a given record into a file. The
    page for the insert must be located and fetched,
    then modified to include the new record and
    written back to the file.
  • Delete delete a record from a file. The page
    for the record must be located, fetched, modified
    and written back.
  • Both insert and delete may involve fetching and
    modifying more than the page with the insertion
    or deletion.
  • Note that most of the file organizations require
    a search key.

10
Cost Model
  • To compare file organizations we need to estimate
    the cost (in execution time) of different DB
    operations.
  • Assume the following
  • B data pages with R records a page.
  • Average time to read / write a page is D.
  • Average time to process a record is C.
  • The time to map a value to a hash function is H.
  • Typical values for these are
  • D 15 milliseconds.
  • C and H 100 nanoseconds.
  • Therefore the cost of I / O dominates and will be
    used as the cost metric for the comparison.
  • Real world systems must consider other aspects of
    cost.
  • CPU cost and usage.
  • Transmission cost if the DB is distributed.
  • The importance of reading contiguous pages where
    possible (blocked access).
  • etc.

11
Heap Files
  • Scan.
  • The cost is B(D RC).
  • Each page must be read and each record must be
    processed.
  • Search with equality selection.
  • Where the selection is on a superkey.
  • The cost is ½ B(D RC) (if the record exists).
  • Only one record needs to be fetched so only half
    the file needs to be scanned (on average).
  • Where the selection returns multiple records
  • The cost is B(D RC).
  • Search with a range selection.
  • The cost is B(D RC).
  • The entire file has to be scanned because we
    dont know where qualifying records are.
  • Insert.
  • Assume records inserted at the end of the file.
  • The cost is 2D C (read and write and modify).
  • Delete.
  • First retrieve the record. If the rid is known
    so is the page, otherwise scan the file.
  • The cost is therefore search cost D C

12
Sorted Files
  • Scan.
  • The cost is B(D RC) to examine all records.
  • Records are scanned in sort order.
  • Search with equality selection.
  • If the selection is not on the sort key.
  • If selection is a superkey cost is ½ B(D RC).
  • Otherwise, the cost is B(D RC).
  • If the selection is on the sort key.
  • Find the first page using a binary search and
    then fetch all qualifying records.
  • The cost is Dlog2B Clog2R.
  • Note that the cost increases as the number of
    records that match the selection increases (only
    1 if the selection is on a superkey).
  • Search with a range selection.
  • If it is not on the sort key cost is B(D RC).
  • If the selection is on the sort key perform a
    binary search as for the equality selection.
  • With a large range the selection may not fit on
    one page.
  • Therefore the cost increases as additional pages
    have to be retrieved.

13
Sorted Files continued
  • Insert.
  • The record has to be inserted in order so its
    position must be found.
  • The cost of finding a record is the same as the
    cost for selection on equality.
  • Once found and written all pages subsequent to
    the page with the insert must be read and written
    (to shift records down one space). On average a
    record is in the middle of the file.
  • Modifying each page costs 2D RC
  • The cost is search cost B(D ½RC).
  • Delete.
  • The cost is similar to the insert cost (because
    the file has to be compacted).
  • However, if the rid is known there is no
    additional cost for finding the record (though it
    still has to be fetched).
  • The cost is search cost B(D ½RC).

14
Clustered B Tree File
  • This organization consists of a B tree where the
    leaf pages contain data records.
  • The pages are usually only 2/3 full.
  • There are therefore 1.5B data pages.
  • Scan.
  • The cost is 1.5B(D RC) to examine all records.
  • Records are scanned in sort order.
  • Search with equality selection.
  • If the selection is not on the search key.
  • If selection is a superkey cost is ½ 1.5B(D
    RC).
  • Otherwise, the cost is 1.5B(D RC).
  • If the selection is on the search key.
  • Find the first page using the tree. Visit each
    level of the tree (including a leaf). The height
    is given by logF1.5B where F is the fan-out of a
    node.
  • The cost is DlogF1.5B Clog2R.
  • Note that the cost increases as the number of
    records that match the selection increases (only
    1 if the selection is on a superkey).

15
Clustered B Tree Files continued
  • Search with a range selection.
  • If it is not on the search key cost is 1.5B(D
    RC).
  • If the selection is on the search key use the
    index as for equality selection and search
    contiguous leaf pages until the end of the range
    is reached.
  • Note that additional leaf pages may have to be
    retrieved if the selection is relatively large.
  • Insert.
  • The record has to be inserted in the appropriate
    place so find its position using the index.
  • Once the page is found it must be modified and
    written back.
  • The cost is DlogF1.5B Clog2R D.
  • This analysis assumes that the new record can fit
    on the leaf page.
  • Delete.
  • The cost is the same as the insert cost as the
    record must be found and the leaf page modified.
  • The cost is DlogF1.5B Clog2R D.

16
Heap File with Unclustered Tree Index
  • With an unclustered index the leaf pages contain
    data entries (search key, rid pairs).
  • These are typically smaller than data records.
  • The number of leaf pages L is therefore less than
    B. Assume that there are RL data entries per
    leaf page.
  • Scan.
  • The cost is B(D RC) to examine all records.
  • This assumes the records are scanned from the
    heap file without using the index (not in order).
  • If the scan is to be ordered by the search key it
    is better to sort the file than to use the index.
  • Search with equality selection.
  • If the selection is not on the search key the
    file has to be scanned.
  • If the selection is on the search key.
  • Find the appropriate leaf page using the tree.
    Use the data entry to find the associated record.
  • The cost is DlogFL Clog2RL D
  • The cost increases with the number of records
    that match the selection. Because the file is a
    heap file matching records may not reside on the
    same page.

17
Heap File with Unclustered Tree Index
  • Search with a range selection.
  • If it is not on the search key cost is B(D RC).
  • If the selection is on the search key use the
    index as for equality selection and search
    contiguous leaf pages until the end of the range
    is reached.
  • For each matching data entry in a leaf page
    retrieve the record. As records may reside on
    separate pages, using the index may become
    prohibitively expensive if the range is large.
  • Insert.
  • First insert the record in the heap file (2D
    C).
  • Then insert an entry in the index find, modify
    and write it back (DlogFL Clog2RL D).
  • This analysis assumes that the new record can fit
    on the leaf page.
  • Delete.
  • Find the record to be deleted using the index,
    modify and write back the file and index pages.
  • The cost is DlogF1.5B Clog2R D 2D.

18
Heap File with Unclustered Hash Index
  • With an unclustered index the buckets contain
    data entries (search key, rid pairs).
  • These are typically smaller than data records.
    Assume that there are RB data entries per bucket.
  • The cost of using a hash index (H) is the cost of
    making a main memory calculation (like C).
  • Scan.
  • The cost is B(D RC) to examine all records.
  • This assumes the records are scanned from the
    heap file without using the index (not in order).
  • Search with equality selection.
  • If the selection is not on the search key the
    file has to be scanned.
  • If the selection is on the search key.
  • Find the appropriate bucket using the hash
    function. Retrieve the bucket (1 I/O) and then
    retrieve the appropriate page of the file.
  • The cost is H 2D ½RB
  • The cost increases with the number of records
    that match the selection. Because the file is a
    heap file matching records may not reside on the
    same page.

19
Heap File with Unclustered Hash Index
  • Search with a range selection.
  • Range selections are not supported by hash
    indices.
  • Because a hash function is uniform and random,
    entries in the range will be randomly distributed
    across buckets.
  • A file scan is therefore required.
  • Insert.
  • First insert the record in the heap file (2D
    C).
  • Then insert an entry in the hash index find,
    modify and write it back (H 2D C).
  • Delete.
  • Find the record to be deleted using the index,
    modify and write back the file and index pages.
  • The cost is H 2D ½RB 2D.
  • Overflow pages.
  • This analysis assumes that each bucket consists
    of one primary page with no overflow pages.
Write a Comment
User Comments (0)
About PowerShow.com