F23AF1: DATABASE File Organisation and Indexing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

F23AF1: DATABASE File Organisation and Indexing

Description:

Database relations are implemented as files of records. ... The area to be searched is halved each time, and thus the maximum search time is ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 26
Provided by: jenny91
Category:

less

Transcript and Presenter's Notes

Title: F23AF1: DATABASE File Organisation and Indexing


1
F23AF1 DATABASE File Organisation and Indexing
  • Miss Jenny Coady
  • Heriot-Watt University
  • Jenny_at_macs.hw.ac.uk
  • Room EMG37, Ext 4178

2
Data Storage Principles
  • Database relations are implemented as files of
    records.
  • This is still an abstraction the real storage
    medium are disks, which consist of pages (ca.
    0.5-5kbytes)
  • Pages are read from disk and written to disk
    high cost operations!
  • Mapping each record has a record identity (rid),
    which identifies the page where it is stored and
    its offset on that page
  • The DBMS reads (and writes) entire pages and
    stores a number of them in a buffer pool
  • The buffer manager decides which pages to load
    into the buffer (Replacement policy e.g. least
    recently used or clock)

3
Alternative File Organisation
  • Alternatives good in some situations, not in
    others
  • Heap Files No order on records. Suitable when
    typical access is a file scan retrieving all
    records.
  • Sorted Files Sorted by a specific record field
    (key). Best if records must be retrieved in some
    order, or only a range of records is needed.

4
Alternative File Organisation (2)
  • Hashed Files
  • File is a collection of buckets
  • (Bucket primary page plus zero or more
    overflow pages)
  • Hashing Function h computes h(r) bucket into
    which record r belongs (looks at only some of the
    fields r, the search fields)
  • Good for equality selections

5
Heap Files
  • Heap files, sometimes called serial files, are
    not ordered in any particular way.
  • The only way to access a specific record is to
    process all the records which physically precede
    it. There are no mechanisms to enable us to
    access a particular record directly.

6
Heap Files (2)
  • The search begins at the first record and
    continues for each subsequent record until the
    required record is found, or until the end of the
    file is reached.
  • The average search time for a linear search is
    proportional to N/2, where N is the number of
    records on the file. Heap files are useful for
    storing transactions.

7
Sorted Files
  • Sorted files, sometimes known as sequential
    files, are ordered by the value of one of the
    attributes, normally the primary key. They are
    useful when the records are retrieved in that
    specific order, or when a range of records is
    required.
  • Sorted files maybe be searched using a technique
    known as binary search, which makes the maximum
    search time proportional to log2N, where N is the
    number of records on the file.

8
Sorted Files 2
  • In a binary search, the first record to be
    checked is in the centre of the file. If the
    required record key is equal to the key of the
    record retrieved, then we have found what we are
    looking for. If not, then if the key of the
    record we are looking for is greater than the key
    of the record we have retrieved, then, because
    the file is sorted, we know that we can discard
    the lower half of the file, because that contains
    records with keys which have values less than the
    value of the key we are searching for. Similarly,
    if the value of the key of the record we have
    retrieved is greater than the value of the key we
    are looking for, then we can discard the top half
    of the file.
  • The process is repeated again and again until we
    find what we are looking for, or until there are
    no more records left in the area we are
    searching. The area to be searched is halved each
    time, and thus the maximum search time is
    proportional to log2N.

9
Hashing Files
  • The organisation of a hashed file depends on a
    hashing function which converts the value of a
    key field in to the address of a page or group of
    pages, often known as a bucket, containing that
    record in the file.
  • A bucket is a piece of storage big enough to
    contain a number of records.
  • A hashing function is used to generate a valid
    address, i.e. an address within the file, from
    the value of a key. A hashing function is chosen
    in such a way as to provide a uniform
    distribution of page addresses from the values of
    frequently used keys in the database.

10
Hashing Files 2
  • For example, if you had a key with range 2000 to
    10000, and you know that there will never be any
    more records in the file than will fill ten
    pages, you need to reduce the range of key values
    (2000-10000) to the range of available page
    addresses (0-9). We normally choose this target
    number to be a prime number. In fact, hash files
    are sparsely populated and are filled rarely to
    more than 70 capacity. After that, their
    efficiency tends to tail off. So, if we know that
    ten pages of data will be filled, then we need
    approximately 13 pages.

11
Cost Model
  • We ignore CPU costs, for simplicity
  • P number of data pages
  • D (average) time to read of write disk page
  • Simplifications
  • Measuring number of page I/Os ignores gains of
    pre-fetching blocks of pages
  • even I/O cost is only approximated
  • Average case analysis (based on several
    simplistic assumptions)
  • Good enough to show overall trends!

12
Assumptions in Our Analysis
  • Single Record insert and delete
  • Heap Files
  • Equality selection on key exactly one match
  • Insert always at end of file
  • Sorted Files
  • Selections on sort field(s)
  • Files compacted after deletions
  • Hashed Files
  • No overflow buckets, 80 page occupancy

13
Cost of Operations
  • We shall compute the cost for some common
    operations for each type of file organisation.
    Some common database operations are as follows
  • Scan records, e.g. display all the customer
    records
  • Equality search e.g. find all orders for customer
    xyz
  • Range search, e.g. find all products where the
    price is greater than 5.5
  • Insert a new record
  • Delete an existing record
  • In order to maintain efficiency when new records
    are added and the overflow areas are used for
    synonyms, static hashed files normally have a
    record occupancy of approximately 70-80,
    therefore the number of pages required is roughly
    1.25 times the actual number of pages occupied.

14
Cost 2
15
Cost 3
  • For a sorted file, we can use a binary search,
    with a maximum search time of D log2P.
  • For a heap file, a new record is inserted at the
    end of the file. The average search time is
    proportional to half of the number of records
    stored in the file, i.e. 0.5 P D.
  • From the data presented in the table, we can see
    that, depending on the operation to be performed,
    some file organisations are more efficient than
    others

16
Exercise
  • Explain how the formulae shown in the table have
    been derived!

17
Indexes
  • An index on a file speeds up selections on the
    search key fields for the index
  • Any subset of the fields of a relation can be the
    search key for an index on the relation
  • Search key is not the same as key (minimal set of
    fields that uniquely identify a record in a
    relation)
  • An index contains a collection of data entries,
    and supports efficient retrieval of all data
    entries K with a given value K

18
Data Entries in Indexes
  • Three alternatives
  • Data record with key value K
  • K, r where r is rid of a record with search key
    value K
  • K, r1, ., rn, where r1, ., rn is a list
    of rids of records with search key value K
  • Choice of alternative for data entries is
    orthogonal to the indexing technique used to
    locate data entries with a given key value K
  • Examples B trees, hash based structures
  • Index may contain auxiliary information that
    directs searches

19
Data Entries in Indexes Alternative 1
  • Data record with key value K
  • Means index structure is a file organisation for
    data records (like heap files or sorted files)
  • For a given collection of data, at most one index
    can use this
  • Otherwise duplication of data records
  • redundant storage, potential inconsistency
  • If large data record
  • High number of pages containing data entries
  • Large size of auxiliary information

20
Data Entries in Indexes Alternatives 2 3
21
Index Classification
22
Clustered vs. Unclustered
23
Dense vs. Sparse
24
Summary
25
Summary (cntd.)
Write a Comment
User Comments (0)
About PowerShow.com