Chapter 12: Indexing and Hashing - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

Chapter 12: Indexing and Hashing

Description:

Indexes speed up access to data in a table. card catalog in a library (author, ... as a disk block, typically 4 kilobytes, n is typically around 100 (assuming 40 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 81
Provided by: marily198
Category:

less

Transcript and Presenter's Notes

Title: Chapter 12: Indexing and Hashing


1
Chapter 12 Indexing and Hashing
  • Basic Index Concepts
  • Ordered Indices
  • B-Tree Index Files
  • Hashing
  • Index Definition in SQL
  • Multiple-Key Access
  • Bitmap Indices

2
Basic Index Concepts
  • Indexes speed up access to data in a table.
  • card catalog in a library (author, title,
    subject)
  • A set of one or more attributes used to look up
    records in a table is referred to as a search
    key.
  • In the simplest case, an index file consists of
    records of the form
  • Each such record is referred to as an index entry.

3
Basic Index Concepts, Cont.
  • An index file is a file, and suffers from many of
    the same problems as a data file.
  • Index files are typically much smaller than the
    original file.
  • 10 - 25
  • Two kinds of indices, primarily
  • Ordered indices - index entries are stored in
    sorted order, based on the search key.
  • Hash indices - search keys are distributed
    uniformly across buckets using a hash
    function.

4
Index Evaluation Metrics
  • Insertion time (OLTP)
  • Deletion time (OLTP)
  • Space overhead
  • Access time (DSS) for
  • Point Queries - Records with a specified value in
    an attribute.
  • Range Queries - Records with an attribute value
    in a specified range.

5
Ordered Indices
  • An index whose search key specifies the
    sequential order of the data file is a primary
    index.
  • Also called clustering or clustered index.
  • Search key of a primary index is frequently the
    primary key.
  • An index whose search key does not specify the
    sequential order of the data file is a secondary
    index.
  • Also called a non-clustering or non-clustered
    index.
  • An ordered sequential file with a primary index
    is an index-sequential file.

6
Dense Index Files
  • An index that contains an index record for every
    search-key value in the data file is a dense
    index.

7
Dense Index Files, Cont.
  • To locate the record(s) with search-key value K
  • Find index record with search-key value K.
  • Follow pointer from the index record to the data
    record(s).
  • To delete a record
  • Locate the record in the data file, perhaps using
    the above procedure.
  • Delete the record from the data file.
  • If the deleted record was the only one with that
    search-key value, then delete the search-key from
    the index (similar to data record deletion)
  • To insert a record
  • Perform an index lookup using the records
    search-key value.
  • If the search-key value does not appear in the
    index, insert it.
  • Insert the record into the data file (exactly
    where depends).
  • Assign a pointer to the data record from the
    index entry.

8
Sparse Index Files
  • An index that contains index records but only for
    some search-key values in the data file is a
    sparse index.
  • Typically one index entry for each data file
    block.
  • Only applicable when records are sequentially
    ordered on search-key, i.e., the index is a
    primary index.

9
Sparse Index Files, Cont.
  • To locate a record with search-key value K
  • Find the index record with largest search-key
    value lt K.
  • Search file sequentially from the record to which
    the index record points.
  • To delete a record
  • Locate the record in the data file, perhaps using
    the above procedure.
  • Delete the record from the data file.
  • If the deleted record was the only record with
    its search-key value, and if an entry for the
    search key exists in the index, then replace the
    index entry with the next search-key value in the
    data file (in search-key order). If the next
    search-key value already has an index entry, the
    entry is simply deleted.
  • To insert a record
  • Perform an index lookup using the records
    search-key value suppose that the index stores
    an entry for each block of the data file.
  • If the index entry points to a block with free
    space, then simply insert the record in that
    block.
  • If the index entry points to a full block, then
    allocate a new block and insert the first
    search-key value appearing in the new block into
    the index

10
Sparse Index Files, Cont.
  • In general, sparse indices
  • Require less space and maintenance for insertions
    and deletions.
  • Are generally slower than dense indices for
    locating records, especially if there is more
    than one block per index entry.

11
Multilevel Index
  • If an index does not fit in memory, access
    becomes expensive.
  • To reduce the number of disk accesses to index
    records, treat it as a sequential file on disk
    and build a sparse index on it.
  • outer index a sparse index
  • inner index sparse or dense index
  • If the outer index is still too large to fit in
    main memory, yet another level of index can be
    created, and so on.

12
Multilevel Index, Cont.
  • Indices at all levels might require updating upon
    insertion or deletion.
  • Multilevel insertion, deletion and lookup
    algorithms are simple extensions of the
    single-level algorithms.

13
Secondary Indices
  • Frequently a table is searched using a search key
    (i.e., a set of columns) other than the one on
    which the table is sorted.
  • Suppose account is sorted by account number.
  • Searching based on branch, or searching for a
    range of balances.
  • Although the table is not sorted based on the
    search key, an index can nonetheless be
    constructed on the search key.
  • As noted previously, such an index is called a
    secondary index, i.e., an index whose search key
    does not specify the sequential order of the data
    file.

14
Secondary Indices
  • A secondary index will contain an entry for each
    search-key value.
  • Each index entry will point to either a
  • Single entry containing the search key value
    (candidate key).
  • Bucket that contains pointers to all records with
    that search-key value (non-candidate key).
  • All previous algorithms and data structures can
    be modified to apply to secondary indices.

15
Secondary Indexon balance field of account
  • Note that secondary indices have to be dense
    (why?)

16
Primary and Secondary Indices
  • Indices improve performance
  • Primary Indices
  • Point queries
  • Range queries
  • Secondary Indices
  • Point queries
  • Indices can also hurt performance
  • All indices must be updated upon insertion or
    deletion.
  • Scanning a file sequentially in secondary
    search-key order can be expensive.
  • Worst case - each record access may fetch a new
    block from disk
  • Thus, the number of tuples retrieved is an upper
    bound on the number of data blocks retrieved when
    scanning a secondary index for a range query.
  • The same upper bound applies to a primary index,
    however in that case the upper bound is not
    tight (we will return to this in query
    optimization)

17
B-Tree Index Files
  • B-tree indices are a type of multi-level index.
  • Disadvantage of traditional sparse and dense
    index files
  • Performance degrades as the index file grows,
    since many overflow blocks get created in the
    index file.
  • Periodic reorganization of entire index file is
    required.
  • Advantage of B-tree index files
  • Automatically reorganizes itself with small,
    local, changes.
  • Index reorganization is not required as
    frequently.

18
Example of a B-tree
  • Disadvantage of B-trees - extra insertion,
    deletion and space overhead.
  • Advantages of B-trees outweigh disadvantages,
    and they are used extensively.

B-tree for account file (n 3)
19
B-Tree Index Files (Cont.)
A B-tree is a rooted tree satisfying the
following properties
  • All paths from root to leaf are of the same
    length
  • Each node that is not a root or leaf must be at
    least half full
  • Between ?n/2? and n children
  • Between ?n/2? 1 and n 1 search key values.
  • A leaf node has between ?n/2? 1 and n 1 search
    key values.
  • Special cases
  • If the root is not a leaf, it can have as few as
    2 children.
  • If the root is a leaf (that is, there are no
    other nodes in the tree), it can have as few as 1
    value and no children.

20
B-Tree Node Structure
  • Node structure (leaf or internal)
  • Ki are the search-key values
  • Pi are pointers to children (for non-leaf nodes)
    or pointers to records or buckets of records (for
    leaf nodes).
  • The search-keys in a node are ordered
  • K1 lt K2 lt K3 lt . . . lt Kn1

21
Leaf Nodes in B-Trees
Properties of a leaf node
  • For i 1, 2, . . ., n1, pointer Pi either
    points to a file record with search-key value Ki,
    or to a bucket of pointers to file records, each
    record having search-key value Ki.
  • Only need bucket structure if search-key does not
    form a primary key.
  • If Li, Lj are leaf nodes and i lt j, Lis
    search-key values are less than Ljs search-key
    values
  • Pn points to next leaf node in search-key order

22
Non-Leaf Nodes in B-Trees
  • Non-leaf nodes form a multi-level sparse index on
    the leaf nodes.
  • For a non-leaf node with n pointers
  • All the search-keys in the subtree to which P1
    points are less than K1
  • For 2 ? i ? n 1, all the search-keys in the
    subtree to which Pi points have values greater
    than or equal to Ki1 and less than Ki
  • All the search-keys in the subtree to which Pn
    points are greater than Kn-1

23
Example of B-tree
B-tree for account file (n 5)
  • Leaf nodes must have between 2 and 4 values
    (?n/2? 1 and n 1, with n 5).
  • Non-leaf nodes other than root must have between
    3 and 5 children (?n/2? and n, with n 5).
  • Non-leaf notes other than the root must have
    between 2 and 4 values.
  • Root must have at least 2 children.

24
Observations about B-trees
  • A B-tree contains a relatively small number of
    levels - logarithmic in the number of search keys
    appearing in the main file, thus searches can be
    conducted efficiently.
  • Insertions and deletions to the index file can be
    handled efficiently, as the index can be
    restructured in logarithmic time (as we shall
    see).
  • Each node is typically a disk block
  • logically close blocks need not be physically
    close.

25
Queries on B-Trees
  • Searching a B tree for a given search key value
    is a straight-forward generalization of searching
    a binary search tree.
  • Find all records with a search-key value of k
    (see page 492)
  • Start with the root node
  • Examine the node for the smallest search-key
    value gt k.
  • If such a value exists, call it is Kj, then
    follow Pj to the child node
  • Otherwise k ? Km1, where there are m pointers in
    the node. Then follow Pm to the child node.
  • If the node is not a leaf, repeat the above
    procedure on that node.
  • Eventually reach a leaf node.
  • If for some i, key Ki k, follow pointer Pi to
    the desired record or bucket.
  • Otherwise no record with search-key value k
    exists.

26
Queries on B-Trees (Cont.)
  • In processing a query, a path is traversed in the
    tree from the root to some leaf node.
  • If there are K search-key values in the file, the
    path is no longer than ? log?n/2?(K)?.
  • Since a node is generally the same size as a disk
    block, typically 4 kilobytes, n is typically
    around 100 (assuming 40 bytes per index entry).
  • With 1 million search key values and n 100, at
    most log50(1,000,000) 4 nodes are accessed in
    a lookup.
  • Contrast this with a balanced binary free with 1
    million search key values around 20 nodes,
    i.e., blocks, are accessed.

27
Updates on B-Trees Insertion
  • Find the leaf node in which the search-key value
    would appear
  • If the search-key value is already in the leaf
    node
  • The record is added to data file, and
  • If necessary a pointer is inserted into the
    bucket.
  • If the search-key value is not in the leaf node
  • Add the record to the data file, and
  • Create a bucket if necessary.
  • If there is room in the leaf node, insert the
    (key-value, pointer) pair.
  • Otherwise, split the node as discussed in the
    next slide.

28
Updates on B-Trees Insertion (Cont.)
  • Splitting a node
  • Examine the n (search-key value, pointer) pairs
    (including the one being inserted) in sorted
    order.
  • Place the first ?n/2? in the original node, and
    the rest in a new node.
  • Let the new node be p, and let k be the least key
    value in p.
  • Insert (k,p) in the parent of the node being
    split (recursively).
  • If the parent is full, split it and propagate the
    split further up.
  • Splitting nodes continues up the tree until a
    node is found that is not full.
  • In the worst case the root node is split
    increasing the height of the tree.

29
Updates on B-Trees Insertion (Cont.)
B-Tree before and after insertion of Clearview
30
Updates on B-Trees Deletion
  • Find the record to be deleted, and remove it from
    the data file and from the bucket (if present).
  • Remove (search-key value, pointer) from the leaf
    node if there is no bucket or if the bucket has
    become empty
  • If the node has too few entries, and the entries
    in the node and a sibling fit into a single node,
    then
  • Insert all search-key values in the two nodes
    into a single node (the one on the left), and
    delete the other node.
  • Delete the pair (Ki1, Pi), where Pi is the
    pointer to the deleted node, from its parent,
    recursively using the above procedure.

31
Updates on B-Trees Deletion
  • Otherwise, if the node has too few entries, but
    the entries in the node and a sibling do not fit
    into a single node, then
  • Redistribute the pointers between the node and a
    sibling so that both have more than the minimum
    number of entries.
  • Update the corresponding search-key value in the
    parent node.
  • Node deletions may cascade upwards until a node
    with ?n/2? or more pointers is found, or
    redistribution with a sibling occurs.
  • If the root node has only one pointer after
    deletion, it is deleted and the sole child
    becomes the root.

32
Examples of B-Tree Deletion
  • Before and after deleting Downtown
  • Note that the removal of the leaf node containing
    Downtown did not result in its parent having
    too few pointers, so the deletions stopped.

33
Examples of B-Tree Deletion (Perryridge)
Should be Perryridge, based on the code.
  • Node with Perryridge becomes underfull
    (actually empty, in this special case) and merged
    with its sibling.
  • As a result the Perryridge nodes parent became
    under-full, and was merged with its sibling (and
    an entry was deleted from their parent).
  • Root node then had only one child, and was
    deleted.

34
Example of B-tree Deletion (Perryridge)
  • Parent of leaf containing Perryridge became
    under-full, and borrowed a pointer from its left
    sibling.
  • Search-key value in the parents parent changes
    as a result.

35
B-Tree File Organization
  • Index file degradation problem is solved by using
    B-Tree indices.
  • Data file degradation is eliminated by using
    B-Tree File Organization.
  • The leaf nodes in a B-tree file organization
    store the actual records.
  • Since records are larger than pointers, the
    maximum number of records that can be stored in a
    leaf node is less than the number of pointers in
    a non-leaf node.
  • Leaf nodes are still required to be half full.
  • Insertion and deletion are handled in the same
    way as insertion and deletion of entries in a
    B-tree index.

36
B-Tree File Organization (Cont.)
Example of B-tree File Organization
  • Good space management is needed since records use
    more space than pointers.
  • To improve space utilization, involve more
    siblings in merge and split redistribution.
  • Involving 2 siblings in redistribution (to avoid
    split / merge where possible) results in each
    node having at least entries

37
B-Tree Index Files
  • A B-tree is similar to a B tree, but it only
    allows search-key values to appear only once.
  • Search keys in non-leaf nodes appear nowhere else
    in the B-tree an additional pointer field for
    each search key in a non-leaf node is included.
  • Generalized B-tree leaf node
  • Non-leaf node pointers Bi are the bucket or file
    record pointers.

38
B-Tree Index File Example
  • B-tree and corresponding B-tree on the same data

39
B-Tree Index Files, Cont.
  • Advantages of B-Tree indices
  • May use fewer tree nodes than a corresponding
    B-Tree.
  • Sometimes possible to find search-key value
    before reaching leaf node.
  • Disadvantages of B-Tree indices
  • Only a small fraction of all search-key values
    are actually found early.
  • Non-leaf nodes contain more data, so fan-out is
    reduced, and thus, B-Trees typically have greater
    depth than corresponding B-Tree.
  • Insertion and deletion more complicated than in
    B-Trees.
  • Implementation is harder than B-Trees.
  • Typically, advantages of B-Trees do not out weigh
    disadvantages.

40
Static Hashing
  • A bucket is a unit of storage containing one or
    more records.
  • In the simplest and ideal case a bucket is a disk
    block.
  • Every bucket has an address.
  • A hash function h is a function from the set of
    all search-key values K to the set of all bucket
    addresses B.
  • Typically quick and easy to compute.
  • In a hash file organization we obtain the bucket
    of a record directly from its search-key value
    using a hash function.
  • Used to locate, insert and delete records.
  • Records with different search-key values are
    typically mapped to the same bucket.
  • Thus an entire bucket has to be searched
    sequentially to locate a record.

41
Example of Hash File Organization
  • Consider a hash file organization of the account
    table (next page)
  • 10 buckets (more typically, this is a prime
    number)
  • Search key is branch-name
  • Typically a hash function performs a computation
    on the internal binary representation of the
    search-key.
  • Example
  • Let the binary representation of the ith
    character be integer i
  • Return the sum of the binary representations of
    the characters modulo 10
  • h(Mianus) is (1391142119) 77 mod 10 7
  • h(Perryridge) 5
  • h(Round Hill) 3
  • h(Brighton) 3

42
Example of Hash File Organization
43
Hash Functions
  • Ideal hash function is uniform - each bucket is
    assigned the same number of search-key values
    from the set of all possible search keys.
  • Ideal hash function is random - on average each
    bucket will have the same number of records
    assigned to it irrespective of the actual
    distribution of search-key values in the file.
  • Only possible if each search key value is in a
    small fraction of records.
  • See the book for an example hash function that is
    uniform but not random.
  • Worst case the hash function maps all
    search-key values to the same bucket this makes
    access time proportional to the number of
    search-key values in the file.
  • Extensive research on hash functions has been
    done over the years.
  • Other textbooks cover hash functions to a much
    greater degree.

44
Bucket Overflows
  • Goal one block per bucket.
  • Bucket overflow occurs when a bucket has been
    assigned more records than it has space for.
  • Bucket overflow can occur because of
  • Insufficient buckets - lots of records
  • Skew in record distribution across buckets
  • multiple records have same search-key value
  • bad hash function (non-random distribution)
  • non-prime number of buckets
  • The probability of bucket overflow can be
    reduced, but not eliminated.
  • Overflow is handled by using overflow buckets.

45
Handling of Bucket Overflows
  • Overflow Chaining overflow blocks of a given
    bucket are a linked.
  • This is called closed hashing.
  • An alternative, called open hashing, which does
    not use overflow buckets, is not suitable for
    database applications.

46
Hash Indices
  • Hashing can be used not only for file
    organization, but also for index-structure
    creation.
  • A hash index organizes the search keys, with
    their associated record pointers, into a hash
    file structure.
  • The term hash index is typically used to refer to
    both a separate hash-organized index structure
    and hash organized files.

47
Example of a Hash Index
Data blocks probably sorted on some attribute
other than account number
  • Index blocks

48
Hash Indices, Cont.
  • The book says that strictly speaking, hash
    indices are always secondary indices.
  • This raises the following question
  • Would a hash index make sense on an attribute on
    which the table is sorted?
  • If so, then, by definition that index would be a
    primary index, thereby contradicting the above.
  • Why the heck not?
  • Perhaps the point the authors were trying to make
    is
  • If data records are assigned to blocks using
    hashing, then the table is not sorted by the
    search key, so it cant be referred to as a
    primary index.

49
Deficiencies of Static Hashing
  • In static hashing h maps search-key values to a
    fixed set of buckets.
  • If initial number of buckets is too small,
    performance will degrade due to overflows as the
    database grows.
  • If file size in the future is anticipated and the
    number of buckets allocated accordingly,
    significant amount of space will be wasted
    initially.
  • If the database shrinks, again space will be
    wasted.
  • One option is periodic re-organization of the
    file with a new hash function, but this can be
    expensive.
  • These problems can be avoided by using techniques
    that allow the number of buckets to be modified
    dynamically.

50
Dynamic Hashing
  • Good for databases that grow and shrink in size.
  • The hash function and number of buckets change
    dynamically.
  • Extendable hashing is one form of dynamic
    hashing.
  • An extendable hash table has three parts
  • Hash function
  • Buckets
  • Bucket address table

51
Dynamic Hashing Hash Function
  • Hash function
  • Generates values over a large range
  • Typically b-bit integers, with b 32.
  • At any time only a prefix of the hash function is
    used.
  • The length of the prefix used is i bits, 0 ? i ?
    b.
  • i is referred to as the global depth
  • The value of i, will grow and shrink along with
    the hash table.
  • i is used to index into the bucket address table

52
Dynamic HashingBucket Address Table
  • Bucket address table contains 2i entries,
    initially i 0.
  • Each entry contains a pointer to a bucket.
  • Multiple entries in the bucket address table may
    point to the same bucket.
  • Thus, actual number of buckets is lt 2i

53
Dynamic HashingBuckets
  • As insertions and deletions take place buckets
    are merged and split.
  • Each bucket has an associated value (address)
    called the local depth.
  • The local depth is equal to the number of bits
    that distinguish values in the bucket from the
    values in all other buckets.
  • Bucket chaining may still occur.

54
General Extendable Hash Structure
Global depth
Local depth
  • Note that it is always the case that ij lt i

55
Use of Extendable Hash Structure
  • Throughout the following
  • i is the global depth
  • ij is the local depth for bucket j.
  • To locate the bucket containing search-key Kj
  • 1. Compute h(Kj) X
  • 2. Use the first i bits of X as a displacement
    into bucket address table, and follow the pointer
    to appropriate bucket
  • To insert a record with search-key value Kj
  • Follow the above procedure to locate the bucket,
    say j.
  • If there is room in the bucket then insert the
    record.
  • Otherwise split the bucket and re-attempt
    insertion (next slide).

56
Updates in Extendable Hash Structure
To split a bucket j when inserting record with
search-key value Kj
  • If i gt ij (more than one pointer to bucket j)
  • Allocate a new bucket z, and set ij and iz to ij
    1.
  • Make the second half of the bucket address table
    entries currently pointing to j point to z.
  • Remove and reinsert each record in bucket j.
  • Re-compute the new bucket for Kj and insert the
    record in the bucket (further splitting is
    required if the bucket is still full)
  • If i ij (only one pointer to bucket j)
  • Increment i and double the size of the bucket
    address table.
  • Replace each entry in the table by two entries
    pointing to the same bucket.
  • Re-compute new bucket address table entry for
    KjNow i gt ij so use the first case above.

57
Updates in Extend. Hash Structure (Cont.)
  • When inserting, if a bucket is still full after
    several splits create an overflow bucket instead
    of splitting the bucket further.
  • To delete a record
  • Locate the record and remove it from its bucket.
  • The bucket itself can be removed if it becomes
    empty (with appropriate updates to the bucket
    address table).
  • Buckets can be coalesced.
  • Decreasing bucket address table size is also
    possible.
  • Note
  • Can only coalesce a bucket with a buddy bucket
    having same local depth (value of ij) and same ij
    1 prefix, if it is present.
  • Decreasing bucket address table size might be an
    expensive operation and should only be done if
    the number of buckets becomes much smaller than
    the size of the table.

58
Use of Extend. Hash Structure (Example)
Initial Hash structure, bucket size 2
59
Example (Cont.)
  • Hash structure after insertion of one Brighton
    and two Downtown records

60
Example (Cont.)
Hash structure after insertion of Mianus record
61
Example (Cont.)
Hash structure after insertion of three
Perryridge records
62
Example (Cont.)
  • Hash structure after insertion of Redwood and
    Round Hill records

63
Extendable Hashing vs. Other Schemes
  • Benefits of extendable hashing
  • Hash performance does not degrade with file
    growth.
  • Minimal space overhead.
  • Disadvantages of extendable hashing
  • Extra level of indirection to find desired
    record.
  • Bucket address table may itself become larger
    than memory.
  • Could impose a tree (or other) structure to
    locate an index entry.
  • Changing the size of bucket address table can be
    an expensive operation.
  • Linear hashing is an alternative mechanism which
    avoids these disadvantages at the possible cost
    of more bucket overflows.

64
Comparison of Ordered Indexing and Hashing
  • What are the expected type of queries?
  • Hashing is generally better for point queries.
  • If range queries are common, ordered indices are
    preferred.
  • Is it desirable to optimize average access time
    at the expense of worst-case access time?
  • What is the cost and frequency of insertions and
    deletions?
  • What is the cost and frequency of periodic
    re-organization?

65
Index Definition in SQL
  • Creating an index
  • create index ltindex-namegt on ltrelation-namegt
    (ltattribute-listgt)
  • Example
  • create index b-index on branch(branch-name)
  • To drop an index
  • drop index ltindex-namegt
  • Indices can also be used enforce the condition
    that a search key is a candidate key
  • create unique index

66
Multiple-Key Access
  • Multiple indices can be used for certain queries
  • select account-number
  • from account
  • where branch-name Perryridge and balance
    1000
  • Possible strategies using single-attribute
    indices
  • 1. Use index on branch-name to find accounts with
    branch-name Perryridge test balance 1000
    in memory.
  • Use index on balance to find accounts with
    balances of 1000 test branch-name
    Perryridge in memory.
  • Use three steps
  • Use branch-name index to find pointers to all
    records pertaining to the Perryridge branch.
  • Similarly use index on balance.
  • Take intersection of both sets of pointers
    obtained.

67
Indices on Multiple Attributes
  • An index can be created on more than one
    attribute.
  • A key requirement for ordered indices is the
    ability to compare search keys for , lt and gt.
  • Suppose a multi-attribute index is created on the
    attributes (branch-name, balance) in the branch
    relation.
  • How do comparisons work?
  • , lt, gt, etc.

68
Indices on Multiple Attributes, Cont.
  • A multi-attribute index can support queries such
    as
  • where branch-name Perryridge and balance
    1000
  • Using separate indices is likely to be less
    efficient many records (or pointers) could be
    retrieved that satisfy only one of the
    conditions.
  • The multi-attribute index is helpful for queries
    such as
  • where branch-name Perryridge and balance lt
    1000
  • The multi-attribute index is not helpful for
    queries such as
  • where branch-name lt Perryridge and balance
    1000
  • May fetch many records that satisfy the first
    but not the second condition.

69
Bitmap Indices
  • Bitmap indices are a special type of index
    designed for efficient querying on multiple
    search keys.
  • Not particularly useful for single attribute
    queries
  • Typically used in data warehouses
  • Applicable on attributes having a small number of
    distinct values
  • Gender, country, state,
  • Income-level (0-9999, 10000-19999, 20000-50000,
    50000-infinity)
  • Bitmap assumptions
  • Records in a relation are numbered sequentially
    from, say, 0
  • Given a number n it must be easy to retrieve
    record n
  • Particularly easy if records are of fixed size

70
Bitmap Indices (Cont.)
  • A bitmap index on an attribute has a bitmap for
    each attribute value.
  • A bitmap is simply an array of bits
  • Bitmap has as many bits as there are records in
    the file.
  • In a bitmap for value v, the bit for a record is
    1 if the record has the value v for the
    attribute, and is 0 otherwise

71
Bitmap Indices (Cont.)
  • Queries on multiple bitmap-indexed attributes use
    bitmap operations
  • Intersection (and) 100110 AND 110011 100010
  • Union (or) 100110 OR 110011 110111
  • Complementation (not) NOT 100110 011001
  • Example - Retrieve records for males with income
    level L1
  • 10010 AND 10100 10000
  • Resulting bitmap is then used to retrieve
    satisfying tuples
  • Counting the number of satisfying tuples is even
    faster

72
Bitmap Indices (Cont.)
  • How big is a bitmap index?
  • Bitmap indices are generally very small compared
    with relation size.
  • Example
  • number of records is n
  • record size 100 bytes, i.e., 800 bits (somewhat
    conservative)
  • total table size is n 800 bits
  • bitmap size is n bits
  • space for a single bitmap is 1/800 of space used
    by the relation.
  • If the number of distinct attribute values is 8,
    then bitmap index is only 1 of relation size.
  • Exercise (do the math) - What happens if each
    record is 32 bits and there are 64 distinct
    attribute values? Where is the cut-off point?

73
Bitmap Indices (Cont.)
  • How does a bitmap change if a record is deleted?
  • Could we simply put a 0 in all bitmaps for the
    record?
  • No! Consider
  • not(Av)
  • An existence bitmap could be used to indicate if
    there is a valid record at a specific location
    for negation to work properly.
  • (NOT bitmap-A-v) AND ExistenceBitmap
  • Bitmaps need to be maintained for all values,
    including null
  • To correctly handle SQL (3 value) null semantics
    for NOT(Av)
  • Intersect the above result with (NOT
    bitmap-A-Null)

74
Efficient Implementation of Bitmap Operations
  • Bitmaps are packed into words.
  • A single word and operation (a basic CPU
    instruction) computes the and of 32 or 64 bits at
    once.
  • The conjunction (and) of two bitmaps containing
    1-million-bits can be done with just 31,250
    instructions.
  • Counting the number of 1s can be done fast by a
    trick
  • Use each byte to index into a pre-computed array
    of 256 elements each storing the count of 1s in
    the binary representation.
  • Add up retrieved counts
  • Bitmaps can be used instead of Tuple-ID lists at
    leaf levels of B-trees, for values that have a
    large number of matching records
  • If a tuple-id is 64-bits, then this saves space
    if gt 1/64 of the records have a specific value
    (as an exercise, do the math).
  • This technique merges benefits of bitmap and
    B-tree indices..

75
End of Chapter
76
Partitioned Hashing
  • Hash values are split into segments that depend
    on each attribute of the search-key.
  • (A1, A2, . . . , An) for n attribute search-key
  • Example n 2, for customer, search-key being
    (customer-street, customer-city)
  • search-key value hash value (Main,
    Harrison) 101 111 (Main, Brooklyn) 101
    001 (Park, Palo Alto) 010 010 (Spring,
    Brooklyn) 001 001 (Alma, Palo Alto) 110 010
  • To answer equality query on single attribute,
    need to look up multiple buckets. Similar in
    effect to grid files.

77
Grid Files
  • A grid file is an index that supports multiple
    search-key queries involving one or more
    comparison operators.
  • The grid file consists of
  • a single n-dimensional grid array, where n is the
    number of search key attributes.
  • n linear scales, one for each search-key
    attribute.
  • Multiple cells of grid array can point to same
    bucket.

78
Example Grid File for account
79
Queries on a Grid File
  • A grid file on two attributes A and B can handle
    queries of all following forms with reasonable
    efficiency
  • (a1 ? A ? a2)
  • (b1 ? B ? b2)
  • (a1 ? A ? a2 ? b1 ? B ? b2)
  • During insertion (Similar to extendable hashing,
    but on n dimensions)
  • If a bucket becomes full then it can be split if
    more than one cell points to it.
  • If only one cell points to the bucket, either an
    overflow bucket can be created or the grid size
    can be increased
  • During deletion (Also similar to extendable
    hashing)
  • If a bucket becomes empty, the bucket can be
    merged with other buckets
  • Or the grid pointer can be set to null, etc.

80
Grid Files (Cont.)
  • Linear scales must be chosen to uniformly
    distribute records across cells.
  • Otherwise there will be too many overflow
    buckets.
  • Periodic re-organization to increase grid size
    will help.
  • But reorganization can be very expensive.
  • Space overhead of grid array can be high.
  • R-trees (Chapter 23) are an alternative
Write a Comment
User Comments (0)
About PowerShow.com