Title: Chapter 12: Indexing and Hashing
1Chapter 12 Indexing and Hashing
- Basic Index Concepts
- Ordered Indices
- B-Tree Index Files
- Hashing
- Index Definition in SQL
- Multiple-Key Access
- Bitmap Indices
2Basic Index Concepts
- Indexes speed up access to data in a table.
- card catalog in a library (author, title,
subject) - A set of one or more attributes used to look up
records in a table is referred to as a search
key. - In the simplest case, an index file consists of
records of the form - Each such record is referred to as an index entry.
3Basic Index Concepts, Cont.
- An index file is a file, and suffers from many of
the same problems as a data file. - Index files are typically much smaller than the
original file. - 10 - 25
- Two kinds of indices, primarily
- Ordered indices - index entries are stored in
sorted order, based on the search key. - Hash indices - search keys are distributed
uniformly across buckets using a hash
function.
4Index Evaluation Metrics
- Insertion time (OLTP)
- Deletion time (OLTP)
- Space overhead
- Access time (DSS) for
- Point Queries - Records with a specified value in
an attribute. - Range Queries - Records with an attribute value
in a specified range.
5Ordered Indices
- An index whose search key specifies the
sequential order of the data file is a primary
index. - Also called clustering or clustered index.
- Search key of a primary index is frequently the
primary key. - An index whose search key does not specify the
sequential order of the data file is a secondary
index. - Also called a non-clustering or non-clustered
index. - An ordered sequential file with a primary index
is an index-sequential file.
6Dense Index Files
- An index that contains an index record for every
search-key value in the data file is a dense
index.
7Dense Index Files, Cont.
- To locate the record(s) with search-key value K
- Find index record with search-key value K.
- Follow pointer from the index record to the data
record(s). - To delete a record
- Locate the record in the data file, perhaps using
the above procedure. - Delete the record from the data file.
- If the deleted record was the only one with that
search-key value, then delete the search-key from
the index (similar to data record deletion) - To insert a record
- Perform an index lookup using the records
search-key value. - If the search-key value does not appear in the
index, insert it. - Insert the record into the data file (exactly
where depends). - Assign a pointer to the data record from the
index entry.
8Sparse Index Files
- An index that contains index records but only for
some search-key values in the data file is a
sparse index. - Typically one index entry for each data file
block. - Only applicable when records are sequentially
ordered on search-key, i.e., the index is a
primary index.
9Sparse Index Files, Cont.
- To locate a record with search-key value K
- Find the index record with largest search-key
value lt K. - Search file sequentially from the record to which
the index record points. - To delete a record
- Locate the record in the data file, perhaps using
the above procedure. - Delete the record from the data file.
- If the deleted record was the only record with
its search-key value, and if an entry for the
search key exists in the index, then replace the
index entry with the next search-key value in the
data file (in search-key order). If the next
search-key value already has an index entry, the
entry is simply deleted. - To insert a record
- Perform an index lookup using the records
search-key value suppose that the index stores
an entry for each block of the data file. - If the index entry points to a block with free
space, then simply insert the record in that
block. - If the index entry points to a full block, then
allocate a new block and insert the first
search-key value appearing in the new block into
the index
10Sparse Index Files, Cont.
- In general, sparse indices
- Require less space and maintenance for insertions
and deletions. - Are generally slower than dense indices for
locating records, especially if there is more
than one block per index entry.
11Multilevel Index
- If an index does not fit in memory, access
becomes expensive. - To reduce the number of disk accesses to index
records, treat it as a sequential file on disk
and build a sparse index on it. - outer index a sparse index
- inner index sparse or dense index
- If the outer index is still too large to fit in
main memory, yet another level of index can be
created, and so on.
12Multilevel Index, Cont.
- Indices at all levels might require updating upon
insertion or deletion. - Multilevel insertion, deletion and lookup
algorithms are simple extensions of the
single-level algorithms.
13Secondary Indices
- Frequently a table is searched using a search key
(i.e., a set of columns) other than the one on
which the table is sorted. - Suppose account is sorted by account number.
- Searching based on branch, or searching for a
range of balances. - Although the table is not sorted based on the
search key, an index can nonetheless be
constructed on the search key. - As noted previously, such an index is called a
secondary index, i.e., an index whose search key
does not specify the sequential order of the data
file.
14Secondary Indices
- A secondary index will contain an entry for each
search-key value. - Each index entry will point to either a
- Single entry containing the search key value
(candidate key). - Bucket that contains pointers to all records with
that search-key value (non-candidate key). - All previous algorithms and data structures can
be modified to apply to secondary indices.
15Secondary Indexon balance field of account
- Note that secondary indices have to be dense
(why?)
16Primary and Secondary Indices
- Indices improve performance
- Primary Indices
- Point queries
- Range queries
- Secondary Indices
- Point queries
- Indices can also hurt performance
- All indices must be updated upon insertion or
deletion. - Scanning a file sequentially in secondary
search-key order can be expensive. - Worst case - each record access may fetch a new
block from disk - Thus, the number of tuples retrieved is an upper
bound on the number of data blocks retrieved when
scanning a secondary index for a range query. - The same upper bound applies to a primary index,
however in that case the upper bound is not
tight (we will return to this in query
optimization)
17B-Tree Index Files
- B-tree indices are a type of multi-level index.
- Disadvantage of traditional sparse and dense
index files - Performance degrades as the index file grows,
since many overflow blocks get created in the
index file. - Periodic reorganization of entire index file is
required. - Advantage of B-tree index files
- Automatically reorganizes itself with small,
local, changes. - Index reorganization is not required as
frequently.
18Example of a B-tree
- Disadvantage of B-trees - extra insertion,
deletion and space overhead. - Advantages of B-trees outweigh disadvantages,
and they are used extensively.
B-tree for account file (n 3)
19B-Tree Index Files (Cont.)
A B-tree is a rooted tree satisfying the
following properties
- All paths from root to leaf are of the same
length - Each node that is not a root or leaf must be at
least half full - Between ?n/2? and n children
- Between ?n/2? 1 and n 1 search key values.
- A leaf node has between ?n/2? 1 and n 1 search
key values. - Special cases
- If the root is not a leaf, it can have as few as
2 children. - If the root is a leaf (that is, there are no
other nodes in the tree), it can have as few as 1
value and no children.
20B-Tree Node Structure
- Node structure (leaf or internal)
-
- Ki are the search-key values
- Pi are pointers to children (for non-leaf nodes)
or pointers to records or buckets of records (for
leaf nodes). - The search-keys in a node are ordered
- K1 lt K2 lt K3 lt . . . lt Kn1
21Leaf Nodes in B-Trees
Properties of a leaf node
- For i 1, 2, . . ., n1, pointer Pi either
points to a file record with search-key value Ki,
or to a bucket of pointers to file records, each
record having search-key value Ki. - Only need bucket structure if search-key does not
form a primary key. - If Li, Lj are leaf nodes and i lt j, Lis
search-key values are less than Ljs search-key
values - Pn points to next leaf node in search-key order
22Non-Leaf Nodes in B-Trees
- Non-leaf nodes form a multi-level sparse index on
the leaf nodes. - For a non-leaf node with n pointers
- All the search-keys in the subtree to which P1
points are less than K1 - For 2 ? i ? n 1, all the search-keys in the
subtree to which Pi points have values greater
than or equal to Ki1 and less than Ki - All the search-keys in the subtree to which Pn
points are greater than Kn-1
23Example of B-tree
B-tree for account file (n 5)
- Leaf nodes must have between 2 and 4 values
(?n/2? 1 and n 1, with n 5). - Non-leaf nodes other than root must have between
3 and 5 children (?n/2? and n, with n 5). - Non-leaf notes other than the root must have
between 2 and 4 values. - Root must have at least 2 children.
24Observations about B-trees
- A B-tree contains a relatively small number of
levels - logarithmic in the number of search keys
appearing in the main file, thus searches can be
conducted efficiently. - Insertions and deletions to the index file can be
handled efficiently, as the index can be
restructured in logarithmic time (as we shall
see). - Each node is typically a disk block
- logically close blocks need not be physically
close.
25Queries on B-Trees
- Searching a B tree for a given search key value
is a straight-forward generalization of searching
a binary search tree. - Find all records with a search-key value of k
(see page 492) - Start with the root node
- Examine the node for the smallest search-key
value gt k. - If such a value exists, call it is Kj, then
follow Pj to the child node - Otherwise k ? Km1, where there are m pointers in
the node. Then follow Pm to the child node. - If the node is not a leaf, repeat the above
procedure on that node. - Eventually reach a leaf node.
- If for some i, key Ki k, follow pointer Pi to
the desired record or bucket. - Otherwise no record with search-key value k
exists.
26Queries on B-Trees (Cont.)
- In processing a query, a path is traversed in the
tree from the root to some leaf node. - If there are K search-key values in the file, the
path is no longer than ? log?n/2?(K)?. - Since a node is generally the same size as a disk
block, typically 4 kilobytes, n is typically
around 100 (assuming 40 bytes per index entry). - With 1 million search key values and n 100, at
most log50(1,000,000) 4 nodes are accessed in
a lookup. - Contrast this with a balanced binary free with 1
million search key values around 20 nodes,
i.e., blocks, are accessed.
27Updates on B-Trees Insertion
- Find the leaf node in which the search-key value
would appear - If the search-key value is already in the leaf
node - The record is added to data file, and
- If necessary a pointer is inserted into the
bucket. - If the search-key value is not in the leaf node
- Add the record to the data file, and
- Create a bucket if necessary.
- If there is room in the leaf node, insert the
(key-value, pointer) pair. - Otherwise, split the node as discussed in the
next slide.
28Updates on B-Trees Insertion (Cont.)
- Splitting a node
- Examine the n (search-key value, pointer) pairs
(including the one being inserted) in sorted
order. - Place the first ?n/2? in the original node, and
the rest in a new node. - Let the new node be p, and let k be the least key
value in p. - Insert (k,p) in the parent of the node being
split (recursively). - If the parent is full, split it and propagate the
split further up. - Splitting nodes continues up the tree until a
node is found that is not full. - In the worst case the root node is split
increasing the height of the tree.
29Updates on B-Trees Insertion (Cont.)
B-Tree before and after insertion of Clearview
30Updates on B-Trees Deletion
- Find the record to be deleted, and remove it from
the data file and from the bucket (if present). - Remove (search-key value, pointer) from the leaf
node if there is no bucket or if the bucket has
become empty - If the node has too few entries, and the entries
in the node and a sibling fit into a single node,
then - Insert all search-key values in the two nodes
into a single node (the one on the left), and
delete the other node. - Delete the pair (Ki1, Pi), where Pi is the
pointer to the deleted node, from its parent,
recursively using the above procedure.
31Updates on B-Trees Deletion
- Otherwise, if the node has too few entries, but
the entries in the node and a sibling do not fit
into a single node, then - Redistribute the pointers between the node and a
sibling so that both have more than the minimum
number of entries. - Update the corresponding search-key value in the
parent node. - Node deletions may cascade upwards until a node
with ?n/2? or more pointers is found, or
redistribution with a sibling occurs. - If the root node has only one pointer after
deletion, it is deleted and the sole child
becomes the root.
32Examples of B-Tree Deletion
- Before and after deleting Downtown
- Note that the removal of the leaf node containing
Downtown did not result in its parent having
too few pointers, so the deletions stopped.
33Examples of B-Tree Deletion (Perryridge)
Should be Perryridge, based on the code.
- Node with Perryridge becomes underfull
(actually empty, in this special case) and merged
with its sibling. - As a result the Perryridge nodes parent became
under-full, and was merged with its sibling (and
an entry was deleted from their parent). - Root node then had only one child, and was
deleted.
34Example of B-tree Deletion (Perryridge)
- Parent of leaf containing Perryridge became
under-full, and borrowed a pointer from its left
sibling. - Search-key value in the parents parent changes
as a result.
35B-Tree File Organization
- Index file degradation problem is solved by using
B-Tree indices. - Data file degradation is eliminated by using
B-Tree File Organization. - The leaf nodes in a B-tree file organization
store the actual records. - Since records are larger than pointers, the
maximum number of records that can be stored in a
leaf node is less than the number of pointers in
a non-leaf node. - Leaf nodes are still required to be half full.
- Insertion and deletion are handled in the same
way as insertion and deletion of entries in a
B-tree index.
36B-Tree File Organization (Cont.)
Example of B-tree File Organization
- Good space management is needed since records use
more space than pointers. - To improve space utilization, involve more
siblings in merge and split redistribution. - Involving 2 siblings in redistribution (to avoid
split / merge where possible) results in each
node having at least entries
37B-Tree Index Files
- A B-tree is similar to a B tree, but it only
allows search-key values to appear only once. - Search keys in non-leaf nodes appear nowhere else
in the B-tree an additional pointer field for
each search key in a non-leaf node is included. - Generalized B-tree leaf node
- Non-leaf node pointers Bi are the bucket or file
record pointers.
38B-Tree Index File Example
- B-tree and corresponding B-tree on the same data
39B-Tree Index Files, Cont.
- Advantages of B-Tree indices
- May use fewer tree nodes than a corresponding
B-Tree. - Sometimes possible to find search-key value
before reaching leaf node. - Disadvantages of B-Tree indices
- Only a small fraction of all search-key values
are actually found early. - Non-leaf nodes contain more data, so fan-out is
reduced, and thus, B-Trees typically have greater
depth than corresponding B-Tree. - Insertion and deletion more complicated than in
B-Trees. - Implementation is harder than B-Trees.
- Typically, advantages of B-Trees do not out weigh
disadvantages.
40Static Hashing
- A bucket is a unit of storage containing one or
more records. - In the simplest and ideal case a bucket is a disk
block. - Every bucket has an address.
- A hash function h is a function from the set of
all search-key values K to the set of all bucket
addresses B. - Typically quick and easy to compute.
- In a hash file organization we obtain the bucket
of a record directly from its search-key value
using a hash function. - Used to locate, insert and delete records.
- Records with different search-key values are
typically mapped to the same bucket. - Thus an entire bucket has to be searched
sequentially to locate a record.
41Example of Hash File Organization
- Consider a hash file organization of the account
table (next page) - 10 buckets (more typically, this is a prime
number) - Search key is branch-name
- Typically a hash function performs a computation
on the internal binary representation of the
search-key. - Example
- Let the binary representation of the ith
character be integer i - Return the sum of the binary representations of
the characters modulo 10 - h(Mianus) is (1391142119) 77 mod 10 7
- h(Perryridge) 5
- h(Round Hill) 3
- h(Brighton) 3
42Example of Hash File Organization
43Hash Functions
- Ideal hash function is uniform - each bucket is
assigned the same number of search-key values
from the set of all possible search keys. - Ideal hash function is random - on average each
bucket will have the same number of records
assigned to it irrespective of the actual
distribution of search-key values in the file. - Only possible if each search key value is in a
small fraction of records. - See the book for an example hash function that is
uniform but not random. - Worst case the hash function maps all
search-key values to the same bucket this makes
access time proportional to the number of
search-key values in the file. - Extensive research on hash functions has been
done over the years. - Other textbooks cover hash functions to a much
greater degree.
44Bucket Overflows
- Goal one block per bucket.
- Bucket overflow occurs when a bucket has been
assigned more records than it has space for. - Bucket overflow can occur because of
- Insufficient buckets - lots of records
- Skew in record distribution across buckets
- multiple records have same search-key value
- bad hash function (non-random distribution)
- non-prime number of buckets
- The probability of bucket overflow can be
reduced, but not eliminated. - Overflow is handled by using overflow buckets.
45Handling of Bucket Overflows
- Overflow Chaining overflow blocks of a given
bucket are a linked. - This is called closed hashing.
- An alternative, called open hashing, which does
not use overflow buckets, is not suitable for
database applications.
46Hash Indices
- Hashing can be used not only for file
organization, but also for index-structure
creation. - A hash index organizes the search keys, with
their associated record pointers, into a hash
file structure. - The term hash index is typically used to refer to
both a separate hash-organized index structure
and hash organized files.
47Example of a Hash Index
Data blocks probably sorted on some attribute
other than account number
48Hash Indices, Cont.
- The book says that strictly speaking, hash
indices are always secondary indices. - This raises the following question
- Would a hash index make sense on an attribute on
which the table is sorted? - If so, then, by definition that index would be a
primary index, thereby contradicting the above. - Why the heck not?
- Perhaps the point the authors were trying to make
is - If data records are assigned to blocks using
hashing, then the table is not sorted by the
search key, so it cant be referred to as a
primary index.
49Deficiencies of Static Hashing
- In static hashing h maps search-key values to a
fixed set of buckets. - If initial number of buckets is too small,
performance will degrade due to overflows as the
database grows. - If file size in the future is anticipated and the
number of buckets allocated accordingly,
significant amount of space will be wasted
initially. - If the database shrinks, again space will be
wasted. - One option is periodic re-organization of the
file with a new hash function, but this can be
expensive. - These problems can be avoided by using techniques
that allow the number of buckets to be modified
dynamically.
50Dynamic Hashing
- Good for databases that grow and shrink in size.
- The hash function and number of buckets change
dynamically. - Extendable hashing is one form of dynamic
hashing. - An extendable hash table has three parts
- Hash function
- Buckets
- Bucket address table
51Dynamic Hashing Hash Function
- Hash function
- Generates values over a large range
- Typically b-bit integers, with b 32.
- At any time only a prefix of the hash function is
used. - The length of the prefix used is i bits, 0 ? i ?
b. - i is referred to as the global depth
- The value of i, will grow and shrink along with
the hash table. - i is used to index into the bucket address table
52Dynamic HashingBucket Address Table
- Bucket address table contains 2i entries,
initially i 0. - Each entry contains a pointer to a bucket.
- Multiple entries in the bucket address table may
point to the same bucket. - Thus, actual number of buckets is lt 2i
53Dynamic HashingBuckets
- As insertions and deletions take place buckets
are merged and split. - Each bucket has an associated value (address)
called the local depth. - The local depth is equal to the number of bits
that distinguish values in the bucket from the
values in all other buckets. - Bucket chaining may still occur.
54General Extendable Hash Structure
Global depth
Local depth
- Note that it is always the case that ij lt i
55Use of Extendable Hash Structure
- Throughout the following
- i is the global depth
- ij is the local depth for bucket j.
- To locate the bucket containing search-key Kj
- 1. Compute h(Kj) X
- 2. Use the first i bits of X as a displacement
into bucket address table, and follow the pointer
to appropriate bucket - To insert a record with search-key value Kj
- Follow the above procedure to locate the bucket,
say j. - If there is room in the bucket then insert the
record. - Otherwise split the bucket and re-attempt
insertion (next slide). -
56Updates in Extendable Hash Structure
To split a bucket j when inserting record with
search-key value Kj
- If i gt ij (more than one pointer to bucket j)
- Allocate a new bucket z, and set ij and iz to ij
1. - Make the second half of the bucket address table
entries currently pointing to j point to z. - Remove and reinsert each record in bucket j.
- Re-compute the new bucket for Kj and insert the
record in the bucket (further splitting is
required if the bucket is still full) - If i ij (only one pointer to bucket j)
- Increment i and double the size of the bucket
address table. - Replace each entry in the table by two entries
pointing to the same bucket. - Re-compute new bucket address table entry for
KjNow i gt ij so use the first case above.
57Updates in Extend. Hash Structure (Cont.)
- When inserting, if a bucket is still full after
several splits create an overflow bucket instead
of splitting the bucket further. - To delete a record
- Locate the record and remove it from its bucket.
- The bucket itself can be removed if it becomes
empty (with appropriate updates to the bucket
address table). - Buckets can be coalesced.
- Decreasing bucket address table size is also
possible. - Note
- Can only coalesce a bucket with a buddy bucket
having same local depth (value of ij) and same ij
1 prefix, if it is present. - Decreasing bucket address table size might be an
expensive operation and should only be done if
the number of buckets becomes much smaller than
the size of the table.
58Use of Extend. Hash Structure (Example)
Initial Hash structure, bucket size 2
59Example (Cont.)
- Hash structure after insertion of one Brighton
and two Downtown records
60Example (Cont.)
Hash structure after insertion of Mianus record
61Example (Cont.)
Hash structure after insertion of three
Perryridge records
62Example (Cont.)
- Hash structure after insertion of Redwood and
Round Hill records
63Extendable Hashing vs. Other Schemes
- Benefits of extendable hashing
- Hash performance does not degrade with file
growth. - Minimal space overhead.
- Disadvantages of extendable hashing
- Extra level of indirection to find desired
record. - Bucket address table may itself become larger
than memory. - Could impose a tree (or other) structure to
locate an index entry. - Changing the size of bucket address table can be
an expensive operation. - Linear hashing is an alternative mechanism which
avoids these disadvantages at the possible cost
of more bucket overflows.
64Comparison of Ordered Indexing and Hashing
- What are the expected type of queries?
- Hashing is generally better for point queries.
- If range queries are common, ordered indices are
preferred. - Is it desirable to optimize average access time
at the expense of worst-case access time? - What is the cost and frequency of insertions and
deletions? - What is the cost and frequency of periodic
re-organization?
65Index Definition in SQL
- Creating an index
- create index ltindex-namegt on ltrelation-namegt
(ltattribute-listgt) - Example
- create index b-index on branch(branch-name)
- To drop an index
- drop index ltindex-namegt
- Indices can also be used enforce the condition
that a search key is a candidate key - create unique index
66Multiple-Key Access
- Multiple indices can be used for certain queries
- select account-number
- from account
- where branch-name Perryridge and balance
1000 - Possible strategies using single-attribute
indices - 1. Use index on branch-name to find accounts with
branch-name Perryridge test balance 1000
in memory. - Use index on balance to find accounts with
balances of 1000 test branch-name
Perryridge in memory. - Use three steps
- Use branch-name index to find pointers to all
records pertaining to the Perryridge branch. - Similarly use index on balance.
- Take intersection of both sets of pointers
obtained.
67Indices on Multiple Attributes
- An index can be created on more than one
attribute. - A key requirement for ordered indices is the
ability to compare search keys for , lt and gt. - Suppose a multi-attribute index is created on the
attributes (branch-name, balance) in the branch
relation. - How do comparisons work?
- , lt, gt, etc.
68Indices on Multiple Attributes, Cont.
- A multi-attribute index can support queries such
as - where branch-name Perryridge and balance
1000 - Using separate indices is likely to be less
efficient many records (or pointers) could be
retrieved that satisfy only one of the
conditions. - The multi-attribute index is helpful for queries
such as - where branch-name Perryridge and balance lt
1000 - The multi-attribute index is not helpful for
queries such as - where branch-name lt Perryridge and balance
1000 - May fetch many records that satisfy the first
but not the second condition.
69Bitmap Indices
- Bitmap indices are a special type of index
designed for efficient querying on multiple
search keys. - Not particularly useful for single attribute
queries - Typically used in data warehouses
- Applicable on attributes having a small number of
distinct values - Gender, country, state,
- Income-level (0-9999, 10000-19999, 20000-50000,
50000-infinity) - Bitmap assumptions
- Records in a relation are numbered sequentially
from, say, 0 - Given a number n it must be easy to retrieve
record n - Particularly easy if records are of fixed size
70Bitmap Indices (Cont.)
- A bitmap index on an attribute has a bitmap for
each attribute value. - A bitmap is simply an array of bits
- Bitmap has as many bits as there are records in
the file. - In a bitmap for value v, the bit for a record is
1 if the record has the value v for the
attribute, and is 0 otherwise
71Bitmap Indices (Cont.)
- Queries on multiple bitmap-indexed attributes use
bitmap operations - Intersection (and) 100110 AND 110011 100010
- Union (or) 100110 OR 110011 110111
- Complementation (not) NOT 100110 011001
- Example - Retrieve records for males with income
level L1 - 10010 AND 10100 10000
- Resulting bitmap is then used to retrieve
satisfying tuples - Counting the number of satisfying tuples is even
faster
72Bitmap Indices (Cont.)
- How big is a bitmap index?
- Bitmap indices are generally very small compared
with relation size. - Example
- number of records is n
- record size 100 bytes, i.e., 800 bits (somewhat
conservative) - total table size is n 800 bits
- bitmap size is n bits
- space for a single bitmap is 1/800 of space used
by the relation. - If the number of distinct attribute values is 8,
then bitmap index is only 1 of relation size. - Exercise (do the math) - What happens if each
record is 32 bits and there are 64 distinct
attribute values? Where is the cut-off point?
73Bitmap Indices (Cont.)
- How does a bitmap change if a record is deleted?
- Could we simply put a 0 in all bitmaps for the
record? - No! Consider
- not(Av)
- An existence bitmap could be used to indicate if
there is a valid record at a specific location
for negation to work properly. - (NOT bitmap-A-v) AND ExistenceBitmap
- Bitmaps need to be maintained for all values,
including null - To correctly handle SQL (3 value) null semantics
for NOT(Av) - Intersect the above result with (NOT
bitmap-A-Null)
74Efficient Implementation of Bitmap Operations
- Bitmaps are packed into words.
- A single word and operation (a basic CPU
instruction) computes the and of 32 or 64 bits at
once. - The conjunction (and) of two bitmaps containing
1-million-bits can be done with just 31,250
instructions. - Counting the number of 1s can be done fast by a
trick - Use each byte to index into a pre-computed array
of 256 elements each storing the count of 1s in
the binary representation. - Add up retrieved counts
- Bitmaps can be used instead of Tuple-ID lists at
leaf levels of B-trees, for values that have a
large number of matching records - If a tuple-id is 64-bits, then this saves space
if gt 1/64 of the records have a specific value
(as an exercise, do the math). - This technique merges benefits of bitmap and
B-tree indices..
75End of Chapter
76Partitioned Hashing
- Hash values are split into segments that depend
on each attribute of the search-key. - (A1, A2, . . . , An) for n attribute search-key
- Example n 2, for customer, search-key being
(customer-street, customer-city) - search-key value hash value (Main,
Harrison) 101 111 (Main, Brooklyn) 101
001 (Park, Palo Alto) 010 010 (Spring,
Brooklyn) 001 001 (Alma, Palo Alto) 110 010 - To answer equality query on single attribute,
need to look up multiple buckets. Similar in
effect to grid files.
77Grid Files
- A grid file is an index that supports multiple
search-key queries involving one or more
comparison operators. - The grid file consists of
- a single n-dimensional grid array, where n is the
number of search key attributes. - n linear scales, one for each search-key
attribute. - Multiple cells of grid array can point to same
bucket.
78Example Grid File for account
79Queries on a Grid File
- A grid file on two attributes A and B can handle
queries of all following forms with reasonable
efficiency - (a1 ? A ? a2)
- (b1 ? B ? b2)
- (a1 ? A ? a2 ? b1 ? B ? b2)
- During insertion (Similar to extendable hashing,
but on n dimensions) - If a bucket becomes full then it can be split if
more than one cell points to it. - If only one cell points to the bucket, either an
overflow bucket can be created or the grid size
can be increased - During deletion (Also similar to extendable
hashing) - If a bucket becomes empty, the bucket can be
merged with other buckets - Or the grid pointer can be set to null, etc.
80Grid Files (Cont.)
- Linear scales must be chosen to uniformly
distribute records across cells. - Otherwise there will be too many overflow
buckets. - Periodic re-organization to increase grid size
will help. - But reorganization can be very expensive.
- Space overhead of grid array can be high.
- R-trees (Chapter 23) are an alternative